Link to home
Start Free TrialLog in
Avatar of Master Work
Master Work

asked on

How to search pdf files for a key word

How to search pdf files for a key word like "Sticky" and report those pdf files with that key word.
Avatar of Chinmay Patel
Chinmay Patel
Flag of India image

Please attach the file. Your file didn't come through.
Avatar of noci
noci

I can't see files attached...

Then again some PDF's are text based, others have images / page in them.
The first one you can search through the latter ones need OCR readers which aren't too reliable ...
ASKER CERTIFIED SOLUTION
Avatar of oBdA
oBdA

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
You can use Acrobat DC 2018 to convert the file to an editable form and save it. That facilitates searching.

If you are searching in Windows outside of Adobe, make sure in Control Panel, Indexing Options, Advanced, File Type tab, that the PDF extension has been set for Index Properties and Content.

If you change this , allow time for the index to rebuild, or rebuild it.
Sorry, there was some leftover from testing in line 7; corrected version:
$find = 'sticky'
Add-Type -Path 'C:\Temp\itextsharp.dll'
Get-ChildItem -Path C:\temp -Filter *.pdf | ForEach-Object {
	Write-Host "Processing $($_.Name) ..."
	$pdfReader = New-Object -TypeName 'iTextSharp.text.pdf.pdfreader' -ArgumentList $_.FullName
	For ($i = 1; $i -le $pdfReader.NumberOfPages; $i++) {
		Write-Host " $($i)" -NoNewline
		$page = [iTextSharp.text.pdf.parser.PdfTextExtractor]::GetTextFromPage($pdfReader, $i)
		If ($page -match "\b$($find)\b") {
			$_ | Select-Object -Property @{n='Page'; e={$i}}, Name, FullName, LastWriteTime
			Break
		}
	}
	$pdfReader.Close()
	Write-Host ''
}

Open in new window

Avatar of Master Work

ASKER

I need this with a single command line. Because I am running this from another application.
That's possible, but you need to define
report those pdf files
I have a shared folder with unknown numbers of pdf file and I should at the end have a list of those pdf files that have the keyword on them
I figured as much, but how do you want that list?
As console output, as a csv file, do you need different exitcodes depending on whether a file was found or not?
Another way to make this work is to query the Windows Search index.  Of course the files would have to be indexed for it to work.  Examples are shown in another question:
https://www.experts-exchange.com/questions/29088159/Get-ChildItem-recurse-Select-String-Pattern-abcd.html
This will now write the output to the console, and if a file path is specified in $outFile, it will write a csv to that file as well.
This has a batch wrapper, so save it as Whatever.cmd
If you need the window to stay open after the script is done, add the following as last line:
$null = Read-Host -Prompt "Enter return to continue"

Open in new window

@PowerShell.exe -Command "Invoke-Expression -Command ((Get-Content -Path '%~f0' | Select-Object -Skip 2) -join [environment]::NewLine)"
@exit /b %Errorlevel%

$find = 'Sticky'
## set to $null to skip creating the csv export.
$outFile = "C:\Temp\sticky.csv"
Add-Type -Path 'C:\Temp\itextsharp.dll'
$results = Get-ChildItem -Path C:\temp -Filter *.pdf | ForEach-Object {
	Write-Host "Processing $($_.Name) ..."
	$pdfReader = New-Object -TypeName 'iTextSharp.text.pdf.pdfreader' -ArgumentList $_.FullName
	For ($i = 1; $i -le $pdfReader.NumberOfPages; $i++) {
		Write-Host " $($i)" -NoNewline
		$page = [iTextSharp.text.pdf.parser.PdfTextExtractor]::GetTextFromPage($pdfReader, $i)
		If ($page -match "\b$($find)\b") {
			$_ | Select-Object -Property @{n='Page'; e={$i}}, Name, FullName, LastWriteTime
			Break
		}
	}
	$pdfReader.Close()
	Write-Host ''
}
$results | Format-Table -AutoSize
If ($outFile) {
	$results | Export-Csv -NoTypeInformation -Path $outFile
}

Open in new window

Another approach:   there is a pdf toolkit that can handle pdf files... named poppler  (available on Linux, and through cygwin on windows, maybe also through WSL).

the following command line (using bash from cygwin should do the job...).

for i in *.pdf ; do pdftotext $i t.t ; grep -i sticky t.t && echo "Sticky in $i"; done
> through cygwin on windows, maybe also through WSL

It's actually available natively on Windows. It's one of the Xpdf utilities, described overall in this five-minute EE video Micro Tutorial:

Xpdf - Command Line Utility for PDF Files

I discuss PDFtoText specifically in another five-minute EE video Micro Tutorial:

Xpdf - PDFtoText - Command Line Utility to Convert PDF Files to Plain Text Files

Regards, Joe