Master Work
asked on
How to search pdf files for a key word
How to search pdf files for a key word like "Sticky" and report those pdf files with that key word.
Please attach the file. Your file didn't come through.
I can't see files attached...
Then again some PDF's are text based, others have images / page in them.
The first one you can search through the latter ones need OCR readers which aren't too reliable ...
Then again some PDF's are text based, others have images / page in them.
The first one you can search through the latter ones need OCR readers which aren't too reliable ...
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
You can use Acrobat DC 2018 to convert the file to an editable form and save it. That facilitates searching.
If you are searching in Windows outside of Adobe, make sure in Control Panel, Indexing Options, Advanced, File Type tab, that the PDF extension has been set for Index Properties and Content.
If you change this , allow time for the index to rebuild, or rebuild it.
If you are searching in Windows outside of Adobe, make sure in Control Panel, Indexing Options, Advanced, File Type tab, that the PDF extension has been set for Index Properties and Content.
If you change this , allow time for the index to rebuild, or rebuild it.
Sorry, there was some leftover from testing in line 7; corrected version:
$find = 'sticky'
Add-Type -Path 'C:\Temp\itextsharp.dll'
Get-ChildItem -Path C:\temp -Filter *.pdf | ForEach-Object {
Write-Host "Processing $($_.Name) ..."
$pdfReader = New-Object -TypeName 'iTextSharp.text.pdf.pdfreader' -ArgumentList $_.FullName
For ($i = 1; $i -le $pdfReader.NumberOfPages; $i++) {
Write-Host " $($i)" -NoNewline
$page = [iTextSharp.text.pdf.parser.PdfTextExtractor]::GetTextFromPage($pdfReader, $i)
If ($page -match "\b$($find)\b") {
$_ | Select-Object -Property @{n='Page'; e={$i}}, Name, FullName, LastWriteTime
Break
}
}
$pdfReader.Close()
Write-Host ''
}
ASKER
I need this with a single command line. Because I am running this from another application.
That's possible, but you need to define
report those pdf files
ASKER
I have a shared folder with unknown numbers of pdf file and I should at the end have a list of those pdf files that have the keyword on them
I figured as much, but how do you want that list?
As console output, as a csv file, do you need different exitcodes depending on whether a file was found or not?
As console output, as a csv file, do you need different exitcodes depending on whether a file was found or not?
Another way to make this work is to query the Windows Search index. Of course the files would have to be indexed for it to work. Examples are shown in another question:
https://www.experts-exchange.com/questions/29088159/Get-ChildItem-recurse-Select-String-Pattern-abcd.html
https://www.experts-exchange.com/questions/29088159/Get-ChildItem-recurse-Select-String-Pattern-abcd.html
This will now write the output to the console, and if a file path is specified in $outFile, it will write a csv to that file as well.
This has a batch wrapper, so save it as Whatever.cmd
If you need the window to stay open after the script is done, add the following as last line:
This has a batch wrapper, so save it as Whatever.cmd
If you need the window to stay open after the script is done, add the following as last line:
$null = Read-Host -Prompt "Enter return to continue"
@PowerShell.exe -Command "Invoke-Expression -Command ((Get-Content -Path '%~f0' | Select-Object -Skip 2) -join [environment]::NewLine)"
@exit /b %Errorlevel%
$find = 'Sticky'
## set to $null to skip creating the csv export.
$outFile = "C:\Temp\sticky.csv"
Add-Type -Path 'C:\Temp\itextsharp.dll'
$results = Get-ChildItem -Path C:\temp -Filter *.pdf | ForEach-Object {
Write-Host "Processing $($_.Name) ..."
$pdfReader = New-Object -TypeName 'iTextSharp.text.pdf.pdfreader' -ArgumentList $_.FullName
For ($i = 1; $i -le $pdfReader.NumberOfPages; $i++) {
Write-Host " $($i)" -NoNewline
$page = [iTextSharp.text.pdf.parser.PdfTextExtractor]::GetTextFromPage($pdfReader, $i)
If ($page -match "\b$($find)\b") {
$_ | Select-Object -Property @{n='Page'; e={$i}}, Name, FullName, LastWriteTime
Break
}
}
$pdfReader.Close()
Write-Host ''
}
$results | Format-Table -AutoSize
If ($outFile) {
$results | Export-Csv -NoTypeInformation -Path $outFile
}
Another approach: there is a pdf toolkit that can handle pdf files... named poppler (available on Linux, and through cygwin on windows, maybe also through WSL).
the following command line (using bash from cygwin should do the job...).
for i in *.pdf ; do pdftotext $i t.t ; grep -i sticky t.t && echo "Sticky in $i"; done
the following command line (using bash from cygwin should do the job...).
for i in *.pdf ; do pdftotext $i t.t ; grep -i sticky t.t && echo "Sticky in $i"; done
> through cygwin on windows, maybe also through WSL
It's actually available natively on Windows. It's one of the Xpdf utilities, described overall in this five-minute EE video Micro Tutorial:
Xpdf - Command Line Utility for PDF Files
I discuss PDFtoText specifically in another five-minute EE video Micro Tutorial:
Xpdf - PDFtoText - Command Line Utility to Convert PDF Files to Plain Text Files
Regards, Joe
It's actually available natively on Windows. It's one of the Xpdf utilities, described overall in this five-minute EE video Micro Tutorial:
Xpdf - Command Line Utility for PDF Files
I discuss PDFtoText specifically in another five-minute EE video Micro Tutorial:
Xpdf - PDFtoText - Command Line Utility to Convert PDF Files to Plain Text Files
Regards, Joe