Link to home
Start Free TrialLog in
Avatar of fnjl
fnjl

asked on

c# routines to search pdf files

i would like to be able to search a directory of PDF docments in c# and produce a list of all documents that contain specific text
Avatar of Joe Winograd
Joe Winograd
Flag of United States of America image

Two ideas for you:

(1) iTextSharp is the .NET port of iText:
http://sourceforge.net/projects/itextsharp/

It is a robust library for working with PDF files, and it is geared towards C# (while iText towards Java).

(2) Xpdf is a set of eight command line executables:
http://www.foolabs.com/xpdf/

The only one you'll need is pdftotext.exe, which converts PDF files to plain text. You may download the package here:
http://www.foolabs.com/xpdf/download.html

To learn more about downloading/installing it, I suggest watching Xpdf - Command Line Utility for PDF Files - Part 1, a 5-minute Experts Exchange video Micro Tutorial:
https://www.experts-exchange.com/VP_213.html

Another 5-minute EE video Micro Tutorial, Xpdf - Convert PDF Files to Plain Text Files - Part 3, explains the specific tool you need, PDFtoText:
https://www.experts-exchange.com/VP_217.html

Other parts of my Xpdf video series are not relevant for this particular project, but here they are in case you need them for future projects:

Xpdf - Extract Images from PDF Files - Part 2
https://www.experts-exchange.com/VP_215.html

Xpdf - PDFinfo - Command Line Utility to Retrieve Page Count and Other Information from PDF Files
https://www.experts-exchange.com/videos/1098/

Xpdf - PDFdetach - Command Line Utility to Detach Attachments from PDF Files
https://www.experts-exchange.com/videos/1118/

There are five different options for creating the text in PDFtoText:

-layout
-lineprinter
-raw
-table
<null>, which is the default

I find that -layout usually works best for me, but you should experiment with your particular PDFs to see which output option creates files that work best in your program. Regards, Joe
Avatar of fnjl
fnjl

ASKER

Thank-you. Your response was very helpful. What I have is thousands of PDF's that are copies of emails. I want to be able to extract the FROM:, TO:, SUBJECT etc. What I believe you are pointing me to is that I will need to convert the PDF to text, and then interrogate the text for the answers. If there is a simpler solution or any c# code that maps this out that you know of it would be appreciated, otherwise, I will keep trying to figure it out myself. Thanks again for your help.
ASKER CERTIFIED SOLUTION
Avatar of Joe Winograd
Joe Winograd
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial