c# routines to search pdf files

fnjl
fnjl used Ask the Experts™
on
i would like to be able to search a directory of PDF docments in c# and produce a list of all documents that contain specific text
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Joe WinogradDeveloper
Fellow 2017
Most Valuable Expert 2018

Commented:
Two ideas for you:

(1) iTextSharp is the .NET port of iText:
http://sourceforge.net/projects/itextsharp/

It is a robust library for working with PDF files, and it is geared towards C# (while iText towards Java).

(2) Xpdf is a set of eight command line executables:
http://www.foolabs.com/xpdf/

The only one you'll need is pdftotext.exe, which converts PDF files to plain text. You may download the package here:
http://www.foolabs.com/xpdf/download.html

To learn more about downloading/installing it, I suggest watching Xpdf - Command Line Utility for PDF Files - Part 1, a 5-minute Experts Exchange video Micro Tutorial:
http://www.experts-exchange.com/VP_213.html

Another 5-minute EE video Micro Tutorial, Xpdf - Convert PDF Files to Plain Text Files - Part 3, explains the specific tool you need, PDFtoText:
http://www.experts-exchange.com/VP_217.html

Other parts of my Xpdf video series are not relevant for this particular project, but here they are in case you need them for future projects:

Xpdf - Extract Images from PDF Files - Part 2
http://www.experts-exchange.com/VP_215.html

Xpdf - PDFinfo - Command Line Utility to Retrieve Page Count and Other Information from PDF Files
http://www.experts-exchange.com/videos/1098/

Xpdf - PDFdetach - Command Line Utility to Detach Attachments from PDF Files
http://www.experts-exchange.com/videos/1118/

There are five different options for creating the text in PDFtoText:

-layout
-lineprinter
-raw
-table
<null>, which is the default

I find that -layout usually works best for me, but you should experiment with your particular PDFs to see which output option creates files that work best in your program. Regards, Joe

Author

Commented:
Thank-you. Your response was very helpful. What I have is thousands of PDF's that are copies of emails. I want to be able to extract the FROM:, TO:, SUBJECT etc. What I believe you are pointing me to is that I will need to convert the PDF to text, and then interrogate the text for the answers. If there is a simpler solution or any c# code that maps this out that you know of it would be appreciated, otherwise, I will keep trying to figure it out myself. Thanks again for your help.
Developer
Fellow 2017
Most Valuable Expert 2018
Commented:
That's exactly the sort of thing you can program. I've done it many times — calling pdftotext.exe to create a text file for each PDF, then parsing the text file to find whatever you're looking for, such as FROM:, TO:, SUBJECT:, etc. Most of my programs are used by clients to process thousands of PDFs, sometimes in just a single folder, but often recursing into subfolders to an unlimited depth. I don't know c#, but I can say that it's relatively straightforward to program in other languages that have strong looping and searching/parsing primitives. Sorry I can't help you with c# code, but I jumped into the question (even though its title says c#) because I thought the real key for a c# programmer would be to know about routines like iTextSharp and PDFtoText. In any case, you're welcome — and good luck with your program...should be fun to develop! Regards, Joe

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial