c# routines to search pdf files

i would like to be able to search a directory of PDF docments in c# and produce a list of all documents that contain specific text
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Joe WinogradDeveloperCommented:
Two ideas for you:

(1) iTextSharp is the .NET port of iText:

It is a robust library for working with PDF files, and it is geared towards C# (while iText towards Java).

(2) Xpdf is a set of eight command line executables:

The only one you'll need is pdftotext.exe, which converts PDF files to plain text. You may download the package here:

To learn more about downloading/installing it, I suggest watching Xpdf - Command Line Utility for PDF Files - Part 1, a 5-minute Experts Exchange video Micro Tutorial:

Another 5-minute EE video Micro Tutorial, Xpdf - Convert PDF Files to Plain Text Files - Part 3, explains the specific tool you need, PDFtoText:

Other parts of my Xpdf video series are not relevant for this particular project, but here they are in case you need them for future projects:

Xpdf - Extract Images from PDF Files - Part 2

Xpdf - PDFinfo - Command Line Utility to Retrieve Page Count and Other Information from PDF Files

Xpdf - PDFdetach - Command Line Utility to Detach Attachments from PDF Files

There are five different options for creating the text in PDFtoText:

<null>, which is the default

I find that -layout usually works best for me, but you should experiment with your particular PDFs to see which output option creates files that work best in your program. Regards, Joe
fnjlAuthor Commented:
Thank-you. Your response was very helpful. What I have is thousands of PDF's that are copies of emails. I want to be able to extract the FROM:, TO:, SUBJECT etc. What I believe you are pointing me to is that I will need to convert the PDF to text, and then interrogate the text for the answers. If there is a simpler solution or any c# code that maps this out that you know of it would be appreciated, otherwise, I will keep trying to figure it out myself. Thanks again for your help.
Joe WinogradDeveloperCommented:
That's exactly the sort of thing you can program. I've done it many times — calling pdftotext.exe to create a text file for each PDF, then parsing the text file to find whatever you're looking for, such as FROM:, TO:, SUBJECT:, etc. Most of my programs are used by clients to process thousands of PDFs, sometimes in just a single folder, but often recursing into subfolders to an unlimited depth. I don't know c#, but I can say that it's relatively straightforward to program in other languages that have strong looping and searching/parsing primitives. Sorry I can't help you with c# code, but I jumped into the question (even though its title says c#) because I thought the real key for a c# programmer would be to know about routines like iTextSharp and PDFtoText. In any case, you're welcome — and good luck with your program...should be fun to develop! Regards, Joe

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Document Management

From novice to tech pro — start learning today.