i would like to be able to search a directory of PDF docments in c# and produce a list of all documents that contain specific text
Two ideas for you:

(1) iTextSharp is the .NET port of iText:

It is a robust library for working with PDF files, and it is geared towards C# (while iText towards Java).

(2) Xpdf is a set of eight command line executables:

The only one you'll need is pdftotext.exe, which converts PDF files to plain text. You may download the package here:

To learn more about downloading/installing it, I suggest watching Xpdf - Command Line Utility for PDF Files - Part 1, a 5-minute Experts Exchange video Micro Tutorial:

Another 5-minute EE video Micro Tutorial, Xpdf - Convert PDF Files to Plain Text Files - Part 3, explains the specific tool you need, PDFtoText:

There are five different options for creating the text in PDFtoText:

<null>, which is the default

I find that -layout usually works best for me, but you should experiment with your particular PDFs to see which output option creates files that work best in your program. Regards, Joe


Thank-you. Your response was very helpful. What I have is thousands of PDF's that are copies of emails. I want to be able to extract the FROM:, TO:, SUBJECT etc. What I believe you are pointing me to is that I will need to convert the PDF to text, and then interrogate the text for the answers. If there is a simpler solution or any c# code that maps this out that you know of it would be appreciated, otherwise, I will keep trying to figure it out myself. Thanks again for your help.
That's exactly the sort of thing you can program. I've done it many times — calling pdftotext.exe to create a text file for each PDF, then parsing the text file to find whatever you're looking for, such as FROM:, TO:, SUBJECT:, etc. Most of my programs are used by clients to process thousands of PDFs, sometimes in just a single folder, but often recursing into subfolders to an unlimited depth. I don't know c#, but I can say that it's relatively straightforward to program in other languages that have strong looping and searching/parsing primitives. Sorry I can't help you with c# code, but I jumped into the question (even though its title says c#) because I thought the real key for a c# programmer would be to know about routines like iTextSharp and PDFtoText. In any case, you're welcome — and good luck with your program...should be fun to develop! Regards, Joe

