Avatar of fnjl
fnjl
 asked on

c# routines to search pdf files

i would like to be able to search a directory of PDF docments in c# and produce a list of all documents that contain specific text
Document Management

Avatar of undefined
Last Comment
Joe Winograd

8/22/2022 - Mon
Joe Winograd

Two ideas for you:

(1) iTextSharp is the .NET port of iText:
http://sourceforge.net/projects/itextsharp/

It is a robust library for working with PDF files, and it is geared towards C# (while iText towards Java).

(2) Xpdf is a set of eight command line executables:
http://www.foolabs.com/xpdf/

The only one you'll need is pdftotext.exe, which converts PDF files to plain text. You may download the package here:
http://www.foolabs.com/xpdf/download.html

To learn more about downloading/installing it, I suggest watching Xpdf - Command Line Utility for PDF Files - Part 1, a 5-minute Experts Exchange video Micro Tutorial:
https://www.experts-exchange.com/VP_213.html

Another 5-minute EE video Micro Tutorial, Xpdf - Convert PDF Files to Plain Text Files - Part 3, explains the specific tool you need, PDFtoText:
https://www.experts-exchange.com/VP_217.html

Other parts of my Xpdf video series are not relevant for this particular project, but here they are in case you need them for future projects:

Xpdf - Extract Images from PDF Files - Part 2
https://www.experts-exchange.com/VP_215.html

Xpdf - PDFinfo - Command Line Utility to Retrieve Page Count and Other Information from PDF Files
https://www.experts-exchange.com/videos/1098/

Xpdf - PDFdetach - Command Line Utility to Detach Attachments from PDF Files
https://www.experts-exchange.com/videos/1118/

There are five different options for creating the text in PDFtoText:

-layout
-lineprinter
-raw
-table
<null>, which is the default

I find that -layout usually works best for me, but you should experiment with your particular PDFs to see which output option creates files that work best in your program. Regards, Joe
fnjl

ASKER
Thank-you. Your response was very helpful. What I have is thousands of PDF's that are copies of emails. I want to be able to extract the FROM:, TO:, SUBJECT etc. What I believe you are pointing me to is that I will need to convert the PDF to text, and then interrogate the text for the answers. If there is a simpler solution or any c# code that maps this out that you know of it would be appreciated, otherwise, I will keep trying to figure it out myself. Thanks again for your help.
ASKER CERTIFIED SOLUTION
Joe Winograd

THIS SOLUTION ONLY AVAILABLE TO MEMBERS.
View this solution by signing up for a free trial.
Members can start a 7-Day free trial and enjoy unlimited access to the platform.
See Pricing Options
Start Free Trial
GET A PERSONALIZED SOLUTION
Ask your own question & get feedback from real experts
Find out why thousands trust the EE community with their toughest problems.
All of life is about relationships, and EE has made a viirtual community a real community. It lifts everyone's boat
William Peck