To learn more about downloading/installing it, I suggest watching Xpdf - Command Line Utility for PDF Files - Part 1, a 5-minute Experts Exchange video Micro Tutorial: https://www.experts-exchange.com/VP_213.html
Another 5-minute EE video Micro Tutorial, Xpdf - Convert PDF Files to Plain Text Files - Part 3, explains the specific tool you need, PDFtoText: https://www.experts-exchange.com/VP_217.html
Other parts of my Xpdf video series are not relevant for this particular project, but here they are in case you need them for future projects:
There are five different options for creating the text in PDFtoText:
-layout
-lineprinter
-raw
-table
<null>, which is the default
I find that -layout usually works best for me, but you should experiment with your particular PDFs to see which output option creates files that work best in your program. Regards, Joe
fnjl
ASKER
Thank-you. Your response was very helpful. What I have is thousands of PDF's that are copies of emails. I want to be able to extract the FROM:, TO:, SUBJECT etc. What I believe you are pointing me to is that I will need to convert the PDF to text, and then interrogate the text for the answers. If there is a simpler solution or any c# code that maps this out that you know of it would be appreciated, otherwise, I will keep trying to figure it out myself. Thanks again for your help.
(1) iTextSharp is the .NET port of iText:
http://sourceforge.net/projects/itextsharp/
It is a robust library for working with PDF files, and it is geared towards C# (while iText towards Java).
(2) Xpdf is a set of eight command line executables:
http://www.foolabs.com/xpdf/
The only one you'll need is pdftotext.exe, which converts PDF files to plain text. You may download the package here:
http://www.foolabs.com/xpdf/download.html
To learn more about downloading/installing it, I suggest watching Xpdf - Command Line Utility for PDF Files - Part 1, a 5-minute Experts Exchange video Micro Tutorial:
https://www.experts-exchange.com/VP_213.html
Another 5-minute EE video Micro Tutorial, Xpdf - Convert PDF Files to Plain Text Files - Part 3, explains the specific tool you need, PDFtoText:
https://www.experts-exchange.com/VP_217.html
Other parts of my Xpdf video series are not relevant for this particular project, but here they are in case you need them for future projects:
Xpdf - Extract Images from PDF Files - Part 2
https://www.experts-exchange.com/VP_215.html
Xpdf - PDFinfo - Command Line Utility to Retrieve Page Count and Other Information from PDF Files
https://www.experts-exchange.com/videos/1098/
Xpdf - PDFdetach - Command Line Utility to Detach Attachments from PDF Files
https://www.experts-exchange.com/videos/1118/
There are five different options for creating the text in PDFtoText:
-layout
-lineprinter
-raw
-table
<null>, which is the default
I find that -layout usually works best for me, but you should experiment with your particular PDFs to see which output option creates files that work best in your program. Regards, Joe