PDF selective txt export from single PDF file

Hi,

I routinely get PDF files for review and analysis. What is the easiest way to search and filter for txt that starts with "Path:" and MD5: in my giant PDF report in order to export or extract ALL instances of these txt types i need to export from PDF and put into excel for analysis thru a table in excel?
Vincent DAsked:
Who is Participating?
 
Joe Winograd, Fellow&MVEDeveloperCommented:
Hi Vincent,
I would start by exporting all the text in each PDF to a plain text file. This 5-minute EE video Micro Tutorial explains where/how to download the Xpdf utilities:
Xpdf - Command Line Utility for PDF Files

This other 5-minute EE video Micro Tutorial explains how to use Xpdf's PDFtoText utility export all the text in a PDF file to a plain text file:
Xpdf - Convert PDF Files to Plain Text Files

Once you have the text in a file, it is a simple matter to write a program/script that searches for any text, such as "Path:" and "MD5:", and then take whatever action you want when finding that text. I have written many programs that call PDFtoText to perform functions like this, such as this Gig here at Experts Exchange:
Rename PDF files based on content

Regards, Joe
1
 
aikimarkCommented:
Do you have Adobe Acrobat or just the reader?

If Acrobat, you should be able to use the Acrobat ActiveX object to directly consume the text.

If not, there are a variety of PDF utilities that will output the text to a .txt file.  You can then use FileSystemObject to read the .txt file and look for whatever values you want.

It might be helpful if you uploaded a representative sample of one or more PDFs.
1
 
Vincent DAuthor Commented:
I do have Adobe Acrobat XI Pro (11.0.23). So how do I use it to export/consume the txt?
0
Ultimate Tool Kit for Technology Solution Provider

Broken down into practical pointers and step-by-step instructions, the IT Service Excellence Tool Kit delivers expert advice for technology solution providers. Get your free copy now.

 
Vincent DAuthor Commented:
Looking to export all instances of "Path:" and " MD5:" to CSV or excel to compare duplicates and review each unique instance of paired value for each. Report shows path of file and it's MD5 hash value for each file that made it to report...
0
 
Joe Winograd, Fellow&MVEDeveloperCommented:
I suggest a program to do the following:

• call another program (API, CLI, SDK) to create a text file of the PDF (or a variable with the text), such as the Xpdf utility called PDFtoText that I mentioned earlier (I would use AutoHotkey to write the program, but you may, of course, use any language that is capable of making the command line call to PDFtoText)

• search for all instances of "Path:" and " MD5:" and append the text following them to a CSV file (note that the "following them" aspect may be tricky...could be on the same line...could be on the next line...could even be elsewhere...depends on the structure of the PDF file and the text-exporting program)

• place the MD5 values in column A and the Path values in column B so that it is easy to sort by either

• sort by the MD5 values, look for duplicates, and place "Dup" (or whatever flag you want) in column C if the MD5 value is a match to the value in the previous row

• optionally, delete rows that do not have any duplicates, so that the report contains only those files with duplicates

This would make for a worthy EE Gig. Btw, as aikimark mentioned earlier, it would be very helpful to have a sample PDF to test with, but make sure that it doesn't contain any private/sensitive information. Regards, Joe
1
 
Vincent DAuthor Commented:
I was able to export txt from PDF to txt as either basic txt, accessible txt or rich txt format. I now need to search txt file to find each pair of Path: and MD5: value. What is the easiest way to scan co considering I don't program or script?

I need to extract each pair of values and export to csv/excel for analysis etc.
0
 
Joe Winograd, Fellow&MVEDeveloperCommented:
Post a sample of the text file with a few of the "Path:" and "MD5:" pairs, being careful to replace any private/sensitive text with test data.

Btw, what method/product/technique did you use to export the text from the PDF file?
What is the easiest way to scan co considering I don't program or script?
What do you mean by "scan co"?
0
 
Vincent DAuthor Commented:
"co" was a typo ignore it plz
0
 
Vincent DAuthor Commented:
I used Adobe Acrobat to export to txt and rtf
0
 
Vincent DAuthor Commented:
Example of what I am trying to get out of txt file....

299.
log3poc.dll
PID(s): 12, 104, 107, 212
Path: c:\yadayada\log3poc.dll
MD5: 123abc4hr8ri4jrjf8fj4jdidjrn (real data is hex value aka 0-9 or A - F)
0
 
Vincent DAuthor Commented:
I want to pull out ALL paired instances of Path: and MD5: and export to csv/excel so it will pair up correctly like below. Each path is for file at and of path and MD5 is hash of file in the directly above path

Example

Column 1.         COLUMN 2
Path of file1.       MD5 hash of file 1
Path of file2.       MD5 hash of file 2
Path of file3.       MD5 hash of file 3
Path of file4.       MD5 hash of file 4
0
 
Joe Winograd, Fellow&MVEDeveloperCommented:
Does the Adobe Acrobat export to TXT always produce the sequence that you show above, i.e., two lines in a row like this:

Path: c:\foldername\filename.filetype
MD5: 32 hex characters
0
 
Vincent DAuthor Commented:
299. Is not unique and there are likely MANY duplicates so I just want a full export of every pair of "Path:" and "MD5"
0
 
Vincent DAuthor Commented:
Yes to my understanding yes
0
 
Vincent DAuthor Commented:
Path is sometimes one line long and sometimes 2 lines long but it is still after "Path:"
0
 
Vincent DAuthor Commented:
I found a bunch of instances where "MD5:" is not on line directly below "Path:" but a few lines below. I am under assumption that searching txt for "Path:" first and then for next following "MD5:" should work...if that clarifies it better
0
 
aikimarkCommented:
please post sample data
1
 
Joe Winograd, Fellow&MVEDeveloperCommented:
> I found a bunch of instances where "MD5:" is not on line directly below "Path:" but a few lines below.

As I mentioned earlier, that's a result of how the export to TXT is done. That's why when I recommend the Xpdf utility called PDFtoText, I suggest experimenting with the different output options on your PDFs, such as -layout, -raw, etc.

> I am under assumption that searching txt for "Path:" first and then for next following "MD5:" should work.

May be a good assumption, but could be a bad one.

Do you want to learn how to program so that you can do this yourself or do you want someone to write the program/script for you and deliver a turnkey solution?
0
 
Vincent DAuthor Commented:
I would like to learn how to myself
0
 
Vincent DAuthor Commented:
For current project I am open to getting a turn key solution as I have stuff that requires my attention short term. Long term I would like to learn...
0
 
Joe Winograd, Fellow&MVEDeveloperCommented:
> For current project I am open to getting a turn key solution

Then I presume that you have a budget for acquiring a turnkey solution, which would make it ideal for posting a Gig here at EE.

> I routinely get PDF files for review and analysis.

Since it's not a one-time task, I recommend that the extraction of text from the PDF be included in the program, so that you don't have to run Acrobat manually each time to create the text file.

> Path is sometimes one line long and sometimes 2 lines long but it is still after "Path:"

This raises two other questions. First, is the Path ever more than two lines? Second, are there ever intervening lines between the first and second (or subsequent) lines of the Path, such as:

C:\my PDF files\this is a long path across multiple lines in the PDF file\perhaps with many subfolders
This is actually a line of text between the first and second lines of the Path, not part of the Path.
\and here we finally have the filename.pdf

> I found a bunch of instances where "MD5:" is not on line directly below "Path:" but a few lines below.

These kinds of issues show how tricky it can be to develop a 100% correct solution. There are probably other such issues lurking that you haven't discovered yet, especially since you "routinely get" the files. This is why we've asked numerous times for samples, and even that may not be adequate, as it is common for a program to work on test files and then fail in the real world of production files. But you'll want to expose as many issues as possible in the dev/test/QA phases, which can be done only by attempting to process multiple production files. I can understand not wanting to post production files on the public Internet (which this thread is), as they will almost certainly contain private/sensitive data — another good reason for considering a Gig, where the correspondence is private while the Client works with the Freelancer. Anyway, for us to have any hope of developing a solution that will work for you in a large percentage of the cases, if not all cases, we'll need an adequate set of test files.

On a completely separate thought, I want to point out that one of the approaches that we often take here at Experts Exchange is to ferret out the "real" issue rather than simply answering the question that was posed. In this case, it got me to thinking — what are you really trying to accomplish from a report that contains file paths and MD5 values? The first answer that popped into my mind is looking for duplicate files. If so, there may be better ways than analyzing those PDF files, such as running a duplicate-file-finder program across the folders that are shown in those PDF reports. Of course, there could be other reasons for it, so please let us know what you're really trying to accomplish — and, if you're interested, we can brainstorm with you a better way to achieve the real objective. Regards, Joe
0
 
Joe Winograd, Fellow&MVEDeveloperCommented:
I selected all of the comments that are helpful in solving the problem of searching for specific text in PDF files and then exporting the associated values into CSV/Excel format for analysis. I can't say that any post is the "Best" answer, so I selected the first post as the Accepted Solution and all the others as Assisted Solutions, and I split the points evenly between the two participating experts. Regards, Joe
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.