Link to home
Start Free TrialLog in
Avatar of vpnsol123
vpnsol123

asked on

VBScript to extract text from a PDF

I have about 1000 PDF's that are named with a sequential number.  One of the text lines in the PDF is the name of the customer.  I need a VBScript that will rename each PDF Document
to the value on that first line.  For Example

Document currently named: D123456.PDF

The first line in the file says: Customer Name: John Smith

I need the document renamed: D123456 John Smith .PDF
Avatar of aikimark
aikimark
Flag of United States of America image

For one of my clients, I run PDFText.exe in a scheduled task, followed by a VBScript routine that renames (and moves) the file to a new location.  Do a search for PDFText.
Hi vpnsol123,
This problem piqued my interest and I'm writing a program to do it. I think the solution may be helpful for other EE members, so I've decided to write an article about it. I'll generalize the solution to make it more useful for a broader audience. Some examples of the generalizations:

(1) In your case, there are seven lead-in characters in the filename (you gave D123456 as an example), but the program allows any number of lead-in characters (it prompts the user for the number).

(2) In your case, the text in the PDF that contains the string for the new filename begins in column 16 (after "Customer Name: "), but the program allows it to start at any column number (it prompts the user for the column number).

(3) In your case, I think you're looking for the files to be stored in the same folder, but the program allows the renamed files to be stored elsewhere (it prompts the user for the source folder and the destination folder, which may or may not be the same).

I'm writing the program in the AutoHotkey scripting language, calling the Xpdf package to convert the PDF file to text (so that it can extract the string for renaming the file). Both are excellent freeware packages. If you are interested in the solution, you'll need to download both. Here are the links:

http://www.autohotkey.com/
http://www.foolabs.com/xpdf/download.html

The Xpdf binaries (both 32-bit and 64-bit) contain seven tools. The only one I'll use is [pdftotext.exe]. The script looks for that file in [Program Files] and [Program Files (x86)] and exits if it doesn't find it. But you may put [pdftotext.exe] wherever you want and simply modify the AutoHotkey script with its location (I'll be providing the source code, of course).

I'm travelling tomorrow (Tuesday), but hope to be able to write and submit the article to EE on Wednesday or Thursday. I'll post back here when EE publishes it. Regards, Joe
Avatar of vpnsol123
vpnsol123

ASKER

Thanks all -

#aikimark - sounds good
#joewinograd - let me know when you have it documented
The way I launch this is from within a batch file (.cmd).  The first command is a For statement that runs PDFTEXT.EXE against any *.PDF file in the directory.  Following that, I invoke a VBScript (.VBS) file that iterates all the .TXT files in the directory, parsing the text and renaming/moving the files accordingly.

Within the VBScript file, I use RegExp to do the parsing.  However, you might be able to get away with a simpler (native VBScript statements) program to accomplish your parsing.
> let me know when you have it documented

Will do. The EE Page Editors (the folks responsible for refereeing/approving Articles) are very responsive. I don't recollect more than 24 hours ever going by for their review of the Articles I've written.

I want to verify two major assumptions with you:

(1) The filename (the portion before the ".pdf") will always have seven characters.

(2) The customer name will always begin in column 16 on the first line of the first page in the PDF file and will run until the end of the line.

If either of these assumptions is not true, I need to know ASAP. Thanks, Joe
vpnsol123P,
I haven't heard back from you on my last two questions, but I went ahead and wrote the Article (and program) assuming the answer to both is YES. I submitted the Article and will let you know as soon as it is published...I'm hoping for later today, or worst case in 24-48 hours. As I mentioned before, the EE Page Editors have been extremely prompt in reviewing and publishing Articles. Regards, Joe
aikimark,
Could you please provide a link to PDFTEXT? When I do a Google search for it, it's not clear to me which hits are for the [pdftext.exe] that you use. As you know from my previous posts, I've been using Xpdf's [pdftotext.exe] for the same purpose, but I'm always on the lookout for new PDF tools and would like to take the [pdftext.exe] that you use for a spin. Thanks much, Joe
vpnsol123P,
An update for you on the Article. Before the Page Editors had a chance to review my original submission, I submitted two modifications to it (yesterday). That likely reset the review time frame, so we're probably looking at Monday or Tuesday for publication. In the meantime, you could prepare for it by installing AutoHotkey and [pdftotext.exe], assuming that you're still interested in the solution. Btw, one of the two modifications to the script is that you no longer have to change it manually if you put [pdftotext.exe] in some other location besides [Program Files] or [Program Files (x86)]. If the script doesn't find it in either of those two folders, it now provides a Browse For File dialog so you may simply navigate to it. Regards, Joe
vpnsol123,

The Article was just published:
https://www.experts-exchange.com/Software/Misc/A_11173-How-To-Rename-Move-a-Batch-of-PDF-Files-Based-on-Contents-of-the-Files.html

I'd appreciate any feedback that you have on the Article itself, as well as the script/program. Please let me know if you think the Article is clear and easy to understand. Also, let me know how the program performs for you. As you can see in the Article, I tested it on just a dozen files (10 PDFs and 2 non-PDFs) and it took just one second to run. I'm very interested to know how it performs on your one thousand files. If you could post your Operation Completed dialog with the results of your run, that would be terrific. Thanks, Joe
I have to get into the server room in order to verify which version and source I used for PDFTEXT.EXE
aikimark,
Sounds good. Thanks, Joe
Hi vpnsol123,
I think it would be a good idea to close out this question, assuming, of course, that you're satisfied with the answer(s). The new issue that you raised over at my Article represents a significant requirement, taking it well above-and-beyond this one. This question requires only that the source files be renamed. The new feature requires that the source files be split into multiple, new files. I'll post a similar comment with some additional thoughts at the Article:
https://www.experts-exchange.com/Software/Misc/A_11173-How-To-Rename-Move-a-Batch-of-PDF-Files-Based-on-Contents-of-the-Files.html
Regards, Joe
I just checked the program that's running at my client site and PDFText tells me:
pdftotext version 3.02
Copyright 1996-2007 Glyph & Cog, LLC


=================
I posted the download links in this 2011 EE question:
http:Q_26760439.html#a34678112

http://www.foolabs.com/xpdf/download.html

ftp://ftp.foolabs.com/pub/xpdf/xpdf-3.02pl5-dos6.zip
OK, thanks for checking. Your earlier posts above said [PDFTEXT.EXE], so I figured it was something different, but there was a typo in those posts ("TO" missing in the middle), and what you really meant to say was [PDFTOTEXT.EXE], which is the same library that I've been using (current version is 3.03). You had me excited there for a moment. :)   Regards, Joe
vpnsol123,
I wrote and tested the new program during the weekend. All seems well in my test cases, although they are limited (will be great to see how it performs on your thousand files). I plan to write the new Article today or tomorrow and with some luck EE will publish it before the end of the week. I'll post some more information over at the current Article that will help you to prepare for the new program (which is also an AutoHotkey script, but requires an additional PDF toolkit). In the meantime, I think it would be a good idea to close out this question, unless you're waiting for additional answers/solutions. Regards, Joe
vpnsol123,
The new Article (with the new AutoHotkey script) was just published:
https://www.experts-exchange.com/Software/Misc/A_11211-How-To-Split-Rename-Move-a-Batch-of-PDF-Files-Based-on-Contents-of-the-Files.html

Please read it, run the program on your thousand files, and post the results at the new Article. Also, as mentioned above, I think it would be a good idea to close out this question, unless you're waiting for additional answers/solutions. Regards, Joe
ASKER CERTIFIED SOLUTION
Avatar of Joe Winograd
Joe Winograd
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial