<

How To Rename-Move a Batch of PDF Files Based on Contents of the Files

Published on
30,022 Points
15,822 Views
7 Endorsements
Last Modified:
Awarded
Joe Winograd, Fellow&MVE
50+ years in computer industry. Everything from development to sales. CIO. Document imaging. EE MVE 2015, EE MVE 2016, EE FELLOW 2017.
Update 21-May-2015: I temporarily removed the source code and the code snippets to make major changes to the program. Regards, Joe

A recent question here at Experts Exchange piqued my interest, so I decided to provide a thorough solution and publish this Article about it. The Original Poster (OP) of the question has approximately one thousand PDF files containing 7-character sequential alphanumeric file names (and, of course, all of the file extensions are PDF). Although the OP did not state this, it is likely that the sequential alphanumerics represent unique identifiers for his customers, perhaps customer numbers. The alphanumeric file name is cryptic, in no way identifiable with the customer, so the OP would like the file name to contain the customer name in addition to the number. For example, a file might be named:

D123456.PDF

The OP would like this file to be renamed:

D123456 John Smith.PDF

The customer name always begins in column 16 on the first line of the first page in the PDF file (and runs to the end of the line). The OP wants an automated way to rename the thousand PDF files, based on the customer name in the contents of each file – in essence, a batch/mass rename. The program documented in this Article (and provided in source code) performs this function.

Two excellent freeware products are needed for this solution – the AutoHotkey scripting language (the program is written in this) and the Xpdf package to convert the PDF files to text (so the program can extract the customer names for renaming the files). Here are the links for downloading both:

http://ahkscript.org (also, see my EE article: AutoHotkey - Getting Started)
http://www.foolabs.com/xpdf/download.html

The Xpdf binaries (both 32-bit and 64-bit) contain seven tools, but the only one needed is pdftotext.exe. The script looks for this file in Program Files and Program Files (x86), but you may put pdftotext.exe wherever you want and simply navigate to it – the script gives you a file browse dialog if it doesn't find it in Program Files or Program Files (x86).

The program generalizes the solution by allowing any number of characters in the original file name (for the OP it is 7) and any starting column number for the string that will be in the new file name (for the OP it is 16). The latter is an area where significant generalization and/or customization can take place. For example, the string that will be in the new file name may be on some line other than the first; or the string may be in a variable column number preceded by a string that identifies it, such as Account Number:. By providing the source code, you may modify the program to parse the text extracted from the PDF files in order to create the new file names. If an EE member needs something different from a fixed column number on the first line of the first page (and you don't feel comfortable modifying the script), please post your requirement in a comment on this Article and (within reason) I'll modify the script for you and post it here.

For those interested in understanding how the script works, the remainder of this Article shows the entire script broken down into code snippets, with a description of what each snippet does, including screenshots where appropriate. This also acts as a form of documentation for the program.

Code snippet:
 
SetBatchLines, -1 ; run at maximum speed

Open in new window

What it does: Sets the script to run at maximum speed, i.e., no "sleeping" will occur in the program.

Code snippet:
 
temporarily removed

Open in new window

What it does: Checks to see if Xpdf's pdftotext.exe is located in C:\Program Files (x86)\xpdf\ or C:\Program Files\xpdf\. If not, it displays a message and provides a file browse dialog so the user may navigate to pdftotext.exe (or exit the program).

pdftotext not foundfind pdftotext
Code snippet:
 
temporarily removed

Open in new window

What it does: Warns the user that existing files in the destination folder will be overwritten with no warning, and then gives the user the opportunity to exit or continue.

Overwrite Warning
Code snippet:
 
temporarily removed

Open in new window

What it does: Initializes some variables.

Code snippet:
 
temporarily removed

Open in new window

What it does: Asks the user for the number of characters in the source file names (before the .PDF extension). If the entry is not an integer and/or not greater than zero, it displays a message and gives the user the opportunity to try again or exit.

Enter num first charsNum first chars must be integerNum first chars must be at least 1
Code snippet:
 
temporarily removed

Open in new window

What it does: Asks the user for the starting column number of the string that will be appended to the current file name. If the entry is not an integer and/or not greater than zero, it displays a message and gives the user the opportunity to try again or exit.
Enter starting column numberColumn num must be integerColumn num must be at least 1
Code snippet:
 
temporarily removed

Open in new window

What it does: Asks the user to enter the full path of the source folder. It allows the user to navigate/browse to it or type/paste it in. It looks for an ending backslash on the path name and if one was not entered, it appends one (in other words, it works whether or not the user includes the ending backslash in the path). It then checks to see if a source folder was entered, and if so, if the folder exists. If either is not true, it gives the user the opportunity to exit or continue. Note: whether or not the source folder can be reported as null with the Browse For Folder dialog depends on the operating system, so the program checks for it.

Enter source folderSource folder must be specified
Code snippet:
 
temporarily removed

Open in new window

What it does: Asks the user to enter the full path of the destination folder. It allows the user to navigate/browse to it or type/paste it in or create it. It looks for an ending backslash on the path name and if one was not entered, it appends one (in other words, it works whether or not the user includes the ending backslash in the path). It then checks to see if a destination folder was entered, and if so, if the folder exists, giving the user the opportunity to create it, exit, or try again to enter the name. Note: whether or not the destination folder can be reported as null with the Browse For Folder dialog depends on the operating system, so the program checks for it.

Enter destination folder
Code snippet:
 
temporarily removed

Open in new window

What it does: Checks to see if the source folder is the same as the destination folder. If it is, a message tells the user that the operation will be a file rename; if it isn't, a message tells the user that the operation will be a file move. It asks if the user wishes to continue or exit.

Source Destination same - RenameSource Destination different - Move
Code snippet:
 
temporarily removed

Open in new window

What it does: Asks the user to confirm that the chosen parameters are correct, providing the option to continue or exit.

Confirm params
Code snippet:
 
temporarily removed

Open in new window

What it does: Initializes variables that are used to track operational statistics, which will be reported in Operation Complete dialog.

Code snippet:
 
temporarily removed

Open in new window

What it does: Loops through all of the files in the source folder, sorted in file name order (ascending), ignoring any non-PDF files. It counts the number of PDF files and the number of non-PDF files. It displays a dialog box with a green progress bar that moves to the right during processing, also showing the percentage that is done and the name of the file currently being processed.

Progress Bar
Code snippet:
 
temporarily removed

Open in new window

What it does: Calls Xpdf's pdftotext.exe to read the first page of the PDF file into a text file (with the same file name as the PDF file, but with a file extension of .TXT). The -f parameter specifies the first page to convert (1) and the -l specifies the last page to convert (also 1). The -layout parameter maintains the original physical layout of the text so that multiple lines are not concatenated into a single line.

Code snippet:
 
temporarily removed

Open in new window

What it does: After converting the first page of the PDF file to a text file, it looks in the first line of the text file for the string starting in the column number specified by the user and then concatenates the original file name with that string, putting a space between them (and it deletes the text file). It then renames/moves the file with its new file name. Note that AutoHotkey doesn't have a FileRename command, per se, but instead provides a FileMove command, which serves a dual purpose – it, in essence, does a file rename when the source and destination folders are the same, but does a file move when they are different.

Code snippet:
 
temporarily removed

Open in new window

What it does: Finalizes and formats all of the statistics from the operation and displays them in an Operation Completed dialog box.

Operation CompletedThat's it! I hope this helps the OP as well as other EE members.  If you find this article to be helpful, please click the thumbs-up icon below. This lets me know what is valuable for EE members and provides direction for future articles. Thanks very much! Regards, Joe
 
7
Comment
  • 33
  • 17
  • 7
  • +5
64 Comments
 

Expert Comment

by:vpnsol123
This is great!  I am getting an "Expected end of statement line 1 character 18"
error when I run it.
0
 
LVL 59

Author Comment

by:Joe Winograd, Fellow&MVE
Does this happen on all of the files or just some? Please attach a file where this occurs and I'll troubleshoot it here. Be sure to remove any sensitive/private info from the file. Thanks, Joe
0
 

Expert Comment

by:vpnsol123
It happens when I try to run the vbscript.  I specify a source folder, a destination folder and the name length and starting position.  Is this correct?
0
Cloud Class® Course: Microsoft Exchange Server

The MCTS: Microsoft Exchange Server 2010 certification validates your skills in supporting the maintenance and administration of the Exchange servers in an enterprise environment. Learn everything you need to know with this course.

 
LVL 59

Author Comment

by:Joe Winograd, Fellow&MVE
Yes, that's correct. I assume you're entering 7 for the length-before-PDF and 16 for the starting-column-of-name...right?
0
 
LVL 59

Author Comment

by:Joe Winograd, Fellow&MVE
Do you know how to take screenshots on your PC and post them here? If not, let me know and I'll walk you through it; if so, please post the "Confirm Parameters" dialog box. It looks like this:
Confirm paramsThanks, Joe
0
 

Expert Comment

by:vpnsol123
I do not even get the first pop up box.  It goes directly to the error.  I have saved it as a VBScript.  Am I missing something?

Thanks
0
 
LVL 59

Author Comment

by:Joe Winograd, Fellow&MVE
It is not a VBScript. It is an AutoHotkey script...they're similar in concept and usage, but are completely different languages. You'll need to follow these steps to run it:

(1) Rename the downloaded file from [Batch-Mass-Rename-Move-PDF-Files.txt] to [Batch-Mass-Rename-Move-PDF-Files.ahk], that is, change the file extension from TXT to AHK.

(2) Install AutoHotkey. Go to this link:
http://www.autohotkey.com/

Click the Download AutoHotkey button, save the install file, and then run it. This will install AutoHotkey.

(3) Download the Xpdf tools from here:
http://www.foolabs.com/xpdf/download.html

Click the [xpdfbin-win-3.03.zip] link to download the Windows files. Unzip the zip file and you will see folders for 32-bit (bin32) and 64-bit (bin64) Windows. Select the right folder for your version of Windows (32-bit or 64-bit) and copy the file called [pdftotext.exe] to wherever you want...the script will automatically find it if you put it in [Program Files\xpdf\] or [Program Files (x86)\xpdf\], but if you put it somewhere else, that's fine...the script gives you a browse-for-file dialog so you may navigate to it.

(4) Run the [Batch-Mass-Rename-Move-PDF-Files.ahk] program by simply double-clicking on it. AutoHotkey is associated with AHK files so it will launch and execute the script.

Regards, Joe
0
 

Expert Comment

by:vpnsol123
Got it.. Awesome!!!!
0
 
LVL 59

Author Comment

by:Joe Winograd, Fellow&MVE
I guess that means it worked. :)

Did you capture the Operation Completed dialog? If so, please post it; if not, let me know approximately how long it took to process your thousand files. Thanks!
0
 

Expert Comment

by:vpnsol123
It worked like a champion.  I will add back the dialog.  Now I am going to really push it.  The output from my program prints one PDF file per customer with multiple invoices.  They all have the same first line.  Is it possible to create a separate PDF file based on that page break?

Thanks again!!
0
 
LVL 59

Author Comment

by:Joe Winograd, Fellow&MVE
You're welcome...I'm glad to hear it worked like a champion for you. Some questions re your new request:

(1) Are all of the invoices only one page each?

(2) If not, how do you know when one invoice ends and another begins? I'm concerned that page 2 of an invoice may have the same first line as page 1, so there would be no way of knowing, for example, if the second page in the PDF file is page 2 of the first invoice or page 1 of the second invoice.

(3) The original files are named like [D123456.pdf]. The current version of the program renames that to, for example, [D123456 John Smith.pdf]. If the program were to create multiple PDF files from one input file, how would they be named? For example, if [D123456.pdf] contains three invoices, would you want it to create [D123456-1 John Smith.pdf], [D123456-2 John Smith.pdf], and [D123456-3 John Smith.pdf]? If not, what?

Regards, Joe
0
 

Expert Comment

by:vpnsol123
Each invoice starts with "customer name:"  If it goes to a second page it is a continuation and doesn't have "Customer Name:"

Thanks,.
0
 
LVL 59

Author Comment

by:Joe Winograd, Fellow&MVE
OK, that answers questions (1) and (2). What's the answer to (3)?
0
 

Expert Comment

by:vpnsol123
Your example with the original file name then a dash and a sequence number works great!!

Thank you again.
0
 
LVL 59

Author Comment

by:Joe Winograd, Fellow&MVE
vpnsol123,
Yes, it is possible to do it. But this requirement represents a significant difference in functionality, as the current script simply renames (or renames/moves) the source PDF files. This additional requirement means that the source PDF files must be split into multiple, newly-created PDF files. This is a significant enough enhancement that I'd rather write a new Article and script for it. It will also require an additional PDF utility...one that can split a PDF file into multiple PDFs, which the Xpdf tools cannot do. I have already designed the new program and tested a (free) PDF toolkit that can do what's needed...I plan to have the new Article and program written within the next week or so. Regards, Joe
0
 
LVL 59

Author Comment

by:Joe Winograd, Fellow&MVE
vpnsol123,
The new Article (with the new program) was just published:

http://www.experts-exchange.com/Software/Misc/A_11211-How-To-Split-Rename-Move-a-Batch-of-PDF-Files-Based-on-Contents-of-the-Files.html

Please read it, run the program on your thousand files, and post the results at the new Article. Thanks, Joe
0
 

Expert Comment

by:igges
joewinograd,
this was just what I was looking for. Perfect! Saved me a lot of time.
igges
0
 
LVL 59

Author Comment

by:Joe Winograd, Fellow&MVE
igges,
Thanks for letting me know – I appreciate your comments! Regards, Joe
0
 
 

Administrative Comment

by:Eric AKA Netminder
joewinograd,

Two of the Page Editors have agreed that your article is worthy of being Experts Exchanged Approved, and has been so awarded.

Congratulations!

ericpete
Page Editor
0
 

Expert Comment

by:Mehmet Adalar
Joe I am trying to use this tool for my pdf files. Somehow I cannot use column #. It does not work. It always bring the first line of the file. How ever mine is on other lines. Can you help me how to chose that line?
0
 
LVL 59

Author Comment

by:Joe Winograd, Fellow&MVE
Hi Mehmet,
The program currently looks for the string on the first line (in whatever starting column you specify). I can change it to look for the string on some other line — second, third, whatever. I can even change it to make that an input parameter, i.e., just as the user inputs the column number, the user would input the line number. Does this approach work for you? In particular, is the string to be appended always on the same line in the file (second, third, whatever)? Regards, Joe
0
 
LVL 59

Author Comment

by:Joe Winograd, Fellow&MVE
Although I haven't heard back from maliadalar (Mehmet Adalar), based on his request above, I modified the program to allow the name string to be on any line of the file, not just the first line. I did this via a new parameter requested from the user in a new input box. I will also take this opportunity to make a few other improvements to the Article and program, as follows:

(1) Remove the comment about downloading/renaming the TXT file since EE now allows AHK files to be uploaded.

(2) Change the program to look for <pdftotext.exe> in the same folder as the script (as well as C:\Program Files (x86)\xpdf\ and C:\Program Files\xpdf\).

(3) Add error checking on calls that can produce an error, including pdftotext.exe, FileAppend, FileDelete, FileMove, and FileReadLine.

(4) Provide an option to save the operational statistics in a text file.

I will make some other changes to the Article (including modified code snippets) and submit it for republication. I will also delete the source code file attached to the Article (Batch-Mass-Rename-Move-PDF-Files.txt) and attach the new source code file (Batch-Mass-Rename-Move-PDF-Files-V2.ahk). I plan to resubmit this within the next few days, but if anyone wants additional changes, please post here and I'll consider them. Regards, Joe
0
 

Expert Comment

by:cbarber22
I literally have been looking for something like this for a week or two. We had an older version of OMNISOFT PRO 14. The server went out due to Edison hitting our power line with a crane. The previous IT staff had no backups of this server, and I just stepped in. The company purchased omnisoft ultimate but Nuance/OMNISOFT Technical support has been of no help resolving this.
Previous process:
Use the Panasonic built in software RTIV to scan the initial physical papers excluding the separator sheets, then once completed, saving as a pdf to a shared temp location. Use Omnisoft to pickup those files read the data, and save them to another location with the order number which is in the scanned files.

But Omnisoft doesn't seem to have the capability in the newer software, to read the file and use the text in the pdf to get the order numbers out of it, and place that as the files name when saving. I would like to be able to do the scanning all in omnisoft, then read the file, place the ordernumber as the name then save it to its final destination instead of this massive work around involving 2 programs and a batch script. Ideas?
0
 
LVL 59

Author Comment

by:Joe Winograd, Fellow&MVE
Hi cbarber22,

I'm sure you mean OmniPage Pro 14, not OMNISOFT PRO 14. I have that version, and also have OmniPage Pro 18, as well as the latest release, which Nuance calls OmniPage Ultimate (under the covers, it is OmniPage 19, but for whatever reason Nuance changed the naming convention). I don't believe that there any features in OmniPage Pro 14 that aren't in OmniPage Pro 18 or OmniPage Ultimate. So when you say that OmniPage "doesn't seem to have the capability in the newer software", I don't think that's the case. Whatever you were doing with the old version 14 you should be able to do with 18 and Ultimate (19). Please describe the previous process/workflow used in OP14 Pro that functioned correctly for this company.

Btw, I don't know what you mean by "this massive work around involving 2 programs and a batch script". But if you're talking about the program presented in this Article, keep in mind that it has nothing to do with OCR. It assumes the existence of a PDF file with text in it. Regards, Joe
0
 

Expert Comment

by:cbarber22
Hi Joe,
Thank you for the response. I was told by OmniPage when talking to them on the technical support line that they are unable to reproduce this functionality of reading the pdf scanned documents for a specific word "Order #" followed by 6-10 digits to use that as the files name when outputting the saved file. Previously in version 14 we had this running for almost 7 years.

 If this isn't a possibility do you  know of any way to read the file after being scanned in, to rename based on text internal to the individual documents?

The massive work around for us moving forward would be to have the staff individually save each file with a manual typing of these pdf documents. This would be a major inconvenience because of the amount of items which are scanned. This is usually upwards of 1500 pages per day.

I hope I answered anything which was cleared previously, if not please let me know and I would be more than happy to explain.

Thank you, Chris
0
 
LVL 59

Author Comment

by:Joe Winograd, Fellow&MVE
Hi Chris,

Yes, that clears it up well. You say that previously in OP14 you had it running, i.e, the ability to name the output file based on text in the OCRed contents of the input file. I've never done that with any version of OP — can you explain/show how you did that with OP14?

Another question for you: is the "Order #" phrase on the same page in the same location for each file? Regards, Joe
0
 

Expert Comment

by:cbarber22
Joe,
Unfortunately, I don't know the setup since I wasn't involved, I just stepped in here at this company as the sole IT staff for the time being until I hire staff to assist. The order # is on the sheets which we are scanning from our Panasonic 4085 scanner and should be in the same location on all sheets.
0
 
LVL 59

Author Comment

by:Joe Winograd, Fellow&MVE
Chris,

Since the "Order #" phrase is in the same location on each page, I have an idea for you. A product called eDoc Zonal OCR was designed to do exactly what you want:

http://www.edocfile.com/eDoc_Zonal_OCR.htm
eDoc Zonal OCR is a program designed to capture data from scanned files, place the data in a csv file and rename the file based upon the contents of the file.
Its output file type may be PDF or TIFF (you may, of course, ignore the data that it places in the CSV file).

Although the website is not maintained well (the copyright says 2008 and the supported Windows section says XP and Vista), the product is up-to-date, supporting W7 and W8. There's also a similar product there called File by OCR, which OCRs the entire document, i.e., it is not zonal OCR, but you don't need that since the "Order #" phrase is in the same location (a "zone") on each page.

The product is somewhat expensive at $1,095, but if you're doing 1,500 pages per day, that's 375,000 pages in a 250-day work year, or 1,125,000 pages in three years, which is a fair time frame to amortize the license fee, resulting in a cost of less than one-tenth of one cent per page — pretty reasonable!

One thing I don't like about this product is its dependence on Microsoft Office Document Imaging (MODI). The last version of Office that included it was Office 2007, although it is possible to install it in Office 2010:

http://support.microsoft.com/kb/982760

But there are mixed reports about trying the Office 2010 technique in Office 2013. Here's one discussion on it:

http://answers.microsoft.com/en-us/office/forum/office_2013_release-other_msftoffice_apps/office-2013-and-installing-ocr-for-documenting/ab7078a3-fd67-4199-a722-6a0596b838a0

A web search for "modi office 2013" will give you plenty more to study.

If you want to learn more about the product and have any of your questions answered, I suggest calling the company at 813-298-2474 and asking for Keith, who is extremely knowledgeable. He wrote the software and is willing to talk about it. He doesn't oversell it — he'll be the first to tell you that its OCR is not perfect, and that you may not be happy with the overall results.

Another idea is to use OmniPage to OCR the files to create what OP calls a PDF Searchable Image file. This will have "Order #" somewhere in the text (assuming, of course, that the OCR is accurate). Depending on how the OCR does, it may not always place it on the same line of the OCRed page (even though it is always in the same "location" on a page, the resulting line numbers could easily vary by plus or minus one or two). I could modify the program in this Article to look for "Order #" anywhere on the page (rather than a specific line number) and then extract the number to its right and use that to rename the file. Regards, Joe
0
 

Expert Comment

by:cbarber22
Wow, Joe that was a lot of information, and I will look into those choices you placed above later this evening. I am concerned with using the OCR incase there was placement changes, or for some reason that the information is not picked up causing us to go back and manually review all of these items. If all else fails I will update this tonight, asking for assistance in modifying the version as you described for the look for "Order #" on the page .

Thank you
0
 

Expert Comment

by:cbarber22
Hi Joe,
Neither really was successful in doing the job we need would you mind modifying the code previously written to pickup the text after the word "Order" as described in your previous post.

Thank you,
Chris
0
 
LVL 59

Author Comment

by:Joe Winograd, Fellow&MVE
Chris,
I'm working on a specification for the new feature. It is almost complete and I expect to be able to post it within an hour or so (the specification, not the modified program — the latter will take longer to code and test). Regards, Joe
0
 
LVL 59

Author Comment

by:Joe Winograd, Fellow&MVE
Here is a proposed specification for a new parameter to satisfy the requirements of both Chris and Mehmet. A new dialog box will look like this:

line number of name string
If a positive integer is entered, such as 1 or 2, the program will look for the occurrence of the string on that line. It will also prompt for the starting column number of the string, as it currently does. However, if 0 (zero) is entered, it will not prompt for a starting column number, but instead will look for the first occurrence of the string anywhere on the first page. When it finds the string, it will look for the first non-blank/legal-in-a-file-name character after the string and will use that as the starting point to extract the name. It will end with the first blank or illegal-in-a-file-name character or at the end of the line. Here are some examples if "Order #" is the string:

(1) Order #: 123456 Name: ABC Company

Extracted name will be 123456 (because ":" is illegal in a file name, and "1" is the first non-blank character, and the blank after the "6" terminates it).

(2) Order #-123456

Extracted name will be -123456 (because "-" is legal in a file name and the end-of-line terminates it).

(3) Order #123 456

Extracted name will be 123 (because the blank after "3" terminates it).

(4) Order #123456/July-2014

Extracted name will be 123456 (because "/" is illegal in a file name and terminates it).

Chris and Mehmet (and anyone else who is interested), please let me know if this specification works for you. If not, what changes would you like? Regards, Joe
0
 
LVL 59

Author Comment

by:Joe Winograd, Fellow&MVE
Chris,

One other thought. You mentioned that you have a Panasonic scanner, so you should take a look at Panasonic's Image Capture Plus software:

http://panasonic.net/pcc/products/scanner/image_capture_plus/

Scroll down at that link until you find a section called, "The OCR Zone Function Supports the Scanning of Standard Forms Such as Invoices", where you will see this:
When you use the OCR Zone function, text will be recognized while scanning and acquired as an OCR result that is used in the file name and Output Log.
Note the diagram where it says,
Part of the document text is reflected in the filename
I've never used this software and haven't checked out the details, but it's worth a look. Since it is Panasonic software, there's a good change that it will work with your Panasonic scanner. My thanks to Eric Beanland of the PaperPort Google Group for bringing Image Capture Plus (and its OCR Zone function) to my attention. Regards, Joe
0
 

Expert Comment

by:cbarber22
I will look into it before asking you to add to the above modifications.
0
 
LVL 59

Author Comment

by:Joe Winograd, Fellow&MVE
I look forward to getting your feedback on it — always like to learn about new products.
0
 

Expert Comment

by:cbarber22
Hi Joe,
Unfortunately the application does not use a variable based option. They are static tags for file names such as "string-#####" where the string is a hardcoded preset value you select and the # is just a sequential numbering system of the scans in order from 1-100000. There are also date stamps, and timestamps but none of those seem to fit the option for what is needed currently. I can provide a screenshot of it incase you want it for reference in the future?
0
 
LVL 59

Author Comment

by:Joe Winograd, Fellow&MVE
Hi Chris,

Yes, a screenshot would be helpful for future reference. Also, please post a sample of your PDF file — I want to be certain that any new code will handle your PDF form correctly (be sure not to have any sensitive/private data in the posted sample).

If I modify the program as described in the proposed specification above, here's what your new process would look like:

(1) Scan the pages with Panasonic's RTIV software, creating image-only PDF files in a shared temp location (same as previous process).

(2) Use OmniPage to OCR the image-only PDF files in the shared temp location, creating PDF files with text in them (same as previous process).

(3) Use my modified program to read the files created by OmniPage, telling it to look for "Order #" anywhere on the page. It will then rename/move each file, using the text following "Order #". You said that the text is always 6-10 digits, but the program won't care about that — it will simply terminate the string when it finds the first blank or illegal-in-a-file-name character or end-of-line.

Sound good? Anything I'm missing? Regards, Joe
0
 

Expert Comment

by:cbarber22
No that sounds great Joe. I will go ahead and do some screenshots of the options built into that program for you for future reference and send those over a little later.
0
 
LVL 59

Author Comment

by:Joe Winograd, Fellow&MVE
Some screenshots will be great. Also, please post a sample of your PDF form.
0
 

Expert Comment

by:cbarber22
Here is the order form which we would need the highlighted data from to use as the filename
SOBLANK.pdf
0
 
LVL 59

Author Comment

by:Joe Winograd, Fellow&MVE
Chris,
I'm really glad you posted that sample file — it shows the difficulty we're facing. If you open that file in the latest Adobe Reader XI and do Select All>Copy, here's the text that you get:
Joo Nbr
Sold ~o :
I NV O I CE
Ship To:
Invoice 501382
Crder Nbr 559386 so
Page Nbr
Date 7;-, 112011.
So my program would not find "Order" because the OCR came up with "Crder" (that's a "C" as the first letter, not "O"). This is the quintessential problem with using OCR for a task like this. For many OCR uses, we don't care about the occasional error — our brains are very good at correcting OCR errors. But for usage like this, an OCR error can be deadly. I could have the program look for "Crder", as well as "0rder" (that first character is the number zero, not an upper case letter "O"), but these are band-aids on the real problem — OCR accuracy. Even if the program finds "Order" (or "Crder" or "0rder"), the digits after it could be a problem. For example, the number "0" could become an upper case letter "O"; the number "1" could become a lower case letter "l" (L), lower case letter "i", upper case letter "I", "|" (pipe character), "!" (exclamation mark), or "¡" (inverted exclamation mark). Once again, the program could fix these problems by assuming that the target string must be digits, so it would change an upper case letter "O" to the number "0"; and change any of the "1" look-alikes to the number "1". But that wouldn't work for users who want non-numeric characters to be allowed in the string, so the program would need to have some options to deal with the contingencies of various situations/users. As I'm sure you can appreciate, it gets very complicated very fast, and (I think it's fair to say) beyond the scope of the EE site, which is gratis from the Author's perspective.

Three other points about the OCR accuracy. First, according to the Properties of the PDF file you posted, the PDF Producer is Adobe Acrobat Pro 11.0.7 Paper Capture Plug-in. So that means you didn't use your OmniPage 14 to OCR it. If you had, I would expect the PDF Producer to say OmniPage 14 — I can say for sure that my OmniPage Ultimate shows the PDF Producer as OmniPage 19. It's possible that OmniPage (14 or 19 or any version in between) would produce better OCR results than the Adobe Acrobat Pro 11.0.7 Paper Capture Plug-in. If you attach a blank of the original PDF file before OCR, I'll run it through OmniPage Ultimate (19) and post the results (I already did that with the PDF you posted — see below — but I'd still rather have a blank before OCR).

Second, the quality of the source document could be the problem. If the original document is poor quality, the scanned image is likely to be poor quality — perhaps too poor for any OCR package to produce accurate text.

Third, the scanning parameters may be bad for OCR. I do nearly all of my scanning at 300dpi/black&white. Occasionally, I'll do 600dpi/b&w, but, strangely enough, that can actually produce worse OCR results. Sometimes (rarely), I'll do 200dpi/grayscale. To learn more about this issue, I recommend visiting Wayne Fulton's excellent site, "A few scanning tips":
http://www.scantips.com/

In particular, the OCR tips section:
http://www.scantips.com/basics04.html

Note the comment about using grayscale for OmniPage Pro, although I rarely find that this is necessary.

I think the problem with the sample you attached is the second and/or third points above. I fed it to OP19 and it did no better than Acrobat. In order to have any hope for this project, you'll need to create much better quality scanned images. Regards, Joe
0
 

Expert Comment

by:cbarber22
Hi Joe,
I believe this could be because it was copied 2 times to block the business related data. I will attempt to remove it shortly with a valid sample in a darker, clearer scale.
0
 
LVL 59

Author Comment

by:Joe Winograd, Fellow&MVE
Chris,
How about simply printing a blank invoice form and then scanning it in with your normal scanning software — ideally at 300dpi/b&w.
0
 

Expert Comment

by:cbarber22
This should be significantly better now let me know how this goes.
Samplescan.pdf
0
 
LVL 59

Author Comment

by:Joe Winograd, Fellow&MVE
Yes, it is significantly better. Here are the OmniPage Ultimate (OP19) OCR results:

L
Invoice 501382
Order Nbr - 559386 SO
Page Nbr - 1
Date - 7/11/2014
CORPORATE OFFICE
INVOICE
Job Nbr
Sold To: Ship To:

The "L" at the top is from the OCR thinking that the lower left corner of one of the empty squares looks like an "L". There was also a form feed character at the end (decimal 12, hex 0C), which is reasonable. Other than that, it did very well!

Btw, one of my imaging programs reports it as 200dpi, not 300dpi. Sometimes those reports are wrong — are you sure it was scanned at 300dpi? Regards, Joe
0
 

Expert Comment

by:cbarber22
I did scan it at 300 but it could have picked up prior settings. Was there anything else needed to make sure this functionality works.
0
 
LVL 59

Author Comment

by:Joe Winograd, Fellow&MVE
Are all the invoices one page? I just noticed a "Page Nbr" field, which leads me to think that some invoices may be more than one page.
0
 

Expert Comment

by:cbarber22
I apologize I didn't even think of that. It only needs to read the first page, each item which needs to be named will be separated by a separator sheet which we have already configured with RTIV to exclude and move onto the next file.
0
 
LVL 59

Author Comment

by:Joe Winograd, Fellow&MVE
OK, so RTIV uses the separator sheet to create separate PDF files, meaning each invoice is in its own PDF file. So my program needs to look for "Order Nbr" only on the first page. That helps! I don't think I need anything else at this time, but let me give it some thought. Regards, Joe
0
 

Expert Comment

by:cbarber22
Hi Joe, I wanted to check in and see how all is going ?
0
 
LVL 59

Author Comment

by:Joe Winograd, Fellow&MVE
Hi Chris,
Sorry for the delay. Was working on paid engagements most of last week and went on an extended weekend vacation starting Friday. I'm in the airport right now waiting for the (delayed!) flight back home. I'll try to get back into this within the next day or two, depending on how things shake out with the paid consulting projects (as I'm sure you can appreciate, I need to give those priority over the gratis EE work). Regards, Joe
0
 

Expert Comment

by:cbarber22
haha I appreciate it and figured you were busy too. I didn't want to rush you just figured if you had a schedule like mine that you had your hands full.
0
 
LVL 59

Author Comment

by:Joe Winograd, Fellow&MVE
Chris,

I hope to have some time tonight or tomorrow night to work on this. I decided to change the spec mentioned in my earlier post with respect to what happens after it finds the first occurrence of the identifier string (such as Order# or Order Nbr). Instead of then looking for the first non-blank/legal-in-a-file-name character after the identifier string, it will look for the first alphanumeric character. I'm thinking that most users won't want a special character to start the file name (of course, anyone with a different requirement can modify that piece of code). With this change, the second example in my earlier post is:

(2) Order #-123456

Extracted name will be 123456 (because "-" is not alphanumeric).

Regards, Joe
0
 

Expert Comment

by:cbarber22
No worries I understand time constraints I just flew out to Sacramento because a DC just went down.
0
 
LVL 59

Author Comment

by:Joe Winograd, Fellow&MVE
Chris (and Mehmet and anyone else who is interested),
I completed the code late Monday evening. All of the tests have worked so far, but I want to run some additional tests tomorrow before releasing it. Regards, Joe
0
 

Expert Comment

by:cbarber22
You are amazing Joe, thank you for all your hard work.
0
 
LVL 59

Author Comment

by:Joe Winograd, Fellow&MVE
Chris,
Thanks for the kind words — I really appreciate hearing them! I hope your trip to Sacramento was successful and your Data Center is up-and-running again. Things got complicated while performing additional testing on the new program. I think the best approach at this point is for you to contact me at the email address in my profile. Regards, Joe
0
 

Expert Comment

by:doctordigital
I am not a programmer. In fact i'm a layman but need to rename a batch of 900 pdfs.  I downloaded autohothey and the xpdf.  I would appreciate a walk through of what I am supposed to do to reach the ultimate goal of renaming the pdfs.  
I don't see from the posted information  where or when to choose the pdfs I'm interested in renaming and I don't know what to do the the xpdf program.  The pdf has two pages in it, but I only need info from the first page. I have attached the pdf.  Ideally, I would like to use the birth date in the new name.  So I just need to add the birthdate to the file name.  
Thank you
Rick
Ironworkers-2014-rev-2-Part1.pdf
0
 
LVL 59

Author Comment

by:Joe Winograd, Fellow&MVE
Hi Rick,

It's not a simple matter of walking you through what to do. The program presented in this article is specific to the identifier being on the first line of the first page in a specified column after the PDF is converted to text. In this case, I tested your PDF with the text converter, and the identifier ("Date of Birth: ") does not wind up on the first line of the page with the converted text. Also, with 900 PDFs, I suspect the location of the identifier could vary, such as the case where the person has a 3-line address (perhaps an apartment or suite number), which could cause the Date of Birth to appear in a different line (and possibly a different column) of the converted text file.

In addition, the content of your identifier field has a character in it ("/") that is illegal in a Windows file name. The current program will not handle that properly — it will attempt to do the rename/move with the invalid character in the file name and it will fail. The program would have to be modified to change the two slashes in the birth date to another character, such as a hyphen, period, underscore, space, etc. (i.e., a character that is valid in a Windows file name).

Since the solution you need goes beyond the scope of this article, I'll send you a note via the EE Message system to discuss the possibility of creating a customized version of this program that does what you need. Regards, Joe
0
 

Expert Comment

by:doctordigital
That would be great.  Thanks
0
 

Expert Comment

by:Member_2_7970298
How do I obtain a copy of your AutoHotKey script for "How To Rename-Move a Batch of PDF Files Based on Contents of the Files" ?
I have what I believe a application to rename multiple PDF files and would greatly appreciate receiving a copy of the  AutoHotKey script.
Thank You
0
 
LVL 59

Author Comment

by:Joe Winograd, Fellow&MVE
Hi New Member,

Perhaps you missed my answer to your same question at my Split-Rename-Move article, so I'll repeat it here for you.

When I removed the source code last year from six articles that I published here at EE, my intention was that the removal be temporary. I began a project to rewrite all of the programs in my portfolio in order to generalize them for a broader audience and to have a standard user interface, including both a GUI (graphical user interface) and, where it makes sense, a CLI (command line interface). It wound up being a much larger effort than I anticipated, and I'm still not ready to post or distribute the source code for this program (or any of the other five published at EE — and I don't know when or even if that will be, for a variety of reasons).

I have created customized versions of these various programs for EE members who became clients of mine. I provided licenses for the run-time programs (the executables, i.e., the compiled EXE files) for an agreed-upon fee, but I did not provide the source code. I did this previously when EE had the "Hire Me" button, but that no longer exists. The mechanism now at EE for such work is the new Gigs feature, if that interests you.

Regards, Joe
0
 

Expert Comment

by:Member_2_7970298
Joe,
Thanks for your response.
I appreciate your comments & issues.
I would greatly appreciate it if you could see your way clear to send me your original AutoHotKey script.
I'm trying to learn more about AutoHotKey scripts and especially how it interfaces with Xpdf's pdftotext.exe
Thanks
0
 
LVL 59

Author Comment

by:Joe Winograd, Fellow&MVE
Hi Member_2_7970298 (???),

I received your email at my personal email address, which I'll respond to in a moment. I already responded to your post at the AHK forum, which led you to this article, and then to my Split-Rename-Move article. Instead of three different communication venues (EE, AHK, email), let's continue this discussion via just email.

That said, a quick message about your comments is that the Tutorials forum and the Scripts and Functions forum at the AHK boards are the way to go "to learn more about AutoHotKey scripts" (as well as the Tutorial at the AHK docs site).

There's not much to learn about "how it interfaces with Xpdf's pdftotext.exe" — the RunWait command is it. Here's an actual call from one of my programs:

RunWait,%pdftotextEXE% -f 1 -l 1 -raw "%FullFileNameCurrent%" "%DestinationFolder%%FileNameCurrentTXT%"

Open in new window

I'm sure from the names of the variables you can figure out what that line does. Also, I gave you links at the AHK forum to my two 5-minute EE video Micro Tutorials that should help you with learning about how to use the pdftotext.exe tool:
Xpdf - Command Line Utility for PDF Files
Xpdf - Convert PDF Files to Plain Text Files

If you haven't viewed them yet, I think you'll find them to be a worthwhile expenditure of 10 minutes. Regards, Joe
0

Featured Post

Get your problem seen by more experts

Be seen. Boost your question’s priority for more expert views and faster solutions

Join & Write a Comment

In an interesting question (https://www.experts-exchange.com/questions/29008360/) here at Experts Exchange, a member asked how to split a single image into multiple images. The primary usage for this is to place many photographs on a flatbed scanner…
In a question here at Experts Exchange (https://www.experts-exchange.com/questions/29062564/Adobe-acrobat-reader-DC.html), a member asked how to create a signature in Adobe Acrobat Reader DC (the free Reader product, not the paid, full Acrobat produ…

Keep in touch with Experts Exchange

Tech news and trends delivered to your inbox every month