<

[Last Call] Learn about multicloud storage options and how to improve your company's cloud strategy. Register Now

x

How To Split-Rename-Move a Batch of PDF Files Based on Contents of the Files

Published on
33,377 Points
10,177 Views
7 Endorsements
Last Modified:
Awarded
Joe Winograd, EE MVE 2015&2016
50+ yrs in computer industry. Everything from programming to sales. OS kernel dev on mainframes. CIO. Document imaging. EE MVE 2015 & 2016.
Update 21-May-2015: I temporarily removed the source code and the code snippets to make major changes to the program. Regards, Joe

INTRODUCTION

This Article is a follow-up to the Article entitled How To Rename-Move a Batch of PDF Files Based on Contents of the Files, recently published here at Experts Exchange.

I considered adding the new feature (splitting a single document into multiple documents) to that Article and program, but concluded that it is a significant enough enhancement to warrant a new Article and program.

PREVIOUS ARTICLE

To understand this Article, it will be helpful to read the previous Article, but to get things going here right away, here's a summary of the previous problem and solution.

There is a large batch of PDF files, all with cryptic names, such as [D123456.PDF]. Inside each file on the first line of the first page (always starting at a fixed column and running to the end of the line) is a human-friendly identifier for the file, such as [John Smith]. The requirement is to loop through all of the files in a specified folder in an automated fashion, changing the file names from, for example,

D123456.PDF

to

D123456 John Smith.PDF

That is, add the identifier from the first line of the first page to the file name.

NEW REQUIREMENT

Following publication of the previous Article and the program that implements the solution, the Original Poster (OP) of the question that prompted the Article asked if an enhancement is possible. Specifically, a single PDF file may be composed of what are really multiple PDF files, and the OP wants the program to split the single PDF into multiple PDFs. For example, pages 1 to 3 of [D123456.PDF] may be an invoice for John Smith, while page 4 may be a different invoice, and pages 5 to 6 yet another invoice. With the previous program, the 6-page [D123456.PDF] would simply be renamed to [D123456 John Smith.PDF], still containing all six pages (three invoices). The OP wants the program to split the original PDF file and create three PDFs, one for each of the invoices. The program still has to rename the files based on content, but, in addition, has to provide a suffix for the multiple files, such as

D123456 John Smith-1.PDF
D123456 John Smith-2.PDF
D123456 John Smith-3.PDF

INSTALLATION INSTRUCTIONS FOR REQUIRED SOFTWARE

The previous solution requires two excellent freeware products – the AutoHotkey scripting language (the program is written in this) and [pdftotext.exe] from the Xpdf package to convert the PDF files to text (so the program can extract the identifying names for renaming the files). This new solution requires another excellent freeware product – PDFtk (the PDF Toolkit) from PDF Labs.

Here are the steps for installation of these three packages:

(1) AutoHotkey – http://ahkscript.org (also, see my EE article: AutoHotkey - Getting Started)

Click the Download button at the page above, save the install file, and then run it.

(2) Xpdf – http://www.foolabs.com/xpdf/download.html

Click the [xpdfbin-win-3.03.zip] link at the page above to download the Windows files. Unzip the zip file and there will be folders for 32-bit Windows (bin32) and 64-bit Windows (bin64). Be sure to select the right folder for your version of Windows (32-bit or 64-bit) and copy the file called [pdftotext.exe] to wherever you want (the Xpdf binaries are "no-install" executables). The script will automatically find it if you put it in [Program Files\xpdf\] or [Program Files (x86)\xpdf\], but if you put it somewhere else, that's fine – the script gives you a browse-for-file dialog so you may navigate to it.

(3) PDFtk – http://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/

Click the [pdftk_server-1.45-windows-setup.msi] link at the page above, save the install file, and then run it. It will create a folder called [Program Files\PDF Labs\PDFtk Server\] or [Program Files (x86)\PDF Labs\PDFtk Server\] with a [bin] folder that contains two files – [pdftk.exe] and [libiconv2.dll]. If you'd like to move those two files, that's fine. The script automatically finds them if you leave them where the installer put them, but if you move them somewhere else, it gives you a browse-for-file dialog so you may navigate to them (place both files in the same folder).

ASSUMPTIONS FOR NEW PROGRAM

All of the assumptions for the previous program apply to the new program, namely:

There is a fixed number of characters in the original file name (before the ".pdf"). For example, with file names like [D123456.PDF], that number is 7.

There is a fixed starting column number for the string that will be in the new file name (and it runs to the end of the line). In other words, following the examples above, this is the column number where "John Smith" begins (for the OP, this is 16).

The user specifies the source and destination folders. If they are the same, the program does just a Rename; if they are different, the program does a Rename and a Move.

Here is the assumption unique to the new program:

The first line of a page contains a string that identifies it as a new document. It is a fixed string (specified by the user) beginning in a fixed column (also specified by the user). An example is that the first line of the first page of an invoice contains "Customer Name:" beginning in column 5, while all subsequent pages of that same invoice do NOT contain "Customer Name:" beginning in column 5.

So the program reads the first line of each page and if it contains the specified new document identifier/separator string (such as "Customer Name:" or "Client Name-" or "Account Number") in the specified starting column (such as 1 or 5 or 10), then it knows this is the first page of a new document; if it does not, then it knows this is a continuation page of the current document.

HOW TO RUN THIS PROGRAM

Download the attached file called Batch-Mass-Split-Rename-Move-PDF-Files.ahk. After downloading it, you may run it by simply double-clicking on it in Windows Explorer or whatever file manager you use. Since its file type is AHK, AutoHotkey will be launched to process it. If you prefer, the file may be turned into an executable via the AutoHotkey compiler, which is installed during the standard installation of AutoHotkey. If you right-click on an AHK file in Windows Explorer or whatever file manager you use, there will be a context menu pick called Compile Script. Select that and it will create an EXE file, which is a stand-alone, no-install executable of the AHK program.

AutoHotkey Compile Script
HOW THE PROGRAM WORKS

For those interested in understanding how the script works, the remainder of this Article shows some code snippets, with a description of what each snippet does, including screenshots where appropriate (this also acts as a form of documentation for the program). However, it does not include code snippets that are the same, or substantially the same, as the code snippets in the previous program, which have already been discussed in the previous Article.

Code snippet:
 
temporarily removed

Open in new window

What it does: Although this code is similar to the "Starting Column" code in the previous script, I decided to document it here, as it is part of the major enhancement in this script. This code asks the user for the starting column number of the new document identifier/separator string. If the entry is not an integer and/or not greater than zero, it displays a message and gives the user the opportunity to try again or exit.

StartCol NDISS
Code snippet:
 
temporarily removed

Open in new window

What it does: Asks the user to enter the new document identifier/separator string, which is used to split multiple documents that are in a single PDF file into multiple PDFs. It also gives the user the opportunity to exit the program.

NDISS
The confirmation dialog is similar to the previous program, but the differences are worth noting here:

Confirm Parameters
Code snippet:
 
temporarily removed

Open in new window

What it does: Calls PDFtk to write a text file (known as dump_data) that contains various information about the PDF file. One of the items that it writes to the dump_data file is the number of pages in the PDF file. If PDFtk returns an error code, a Fatal Error dialog is displayed with some helpful information to troubleshoot the error.

Code snippet:
 
temporarily removed

Open in new window

What it does: Reads all of the lines in the dump_data file looking for the "NumberOfPages:" line. If it finds the line, it stores the number of pages in a variable (numpages); if it doesn't find the line, it displays a Fatal Error dialog.

Code snippet:
 
temporarily removed

Open in new window

What it does: Loops through all of the pages of the current PDF file, calling [pdftotext.exe] to write the contents of each PDF page, one at a time, to a text file. If [pdftotext.exe] returns an error code, a Fatal Error dialog is displayed with some helpful information to troubleshoot the error.

Code snippet:
 
temporarily removed

Open in new window

What it does: Checks the first line of the page starting at the specified column to see if it contains the new document identifier/separator string. If it does, then this page begins a new document, and if it isn't the first document in the file, then it calls PDFtk (with the "shuffle" and "output" parameters) to write out the previous document to a new PDF file with a unique suffix. It also increments the suffix for the next new document. If PDFtk returns an error code, a Fatal Error dialog is displayed with some helpful information to troubleshoot the error.

Code snippet:
 
temporarily removed

Open in new window

What it does: If it is the first document in the PDF file, it sets the suffix to 1 (of course, there is no prior document to write out).

Code snippet
 
temporarily removed

Open in new window

What it does: For any new document, whether or not the first one in the current PDF file, it renames/moves it to the destination folder.

Code snippet:
 
temporarily removed

Open in new window

What it does: If this page does not have the new document identifier/separator string starting in the specified column, then it is a continuation page, that is, part of the current document. The only action for this is to build up the "shuffle" parameter for the call to PDFtk.

Code snippet:
 
temporarily removed

Open in new window

What it does: Writes out the last document in the PDF file when there are no more pages to process.

Code snippet:
 
temporarily removed

Open in new window

What it does: The previous program and this one both write out an Operation Completed dialog with statistics from the run, as shown above. The difference in this new program is that it offers to save the operational statistics in a text file. If the user says Yes, it creates a file with the name Operational_Statistics_YYYY-MM-DD_HH.MM.SS.txt in the destination folder (where YYYY-MM-DD_HH.MM.SS are the ending date and time of the run).

OpStats saved
The text file looks like this:

Operational Statistics from Batch-Mass-Split-Rename-Move-PDF-Files
Beginning date and time: 2013-02-11/18:19:22
Number of PDF files processed: 1,969
Number of non-PDF files ignored: 14
Ending date and time: 2013-02-11/18:29:24
Elapsed time (minutes:seconds): 10:2

That's it! I hope this helps the OP as well as other EE members. Although I did a bit of generalization, I realize that the solution is still rather specific to the OP's requirements. However, by providing the source code, I hope that other folks with similar needs will be able to modify the program to suit their purposes.

If you find this article to be helpful, please click the thumbs-up icon below. This lets me know what is valuable for EE members and provides direction for future articles. Thanks very much! Regards, Joe
7
Comment
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 8
  • 3
  • 2
  • +3
17 Comments
 
 

Administrative Comment

by:Eric AKA Netminder
joewinograd,

Congratulations; your article has been published.

ericpete
Page Editor
0
 
LVL 31

Expert Comment

by:captain
Excellent article Joe, as a fellow expert in the Acrobat TA, I know that this solves a hot topic issue that gets asked ever so often.
0
 
LVL 56

Author Comment

by:Joe Winograd, EE MVE 2015&2016
captainreiss,
Thank you for the compliment – I really appreciate it! Regards, Joe
0
Windows Server 2016: All you need to know

Learn about Hyper-V features that increase functionality and usability of Microsoft Windows Server 2016. Also, throughout this eBook, you’ll find some basic PowerShell examples that will help you leverage the scripts in your environments!

 

Expert Comment

by:igges
joewinograd,
this worked perfect for me, thank you for your work! I will be using this monthly to treat pdf-files an save a lot of time.
igges
0
 
LVL 56

Author Comment

by:Joe Winograd, EE MVE 2015&2016
Hi igges,
You're welcome! And thanks again to you for letting me know that this article is helpful to you and that the program works well for you. It is especially nice to know that you will be using it regularly to save a lot of time. As an author of numerous articles, I can tell you that it is very rewarding to get feedback like this from EE members. Regards, Joe
0
 
 

Administrative Comment

by:Eric AKA Netminder
Joe,

Your article has been selected by the Page Editors as EE-Approved. Congratulations!

ericpete
Page Editor
0
 
LVL 56

Author Comment

by:Joe Winograd, EE MVE 2015&2016
Hi ericpete,
Thanks to you and the other Page Editors for the EE-Approved award – truly appreciated! Regards, Joe
0
 

Expert Comment

by:igges
Hi

When using the code I had the problem, that the name was not in a fixed colummn but changed its position. The consequence was that in some generated files, the name was cut off, in others there were too many empty spaces. But in the original pdf-file, the name's are displayed perfectly aligned.

Cause:
By testing xpdf directly I found out, that it, why ever, changed the alignement when it created the txt-files.

Solution:
The Soluion was to add
-fixed 10

Open in new window

into the above script (just after "-layout". This assumes fixed-pitch (or tabular) text. The value "10" I used after trying out different other values. It just worked best for me, but can change.

Hope that helps others
igges
0
 
LVL 56

Author Comment

by:Joe Winograd, EE MVE 2015&2016
Hi igges,

Thanks for posting the problem and your solution. The Xpdf documentation for pdftotext says this for the -fixed option:
-fixed number
Assume fixed-pitch (or tabular) text, with the specified character width (in points). This forces physical layout mode.
I don't understand why the character width in points affects the column number. Seems to me that a character of any number of points can be in a column, and that column number would be the same no matter how many points wide the character is. But your results say otherwise, so I'm perplexed. I'll do some research with the Xpdf folks and post back here with anything of value that I find out. In the meantime, I'm glad to hear that you found a setting that works for you – and thanks for posting it to help other members who may be experiencing a similar issue.

Two other points. First, when I looked at that code segment with the call to pdftotext, I noticed a bug in the ErrorLevel<>0 path. It has this:

MsgBox,16,Fatal Error,Error Level %ErrorLevel% from PDFtk`n\

But the error is coming from pdftotext, not PDFtk, so it should be this:

MsgBox,16,Fatal Error,Error Level %ErrorLevel% from pdftotext`n\

Second, EE now allows AHK file types to be uploaded. So I'm going to fix the MsgBox line above, and may also make some other changes, depending on what I find out from the Xpdf folks. When I do, I'll upload the modified program as an AHK file and remove the portion of the article that talks about having to rename the TXT file to AHK. Regards, Joe
0
 
LVL 56

Author Comment

by:Joe Winograd, EE MVE 2015&2016
Oops, don't know how those ending backslashes got in there. Those two lines should have been:

MsgBox,16,Fatal Error,Error Level %ErrorLevel% from PDFtk`n
MsgBox,16,Fatal Error,Error Level %ErrorLevel% from pdftotext`n

Regards, Joe
0
 

Expert Comment

by:igges
I have to add that I had this problem only with 1 pdf-file out of 3. The "treatment" of the other 2 works smothly.
0
 
LVL 56

Author Comment

by:Joe Winograd, EE MVE 2015&2016
igges,
Thanks for the clarification. I emailed my contact at Xpdf yesterday and will post back here if and when I hear from him. Thanks again for your feedback – it is very helpful in improving the quality and usefulness of articles. Regards, Joe
0
 

Expert Comment

by:EgoBubble
Great article. Seems to be exactly what I need. But where is the "Batch-Mass-Split-Rename-Move-PDF-Files.ahk" file necessary for it to work? Please let me know. Thanks.
0
 
LVL 56

Author Comment

by:Joe Winograd, EE MVE 2015&2016
Hi EgoBubble,
Thanks for the compliment — much appreciated! I'm not ready to re-post the source code here, but there may be some other way that I can help you. I'll reply to the message you sent via the Messaging System in a short while. Regards, Joe
0
 

Expert Comment

by:EgoBubble
Alright, greatly appreciated Joe.
0
 

Expert Comment

by:Member_2_7970298
how do I obtain a copy of the autohotkey script?
0
 
LVL 56

Author Comment

by:Joe Winograd, EE MVE 2015&2016
Hi New Member,

When I removed the source code last year from six articles that I published here at EE, my intention was that the removal be temporary. I began a project to rewrite all of the programs in my portfolio in order to generalize them for a broader audience and to have a standard user interface, including both a GUI (graphical user interface) and, where it makes sense, a CLI (command line interface). It wound up being a much larger effort than I anticipated, and I'm still not ready to post or distribute the source code for this program (or any of the other five published at EE — and I don't know when or even if that will be, for a variety of reasons).

I have created customized versions of these various programs for EE members who became clients of mine. I provided licenses for the run-time programs (the executables, i.e., the compiled EXE files) for an agreed-upon fee, but I did not provide the source code. I did this previously when EE had the "Hire Me" button, but that no longer exists. The mechanism now at EE for such work is the new Gigs feature, if that interests you.

Regards, Joe
0

Featured Post

NEW Veeam Agent for Microsoft Windows

Backup and recover physical and cloud-based servers and workstations, as well as endpoint devices that belong to remote users. Avoid downtime and data loss quickly and easily for Windows-based physical or public cloud-based workloads!

Join & Write a Comment

The viewer will learn how to create a basic form using some HTML5 and PHP for later processing. Set up your basic HTML file. Open your form tag and set the method and action attributes.: (CODE) Set up your first few inputs one for the name and …
This video Micro Tutorial is the second in a two-part series that shows how to create and use custom scanning profiles in Nuance's PaperPort 14.5 (http://www.experts-exchange.com/articles/17490/). But the ability to create custom scanning profiles a…
Suggested Courses

Keep in touch with Experts Exchange

Tech news and trends delivered to your inbox every month