<

How to Combine-Merge PDF Files in Many Subfolders

Published on
16,580 Points
4,880 Views
7 Endorsements
Last Modified:
Approved
Joe Winograd, EE Fellow 2017, MVE 2016, MVE 2015
50+ years in computer industry. Everything from development to sales. CIO. Document imaging. EE MVE 2015, EE MVE 2016, EE FELLOW 2017.
Update 21-May-2015: I temporarily removed the source code to make major changes to the program. Regards, Joe

INTRODUCTION

This article presents a solution to a question asked here at Experts Exchange. The situation is that there's a large number of subfolders (400 in the original question), each of which has a number of PDF files (two in the original question). The goal is to combine/merge the PDF files in each subfolder (in ascending date order) into a single PDF file, storing the combined file in each subfolder. The source PDF files in each subfolder may have any file names and the user should be able to specify the file name of the combined file.

REQUIRED SOFTWARE

The method presented in this article requires AutoHotkey, an excellent (free!) programming/scripting language. The quick explanation for installing AutoHotkey is to visit its website. A more comprehensive explanation is to read my EE article, AutoHotkey - Getting Started. After installation, AutoHotkey will own the AHK file type, supporting the solution discussed in the remainder of this article.

The program utilizes another excellent (free!) piece of software — PDF Toolkit (PDFtk). It comes in both command line and GUI versions. The command line version is called PDFtk Server. Don't be misled by "Server" in the name. I don't know why they called it that, but it's just an executable — (with a supporting DLL, ) that runs on XP, Vista, W7, and W8. That is, it does not have to run on a "server" OS.

In order to implement the solution in this article, you must have AutoHotkey and PDFtk on your computer (downloads are available at the links above). The solution should work on XP, Vista, W7, and W8 (32-bit and 64-bit), but I did all of the development and testing on W7/64-bit, so that is the only certified platform.

RUNNING THE PROGRAM

Download the source code for the program, which is attached to this article in a plain text file called Combine-Merge-PDF-files-20140826.ahk. After installing AutoHotkey, it will own the file type AHK, so simply double-click on the downloaded source code file in Windows/File Explorer (or whatever file manager you prefer) to run it. You may also compile the source code into a stand-alone/no-install executable by right-clicking on the source code file and selecting Compile Script:

AutoHotkey Compile Script
After compiling it, simply run the EXE file that the compiler created.

SOLUTION DESCRIPTION

The solution works on any number of subfolders and any number of PDF files in each subfolder (it ignores non-PDF files). It provides an option to combine/merge the files sorted in three ways — by file name, by modified date ascending (oldest first), by modified date descending (newest first).

The remainder of this section discusses the solution in detail by going through the user interface, showing the screenshots from various executions of the program (all screenshots are from a W7/64-bit system).

The first step is to check for the installation of PDFtk Server by looking for in default locations. If it isn't found, you will see this error dialog:

PDFtk not found
If you installed it in a non-default location, click OK and you will get a browse-for-file dialog:

Navigate to PDFtk
Navigate to pdftk.exe and select it.

The next step is to select the root folder:

Select root folder
You may navigate to it or type/paste it in. It looks for an ending backslash on the path name and if one was not entered, it appends one (in other words, it works whether or not you include the ending backslash in the path). It checks to see if you entered a folder and if the folder exists. If either is not true, it gives you the opportunity to try again or exit the program:

Root folder not specified
Root folder does not exist
Note: whether or not a folder can be reported as null with the browse-for-folder dialog depends on the operating system, so the program checks for it.

Now it's time to enter the parameters for the run:

Enter-Parameters.jpg
In the top box, enter the name for the combined/merged file (without the .PDF file type). You must enter a name, otherwise it shows this:

Combined file name not specified
The program then checks for characters that are invalid in a file name and displays this dialog if it finds any:

Invalid character
Once the file name is valid, it appends .pdf as the file type. For example, if you enter

combined PDF file

in the box, then the name of each combined/merged file will be

combined PDF file.pdf

Then select a radio button for the order in which to combine/merge the files (default is By file name). Finally, select a radio button for which folders to process: subfolders only (the default), the root folder only, or the root folder and subfolders.

The program now processes the selected folders (if subfolders are selected, they are processed to an unlimited depth). It calls PDFtk to combine all of the PDF files in each subfolder into a single PDF file in each subfolder, with a file name as described above. During processing, it displays a green progress bar that moves to the right so that you know it is working, not hanging. The progress bar also displays the name of the current subfolder being processed and the percentage completion:

Progress Bar
The percentage completion is based on the number of folders (not on time or number of PDFs).

If a call to PDFtk results in an error code, you will see this dialog:

PDFtk fatal error
It shows you the folder causing the error so you may investigate that folder to determine the problem. The most common reason for this error is that an input file has the same name as the output file — PDFtk does not allow this. This would happen if you ran the program a second time, giving it the same combined file name without having first deleted (or moved) the combined file from the previous run.

When the program finishes, it displays this Operation Completed dialog:

Operation-Completed.jpg
The operational statistics are stored in a plain text file in the source folder. The file name of this results file is "Operational_Statistics_YY-MM-DD_HH.MM.SS.txt", where the date/time is the beginning time of the run. Since seconds (SS) are in the file name, it is not possible to have a duplicate file (so there's no issue with respect to overwriting a file).

The results file contains this: 

Operational Statistics from Combine-Merge-PDFs
Name of merged file: combined PDF file.pdf
Root folder: D:\0tempD\test combine\
Sort order: By file name
Folders processed: Root folder and subfolders
Number of folders processed: 11
Beginning date and time: 2014-08-26_01.52.39
Ending date and time: 2014-08-26_01.52.42
Elapsed time (minutes:seconds): 0:3

The elapsed time measurement begins after the parameters are entered so that it measures just the processing time, not including the time spent waiting for user input.

If you find this article to be helpful, please click the thumbs-up icon below. This lets me know what is valuable for EE members and provides direction for future articles. Thanks very much! Regards, Joe
7
Comment
  • 14
  • 6
  • 3
  • +2
25 Comments
 

Expert Comment

by:David Hood
Great article. However, I cannot get it to work. Every time I run the file, I get the error, "String too long", and the program hangs. I have to kill it from task manager. See attached screen capture.
Capture.JPG
0
 
LVL 58
Hi David,

First, thanks for the compliment — I really appreciate hearing that! I suppose what's happening is that the string "PDFtkInputFiles" is getting too large. I build it up in a loop, concatenating it to itself, and then feeding it to the PDFtk command line call. To test this theory, please do two things. First, put this MsgBox command right before the RunWait:

MsgBox,4096,Debug,%PDFtkInputFiles%

Post a screenshot of that MsgBox dialog. Second, try the program on a single folder with a small number of files — I want to make sure that it runs at all in your environment.

Btw, you caught me on a very bad Friday and upcoming weekend. I'll have little, if any, time to work on this after this post. But I'll have time on Monday. Regards, Joe
0
 

Expert Comment

by:David Hood
It worked beautifully if I chose a parent folder that had subfolders containing a few pdf files.. However, if there was no subfolder and I ran it on a single folder that contained pdf files, it ran through and gave the successful confirmation but there was no output file.

When I put in the debug code you supplied and ran it on my full set of folders, this is the debug message:
Capture2.JPG
0
[Webinar] Improve your customer journey

A positive customer journey is important in attracting and retaining business. To improve this experience, you can use Google Maps APIs to increase checkout conversions, boost user engagement, and optimize order fulfillment. Learn how in this webinar presented by Dito.

 
LVL 58
> However, if there was no subfolder and I ran it on a single folder that contained pdf files, it ran through and gave the successful confirmation but there was no output file.

That's correct. The program, by design, does not process the root folder — just all of its subfolders. I could change the program to process the root folder, too, but for expediency right now, I suggest that you simply create a new folder and put the folder you want to process under it (and, of course, give the new, higher-level folder as the root folder parameter to the program).

> I have to kill it from task manager.

That works, but there's an easier way to stop the program. Right-click on its icon in the system tray (notification area) and select Exit. The default AutoHotkey icon looks like this:

AutoHotkey default icon
Its systray context menu looks like this:

AutoHotkey systray context menu
Click Exit to terminate the program. Regards, Joe
0
 
LVL 58
Hi David,

First, I'll explain why you received the "String too long" error message. The debug dialog you posted shows that you are sorting by date (either ascending or descending). In order to achieve that, the program builds up the input files (sorted by date) in a single variable that it feeds to the PDFtk command line call. The problem is that you have a lot of files in one folder, which creates a variable that results in a command line call with more than 8,191 characters — the maximum allowed by Windows. You can see that in the dialog you posted, which shows 200 files, but, more importantly, more than 8,000 characters (and more than 8,191 total when combined with the other parameters on the resulting command line).

Given enough time and money, I could rewrite the code to make individual calls to PDFtk with just one file at a time. That is, instead of building up the input variable to feed to PDFtk, I could build up the output file via successive calls to PDFtk. I looked at what it would take to do that and concluded that it is beyond the scope of this article. In lieu of that, I recommend the following workarounds:

o  Sort the files by name instead of date. When sorting by name, I use a wildcard (*.pdf) as the input parameter to PDFtk, so no matter how many files you have, the command line call won't be a large number of characters.

o  If you must sort by date, split the files into multiple folders so that no one folder has more than 8,000 or so characters of file names. Each folder should be in the date sequence you want (ascending or descending). Then combine each folder with the program, and then combine those results with the program.

o  Shorten the source folder name. For example, in the debug dialog you posted, it is "C:\output\08-15-2014\A02", which is 24 characters. Changing that to "C:\x" saves 20 characters for each file. For the 200 files you posted, that saves 4,000 characters! This would allow in the neighborhood of 400 files for you instead of 200.

Second, I added an enhancement to the program that allows the user to include the root folder in the processing. Further, it even allows the root folder alone to be processed, that is, the subfolders will not be processed. It prompts for this via a new drop-down in the "Enter Parameters" dialog called "Select folders to process". It looks like this:

Select folders to process
The default is subfolders only, which is what it was before this enhancement. Right after submitting this post, I'll upload the new source code AHK file containing the enhancement (with today's date in the file name). Please let me know if it works well for you. Regards, Joe
0
 

Expert Comment

by:David Hood
Joe, thank you. You've gone over and above! I will give it a shot with the sort by name. Sorting by date is preferred, but not a necessity. What I'm actually trying to do is make this an automated process with no user interaction. So, I had edited to specify the sort by date, but I can easily change that. The only problem I have now is that I cannot specify the parent folder. Even if I set the variable, "SourceFolder", the script still wants to get that folder from GUI input. If I figure it out, I'll let you know.

The script is ideal for what I need in that it recurses subfolders and merges all the pdfs in each subfolder. I just need it to run on a nightly basis. The parent folder will always be named the current day's date in format MM-DD-YYYY. I read ahk's documentation, and have tried setting SourceFolder to %A_MM%-%A_DD%-%A_YYYY%. It translates correctly, but it's not inputting it into the script.
0
 
LVL 58
David,

> Even if I set the variable, "SourceFolder", the script still wants to get that folder from GUI input. If I figure it out, I'll let you know.

Comment out (or delete) this block of code (lines 44-67 in the latest attached AHK file):

Loop
{
  FileSelectFolder,SourceFolder,,2,Navigate to root folder or type/paste name in Folder box
  If (SourceFolder="")
  {
    MsgBox,4149,Error,Root folder must be specified`nClick Retry to try again or Cancel to exit
    IfMsgBox,Cancel
      ExitApp
    Else
      Continue
  }
  StringRight,FolderLastChar,SourceFolder,1
  If (FolderLastChar<>"\")
    SourceFolder:=SourceFolder . "\"
  IfNotExist,%SourceFolder%
  {
    MsgBox,4149,Error,Root folder %SourceFolder% does not exist`nClick Retry to try again or Cancel to exit
    IfMsgBox,Cancel
      ExitApp
    Else
      Continue
  }
  Break
}

In its place, assign the SourceFolder variable to whatever value you want. For example:

SourceFolder:="D:\" . A_MM . "-" . A_DD . "-" . A_YYYY . "\"

Or if there's a higher-level folder that contains the MM-DD-YYYY parent folder, it would be something like this:

SourceFolder:="D:\rootfolder\" . A_MM . "-" . A_DD . "-" . A_YYYY . "\"

Remember to end it with a backslash — the subsequent code expects it to be there.

This will prevent the GUI prompt for the source folder. It sounds as if you've figured out how to hard-code the other variables with assignment statements, so you should be good-to-go now, but don't hesitate to let me know if you have other issues.

Btw, with your comments that I've gone over and above, and that the script is ideal for what you need, it seems fair to say that you have found the Article to be helpful. So if you wouldn't mind giving it a Helpful vote (sometimes called an "upvote"), I'd appreciate it. If you haven't done that before, scroll to the bottom of the Article, but before the reader comments begin (so scroll up from here). On the right side, you'll see this:

article helpful
Just click the green check-mark. If you find other Articles here at EE to be helpful, please upvote them — Authors will truly appreciate it! Regards, Joe
0
 
LVL 58
David,

I was just thinking — there are some other issues with respect to running unattended. You may have figured these out already, but here goes:

o  With the new enhancement, you'll have to hard-code that variable, too. That is:

FoldersToProcess:="Root folder and subfolders"

or

FoldersToProcess:="Subfolders only"

or

FoldersToProcess:="Root folder only"

o  The easiest thing to do is delete all the code from the "GetParams:" label to the "ParamsOK:" label. In fact, delete everything from the "Loop" statement above that block, which is the GUI for inputting the source folder. In its place, put all your hard-coded assignment statements. And since you're deleting all the GUI code, you may delete the "ButtonOK:" and "BadWinVar(ft)" code at the bottom. In other words, delete everything after the "ExitApp" command at the end of the "Successful Completion" MsgBox code.

o  Speaking of the "Successful Completion" MsgBox, that's another feature to delete for unattended operation. I suggest replacing it with code that writes the operational statistics to a file.

Delete this block of code:

MsgBox,4160,Successful Completion,
(
Root folder: %SourceFolder%
Folders processed: %FoldersToProcess%
Number of folders processed: %NumFolders%
Beginning date and time: %begintime%
Ending date and time: %endtime%
Elapsed time (minutes:seconds): %mins%:%secs%
)

Replace it with this code:

opstats:=SourceFolder . "Operational_Statistics_" . begintime . ".txt"
FileAppend,
(
Operational Statistics from Combine-Merge-PDFs
Root folder: %SourceFolder%
Folders processed: %FoldersToProcess%
Number of folders processed: %NumFolders%
Beginning date and time: %begintime%
Ending date and time: %endtime%
Elapsed time (minutes:seconds): %mins%:%secs%
)
,%opstats%

This will create a plain text file called <Operational_Statistics_YYYY-MM-DD_HH.MM.SS.txt> in the source folder. Since seconds (SS) are in the file name, there will never be an overwriting or duplicate file issue. Proper programming would check for an error from the FileAppend command:

If (ErrorLevel<>0)
{
  error processing code here
}

But since it's your code and you'll have write permission on the source folder, it's pretty safe to let it rip. Regards, Joe
0
 
LVL 58
Update on 26-Aug-2014: I changed the new "folders to process" parameter from a drop-down list to radio buttons. So the new Enter Parameters dialog looks like this:

Enter Parameters
I also changed the delivery of the operational statistics from a message box to a plain text file (stored in the source folder) and put some additional information in it. The file name of the results file is <Operational_Statistics_YYYY-MM-DD_HH.MM.SS.txt> and the final dialog box now looks like this:

Operation Completed
The results file contains this:

Operational Statistics from Combine-Merge-PDFs
Name of merged file: combined PDF file.pdf
Root folder: D:\0tempD\test combine\
Sort order: By file name
Folders processed: Root folder and subfolders
Number of folders processed: 11
Beginning date and time: 2014-08-26_01.52.39
Ending date and time: 2014-08-26_01.52.42
Elapsed time (minutes:seconds): 0:3

After making this post, I will modify the Article accordingly and attach the new AHK source code file (with today's date in the file name). Regards, Joe
0
 
LVL 46

Expert Comment

by:aikimark
This could also be done with VBScript, merging groups of PDFs in batches that fall below the 8K command line string limit.  Then merging the merged files from the prior step(s).
0
 
LVL 58
Yes, it can be done with any programming language that is able to make a call to the command line PDFtk. I chose AutoHotkey as the programming language, but VBScript would be fine, and many others. There are also tools besides PDFtk that could be used, such as the popular iText (and iTextSharp, its .NET port). It may even be that VBScript (and other languages) have built-in PDF merging subroutines. I don't know, as I rarely use VBScript. Regards, Joe
0
 
LVL 46

Expert Comment

by:aikimark
I suggested VBScript, since it is part of the WIndows OS.  There is nothing to install.

It can also be done with the Powershell script engine that is part of Windows OS.
0
 
LVL 58
Can VBScript and PowerShell merge PDF files without using non-Windows software?
0
 
LVL 46

Expert Comment

by:aikimark
I don't know about PS, but I use VBScript to launch PDFtk for one of my client applications.  If there is an intrinsic .Net library, then Powershell would be able.  I don't think one exists, so PDFSharp, iTextsharp (which you mentioned above), CutePDF, and the like, would need to be installed/registered in the .Net GAC.

It is possible that handling PDFs will be such a common requirement that Microsoft will add such functionality to future PS versions.
0
 
LVL 58
So you suggested VBScript and PowerShell because they're part of the Windows OS and there is nothing to install, but you still need to install non-Windows software, like PDFtk, iText, etc., to perform the PDF merging. I have no objection to installing the third-party AutoHotkey (which is literally a one-minute install) for the programming language, which I've come to know and love. I'm sure I could learn VBScript and PowerShell, but I've been extremely happy with the scripting/programming functionality of AutoHotkey during the past few years. Regards, Joe
0
 
LVL 46

Expert Comment

by:aikimark
I'm using an older version of PDFtk and don't remember going through an installation process, just downloading and unzipping files.

I would have mentioned Kixtart, but that would be another scripting shell, like AutoHotKey.  The secret sauce is PDFtk.
0
 
LVL 58
PDFtk (PDF Toolkit) comes in both command line and GUI versions. The command line version is called PDFtk Server and may be downloaded here:
http://www.pdflabs.com/tools/pdftk-server/

Don't be misled by "Server" in the name. I don't know why they called it that, but it's just an executable (pdftk.exe, with a supporting DLL, libiconv2.dll) that runs on XP, Vista, W7, and W8 (it does not have to run on a "server" OS — it also runs on Mac, but I've never used it on that). So you are correct — it doesn't require an installation, but simply access to <pdftk.exe> and the supporting DLL, <libiconv2.dll>. AutoHotkey does require an installation, but if you have a spare 60 seconds, you'll be done. :)  At the end of the day, both PDFtk and AutoHotkey place non-Microsoft software on your Windows PC. Users need to decide for themselves if that's acceptable. Cheers, Joe
0
 
LVL 46

Expert Comment

by:aikimark
At my client site, I'm limited in what I can put on their PCs as I don't have an admin account.

If Office is installed, one can also use the VBA environment those products provide in place of VBScript as a  run-time platform.
0
 
LVL 58
Would your client let you put PDFtk on their PCs? Installation or not, it is an executable, with all of the potential dangers of an EXE (and DLL).
0
 
LVL 46

Expert Comment

by:aikimark
They seem to let me put executables (scripts and EXEs/DLLs) on the PC doing the work as long as I don't have to do an installation.  I can do things as long as I don't need any permissions or elevated privs.
0
 
LVL 58
Well, that's good news. In addition to PDFtk, you can use all the Xpdf tools, as well as the NirSoft utilities. Great stuff!

Also, using the AutoHotkey compiler on your own computer (which is installed as part of a standard AutoHotkey installation), you can compile the program described in this article (How to Combine-Merge PDF Files in Many Subfolders) into a stand-alone/no-install EXE file which can then be run on your client's machines. Indeed, any AutoHotkey program can be compiled into a stand-alone/no-install EXE file with the AutoHotkey compiler — see the first screenshot in this article. Regards, Joe
0
 

Expert Comment

by:Pinakin Mistry
Dear sir,
Kind Attn. Mr. Joe Winograd
Requesting you to kindly re-attach the removed source code as I am desperately looking for this particular solution for merging several pdf files in to single pdf from multiple subfolders. This is what I need pl. re-upload. If the software is updated with some added features requesting you to pl. re-upload it soon.

Regards
Pinakin
0
 
LVL 58
Hi Pinakin,
I am not yet ready to re-attach the source code. However, I did receive your email and may be able to help you in another way, since you have an immediate need. I'll reply to your email soon. Regards, Joe
0
 

Expert Comment

by:Centex Aps
Hi

Will the "Combine-Merge-PDF-files-20140826.ahk"  file not be attached again?
0
 
LVL 58
Hi Centex,
I've decided not to post the full program. I'll be rewriting the article as a "design roadmap" with some crucial code snippets, such as how to call PDFtk Server, but will not be posting the complete source code. Regards, Joe
0

Featured Post

2018 Annual Membership Survey

Here at Experts Exchange, we strive to give members the best experience. Help us improve the site by taking this survey today! (Bonus: Be entered to win a great tech prize for participating!)

Join & Write a Comment

This video Micro Tutorial is the first in a two-part series that shows how to create and use custom scanning profiles in Nuance's PaperPort 14.5 (http://www.experts-exchange.com/articles/17490/). But the ability to create custom scanning profiles al…
In this seventh video of the Xpdf series, we discuss and demonstrate the PDFfonts utility, which lists all the fonts used in a PDF file. It does this via a command line interface, making it suitable for use in programs, scripts, batch files — any pl…
Suggested Courses

Keep in touch with Experts Exchange

Tech news and trends delivered to your inbox every month