Merging 10,000 PDF's into ONE searchable file

Patrick O'Dea
Patrick O'Dea used Ask the Experts™
on
This is a one off exercise:

I have 10,000 pdf's in about 500 folders.
The pdf's are called a persons name and have a photo of the person in the pdf.

Example:
johnsmith.pdf has a picture of John Smith !

I want to start at a high level folder and drill down into all lower folders looking for pdf's.

REQUIREMENT: One large PDF containing all the smaller ones.

Also, I need to be able to search the PDF and find the photo of "John Smith" or whoever.
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Developer
Fellow 2017
Most Valuable Expert 2018
Commented:
Hi Patrick,

Here's how I would attack it:

o  Write a program/script in whatever programming/scripting language you prefer to recurse into all subfolders of a specified source folder.

o  Look for all PDF files in each subfolder and extract the file name of each PDF file into a variable.

o  Call a utility to put the file name in the header or footer (whichever you prefer) of the PDF — this will make the name searchable. You have to decide if you want the fully qualified file name, such as...

C:\photos\webinars\2016Q3\John Smith.pdf

...or just the person's name, or something in between the two. You must have already solved the issue of more than one John Smith being in a subfolder, since there can't be a duplicate file name in the same folder. Of course, there may be more than one John Smith in multiple subfolders, so a search for just the name in the merged file would get hits on more than one — maybe you don't care about that, or maybe that's a good reason for putting the fully qualified file name in the header/footer. I recommend creating a temporary file with the header/footer so that the original file is not modified.

o  Call a utility to merge all of the new PDF files that have the file name in the header/footer into a single, combined PDF file.

If you don't have the expertise to do this, what is your budget for the project? It should make for a very reasonably priced Gig. Regards, Joe
Joe WinogradDeveloper
Fellow 2017
Most Valuable Expert 2018

Commented:
The approach documented in post #a42049716 will work well. In fact, I have working subroutines of all the components described in the post. It would be a matter of combining them into a single program and, of course, testing it thoroughly. It would not be trivial, especially to make sure that it is able to handle error conditions when processing 10,000 PDFs in 500 folders. This is why I mentioned a Gig if the asker does not have the expertise to write the program. But it is very doable using the roadmap in my post.

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial