Merging 10,000 PDF's into ONE searchable file

This is a one off exercise:

I have 10,000 pdf's in about 500 folders.
The pdf's are called a persons name and have a photo of the person in the pdf.

johnsmith.pdf has a picture of John Smith !

I want to start at a high level folder and drill down into all lower folders looking for pdf's.

REQUIREMENT: One large PDF containing all the smaller ones.

Also, I need to be able to search the PDF and find the photo of "John Smith" or whoever.
Patrick O'DeaAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Joe WinogradDeveloperCommented:
Hi Patrick,

Here's how I would attack it:

o  Write a program/script in whatever programming/scripting language you prefer to recurse into all subfolders of a specified source folder.

o  Look for all PDF files in each subfolder and extract the file name of each PDF file into a variable.

o  Call a utility to put the file name in the header or footer (whichever you prefer) of the PDF — this will make the name searchable. You have to decide if you want the fully qualified file name, such as...

C:\photos\webinars\2016Q3\John Smith.pdf

...or just the person's name, or something in between the two. You must have already solved the issue of more than one John Smith being in a subfolder, since there can't be a duplicate file name in the same folder. Of course, there may be more than one John Smith in multiple subfolders, so a search for just the name in the merged file would get hits on more than one — maybe you don't care about that, or maybe that's a good reason for putting the fully qualified file name in the header/footer. I recommend creating a temporary file with the header/footer so that the original file is not modified.

o  Call a utility to merge all of the new PDF files that have the file name in the header/footer into a single, combined PDF file.

If you don't have the expertise to do this, what is your budget for the project? It should make for a very reasonably priced Gig. Regards, Joe

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Joe WinogradDeveloperCommented:
The approach documented in post #a42049716 will work well. In fact, I have working subroutines of all the components described in the post. It would be a matter of combining them into a single program and, of course, testing it thoroughly. It would not be trivial, especially to make sure that it is able to handle error conditions when processing 10,000 PDFs in 500 folders. This is why I mentioned a Gig if the asker does not have the expertise to write the program. But it is very doable using the roadmap in my post.
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Document Imaging

From novice to tech pro — start learning today.