How to split/break .pdf files based on text in the pdf?

Posted on 2012-09-10
Last Modified: 2012-09-11
Does anyone have any script that can be used to split pdfs based on text in the pdf?  The ultimate goal is to split into individual files and name the new files based on the text used to split the pdf.

I am using Visual Studio 2008

Thank you very much.
Question by:kgittinger
    LVL 44

    Accepted Solution

    One of the most complicated things is to extract text from a PDF file.

    If you want a simple solution, go with a commercial solution - google for "split pdf by content" to find a number of offerings.

    If you want to do it yourself, take a look at a PDF library or framework that allows you to extract text on a page basis - you will lose the formatting, so that will make it more complicated to identify what you are looking for. But once you have that page content, you should be able to split your documents.

    As far as PDF libraries go, I would suggest either iTexSharp (which is free, but you have to take a look at the license to see if you can use it for free, or if you need a license for your particular application), or ABCPDF.NET, which is a great product, but you need to pay for it:
    LVL 44

    Assisted Solution

    in one of my projects, I use PDFTEXT to extract the text layer in to a .txt file.  I then use some VBScript to parse each text file.  I use PDFTK to combine or split some of the PDF files into on  page boundaries.

    What kind of splitting do you need?

    Featured Post

    Better Security Awareness With Threat Intelligence

    See how one of the leading financial services organizations uses Recorded Future as part of a holistic threat intelligence program to promote security awareness and proactively and efficiently identify threats.

    Join & Write a Comment

    Power PDF ( is the newest product from the Document Imaging division of Nuance Communications ( It is available in two editions — …
    PaperPort is a popular document imaging/management product from Nuance Communications ( It is in widespread use by both individuals ( and businesses (http:/…
    In this fourth video of the Xpdf series, we discuss and demonstrate the PDFinfo utility, which retrieves the contents of a PDF's Info Dictionary, as well as some other information, including the page count. We show how to isolate the page count in a…
    This lesson covers basic error handling code in Microsoft Excel using VBA. This is the first lesson in a 3-part series that uses code to loop through an Excel spreadsheet in VBA and then fix errors, taking advantage of error handling code. This l…

    729 members asked questions and received personalized solutions in the past 7 days.

    Join the community of 500,000 technology professionals and ask your questions.

    Join & Ask a Question

    Need Help in Real-Time?

    Connect with top rated Experts

    21 Experts available now in Live!

    Get 1:1 Help Now