[Okta Webinar] Learn how to a build a cloud-first strategyRegister Now

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 878
  • Last Modified:

How to split/break .pdf files based on text in the pdf?

Does anyone have any script that can be used to split pdfs based on text in the pdf?  The ultimate goal is to split into individual files and name the new files based on the text used to split the pdf.

I am using Visual Studio 2008

Thank you very much.
0
kgittinger
Asked:
kgittinger
2 Solutions
 
Karl Heinz KremerCommented:
One of the most complicated things is to extract text from a PDF file.

If you want a simple solution, go with a commercial solution - google for "split pdf by content" to find a number of offerings.

If you want to do it yourself, take a look at a PDF library or framework that allows you to extract text on a page basis - you will lose the formatting, so that will make it more complicated to identify what you are looking for. But once you have that page content, you should be able to split your documents.

As far as PDF libraries go, I would suggest either iTexSharp (which is free, but you have to take a look at the license to see if you can use it for free, or if you need a license for your particular application), or ABCPDF.NET, which is a great product, but you need to pay for it:

http://sourceforge.net/projects/itextsharp/
http://www.websupergoo.com/abcpdf-1.htm
0
 
aikimarkCommented:
in one of my projects, I use PDFTEXT to extract the text layer in to a .txt file.  I then use some VBScript to parse each text file.  I use PDFTK to combine or split some of the PDF files into on  page boundaries.

What kind of splitting do you need?
0

Featured Post

Upgrade your Question Security!

Add Premium security features to your question to ensure its privacy or anonymity. Learn more about your ability to control Question Security today.

Tackle projects and never again get stuck behind a technical roadblock.
Join Now