Get Text from PDF Files

I have an application that I need to have open PDF files and extract all of the text.  I looked at wPDF but it can't read many PDF files and some of the ones it can read the text gets corrupt.

How can I do this?
dokkenAsked:
Who is Participating?

[Webinar] Streamline your web hosting managementRegister Today

x
 
aikimarkConnect With a Mentor Commented:
From http://www.cs.wisc.edu/~ghost/doc/gsapi.htm we have a
Delphi Ghostscript example:
ftp://mirror.cs.wisc.edu/pub/mirrors/ghost/contrib/gsapi_delphi.zip

Ghostscript API documentation:
http://www.cs.wisc.edu/~ghost/doc/cvs/API.htm

COMGS.DLL is a COM Automation Server and Interface to Alladin Ghostscript version 5.50 and above.
http://community.wow.net/grt/comgs.html
0
 
aikimarkCommented:
How about...
1. automating Adobe to Open and Save As text
2. doing the same as (1) using an Adobe object
3. Adobe plug-in

===============================================
PDF Forms Toolkit Component
http://www.epublishstore.com/details.asp?ProdID=168

Planet PDF Delphi-related software (page 1 of 2)
http://www.planetpdf.com/mainpage%2Easp?IsPostback=true&Q=delphi&WebPageID=539&qphrasetype=free&PP=1&qscope=allsite

How To Use Adobe Acrobat (PDF) Files in a Delphi Application (recommends TPDF)
http://delphi.about.com/library/howto/htusepdf.htm

LlPDFLib 1.1 (may have something useful for reverse engineering if the other links are not helpful)
LlPDFLib is pure Object Pascal library for create PDF documents. This library dosen't use any DLL and external third-party software to generate PDF files.Library consist TPDFDocument component with properties and methods like Delphi's TPrinter but designed to generate a PDF file.
http://www.programmersheaven.com/search/Download.asp?FileID=20855

PDF In-the box
http://vclcomponents.com/download.asp?ID_COMPONENT=19080

PDFlib (ActiveX)
http://www.epublishstore.com/details.asp?ProdID=177

activePDF Toolkit
http://www.epublishstore.com/details.asp?ProdID=47

There are several PDF converters on this page, but you'll need to do subsequent processing to get plain text as these conversions result in either RTF or HTML formatted data.
http://www.pinnaclepublishing.com/Store.nsf/cscatgroup?open&DATAIE,X,DD
0
 
GwenaCommented:
listening :-)
0
Free Tool: ZipGrep

ZipGrep is a utility that can list and search zip (.war, .ear, .jar, etc) archives for text patterns, without the need to extract the archive's contents.

One of a set of tools we're offering as a way to say thank you for being a part of the community.

 
dokkenAuthor Commented:
Unfortuately many of the systems won't have Acrobat installed, and I don't want that to be a requirement.  

Out of that list I found two components that look like they will work (I haven't tested them yet though) but they're both $4,500.  $4,500 is more than I'm willing to spend at this time for a component.  I'm looking for a much cheaper solution.
0
 
lottolCommented:
listening :-)
0
 
dokkenAuthor Commented:
Adobe has a DLL called IFilter which was designed for Microsoft Indexing clients and is free to download at http://www.adobe.com/support/downloads/detail.jsp?ftpID=1276.

Unforunately I don't have experience working with DLL's.  I will give the points to anyone who can provide me with source code for getting the text from PDF files using the IFilter dll.
0
 
aikimarkCommented:
how about Ghostscript/view or GSView
http://www.cs.wisc.edu/~ghost/

========================================
IFilter might allow you to use ADO to access the PDF documents, but this was meant to work with Microsoft's Index Server.
0
 
dokkenAuthor Commented:
Ghostscript would probably work for me if it was in Delphi... I'll look into it a little more but actually I would much rather use a DLL from Adobe so I know it will always work.

I found out that IFilter uses a standard COM interface and documentation is at http://msdn.microsoft.com/library/default.asp?url=/library/en-us/indexsrv/ixrefint_9sfm.asp.  I don't know anything about COM but someone should now how I can use this DLL.

I'm bumping the points up to 390... that's all I have.
0
 
aikimarkCommented:
Solutions don't have to be in Delphi, just controllable by your Delphi program.
0
 
dokkenAuthor Commented:
Well, what I'm looking for is something like a DLL that can either be distributed with my app or easily obtainable (for free).  The DLL I was directed to should work and since it's from Adobe it should be able to extract text from all PDF files.

I'm still looking into Ghostscript.  Their Delphi demo can't even read a single PDF file.  I posted a message in their newsgroup and was referred to something called pstotext, which I'm going to look into right now.
0
 
pnh73Commented:
No comment has been added lately, so it's time to clean up this TA.
I will leave a recommendation in the Cleanup topic area that this question is:

Accept answer from aikimark

Please leave any comments here within the next seven days.
 
PLEASE DO NOT ACCEPT THIS COMMENT AS AN ANSWER!
 
Paul (pnh73)
EE Cleanup Volunteer
0
All Courses

From novice to tech pro — start learning today.