Solved

Get Text from PDF Files

Posted on 2002-04-20
11
486 Views
Last Modified: 2013-12-02
I have an application that I need to have open PDF files and extract all of the text.  I looked at wPDF but it can't read many PDF files and some of the ones it can read the text gets corrupt.

How can I do this?
0
Comment
Question by:dokken
11 Comments
 
LVL 45

Expert Comment

by:aikimark
ID: 6956900
How about...
1. automating Adobe to Open and Save As text
2. doing the same as (1) using an Adobe object
3. Adobe plug-in

===============================================
PDF Forms Toolkit Component
http://www.epublishstore.com/details.asp?ProdID=168

Planet PDF Delphi-related software (page 1 of 2)
http://www.planetpdf.com/mainpage%2Easp?IsPostback=true&Q=delphi&WebPageID=539&qphrasetype=free&PP=1&qscope=allsite

How To Use Adobe Acrobat (PDF) Files in a Delphi Application (recommends TPDF)
http://delphi.about.com/library/howto/htusepdf.htm

LlPDFLib 1.1 (may have something useful for reverse engineering if the other links are not helpful)
LlPDFLib is pure Object Pascal library for create PDF documents. This library dosen't use any DLL and external third-party software to generate PDF files.Library consist TPDFDocument component with properties and methods like Delphi's TPrinter but designed to generate a PDF file.
http://www.programmersheaven.com/search/Download.asp?FileID=20855

PDF In-the box
http://vclcomponents.com/download.asp?ID_COMPONENT=19080

PDFlib (ActiveX)
http://www.epublishstore.com/details.asp?ProdID=177

activePDF Toolkit
http://www.epublishstore.com/details.asp?ProdID=47

There are several PDF converters on this page, but you'll need to do subsequent processing to get plain text as these conversions result in either RTF or HTML formatted data.
http://www.pinnaclepublishing.com/Store.nsf/cscatgroup?open&DATAIE,X,DD
0
 
LVL 5

Expert Comment

by:Gwena
ID: 6957483
listening :-)
0
 

Author Comment

by:dokken
ID: 6958626
Unfortuately many of the systems won't have Acrobat installed, and I don't want that to be a requirement.  

Out of that list I found two components that look like they will work (I haven't tested them yet though) but they're both $4,500.  $4,500 is more than I'm willing to spend at this time for a component.  I'm looking for a much cheaper solution.
0
 

Expert Comment

by:lottol
ID: 6958970
listening :-)
0
 

Author Comment

by:dokken
ID: 6961156
Adobe has a DLL called IFilter which was designed for Microsoft Indexing clients and is free to download at http://www.adobe.com/support/downloads/detail.jsp?ftpID=1276.

Unforunately I don't have experience working with DLL's.  I will give the points to anyone who can provide me with source code for getting the text from PDF files using the IFilter dll.
0
Microsoft Certification Exam 74-409

Veeam® is happy to provide the Microsoft community with a study guide prepared by MVP and MCT, Orin Thomas. This guide will take you through each of the exam objectives, helping you to prepare for and pass the examination.

 
LVL 45

Expert Comment

by:aikimark
ID: 6961259
how about Ghostscript/view or GSView
http://www.cs.wisc.edu/~ghost/

========================================
IFilter might allow you to use ADO to access the PDF documents, but this was meant to work with Microsoft's Index Server.
0
 

Author Comment

by:dokken
ID: 6963256
Ghostscript would probably work for me if it was in Delphi... I'll look into it a little more but actually I would much rather use a DLL from Adobe so I know it will always work.

I found out that IFilter uses a standard COM interface and documentation is at http://msdn.microsoft.com/library/default.asp?url=/library/en-us/indexsrv/ixrefint_9sfm.asp.  I don't know anything about COM but someone should now how I can use this DLL.

I'm bumping the points up to 390... that's all I have.
0
 
LVL 45

Accepted Solution

by:
aikimark earned 390 total points
ID: 6963393
From http://www.cs.wisc.edu/~ghost/doc/gsapi.htm we have a
Delphi Ghostscript example:
ftp://mirror.cs.wisc.edu/pub/mirrors/ghost/contrib/gsapi_delphi.zip

Ghostscript API documentation:
http://www.cs.wisc.edu/~ghost/doc/cvs/API.htm

COMGS.DLL is a COM Automation Server and Interface to Alladin Ghostscript version 5.50 and above.
http://community.wow.net/grt/comgs.html
0
 
LVL 45

Expert Comment

by:aikimark
ID: 6963499
Solutions don't have to be in Delphi, just controllable by your Delphi program.
0
 

Author Comment

by:dokken
ID: 6964504
Well, what I'm looking for is something like a DLL that can either be distributed with my app or easily obtainable (for free).  The DLL I was directed to should work and since it's from Adobe it should be able to extract text from all PDF files.

I'm still looking into Ghostscript.  Their Delphi demo can't even read a single PDF file.  I posted a message in their newsgroup and was referred to something called pstotext, which I'm going to look into right now.
0
 
LVL 1

Expert Comment

by:pnh73
ID: 9003824
No comment has been added lately, so it's time to clean up this TA.
I will leave a recommendation in the Cleanup topic area that this question is:

Accept answer from aikimark

Please leave any comments here within the next seven days.
 
PLEASE DO NOT ACCEPT THIS COMMENT AS AN ANSWER!
 
Paul (pnh73)
EE Cleanup Volunteer
0

Featured Post

Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Microsoft Office Picture Manager is not included in Office 2013. This comes as a shock to users upgrading from earlier versions of Office, such as 2007 and 2010, where Picture Manager was included as a standard application. This article explains how…
PaperPort (http://www.nuance.com/for-individuals/by-product/paperport/index.htm) is among the most important applications that I run on my Windows computers. I use it every day, for nearly all of my document and photo scanning, as well as most of my…
Microsoft Office Picture Manager has a Picture Shortcuts pane that shows a list with the Recently Browsed folders. While creating my video Micro Tutorial here at Experts Exchange showing How to Install Microsoft Office Picture Manager in Office 2013…
This video Micro Tutorial is the first in a two-part series that shows how to create and use custom scanning profiles in Nuance's PaperPort 14.5 (http://www.experts-exchange.com/articles/17490/). But the ability to create custom scanning profiles al…

920 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

16 Experts available now in Live!

Get 1:1 Help Now