Solved

Get Text from PDF Files

Posted on 2002-04-20
11
494 Views
Last Modified: 2013-12-02
I have an application that I need to have open PDF files and extract all of the text.  I looked at wPDF but it can't read many PDF files and some of the ones it can read the text gets corrupt.

How can I do this?
0
Comment
Question by:dokken
11 Comments
 
LVL 45

Expert Comment

by:aikimark
ID: 6956900
How about...
1. automating Adobe to Open and Save As text
2. doing the same as (1) using an Adobe object
3. Adobe plug-in

===============================================
PDF Forms Toolkit Component
http://www.epublishstore.com/details.asp?ProdID=168

Planet PDF Delphi-related software (page 1 of 2)
http://www.planetpdf.com/mainpage%2Easp?IsPostback=true&Q=delphi&WebPageID=539&qphrasetype=free&PP=1&qscope=allsite

How To Use Adobe Acrobat (PDF) Files in a Delphi Application (recommends TPDF)
http://delphi.about.com/library/howto/htusepdf.htm

LlPDFLib 1.1 (may have something useful for reverse engineering if the other links are not helpful)
LlPDFLib is pure Object Pascal library for create PDF documents. This library dosen't use any DLL and external third-party software to generate PDF files.Library consist TPDFDocument component with properties and methods like Delphi's TPrinter but designed to generate a PDF file.
http://www.programmersheaven.com/search/Download.asp?FileID=20855

PDF In-the box
http://vclcomponents.com/download.asp?ID_COMPONENT=19080

PDFlib (ActiveX)
http://www.epublishstore.com/details.asp?ProdID=177

activePDF Toolkit
http://www.epublishstore.com/details.asp?ProdID=47

There are several PDF converters on this page, but you'll need to do subsequent processing to get plain text as these conversions result in either RTF or HTML formatted data.
http://www.pinnaclepublishing.com/Store.nsf/cscatgroup?open&DATAIE,X,DD
0
 
LVL 5

Expert Comment

by:Gwena
ID: 6957483
listening :-)
0
 

Author Comment

by:dokken
ID: 6958626
Unfortuately many of the systems won't have Acrobat installed, and I don't want that to be a requirement.  

Out of that list I found two components that look like they will work (I haven't tested them yet though) but they're both $4,500.  $4,500 is more than I'm willing to spend at this time for a component.  I'm looking for a much cheaper solution.
0
What is SQL Server and how does it work?

The purpose of this paper is to provide you background on SQL Server. It’s your self-study guide for learning fundamentals. It includes both the history of SQL and its technical basics. Concepts and definitions will form the solid foundation of your future DBA expertise.

 

Expert Comment

by:lottol
ID: 6958970
listening :-)
0
 

Author Comment

by:dokken
ID: 6961156
Adobe has a DLL called IFilter which was designed for Microsoft Indexing clients and is free to download at http://www.adobe.com/support/downloads/detail.jsp?ftpID=1276.

Unforunately I don't have experience working with DLL's.  I will give the points to anyone who can provide me with source code for getting the text from PDF files using the IFilter dll.
0
 
LVL 45

Expert Comment

by:aikimark
ID: 6961259
how about Ghostscript/view or GSView
http://www.cs.wisc.edu/~ghost/

========================================
IFilter might allow you to use ADO to access the PDF documents, but this was meant to work with Microsoft's Index Server.
0
 

Author Comment

by:dokken
ID: 6963256
Ghostscript would probably work for me if it was in Delphi... I'll look into it a little more but actually I would much rather use a DLL from Adobe so I know it will always work.

I found out that IFilter uses a standard COM interface and documentation is at http://msdn.microsoft.com/library/default.asp?url=/library/en-us/indexsrv/ixrefint_9sfm.asp.  I don't know anything about COM but someone should now how I can use this DLL.

I'm bumping the points up to 390... that's all I have.
0
 
LVL 45

Accepted Solution

by:
aikimark earned 390 total points
ID: 6963393
From http://www.cs.wisc.edu/~ghost/doc/gsapi.htm we have a
Delphi Ghostscript example:
ftp://mirror.cs.wisc.edu/pub/mirrors/ghost/contrib/gsapi_delphi.zip

Ghostscript API documentation:
http://www.cs.wisc.edu/~ghost/doc/cvs/API.htm

COMGS.DLL is a COM Automation Server and Interface to Alladin Ghostscript version 5.50 and above.
http://community.wow.net/grt/comgs.html
0
 
LVL 45

Expert Comment

by:aikimark
ID: 6963499
Solutions don't have to be in Delphi, just controllable by your Delphi program.
0
 

Author Comment

by:dokken
ID: 6964504
Well, what I'm looking for is something like a DLL that can either be distributed with my app or easily obtainable (for free).  The DLL I was directed to should work and since it's from Adobe it should be able to extract text from all PDF files.

I'm still looking into Ghostscript.  Their Delphi demo can't even read a single PDF file.  I posted a message in their newsgroup and was referred to something called pstotext, which I'm going to look into right now.
0
 
LVL 1

Expert Comment

by:pnh73
ID: 9003824
No comment has been added lately, so it's time to clean up this TA.
I will leave a recommendation in the Cleanup topic area that this question is:

Accept answer from aikimark

Please leave any comments here within the next seven days.
 
PLEASE DO NOT ACCEPT THIS COMMENT AS AN ANSWER!
 
Paul (pnh73)
EE Cleanup Volunteer
0

Featured Post

VMware Disaster Recovery and Data Protection

In this expert guide, you’ll learn about the components of a Modern Data Center. You will use cases for the value-added capabilities of Veeam®, including combining backup and replication for VMware disaster recovery and using replication for data center migration.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
Downloading email attachments 2 80
URL for downloading Google Chrome for Win XP 2 191
When i run adoquery my application freezes 26 180
Twebbrowser add css to the header 3 29
Microsoft Office Picture Manager is not included in Office 2013. This comes as a shock to users upgrading from earlier versions of Office, such as 2007 and 2010, where Picture Manager was included as a standard application. This article explains how…
PaperPort 14.5 Patch 1 update is often not detected or downloaded automatically. This article provides direct download links to solve the problem for retail (non-bundled) versions of the Standard and Professional editions, as well as the Professiona…
This video is the first in a two-part series that discusses PaperPort's "Send To Bar" feature . This first video tutorial explains the purpose of the Send To Bar, how to use it, and how to hide unwanted items that are automatically created on it whe…
In a recent question (https://www.experts-exchange.com/questions/28997919/Pagination-in-Adobe-Acrobat.html) here at Experts Exchange, a member asked how to add page numbers to a PDF file using Adobe Acrobat XI Pro. This short video Micro Tutorial sh…

830 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question