Solved

Get Text from PDF Files

Posted on 2002-04-20
11
505 Views
Last Modified: 2013-12-02
I have an application that I need to have open PDF files and extract all of the text.  I looked at wPDF but it can't read many PDF files and some of the ones it can read the text gets corrupt.

How can I do this?
0
Comment
Question by:dokken
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
11 Comments
 
LVL 46

Expert Comment

by:aikimark
ID: 6956900
How about...
1. automating Adobe to Open and Save As text
2. doing the same as (1) using an Adobe object
3. Adobe plug-in

===============================================
PDF Forms Toolkit Component
http://www.epublishstore.com/details.asp?ProdID=168

Planet PDF Delphi-related software (page 1 of 2)
http://www.planetpdf.com/mainpage%2Easp?IsPostback=true&Q=delphi&WebPageID=539&qphrasetype=free&PP=1&qscope=allsite

How To Use Adobe Acrobat (PDF) Files in a Delphi Application (recommends TPDF)
http://delphi.about.com/library/howto/htusepdf.htm

LlPDFLib 1.1 (may have something useful for reverse engineering if the other links are not helpful)
LlPDFLib is pure Object Pascal library for create PDF documents. This library dosen't use any DLL and external third-party software to generate PDF files.Library consist TPDFDocument component with properties and methods like Delphi's TPrinter but designed to generate a PDF file.
http://www.programmersheaven.com/search/Download.asp?FileID=20855

PDF In-the box
http://vclcomponents.com/download.asp?ID_COMPONENT=19080

PDFlib (ActiveX)
http://www.epublishstore.com/details.asp?ProdID=177

activePDF Toolkit
http://www.epublishstore.com/details.asp?ProdID=47

There are several PDF converters on this page, but you'll need to do subsequent processing to get plain text as these conversions result in either RTF or HTML formatted data.
http://www.pinnaclepublishing.com/Store.nsf/cscatgroup?open&DATAIE,X,DD
0
 
LVL 5

Expert Comment

by:Gwena
ID: 6957483
listening :-)
0
 

Author Comment

by:dokken
ID: 6958626
Unfortuately many of the systems won't have Acrobat installed, and I don't want that to be a requirement.  

Out of that list I found two components that look like they will work (I haven't tested them yet though) but they're both $4,500.  $4,500 is more than I'm willing to spend at this time for a component.  I'm looking for a much cheaper solution.
0
Prepare for your VMware VCP6-DCV exam.

Josh Coen and Jason Langer have prepared the latest edition of VCP study guide. Both authors have been working in the IT field for more than a decade, and both hold VMware certifications. This 163-page guide covers all 10 of the exam blueprint sections.

 

Expert Comment

by:lottol
ID: 6958970
listening :-)
0
 

Author Comment

by:dokken
ID: 6961156
Adobe has a DLL called IFilter which was designed for Microsoft Indexing clients and is free to download at http://www.adobe.com/support/downloads/detail.jsp?ftpID=1276.

Unforunately I don't have experience working with DLL's.  I will give the points to anyone who can provide me with source code for getting the text from PDF files using the IFilter dll.
0
 
LVL 46

Expert Comment

by:aikimark
ID: 6961259
how about Ghostscript/view or GSView
http://www.cs.wisc.edu/~ghost/

========================================
IFilter might allow you to use ADO to access the PDF documents, but this was meant to work with Microsoft's Index Server.
0
 

Author Comment

by:dokken
ID: 6963256
Ghostscript would probably work for me if it was in Delphi... I'll look into it a little more but actually I would much rather use a DLL from Adobe so I know it will always work.

I found out that IFilter uses a standard COM interface and documentation is at http://msdn.microsoft.com/library/default.asp?url=/library/en-us/indexsrv/ixrefint_9sfm.asp.  I don't know anything about COM but someone should now how I can use this DLL.

I'm bumping the points up to 390... that's all I have.
0
 
LVL 46

Accepted Solution

by:
aikimark earned 390 total points
ID: 6963393
From http://www.cs.wisc.edu/~ghost/doc/gsapi.htm we have a
Delphi Ghostscript example:
ftp://mirror.cs.wisc.edu/pub/mirrors/ghost/contrib/gsapi_delphi.zip

Ghostscript API documentation:
http://www.cs.wisc.edu/~ghost/doc/cvs/API.htm

COMGS.DLL is a COM Automation Server and Interface to Alladin Ghostscript version 5.50 and above.
http://community.wow.net/grt/comgs.html
0
 
LVL 46

Expert Comment

by:aikimark
ID: 6963499
Solutions don't have to be in Delphi, just controllable by your Delphi program.
0
 

Author Comment

by:dokken
ID: 6964504
Well, what I'm looking for is something like a DLL that can either be distributed with my app or easily obtainable (for free).  The DLL I was directed to should work and since it's from Adobe it should be able to extract text from all PDF files.

I'm still looking into Ghostscript.  Their Delphi demo can't even read a single PDF file.  I posted a message in their newsgroup and was referred to something called pstotext, which I'm going to look into right now.
0
 
LVL 1

Expert Comment

by:pnh73
ID: 9003824
No comment has been added lately, so it's time to clean up this TA.
I will leave a recommendation in the Cleanup topic area that this question is:

Accept answer from aikimark

Please leave any comments here within the next seven days.
 
PLEASE DO NOT ACCEPT THIS COMMENT AS AN ANSWER!
 
Paul (pnh73)
EE Cleanup Volunteer
0

Featured Post

Get HTML5 Certified

Want to be a web developer? You'll need to know HTML. Prepare for HTML5 certification by enrolling in July's Course of the Month! It's free for Premium Members, Team Accounts, and Qualified Experts.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Update 21-May-2015: I temporarily removed the source code to make major changes to the program. Regards, Joe INTRODUCTION This article presents a solution to a question (http://www.experts-exchange.com/Programming/Installation/Q_28396542.html)…
I. Introduction In a previous article (http://www.experts-exchange.com/Web_Development/Document_Imaging/A_6537-PaperPort-Upgrade-How-to-download-and-install-updated-versions-of-PaperPort-11-and-12.html) (now deprecated), I discussed how to upgrad…
In this second video of the Xpdf series, we discuss and demonstrate the PDFimages utility, which, in a single command, is able to extract all the images from a PDF file and save each one in a separate image file (PBM, PPM, or JPG). Download and inst…
This video Micro Tutorial is the second in a two-part series that shows how to create and use custom scanning profiles in Nuance's PaperPort 14.5 (http://www.experts-exchange.com/articles/17490/). But the ability to create custom scanning profiles a…
Suggested Courses
Course of the Month7 days, 12 hours left to enroll

632 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question