Solved

Get Text from PDF Files

Posted on 2002-04-20
11
501 Views
Last Modified: 2013-12-02
I have an application that I need to have open PDF files and extract all of the text.  I looked at wPDF but it can't read many PDF files and some of the ones it can read the text gets corrupt.

How can I do this?
0
Comment
Question by:dokken
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
11 Comments
 
LVL 45

Expert Comment

by:aikimark
ID: 6956900
How about...
1. automating Adobe to Open and Save As text
2. doing the same as (1) using an Adobe object
3. Adobe plug-in

===============================================
PDF Forms Toolkit Component
http://www.epublishstore.com/details.asp?ProdID=168

Planet PDF Delphi-related software (page 1 of 2)
http://www.planetpdf.com/mainpage%2Easp?IsPostback=true&Q=delphi&WebPageID=539&qphrasetype=free&PP=1&qscope=allsite

How To Use Adobe Acrobat (PDF) Files in a Delphi Application (recommends TPDF)
http://delphi.about.com/library/howto/htusepdf.htm

LlPDFLib 1.1 (may have something useful for reverse engineering if the other links are not helpful)
LlPDFLib is pure Object Pascal library for create PDF documents. This library dosen't use any DLL and external third-party software to generate PDF files.Library consist TPDFDocument component with properties and methods like Delphi's TPrinter but designed to generate a PDF file.
http://www.programmersheaven.com/search/Download.asp?FileID=20855

PDF In-the box
http://vclcomponents.com/download.asp?ID_COMPONENT=19080

PDFlib (ActiveX)
http://www.epublishstore.com/details.asp?ProdID=177

activePDF Toolkit
http://www.epublishstore.com/details.asp?ProdID=47

There are several PDF converters on this page, but you'll need to do subsequent processing to get plain text as these conversions result in either RTF or HTML formatted data.
http://www.pinnaclepublishing.com/Store.nsf/cscatgroup?open&DATAIE,X,DD
0
 
LVL 5

Expert Comment

by:Gwena
ID: 6957483
listening :-)
0
 

Author Comment

by:dokken
ID: 6958626
Unfortuately many of the systems won't have Acrobat installed, and I don't want that to be a requirement.  

Out of that list I found two components that look like they will work (I haven't tested them yet though) but they're both $4,500.  $4,500 is more than I'm willing to spend at this time for a component.  I'm looking for a much cheaper solution.
0
NFR key for Veeam Backup for Microsoft Office 365

Veeam is happy to provide a free NFR license (for 1 year, up to 10 users). This license allows for the non‑production use of Veeam Backup for Microsoft Office 365 in your home lab without any feature limitations.

 

Expert Comment

by:lottol
ID: 6958970
listening :-)
0
 

Author Comment

by:dokken
ID: 6961156
Adobe has a DLL called IFilter which was designed for Microsoft Indexing clients and is free to download at http://www.adobe.com/support/downloads/detail.jsp?ftpID=1276.

Unforunately I don't have experience working with DLL's.  I will give the points to anyone who can provide me with source code for getting the text from PDF files using the IFilter dll.
0
 
LVL 45

Expert Comment

by:aikimark
ID: 6961259
how about Ghostscript/view or GSView
http://www.cs.wisc.edu/~ghost/

========================================
IFilter might allow you to use ADO to access the PDF documents, but this was meant to work with Microsoft's Index Server.
0
 

Author Comment

by:dokken
ID: 6963256
Ghostscript would probably work for me if it was in Delphi... I'll look into it a little more but actually I would much rather use a DLL from Adobe so I know it will always work.

I found out that IFilter uses a standard COM interface and documentation is at http://msdn.microsoft.com/library/default.asp?url=/library/en-us/indexsrv/ixrefint_9sfm.asp.  I don't know anything about COM but someone should now how I can use this DLL.

I'm bumping the points up to 390... that's all I have.
0
 
LVL 45

Accepted Solution

by:
aikimark earned 390 total points
ID: 6963393
From http://www.cs.wisc.edu/~ghost/doc/gsapi.htm we have a
Delphi Ghostscript example:
ftp://mirror.cs.wisc.edu/pub/mirrors/ghost/contrib/gsapi_delphi.zip

Ghostscript API documentation:
http://www.cs.wisc.edu/~ghost/doc/cvs/API.htm

COMGS.DLL is a COM Automation Server and Interface to Alladin Ghostscript version 5.50 and above.
http://community.wow.net/grt/comgs.html
0
 
LVL 45

Expert Comment

by:aikimark
ID: 6963499
Solutions don't have to be in Delphi, just controllable by your Delphi program.
0
 

Author Comment

by:dokken
ID: 6964504
Well, what I'm looking for is something like a DLL that can either be distributed with my app or easily obtainable (for free).  The DLL I was directed to should work and since it's from Adobe it should be able to extract text from all PDF files.

I'm still looking into Ghostscript.  Their Delphi demo can't even read a single PDF file.  I posted a message in their newsgroup and was referred to something called pstotext, which I'm going to look into right now.
0
 
LVL 1

Expert Comment

by:pnh73
ID: 9003824
No comment has been added lately, so it's time to clean up this TA.
I will leave a recommendation in the Cleanup topic area that this question is:

Accept answer from aikimark

Please leave any comments here within the next seven days.
 
PLEASE DO NOT ACCEPT THIS COMMENT AS AN ANSWER!
 
Paul (pnh73)
EE Cleanup Volunteer
0

Featured Post

Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
find a node in VST 2 93
Dev Express grid collapse 2 55
scroll down TListBox component in Delphi 1 44
Making a color png into Black And White? 4 57
I. Introduction In a previous article (http://www.experts-exchange.com/Web_Development/Document_Imaging/A_6537-PaperPort-Upgrade-How-to-download-and-install-updated-versions-of-PaperPort-11-and-12.html) (now deprecated), I discussed how to upgrad…
In a previous article published here at Experts Exchange, Signature Image with Transparent Background (http://www.experts-exchange.com/Web_Development/Document_Imaging/A_12380-Signature-Image-with-Transparent-Background.html), I explained how to cre…
Microsoft Office Picture Manager is not included in Office 2013. This comes as quite a surprise to users upgrading from earlier versions of Office, such as 2007 and 2010, where Picture Manager was included as a standard application. This video expla…
This video Micro Tutorial is the first in a two-part series that shows how to create and use custom scanning profiles in Nuance's PaperPort 14.5 (http://www.experts-exchange.com/articles/17490/). But the ability to create custom scanning profiles al…

751 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question