?
Solved

Why is XPDF missing words and files when extracting text from pdf files?

Posted on 2006-06-23
3
Medium Priority
?
322 Views
Last Modified: 2012-06-21
Hello all,

I have a search engine that runs on our website server.  It uses an inverted index created by an indexing routine.  In order to index PDF files, the indexer calls the utility XPDF ( http://www.foolabs.com/xpdf/ ) to extract the text.

Mostly it works just fine, BUT:

- sometimes it misses words.  For example, if a PDF document is konwn to contain the phrase "blue sky", after indexing a search may find "blue", but not "sky"

- sometimes it skips files completely.  This seems to be accompanied by cryptic error messages like "Bad Annotation".

Anyone have insights into XPDF?

Thanks.
0
Comment
Question by:xfvgdrthbdtyvhgscv
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 2
3 Comments
 
LVL 35

Accepted Solution

by:
[ fanpages ] earned 1000 total points
ID: 16973565
I do not have any previous experience of XPDF.

Have you tried contacting the software authors?  Perhaps there is a fault with it that they are not aware of.

You may also like to extract your PDF based text into other formats using these tools:

HTML, or XML:
[ http://pdftohtml.sourceforge.net/ ]

RTF, MS-Word DOC, or Text (TXT):
[ http://www.verypdf.com/pdf2word/pdf-to-doc/pdf-to-txt.html ]

HTML, TXT, BMP or JPG:
[ http://www.topshareware.com/Paq-PDFTools-(-pdf2txt-pdf2html-pdf2htm-pdf-to-txt-pdf-to-html)-download-12417.htm ]


BFN,

fp.
0
 
LVL 1

Author Comment

by:xfvgdrthbdtyvhgscv
ID: 16979619
Well, although it doesn't actually answer the question, I guess I'll have to award the points by default.
0
 
LVL 35

Expert Comment

by:[ fanpages ]
ID: 16979848
You could have waited for further Expert input; you need not have closed the question at all.

Did you contact the software vendor?
0

Featured Post

Get 15 Days FREE Full-Featured Trial

Benefit from a mission critical IT monitoring with Monitis Premium or get it FREE for your entry level monitoring needs.
-Over 200,000 users
-More than 300,000 websites monitored
-Used in 197 countries
-Recommended by 98% of users

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Whether you’re a college noob or a soon-to-be pro, these tips are sure to help you in your journey to becoming a programming ninja and stand out from the crowd.
This article will inform Clients about common and important expectations from the freelancers (Experts) who are looking at your Gig.
Viewers will learn how to properly install Eclipse with the necessary JDK, and will take a look at an introductory Java program. Download Eclipse installation zip file: Extract files from zip file: Download and install JDK 8: Open Eclipse and …
In this fourth video of the Xpdf series, we discuss and demonstrate the PDFinfo utility, which retrieves the contents of a PDF's Info Dictionary, as well as some other information, including the page count. We show how to isolate the page count in a…

770 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question