Solved

Why is XPDF missing words and files when extracting text from pdf files?

Posted on 2006-06-23
3
309 Views
Last Modified: 2012-06-21
Hello all,

I have a search engine that runs on our website server.  It uses an inverted index created by an indexing routine.  In order to index PDF files, the indexer calls the utility XPDF ( http://www.foolabs.com/xpdf/ ) to extract the text.

Mostly it works just fine, BUT:

- sometimes it misses words.  For example, if a PDF document is konwn to contain the phrase "blue sky", after indexing a search may find "blue", but not "sky"

- sometimes it skips files completely.  This seems to be accompanied by cryptic error messages like "Bad Annotation".

Anyone have insights into XPDF?

Thanks.
0
Comment
Question by:xfvgdrthbdtyvhgscv
  • 2
3 Comments
 
LVL 35

Accepted Solution

by:
[ fanpages ] earned 500 total points
ID: 16973565
I do not have any previous experience of XPDF.

Have you tried contacting the software authors?  Perhaps there is a fault with it that they are not aware of.

You may also like to extract your PDF based text into other formats using these tools:

HTML, or XML:
[ http://pdftohtml.sourceforge.net/ ]

RTF, MS-Word DOC, or Text (TXT):
[ http://www.verypdf.com/pdf2word/pdf-to-doc/pdf-to-txt.html ]

HTML, TXT, BMP or JPG:
[ http://www.topshareware.com/Paq-PDFTools-(-pdf2txt-pdf2html-pdf2htm-pdf-to-txt-pdf-to-html)-download-12417.htm ]


BFN,

fp.
0
 
LVL 1

Author Comment

by:xfvgdrthbdtyvhgscv
ID: 16979619
Well, although it doesn't actually answer the question, I guess I'll have to award the points by default.
0
 
LVL 35

Expert Comment

by:[ fanpages ]
ID: 16979848
You could have waited for further Expert input; you need not have closed the question at all.

Did you contact the software vendor?
0

Featured Post

Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

This article will show, step by step, how to integrate R code into a R Sweave document
This article will inform Clients about common and important expectations from the freelancers (Experts) who are looking at your Gig.
Viewers will learn how to properly install Eclipse with the necessary JDK, and will take a look at an introductory Java program. Download Eclipse installation zip file: Extract files from zip file: Download and install JDK 8: Open Eclipse and …
With the power of JIRA, there's an unlimited number of ways you can customize it, use it and benefit from it. With that in mind, there's bound to be things that I wasn't able to cover in this course. With this summary we'll look at some places to go…

756 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question