Solved

Why is XPDF missing words and files when extracting text from pdf files?

Posted on 2006-06-23
3
277 Views
Last Modified: 2012-06-21
Hello all,

I have a search engine that runs on our website server.  It uses an inverted index created by an indexing routine.  In order to index PDF files, the indexer calls the utility XPDF ( http://www.foolabs.com/xpdf/ ) to extract the text.

Mostly it works just fine, BUT:

- sometimes it misses words.  For example, if a PDF document is konwn to contain the phrase "blue sky", after indexing a search may find "blue", but not "sky"

- sometimes it skips files completely.  This seems to be accompanied by cryptic error messages like "Bad Annotation".

Anyone have insights into XPDF?

Thanks.
0
Comment
Question by:xfvgdrthbdtyvhgscv
  • 2
3 Comments
 
LVL 35

Accepted Solution

by:
[ fanpages ] earned 500 total points
ID: 16973565
I do not have any previous experience of XPDF.

Have you tried contacting the software authors?  Perhaps there is a fault with it that they are not aware of.

You may also like to extract your PDF based text into other formats using these tools:

HTML, or XML:
[ http://pdftohtml.sourceforge.net/ ]

RTF, MS-Word DOC, or Text (TXT):
[ http://www.verypdf.com/pdf2word/pdf-to-doc/pdf-to-txt.html ]

HTML, TXT, BMP or JPG:
[ http://www.topshareware.com/Paq-PDFTools-(-pdf2txt-pdf2html-pdf2htm-pdf-to-txt-pdf-to-html)-download-12417.htm ]


BFN,

fp.
0
 
LVL 1

Author Comment

by:xfvgdrthbdtyvhgscv
ID: 16979619
Well, although it doesn't actually answer the question, I guess I'll have to award the points by default.
0
 
LVL 35

Expert Comment

by:[ fanpages ]
ID: 16979848
You could have waited for further Expert input; you need not have closed the question at all.

Did you contact the software vendor?
0

Featured Post

IT, Stop Being Called Into Every Meeting

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

Join & Write a Comment

RIA (Rich Internet Application) tools are interactive internet applications which have many of the characteristics of desktop applications. The RIA tools typically deliver output either by the way of a site-specific browser or via browser plug-in. T…
Although it can be difficult to imagine, someday your child will have a career of his or her own. He or she will likely start a family, buy a home and start having their own children. So, while being a kid is still extremely important, it’s also …
Viewers will learn how to properly install Eclipse with the necessary JDK, and will take a look at an introductory Java program. Download Eclipse installation zip file: Extract files from zip file: Download and install JDK 8: Open Eclipse and …
In this fourth video of the Xpdf series, we discuss and demonstrate the PDFinfo utility, which retrieves the contents of a PDF's Info Dictionary, as well as some other information, including the page count. We show how to isolate the page count in a…

758 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

22 Experts available now in Live!

Get 1:1 Help Now