Solved

Why is XPDF missing words and files when extracting text from pdf files?

Posted on 2006-06-23
3
317 Views
Last Modified: 2012-06-21
Hello all,

I have a search engine that runs on our website server.  It uses an inverted index created by an indexing routine.  In order to index PDF files, the indexer calls the utility XPDF ( http://www.foolabs.com/xpdf/ ) to extract the text.

Mostly it works just fine, BUT:

- sometimes it misses words.  For example, if a PDF document is konwn to contain the phrase "blue sky", after indexing a search may find "blue", but not "sky"

- sometimes it skips files completely.  This seems to be accompanied by cryptic error messages like "Bad Annotation".

Anyone have insights into XPDF?

Thanks.
0
Comment
Question by:xfvgdrthbdtyvhgscv
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 2
3 Comments
 
LVL 35

Accepted Solution

by:
[ fanpages ] earned 500 total points
ID: 16973565
I do not have any previous experience of XPDF.

Have you tried contacting the software authors?  Perhaps there is a fault with it that they are not aware of.

You may also like to extract your PDF based text into other formats using these tools:

HTML, or XML:
[ http://pdftohtml.sourceforge.net/ ]

RTF, MS-Word DOC, or Text (TXT):
[ http://www.verypdf.com/pdf2word/pdf-to-doc/pdf-to-txt.html ]

HTML, TXT, BMP or JPG:
[ http://www.topshareware.com/Paq-PDFTools-(-pdf2txt-pdf2html-pdf2htm-pdf-to-txt-pdf-to-html)-download-12417.htm ]


BFN,

fp.
0
 
LVL 1

Author Comment

by:xfvgdrthbdtyvhgscv
ID: 16979619
Well, although it doesn't actually answer the question, I guess I'll have to award the points by default.
0
 
LVL 35

Expert Comment

by:[ fanpages ]
ID: 16979848
You could have waited for further Expert input; you need not have closed the question at all.

Did you contact the software vendor?
0

Featured Post

Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
youtube blocking politics 4 103
backup program with robocopy 6 91
How to call a web service and get the results posted in a form in PHP 9 97
Developing a front end to SPLUNK 1 104
This is an explanation of a simple data model to help parse a JSON feed
Although it can be difficult to imagine, someday your child will have a career of his or her own. He or she will likely start a family, buy a home and start having their own children. So, while being a kid is still extremely important, it’s also …
In this fifth video of the Xpdf series, we discuss and demonstrate the PDFdetach utility, which is able to list and, more importantly, extract attachments that are embedded in PDF files. It does this via a command line interface, making it suitable …

734 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question