Solved

Why is XPDF missing words and files when extracting text from pdf files?

Posted on 2006-06-23
3
298 Views
Last Modified: 2012-06-21
Hello all,

I have a search engine that runs on our website server.  It uses an inverted index created by an indexing routine.  In order to index PDF files, the indexer calls the utility XPDF ( http://www.foolabs.com/xpdf/ ) to extract the text.

Mostly it works just fine, BUT:

- sometimes it misses words.  For example, if a PDF document is konwn to contain the phrase "blue sky", after indexing a search may find "blue", but not "sky"

- sometimes it skips files completely.  This seems to be accompanied by cryptic error messages like "Bad Annotation".

Anyone have insights into XPDF?

Thanks.
0
Comment
Question by:xfvgdrthbdtyvhgscv
  • 2
3 Comments
 
LVL 35

Accepted Solution

by:
[ fanpages ] earned 500 total points
ID: 16973565
I do not have any previous experience of XPDF.

Have you tried contacting the software authors?  Perhaps there is a fault with it that they are not aware of.

You may also like to extract your PDF based text into other formats using these tools:

HTML, or XML:
[ http://pdftohtml.sourceforge.net/ ]

RTF, MS-Word DOC, or Text (TXT):
[ http://www.verypdf.com/pdf2word/pdf-to-doc/pdf-to-txt.html ]

HTML, TXT, BMP or JPG:
[ http://www.topshareware.com/Paq-PDFTools-(-pdf2txt-pdf2html-pdf2htm-pdf-to-txt-pdf-to-html)-download-12417.htm ]


BFN,

fp.
0
 
LVL 1

Author Comment

by:xfvgdrthbdtyvhgscv
ID: 16979619
Well, although it doesn't actually answer the question, I guess I'll have to award the points by default.
0
 
LVL 35

Expert Comment

by:[ fanpages ]
ID: 16979848
You could have waited for further Expert input; you need not have closed the question at all.

Did you contact the software vendor?
0

Featured Post

Is Your AD Toolbox Looking More Like a Toybox?

Managing Active Directory can get complicated.  Often, the native tools for managing AD are just not up to the task.  The largest Active Directory installations in the world have relied on one tool to manage their day-to-day administration tasks: Hyena. Start your trial today.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
split53 challenge 7 97
Windows Service to Receive TCP Packets 4 147
Not needed 13 113
object oriented design (python) and documentation tools 2 43
A short article about a problem I had getting the GPS LocationListener working.
Displaying an arrayList in a listView using the default adapter is rarely the best solution. To get full control of your display data, and to be able to refresh it after editing, requires the use of a custom adapter.
With the power of JIRA, there's an unlimited number of ways you can customize it, use it and benefit from it. With that in mind, there's bound to be things that I wasn't able to cover in this course. With this summary we'll look at some places to go…
In this seventh video of the Xpdf series, we discuss and demonstrate the PDFfonts utility, which lists all the fonts used in a PDF file. It does this via a command line interface, making it suitable for use in programs, scripts, batch files — any pl…

810 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question