I have a search engine that runs on our website server. It uses an inverted index created by an indexing routine. In order to index PDF files, the indexer calls the utility XPDF ( http://www.foolabs.com/xpdf/
) to extract the text.
Mostly it works just fine, BUT:
- sometimes it misses words. For example, if a PDF document is konwn to contain the phrase "blue sky", after indexing a search may find "blue", but not "sky"
- sometimes it skips files completely. This seems to be accompanied by cryptic error messages like "Bad Annotation".
Anyone have insights into XPDF?