asked on

accesssing data from pdf file

Hello all.
i have a tax document in pdf format. i want to extract data from that pdf file using C++. Is there any method by which i can directly parse the pdf file and extract the data.

I might have to deal with several thousand pdf documents.

Please Help.

callrs

http://www.codeproject.com/cpp/ExtractPDFText.asp

ASKER CERTIFIED SOLUTION

callrs

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

andre6b

The answer really depends on whether your PDF document actually contains the data as text or actually as images. Text can relatively easily extracted by one of the tool, mentioned above. However, if the data is stored as images, you will have to use an OCR solution. There are no open source reliable products as far as I am aware. For commercial, try FineReader: http://buy.abbyy.com/content/frpro/default.aspx

callrs

Good point. Google for: OCR open source, to try to find open source solutions.

jhav1594

ASKER

Thanks andre ..... are you aware of a technique to find out if the pdf contains data as text or images. I would imagine that the pdfs that I have have scanned forms and hence the data must be as image....but i would like to confirm this....

callrs

>>technique to find out if the pdf contains data as text or images
Read up about the pdf file format.

I don't know personally, but one way may be: Attempt an extract; no text back may mean the text is stored as images. However, can "text as images" refer to encrypted pdf files i.e. ones that have text-select or print disabled? If that's the case, these links may help: http:Q_21881833.html "print pdf file", http:Q_21881892.html "unlock pdf file"