jhav1594
asked on
accesssing data from pdf file
Hello all.
i have a tax document in pdf format. i want to extract data from that pdf file using C++. Is there any method by which i can directly parse the pdf file and extract the data.
I might have to deal with several thousand pdf documents.
Please Help.
i have a tax document in pdf format. i want to extract data from that pdf file using C++. Is there any method by which i can directly parse the pdf file and extract the data.
I might have to deal with several thousand pdf documents.
Please Help.
http://www.codeproject.com/cpp/ExtractPDFText.asp
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
The answer really depends on whether your PDF document actually contains the data as text or actually as images. Text can relatively easily extracted by one of the tool, mentioned above. However, if the data is stored as images, you will have to use an OCR solution. There are no open source reliable products as far as I am aware. For commercial, try FineReader: http://buy.abbyy.com/content/frpro/default.aspx
Good point. Google for: OCR open source, to try to find open source solutions.
ASKER
Thanks andre ..... are you aware of a technique to find out if the pdf contains data as text or images. I would imagine that the pdfs that I have have scanned forms and hence the data must be as image....but i would like to confirm this....
>>technique to find out if the pdf contains data as text or images
Read up about the pdf file format.
I don't know personally, but one way may be: Attempt an extract; no text back may mean the text is stored as images. However, can "text as images" refer to encrypted pdf files i.e. ones that have text-select or print disabled? If that's the case, these links may help: http:Q_21881833.html "print pdf file", http:Q_21881892.html "unlock pdf file"
Read up about the pdf file format.
I don't know personally, but one way may be: Attempt an extract; no text back may mean the text is stored as images. However, can "text as images" refer to encrypted pdf files i.e. ones that have text-select or print disabled? If that's the case, these links may help: http:Q_21881833.html "print pdf file", http:Q_21881892.html "unlock pdf file"