Solved

accesssing data from pdf file

Posted on 2006-06-24
6
220 Views
Last Modified: 2010-04-17
Hello all.
i have a tax document in pdf format. i want to extract data from that pdf file using C++. Is there any method by which i can directly parse the pdf file and extract the data.

I might have to deal with several thousand pdf documents.

Please Help.

0
Comment
Question by:jhav1594
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 4
6 Comments
 
LVL 30

Expert Comment

by:callrs
ID: 16976597
0
 
LVL 30

Accepted Solution

by:
callrs earned 500 total points
ID: 16976605
0
 
LVL 1

Expert Comment

by:andre6b
ID: 16978198
The answer really depends on whether your PDF document actually contains the data as text or actually as images.  Text can relatively easily extracted by one of the tool, mentioned above.  However, if the data is stored as images, you will have to use an OCR solution.  There are no open source reliable products as far as I am aware.  For commercial, try FineReader: http://buy.abbyy.com/content/frpro/default.aspx
0
Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 30

Expert Comment

by:callrs
ID: 16978906
Good point. Google for: OCR open source, to try to find open source solutions.


0
 

Author Comment

by:jhav1594
ID: 16987513
Thanks andre ..... are you aware of a technique to find out if the pdf contains data as text or images. I would imagine that the pdfs that I have have scanned forms and hence the data must be as image....but i would like to confirm this....
0
 
LVL 30

Expert Comment

by:callrs
ID: 16987708
>>technique to find out if the pdf contains data as text or images
Read up about the pdf file format.

I don't know personally, but one way may be: Attempt an extract; no text back may mean the text is stored as images. However, can "text as images" refer to encrypted pdf files i.e. ones that have text-select or print disabled? If that's the case, these links may help: http:Q_21881833.html "print pdf file", http:Q_21881892.html "unlock pdf file"

0

Featured Post

Free Tool: Port Scanner

Check which ports are open to the outside world. Helps make sure that your firewall rules are working as intended.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

A short article about problems I had with the new location API and permissions in Marshmallow
If you’re thinking to yourself “That description sounds a lot like two people doing the work that one could accomplish,” you’re not alone.
In this seventh video of the Xpdf series, we discuss and demonstrate the PDFfonts utility, which lists all the fonts used in a PDF file. It does this via a command line interface, making it suitable for use in programs, scripts, batch files — any pl…
Six Sigma Control Plans

717 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question