Solved

accesssing data from pdf file

Posted on 2006-06-24
6
217 Views
Last Modified: 2010-04-17
Hello all.
i have a tax document in pdf format. i want to extract data from that pdf file using C++. Is there any method by which i can directly parse the pdf file and extract the data.

I might have to deal with several thousand pdf documents.

Please Help.

0
Comment
Question by:jhav1594
  • 4
6 Comments
 
LVL 30

Expert Comment

by:callrs
ID: 16976597
0
 
LVL 30

Accepted Solution

by:
callrs earned 500 total points
ID: 16976605
0
 
LVL 1

Expert Comment

by:andre6b
ID: 16978198
The answer really depends on whether your PDF document actually contains the data as text or actually as images.  Text can relatively easily extracted by one of the tool, mentioned above.  However, if the data is stored as images, you will have to use an OCR solution.  There are no open source reliable products as far as I am aware.  For commercial, try FineReader: http://buy.abbyy.com/content/frpro/default.aspx
0
Free Tool: Port Scanner

Check which ports are open to the outside world. Helps make sure that your firewall rules are working as intended.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

 
LVL 30

Expert Comment

by:callrs
ID: 16978906
Good point. Google for: OCR open source, to try to find open source solutions.


0
 

Author Comment

by:jhav1594
ID: 16987513
Thanks andre ..... are you aware of a technique to find out if the pdf contains data as text or images. I would imagine that the pdfs that I have have scanned forms and hence the data must be as image....but i would like to confirm this....
0
 
LVL 30

Expert Comment

by:callrs
ID: 16987708
>>technique to find out if the pdf contains data as text or images
Read up about the pdf file format.

I don't know personally, but one way may be: Attempt an extract; no text back may mean the text is stored as images. However, can "text as images" refer to encrypted pdf files i.e. ones that have text-select or print disabled? If that's the case, these links may help: http:Q_21881833.html "print pdf file", http:Q_21881892.html "unlock pdf file"

0

Featured Post

Free Tool: Postgres Monitoring System

A PHP and Perl based system to collect and display usage statistics from PostgreSQL databases.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

This article is meant to give a basic understanding of how to use R Sweave as a way to merge LaTeX and R code seamlessly into one presentable document.
This is about my first experience with programming Arduino.

830 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question