Solved

accesssing data from pdf file

Posted on 2006-06-24
6
219 Views
Last Modified: 2010-04-17
Hello all.
i have a tax document in pdf format. i want to extract data from that pdf file using C++. Is there any method by which i can directly parse the pdf file and extract the data.

I might have to deal with several thousand pdf documents.

Please Help.

0
Comment
Question by:jhav1594
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 4
6 Comments
 
LVL 30

Expert Comment

by:callrs
ID: 16976597
0
 
LVL 30

Accepted Solution

by:
callrs earned 500 total points
ID: 16976605
0
 
LVL 1

Expert Comment

by:andre6b
ID: 16978198
The answer really depends on whether your PDF document actually contains the data as text or actually as images.  Text can relatively easily extracted by one of the tool, mentioned above.  However, if the data is stored as images, you will have to use an OCR solution.  There are no open source reliable products as far as I am aware.  For commercial, try FineReader: http://buy.abbyy.com/content/frpro/default.aspx
0
Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 30

Expert Comment

by:callrs
ID: 16978906
Good point. Google for: OCR open source, to try to find open source solutions.


0
 

Author Comment

by:jhav1594
ID: 16987513
Thanks andre ..... are you aware of a technique to find out if the pdf contains data as text or images. I would imagine that the pdfs that I have have scanned forms and hence the data must be as image....but i would like to confirm this....
0
 
LVL 30

Expert Comment

by:callrs
ID: 16987708
>>technique to find out if the pdf contains data as text or images
Read up about the pdf file format.

I don't know personally, but one way may be: Attempt an extract; no text back may mean the text is stored as images. However, can "text as images" refer to encrypted pdf files i.e. ones that have text-select or print disabled? If that's the case, these links may help: http:Q_21881833.html "print pdf file", http:Q_21881892.html "unlock pdf file"

0

Featured Post

Free Tool: Subnet Calculator

The subnet calculator helps you design networks by taking an IP address and network mask and returning information such as network, broadcast address, and host range.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
Problems moving Excel files from local drive to server 4 72
Java Inheritance super keyword use 8 71
VLC command 34 79
JVM error from eclipse 1 25
Entering a date in Microsoft Access can be tricky. A typo can cause month and day to be shuffled, entering the day only causes an error, as does entering, say, day 31 in June. This article shows how an inputmask supported by code can help the user a…
In this post we will learn how to make Android Gesture Tutorial and give different functionality whenever a user Touch or Scroll android screen.
In this fourth video of the Xpdf series, we discuss and demonstrate the PDFinfo utility, which retrieves the contents of a PDF's Info Dictionary, as well as some other information, including the page count. We show how to isolate the page count in a…

739 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question