?
Solved

accesssing data from pdf file

Posted on 2006-06-24
6
Medium Priority
?
221 Views
Last Modified: 2010-04-17
Hello all.
i have a tax document in pdf format. i want to extract data from that pdf file using C++. Is there any method by which i can directly parse the pdf file and extract the data.

I might have to deal with several thousand pdf documents.

Please Help.

0
Comment
Question by:jhav1594
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 4
6 Comments
 
LVL 30

Expert Comment

by:callrs
ID: 16976597
0
 
LVL 30

Accepted Solution

by:
callrs earned 2000 total points
ID: 16976605
0
 
LVL 1

Expert Comment

by:andre6b
ID: 16978198
The answer really depends on whether your PDF document actually contains the data as text or actually as images.  Text can relatively easily extracted by one of the tool, mentioned above.  However, if the data is stored as images, you will have to use an OCR solution.  There are no open source reliable products as far as I am aware.  For commercial, try FineReader: http://buy.abbyy.com/content/frpro/default.aspx
0
VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

 
LVL 30

Expert Comment

by:callrs
ID: 16978906
Good point. Google for: OCR open source, to try to find open source solutions.


0
 

Author Comment

by:jhav1594
ID: 16987513
Thanks andre ..... are you aware of a technique to find out if the pdf contains data as text or images. I would imagine that the pdfs that I have have scanned forms and hence the data must be as image....but i would like to confirm this....
0
 
LVL 30

Expert Comment

by:callrs
ID: 16987708
>>technique to find out if the pdf contains data as text or images
Read up about the pdf file format.

I don't know personally, but one way may be: Attempt an extract; no text back may mean the text is stored as images. However, can "text as images" refer to encrypted pdf files i.e. ones that have text-select or print disabled? If that's the case, these links may help: http:Q_21881833.html "print pdf file", http:Q_21881892.html "unlock pdf file"

0

Featured Post

Free Tool: Port Scanner

Check which ports are open to the outside world. Helps make sure that your firewall rules are working as intended.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Entering a date in Microsoft Access can be tricky. A typo can cause month and day to be shuffled, entering the day only causes an error, as does entering, say, day 31 in June. This article shows how an inputmask supported by code can help the user a…
Whether you’re a college noob or a soon-to-be pro, these tips are sure to help you in your journey to becoming a programming ninja and stand out from the crowd.
In this fourth video of the Xpdf series, we discuss and demonstrate the PDFinfo utility, which retrieves the contents of a PDF's Info Dictionary, as well as some other information, including the page count. We show how to isolate the page count in a…
With the power of JIRA, there's an unlimited number of ways you can customize it, use it and benefit from it. With that in mind, there's bound to be things that I wasn't able to cover in this course. With this summary we'll look at some places to go…
Suggested Courses
Course of the Month8 days, 10 hours left to enroll

764 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question