Solved

accesssing data from pdf file

Posted on 2006-06-24
6
213 Views
Last Modified: 2010-04-17
Hello all.
i have a tax document in pdf format. i want to extract data from that pdf file using C++. Is there any method by which i can directly parse the pdf file and extract the data.

I might have to deal with several thousand pdf documents.

Please Help.

0
Comment
Question by:jhav1594
  • 4
6 Comments
 
LVL 30

Expert Comment

by:callrs
ID: 16976597
0
 
LVL 30

Accepted Solution

by:
callrs earned 500 total points
ID: 16976605
0
 
LVL 1

Expert Comment

by:andre6b
ID: 16978198
The answer really depends on whether your PDF document actually contains the data as text or actually as images.  Text can relatively easily extracted by one of the tool, mentioned above.  However, if the data is stored as images, you will have to use an OCR solution.  There are no open source reliable products as far as I am aware.  For commercial, try FineReader: http://buy.abbyy.com/content/frpro/default.aspx
0
How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

 
LVL 30

Expert Comment

by:callrs
ID: 16978906
Good point. Google for: OCR open source, to try to find open source solutions.


0
 

Author Comment

by:jhav1594
ID: 16987513
Thanks andre ..... are you aware of a technique to find out if the pdf contains data as text or images. I would imagine that the pdfs that I have have scanned forms and hence the data must be as image....but i would like to confirm this....
0
 
LVL 30

Expert Comment

by:callrs
ID: 16987708
>>technique to find out if the pdf contains data as text or images
Read up about the pdf file format.

I don't know personally, but one way may be: Attempt an extract; no text back may mean the text is stored as images. However, can "text as images" refer to encrypted pdf files i.e. ones that have text-select or print disabled? If that's the case, these links may help: http:Q_21881833.html "print pdf file", http:Q_21881892.html "unlock pdf file"

0

Featured Post

Why You Should Analyze Threat Actor TTPs

After years of analyzing threat actor behavior, it’s become clear that at any given time there are specific tactics, techniques, and procedures (TTPs) that are particularly prevalent. By analyzing and understanding these TTPs, you can dramatically enhance your security program.

Join & Write a Comment

Suggested Solutions

A short article about problems I had with the new location API and permissions in Marshmallow
Whether you’re a college noob or a soon-to-be pro, these tips are sure to help you in your journey to becoming a programming ninja and stand out from the crowd.
Viewers will learn how to properly install Eclipse with the necessary JDK, and will take a look at an introductory Java program. Download Eclipse installation zip file: Extract files from zip file: Download and install JDK 8: Open Eclipse and …
In this fourth video of the Xpdf series, we discuss and demonstrate the PDFinfo utility, which retrieves the contents of a PDF's Info Dictionary, as well as some other information, including the page count. We show how to isolate the page count in a…

747 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

13 Experts available now in Live!

Get 1:1 Help Now