Solved

pdf->ascii convert??

Posted on 2000-05-08
10
731 Views
Last Modified: 2008-03-10
convert an *entire* adobe pdf file into ascii

any software? any technique? in C or any other language.

i know there are some perl modules that parse the informational headers (which are already in ascii).
0
Comment
Question by:eng40490
  • 6
  • 4
10 Comments
 
LVL 16

Accepted Solution

by:
maneshr earned 50 total points
ID: 2789669
you might want to take a look at PDF2TXT. It is a simple tool (perl script) to extract text from PDF files.

Key Features:

.. Extract Japanese text - JIS, SJIS, EUC and UCS2 encoded strings from
  Japanese PDF files by use of CMap files.
.. Extract document information and bookmark.
.. Support decoding methods:
    ASCIIHexDecode, ASCII85Decode, FlateDecode, LZWDecode
.. Crypto (pdf2txt_X.XX) and non-crypto (pdf2txt_X.XX_no_crypto) version.

Requirements:           UNIX

[common]
.. Perl (>= 5.005_03)    CPAN
.. zlib                  http://www.cdrom.com/pub/infozip/zlib/
.. Compress::Zlib        CPAN
.. uncompress            UNIX uncompress command
.. Jcode                 http://openlab.ring.gr.jp/Jcode/
.. Base85.pl             ftp://www.isl.intec.co.jp/pub/person/ishida/freeware/pdf2txt/
.. PDFLZW.pl             ftp://www.isl.intec.co.jp/pub/person/ishida/freeware/pdf2txt/
.. PDFEncoding.pl        ftp://www.isl.intec.co.jp/pub/person/ishida/freeware/pdf2txt/
.. makeCMap.pl*          ftp://www.isl.intec.co.jp/pub/person/ishida/freeware/pdf2txt/
.. aj12.tar.Z            ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/adobe/
.. aj20.tar.Z            ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/adobe/

[+crypto version]
.. MD5                   CPAN
.. RC4.pl                ftp://www.isl.intec.co.jp/pub/person/ishida/freeware/pdf2txt/

To get pdf2txt:

  ftp://www.isl.intec.co.jp/pub/person/ishida/freeware/pdf2txt/

====================================================
Here is another way to do the same.

PDF Conversion by E-mail

There are three e-mail options you can use to convert PDF
documents to a format that is more accessible to screen reading
software. The e-mail address you use depends on the conversion
format you want, plain (ASCII) text or HTML, and whether the
PDF is on the Internet or local media.

Option 1
If the PDF is on the Internet, you can mail the URL (web address)
of the PDF in the body of an email message to
pdf2txt@adobe.com (for plain text) or to pdf2html@adobe.com
(for HTML). The convertor will mail back the translation of the
PDF file. You can submit multiple URLs in a single e-mail.

Tip: Some URLs are very long and cumbersome to type. Cutting
and pasting the URL into the mail message will save you some
keystrokes.

Option 2
If the PDF is on local media, such as a hard drive, CD-ROM, or
internal server, it can be submitted as a MIME attachment to an
e-mail message. All converted pdf-documents will be sent back
to the sender as MIME attachments. For plain text, mail the
attached PDF to pdf2txt@adobe.com. For HTML, mail the
attached PDF to pdf2html@adobe.com.

Option 3

A service hosted by Trace Research Center also allows you to
convert PDF documents.

You can either mail the URL of the PDF or attach the PDF
document itself to your email message and send it to
pdf2txt@sun.trace.wisc.edu (for plain text) or to
pdf2html@sun.trace.wisc.edu (for HTML). The convertor will
mail back the translation of the PDF file.

Adobe would like to thank Dr. Gregg Vanderheiden and the
Trace Research Center (http://trace.wisc.edu) for helping us host
this service.

For further info ....
http://access.adobe.com/access_email.html
0
 
LVL 16

Expert Comment

by:maneshr
ID: 2797917
were you able to find a solution for your question??

if so, pl let me know of the solution you used?

Thanks
0
 

Author Comment

by:eng40490
ID: 2813196
i tried all the email solutions. they all failed in the same place -- they converted 'Offer' to 'oxxxxer' whre xxxx is some funny character.

seems like that's the state of the art in pdf->txt conversion.
0
 
LVL 16

Expert Comment

by:maneshr
ID: 2813973
Hmm!!

Even the adobe site gave you this problem??
did it do this for all chars or some??

let me know
0
 

Author Comment

by:eng40490
ID: 2817470
Global Oering
the O,cial List of the SES

i found 5 instances of the first line and 1 instance of the 2nd. all other occurances of 'F' are converted.

i used pdf2txt@adobe.com, pdf2html@adobe, pdf2txt@sun.trace.wisc.edu.
i used 2 pdf documents.

all suffered from the same problem.
0
What Should I Do With This Threat Intelligence?

Are you wondering if you actually need threat intelligence? The answer is yes. We explain the basics for creating useful threat intelligence.

 
LVL 16

Expert Comment

by:maneshr
ID: 2818123
i would suggest that you post this problem to adobe and bring this to their notice. Who knows, this might be a known bug, due to version compability problem, or something!!

Rgds
0
 

Author Comment

by:eng40490
ID: 2820031
actually only 1 pdf docs sufferred from the problem. i just sent another for conversion and it's ok. so looks like the first pdf document has something unusual.
0
 
LVL 16

Expert Comment

by:maneshr
ID: 2821125
was the problematic PDF file created using a different ver. of acrobat than the one which worked fine?

does the problematic PDF file have sp./international characters in it??

pl. let me know.
0
 

Author Comment

by:eng40490
ID: 2822305
can't ask the author/creator of the doc. no special or international character in the problematic *words*. i did not read every word of the doc so can't say about the entire doc.
0
 
LVL 16

Expert Comment

by:maneshr
ID: 2822391
sorry, cant think of anything else that might cause the problem. :-(
0

Featured Post

Why You Should Analyze Threat Actor TTPs

After years of analyzing threat actor behavior, it’s become clear that at any given time there are specific tactics, techniques, and procedures (TTPs) that are particularly prevalent. By analyzing and understanding these TTPs, you can dramatically enhance your security program.

Join & Write a Comment

Suggested Solutions

Email validation in proper way is  very important validation required in any web pages. This code is self explainable except that Regular Expression which I used for pattern matching. I originally published as a thread on my website : http://www…
A year or so back I was asked to have a play with MongoDB; within half an hour I had downloaded (http://www.mongodb.org/downloads),  installed and started the daemon, and had a console window open. After an hour or two of playing at the command …
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
This video gives you a great overview about bandwidth monitoring with SNMP and WMI with our network monitoring solution PRTG Network Monitor (https://www.paessler.com/prtg). If you're looking for how to monitor bandwidth using netflow or packet s…

758 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

22 Experts available now in Live!

Get 1:1 Help Now