Solved

pdf->ascii convert??

Posted on 2000-05-08
10
743 Views
Last Modified: 2008-03-10
convert an *entire* adobe pdf file into ascii

any software? any technique? in C or any other language.

i know there are some perl modules that parse the informational headers (which are already in ascii).
0
Comment
Question by:eng40490
  • 6
  • 4
10 Comments
 
LVL 16

Accepted Solution

by:
maneshr earned 50 total points
ID: 2789669
you might want to take a look at PDF2TXT. It is a simple tool (perl script) to extract text from PDF files.

Key Features:

.. Extract Japanese text - JIS, SJIS, EUC and UCS2 encoded strings from
  Japanese PDF files by use of CMap files.
.. Extract document information and bookmark.
.. Support decoding methods:
    ASCIIHexDecode, ASCII85Decode, FlateDecode, LZWDecode
.. Crypto (pdf2txt_X.XX) and non-crypto (pdf2txt_X.XX_no_crypto) version.

Requirements:           UNIX

[common]
.. Perl (>= 5.005_03)    CPAN
.. zlib                  http://www.cdrom.com/pub/infozip/zlib/
.. Compress::Zlib        CPAN
.. uncompress            UNIX uncompress command
.. Jcode                 http://openlab.ring.gr.jp/Jcode/
.. Base85.pl             ftp://www.isl.intec.co.jp/pub/person/ishida/freeware/pdf2txt/
.. PDFLZW.pl             ftp://www.isl.intec.co.jp/pub/person/ishida/freeware/pdf2txt/
.. PDFEncoding.pl        ftp://www.isl.intec.co.jp/pub/person/ishida/freeware/pdf2txt/
.. makeCMap.pl*          ftp://www.isl.intec.co.jp/pub/person/ishida/freeware/pdf2txt/
.. aj12.tar.Z            ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/adobe/
.. aj20.tar.Z            ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/adobe/

[+crypto version]
.. MD5                   CPAN
.. RC4.pl                ftp://www.isl.intec.co.jp/pub/person/ishida/freeware/pdf2txt/

To get pdf2txt:

  ftp://www.isl.intec.co.jp/pub/person/ishida/freeware/pdf2txt/

====================================================
Here is another way to do the same.

PDF Conversion by E-mail

There are three e-mail options you can use to convert PDF
documents to a format that is more accessible to screen reading
software. The e-mail address you use depends on the conversion
format you want, plain (ASCII) text or HTML, and whether the
PDF is on the Internet or local media.

Option 1
If the PDF is on the Internet, you can mail the URL (web address)
of the PDF in the body of an email message to
pdf2txt@adobe.com (for plain text) or to pdf2html@adobe.com
(for HTML). The convertor will mail back the translation of the
PDF file. You can submit multiple URLs in a single e-mail.

Tip: Some URLs are very long and cumbersome to type. Cutting
and pasting the URL into the mail message will save you some
keystrokes.

Option 2
If the PDF is on local media, such as a hard drive, CD-ROM, or
internal server, it can be submitted as a MIME attachment to an
e-mail message. All converted pdf-documents will be sent back
to the sender as MIME attachments. For plain text, mail the
attached PDF to pdf2txt@adobe.com. For HTML, mail the
attached PDF to pdf2html@adobe.com.

Option 3

A service hosted by Trace Research Center also allows you to
convert PDF documents.

You can either mail the URL of the PDF or attach the PDF
document itself to your email message and send it to
pdf2txt@sun.trace.wisc.edu (for plain text) or to
pdf2html@sun.trace.wisc.edu (for HTML). The convertor will
mail back the translation of the PDF file.

Adobe would like to thank Dr. Gregg Vanderheiden and the
Trace Research Center (http://trace.wisc.edu) for helping us host
this service.

For further info ....
http://access.adobe.com/access_email.html
0
 
LVL 16

Expert Comment

by:maneshr
ID: 2797917
were you able to find a solution for your question??

if so, pl let me know of the solution you used?

Thanks
0
 

Author Comment

by:eng40490
ID: 2813196
i tried all the email solutions. they all failed in the same place -- they converted 'Offer' to 'oxxxxer' whre xxxx is some funny character.

seems like that's the state of the art in pdf->txt conversion.
0
Free Tool: IP Lookup

Get more info about an IP address or domain name, such as organization, abuse contacts and geolocation.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

 
LVL 16

Expert Comment

by:maneshr
ID: 2813973
Hmm!!

Even the adobe site gave you this problem??
did it do this for all chars or some??

let me know
0
 

Author Comment

by:eng40490
ID: 2817470
Global Oering
the O,cial List of the SES

i found 5 instances of the first line and 1 instance of the 2nd. all other occurances of 'F' are converted.

i used pdf2txt@adobe.com, pdf2html@adobe, pdf2txt@sun.trace.wisc.edu.
i used 2 pdf documents.

all suffered from the same problem.
0
 
LVL 16

Expert Comment

by:maneshr
ID: 2818123
i would suggest that you post this problem to adobe and bring this to their notice. Who knows, this might be a known bug, due to version compability problem, or something!!

Rgds
0
 

Author Comment

by:eng40490
ID: 2820031
actually only 1 pdf docs sufferred from the problem. i just sent another for conversion and it's ok. so looks like the first pdf document has something unusual.
0
 
LVL 16

Expert Comment

by:maneshr
ID: 2821125
was the problematic PDF file created using a different ver. of acrobat than the one which worked fine?

does the problematic PDF file have sp./international characters in it??

pl. let me know.
0
 

Author Comment

by:eng40490
ID: 2822305
can't ask the author/creator of the doc. no special or international character in the problematic *words*. i did not read every word of the doc so can't say about the entire doc.
0
 
LVL 16

Expert Comment

by:maneshr
ID: 2822391
sorry, cant think of anything else that might cause the problem. :-(
0

Featured Post

Free Tool: ZipGrep

ZipGrep is a utility that can list and search zip (.war, .ear, .jar, etc) archives for text patterns, without the need to extract the archive's contents.

One of a set of tools we're offering as a way to say thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

In the distant past (last year) I hacked together a little toy that would allow a couple of Manager types to query, preview, and extract data from a number of MongoDB instances, to their tool of choice: Excel (http://dilbert.com/strips/comic/2007-08…
Checking the Alert Log in AWS RDS Oracle can be a pain through their user interface.  I made a script to download the Alert Log, look for errors, and email me the trace files.  In this article I'll describe what I did and share my script.
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

809 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question