Improve company productivity with a Business Account.Sign Up

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 807
  • Last Modified:

pdf->ascii convert??

convert an *entire* adobe pdf file into ascii

any software? any technique? in C or any other language.

i know there are some perl modules that parse the informational headers (which are already in ascii).
0
eng40490
Asked:
eng40490
  • 6
  • 4
1 Solution
 
maneshrCommented:
you might want to take a look at PDF2TXT. It is a simple tool (perl script) to extract text from PDF files.

Key Features:

.. Extract Japanese text - JIS, SJIS, EUC and UCS2 encoded strings from
  Japanese PDF files by use of CMap files.
.. Extract document information and bookmark.
.. Support decoding methods:
    ASCIIHexDecode, ASCII85Decode, FlateDecode, LZWDecode
.. Crypto (pdf2txt_X.XX) and non-crypto (pdf2txt_X.XX_no_crypto) version.

Requirements:           UNIX

[common]
.. Perl (>= 5.005_03)    CPAN
.. zlib                  http://www.cdrom.com/pub/infozip/zlib/
.. Compress::Zlib        CPAN
.. uncompress            UNIX uncompress command
.. Jcode                 http://openlab.ring.gr.jp/Jcode/
.. Base85.pl             ftp://www.isl.intec.co.jp/pub/person/ishida/freeware/pdf2txt/
.. PDFLZW.pl             ftp://www.isl.intec.co.jp/pub/person/ishida/freeware/pdf2txt/
.. PDFEncoding.pl        ftp://www.isl.intec.co.jp/pub/person/ishida/freeware/pdf2txt/
.. makeCMap.pl*          ftp://www.isl.intec.co.jp/pub/person/ishida/freeware/pdf2txt/
.. aj12.tar.Z            ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/adobe/
.. aj20.tar.Z            ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/adobe/

[+crypto version]
.. MD5                   CPAN
.. RC4.pl                ftp://www.isl.intec.co.jp/pub/person/ishida/freeware/pdf2txt/

To get pdf2txt:

  ftp://www.isl.intec.co.jp/pub/person/ishida/freeware/pdf2txt/

====================================================
Here is another way to do the same.

PDF Conversion by E-mail

There are three e-mail options you can use to convert PDF
documents to a format that is more accessible to screen reading
software. The e-mail address you use depends on the conversion
format you want, plain (ASCII) text or HTML, and whether the
PDF is on the Internet or local media.

Option 1
If the PDF is on the Internet, you can mail the URL (web address)
of the PDF in the body of an email message to
pdf2txt@adobe.com (for plain text) or to pdf2html@adobe.com
(for HTML). The convertor will mail back the translation of the
PDF file. You can submit multiple URLs in a single e-mail.

Tip: Some URLs are very long and cumbersome to type. Cutting
and pasting the URL into the mail message will save you some
keystrokes.

Option 2
If the PDF is on local media, such as a hard drive, CD-ROM, or
internal server, it can be submitted as a MIME attachment to an
e-mail message. All converted pdf-documents will be sent back
to the sender as MIME attachments. For plain text, mail the
attached PDF to pdf2txt@adobe.com. For HTML, mail the
attached PDF to pdf2html@adobe.com.

Option 3

A service hosted by Trace Research Center also allows you to
convert PDF documents.

You can either mail the URL of the PDF or attach the PDF
document itself to your email message and send it to
pdf2txt@sun.trace.wisc.edu (for plain text) or to
pdf2html@sun.trace.wisc.edu (for HTML). The convertor will
mail back the translation of the PDF file.

Adobe would like to thank Dr. Gregg Vanderheiden and the
Trace Research Center (http://trace.wisc.edu) for helping us host
this service.

For further info ....
http://access.adobe.com/access_email.html
0
 
maneshrCommented:
were you able to find a solution for your question??

if so, pl let me know of the solution you used?

Thanks
0
 
eng40490Author Commented:
i tried all the email solutions. they all failed in the same place -- they converted 'Offer' to 'oxxxxer' whre xxxx is some funny character.

seems like that's the state of the art in pdf->txt conversion.
0
Free Tool: ZipGrep

ZipGrep is a utility that can list and search zip (.war, .ear, .jar, etc) archives for text patterns, without the need to extract the archive's contents.

One of a set of tools we're offering as a way to say thank you for being a part of the community.

 
maneshrCommented:
Hmm!!

Even the adobe site gave you this problem??
did it do this for all chars or some??

let me know
0
 
eng40490Author Commented:
Global Oering
the O,cial List of the SES

i found 5 instances of the first line and 1 instance of the 2nd. all other occurances of 'F' are converted.

i used pdf2txt@adobe.com, pdf2html@adobe, pdf2txt@sun.trace.wisc.edu.
i used 2 pdf documents.

all suffered from the same problem.
0
 
maneshrCommented:
i would suggest that you post this problem to adobe and bring this to their notice. Who knows, this might be a known bug, due to version compability problem, or something!!

Rgds
0
 
eng40490Author Commented:
actually only 1 pdf docs sufferred from the problem. i just sent another for conversion and it's ok. so looks like the first pdf document has something unusual.
0
 
maneshrCommented:
was the problematic PDF file created using a different ver. of acrobat than the one which worked fine?

does the problematic PDF file have sp./international characters in it??

pl. let me know.
0
 
eng40490Author Commented:
can't ask the author/creator of the doc. no special or international character in the problematic *words*. i did not read every word of the doc so can't say about the entire doc.
0
 
maneshrCommented:
sorry, cant think of anything else that might cause the problem. :-(
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Featured Post

Get expert help—faster!

Need expert help—fast? Use the Help Bell for personalized assistance getting answers to your important questions.

  • 6
  • 4
Tackle projects and never again get stuck behind a technical roadblock.
Join Now