Solved

pdf2txt and doc2txt

Posted on 2003-10-28
26
7,265 Views
Last Modified: 2007-12-19
Hello

i need find to aplications that convert pdf and doc file to txt.

thanks

Mario
0
Comment
Question by:Mario_castro
  • 10
  • 4
  • 4
  • +4
26 Comments
 
LVL 5

Expert Comment

by:willy134
ID: 9634517
for pdf do
pdftotext

I don't know about a doc converter.  I guess you could open it in openoffice and then either save it as a pdf and run pdftotext or just save it as a txt file.
0
 

Author Comment

by:Mario_castro
ID: 9634566
but my linux distribution dont have pdftotext, do tou have any URL that permit download this file?
0
 
LVL 9

Expert Comment

by:majorwoo
ID: 9634949
xpdf usually ships with a pdftotext utility, you can get it from:

http://www.foolabs.com/xpdf/download.html

or for a rpm based system,

http://www.rpmfind.net/linux/rpm2html/search.php?query=xpdf&submit=Search+...
0
 
LVL 20

Expert Comment

by:Gns
ID: 9635209
On my "intranet index machine" running htdig, I've successfully used doc2html.pl as "external parser", which is more of a wrapper than anything else:-). It uses several parsers that convert from a range of formats to html ... which in turn can rather easily be turned into text. You'll find it at http://www.htdig.org/files/contrib/parsers/ ... amongst some others.
rtf2html and catdoc (parsers used by the above) from http://www.45.free.net/~vitus/ice/catdoc/ would fit your bill...:-) Especially catdoc, since it reads M$ word files, and output text;-)
You might also be interrested in xlHtml and ppt2html (http://www.xlhtml.org).

Regarding pdf ... you have at least two options. Either install the xpdf package (almost all distros have it) that contain the pdftotext program... or do a pdf2ps followed by a ps2ascii ... these are part of ghostscript, which is very commonly installed to handle your PostScript(-ish) printing needs:-). (I'm to slow typing... majorwoo already covered xpdf).

See, not even any need to use a html2txt util:-).

If all else fails, the strings command could be used:-).

-- Glenn
0
 
LVL 9

Expert Comment

by:majorwoo
ID: 9635365
or sed!
0
 
LVL 20

Expert Comment

by:Gns
ID: 9635390
:-)
It shouldn't have to come to that... catdoc and pdftotext should do nicely;-).

-- Glenn
0
 

Author Comment

by:Mario_castro
ID: 9635944
ops i have a sevral problem but this software is for install in phpdig that is a search aplication an need two modules to search in pdf an doc files. this is development in php, and i need put this search engine in my site that is allowed in not dedicated server, but this i don´t can install any aplications.

any solutions?

Thanks

mario
0
 
LVL 20

Expert Comment

by:Gns
ID: 9636163
Ok, and they don't have catdoc and/or pdftotext or... already?
Then you're smoked more or less:-(.

-- Glenn
0
 
LVL 20

Expert Comment

by:Gns
ID: 9636171
Well, apart from using sed or strings as the parser.

-- Glenn
0
 
LVL 9

Expert Comment

by:majorwoo
ID: 9636933
and good luck with that ;-(
0
 
LVL 20

Expert Comment

by:Gns
ID: 9640446
Strings will work somewhat for "normal" M$ doc files (You'll likely index a hefty amount of garbage too)... Pdf is a tougher cookie, since you'll need "decode" the postscriptish language (so you don't index that), and perhaps also "unscramble" some binary parts... not easily done with sed (or awk ... or perl).
The easy thing is to try and convince the server owner to install some of the utils above.

-- Glenn
0
 
LVL 4

Expert Comment

by:Robson
ID: 9641293
I recommend antiword for reading DOC files.
0
6 Surprising Benefits of Threat Intelligence

All sorts of threat intelligence is available on the web. Intelligence you can learn from, and use to anticipate and prepare for future attacks.

 
LVL 5

Expert Comment

by:willy134
ID: 9642828
do you really want to install this all on a web server or are you just trying to convert them to text?
0
 
LVL 20

Expert Comment

by:Gns
ID: 9642871
Why should we doubt Marios word willy134? I'm guessing he's setting this up on a hosting service of some kind, that simply don't provide the tools he needs. Might be wrong though (wouldn't be the first time that that happens either:-).

-- Glenn
0
 
LVL 5

Expert Comment

by:willy134
ID: 9643728
I just was looking at the original question and it didn't mention web server stuff.
0
 
LVL 20

Expert Comment

by:Gns
ID: 9643972
True. His other comment does though.

-- Glenn
0
 
LVL 20

Expert Comment

by:Gns
ID: 10183147
Several helpful answers covering several situations (possibly excluding the exact situation Mario has though)... Otoh, it feels silly sliting 50 points.
PAQ - no refund?

-- Glenn
0
 
LVL 9

Expert Comment

by:majorwoo
ID: 10183952
fine by me...
0
 
LVL 44

Expert Comment

by:Karl Heinz Kremer
ID: 10242157
There is no way you can extract text from a PDF file without a PDF parser. There are two problems with using tools like sed or strings: Depending on the PDF creator, even normal text will be stored as binary data. When using Adobe tools to create PDF documents, the more recent these tools are the more likely it is that the content is binary (actually compressed). The second problem are the incremental updates PDF allows. You may find text in a file (assuming that the text is actually uncompressed) that is no longer supposed to be in the document: Somebody may have removed one or more pages from the document and saved the updated document. A PDF creator will (if not specifically asked to do otherwise) only store the new and updated data, and will leave the original content intact - you can actually recover the old version of the document by stripping off the new part at the end of the file. So using a non-PDF-aware approach for indexing puposes will lead to a corrupt index.

The DOC format has similar mechanisms to retain old information, so it's also necessary to use a tool that is aware of these "hidden" data structures.
0
 
LVL 44

Expert Comment

by:Karl Heinz Kremer
ID: 10242163
No comment has been added lately, so it's time to clean up this TA.
I will leave a recommendation in the Cleanup topic area that this question is:
PAQ'd and pts forfeited
Please leave any comments here within the next seven days.

PLEASE DO NOT ACCEPT THIS COMMENT AS AN ANSWER!

khkremer
EE Cleanup Volunteer
0
 
LVL 20

Expert Comment

by:Gns
ID: 10251333
... As said (in perhaps not such an eloquent way) above. Thanks for your insight though khkremer (Yeah, I sneaked a look at your track record "mr PDF":-). (Seconding your recommendation BTW:).

-- Glenn
0
 
LVL 44

Expert Comment

by:Karl Heinz Kremer
ID: 10251445
... and I'm still waiting for the really hard PDF questions :-)
0
 
LVL 44

Expert Comment

by:Karl Heinz Kremer
ID: 10251658
I forgot one reason why this will not work: The PDF standard does not require you to output text as "text strings": You could for example first print all "a" characters on a page, then all "b", and so on... This means that a text extraction tool has to use the relative position of every "text piece" on a page to sort the data back into something that more or less resembles your original text. Do, don't do it with strings or sed. I hope we can put this to rest now :-)
0
 
LVL 20

Expert Comment

by:Gns
ID: 10251689
Yup. It's a strange beast all right:-). So any perl/php/sed/awk/whatever would need be a fullfledged (well, not really... You'd just need ... "semirendering", and be damned with those cases that it cannot grok:-) pdf parser. Yucky at best:-).

-- Glenn
0
 
LVL 1

Accepted Solution

by:
Computer101 earned 0 total points
ID: 10300575
PAQed - no points refunded (of 50)

Computer101
E-E Admin
0

Featured Post

Complete Microsoft Windows PC® & Mac Backup

Backup and recovery solutions to protect all your PCs & Mac– on-premises or in remote locations. Acronis backs up entire PC or Mac with patented reliable disk imaging technology and you will be able to restore workstations to a new, dissimilar hardware in minutes.

Join & Write a Comment

How many times have you wanted to quickly do the same thing to a list but found yourself typing it again and again? I first figured out a small time saver with the up arrow to recall the last command but that can only get you so far if you have a bi…
SSH (Secure Shell) - Tips and Tricks As you all know SSH(Secure Shell) is a network protocol, which we use to access/transfer files securely between two networked devices. SSH was actually designed as a replacement for insecure protocols that sen…
Connecting to an Amazon Linux EC2 Instance from Windows Using PuTTY.
Get a first impression of how PRTG looks and learn how it works.   This video is a short introduction to PRTG, as an initial overview or as a quick start for new PRTG users.

760 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

20 Experts available now in Live!

Get 1:1 Help Now