Solved

pdf2txt and doc2txt

Posted on 2003-10-28
26
7,325 Views
Last Modified: 2007-12-19
Hello

i need find to aplications that convert pdf and doc file to txt.

thanks

Mario
0
Comment
Question by:Mario_castro
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 10
  • 4
  • 4
  • +4
26 Comments
 
LVL 5

Expert Comment

by:willy134
ID: 9634517
for pdf do
pdftotext

I don't know about a doc converter.  I guess you could open it in openoffice and then either save it as a pdf and run pdftotext or just save it as a txt file.
0
 

Author Comment

by:Mario_castro
ID: 9634566
but my linux distribution dont have pdftotext, do tou have any URL that permit download this file?
0
 
LVL 9

Expert Comment

by:majorwoo
ID: 9634949
xpdf usually ships with a pdftotext utility, you can get it from:

http://www.foolabs.com/xpdf/download.html

or for a rpm based system,

http://www.rpmfind.net/linux/rpm2html/search.php?query=xpdf&submit=Search+...
0
Use Case: Protecting a Hybrid Cloud Infrastructure

Microsoft Azure is rapidly becoming the norm in dynamic IT environments. This document describes the challenges that organizations face when protecting data in a hybrid cloud IT environment and presents a use case to demonstrate how Acronis Backup protects all data.

 
LVL 20

Expert Comment

by:Gns
ID: 9635209
On my "intranet index machine" running htdig, I've successfully used doc2html.pl as "external parser", which is more of a wrapper than anything else:-). It uses several parsers that convert from a range of formats to html ... which in turn can rather easily be turned into text. You'll find it at http://www.htdig.org/files/contrib/parsers/ ... amongst some others.
rtf2html and catdoc (parsers used by the above) from http://www.45.free.net/~vitus/ice/catdoc/ would fit your bill...:-) Especially catdoc, since it reads M$ word files, and output text;-)
You might also be interrested in xlHtml and ppt2html (http://www.xlhtml.org).

Regarding pdf ... you have at least two options. Either install the xpdf package (almost all distros have it) that contain the pdftotext program... or do a pdf2ps followed by a ps2ascii ... these are part of ghostscript, which is very commonly installed to handle your PostScript(-ish) printing needs:-). (I'm to slow typing... majorwoo already covered xpdf).

See, not even any need to use a html2txt util:-).

If all else fails, the strings command could be used:-).

-- Glenn
0
 
LVL 9

Expert Comment

by:majorwoo
ID: 9635365
or sed!
0
 
LVL 20

Expert Comment

by:Gns
ID: 9635390
:-)
It shouldn't have to come to that... catdoc and pdftotext should do nicely;-).

-- Glenn
0
 

Author Comment

by:Mario_castro
ID: 9635944
ops i have a sevral problem but this software is for install in phpdig that is a search aplication an need two modules to search in pdf an doc files. this is development in php, and i need put this search engine in my site that is allowed in not dedicated server, but this i don´t can install any aplications.

any solutions?

Thanks

mario
0
 
LVL 20

Expert Comment

by:Gns
ID: 9636163
Ok, and they don't have catdoc and/or pdftotext or... already?
Then you're smoked more or less:-(.

-- Glenn
0
 
LVL 20

Expert Comment

by:Gns
ID: 9636171
Well, apart from using sed or strings as the parser.

-- Glenn
0
 
LVL 9

Expert Comment

by:majorwoo
ID: 9636933
and good luck with that ;-(
0
 
LVL 20

Expert Comment

by:Gns
ID: 9640446
Strings will work somewhat for "normal" M$ doc files (You'll likely index a hefty amount of garbage too)... Pdf is a tougher cookie, since you'll need "decode" the postscriptish language (so you don't index that), and perhaps also "unscramble" some binary parts... not easily done with sed (or awk ... or perl).
The easy thing is to try and convince the server owner to install some of the utils above.

-- Glenn
0
 
LVL 4

Expert Comment

by:Robson
ID: 9641293
I recommend antiword for reading DOC files.
0
 
LVL 5

Expert Comment

by:willy134
ID: 9642828
do you really want to install this all on a web server or are you just trying to convert them to text?
0
 
LVL 20

Expert Comment

by:Gns
ID: 9642871
Why should we doubt Marios word willy134? I'm guessing he's setting this up on a hosting service of some kind, that simply don't provide the tools he needs. Might be wrong though (wouldn't be the first time that that happens either:-).

-- Glenn
0
 
LVL 5

Expert Comment

by:willy134
ID: 9643728
I just was looking at the original question and it didn't mention web server stuff.
0
 
LVL 20

Expert Comment

by:Gns
ID: 9643972
True. His other comment does though.

-- Glenn
0
 
LVL 20

Expert Comment

by:Gns
ID: 10183147
Several helpful answers covering several situations (possibly excluding the exact situation Mario has though)... Otoh, it feels silly sliting 50 points.
PAQ - no refund?

-- Glenn
0
 
LVL 9

Expert Comment

by:majorwoo
ID: 10183952
fine by me...
0
 
LVL 44

Expert Comment

by:Karl Heinz Kremer
ID: 10242157
There is no way you can extract text from a PDF file without a PDF parser. There are two problems with using tools like sed or strings: Depending on the PDF creator, even normal text will be stored as binary data. When using Adobe tools to create PDF documents, the more recent these tools are the more likely it is that the content is binary (actually compressed). The second problem are the incremental updates PDF allows. You may find text in a file (assuming that the text is actually uncompressed) that is no longer supposed to be in the document: Somebody may have removed one or more pages from the document and saved the updated document. A PDF creator will (if not specifically asked to do otherwise) only store the new and updated data, and will leave the original content intact - you can actually recover the old version of the document by stripping off the new part at the end of the file. So using a non-PDF-aware approach for indexing puposes will lead to a corrupt index.

The DOC format has similar mechanisms to retain old information, so it's also necessary to use a tool that is aware of these "hidden" data structures.
0
 
LVL 44

Expert Comment

by:Karl Heinz Kremer
ID: 10242163
No comment has been added lately, so it's time to clean up this TA.
I will leave a recommendation in the Cleanup topic area that this question is:
PAQ'd and pts forfeited
Please leave any comments here within the next seven days.

PLEASE DO NOT ACCEPT THIS COMMENT AS AN ANSWER!

khkremer
EE Cleanup Volunteer
0
 
LVL 20

Expert Comment

by:Gns
ID: 10251333
... As said (in perhaps not such an eloquent way) above. Thanks for your insight though khkremer (Yeah, I sneaked a look at your track record "mr PDF":-). (Seconding your recommendation BTW:).

-- Glenn
0
 
LVL 44

Expert Comment

by:Karl Heinz Kremer
ID: 10251445
... and I'm still waiting for the really hard PDF questions :-)
0
 
LVL 44

Expert Comment

by:Karl Heinz Kremer
ID: 10251658
I forgot one reason why this will not work: The PDF standard does not require you to output text as "text strings": You could for example first print all "a" characters on a page, then all "b", and so on... This means that a text extraction tool has to use the relative position of every "text piece" on a page to sort the data back into something that more or less resembles your original text. Do, don't do it with strings or sed. I hope we can put this to rest now :-)
0
 
LVL 20

Expert Comment

by:Gns
ID: 10251689
Yup. It's a strange beast all right:-). So any perl/php/sed/awk/whatever would need be a fullfledged (well, not really... You'd just need ... "semirendering", and be damned with those cases that it cannot grok:-) pdf parser. Yucky at best:-).

-- Glenn
0
 
LVL 1

Accepted Solution

by:
Computer101 earned 0 total points
ID: 10300575
PAQed - no points refunded (of 50)

Computer101
E-E Admin
0

Featured Post

Simplifying Server Workload Migrations

This use case outlines the migration challenges that organizations face and how the Acronis AnyData Engine supports physical-to-physical (P2P), physical-to-virtual (P2V), virtual to physical (V2P), and cross-virtual (V2V) migration scenarios to address these challenges.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
AWS EC2 HTTP & HTTPS 2 114
linux - yum package installation issue 2 156
Running linux commands into linux system remotely using powershell 6 103
exchange, squid, proxy, linux 6 84
This is the error message I got (CODE) Error caused by incompatible libmp3lame 3.98-2 with ffmpeg I've googled this error message and found out sometimes it attaches this note "can be treated with downgrade libmp3lame to version 3.97 or 3.98" …
I am a long time windows user and for me it is normal to have spaces in directory and file names. Changing to Linux I found myself frustrated when I moved my windows data over to my new Linux computer. The problem occurs when at the command line.…
Learn several ways to interact with files and get file information from the bash shell. ls lists the contents of a directory: Using the -a flag displays hidden files: Using the -l flag formats the output in a long list: The file command gives us mor…
This demo shows you how to set up the containerized NetScaler CPX with NetScaler Management and Analytics System in a non-routable Mesos/Marathon environment for use with Micro-Services applications.
Suggested Courses

739 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question