• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 7389
  • Last Modified:

pdf2txt and doc2txt

Hello

i need find to aplications that convert pdf and doc file to txt.

thanks

Mario
0
Mario_castro
Asked:
Mario_castro
  • 10
  • 4
  • 4
  • +4
1 Solution
 
willy134Commented:
for pdf do
pdftotext

I don't know about a doc converter.  I guess you could open it in openoffice and then either save it as a pdf and run pdftotext or just save it as a txt file.
0
 
Mario_castroAuthor Commented:
but my linux distribution dont have pdftotext, do tou have any URL that permit download this file?
0
 
majorwooCommented:
xpdf usually ships with a pdftotext utility, you can get it from:

http://www.foolabs.com/xpdf/download.html

or for a rpm based system,

http://www.rpmfind.net/linux/rpm2html/search.php?query=xpdf&submit=Search+...
0
VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

 
GnsCommented:
On my "intranet index machine" running htdig, I've successfully used doc2html.pl as "external parser", which is more of a wrapper than anything else:-). It uses several parsers that convert from a range of formats to html ... which in turn can rather easily be turned into text. You'll find it at http://www.htdig.org/files/contrib/parsers/ ... amongst some others.
rtf2html and catdoc (parsers used by the above) from http://www.45.free.net/~vitus/ice/catdoc/ would fit your bill...:-) Especially catdoc, since it reads M$ word files, and output text;-)
You might also be interrested in xlHtml and ppt2html (http://www.xlhtml.org).

Regarding pdf ... you have at least two options. Either install the xpdf package (almost all distros have it) that contain the pdftotext program... or do a pdf2ps followed by a ps2ascii ... these are part of ghostscript, which is very commonly installed to handle your PostScript(-ish) printing needs:-). (I'm to slow typing... majorwoo already covered xpdf).

See, not even any need to use a html2txt util:-).

If all else fails, the strings command could be used:-).

-- Glenn
0
 
majorwooCommented:
or sed!
0
 
GnsCommented:
:-)
It shouldn't have to come to that... catdoc and pdftotext should do nicely;-).

-- Glenn
0
 
Mario_castroAuthor Commented:
ops i have a sevral problem but this software is for install in phpdig that is a search aplication an need two modules to search in pdf an doc files. this is development in php, and i need put this search engine in my site that is allowed in not dedicated server, but this i donĀ“t can install any aplications.

any solutions?

Thanks

mario
0
 
GnsCommented:
Ok, and they don't have catdoc and/or pdftotext or... already?
Then you're smoked more or less:-(.

-- Glenn
0
 
GnsCommented:
Well, apart from using sed or strings as the parser.

-- Glenn
0
 
majorwooCommented:
and good luck with that ;-(
0
 
GnsCommented:
Strings will work somewhat for "normal" M$ doc files (You'll likely index a hefty amount of garbage too)... Pdf is a tougher cookie, since you'll need "decode" the postscriptish language (so you don't index that), and perhaps also "unscramble" some binary parts... not easily done with sed (or awk ... or perl).
The easy thing is to try and convince the server owner to install some of the utils above.

-- Glenn
0
 
RobsonCommented:
I recommend antiword for reading DOC files.
0
 
willy134Commented:
do you really want to install this all on a web server or are you just trying to convert them to text?
0
 
GnsCommented:
Why should we doubt Marios word willy134? I'm guessing he's setting this up on a hosting service of some kind, that simply don't provide the tools he needs. Might be wrong though (wouldn't be the first time that that happens either:-).

-- Glenn
0
 
willy134Commented:
I just was looking at the original question and it didn't mention web server stuff.
0
 
GnsCommented:
True. His other comment does though.

-- Glenn
0
 
GnsCommented:
Several helpful answers covering several situations (possibly excluding the exact situation Mario has though)... Otoh, it feels silly sliting 50 points.
PAQ - no refund?

-- Glenn
0
 
majorwooCommented:
fine by me...
0
 
Karl Heinz KremerCommented:
There is no way you can extract text from a PDF file without a PDF parser. There are two problems with using tools like sed or strings: Depending on the PDF creator, even normal text will be stored as binary data. When using Adobe tools to create PDF documents, the more recent these tools are the more likely it is that the content is binary (actually compressed). The second problem are the incremental updates PDF allows. You may find text in a file (assuming that the text is actually uncompressed) that is no longer supposed to be in the document: Somebody may have removed one or more pages from the document and saved the updated document. A PDF creator will (if not specifically asked to do otherwise) only store the new and updated data, and will leave the original content intact - you can actually recover the old version of the document by stripping off the new part at the end of the file. So using a non-PDF-aware approach for indexing puposes will lead to a corrupt index.

The DOC format has similar mechanisms to retain old information, so it's also necessary to use a tool that is aware of these "hidden" data structures.
0
 
Karl Heinz KremerCommented:
No comment has been added lately, so it's time to clean up this TA.
I will leave a recommendation in the Cleanup topic area that this question is:
PAQ'd and pts forfeited
Please leave any comments here within the next seven days.

PLEASE DO NOT ACCEPT THIS COMMENT AS AN ANSWER!

khkremer
EE Cleanup Volunteer
0
 
GnsCommented:
... As said (in perhaps not such an eloquent way) above. Thanks for your insight though khkremer (Yeah, I sneaked a look at your track record "mr PDF":-). (Seconding your recommendation BTW:).

-- Glenn
0
 
Karl Heinz KremerCommented:
... and I'm still waiting for the really hard PDF questions :-)
0
 
Karl Heinz KremerCommented:
I forgot one reason why this will not work: The PDF standard does not require you to output text as "text strings": You could for example first print all "a" characters on a page, then all "b", and so on... This means that a text extraction tool has to use the relative position of every "text piece" on a page to sort the data back into something that more or less resembles your original text. Do, don't do it with strings or sed. I hope we can put this to rest now :-)
0
 
GnsCommented:
Yup. It's a strange beast all right:-). So any perl/php/sed/awk/whatever would need be a fullfledged (well, not really... You'd just need ... "semirendering", and be damned with those cases that it cannot grok:-) pdf parser. Yucky at best:-).

-- Glenn
0
 
Computer101Commented:
PAQed - no points refunded (of 50)

Computer101
E-E Admin
0

Featured Post

VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

  • 10
  • 4
  • 4
  • +4
Tackle projects and never again get stuck behind a technical roadblock.
Join Now