Extracting text from a MS Word (pre 2007 .doc ) in C#

I am implementing a Lucene search for documents on a site. I have used PDFBox for extracting text from Pdfs and I used an XML parser to extract text from MS Word 2007. However, I still cannot read the older .Doc versions. I have tried NPOI, POI.NET etc without much luck.

I can use File.OpentText(path) but, it also returns some cryptic markup that messes up my search results.

Does anyone have samples for POI or know how to read the .doc files (without needing Office Installation or Interop because MS doesnt recommend that)?
LVL 14
robastaAsked:
Who is Participating?
 
_Katka_Commented:
Hi, I guess you'll have to go for interop Office assemblies.

Here's a good tutorial on how to accomplish that:

http://eggheadcafe.com/tutorials/aspnet/b6f75379-840c-4745-a76c-04d43694333b/read-a-word-document-do.aspx

regards,
Kate
0
 
robastaAuthor Commented:
I did not want to use Interop because MS discourages it (http://support.microsoft.com/kb/257757) and licensing issues.

Aspose does the job but its not free.

Free Solutions:
1. use this MS dll to get properties -http://blogs.msdn.com/erikaehrli/archive/2005/11/30/dsofileproperties.aspx
2. use Ifilters (http://www.codeproject.com/KB/cs/IFilter.aspx)
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.