Use C# to extract text from word document without Interop.

Hi

Does anyone have any experience of opening word documents and saving them as text files? Or just extracting the words (not the formatting etc).

I don't want to install word on my server and use the interop libraries. Is there another way to do this?  I dont mind paying for some kind of dll if its not too expensive (less than $99ish!)

Many thanks in advance...
Ben
benwilliamsonAsked:
Who is Participating?
 
Gautham JanardhanCommented:
i got this out from another forum for a small word doc it's working but dont know how it will fair against word with heavy formatting

StreamReader reader = null;
StreamWriter writer = null;

SortedList table = new SortedList();
//Hashtable table = new Hashtable();

string logFile = "logfile.txt";

try
{
//iterate one word at a time. Each word/count gets updated for each instance that gets encountered.

reader = new StreamReader(textBox1.Text);//opens the file

writer = new StreamWriter(logFile, false);

int h=0;

for (string line = reader.ReadLine(); line != null; line = reader.ReadLine())
{
string[] words = GetWords (line);

foreach (string word in words)
{
string iword = word.ToLower();

h++;
if (table.ContainsKey (iword))
{
table[iword] = table[iword] + "," + "'" + h + "'";
}
else
{
table[iword] = "'" + h + "'";
}
}
}


foreach (DictionaryEntry entry in table)
{
writer.WriteLine ("{0} ({1})", entry.Key, entry.Value);
}


catch (Exception c)
{
writer.WriteLine(c.Message);
}

finally
{
if (reader != null)
reader.Close();
if (writer != null)
writer.Close();

}

static string[] GetWords(string line)
{
ArrayList al = new ArrayList(); //for intermediate results

int i = 0;
string word;
char[] characters = line.ToCharArray();

while ((word = GetNextWord(line, characters, ref i)) != null)
al.Add(word);

string[]words = new string[al.Count];
al.CopyTo(words);
return words;
}


static string GetNextWord (string line, char[] characters, ref int i)
{

while (i < characters.Length && !Char.IsLetterOrDigit (characters[i]))
i++;

if (i == characters.Length)
return null;

int start = i;

//find the end of the word
while (i< characters.Length && Char.IsLetterOrDigit (characters[i]))
i++;

//return the word
return line.Substring (start, i - start);
}
0
 
sjturner2Commented:
This product should do the job but it's 399 euro.

http://www.independentsoft.de/word/index.html

0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.