Solved

Word to Text

Posted on 2004-04-02
7
439 Views
Last Modified: 2008-02-01
Hi All

How can I get the text from Word Document

Thanks
0
Comment
Question by:lakkiprasanna
  • 4
  • 2
7 Comments
 
LVL 14

Expert Comment

by:sudhakar_koundinya
ID: 10740144
You can use textmining api for this
0
 
LVL 14

Expert Comment

by:sudhakar_koundinya
ID: 10740150
Download api from www.textmining.org

Regards
0
 
LVL 14

Expert Comment

by:sudhakar_koundinya
ID: 10740210
A sample example

//package org.prithvi.test;


import java.io.*;

/**
 * <p>Title: Parsers</p>
 * <p>Description: </p>
 * <p>Copyright: Copyright (c) 2004</p>
 * <p>Company: </p>
 * @author not attributable
 * @version 1.0
 */

public class Word2Text {

  public static void main(String[] args) throws Exception{
   
     java.util.Date start=new java.util.Date();
       SearchFolder(args[0]);
       java.util.Date end=new java.util.Date();


       successLog.append("\n\nTotal No of Successfull parsed files: "+success);
        successLog.append("\n\nTotal No of  files: "+(success+error));
         successLog.append("\n\nTotal Execution Time(in milli seconds): "+(end.getTime()-start.getTime()));

      errorLog.append("\n\nTotal No of bad files : "+error);
  errorLog.append("\n\nTotal No of  files: "+(success+error));
   errorLog.append("\n\nTotal Execution Time(in milli seconds): "+(end.getTime()-start.getTime()));

   FileOutputStream fout=new FileOutputStream("success.textmining.log");
   fout.write(successLog.toString().getBytes());
   fout.close();
   fout=new FileOutputStream("error.textmining.log");
   fout.write(errorLog.toString().getBytes());
   fout.close();


  }

  static int success=0;
  static int error=0;

  public static void SearchFile(String strFile)
  {
      java.util.Date start = new java.util.Date();

      File f=new File(strFile);
      long size=f.length() ;

    try
    {
            FileInputStream fin=new FileInputStream(strFile);
            org.textmining.text.extraction.WordExtractor extractor=new org.textmining.text.extraction.WordExtractor();
            String str=extractor.extractText(fin);            
            java.util.Date end=new java.util.Date();
            String str2="\nParsed Time(in milli seconds) :"+(end.getTime()-start.getTime());
            start=new java.util.Date();
            FileOutputStream fout=new FileOutputStream(strFile+".textmining.txt");
            fout.write(str.getBytes());
            end=new java.util.Date();
            String str1="\nWriting Time(in milli seconds) :"+(end.getTime()-start.getTime());
            fout.close();
            fin.close();
//             java.util.Date end = new java.util.Date();
              successLog.append("\n\nFile :"+strFile) ;
              successLog.append("\nFile Size (in bytes):"+size) ;
              successLog.append("\nOutput File :"+strFile+".textmining.txt") ;
              successLog.append("\nStart Time:"+start) ;
              successLog.append("\nEnd Time :"+end) ;
              successLog.append(str2);                  
              successLog.append(str1);                  
              success++;

    }
    catch(Exception Exe)
    {
      java.util.Date end = new java.util.Date();
      errorLog.append("\n\nFile :"+strFile) ;
      errorLog.append("\nFile Size:"+size) ;
      errorLog.append("\nStart Time:"+start) ;
      errorLog.append("\nEnd Time :"+end) ;
      errorLog.append("\nTime in Milli Seconds :"+(end.getTime()-start.getTime() )) ;
      errorLog.append("\nException :"+Exe) ;
        error++;

    }


  }
  public static void SearchFolder(String strFile) throws Exception
  {
    File file=new File(strFile);
    if(file.isDirectory()==false )
    {
      errorLog.append("\n\n"+strFile+" is not directory") ;
      return;
    }
    String files[]= file.list() ;
    for(int i=0;i<files.length ;i++)
    {
     String docFile=files[i];
     docFile.toLowerCase() ;
      if(docFile.endsWith(".doc") )
      {
        SearchFile(strFile+"/"+docFile);
      }

    }

  }

  static StringBuffer errorLog=new StringBuffer("");
  static StringBuffer successLog=new StringBuffer("");
}
0
Do You Know the 4 Main Threat Actor Types?

Do you know the main threat actor types? Most attackers fall into one of four categories, each with their own favored tactics, techniques, and procedures.

 
LVL 13

Expert Comment

by:Webstorm
ID: 10740216
Hi  lakkiprasanna,

See also http://jakarta.apache.org/poi/
0
 
LVL 30

Expert Comment

by:mayankeagle
ID: 10740415
You also have: http://api.openoffice.org
0
 
LVL 30

Expert Comment

by:mayankeagle
ID: 10740424
By the way, why does this question have 495 points ;-) ? 500 is unlucky? :-)
0
 
LVL 14

Accepted Solution

by:
sudhakar_koundinya earned 495 total points
ID: 10746354
Hi ALL

FYI

Text Mining is basically developed using POI API only.


POI
It concentrates on MS office Document formats for both reading and writing


Text Mining

It is concentrating on different document formats for getting only text (i.e. reading only). Developer of text mining API is also developer of POI API. you can check different apis from single api in near future at thi site (just for text extraction only)

Some of problems that are raising in POI API are  fixed in textmining. Jakarta didn't released fully functional POI (still working on HWPF formats - I mean Word Document Formats)

Currently POI is supporting word97 to word 2003 formats. Whereas Textmining is giving support for Word 6.0 Formats also. I have communicated to developer of POI API for word2.x version. And I have contributed my self  with sample code for word 2.x formats.
The code for word2.x is listed at
http://www.textmining.org/modules.php?op=modload&name=News&file=article&sid=8&mode=nested&order=1&thold=0
So we may see the word2.x support in near future (a positve hope :-))


BOTH textmining and poi api does not have support for fast saved (complex) documents.


Regards
0

Featured Post

What Should I Do With This Threat Intelligence?

Are you wondering if you actually need threat intelligence? The answer is yes. We explain the basics for creating useful threat intelligence.

Join & Write a Comment

Suggested Solutions

Title # Comments Views Activity
array6 challenfge 6 64
wordappend challenge 8 85
Java asynchronous logging 4 34
get weblogic logged in user in java 2 43
Introduction Java can be integrated with native programs using an interface called JNI(Java Native Interface). Native programs are programs which can directly run on the processor. JNI is simply a naming and calling convention so that the JVM (Java…
Java Flight Recorder and Java Mission Control together create a complete tool chain to continuously collect low level and detailed runtime information enabling after-the-fact incident analysis. Java Flight Recorder is a profiling and event collectio…
Viewers will learn one way to get user input in Java. Introduce the Scanner object: Declare the variable that stores the user input: An example prompting the user for input: Methods you need to invoke in order to properly get  user input:
This theoretical tutorial explains exceptions, reasons for exceptions, different categories of exception and exception hierarchy.

762 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

26 Experts available now in Live!

Get 1:1 Help Now