Solved

read word doc. and grab data

Posted on 2008-06-23
6
722 Views
Last Modified: 2011-10-19
i need to read word documents with certain headings. Grab these headings and the text under each heading from the word document and insert in mysql database without loosing the word format using Perl or javascript.  If their are few new headings which are not from my list, they should all be put together in a separate variable as one.
0
Comment
Question by:tomappu
  • 3
6 Comments
 
LVL 39

Expert Comment

by:Adam314
Comment Utility
What format do you plan to use to maintain the format in your database?
0
 
LVL 17

Expert Comment

by:mjcoyne
Comment Utility
For an example of how to extract data from MS Word documents, have a look at http://www.wellho.net/solutions/perl-using-perl-to-read-microsoft-word-documents.html.  Adam's question is a good one, though...

I suppose you could, as in the example given at the link above, store the formats as paragraph style names extracted from the documents, which could then be re-created if needed, perhaps using something like Win32::Word::Writer  (see http://search.cpan.org/~johanl/Win32-Word-Writer-0.02/lib/Win32/Word/Writer.pm).
0
 
LVL 2

Author Comment

by:tomappu
Comment Utility
i will be basically using the mysql database with datatype text for storing the data. What i meant my format was to save the text with exact rich text format as in word file for example: bold, indentation, colors, tables etc.  I have attached a example file below.  Now the file contains Heading1 and some text under it. How do i grab the text under each such heading.  The grabing is the main issue here. i can insert them into database.


1.doc
0
 
LVL 17

Accepted Solution

by:
mjcoyne earned 250 total points
Comment Utility
If I take your example Word file, save it as ee-test.doc, and run it through this code (after borrowing heavily from the link I gave you above):

#!/usr/bin/perl -w
use strict;
use Win32::OLE;
use Win32::OLE::Enum;

my $inputdoc = 'C:/Documents and Settings/Mike/Desktop/ee-test.doc';
my $outputdoc = "ee-test-data.txt";

my $document = Win32::OLE -> GetObject("$inputdoc");
open (FH,">$outputdoc");

print "Extracting text from $inputdoc...\n";

my $paragraph = $document->Paragraphs();
my $enumerate = new Win32::OLE::Enum($paragraph);

while(defined($paragraph = $enumerate->Next())) {
    my $text = $paragraph->{Range}->{Text};
    $text =~ s/[\n\r]//g;
    $text =~ s/\x0b/\n/g;
    print FH "$text\n";
}

undef $document;
undef $enumerate;

print "Done.  Text saved in $outputdoc.\n";


The output, contained in ee-test-data.txt as text, is:

HEADING 1

Sometext blah blah&&&&..


HEADING 2

Sometext blah blah&&&&..


HEADING 3

Sometext blah blah&&&&..


So it's a simple matter to extract the text from the binary Word file.  You'd just replace the code that prints to the output file with code that instead inserts it into your batabase.

But, I'm still a bit confused...  If you're going to save the data as "datatype text", then by definition you're going to lose "bold, indentation, colors, tables etc", as text does not, of course, include these attributes.

This is why I suggested that in the database you could associate each text (data) entry with a paragragh style, which does tell Word to use styling like bold, italics, indents, etc...

0
 
LVL 17

Expert Comment

by:mjcoyne
Comment Utility
BTW, you mentioned you want to "save the text with exact rich text format as in word file".  Are these Word files saved as RTF files, or as MS Word's binary .doc format?
0

Featured Post

What Should I Do With This Threat Intelligence?

Are you wondering if you actually need threat intelligence? The answer is yes. We explain the basics for creating useful threat intelligence.

Join & Write a Comment

In the distant past (last year) I hacked together a little toy that would allow a couple of Manager types to query, preview, and extract data from a number of MongoDB instances, to their tool of choice: Excel (http://dilbert.com/strips/comic/2007-08…
Nothing in an HTTP request can be trusted, including HTTP headers and form data.  A form token is a tool that can be used to guard against request forgeries (CSRF).  This article shows an improved approach to form tokens, making it more difficult to…
This Micro Tutorial well show you how to find and replace special characters in Microsoft Word. This is similar to carriage returns to convert columns of values from Microsoft Excel into comma separated lists.
In a previous video Micro Tutorial here at Experts Exchange (http://www.experts-exchange.com/videos/1358/How-to-get-a-free-trial-of-Office-365-with-the-Office-2016-desktop-applications.html), I explained how to get a free, one-month trial of Office …

771 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

11 Experts available now in Live!

Get 1:1 Help Now