Solved

read word doc. and grab data

Posted on 2008-06-23
6
736 Views
Last Modified: 2011-10-19
i need to read word documents with certain headings. Grab these headings and the text under each heading from the word document and insert in mysql database without loosing the word format using Perl or javascript.  If their are few new headings which are not from my list, they should all be put together in a separate variable as one.
0
Comment
Question by:tomappu
  • 3
6 Comments
 
LVL 39

Expert Comment

by:Adam314
ID: 21847749
What format do you plan to use to maintain the format in your database?
0
 
LVL 17

Expert Comment

by:mjcoyne
ID: 21849399
For an example of how to extract data from MS Word documents, have a look at http://www.wellho.net/solutions/perl-using-perl-to-read-microsoft-word-documents.html.  Adam's question is a good one, though...

I suppose you could, as in the example given at the link above, store the formats as paragraph style names extracted from the documents, which could then be re-created if needed, perhaps using something like Win32::Word::Writer  (see http://search.cpan.org/~johanl/Win32-Word-Writer-0.02/lib/Win32/Word/Writer.pm).
0
 
LVL 2

Author Comment

by:tomappu
ID: 21853284
i will be basically using the mysql database with datatype text for storing the data. What i meant my format was to save the text with exact rich text format as in word file for example: bold, indentation, colors, tables etc.  I have attached a example file below.  Now the file contains Heading1 and some text under it. How do i grab the text under each such heading.  The grabing is the main issue here. i can insert them into database.


1.doc
0
 
LVL 17

Accepted Solution

by:
mjcoyne earned 250 total points
ID: 21855179
If I take your example Word file, save it as ee-test.doc, and run it through this code (after borrowing heavily from the link I gave you above):

#!/usr/bin/perl -w
use strict;
use Win32::OLE;
use Win32::OLE::Enum;

my $inputdoc = 'C:/Documents and Settings/Mike/Desktop/ee-test.doc';
my $outputdoc = "ee-test-data.txt";

my $document = Win32::OLE -> GetObject("$inputdoc");
open (FH,">$outputdoc");

print "Extracting text from $inputdoc...\n";

my $paragraph = $document->Paragraphs();
my $enumerate = new Win32::OLE::Enum($paragraph);

while(defined($paragraph = $enumerate->Next())) {
    my $text = $paragraph->{Range}->{Text};
    $text =~ s/[\n\r]//g;
    $text =~ s/\x0b/\n/g;
    print FH "$text\n";
}

undef $document;
undef $enumerate;

print "Done.  Text saved in $outputdoc.\n";


The output, contained in ee-test-data.txt as text, is:

HEADING 1

Sometext blah blah&&&&..


HEADING 2

Sometext blah blah&&&&..


HEADING 3

Sometext blah blah&&&&..


So it's a simple matter to extract the text from the binary Word file.  You'd just replace the code that prints to the output file with code that instead inserts it into your batabase.

But, I'm still a bit confused...  If you're going to save the data as "datatype text", then by definition you're going to lose "bold, indentation, colors, tables etc", as text does not, of course, include these attributes.

This is why I suggested that in the database you could associate each text (data) entry with a paragragh style, which does tell Word to use styling like bold, italics, indents, etc...

0
 
LVL 17

Expert Comment

by:mjcoyne
ID: 21856942
BTW, you mentioned you want to "save the text with exact rich text format as in word file".  Are these Word files saved as RTF files, or as MS Word's binary .doc format?
0

Featured Post

Courses: Start Training Online With Pros, Today

Brush up on the basics or master the advanced techniques required to earn essential industry certifications, with Courses. Enroll in a course and start learning today. Training topics range from Android App Dev to the Xen Virtualization Platform.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Nice table. Huge mess. Maybe this was something you created way back before you figured out tabs or a document you received from someone else. Either way, using the spacebar to separate the columns resulted in a mess. Trying to convert text to t…
Boost your ability to deliver ambitious and competitive web apps by choosing the right JavaScript framework to best suit your project’s needs.
This video walks the viewer through the process of creating a watermark for their document, customizing it, and saving it for viewing/printing needs.
This video shows and describes the main difference between both orientations in Microsoft Word. Viewers will understand when to use each orientation and how to get the most out of them.

813 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

8 Experts available now in Live!

Get 1:1 Help Now