Solved

read word doc. and grab data

Posted on 2008-06-23
6
760 Views
Last Modified: 2011-10-19
i need to read word documents with certain headings. Grab these headings and the text under each heading from the word document and insert in mysql database without loosing the word format using Perl or javascript.  If their are few new headings which are not from my list, they should all be put together in a separate variable as one.
0
Comment
Question by:tomappu
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
6 Comments
 
LVL 39

Expert Comment

by:Adam314
ID: 21847749
What format do you plan to use to maintain the format in your database?
0
 
LVL 17

Expert Comment

by:mjcoyne
ID: 21849399
For an example of how to extract data from MS Word documents, have a look at http://www.wellho.net/solutions/perl-using-perl-to-read-microsoft-word-documents.html.  Adam's question is a good one, though...

I suppose you could, as in the example given at the link above, store the formats as paragraph style names extracted from the documents, which could then be re-created if needed, perhaps using something like Win32::Word::Writer  (see http://search.cpan.org/~johanl/Win32-Word-Writer-0.02/lib/Win32/Word/Writer.pm).
0
 
LVL 2

Author Comment

by:tomappu
ID: 21853284
i will be basically using the mysql database with datatype text for storing the data. What i meant my format was to save the text with exact rich text format as in word file for example: bold, indentation, colors, tables etc.  I have attached a example file below.  Now the file contains Heading1 and some text under it. How do i grab the text under each such heading.  The grabing is the main issue here. i can insert them into database.


1.doc
0
 
LVL 17

Accepted Solution

by:
mjcoyne earned 250 total points
ID: 21855179
If I take your example Word file, save it as ee-test.doc, and run it through this code (after borrowing heavily from the link I gave you above):

#!/usr/bin/perl -w
use strict;
use Win32::OLE;
use Win32::OLE::Enum;

my $inputdoc = 'C:/Documents and Settings/Mike/Desktop/ee-test.doc';
my $outputdoc = "ee-test-data.txt";

my $document = Win32::OLE -> GetObject("$inputdoc");
open (FH,">$outputdoc");

print "Extracting text from $inputdoc...\n";

my $paragraph = $document->Paragraphs();
my $enumerate = new Win32::OLE::Enum($paragraph);

while(defined($paragraph = $enumerate->Next())) {
    my $text = $paragraph->{Range}->{Text};
    $text =~ s/[\n\r]//g;
    $text =~ s/\x0b/\n/g;
    print FH "$text\n";
}

undef $document;
undef $enumerate;

print "Done.  Text saved in $outputdoc.\n";


The output, contained in ee-test-data.txt as text, is:

HEADING 1

Sometext blah blah&&&&..


HEADING 2

Sometext blah blah&&&&..


HEADING 3

Sometext blah blah&&&&..


So it's a simple matter to extract the text from the binary Word file.  You'd just replace the code that prints to the output file with code that instead inserts it into your batabase.

But, I'm still a bit confused...  If you're going to save the data as "datatype text", then by definition you're going to lose "bold, indentation, colors, tables etc", as text does not, of course, include these attributes.

This is why I suggested that in the database you could associate each text (data) entry with a paragragh style, which does tell Word to use styling like bold, italics, indents, etc...

0
 
LVL 17

Expert Comment

by:mjcoyne
ID: 21856942
BTW, you mentioned you want to "save the text with exact rich text format as in word file".  Are these Word files saved as RTF files, or as MS Word's binary .doc format?
0

Featured Post

Get 15 Days FREE Full-Featured Trial

Benefit from a mission critical IT monitoring with Monitis Premium or get it FREE for your entry level monitoring needs.
-Over 200,000 users
-More than 300,000 websites monitored
-Used in 197 countries
-Recommended by 98% of users

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

This is written from a 'VBA for MS Word' perspective, but I am sure it applies to most other MS Office components where VBA is used.  One thing that really bugs me is slow code, ESPECIALLY when it's mine!  In programming there are so many ways to…
This article discusses how to create an extensible mechanism for linked drop downs.
This video shows where to find templates, what they are used for, and how to create and save a custom template using Microsoft Word.
Learn how to make your own table of contents in Microsoft Word using paragraph styles and the automatic table of contents tool. We'll be using the paragraph styles in Word’s Home toolbar to help you create a table of contents. Type out your initial …

626 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question