Still celebrating National IT Professionals Day with 3 months of free Premium Membership. Use Code ITDAY17

x
?
Solved

read word doc. and grab data

Posted on 2008-06-23
6
Medium Priority
?
770 Views
Last Modified: 2011-10-19
i need to read word documents with certain headings. Grab these headings and the text under each heading from the word document and insert in mysql database without loosing the word format using Perl or javascript.  If their are few new headings which are not from my list, they should all be put together in a separate variable as one.
0
Comment
Question by:tomappu
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
6 Comments
 
LVL 39

Expert Comment

by:Adam314
ID: 21847749
What format do you plan to use to maintain the format in your database?
0
 
LVL 17

Expert Comment

by:mjcoyne
ID: 21849399
For an example of how to extract data from MS Word documents, have a look at http://www.wellho.net/solutions/perl-using-perl-to-read-microsoft-word-documents.html.  Adam's question is a good one, though...

I suppose you could, as in the example given at the link above, store the formats as paragraph style names extracted from the documents, which could then be re-created if needed, perhaps using something like Win32::Word::Writer  (see http://search.cpan.org/~johanl/Win32-Word-Writer-0.02/lib/Win32/Word/Writer.pm).
0
 
LVL 2

Author Comment

by:tomappu
ID: 21853284
i will be basically using the mysql database with datatype text for storing the data. What i meant my format was to save the text with exact rich text format as in word file for example: bold, indentation, colors, tables etc.  I have attached a example file below.  Now the file contains Heading1 and some text under it. How do i grab the text under each such heading.  The grabing is the main issue here. i can insert them into database.


1.doc
0
 
LVL 17

Accepted Solution

by:
mjcoyne earned 1000 total points
ID: 21855179
If I take your example Word file, save it as ee-test.doc, and run it through this code (after borrowing heavily from the link I gave you above):

#!/usr/bin/perl -w
use strict;
use Win32::OLE;
use Win32::OLE::Enum;

my $inputdoc = 'C:/Documents and Settings/Mike/Desktop/ee-test.doc';
my $outputdoc = "ee-test-data.txt";

my $document = Win32::OLE -> GetObject("$inputdoc");
open (FH,">$outputdoc");

print "Extracting text from $inputdoc...\n";

my $paragraph = $document->Paragraphs();
my $enumerate = new Win32::OLE::Enum($paragraph);

while(defined($paragraph = $enumerate->Next())) {
    my $text = $paragraph->{Range}->{Text};
    $text =~ s/[\n\r]//g;
    $text =~ s/\x0b/\n/g;
    print FH "$text\n";
}

undef $document;
undef $enumerate;

print "Done.  Text saved in $outputdoc.\n";


The output, contained in ee-test-data.txt as text, is:

HEADING 1

Sometext blah blah&&&&..


HEADING 2

Sometext blah blah&&&&..


HEADING 3

Sometext blah blah&&&&..


So it's a simple matter to extract the text from the binary Word file.  You'd just replace the code that prints to the output file with code that instead inserts it into your batabase.

But, I'm still a bit confused...  If you're going to save the data as "datatype text", then by definition you're going to lose "bold, indentation, colors, tables etc", as text does not, of course, include these attributes.

This is why I suggested that in the database you could associate each text (data) entry with a paragragh style, which does tell Word to use styling like bold, italics, indents, etc...

0
 
LVL 17

Expert Comment

by:mjcoyne
ID: 21856942
BTW, you mentioned you want to "save the text with exact rich text format as in word file".  Are these Word files saved as RTF files, or as MS Word's binary .doc format?
0

Featured Post

VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Microsoft Word is a program we have all encountered at some point, but very few of us have dug deep into its full scope of features, let alone customized it to suit our needs. Luckily making the ribbon (aka toolbar, first introduced in Word 2007) wo…
Nothing in an HTTP request can be trusted, including HTTP headers and form data.  A form token is a tool that can be used to guard against request forgeries (CSRF).  This article shows an improved approach to form tokens, making it more difficult to…
This video shows and describes the main difference between both orientations in Microsoft Word. Viewers will understand when to use each orientation and how to get the most out of them.
The viewer will learn how to make their project stand out over others by learning how to change colors and shapes, add spaces, change directions, and add bullets to their charts.

688 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question