Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

x
?
Solved

read word doc. and grab data

Posted on 2008-06-23
6
Medium Priority
?
778 Views
Last Modified: 2011-10-19
i need to read word documents with certain headings. Grab these headings and the text under each heading from the word document and insert in mysql database without loosing the word format using Perl or javascript.  If their are few new headings which are not from my list, they should all be put together in a separate variable as one.
0
Comment
Question by:tomappu
  • 3
5 Comments
 
LVL 39

Expert Comment

by:Adam314
ID: 21847749
What format do you plan to use to maintain the format in your database?
0
 
LVL 17

Expert Comment

by:mjcoyne
ID: 21849399
For an example of how to extract data from MS Word documents, have a look at http://www.wellho.net/solutions/perl-using-perl-to-read-microsoft-word-documents.html.  Adam's question is a good one, though...

I suppose you could, as in the example given at the link above, store the formats as paragraph style names extracted from the documents, which could then be re-created if needed, perhaps using something like Win32::Word::Writer  (see http://search.cpan.org/~johanl/Win32-Word-Writer-0.02/lib/Win32/Word/Writer.pm).
0
 
LVL 2

Author Comment

by:tomappu
ID: 21853284
i will be basically using the mysql database with datatype text for storing the data. What i meant my format was to save the text with exact rich text format as in word file for example: bold, indentation, colors, tables etc.  I have attached a example file below.  Now the file contains Heading1 and some text under it. How do i grab the text under each such heading.  The grabing is the main issue here. i can insert them into database.


1.doc
0
 
LVL 17

Accepted Solution

by:
mjcoyne earned 1000 total points
ID: 21855179
If I take your example Word file, save it as ee-test.doc, and run it through this code (after borrowing heavily from the link I gave you above):

#!/usr/bin/perl -w
use strict;
use Win32::OLE;
use Win32::OLE::Enum;

my $inputdoc = 'C:/Documents and Settings/Mike/Desktop/ee-test.doc';
my $outputdoc = "ee-test-data.txt";

my $document = Win32::OLE -> GetObject("$inputdoc");
open (FH,">$outputdoc");

print "Extracting text from $inputdoc...\n";

my $paragraph = $document->Paragraphs();
my $enumerate = new Win32::OLE::Enum($paragraph);

while(defined($paragraph = $enumerate->Next())) {
    my $text = $paragraph->{Range}->{Text};
    $text =~ s/[\n\r]//g;
    $text =~ s/\x0b/\n/g;
    print FH "$text\n";
}

undef $document;
undef $enumerate;

print "Done.  Text saved in $outputdoc.\n";


The output, contained in ee-test-data.txt as text, is:

HEADING 1

Sometext blah blah&&&&..


HEADING 2

Sometext blah blah&&&&..


HEADING 3

Sometext blah blah&&&&..


So it's a simple matter to extract the text from the binary Word file.  You'd just replace the code that prints to the output file with code that instead inserts it into your batabase.

But, I'm still a bit confused...  If you're going to save the data as "datatype text", then by definition you're going to lose "bold, indentation, colors, tables etc", as text does not, of course, include these attributes.

This is why I suggested that in the database you could associate each text (data) entry with a paragragh style, which does tell Word to use styling like bold, italics, indents, etc...

0
 
LVL 17

Expert Comment

by:mjcoyne
ID: 21856942
BTW, you mentioned you want to "save the text with exact rich text format as in word file".  Are these Word files saved as RTF files, or as MS Word's binary .doc format?
0

Featured Post

[Webinar On Demand] Database Backup and Recovery

Does your company store data on premises, off site, in the cloud, or a combination of these? If you answered “yes”, you need a data backup recovery plan that fits each and every platform. Watch now as as Percona teaches us how to build agile data backup recovery plan.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Checking the Alert Log in AWS RDS Oracle can be a pain through their user interface.  I made a script to download the Alert Log, look for errors, and email me the trace files.  In this article I'll describe what I did and share my script.
Nothing in an HTTP request can be trusted, including HTTP headers and form data.  A form token is a tool that can be used to guard against request forgeries (CSRF).  This article shows an improved approach to form tokens, making it more difficult to…
Office 365 is currently available in five editions. Three of them are for business use: Office 365 Business Essentials, Office 365 Business, and Office 365 Business Premium. Two of them are for home/personal use: Office 365 Home and Office 365 Perso…
In a previous video Micro Tutorial here at Experts Exchange (http://www.experts-exchange.com/videos/1358/How-to-get-a-free-trial-of-Office-365-with-the-Office-2016-desktop-applications.html), I explained how to get a free, one-month trial of Office …
Suggested Courses

579 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question