tomappu
asked on
read word doc. and grab data
i need to read word documents with certain headings. Grab these headings and the text under each heading from the word document and insert in mysql database without loosing the word format using Perl or javascript. If their are few new headings which are not from my list, they should all be put together in a separate variable as one.
What format do you plan to use to maintain the format in your database?
For an example of how to extract data from MS Word documents, have a look at http://www.wellho.net/solutions/perl-using-perl-to-read-microsoft-word-documents.html. Adam's question is a good one, though...
I suppose you could, as in the example given at the link above, store the formats as paragraph style names extracted from the documents, which could then be re-created if needed, perhaps using something like Win32::Word::Writer (see http://search.cpan.org/~johanl/Win32-Word-Writer-0.02/lib/Win32/Word/Writer.pm).
I suppose you could, as in the example given at the link above, store the formats as paragraph style names extracted from the documents, which could then be re-created if needed, perhaps using something like Win32::Word::Writer (see http://search.cpan.org/~johanl/Win32-Word-Writer-0.02/lib/Win32/Word/Writer.pm).
ASKER
i will be basically using the mysql database with datatype text for storing the data. What i meant my format was to save the text with exact rich text format as in word file for example: bold, indentation, colors, tables etc. I have attached a example file below. Now the file contains Heading1 and some text under it. How do i grab the text under each such heading. The grabing is the main issue here. i can insert them into database.
1.doc
1.doc
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
BTW, you mentioned you want to "save the text with exact rich text format as in word file". Are these Word files saved as RTF files, or as MS Word's binary .doc format?