Link to home
Start Free TrialLog in
Avatar of karnovsk
karnovsk

asked on

Problem reading MS Word 2000 files as binary stream

I need to read MS Word 2000 files as a stream of bytes.

However, whatever I do (e.g. using read function), I get only 6 bytes. Interestingly, these bytes are the same for all files!

I have no problem reading other types of binary files, including earlier versions of Word.

What is it? Some trick of MS?
Avatar of vermeylen
vermeylen

Hi,
The "debug" utility (still available on Windows2000!) shows that MSWord 2000 files have a "EOF" character as the 7th byte. From a command box, try:
c:\>debug myword.doc
-d
(Enter d on the "-" prompt).
An ASCII dump of the word document is printed. The first 8 characters are:
D0 CF 11 E0 A1 B1 1A E1
1A (End of File) is the seventh character...
Debugging a little bit more (enter "d" on the "-" prompt, "q" to quit) showed that the EOF character appears every  now and then in the word document.
Following script reads until EOF, prints the characters, position the pointer after the EOF character and continues until next EOF:

$pos = 0;
open(DOC, "c:\\temp\\test.doc");
while (1 == 1) {
    seek DOC, $pos, 0;
    while ($char = getc(DOC)) {
     $pos++;
     print $char;
    }
    $pos++;
    print "\nEnd of file Character found, continue? (CTRL-C to quit)\n";
    $a = <STDIN>;
}

However I have no clue when the MSWord file really reaches End of File...
Dirk
ASKER CERTIFIED SOLUTION
Avatar of karnovsk
karnovsk

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of Dave Cross
No comment has been added lately, so it's time to clean up this TA.
I will leave a recommendation in the Cleanup topic area that this question is:

PAQ/Refund

Please leave any comments here within the next seven days.

PLEASE DO NOT ACCEPT THIS COMMENT AS AN ANSWER!

davorg
EE Cleanup Volunteer