• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 8625
  • Last Modified:

Corrupted word doc - document.xml corrupted

Word document containing full semester uni work of my friend. She had NO backups and file was kept on USB key (hard to believe but true).
I have tried a number of recovery tools including onlinerecovery.com, DataNumen, Corrupt Docx etc. They all fail.
I tried renaming as .zip and extracting the document.xml - no luck.
tried hex editor to see if I could see any text  - nothing.
I'm not interested in recovering the images in the file (got those ok). Just need the text if possible. (or point me to tools which might work?)
Thank you very much.
  • 10
  • 4
  • 4
  • +2
1 Solution
Dave BaldwinFixer of ProblemsCommented:
Since you Zipped it, it is not a valid 'docx' anymore.  I was able to open it with 7-zip and see a lot of XML files.  I can't tell if what you want is there but some are readable.
owenharris63Author Commented:
it was a docx simply renamed as a zip. It was never zipped.
just tricks winzip etc into thinking it is a zip file.
If you treat it as a zip file (which it isnt) the culprit appears to be document.xml within the doc file. It says its 2Mb in size but attempting to extract it yields an error message.
My understanding that office 2010 saves .docx files in some sort of open xml format.
Dave BaldwinFixer of ProblemsCommented:
It had a PK zip signature at the beginning of the file which is why I tried 7-zip.  I looked at another 'docx' file and it did too.  I didn't realize that 'docx' files were also zipped.

I had the same problem with 'document.xml' but the other 'xml' files seemed to open fine.  I couldn't extract or view 'document.xml'.
We Need Your Input!

WatchGuard is currently running a beta program for our new macOS Host Sensor for our Threat Detection and Response service. We're looking for more macOS users to help provide insight and feedback to help us make the product even better. Please sign up for our beta program today!

owenharris63Author Commented:
Yes thanks Dave - thats the problem I am facing. This is one of my students who consistently gets top marks. This is worth 40% of her mark so I'm doing all I can to get this fixed for her.
Dave BaldwinFixer of ProblemsCommented:
It appears that the error is a ZIP error which probably means an error in the XML also.  It's the ZIP error that is preventing it from being extracted.
owenharris63Author Commented:
The file was NEVER zipped. Using winzip is just a sneaky way of looking into the .docx file (which I understand is stored in Open XML format) to unpack and work around the broken bits. There is a CRC integrity error in the file it would seem so all the tools are saying we cant find the text within this document. Whatever the error is, is confusing both the Zip software and Word. Thanks anyway.
Dave BaldwinFixer of ProblemsCommented:
You misunderstand as I did.  I just checked about 20 'docx' files and they are all ZIP files.  This page http://en.wikipedia.org/wiki/Open_Packaging_Conventions talks about the convention of using PKZIP to 'package'  Office Open XML files like 'docx' and 'xlsx' files.

Normally, 'docx' files are XML packaged inside a PKZIP container.
As dave says, the DOCX file format is just a collection of XML based layout files and other files like images packaged into one using standard "zip" compression.  Here are all the files unzipped:
|   [Content_Types].xml
|   |   item1.xml
|   |   item2.xml
|   |   itemProps1.xml
|   |   itemProps2.xml
|   |   
|   \---_rels
|           item1.xml.rels
|           item2.xml.rels
|       app.xml
|       core.xml
|       custom.xml
|   |   document.xml
|   |   endnotes.xml
|   |   fontTable.xml
|   |   footer1.xml
|   |   footnotes.xml
|   |   numbering.xml
|   |   settings.xml
|   |   styles.xml
|   |   stylesWithEffects.xml
|   |   webSettings.xml
|   |   
|   +---media
|   |       image1.jpg
|   |       image2.jpg
|   |       image3.png
|   |       image4.emf
|   |       image4.png
|   |       
|   +---theme
|   |       theme1.xml
|   |       
|   \---_rels
|           document.xml.rels

Open in new window

The Word document does not open in Word and only shows a generic error on my XP SP3 PC with Office 2003 plus the Office 2007-2010-2013 file format converter installed.  I will try to debug it later.

Here's the details of the web page error I get when I view the extracted "document.xml" in the "XML Editor" that is part of MS Office and opens it with color-coded tagging and properly indented lines within Internet Explorer.

Message: An invalid character was found inside an entity reference.

Line: 2
Char: 1584323
Code: 0
URI: file:///C:/Documents and Settings/Bill/My Documents/Downloads/Reflective-Journal-Submission-/word/document.xml

The XML page cannot be displayed

Cannot view XML input using XSL style sheet. Please correct the error and then click the Refresh button, or try again later.

When I scroll to the end I see this where it is unable to display any more:

An invalid character was found inside an entity reference. Error processing resource 'file:///C:/Documents and Settings/Bil...

<w:b/><w:bCs/><w:sz w:val="52"/><w:szCs w:val="52"/></w:rPr><w:tab/></w:r><w:r ...

I will look to see if I can find this in a text editor later to see if there are any oddball characters in there, but it is probably just that the installed XSL style sheets on my PC are older and don't support newer ones.

Unfortunately I need to get to bed right now and don't know when I will be able to sit down and study it in detail
Here's the document's properties, from App.xml and Core.xml:

Revision: 20
Created: 2013-08-31 02:53:00
Modified: 2013-10-15 11:41:00
TotalTime: 384
Pages: 21
Paragraphs: 80
Lines: 286
Words: 6036
Characters: 34411
Characters With Spaces: 40367
Application: Microsoft Macintosh Word
App Version: 14.0000
Doc Security: 0

The opening and closing XML tags that are used to contain text-based content in the file Document.xml are:
<w:t>the text content here</w:t>
When I open the Document.xml file after extracting the affected docx file to its own folder, I only have three instances of this tag:

<w:t>Reflective Journal</w:t>
<w:t>The Student's Real Name</w:t>
<w:t>Legal Advice Clinic</w:t>

From what I can see there is no other text present, and the bulk remainder of the contents seems to comprise embedded graphic data such as images and possibly WordArt, Autoshapes, or other drawn objects encoded into some kind of encoded data that resembles BASE64 or Unicode, as is commonly used to store embedded images in emails.

From what I know of Office documents that are saved for the web (as html), it can use VML (Vector Markup Language) to display drawn embedded graphic elements.  This is tagged in much the same way as XML, in that every coordinate, dimension, fill colour, border, gradient, etc is specified in separate tags.

Embedded images are normally just given unique IDs and are then cross-referenced to the actual image file names in other XML files which are all packed into the zipped-up monolithic DOCX file.

In this case it appears that some of the graphic content has been replicated as huge blocks of BASE64-like data for some mysterious "compatibility" reasons, and it has inflated the file size of what should be a small file into a pretty large one.  I can see this within the normal embedded graphic element tags:

o:gfxdata="UEsDBBQABgAIAAAAIQB ............ &#xA;LnhtbFBLBQYAAAPMAAxBgAAAAA=&#xA;"

The Document.xml file seems to end abruptly with such a huge block of data without having the correct XML closing tags like:


to close the starting tags


Bear in mind that this may have been cut off when unzipping the DOCX file, when the Document.xml file still within the Word document may actually be complete but with some odd character that cannot be deciphered to open it normally or with WinZip.

I will try and add these closing tags later and see what happens.

I found a previous discussion which talks about what I discovered with the graphic data:

I will try and mess with the data and see what I can ascertain, but on the face of it I believe that the originator has lost all the textual content from this document.

I suppose it is possible that something like this could have been caused by opening, saving, reopening, and resaving the same document on different operating systems, in different versions of MS Word, and perhaps also other MS Office alternatives like OpenOffice or LibreOffice.  Imagine that the originator created this in Word 2010 on a Mac and saved it as DOCX, then later opened it in OpenOffice on a Linux computer and accidentally resaved it as an *.ODF document, then later went back and resaved it to *.DOCX again.  This would be especially critical if, during one of the saves, a conversion didn't complete while writing back to a slow or damaged Flash Drive which was ejected too early.

There are loads of possibilities, but in my opinion the logical step right now would be to immediately cease using the USB Flash Drive and try and hunt down some previous saved versions of the document using a data recovery program, targeting the USB Flash Drive and any other computers that the originator may have had this document opened in or saved to.

A simple, and sometimes effective, "undelete" program that does just what is says is Recuva by Piriform (http://www.piriform.com/recuva), however my choice of retail data recovery program is GetDataBack by Runtime.org (http://www.runtime.org/data-recovery-software.htm).

If trying to recover files from a computer's hard drive, the recommended way is to install the software onto another computer and then attach the drive to be searched as a slave drive on that host, so as to avoid accidentally overwriting the areas containing files marked as deleted which have yet to be populated by new files.

Maybe this is an area where experts who hang around the Digital Forensics topic area would be able to help, seeing as they reconstruct damaged files on a regular basis.  You could remove the "Microsoft Office Suite" zone from this question and add the Digital Forensics one (http://www.experts-exchange.com/Security/Digital_Forensics/) and see if you get any suggestions there, but I don't think even they will see any more text-based content in the file than I have seen.

If the author has ever printed this document to paper, in any of its revision states, then scanning the hardcopy with an OCR application would at least recover some of the content.  If it has ever been "printed" to PDF, then it may be possible to extract the content to standard text and images, but it all depends on what created the PDF and whether it wrote proper text to the PDF or just dumped one "screenshot" type image of the screen into it.

owenharris63Author Commented:
Wow thanks BillDL. That's much appreciated. There were 4 graphics in the file which were able to be recovered and werent needed anyway. Unfortunately for my student she didnt have any backups (who doesnt know to do backups in this day and age) and had everything on her flashdrive. She was using a MacBook Pro I believe.
I will do as you have suggested.
owenharris63Author Commented:
OK dumb question. How do I add tags and topics to an open question? I cant see any  options here and nothing in the help section (and I'm guessing its in there somewhere).
TeksquisiteSecurity Technology EditorCommented:
Same thing here (as above) - able to view graphics only and the error with hexadecimal digit expected - .xml. It is going to take someone with forensics software to work with this puppy.
owenharris63Author Commented:
Thanks anyway! :)
owenharris63Author Commented:
Bill suggested above: You could remove the "Microsoft Office Suite" zone from this question and add the Digital Forensics one (http://www.experts-exchange.com/Security/Digital_Forensics/)
owenharris63Author Commented:
I didnt know how to do that so i opened a new question and linked it to this one.
owenharris63Author Commented:
I agree with teksquisite. That was indeed above and beyond the call of duty and much appreciated. My student is desperate so anything anyone can do is appreciated (in fact I only ever post here if I have exhausted all other avenues. If it wasnt for my student I would have closed this and awarded to Bill.
And sorry for posting a new one.
You may try DataNumen Word Repair at


to see if it can repair your file. Good luck!
OK, an update, but not the results you and the Author were praying for.  It is incomplete and only contains the facing "cover" page.
Also, footnotes extracted from "footnotes.xml"

The Hon A.M. Gleeson AC, "Advocacy", Paper to NSW Bar Association's Bar Practice Course, November 2001

The Discipline of the Law, Butterworths, London, 1979, p.7-8

Address upon the occasion of first presiding as Chief Justice at Melbourne on 7th May, 1952 in Jesting Pilate, the Law Book Co, Melbourne, 1965 p.251.

Here are the images used.  You might want to suggest to the student that it is a much better idea to first resize images to be embedded rather than embedding huge images and squashing them to fit.

First of all a brief summary. of where we have reached.

Old Word 97/2000/2002/2003 *.doc files are single files with all the contents embedded into that file in various formats (binary, unicode, and more).  Word 2007 and upwards *.docx files use multiple separate XML files to declare unique IDs and thereafter these are used to create interdependencies and relationships between all the various forms of data and to act as "layout" templates where placement, size, behaviour, etc of all the elements in the document are specified in tags.  Every element must have an <opening:tag> and a </closing:tag>, or given as a <standalone:tag/>.   It is something like the HTML code of a web page being loaded by a browser and bringing together all the other referenced files from internal or external sources and displaying them all as originally intended.

A *.docx file is packed using standard "Zip" compression.  This can be seen quite easily by opening a docx file in a plain text editor and looking down to the end for the "PK" suffix that is applied to the list of files shown with the relative paths to their containing folders inside the archive, like this extract shows:


When a *.docx file is opened, the separate files are temporarily unpacked, just as most Unzipping programs do.  When the contents are edited, the modified XML files are all repacked back to the single docx file again.  The same thing is usually also possible with a standard Zip/Unzip app like WinZip, 7-Zip, WinRAR, etc, ie. internal files can be updated in-place via the program's graphic interface or command line.

Zip compression is such that to unzip files it needs to parse each file and reconstruct it again to the original format.  Any errors with certain files that prevent them being parsed can halt the whole unzipping process, and this is what was happening when trying to open the affected docx file in Word (and other word processing apps like OpenOffice and LibreOffice), and also with Unzipping programs when renamed as a *.zip file.  The individual files in the docx file are all visible in WinZip, and as a folder/file tree in programs like my old "QuickView Plus", and all are viewable apart from document.xml, so it is not the actual zip container which is damaged.

Being a zip file containing all the layout and resorce files in separate sub-folders, and with errors pointing to "document.xml" as being corrupt, the logical proposal would be to do one of two things:

1. Open the *.docx file as though it was a zip file, open document.xml in a text editor from within the archive and fix it, then finally save it back into the archive so that it updates the file in there with the changes


2. Unzip the *.docx file, open the uncompressed document.xml from the folder it was unzipped into with a text editor and fix it, save the file, and then create a new package.

The following problems hampered this:

1. Renamed to a Zip file, Winzip and other common "unzipping" applications errored out and were unable to properly read and decompress the affected XML file from within the archive.  WinZip reported of document.xml: "Invalid compressed data to expand (inflate) the file".

2. Word was finding issues with bad syntax in that there seemed to be erroneous "hex" values, but there were other issues with the completeness of the contents.

I use a little program named "Universal Extractor" to unpack loads of different packaged and compressed files.  It uses free and open source 3rd-party decoders to parse the different formats of files, and it was able to unpack the affected docx file without error.

It was immediately apparent that there were massive blocks of encoded data within "document.xml" similar to the sort of code that is used to store embedded digital images in emailed messages and web pages saved as single *.MHT files.  It was inside these blocks of code where hex characters, normally specified as
(where the AAA has to be be a number/letter or combination of letters/numbers from 0–9 and/or A to F - http://myhandbook.info/codes_htmlchr.html)
used letters beyond F, eg. &#xGS5; and in some cases were not terminated by the semicolon, and would therefore not be valid.

It appears that this block of encoded bloat is superfluous replication of embedded image data that may only be needed in rare compatibility cases.

Trying to run a file like this through an XML Syntax Checker or code validator hangs it because of all the encoded stuff it has to parse, and validators rarely go beyond the first error encountered, so you have to fix one, validate, fix, ad nauseum.

I deleted all of these blocks of BASE64/UUENCODE junk in a text editor ready to save it as a slimmed-down XML file, but realised that the "document.xml" file was incomplete.  It was abruptly truncated immediately after, or in the middle of, one of these massive blocks of gobbledegook code and there were therefore no closing tags for about 15 of the opening tags.

I saved the file and kept adding missing closing tags until I got it to open without error in LibreOffice, then saved it out as a DOC file, but sadly the main content beyond the facing page with the logos appears to have been down beyond the point where document.xml was abruptly truncated.

Now, what you must acknowledge is that I am not a forensic scientist specialising in digital data.  Far from it actually, but I enjoy tinkering at an advanced amateur level.  It IS possible that the utility I used to extract all the files from the affected DOCX file reached a problem area in the document.xml code and only extracted it down to that point.  Trying to examine the contents of a ZIP or DOCX file in a hex editor is something that I am not equipped to try and decipher.  it is just an endless lump of code without legible text strings because of the compression used.

I wish I could have given you good news and presented a fully recovered document to save your student's skin.

owenharris63Author Commented:
BIll went above and beyond on this one and did as much as possible without asking NSA or someone on an American spy show (who can instantly hack in, instantly guess passwords, and recover data with seconds to spare in real time while people are shooting etc). I really appreciate your efforts. Thank you!
Thank you Owen.  I appreciated the desperation of the plight and was desperate to try and help.  While I was doing evening classes in computing at the local college quite a number of years back, only a few people had expensive 16MB USB Flash Drives or laptops (yes, that long ago), so we had to save docs to floppies and do lots of zipping/unzipping and conversion  back and forward between different versions of Word, Excel, and Access.  I lost a very lengthy and important document days before I needed to have it submitted for the last "exam" and had to type it all again because my backup was corrupt.  I commiserate with the student and hope she is able to recreate it.
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Featured Post

Free Tool: Subnet Calculator

The subnet calculator helps you design networks by taking an IP address and network mask and returning information such as network, broadcast address, and host range.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

  • 10
  • 4
  • 4
  • +2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now