We help IT Professionals succeed at work.

Using Word's XML Structure to Store My Own Information

DrTribos
DrTribos used Ask the Experts™
on
Hello Experts,

I'm looking at creating a simple word processor for developing a specific document type.  At present I am using MS Word to create the documents.  Going forward I would like to still be able to use MS Word as a viewer.

So what I thought was:
create a custom xmlFolder in the root of a dotx file (as described in www.experts-exchange.com/Q_24284711)

This seems to work (after figuring out there must not be any spaces in filenames AND each file must have an .xlm extension)

This solution seems great.  My application would be able to treat the word file as a 'zip-like' archive, go straight to my target folder allowing me to edit etc.  I can then render my data as a word document allowing it to also open in MS Word.  Problem is that if someone opens AND saves the file in MS Word my custom folder is deleted.  Opening the file and then closing it seems to have no impact; it's saving that seems to be the problem.  

I think this is this related to:
www.experts-exchange.com/Q_25349248
www.experts-exchange.com/Q_27559661
But I thought I was doing something different than using xml tags...

My questions:
- what is my best option for embedding my own data in a file and being able to open that file in MS Word?
- should I use a different format, e.g. Libre or OpenOffice, that would still open in MS Word?

Interested to know what pitfalls I might expect.

Look forward to your thoughts
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®

Commented:
The ODF format used by LibreOffice, OpenOffice.org, Abiword, etc... is cleaner than the equivalent XML format of MS. This format can easily be converted to one of the many MS DOC formats. You will find it probably more confortable to make a detour through ODF.

There is another problem. When you start editing a file with MS-Word or LibreOffice, there are locking systems preventing several users to edit the same file at the same time. Look at the files appearing in your directory when you edit a file with MS-Word or LibreOffice. A way to lock the file you want to modify could be to imitate the behaviour of MS-Word or LibreOffice by creating such a valid lock file.

Another way could be to make the file you are changing readonly beween checking it out and it in after the modification.

Author

Commented:
Thanks for your comment.  I will need to do some testing with ODF and how it opens in MS Word.  The thing I want most is for my embedded folder to NOT get deleted.  

I did not make it clear in my question that I only want the file to open in MS Word, Open Office etc. for the purpose of viewing.   Editing should be done in a small purpose built word processor (which is planned to be written in Java).  

The idea is that the word processor will be able to export to DOCX, ODF, HTML, PDF etc. so that document creators can distribute their documents to colleagues without needing to provide them with the special wordprocessor.  Ideally their colleagues would be able to add comments to the document and send back for consideration.

I am expecting there will be many users of this purpose built document, and not all of them will have a DMS so checking-in & checking-out can not be relied on as a solution.

The readonly option is something that we will need to do although I am not sure what the best way to implement this is.  

When you start editing a file with MS-Word or LibreOffice, there are locking systems preventing several users to edit the same file at the same time.
I'm not really expecting this to be an issue as it is not intended that these files are edited outside of my purpose built word processor...

Commented:
Your idea of making a small purpose editor is interesting. There is already an existing framework for doing that: the TinyMCE editor. It is not Java but Javascript. It is easy to integrate into a web application. See http://www.tinymce.com/

If that application produces flat ODF documents (i.e. not zip compressed XML), it is easy to convert these into MS-DOC, PDF or whatever.

Author

Commented:
My documents require very specific formatting and use bookmarks, macro buttons, specific style and layout.  It is currently in MS Word.  The problem is inexperienced Word users (almost everyone) keep on breaking the documents.  Oh, and experienced Word users who like to tinker break them as well...

Will look at the JavaScript link, thanks.

Author

Commented:
kewl - http://www.tinymce.com/tryit/full.php  but seems to run on server, not PC (still kewl though)

Commented:
TinyMCE doesn't run on the server-side, but on the client-side. It runs inside of your browser. So, for starting it, you just point your browser to a local file.

If you experinent that your browser is too heavy to launch, with a little bit of Java knowledge, you can write a little app opening a window on your PC with the Javascript running inside.

References for running standalone Javascripts:

Author

Commented:
Thought I posted this last night... From TinyMCE

The examples might not work properly on the local file system due to security settings in your browser. Please use a real webserver

Can't seem to adjust the security settings to fix.  Anyway, I think this question has gone somewhat off topic!

Does anyone have experience with creating a custom structure in an DOCX or ODF and working with MS Word?  
Will the custom structure always get overwritten?
Is this happening because of the US Lawsuit?
Would making the file readonly prevent an overwrite, or can that fail on occasion?
Is there a way to stop the file from being saved without using readonly?

Thanks,

Author

Commented:
Could office file validiation become an issue?  (http://technet.microsoft.com/en-us/library/gg985445%28v=office.12%29.aspx)

About Office File Validation

Office File Validation helps detect and prevent a kind of exploit known as a file format attack or file fuzzing attack. File format attacks exploit the integrity of a file, and they occur when someone intentionally modifies the structure of a file to add malicious code. Usually the malicious code is run remotely and is used to elevate the privilege of restricted accounts on the computer. As a result, attackers could gain access to a computer that they did not previously have access to. This could enable an attacker to read sensitive information on the computer’s hard disk drive or install malware, such as a worm or a key logging program. The Office File Validation feature helps prevent file format attacks by scanning and validating files before they are opened and then notifying the user if the file may have been compromised.

To validate files, Office File Validation compares a file’s structure to a predefined file schema, which is a set of rules that determine what a readable file resembles. The file does not pass validation if Office File Validation determines that a file’s structure does not follow all rules that are described in the schema.

To run Office File Validation on either Office 2003 or Office 2007 you must first apply the Office File Validation files to the computers that are running either Office 2003 or Office 2007.
Commented:
I have a good experience in creating ODF files with xsltproc starting from XML or HTML stuff. Xsltproc is a processor transforming XML into XML. This processor understands a language that is called XSL.

The fastest way is to be productive is to make some ODF file with the general structure of what you want to get. You create that file with libreoffice and you save this file as flat XML (with extension .fodt) and you copy this template into the xsl code.

I attach an example of an ODF file in flat XML containing "Hello world".

A lot of stuff at the beginning of the file is not needed. If this stuff disturbes you, you can experiment by cutting parts of it out.

The body of the document stays at the bottom:
 <office:body>
  <office:text>
   <text:sequence-decls>
    <text:sequence-decl text:display-outline-level="0" text:name="Illustration"/>
    <text:sequence-decl text:display-outline-level="0" text:name="Table"/>
    <text:sequence-decl text:display-outline-level="0" text:name="Text"/>
    <text:sequence-decl text:display-outline-level="0" text:name="Drawing"/>
   </text:sequence-decls>
   <text:p text:style-name="Standard">Hello world!</text:p>
  </office:text>
 </office:body>

Open in new window


Now a typical XSL program that changes the first paragraph of the document into a paragraph containing "Goodbye folks!" could look like:
<?xml version="1.0" encoding="utf-8"?>

<xsl:stylesheet
  version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:text="urn:oasis:names:tc:opendocument:xmlns:text:1.0">

  <xsl:output method="xml" indent="yes" encoding="utf-8"/>

  <xsl:template match="@*|node()">
    <!-- this template is the identity transformation and is applied by default -->
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="text:p">
      <!-- this template applies to the paragraphs of text -->
      <text:p text:style-name="Standard">Goodbye, folks!</text:p>
  </xsl:template>

</xsl:stylesheet>

Open in new window

This xsl code was saved into a file called change-some-paragraph.xsl and I ran it by invoking xsltproc with next arguments:
xsltproc -o goodbye.fodt change-some-paragraph.xsl hello.fodt

Open in new window


The result was a .fodt file containing "Goodbye folks!" instead of "Hello world!". The following step is to convert .fodt into .doc, which can be done with scripts like unoconv.
hello.fodt.txt

Author

Commented:
Hey - thanks for that!  I'll take a look, today I was supposed to finish early.  I got hammered with 'stuff' clocking off now.  Cheers,

Author

Commented:
Thanks for your help - this give me a great start