Convert FO to DOCX

Posted on 2014-01-02
Last Modified: 2014-01-03
Could somebody recommend a good library (except XMLmind FO converter) which could be used to convert XSL-FO files to DOCX ?

It has to be for Windows (native or .net), but not Java.
Question by:zc2
  • 4
  • 3
LVL 60

Expert Comment

by:Geert Bormans
ID: 39753436
Hi zc2,

I tend to stay far away from the usual free stuff that promises the world but never does what you need. I am done personally with the tools that do all sorts of crap magic to make a word file look remotely like the PDF you intended to get out of the FO... they make the docx file unsuitable for further processing

From the top of my head: Ecrion ( generates Word from FO (but I have never been really happy with the results, it might have been improved) . As far as I know, none of the other big FO processor vendors ever bothered about word (FOP tried a bit for RTF, but you did not want Java)

If this is for serious production work you might want to look at
My co-workers use aspose for all sorts of Word document normalisation in our workflows.
It is not free however and it is not a generic XSL-FO to docx transformer.

But it gives you a reliable Word object model to program against and it gives you control over styles, so I assume that given a bit of XSLT to clean out some of the FO overhead and a bit of styles definitions you can make this a reliable transformer for predictable documents.

Personally I have done quiet a few workflows in the past that have XSL-FO in the middle. You can enrich your XSL-FO with smart ids, out of namespace class attributes and process instructions, that don't hinder the PDF generation, but help you getting the next steps... in a way a pass through "style" information transparently through the XSL-FO... maybe that helps to keep the next steps lighter

If you are looking for a good and generic FO2DOCX, I don't think this is helpful. If you are looking for a tool that might give you reliable docx files stemming from an XSL-FO you control yourself, and you don't mind doing some handwork yourself, I think aspose is something worth looking at

Good luck

LVL 18

Author Comment

ID: 39754098
Geert, thank you for such a comprehensive answer.

So, do I understand it correctly, even if we purchase the Aspose developer license (we need a royalty free license for redistribution) we still need to create a code which reads the FO tree and then calls the Aspose API methods for each "Formatting Object" in the input?

Currently I'm trying to implement the same using the MS Open XML SDK (it seems the SDK does not require the MS Office has to be installed, at least my tests tell me so). But I have to work on a low level dealing with WordprocessingML objects which are not  very amazing things.

Another problem here - FO objects could be nested, but as I understood, WordprocessingML is a flat structure.

The input FO is produced by us, so I could add there some additional processing instruction if that could help (I hope it will not affect the other processor which produces PDF from the same FO), but I don't see yet how I could easy my task by incorporating additional markup to the FO.
LVL 60

Accepted Solution

Geert Bormans earned 500 total points
ID: 39754381
Yes, your understanding is correct. We use aspose mainly to normalize between different versions of word. The object model to program against is much easier than the MS SDK, but it isn't cheap, and if you need free redistribution, that does not seem an option.  And yes, you still need quiet a bit of programming

The low level WordprocessingML objects are a mess, but by the sound of it that might be your best option. There is no hierarchy in Word XML that is true. But the good thing about it is, you have to create the Word XML, not read from it. Dealing with Word XML documents to start with gives you a lot of complex grouping :-)

At the end, Word objects are "w:p" (paragraphs) and "w:r" (runs) and some styling description inside it. some added complexity for tables and lists of course, equations and graphics maybe...
At the end of the day, if you know the hierarchy of the nested fo:blocks, you could assign them styles in word and map the deepest nesting of the fo:blocks to a styled "w:p" and make seperate "w:r" for the mixed content
You are comfortable enough using XSLT, so an option could be to put the complexity of the flat down mapping with styles in an XSLT so you would have less work pushing the lot to SDK objects in .net code
If you do it that way, smart id generation or class like constructs can tell you which nested block has which intended style for word, so you could facilitate your mapping logic

Just thinking in the wild here, not sure if it makes sense in your particular project
Top 6 Sources for Identifying Threat Actor TTPs

Understanding your enemy is essential. These six sources will help you identify the most popular threat actor tactics, techniques, and procedures (TTPs).

LVL 60

Expert Comment

by:Geert Bormans
ID: 39754396
One option often overlooked.

I have been pretty succesfull in some not too complex layouts, by generating HTML, with CSS and a dotx file for templating, describing the styles and the header/footer stuff and combine them into word automatically. It works if the layout is not too complex, and it leaves a lot of messy coding to the Word import filter. Of all the stuff I tried to get XML in to Word, that one gives pretty decent results at a low cost of entry. Not sure what the .net guys use for it here, but I think you can use the SDK for that too

Just a thought
LVL 18

Author Closing Comment

ID: 39754586
Geert, thank you.
I will continue studying the OpenXML SDK, even though I'm in the very beginning of it (currently I don't even understand the role of those "runs" and why they have to be inside the paragraphs).
LVL 60

Expert Comment

by:Geert Bormans
ID: 39754852

a paragraph is a logical block level unit, it can have a paragraph style. It is a very common use of the concept paragraph
a run is a sequence of characters that share a common property, could be a character style, could also be track changes information et al. Many sorts of events can break a run into multiple runs, so getting stuff out of Word XML can be tough, but getting stuff in is just a matter of breaking things apart in a serial fashion

Open in new window

has a i nested in a b, this would lead to five different runs in one p (I numbered them)
LVL 18

Author Comment

ID: 39755230
I see, thank you!

Featured Post

How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

Join & Write a Comment

The ability to automatically add page numbers to a layout is one of the many easy, convenient features InDesign has to offer. There are many reasons why you would want to automatically generate page numbers in your next project, so whether it’s a ma…
This article covers the basics of the Sass, which is a CSS extension language. You will learn about variables, mixins, and nesting.
Viewers will learn about arithmetic and Boolean expressions in Java and the logical operators used to create Boolean expressions. We will cover the symbols used for arithmetic expressions and define each logical operator and how to use them in Boole…
The viewer will learn how to create a basic form using some HTML5 and PHP for later processing. Set up your basic HTML file. Open your form tag and set the method and action attributes.: (CODE) Set up your first few inputs one for the name and …

757 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

20 Experts available now in Live!

Get 1:1 Help Now