XSL to clean up Word's xhtml ?

Does anyone know of an XSL styesheet that will take a Word doc (saved as xhtml) and clean it up.

By this I mean, return the Paragraphs only. And maybe bullet points also.

Drop everytinng else (indentation, images, line art, pagenumbeing etc etc). Also tidy up "empty" paragraphs.

PS I have used xml and xsl in the past (for web applications) but never with MS Office. So any help is appreciated :)
eamonrocheAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

PeterCiuffettiCommented:
Hi,

When I have had to convert Word documents to (usable) XML, I first started with a utility called UpCast

http://www.infinity-loop.de/products/upcast/

This converts directly from the .doc format to an XML format.  You get to choose from among a number of DTDs specifying the output format, one of which is relatively simplified with all styles being set up for control from an external CSS file.

I then found that writing XSL for this simplified format to be much easier than wrtiting XSL for the XML output of Word.

Pete

0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
eamonrocheAuthor Commented:
I understand your comment but.....

There will be about 200 users who will be writing the Word doc. I do not want to install Upcast on each of their PCs. I will need them to so "Save as Web page" then find an xsl file to tidy up this xhtml file. Thanks.

0
PeterCiuffettiCommented:
Can you make this a web server function?  So for example, after one of the users saves their Word document, they could upload it, via a web form, to a server you provide to convert documents to XML.   If you are working in Java on your server, you could get the JAR version of UpCast.  Then the server could use its API to do the conversion after the save.

0
Upgrade your Question Security!

Your question, your audience. Choose who sees your identity—and your question—with question security.

eamonrocheAuthor Commented:
i see what you are getting at. If there was a "com" component version of Upcast that I could interface from an asp script then this would work.
0
eamonrocheAuthor Commented:
....but really I dont mind if the xhtml produced by word is messy. Surely there must be a smart xsl file that will clean this up.
0
PeterCiuffettiCommented:
Hi again,

I couldn't find any.  I  took a look at the XSD schemas offered through Microsoft here (http://www.microsoft.com/downloads/details.aspx?FamilyID=fe118952-3547-420a-a412-00a2662442d9&DisplayLang=en)

Not surprisingly, the markup language is considerably complex.  Given all the styles, tables, footers, headers, footnotes, etc, 'cleaning it up' would require quite a bit of XSL.  If you just wanted the paragraphs, I suppose it wouldn't be too bad.  But then you'd have to work around the objects that can break up a paragrah just to reassemble them.  If you enumerate a short list of elements you want to capture, i can try to put something small together, and then maybe you could build on that.

Pete
0
eamonrocheAuthor Commented:
Pete
Thanks for your help. The job wont be starting for a few months. I was just doing some preparation. I will have a go myself first and see how I get on.
Eamon
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
XML

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.