Link to home
Start Free TrialLog in
Avatar of SteveFarndon2000
SteveFarndon2000Flag for United Kingdom of Great Britain and Northern Ireland

asked on

How do I get an MS-Word .doc in to .html format?

Hi Experts,

I want to convert MS-Word .doc files in to HTML documents that render exactly what you would see if did 'print preview' in Word on the original .doc version.

If I open the doc in Word and use the 'save as webpage' option then view the output in Firefox I just see the main body of each page, fine, but no page headers or page footers.

I've looked around for a utility that will do such conversion but all have failed to show the page headers and page footers. If I use a utility I need it to be something I can run from a command prompt that accepts the input file name and output file name as parameters. Even the ones with a GUI failed to recognise the page headers,etc.

The only solution that I can come up with is using a virtual printer called Print2eDoc from Gnostice which is installed as a printer under WinXP and 'prints' the Word print output to JPEG files, one for each page of the Word doc. Then I have to encapsulate these JPEGs in HTML before I can render in Firefox. This is messy because (a) any graphics in the Word doc degrade when put through the conversion (b) the text in the Word doc degrades as well. Very bad. There is also the hassle of building the HTML document myself.

How can I get the whole doc to appear in HTML?

Am I missing a setting in 'save as webpage' in Word?
Is there a much better utility which will capture Word print output to HTML?
Is there a much better conversion utility that does doc to HTML?
Is there a way of tweaking the HTML that 'save as webpage' produces so that the page headers/footers are rendered (the header/footer information is stored in a subfolder by the same name as the output file)?

Thank for all your help.

Big points because I need good answer quickly.
Avatar of Brian B
Brian B
Flag of Canada image

Everything I have seen is that Word does not do a good job of saving to HTML for anything but IE. However, some searching has produced a number of diffrent DOC to HTML converter tools.
You can accomplish this with a free HTML editor. Copy the text from the Word document Test-Word-2010.docx and paste into the HTML document. Test00.html

It doesn't take that long & you don't really have to know any HTML to get a result!
Avatar of SteveFarndon2000

ASKER

Tbone2k,

And those converter tools are...?
paulsauve,

Thanks but this is OK just for a one-off. I need to automate the process since I will be doing it many times a day every day. I know I can automate Word via macro/scripting/VB app but that only applies to operations inside Word. I can't automate the copy-and-paste of the Word doc into another app such as your HTML editor.

Am also using Word2000 so docx doesn't apply.

Do you have a procedure I can automate?
Sorry - can't help you with that!

BUT, you can easily convert all your files to pdf and put up a list of links to files in an available pdf directory... Conversion to PDF should conserve all the headers & footers.

I think that this would be easer to automate than the conversion to HTML.
Tbone2k,

Any news about those converters?
Not being fascecious, but try http://www.google.com/search?hl=en&rls=com.microsoft%3Aen-us&q=word+html+converter&aq=f&aqi=g1g-v9&aql=&oq=&redir_esc=&ei=dWjATaq-FMu1tge59eDDBQ

Different converters do a better job on different DOC styles, particularly the free online ones. So that's why its good to try more than one.
I tried a few, and, like you, noted that headers and footers are NOT converted! Sorry I couldn't be of more help!
Tbone2k,

You will have to be more specific in order to get some points. From the start I requested a converter that I could run from a command line prompt. Online converters are not practicable.

A solution is a recommendation to a converter that works, not a suggestion as to ones that might work. Just need to know which one that is...

Thank you.
Steve, I tried to provide an answer based on the fact that you weren't getting an exact response, i.e. command line utility. How you choose to accept the answer is up to you as long as it is within EE guidelines.

Balancing the request to get a fast answer with a complete one, I finally got a chance to find one application that was recommended.... Word Cleaner: http://www.convertwordtohtml.com/
Tbone2k,

WordCleaner was one that I'd already tried with no joy. I was surprised they charge money for this product and someone must have spotted this fault by now. Obviously not.

Next.

If you still need help, you might want to try the request attention link in your question to flag it as still in need of a solution.
Do you need to do this directly from Word? (File | Save As)
Just a thought...if you found an online tool that works for you, you can use Word to automate the process of loading the data into a hidden browser object that can handle the conversion if it is pure HTML/JavaScript, then have the output redirected to a location of your choice.

If you have such a site and want to pursue this, let me know and I'll see if I can whip up the VBA to make that happen.
If command-line/batch conversion is a requirement, you may want to investigate the OmniFormat series of products:

http://www.omniformat.com/

That requirement really throws a wrench into the works.  Without it, you have a lot more options (cold comfort, I know).

The problem with going from Word to HTML will always be the quality of the output.  Headers and Footers just don't translate well except to PDF.
> I want to convert MS-Word .doc files in to HTML documents that render
> exactly what you would see if did 'print preview' in Word on the original .doc version.

But you can't have that "exactly". Print Preview is page-oriented, and page header and page footer appear on each page. In HTML there are no pages. Let's say, you have 5-page Word document. In Print Preview, it will be 5 pages, each with header and footer. How is it supposed to look in HTML? Then there's a whole mountain of print options - depending on the printer and on the paper, the same Word document will look very differently; in fact, your requirement should look like this: "I want webpage looking like my Word document when it's printed on my color printer (or is it b/w?) in Portrait format on pager 11X8.5, with margins 0.5 from each side,....." (and we can continue here for couple of pages more, representing all printing options in Word). If in Word you switch to Normal view, or Web view, you won't see your Header and Footer, and for a very good reason - you can't have page headers if you have no pages to begin with.

Because of that, indeed, the option to produce JPEG images is probably the best, because only that will show on the webpage something that does not exist on the webpage - the paper page.
...continuing... I looked at a random book at books.google.com, and what Google shows, is true HTML, not Flash, or Image - right-click on the book, and you will see it. But how that actually works, is somewhat of a mystery (see discussion at http://stackoverflow.com/questions/3896008/how-google-books-and-google-docs-viewer-work )

Another approach is HTML 5: http://www.20thingsilearned.com/foreword/2

All this seems to be very far from being offered as some utility that you could just grab and utilize for your project.

I'm with vadim on this...the web page was specifically designed to allow the recipient to adjust formatting and have that independent of content.  Over the years, this has been adjusted to allow the creator to make the item appear closer to the original concept, but what happens when you show something in a font that doesn't exist on the recipient's machine?  What if the preview is set to legal paper? Or index card? or European A4 (similar to American letter but not quite the same)?

As indicated, the only real, viable solution to that is to draw the web page as an image (which is essentially what PDF does) so that it looks exactly the way you want.  However, that still doesn't guarantee that it will *print* the same way.

So that leads to the question of, "how would you like it to appear, and who do you expect to view it?"  If the recipient will have the same machine (or the same fonts and settings as your machine), then we could probably create a web page from a Word document that appears like that on the recipient's machine, but that would probably be a custom job, not a free tool (or even a 3rd party tool that you pay for.)
I've seen one product that does a credible job of replicating a word "page" but it does it in such a weird way that manipulating the page is all but impossible.

The application would create separate HTML pages for each page of the document and set an AP div with 7.5" x 10" dimensions.  Each LINE of text would be converted to a div and those would also be absolutely positioned.  The end result is a credible attempt at mimicking a Word doc but an absolute nightmare to handle and edit.  

Be d*mned if I can remember the name of it now, though...
I agree with Paul that the only viable solution is PDF

If you install Acrobat professional you can even have it monitor a directory for new files and have it convert them on the fly.
Alternative is a free pdf writer where you can save-as pdf
I am certain that programs exist for a server that will convert on the fly too
Remove 35726243 from the split.  My post doesn't answer the question unless I remember the name of the program.  
@jason1178, I disagree. The mere information that something like this exists, is valuable - whoever will find this Q in the KB, will do some additional research and find the product.
In that case, I would like to also add http:#35505511 and http:#35516809 to the mix since it suggests online tools ahead of the other posts.
from vadimapp1:
<< But you can't have that "exactly". Print Preview is page-oriented, and page  header and page footer appear on each page. In HTML there are no pages. Let's say, you have 5-page Word document. In Print Preview, it will be 5 pages, each with header and footer. How is it supposed to look in HTML? Then there's a whole mountain of print options - depending on the printer and on the paper, ...>>

The point about the massive number of variations in the print options is spurious. The Print Preview option uses whatever paper size, margins etc (see 'Page Setup') have been set for that section. yes? An HTML version of a page will do the same thing, i.e. use the current settings for the Word Section. In fact that seems to be exactly what Word's own 'Save as Webpage' option does except for the bug that doesn't render any header/footer. The HTML could and should have a new <DIV> tag for each page of Word document inside a single HTML page.

I see that you are all waiting for comments/points allocation. I'm back after being away for a while and will add more comments v. soon. Thanks for all your comments so far.
Steve,

We all agree about what a conversion SHOULD do, but the sad fact is that no one has made a tool that does a good (or even tolerable) job of it.  
Jason, that's partly true.  The real problem is that different people have different definitions of what is a good job.  For example, if it's simply taking the text and putting bold and italic tags as needed, that's probably been done.  When you get more complex and start adding floating images inside complex tables with formulas and headings that reference page numbers along with some macros, I'm not sure that it's practical to build a tool to handle all that.
At some point, "good enough" is probably fine.  I think the real problem is that for anything beyond a basic document, I haven't found any good tools.  That means writing a custom solution for those cases, and that's not likely to be done on a free/cheap site like EE.
SOLUTION
Avatar of Vadim Rapp
Vadim Rapp
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
> We all agree about what a conversion SHOULD do, but the sad fact is that no one has made a tool that does a good (or even tolerable) job of it.  

If the objective is having the page look as close to the print preview as possible, then isn't PDF exactly that?
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
This question has been classified as abandoned and is closed as part of the Cleanup Program. See the recommendation for more details.