• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 410
  • Last Modified:

How do I get an MS-Word .doc in to .html format?

Hi Experts,

I want to convert MS-Word .doc files in to HTML documents that render exactly what you would see if did 'print preview' in Word on the original .doc version.

If I open the doc in Word and use the 'save as webpage' option then view the output in Firefox I just see the main body of each page, fine, but no page headers or page footers.

I've looked around for a utility that will do such conversion but all have failed to show the page headers and page footers. If I use a utility I need it to be something I can run from a command prompt that accepts the input file name and output file name as parameters. Even the ones with a GUI failed to recognise the page headers,etc.

The only solution that I can come up with is using a virtual printer called Print2eDoc from Gnostice which is installed as a printer under WinXP and 'prints' the Word print output to JPEG files, one for each page of the Word doc. Then I have to encapsulate these JPEGs in HTML before I can render in Firefox. This is messy because (a) any graphics in the Word doc degrade when put through the conversion (b) the text in the Word doc degrades as well. Very bad. There is also the hassle of building the HTML document myself.

How can I get the whole doc to appear in HTML?

Am I missing a setting in 'save as webpage' in Word?
Is there a much better utility which will capture Word print output to HTML?
Is there a much better conversion utility that does doc to HTML?
Is there a way of tweaking the HTML that 'save as webpage' produces so that the page headers/footers are rendered (the header/footer information is stored in a subfolder by the same name as the output file)?

Thank for all your help.

Big points because I need good answer quickly.
0
SteveFarndon2000
Asked:
SteveFarndon2000
  • 6
  • 6
  • 5
  • +6
3 Solutions
 
Brian BIndependant Technology ProfessionalCommented:
Everything I have seen is that Word does not do a good job of saving to HTML for anything but IE. However, some searching has produced a number of diffrent DOC to HTML converter tools.
0
 
Paul SauvéCommented:
You can accomplish this with a free HTML editor. Copy the text from the Word document Test-Word-2010.docx and paste into the HTML document. Test00.html

It doesn't take that long & you don't really have to know any HTML to get a result!
0
 
SteveFarndon2000Managing DirectorAuthor Commented:
Tbone2k,

And those converter tools are...?
0
Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
SteveFarndon2000Managing DirectorAuthor Commented:
paulsauve,

Thanks but this is OK just for a one-off. I need to automate the process since I will be doing it many times a day every day. I know I can automate Word via macro/scripting/VB app but that only applies to operations inside Word. I can't automate the copy-and-paste of the Word doc into another app such as your HTML editor.

Am also using Word2000 so docx doesn't apply.

Do you have a procedure I can automate?
0
 
Paul SauvéCommented:
Sorry - can't help you with that!

BUT, you can easily convert all your files to pdf and put up a list of links to files in an available pdf directory... Conversion to PDF should conserve all the headers & footers.

I think that this would be easer to automate than the conversion to HTML.
0
 
Paul SauvéCommented:
0
 
SteveFarndon2000Managing DirectorAuthor Commented:
Tbone2k,

Any news about those converters?
0
 
Brian BIndependant Technology ProfessionalCommented:
Not being fascecious, but try http://www.google.com/search?hl=en&rls=com.microsoft%3Aen-us&q=word+html+converter&aq=f&aqi=g1g-v9&aql=&oq=&redir_esc=&ei=dWjATaq-FMu1tge59eDDBQ

Different converters do a better job on different DOC styles, particularly the free online ones. So that's why its good to try more than one.
0
 
Paul SauvéCommented:
I tried a few, and, like you, noted that headers and footers are NOT converted! Sorry I couldn't be of more help!
0
 
SteveFarndon2000Managing DirectorAuthor Commented:
Tbone2k,

You will have to be more specific in order to get some points. From the start I requested a converter that I could run from a command line prompt. Online converters are not practicable.

A solution is a recommendation to a converter that works, not a suggestion as to ones that might work. Just need to know which one that is...

Thank you.
0
 
Brian BIndependant Technology ProfessionalCommented:
Steve, I tried to provide an answer based on the fact that you weren't getting an exact response, i.e. command line utility. How you choose to accept the answer is up to you as long as it is within EE guidelines.

Balancing the request to get a fast answer with a complete one, I finally got a chance to find one application that was recommended.... Word Cleaner: http://www.convertwordtohtml.com/
0
 
SteveFarndon2000Managing DirectorAuthor Commented:
Tbone2k,

WordCleaner was one that I'd already tried with no joy. I was surprised they charge money for this product and someone must have spotted this fault by now. Obviously not.

Next.

0
 
Brian BIndependant Technology ProfessionalCommented:
If you still need help, you might want to try the request attention link in your question to flag it as still in need of a solution.
0
 
aikimarkCommented:
Do you need to do this directly from Word? (File | Save As)
0
 
rspahitzCommented:
Just a thought...if you found an online tool that works for you, you can use Word to automate the process of loading the data into a hidden browser object that can handle the conversion if it is pure HTML/JavaScript, then have the output redirected to a location of your choice.

If you have such a site and want to pursue this, let me know and I'll see if I can whip up the VBA to make that happen.
0
 
Jason C. LevineNo oneCommented:
If command-line/batch conversion is a requirement, you may want to investigate the OmniFormat series of products:

http://www.omniformat.com/

That requirement really throws a wrench into the works.  Without it, you have a lot more options (cold comfort, I know).

The problem with going from Word to HTML will always be the quality of the output.  Headers and Footers just don't translate well except to PDF.
0
 
Vadim RappCommented:
> I want to convert MS-Word .doc files in to HTML documents that render
> exactly what you would see if did 'print preview' in Word on the original .doc version.

But you can't have that "exactly". Print Preview is page-oriented, and page header and page footer appear on each page. In HTML there are no pages. Let's say, you have 5-page Word document. In Print Preview, it will be 5 pages, each with header and footer. How is it supposed to look in HTML? Then there's a whole mountain of print options - depending on the printer and on the paper, the same Word document will look very differently; in fact, your requirement should look like this: "I want webpage looking like my Word document when it's printed on my color printer (or is it b/w?) in Portrait format on pager 11X8.5, with margins 0.5 from each side,....." (and we can continue here for couple of pages more, representing all printing options in Word). If in Word you switch to Normal view, or Web view, you won't see your Header and Footer, and for a very good reason - you can't have page headers if you have no pages to begin with.

Because of that, indeed, the option to produce JPEG images is probably the best, because only that will show on the webpage something that does not exist on the webpage - the paper page.
0
 
Vadim RappCommented:
...continuing... I looked at a random book at books.google.com, and what Google shows, is true HTML, not Flash, or Image - right-click on the book, and you will see it. But how that actually works, is somewhat of a mystery (see discussion at http://stackoverflow.com/questions/3896008/how-google-books-and-google-docs-viewer-work )

Another approach is HTML 5: http://www.20thingsilearned.com/foreword/2

All this seems to be very far from being offered as some utility that you could just grab and utilize for your project.

0
 
rspahitzCommented:
I'm with vadim on this...the web page was specifically designed to allow the recipient to adjust formatting and have that independent of content.  Over the years, this has been adjusted to allow the creator to make the item appear closer to the original concept, but what happens when you show something in a font that doesn't exist on the recipient's machine?  What if the preview is set to legal paper? Or index card? or European A4 (similar to American letter but not quite the same)?

As indicated, the only real, viable solution to that is to draw the web page as an image (which is essentially what PDF does) so that it looks exactly the way you want.  However, that still doesn't guarantee that it will *print* the same way.

So that leads to the question of, "how would you like it to appear, and who do you expect to view it?"  If the recipient will have the same machine (or the same fonts and settings as your machine), then we could probably create a web page from a Word document that appears like that on the recipient's machine, but that would probably be a custom job, not a free tool (or even a 3rd party tool that you pay for.)
0
 
Jason C. LevineNo oneCommented:
I've seen one product that does a credible job of replicating a word "page" but it does it in such a weird way that manipulating the page is all but impossible.

The application would create separate HTML pages for each page of the document and set an AP div with 7.5" x 10" dimensions.  Each LINE of text would be converted to a div and those would also be absolutely positioned.  The end result is a credible attempt at mimicking a Word doc but an absolute nightmare to handle and edit.  

Be d*mned if I can remember the name of it now, though...
0
 
Michel PlungjanIT ExpertCommented:
I agree with Paul that the only viable solution is PDF

If you install Acrobat professional you can even have it monitor a directory for new files and have it convert them on the fly.
Alternative is a free pdf writer where you can save-as pdf
I am certain that programs exist for a server that will convert on the fly too
0
 
Vadim RappCommented:
0
 
Jason C. LevineNo oneCommented:
Remove 35726243 from the split.  My post doesn't answer the question unless I remember the name of the program.  
0
 
Vadim RappCommented:
@jason1178, I disagree. The mere information that something like this exists, is valuable - whoever will find this Q in the KB, will do some additional research and find the product.
0
 
Brian BIndependant Technology ProfessionalCommented:
In that case, I would like to also add http:#35505511 and http:#35516809 to the mix since it suggests online tools ahead of the other posts.
0
 
SteveFarndon2000Managing DirectorAuthor Commented:
from vadimapp1:
<< But you can't have that "exactly". Print Preview is page-oriented, and page  header and page footer appear on each page. In HTML there are no pages. Let's say, you have 5-page Word document. In Print Preview, it will be 5 pages, each with header and footer. How is it supposed to look in HTML? Then there's a whole mountain of print options - depending on the printer and on the paper, ...>>

The point about the massive number of variations in the print options is spurious. The Print Preview option uses whatever paper size, margins etc (see 'Page Setup') have been set for that section. yes? An HTML version of a page will do the same thing, i.e. use the current settings for the Word Section. In fact that seems to be exactly what Word's own 'Save as Webpage' option does except for the bug that doesn't render any header/footer. The HTML could and should have a new <DIV> tag for each page of Word document inside a single HTML page.

I see that you are all waiting for comments/points allocation. I'm back after being away for a while and will add more comments v. soon. Thanks for all your comments so far.
0
 
Jason C. LevineNo oneCommented:
Steve,

We all agree about what a conversion SHOULD do, but the sad fact is that no one has made a tool that does a good (or even tolerable) job of it.  
0
 
rspahitzCommented:
Jason, that's partly true.  The real problem is that different people have different definitions of what is a good job.  For example, if it's simply taking the text and putting bold and italic tags as needed, that's probably been done.  When you get more complex and start adding floating images inside complex tables with formulas and headings that reference page numbers along with some macros, I'm not sure that it's practical to build a tool to handle all that.
At some point, "good enough" is probably fine.  I think the real problem is that for anything beyond a basic document, I haven't found any good tools.  That means writing a custom solution for those cases, and that's not likely to be done on a free/cheap site like EE.
0
 
Vadim RappCommented:
> The Print Preview option uses whatever paper size, margins etc (see 'Page Setup') have been set for that section. yes?

Not quite. See what I mean in this screencast I made. Note how the selection of Zebra label printer, with its narrow page, results in different text flow, so the text fits the page. And with yet another Zebra printer, it showed no text at all, for whatever reason.

vadimrapp1-476073.flv

As for the page headers and footers, as I said, I don't think it's a bug: you can't have page header when you don't have the page itself. Word does not show them as well in Normal view, and in Web page view, and for the same reason - there's no page.

> In fact that seems to be exactly what Word's own 'Save as Webpage' option does

If you save as webpage, long paragraphs will be as wide as possible, for the same reason - with no pages, hence no page width, there's no reason to go to the next line.
0
 
Vadim RappCommented:
> We all agree about what a conversion SHOULD do, but the sad fact is that no one has made a tool that does a good (or even tolerable) job of it.  

If the objective is having the page look as close to the print preview as possible, then isn't PDF exactly that?
0
 
rspahitzCommented:
PDF is good for "near exact" rendering, but HTML with CSS was designed to do the same.

The issue here goes back to the fact that HTML is about content, not presentation.  As such, there is no real concept of a page.  With no page, there's no such thing as a page header and footer.

Steve, what you seem to be asking for is a way to dynamically modify an HTML document so that, given certain printer criteria, the document will be re-rendered with DIVs to show header and footer information.
I think that's a great idea! :)
I don't know of any tools that offer that. :(
Of course, you wouldn't want to embed those DIVs in your original because if you changed the margins, it would have to remove them and rebuild them, so you'll have the issue of handling that in a temp file that gets built from the original, then the temp file gets submitted to the printer.  For small documents (most HTML pages) that would work reasonably fast.
0
 
Jason C. LevineNo oneCommented:
Hah!  I finally remembered/figured out the name of the conversion tool that did a decent job, but it's not exactly what's being asked for.

http://www.docudesk.com/deskUNPDF-PRO-PDF-Converter.shtml

We used it to go from PDF to HTML, but it should be trivial to go from DOC->PDF->HTML automatically.  This thing does attempt to render a perfect (margins and headers/footers included) version in HTML but the lines of text will all be in separate divs.  

0
 
Guy Hengel [angelIII / a3]Billing EngineerCommented:
This question has been classified as abandoned and is closed as part of the Cleanup Program. See the recommendation for more details.
0

Featured Post

Prep for the ITIL® Foundation Certification Exam

December’s Course of the Month is now available! Enroll to learn ITIL® Foundation best practices for delivering IT services effectively and efficiently.

  • 6
  • 6
  • 5
  • +6
Tackle projects and never again get stuck behind a technical roadblock.
Join Now