Problems importing HTML from a browser into Word


For various reasons, I need to import HTML into a Word document.  I display the HTML in Chrome, select the entire page, then paste it into Word. It works great, except for two problems I need help with.

First, if a word is surrounded by bold tags, when it is pasted into Word, it is accurately bolded, but it is surrounded now by non-breaking spaces, which sometimes make a small mess of the layout in Word.

For example, this page has two short sentences, each with a word bolded, one with <b> and the other with CSS:

If you load that page in Chrome, select all, and paste into Word, you'll see the non-breaking spaces surrounding the bolded words.  Is there a way to prevent the non-breaking spaces?  Or identify specifically the non-breaking spaces that were wrongly created like that? I know I could search/replace a non-breaking space with a regular space, but that would interfere with other places where I *want* non-breaking spaces.

If I show the html page in Internet Explorer and paste into Word, the problem does not exist, but for a multitude of reasons I need to use Chrome.

The other problem is that sometimes two words that are separate in the HTML will run together without a space in Word.  It happens only infrequently, and always one of the words is bolded and the other is not bolded.  In a 50-page document, this might happen 5 or 6 times.

I have only minimal skills in VBA, but I thought someone might be able to write a tiny program that would find instances where a non-bold character is immediately adjacent to a bold character.  That would allow me to do a quick search and fix all of the instances rather than scouring the entire document visually.

Thanks in advance for anyl help on these issues.
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

You can do this manually.
1. find/replace all ^s with ^s (font: bold)
2. position the cursor up the the start of the document
3. find ^$^s (font: bold)

You can repeat step three by clicking the Find Next button on the dialog.
StevenMilesAuthor Commented:
Hi, aikimark,
Thank you for responding.  Maybe I'm not understanding, but I don't think this helps me.  If I replace all of the nbsp's with bold-nbsp's, and then search for a bold character adjacent to a bold nbsp, I've just located every place where Word wrongly imported nbsp's around a bold word.  But there are nbsp's that I *want* that are adjacent to bold words, too.

And that doesn't address the more aggravating problem: that sometimes I'll have two words together, like appleorange, and "apple" is not bold but "orange" is bold.  Do you have a way to find those instances?
the first part of the problem you described finding non-breaking space characters that follow a bold faced word/character.  I'm providing a means to find those in your document.  You would skip non-breaking space characters that follow a non-bold character.

The find text is <*>
with wildcards enabled.

I did not address the second part of your problem. That will require some code.  This code will find and select mixed formatted words. It starts with the current cursor position, so you'll need to move the cursor after each invocation.
Option Explicit

Sub FindMixed()
    Dim oWd As Range
    For Each oWd In ActiveDocument.Range(Selection.Start, ActiveDocument.Range.End).Words
        If oWd.Font.Bold = wdUndefined Then
            Debug.Print oWd.Text, oWd.Font.Bold
            Exit For
        End If
End Sub

Open in new window

StevenMilesAuthor Commented:
Hi again,
Yeah, I think part of the problem was that I didn't explain myself well enough.

Your code for finding mixed formatted words is exactly right.

Now I think I can describe the issue of the nbsp's better, and I'll bet some short code segment will fix that, too. See if the following makes sense:

The imported document puts nbsp's next to bold words, but there are also bold words next to *nbsp's that I want to keep*, so the find operation we have been discussing will find lots of instances that I don't want to change.

But: I do know exactly what the text is for all of the nbsp's that I want to *keep*, so I could just replace EVERY nbsp with a regular space, and then do a search/replace of, for example, "Bill Smith" with "BillnbspSmith".

The problem with the blanket replacement is that there are nbsp's in *tables* that are needed to create layout spacing, so I can't just do a blanket replacement of all nbsp's with regular spaces.  BUT, I *can* replace all nbsp's *that are adjacent to a character* with regular spaces.  See?  If a nbsp is adjacent to a letter, I can freely replace it with a regular space, and that would solve the problem.  Can you write a snippet that will find and replace those instances?
enable wildcards and do a find/replace all of
with a single space character

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Microsoft Word

From novice to tech pro — start learning today.