asked on

Problems importing HTML from a browser into Word

Hi,

For various reasons, I need to import HTML into a Word document. I display the HTML in Chrome, select the entire page, then paste it into Word. It works great, except for two problems I need help with.

First, if a word is surrounded by bold tags, when it is pasted into Word, it is accurately bolded, but it is surrounded now by non-breaking spaces, which sometimes make a small mess of the layout in Word.

For example, this page has two short sentences, each with a word bolded, one with <b> and the other with CSS:
http://www.App33.com/bold.html

If you load that page in Chrome, select all, and paste into Word, you'll see the non-breaking spaces surrounding the bolded words. Is there a way to prevent the non-breaking spaces? Or identify specifically the non-breaking spaces that were wrongly created like that? I know I could search/replace a non-breaking space with a regular space, but that would interfere with other places where I *want* non-breaking spaces.

If I show the html page in Internet Explorer and paste into Word, the problem does not exist, but for a multitude of reasons I need to use Chrome.

The other problem is that sometimes two words that are separate in the HTML will run together without a space in Word. It happens only infrequently, and always one of the words is bolded and the other is not bolded. In a 50-page document, this might happen 5 or 6 times.

I have only minimal skills in VBA, but I thought someone might be able to write a tiny program that would find instances where a non-bold character is immediately adjacent to a bold character. That would allow me to do a quick search and fix all of the instances rather than scouring the entire document visually.

Thanks in advance for anyl help on these issues.

aikimark

You can do this manually.
1. find/replace all ^s with ^s (font: bold)
2. position the cursor up the the start of the document
3. find ^$^s (font: bold)

You can repeat step three by clicking the Find Next button on the dialog.

StevenMiles

ASKER

Hi, aikimark,
Thank you for responding. Maybe I'm not understanding, but I don't think this helps me. If I replace all of the nbsp's with bold-nbsp's, and then search for a bold character adjacent to a bold nbsp, I've just located every place where Word wrongly imported nbsp's around a bold word. But there are nbsp's that I *want* that are adjacent to bold words, too.

And that doesn't address the more aggravating problem: that sometimes I'll have two words together, like appleorange, and "apple" is not bold but "orange" is bold. Do you have a way to find those instances?

aikimark

the first part of the problem you described finding non-breaking space characters that follow a bold faced word/character. I'm providing a means to find those in your document. You would skip non-breaking space characters that follow a non-bold character.

The find text is <*>
with wildcards enabled.

I did not address the second part of your problem. That will require some code. This code will find and select mixed formatted words. It starts with the current cursor position, so you'll need to move the cursor after each invocation.

Option Explicit

Sub FindMixed()
    Dim oWd As Range
    For Each oWd In ActiveDocument.Range(Selection.Start, ActiveDocument.Range.End).Words
        If oWd.Font.Bold = wdUndefined Then
            Debug.Print oWd.Text, oWd.Font.Bold
            oWd.Select
            Exit For
        End If
    Next
End Sub

Open in new window

StevenMiles

ASKER

Hi again,
Yeah, I think part of the problem was that I didn't explain myself well enough.

Your code for finding mixed formatted words is exactly right.

Now I think I can describe the issue of the nbsp's better, and I'll bet some short code segment will fix that, too. See if the following makes sense:

The imported document puts nbsp's next to bold words, but there are also bold words next to *nbsp's that I want to keep*, so the find operation we have been discussing will find lots of instances that I don't want to change.

But: I do know exactly what the text is for all of the nbsp's that I want to *keep*, so I could just replace EVERY nbsp with a regular space, and then do a search/replace of, for example, "Bill Smith" with "BillnbspSmith".

The problem with the blanket replacement is that there are nbsp's in *tables* that are needed to create layout spacing, so I can't just do a blanket replacement of all nbsp's with regular spaces. BUT, I *can* replace all nbsp's *that are adjacent to a character* with regular spaces. See? If a nbsp is adjacent to a letter, I can freely replace it with a regular space, and that would solve the problem. Can you write a snippet that will find and replace those instances?

ASKER CERTIFIED SOLUTION

aikimark

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial