We help IT Professionals succeed at work.

regex to match not in within html tags

Amkick
Amkick asked
on
Hi there,

I want to match occurrences of a term that do not appear within xhtml tags. So in the following example I want to only match the second occurrence for font:

<font color="red">font might be red</font>

Open in new window

Of course, I need to be flexible and "font" might be substituted for anything. I just want matches that are not within a xhtml tag <....>

I tried this:
rex = new Regex("(?:?<=<)(?:?!>)\\b" + strHighlight + "\\b(?:?!=>)(?:?=<)", RegexOptions.IgnoreCase);

Open in new window


But that does not seem to work. Probably because it not the right way to say that strHighlight should not be within a <....> tag...

Thanks for your help!

Amkick
Comment
Watch Question

Most Valuable Expert 2011
Top Expert 2015

Commented:
If the text you are searching for is always just the tag name, then you should be able to use the following:

(?<!</?)font

Open in new window

Terry WoodsIT Guru
Most Valuable Expert 2011

Commented:
Try pattern:
font(?![^<>]*>)

Author

Commented:
hi guys, thanks fo the comments. I guess I did not make myself clear. I am implementing a highlighting function where I mark found terms within an html document. As users can query anything, they might also query words that appear within < and > tags. I don't want those occurrences to be matched. So I am looking for a single regex that does work for only the second occurrence of the word font in the example above and

<tag name="whatever font you like"/> stuf about a font and other stuff.

Open in new window


The word font is a variable in my script.
Terry WoodsIT Guru
Most Valuable Expert 2011

Commented:
I'm clear on that - I thought you would just modify my pattern to include a variable like this:

rex = new Regex(strHighlight + "(?![^<>]*>)", RegexOptions.IgnoreCase);
Most Valuable Expert 2011
Top Expert 2015

Commented:
My suggestion would be:

rex = new Regex("(?i)(?!<<[^<]*)\bfont\b(?![^>]*>)")

Open in new window

Most Valuable Expert 2011
Top Expert 2015

Commented:
Nix that.
Terry WoodsIT Guru
Most Valuable Expert 2011

Commented:
Kaufmed's post just now reminded me that the word boundaries are important - this would give:

rex = new Regex("\\b" + strHighlight + "\\b(?![^<>]*>)", RegexOptions.IgnoreCase);
Most Valuable Expert 2011
Top Expert 2015
Commented:
You might try this, but I still can't guarantee it. HTML is difficult to parse with regex.

(?i)(?<=<(?=(\S+))[^<]*>[^>]*)\bfont\b(?=(?:[^<]*</\1>)?)

Open in new window

Most Valuable Expert 2011
Top Expert 2015

Commented:
Terry,

Yours would fail to find:

<tag name="whatever font you like"/> the size of the font is > 5

There's probably little chance of receiving such text, though.
Terry WoodsIT Guru
Most Valuable Expert 2011

Commented:
Wouldn't the > 5 actually be &gt; 5 ?
Most Valuable Expert 2011
Top Expert 2015

Commented:
Possibly. It depends on the doctype of the HTML document, I believe, as to whether an unencoded gt is allowed. If that's the case, then it's fine--as you know  ; )
Terry WoodsIT Guru
Most Valuable Expert 2011

Commented:
kaufmed, your latest pattern requires a tag to be found before and after the word being searched for - that's ok for complete HTML pages, but not so good if we're doing something like searching the user defined content of a CMS.

Ironically, I think your latest pattern will fail if you have a > character (must be > rather than &gt;) *before* the word you're looking for:
<tag name="whatever font you like"/> the size of the something is > 5 but I want to find font please

One other thing I'm not sure about - are (x)html tags valid if they have leading spaces inside the tag? eg < strong>
You'd just need to add a \s* to your pattern to fix that though.

There's nothing like peer review to keep you on your toes... lol
Terry WoodsIT Guru
Most Valuable Expert 2011

Commented:
Oh, and does ASP.NET allow wildcards in lookbehinds? PHP doesn't I believe.
Most Valuable Expert 2011
Top Expert 2015

Commented:
Oh, and does ASP.NET allow wildcards in lookbehinds?
Yes, it does.
Most Valuable Expert 2011
Top Expert 2015

Commented:
Ironically, I think your latest pattern will fail if you have a > character (must be > rather than &gt;) *before* the word you're looking for:
I didn't check it extensively, but it it seems to handle your suggested scenario:
untitled.PNG
Terry WoodsIT Guru
Most Valuable Expert 2011

Commented:
Sorry - you're right, I see it does - this part of the pattern:
[^<]*>
matches:
name="whatever font you like"/> the size of the something is >

and:
[^>]*
matches:
5 but I want to find

(Was it intentional to work that way though?)
Most Valuable Expert 2011
Top Expert 2015

Commented:
No, I'd have to say that was dumb luck. There's probably some way to break my pattern.

Author

Commented:
Hmmm. As (even) you two are having doubts this will work, I have changed things to a more controlled situation. I now want to remove all instances that appear within a name attribute. So:

<a name="this is a [highlight]test[/highlight]">[highlight]test[/highlight]

Open in new window


should come back as

<a name="this is a test">[highlight]test[/highlight]

Open in new window


because the regex would match on the first two occurrences of [/?highlight] only. Can you assist with this one too? Thanks so much.
Terry WoodsIT Guru
Most Valuable Expert 2011
Commented:
Try replacing:
(?<=<\s*[a-z]+[^>]*name\s*=\s*"[^["]*)\[[^\]]*\]
with an empty string.

Author

Commented:
Regexes with variable length lookbehinds and such are the closest things to magic I know. My thanks go out to these two wizzards.