How can I determine whether a given string is a single HTML/XML token?

Hi Guys,

I am trying to confirm whether or not a string token is in and of itself an HTML or XML token.  To match it must start with '<', and with '>' and have zero or more slash characters and one or more additional characters.   There must be no additional '<' or '>' characters within the string.  Examples of matching strings:

<b>
<BR>
<html xmlns="http://www.w3.org/1999/xhtml" >

I have done this so far using the following regex:
@"^<\/*.>$"

Open in new window

Unfortunately, whilst this is generally OK it does not match the edge cases where the string starts and ends with a tag e.g.:

<b></b>
<b>This text will be displayed in bold</b>

So, how can I cause the regex to match only a single tag, and return no match if there is more than one tag in the string?

I am not necessarily stuck on using regex if there is a better alternative suggestion...

Chris Bray
LVL 3
chrisbrayAsked:
Who is Participating?
 
chrisbrayConnect With a Mentor Author Commented:
In the end, I gave up on the regex and used string handling and  a little Linq to provide the answer:

return str.Length > 2 && str.StartsWith("<") && str.EndsWith(">") 
&& str.Count(c => c == '<') == 1 &&  str.Count(c => c == '>') == 1;

Open in new window

This meets all the tests devised whilst being pretty quick in normal usage.  I hope that this is helpful to someone else faced with a similar issue.

Chris Bray
0
 
BardobraveCommented:
What about something like this?

@"^<\/*.>(<\/*.>)*$"

This way you add the posibility that there is an additional closing tag after your current configuration.
0
 
chrisbrayAuthor Commented:
Hi Bardobrave,

That solves one of the edge cases, but not the other.  This still fails the test by returning a match when it shouldn't:

<b></b>

However, it does not provide a match for this one:

<BR><BR>

What makes it worse is that it breaks one of the working ones:

<BR>

This is a valid tag, but is reported as not matching when using your regex.

Chris Bray
0
 
chrisbrayAuthor Commented:
I have found an issue in my starting regex, which should have a + to match 1 or more other characters - it was only matching a single before so <BR> and long tags did not match. Here is the new working regex APART from the edge cases reported previously:

@"#^<\/*.+>$"

To test for yourself if a proposed regex is working, these are the failing edge cases:

<b>test</b>
<b></b>
<BR><BR>

In each case the problem is that the string opens and closes with a tag.  If the string does not close with a tag it works fine.

Chris Bray
0
 
chrisbrayAuthor Commented:
No solution of any kind was forthcoming, and my experiments with regex were unsuccessful in eliminating the edge cases, so I created an answer of my own.
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.