Regex HTML Validation

I am using a regular expressions replace in C# to validate some html. It seems the HTML has opening <td.*?> but in some cases, the closing </td> tag has been omitted.

My goal is to find occurances of <td> tags (with/or without attributes), and add the closing tag before the next opening <td.*?> tag.
This will be exectued on multiple sets of HTML.

html snippet:
<tr class="odd">
...
<td class="tright">0</td>
<td class="tright">-</td>
</tr>
<tr class="odd">
<td class="left"><a href="/test/test.html?unique_id=1270">UNIQUE</a></td>
<td class="left"><a href="/test/test.html?unique_id=1271">UNIQUE</a></td>
<td class="left"><a href="/test/test.html?unique_id=1272">UNIQUE</a>    <------ Error is here
<td class="right">0</td><td class="right">0</td>

The expressions I have built have not worked. I won't include my expressions with this question yet, just in case it adds any additional complications.

Would any regex guru please offer me some advice?

Thanks in advance.

LVL 10
rivusglobalAsked:
Who is Participating?

[Webinar] Streamline your web hosting managementRegister Today

x
 
ozoConnect With a Mentor Commented:
#!/usr/bin/perl
$_='
<tr class="odd">
...
<td class="tright">0</td>
<td class="tright">-</td>
</tr>
<tr class="odd">
<td class="left"><a href="/test/test.html?unique_id=1270">UNIQUE</a></td>
<td class="left"><a href="/test/test.html?unique_id=1271">UNIQUE</a></td>
<td class="left"><a href="/test/test.html?unique_id=1272">UNIQUE</a>    <------ Error is here
<td class="right">0</td><td class="right">0</td>
';
print "error:$1\n" if m#(<td\b[^>]*>((?!</td\s*>).)*(<td\b|$))#si;
0
 
adg080898Connect With a Mentor Commented:
This is a common question with regular expressions. What you ask is too complicated to be done reliably with regular expressions.

What would happen if one of the table cells itself included a table? Regexes would not (and could not) handle it and it would screw up.

For something this complicated, you should have a full blown parser. However, it is possible to do what you ask, but not simple...

The only way to reliably handle it is to maintain a stack. Scan forward for ANY html tag, and push an entry to the stack on an opening tag, and pop from the stack on a closing tag. (You would need to know which tags required closing tags - some tags don't have closing tags). This way, whenever you encounter a new tag, you could look examine the stack and make sure that the next tag is appropriate.

In your example, upon encountering a "<td", "</table", or "<tr" tag, you would make sure that the top of the stack is not "<td". If it was, you could emit the required "</td" tag.
0
 
HonorGodCommented:
Basically, you are asking "is the input 'valid HTML'?"
As adq stated, you can not identify this with regular expressions.
You would need to parse the input, and you really need an HTML validator.
0
The new generation of project management tools

With monday.com’s project management tool, you can see what everyone on your team is working in a single glance. Its intuitive dashboards are customizable, so you can create systems that work for you.

 
Ralf KlattPrincipal ConsultantCommented:
Hi,

@adg ...

... you're definitely right with your statement ... I've subscribed to this question about an hour ago just to see if there'd be a way to handle this in a different way I'd resumed already ... reading your answer made me think about the Visual Studio development environment ... how can the Visual Studio development environment tell me that I forgot a "TD" ending tag? ... Visual Studio 2005 will not even compile a web project if I forgot to end a table definition ...

... makes me think that there must be a way ...

... what I did was to have a closer look at a "well formed" web page (a large one) of my own ... and to tell ... a count on "<td" brought 14.777 results and a count on "</td>" brought only 15.034 results ... on the same (well formed!!!) page ...

... it's not that I wouldn't have an idea on getting deeper into a possible solution ... it's just that I'd just paste html code like the one shown into a (somehow intelligent) html editor ... just waiting that it shows where errors occur!


Best regards,
Raisor
0
 
rivusglobalAuthor Commented:
adg,

Thanks for your response, I will implement a stack base solution to validate this code rather than using regular expressions.

Unfortunately, the HTML my code is receiving programatically has errors, and with the amount of HTML needed to be parsed with other regular expressions, I have to ensure the HTML is properly formed.

I thought someone might have some ninja RegEx replace solution, seeing how the stack implementation is more in-depth.

Here's was my original, and unsuccessful RegEx approach:

@"(<td.*?>.*?)[^</td>].*?<td", @"$1</td><td"
0
 
adg080898Commented:
If you only want to validate your pages (not actually develop your own validator), I suggest you look at:

http://validator.w3.org/

It might be easier to make a C# program that generates a webpage that uses the file upload service and lets the above site do the validating.

They even have a CSS validator:

http://jigsaw.w3.org/css-validator/
0
 
adg080898Commented:
I am just starting to learn C# (the language is easy to learn (when you already know C++), the problem is the enormous .net library!).

Perhaps there are some parsing libraries in .net you can use to drastically simplify your problem.

You might want to post to a question about parsing in the C# area on EE:

http://www.experts-exchange.com/Programming/Programming_Languages/C_Sharp/


0
 
Ralf KlattPrincipal ConsultantCommented:
Hi,

The problem described is a logic one ... if you're the "creator" of the html code you'll place the table definition tags programmatically -> meaning, once implemented there'll never be a missing closing tag ... if you leverage foreign code [written by someone else] you'll come to a "downlevel-issue" as adg has very well explained!

A good plan would be to split your html file into parts:

1. Take off the header section
2. Take off the body section
3. Count the tables in the body section
4. Split off each table
5. Count each TR in each of the tables
6. Split off each TR in each of the tables
7. Count each TD in each of your TRs in each of your tables
8. Perform a "InString" search in each TD element for a closing ("</td>") element
9. When you've reached the missing one you'll have the count state
10. On the "count state basis" you'll be able to add a "</td>" string right before the next starting "<td>" element (meaning to add it to the end of the splitted indexed chunk)

... a good thing would be to always use te "low case" function when comparing ... as some elements might be written differently ... some low case and some others upper case ...


Hope this helps!
Best regards,
Raisor

0
 
rivusglobalAuthor Commented:
Thanks for everyone's time and suggestions.

ozo, you are a ninja. I modified your example to fit our needs and it's running exactly as anticipated.

Here's the resulting C# expression we used.

_tdFix = Regex.Replace(_tdFix, @"(<td\b[^>]*>((?!</td\s*>).)*)(<td\b|$)", @"$1</td><td", RegexOptions.IgnoreCase);

See you in the Perl TA.

Salute!
0
All Courses

From novice to tech pro — start learning today.