asked on

Regex HTML Validation

I am using a regular expressions replace in C# to validate some html. It seems the HTML has opening <td.*?> but in some cases, the closing </td> tag has been omitted.

My goal is to find occurances of <td> tags (with/or without attributes), and add the closing tag before the next opening <td.*?> tag.
This will be exectued on multiple sets of HTML.

html snippet:
<tr class="odd">
...
<td class="tright">0</td>
<td class="tright">-</td>
</tr>
<tr class="odd">
<td class="left"><a href="/test/test.html?unique_id=1270">UNIQUE</a></td>
<td class="left"><a href="/test/test.html?unique_id=1271">UNIQUE</a></td>
<td class="left"><a href="/test/test.html?unique_id=1272">UNIQUE</a> <------ Error is here
<td class="right">0</td><td class="right">0</td>

The expressions I have built have not worked. I won't include my expressions with this question yet, just in case it adds any additional complications.

Would any regex guru please offer me some advice?

Thanks in advance.

SOLUTION

adg080898

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

HonorGod

Basically, you are asking "is the input 'valid HTML'?"
As adq stated, you can not identify this with regular expressions.
You would need to parse the input, and you really need an HTML validator.

Ralf Klatt

Hi,

@adg ...

... you're definitely right with your statement ... I've subscribed to this question about an hour ago just to see if there'd be a way to handle this in a different way I'd resumed already ... reading your answer made me think about the Visual Studio development environment ... how can the Visual Studio development environment tell me that I forgot a "TD" ending tag? ... Visual Studio 2005 will not even compile a web project if I forgot to end a table definition ...

... makes me think that there must be a way ...

... what I did was to have a closer look at a "well formed" web page (a large one) of my own ... and to tell ... a count on "<td" brought 14.777 results and a count on "</td>" brought only 15.034 results ... on the same (well formed!!!) page ...

... it's not that I wouldn't have an idea on getting deeper into a possible solution ... it's just that I'd just paste html code like the one shown into a (somehow intelligent) html editor ... just waiting that it shows where errors occur!

Best regards,
Raisor

rivusglobal

ASKER

adg,

Thanks for your response, I will implement a stack base solution to validate this code rather than using regular expressions.

Unfortunately, the HTML my code is receiving programatically has errors, and with the amount of HTML needed to be parsed with other regular expressions, I have to ensure the HTML is properly formed.

I thought someone might have some ninja RegEx replace solution, seeing how the stack implementation is more in-depth.

Here's was my original, and unsuccessful RegEx approach:

@"(<td.*?>.*?)[^</td>].*?<td", @"$1</td><td"

adg080898

If you only want to validate your pages (not actually develop your own validator), I suggest you look at:

http://validator.w3.org/

It might be easier to make a C# program that generates a webpage that uses the file upload service and lets the above site do the validating.

They even have a CSS validator:

http://jigsaw.w3.org/css-validator/

adg080898

I am just starting to learn C# (the language is easy to learn (when you already know C++), the problem is the enormous .net library!).

Perhaps there are some parsing libraries in .net you can use to drastically simplify your problem.

You might want to post to a question about parsing in the C# area on EE:

https://www.experts-exchange.com/Programming/Programming_Languages/C_Sharp/

Ralf Klatt

Hi,

The problem described is a logic one ... if you're the "creator" of the html code you'll place the table definition tags programmatically -> meaning, once implemented there'll never be a missing closing tag ... if you leverage foreign code [written by someone else] you'll come to a "downlevel-issue" as adg has very well explained!

A good plan would be to split your html file into parts:

1. Take off the header section
2. Take off the body section
3. Count the tables in the body section
4. Split off each table
5. Count each TR in each of the tables
6. Split off each TR in each of the tables
7. Count each TD in each of your TRs in each of your tables
8. Perform a "InString" search in each TD element for a closing ("</td>") element
9. When you've reached the missing one you'll have the count state
10. On the "count state basis" you'll be able to add a "</td>" string right before the next starting "<td>" element (meaning to add it to the end of the splitted indexed chunk)

... a good thing would be to always use te "low case" function when comparing ... as some elements might be written differently ... some low case and some others upper case ...

Hope this helps!
Best regards,
Raisor

ASKER CERTIFIED SOLUTION

ozo

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

rivusglobal

ASKER

Thanks for everyone's time and suggestions.

ozo, you are a ninja. I modified your example to fit our needs and it's running exactly as anticipated.

Here's the resulting C# expression we used.

_tdFix = Regex.Replace(_tdFix, @"(<td\b[^>]*>((?!</td\s*>).)*)(<td\b|$)", @"$1</td><td", RegexOptions.IgnoreCase);

See you in the Perl TA.

Salute!