Solved

Regex HTML Validation

Posted on 2006-11-24
9
1,210 Views
Last Modified: 2008-01-09
I am using a regular expressions replace in C# to validate some html. It seems the HTML has opening <td.*?> but in some cases, the closing </td> tag has been omitted.

My goal is to find occurances of <td> tags (with/or without attributes), and add the closing tag before the next opening <td.*?> tag.
This will be exectued on multiple sets of HTML.

html snippet:
<tr class="odd">
...
<td class="tright">0</td>
<td class="tright">-</td>
</tr>
<tr class="odd">
<td class="left"><a href="/test/test.html?unique_id=1270">UNIQUE</a></td>
<td class="left"><a href="/test/test.html?unique_id=1271">UNIQUE</a></td>
<td class="left"><a href="/test/test.html?unique_id=1272">UNIQUE</a>    <------ Error is here
<td class="right">0</td><td class="right">0</td>

The expressions I have built have not worked. I won't include my expressions with this question yet, just in case it adds any additional complications.

Would any regex guru please offer me some advice?

Thanks in advance.

0
Comment
Question by:rivusglobal
  • 3
  • 2
  • 2
  • +2
9 Comments
 
LVL 8

Assisted Solution

by:adg080898
adg080898 earned 50 total points
ID: 18009706
This is a common question with regular expressions. What you ask is too complicated to be done reliably with regular expressions.

What would happen if one of the table cells itself included a table? Regexes would not (and could not) handle it and it would screw up.

For something this complicated, you should have a full blown parser. However, it is possible to do what you ask, but not simple...

The only way to reliably handle it is to maintain a stack. Scan forward for ANY html tag, and push an entry to the stack on an opening tag, and pop from the stack on a closing tag. (You would need to know which tags required closing tags - some tags don't have closing tags). This way, whenever you encounter a new tag, you could look examine the stack and make sure that the next tag is appropriate.

In your example, upon encountering a "<td", "</table", or "<tr" tag, you would make sure that the top of the stack is not "<td". If it was, you could emit the required "</td" tag.
0
 
LVL 41

Expert Comment

by:HonorGod
ID: 18009872
Basically, you are asking "is the input 'valid HTML'?"
As adq stated, you can not identify this with regular expressions.
You would need to parse the input, and you really need an HTML validator.
0
 
LVL 15

Expert Comment

by:Raisor
ID: 18009882
Hi,

@adg ...

... you're definitely right with your statement ... I've subscribed to this question about an hour ago just to see if there'd be a way to handle this in a different way I'd resumed already ... reading your answer made me think about the Visual Studio development environment ... how can the Visual Studio development environment tell me that I forgot a "TD" ending tag? ... Visual Studio 2005 will not even compile a web project if I forgot to end a table definition ...

... makes me think that there must be a way ...

... what I did was to have a closer look at a "well formed" web page (a large one) of my own ... and to tell ... a count on "<td" brought 14.777 results and a count on "</td>" brought only 15.034 results ... on the same (well formed!!!) page ...

... it's not that I wouldn't have an idea on getting deeper into a possible solution ... it's just that I'd just paste html code like the one shown into a (somehow intelligent) html editor ... just waiting that it shows where errors occur!


Best regards,
Raisor
0
 
LVL 10

Author Comment

by:rivusglobal
ID: 18009922
adg,

Thanks for your response, I will implement a stack base solution to validate this code rather than using regular expressions.

Unfortunately, the HTML my code is receiving programatically has errors, and with the amount of HTML needed to be parsed with other regular expressions, I have to ensure the HTML is properly formed.

I thought someone might have some ninja RegEx replace solution, seeing how the stack implementation is more in-depth.

Here's was my original, and unsuccessful RegEx approach:

@"(<td.*?>.*?)[^</td>].*?<td", @"$1</td><td"
0
Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

 
LVL 8

Expert Comment

by:adg080898
ID: 18009923
If you only want to validate your pages (not actually develop your own validator), I suggest you look at:

http://validator.w3.org/

It might be easier to make a C# program that generates a webpage that uses the file upload service and lets the above site do the validating.

They even have a CSS validator:

http://jigsaw.w3.org/css-validator/
0
 
LVL 8

Expert Comment

by:adg080898
ID: 18009931
I am just starting to learn C# (the language is easy to learn (when you already know C++), the problem is the enormous .net library!).

Perhaps there are some parsing libraries in .net you can use to drastically simplify your problem.

You might want to post to a question about parsing in the C# area on EE:

http://www.experts-exchange.com/Programming/Programming_Languages/C_Sharp/


0
 
LVL 15

Expert Comment

by:Raisor
ID: 18009990
Hi,

The problem described is a logic one ... if you're the "creator" of the html code you'll place the table definition tags programmatically -> meaning, once implemented there'll never be a missing closing tag ... if you leverage foreign code [written by someone else] you'll come to a "downlevel-issue" as adg has very well explained!

A good plan would be to split your html file into parts:

1. Take off the header section
2. Take off the body section
3. Count the tables in the body section
4. Split off each table
5. Count each TR in each of the tables
6. Split off each TR in each of the tables
7. Count each TD in each of your TRs in each of your tables
8. Perform a "InString" search in each TD element for a closing ("</td>") element
9. When you've reached the missing one you'll have the count state
10. On the "count state basis" you'll be able to add a "</td>" string right before the next starting "<td>" element (meaning to add it to the end of the splitted indexed chunk)

... a good thing would be to always use te "low case" function when comparing ... as some elements might be written differently ... some low case and some others upper case ...


Hope this helps!
Best regards,
Raisor

0
 
LVL 84

Accepted Solution

by:
ozo earned 450 total points
ID: 18010264
#!/usr/bin/perl
$_='
<tr class="odd">
...
<td class="tright">0</td>
<td class="tright">-</td>
</tr>
<tr class="odd">
<td class="left"><a href="/test/test.html?unique_id=1270">UNIQUE</a></td>
<td class="left"><a href="/test/test.html?unique_id=1271">UNIQUE</a></td>
<td class="left"><a href="/test/test.html?unique_id=1272">UNIQUE</a>    <------ Error is here
<td class="right">0</td><td class="right">0</td>
';
print "error:$1\n" if m#(<td\b[^>]*>((?!</td\s*>).)*(<td\b|$))#si;
0
 
LVL 10

Author Comment

by:rivusglobal
ID: 18012499
Thanks for everyone's time and suggestions.

ozo, you are a ninja. I modified your example to fit our needs and it's running exactly as anticipated.

Here's the resulting C# expression we used.

_tdFix = Regex.Replace(_tdFix, @"(<td\b[^>]*>((?!</td\s*>).)*)(<td\b|$)", @"$1</td><td", RegexOptions.IgnoreCase);

See you in the Perl TA.

Salute!
0

Featured Post

Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Here we come across an interesting topic of coding guidelines while designing automation test scripts. The scope of this article will not be limited to QTP but to an overall extent of using VB Scripting for automation projects. Introduction Now…
This is an explanation of a simple data model to help parse a JSON feed
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

914 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

19 Experts available now in Live!

Get 1:1 Help Now