Solved

Regex HTML Validation

Posted on 2006-11-24
9
1,209 Views
Last Modified: 2008-01-09
I am using a regular expressions replace in C# to validate some html. It seems the HTML has opening <td.*?> but in some cases, the closing </td> tag has been omitted.

My goal is to find occurances of <td> tags (with/or without attributes), and add the closing tag before the next opening <td.*?> tag.
This will be exectued on multiple sets of HTML.

html snippet:
<tr class="odd">
...
<td class="tright">0</td>
<td class="tright">-</td>
</tr>
<tr class="odd">
<td class="left"><a href="/test/test.html?unique_id=1270">UNIQUE</a></td>
<td class="left"><a href="/test/test.html?unique_id=1271">UNIQUE</a></td>
<td class="left"><a href="/test/test.html?unique_id=1272">UNIQUE</a>    <------ Error is here
<td class="right">0</td><td class="right">0</td>

The expressions I have built have not worked. I won't include my expressions with this question yet, just in case it adds any additional complications.

Would any regex guru please offer me some advice?

Thanks in advance.

0
Comment
Question by:rivusglobal
  • 3
  • 2
  • 2
  • +2
9 Comments
 
LVL 8

Assisted Solution

by:adg080898
adg080898 earned 50 total points
Comment Utility
This is a common question with regular expressions. What you ask is too complicated to be done reliably with regular expressions.

What would happen if one of the table cells itself included a table? Regexes would not (and could not) handle it and it would screw up.

For something this complicated, you should have a full blown parser. However, it is possible to do what you ask, but not simple...

The only way to reliably handle it is to maintain a stack. Scan forward for ANY html tag, and push an entry to the stack on an opening tag, and pop from the stack on a closing tag. (You would need to know which tags required closing tags - some tags don't have closing tags). This way, whenever you encounter a new tag, you could look examine the stack and make sure that the next tag is appropriate.

In your example, upon encountering a "<td", "</table", or "<tr" tag, you would make sure that the top of the stack is not "<td". If it was, you could emit the required "</td" tag.
0
 
LVL 41

Expert Comment

by:HonorGod
Comment Utility
Basically, you are asking "is the input 'valid HTML'?"
As adq stated, you can not identify this with regular expressions.
You would need to parse the input, and you really need an HTML validator.
0
 
LVL 15

Expert Comment

by:Raisor
Comment Utility
Hi,

@adg ...

... you're definitely right with your statement ... I've subscribed to this question about an hour ago just to see if there'd be a way to handle this in a different way I'd resumed already ... reading your answer made me think about the Visual Studio development environment ... how can the Visual Studio development environment tell me that I forgot a "TD" ending tag? ... Visual Studio 2005 will not even compile a web project if I forgot to end a table definition ...

... makes me think that there must be a way ...

... what I did was to have a closer look at a "well formed" web page (a large one) of my own ... and to tell ... a count on "<td" brought 14.777 results and a count on "</td>" brought only 15.034 results ... on the same (well formed!!!) page ...

... it's not that I wouldn't have an idea on getting deeper into a possible solution ... it's just that I'd just paste html code like the one shown into a (somehow intelligent) html editor ... just waiting that it shows where errors occur!


Best regards,
Raisor
0
 
LVL 10

Author Comment

by:rivusglobal
Comment Utility
adg,

Thanks for your response, I will implement a stack base solution to validate this code rather than using regular expressions.

Unfortunately, the HTML my code is receiving programatically has errors, and with the amount of HTML needed to be parsed with other regular expressions, I have to ensure the HTML is properly formed.

I thought someone might have some ninja RegEx replace solution, seeing how the stack implementation is more in-depth.

Here's was my original, and unsuccessful RegEx approach:

@"(<td.*?>.*?)[^</td>].*?<td", @"$1</td><td"
0
Why You Should Analyze Threat Actor TTPs

After years of analyzing threat actor behavior, it’s become clear that at any given time there are specific tactics, techniques, and procedures (TTPs) that are particularly prevalent. By analyzing and understanding these TTPs, you can dramatically enhance your security program.

 
LVL 8

Expert Comment

by:adg080898
Comment Utility
If you only want to validate your pages (not actually develop your own validator), I suggest you look at:

http://validator.w3.org/

It might be easier to make a C# program that generates a webpage that uses the file upload service and lets the above site do the validating.

They even have a CSS validator:

http://jigsaw.w3.org/css-validator/
0
 
LVL 8

Expert Comment

by:adg080898
Comment Utility
I am just starting to learn C# (the language is easy to learn (when you already know C++), the problem is the enormous .net library!).

Perhaps there are some parsing libraries in .net you can use to drastically simplify your problem.

You might want to post to a question about parsing in the C# area on EE:

http://www.experts-exchange.com/Programming/Programming_Languages/C_Sharp/


0
 
LVL 15

Expert Comment

by:Raisor
Comment Utility
Hi,

The problem described is a logic one ... if you're the "creator" of the html code you'll place the table definition tags programmatically -> meaning, once implemented there'll never be a missing closing tag ... if you leverage foreign code [written by someone else] you'll come to a "downlevel-issue" as adg has very well explained!

A good plan would be to split your html file into parts:

1. Take off the header section
2. Take off the body section
3. Count the tables in the body section
4. Split off each table
5. Count each TR in each of the tables
6. Split off each TR in each of the tables
7. Count each TD in each of your TRs in each of your tables
8. Perform a "InString" search in each TD element for a closing ("</td>") element
9. When you've reached the missing one you'll have the count state
10. On the "count state basis" you'll be able to add a "</td>" string right before the next starting "<td>" element (meaning to add it to the end of the splitted indexed chunk)

... a good thing would be to always use te "low case" function when comparing ... as some elements might be written differently ... some low case and some others upper case ...


Hope this helps!
Best regards,
Raisor

0
 
LVL 84

Accepted Solution

by:
ozo earned 450 total points
Comment Utility
#!/usr/bin/perl
$_='
<tr class="odd">
...
<td class="tright">0</td>
<td class="tright">-</td>
</tr>
<tr class="odd">
<td class="left"><a href="/test/test.html?unique_id=1270">UNIQUE</a></td>
<td class="left"><a href="/test/test.html?unique_id=1271">UNIQUE</a></td>
<td class="left"><a href="/test/test.html?unique_id=1272">UNIQUE</a>    <------ Error is here
<td class="right">0</td><td class="right">0</td>
';
print "error:$1\n" if m#(<td\b[^>]*>((?!</td\s*>).)*(<td\b|$))#si;
0
 
LVL 10

Author Comment

by:rivusglobal
Comment Utility
Thanks for everyone's time and suggestions.

ozo, you are a ninja. I modified your example to fit our needs and it's running exactly as anticipated.

Here's the resulting C# expression we used.

_tdFix = Regex.Replace(_tdFix, @"(<td\b[^>]*>((?!</td\s*>).)*)(<td\b|$)", @"$1</td><td", RegexOptions.IgnoreCase);

See you in the Perl TA.

Salute!
0

Featured Post

What Is Threat Intelligence?

Threat intelligence is often discussed, but rarely understood. Starting with a precise definition, along with clear business goals, is essential.

Join & Write a Comment

I know it’s not a new topic to discuss and it has lots of online contents already available over the net. But Then I thought it would be useful to this site’s visitors and can have online repository on vim most commonly used commands. This post h…
Go is an acronym of golang, is a programming language developed Google in 2007. Go is a new language that is mostly in the C family, with significant input from Pascal/Modula/Oberon family. Hence Go arisen as low-level language with fast compilation…
An introduction to basic programming syntax in Java by creating a simple program. Viewers can follow the tutorial as they create their first class in Java. Definitions and explanations about each element are given to help prepare viewers for future …
Viewers will learn how to properly install Eclipse with the necessary JDK, and will take a look at an introductory Java program. Download Eclipse installation zip file: Extract files from zip file: Download and install JDK 8: Open Eclipse and …

743 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

17 Experts available now in Live!

Get 1:1 Help Now