Want to win a PS4? Go Premium and enter to win our High-Tech Treats giveaway. Enter to Win

x
?
Solved

Regex HTML Validation

Posted on 2006-11-24
9
Medium Priority
?
1,240 Views
Last Modified: 2008-01-09
I am using a regular expressions replace in C# to validate some html. It seems the HTML has opening <td.*?> but in some cases, the closing </td> tag has been omitted.

My goal is to find occurances of <td> tags (with/or without attributes), and add the closing tag before the next opening <td.*?> tag.
This will be exectued on multiple sets of HTML.

html snippet:
<tr class="odd">
...
<td class="tright">0</td>
<td class="tright">-</td>
</tr>
<tr class="odd">
<td class="left"><a href="/test/test.html?unique_id=1270">UNIQUE</a></td>
<td class="left"><a href="/test/test.html?unique_id=1271">UNIQUE</a></td>
<td class="left"><a href="/test/test.html?unique_id=1272">UNIQUE</a>    <------ Error is here
<td class="right">0</td><td class="right">0</td>

The expressions I have built have not worked. I won't include my expressions with this question yet, just in case it adds any additional complications.

Would any regex guru please offer me some advice?

Thanks in advance.

0
Comment
Question by:rivusglobal
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
  • 2
  • 2
  • +2
9 Comments
 
LVL 8

Assisted Solution

by:adg080898
adg080898 earned 200 total points
ID: 18009706
This is a common question with regular expressions. What you ask is too complicated to be done reliably with regular expressions.

What would happen if one of the table cells itself included a table? Regexes would not (and could not) handle it and it would screw up.

For something this complicated, you should have a full blown parser. However, it is possible to do what you ask, but not simple...

The only way to reliably handle it is to maintain a stack. Scan forward for ANY html tag, and push an entry to the stack on an opening tag, and pop from the stack on a closing tag. (You would need to know which tags required closing tags - some tags don't have closing tags). This way, whenever you encounter a new tag, you could look examine the stack and make sure that the next tag is appropriate.

In your example, upon encountering a "<td", "</table", or "<tr" tag, you would make sure that the top of the stack is not "<td". If it was, you could emit the required "</td" tag.
0
 
LVL 41

Expert Comment

by:HonorGod
ID: 18009872
Basically, you are asking "is the input 'valid HTML'?"
As adq stated, you can not identify this with regular expressions.
You would need to parse the input, and you really need an HTML validator.
0
 
LVL 15

Expert Comment

by:Ralf Klatt
ID: 18009882
Hi,

@adg ...

... you're definitely right with your statement ... I've subscribed to this question about an hour ago just to see if there'd be a way to handle this in a different way I'd resumed already ... reading your answer made me think about the Visual Studio development environment ... how can the Visual Studio development environment tell me that I forgot a "TD" ending tag? ... Visual Studio 2005 will not even compile a web project if I forgot to end a table definition ...

... makes me think that there must be a way ...

... what I did was to have a closer look at a "well formed" web page (a large one) of my own ... and to tell ... a count on "<td" brought 14.777 results and a count on "</td>" brought only 15.034 results ... on the same (well formed!!!) page ...

... it's not that I wouldn't have an idea on getting deeper into a possible solution ... it's just that I'd just paste html code like the one shown into a (somehow intelligent) html editor ... just waiting that it shows where errors occur!


Best regards,
Raisor
0
Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

 
LVL 10

Author Comment

by:rivusglobal
ID: 18009922
adg,

Thanks for your response, I will implement a stack base solution to validate this code rather than using regular expressions.

Unfortunately, the HTML my code is receiving programatically has errors, and with the amount of HTML needed to be parsed with other regular expressions, I have to ensure the HTML is properly formed.

I thought someone might have some ninja RegEx replace solution, seeing how the stack implementation is more in-depth.

Here's was my original, and unsuccessful RegEx approach:

@"(<td.*?>.*?)[^</td>].*?<td", @"$1</td><td"
0
 
LVL 8

Expert Comment

by:adg080898
ID: 18009923
If you only want to validate your pages (not actually develop your own validator), I suggest you look at:

http://validator.w3.org/

It might be easier to make a C# program that generates a webpage that uses the file upload service and lets the above site do the validating.

They even have a CSS validator:

http://jigsaw.w3.org/css-validator/
0
 
LVL 8

Expert Comment

by:adg080898
ID: 18009931
I am just starting to learn C# (the language is easy to learn (when you already know C++), the problem is the enormous .net library!).

Perhaps there are some parsing libraries in .net you can use to drastically simplify your problem.

You might want to post to a question about parsing in the C# area on EE:

http://www.experts-exchange.com/Programming/Programming_Languages/C_Sharp/


0
 
LVL 15

Expert Comment

by:Ralf Klatt
ID: 18009990
Hi,

The problem described is a logic one ... if you're the "creator" of the html code you'll place the table definition tags programmatically -> meaning, once implemented there'll never be a missing closing tag ... if you leverage foreign code [written by someone else] you'll come to a "downlevel-issue" as adg has very well explained!

A good plan would be to split your html file into parts:

1. Take off the header section
2. Take off the body section
3. Count the tables in the body section
4. Split off each table
5. Count each TR in each of the tables
6. Split off each TR in each of the tables
7. Count each TD in each of your TRs in each of your tables
8. Perform a "InString" search in each TD element for a closing ("</td>") element
9. When you've reached the missing one you'll have the count state
10. On the "count state basis" you'll be able to add a "</td>" string right before the next starting "<td>" element (meaning to add it to the end of the splitted indexed chunk)

... a good thing would be to always use te "low case" function when comparing ... as some elements might be written differently ... some low case and some others upper case ...


Hope this helps!
Best regards,
Raisor

0
 
LVL 84

Accepted Solution

by:
ozo earned 1800 total points
ID: 18010264
#!/usr/bin/perl
$_='
<tr class="odd">
...
<td class="tright">0</td>
<td class="tright">-</td>
</tr>
<tr class="odd">
<td class="left"><a href="/test/test.html?unique_id=1270">UNIQUE</a></td>
<td class="left"><a href="/test/test.html?unique_id=1271">UNIQUE</a></td>
<td class="left"><a href="/test/test.html?unique_id=1272">UNIQUE</a>    <------ Error is here
<td class="right">0</td><td class="right">0</td>
';
print "error:$1\n" if m#(<td\b[^>]*>((?!</td\s*>).)*(<td\b|$))#si;
0
 
LVL 10

Author Comment

by:rivusglobal
ID: 18012499
Thanks for everyone's time and suggestions.

ozo, you are a ninja. I modified your example to fit our needs and it's running exactly as anticipated.

Here's the resulting C# expression we used.

_tdFix = Regex.Replace(_tdFix, @"(<td\b[^>]*>((?!</td\s*>).)*)(<td\b|$)", @"$1</td><td", RegexOptions.IgnoreCase);

See you in the Perl TA.

Salute!
0

Featured Post

Free Tool: SSL Checker

Scans your site and returns information about your SSL implementation and certificate. Helpful for debugging and validating your SSL configuration.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

The SignAloud Glove is capable of translating American Sign Language signs into text and audio.
We are witnesses that everyone is saying that our children shouldn't "play" with a technology because it is dangerous. This article is going to prove that they are wrong.
Viewers will learn how to properly install Eclipse with the necessary JDK, and will take a look at an introductory Java program. Download Eclipse installation zip file: Extract files from zip file: Download and install JDK 8: Open Eclipse and …
In this fourth video of the Xpdf series, we discuss and demonstrate the PDFinfo utility, which retrieves the contents of a PDF's Info Dictionary, as well as some other information, including the page count. We show how to isolate the page count in a…

610 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question