asked on

Problem with regular expression to capture tag content

Hello All !

This is maybe not the best place for that subject but anyway...
I need to be able to parse an XHTML page in XML.
So, first step, simply load the page in an XmlDocument. This was working fine until my page was having some special characters (accent, ...)
So second step: before loading the page, I wanted to parse it with regular expression to add CDATA blocks for each tag content. To do that, I was using the following regular expression:
new Regex(@"(?<openTag><[^/][^>]*[^/]>)(?<content>\s*[^<>]+\s*)(?<endTag></[^>]*>)");
This was working fine until I received a page containing such tags:
<tr>
<td align="right" colspan="2">
<a href="javascript:__doPostBack('ctl00$MasterHome$gvMonTableauProduction','Page$Prev')">
<< Previous
</a>
<select id="ctl00_MasterHome_gvMonTableauProduction_cboNumPages" onchange="__doPostBack('ctl00$MasterHome$gvMonTableauProduction','SelectedPage')">
<option value="0">Page 1</option>
<option selected="selected" value="1">Page 2</option>

</select>
</td>
</tr>

I can't succeed to modify corretly my regex to match the part with "<< Previous"
Do you have any idea how I can do this ?

Thanks,
Jarod

c_myers

IMHO, you're heading down a long and endless rabbit hole trying to use Regexs on XHTML/XML-type syntax. Even if it is well formed, there's so many oddities that can occur in the XHTML that you'll probably never get it right (most of the major browsers can't parse some of the stuff that's out there).

I'd say you should find out why these "special characters" are a problem.

the XmlDocument should be able to handle extended and Unicode characters no problem.

What does the erroneous XML look like? What error is the parser throwing?

Jarodtweiss

ASKER

Here is below the XML that I cannot parse. I receive the error : "System.Xml.XmlException: Name cannot begin with the '' character, hexadecimal value 0x0D. Line 16, position 26"
It's because of "<< Précédent" because it considers the '<' as an opening tag
If a add a CDATA block around it, it's ok, that's why I wanted to use a regex before to add these blocks (I have also run my routine, that's why there are already some CDATA blocks in it)

<?xml version="1.0" encoding="windows-1250"?>
<html>
<head>
<title><![CDATA[
DH2
]]></title>
<link href="../../Styles/DH2.css" rel="stylesheet" type="text/css" />
</head>
<body class="home">
<div>
<form name="aspnetForm" method="post" action="MonTableauProduction.aspx" id="aspnetForm">
<table width="80%">
<tr>
<td align="right" colspan="2">
<a href="javascript:__doPostBack('ctl00$MasterHome$gvMonTableauProduction','Page$Prev')">
<
< Précédent
</a>
<select name="ctl00_MasterHome_gvMonTableauProduction_cboNumPages" id="ctl00_MasterHome_gvMonTableauProduction_cboNumPages" onchange="__doPostBack('ctl00$MasterHome$gvMonTableauProduction','SelectedPage')">
<option value="0"><![CDATA[Page 1]]></option>
<option selected="selected" value="1"><![CDATA[Page 2]]></option>
</select>
</td>
</tr>
</table>
</form>
</div>
</body>
</html>

c_myers

Well, that's just invalid XML, no two ways about it. Where'd you get this XML/XHTML from?

I just have a REALLY bad feeling about a.) Trying to load XHTML as XML (almost never works, as you're seeing) and/or b.) Going into the document to try to "fix" it with regexes and adding CDATA sections to everything. I smell disaster here. You might be able to find a 90% solution, but it's only a matter of time before you get something that blows it up.

Do you have control over the source XHTML (where it comes from)? Because you shouldn't be putting unescaped XML special characters in attribute or element values.

Jarodtweiss

ASKER

I know this is invalid XML. This page has been generated by the .NET framework but for sure I can update that to replace the "<<" bu "<<" (The XHTML validation is done on static page but not on the content of user controls)
But I was wondering how I could fix my regex to put CDATA blocks

c_myers

I still maintain that this is but one of a million little things you're going to run into and, even after you deploy this app, things will keep popping up.

The point, in these examples, is that the content simply CANNOT be reliably parse. You just HAPPEN to be lucky in that your content looks like "< blah"

but what if your content looked like "< blah >". How would you know, from a regex, that "< blah >" wasn't simply an inner tag?

The correct solution to your question is to either a.) Don't try to parse XHTML as XML or b.) Fix the XHTML source so that it's proper XML.

If you insist, though, here's a regex that works for both of your examples. But I can come up with several examples that will break this regex quickly. So what's the point?

(?s:(?<openTag><(?<tagName>[^/>=\s]*)\s*[^/>]*(?![/])>)(?!\s*<\s*\w)(?<content>[\s]*.+?[\s]*)(?<endTag></\k<tagName>>))

NOTE: I would recommend adding the RegexOption.Compiled to your 'new Regex()' declaration since this is an expensive regex and, I assume, you'll be using it a lot.

Good luck if you continue down this path. I don't envy all the hairpulling you're setting yourself up for! :)

Jarodtweiss

ASKER

Thanks !
At least, I have learned something with regex :-)

But I definitely agree with you, I will encounter some other problems in the future, that's why I fixed my page to avoid those characters.
However, an extra question ( you already have the points, but it's just to continue a little bit the discussion ;-) )

Why I was doing that is because I'm working in TDD, so I have some tests running with HTTP request to validate my pages. So I need to be able to parse my page to validate the presence of such or such button, input, ...
First I was working with regex but my misknowledge of them was blocking me. Eg one think I had to search was :
I want the input field with the id attribute containing the word "part of my id" and the value "my input value"
This is very easy with XPath but I didn't know how to do the same with regex because I didn't know in which order the framework was generating the attributes.
So two questions :
- Can we specify a test like "Id = 'something' and value = 'something else' whatever the order" in a regex ? (btw do you know some good tutorial / reference document for regex)
- If you had to validate the content of an (X)HTML page, which solution would you use ?

To complete, I need to parse the page for :
- validating that the page holds the correct control
- to extract some control name and value to use my ASP.NET web site thru HTTP request by providing the correct POST parameters

ASKER CERTIFIED SOLUTION

c_myers

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Jarodtweiss

ASKER

Ok thanks for ur help !