[Okta Webinar] Learn how to a build a cloud-first strategyRegister Now

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 436
  • Last Modified:

Problem with regular expression to capture tag content

Hello All !

This is maybe not the best place for that subject but anyway...
I need to be able to parse an XHTML page in XML.
So, first step, simply load the page in an XmlDocument. This was working fine until my page was having some special characters (accent, ...)
So second step: before loading the page, I wanted to parse it with regular expression to add CDATA blocks for each tag content. To do that, I was using the following regular expression:
  new Regex(@"(?<openTag><[^/][^>]*[^/]>)(?<content>\s*[^<>]+\s*)(?<endTag></[^>]*>)");
This was working fine until I received a page containing such tags:
<tr>
  <td align="right" colspan="2">
     <a href="javascript:__doPostBack('ctl00$MasterHome$gvMonTableauProduction','Page$Prev')">
        << Previous
     </a>
     <select id="ctl00_MasterHome_gvMonTableauProduction_cboNumPages" onchange="__doPostBack('ctl00$MasterHome$gvMonTableauProduction','SelectedPage')">
        <option value="0">Page 1</option>
        <option selected="selected" value="1">Page 2</option>

     </select>
  </td>
</tr>

I can't succeed to modify corretly my regex to match the part with "<< Previous"
Do you have any idea how I can do this ?

Thanks,
Jarod
0
Jarodtweiss
Asked:
Jarodtweiss
  • 4
  • 4
1 Solution
 
c_myersCommented:
IMHO, you're heading down a long and endless rabbit hole trying to use Regexs on XHTML/XML-type syntax. Even if it is well formed, there's so many oddities that can occur in the XHTML that you'll probably never get it right (most of the major browsers can't parse some of the stuff that's out there).

I'd say you should find out why these "special characters" are a problem.

the XmlDocument should be able to handle extended and Unicode characters no problem.

What does the erroneous XML look like? What error is the parser throwing?
0
 
JarodtweissAuthor Commented:
Here is below the XML that I cannot parse. I receive the error : "System.Xml.XmlException: Name cannot begin with the '' character, hexadecimal value 0x0D. Line 16, position 26"
It's because of "<< Précédent" because it considers the '<' as an opening tag
If a add a CDATA block around it, it's ok, that's why I wanted to use a regex before to add these blocks (I have also run my routine, that's why there are already some CDATA blocks in it)


<?xml version="1.0" encoding="windows-1250"?>
<html>
   <head>
      <title><![CDATA[
         DH2
      ]]></title>
      <link href="../../Styles/DH2.css" rel="stylesheet" type="text/css" />
   </head>
   <body class="home">
      <div>
         <form name="aspnetForm" method="post" action="MonTableauProduction.aspx" id="aspnetForm">
            <table width="80%">
               <tr>
                  <td align="right" colspan="2">
                     <a href="javascript:__doPostBack('ctl00$MasterHome$gvMonTableauProduction','Page$Prev')">
                        <
                        < Précédent
                     </a>
                     <select name="ctl00_MasterHome_gvMonTableauProduction_cboNumPages" id="ctl00_MasterHome_gvMonTableauProduction_cboNumPages" onchange="__doPostBack('ctl00$MasterHome$gvMonTableauProduction','SelectedPage')">
                        <option value="0"><![CDATA[Page 1]]></option>
                        <option selected="selected" value="1"><![CDATA[Page 2]]></option>
                     </select>
                  </td>
               </tr>
            </table>
         </form>
      </div>
   </body>
</html>
0
 
c_myersCommented:
Well, that's just invalid XML, no two ways about it. Where'd you get this XML/XHTML from?

I just have a REALLY bad feeling about a.) Trying to load XHTML as XML (almost never works, as you're seeing) and/or b.) Going into the document to try to "fix" it with regexes and adding CDATA sections to everything. I smell disaster here. You might be able to find a 90% solution, but it's only a matter of time before you get something that blows it up.

Do you have control over the source XHTML (where it comes from)? Because you shouldn't be putting unescaped XML special characters in attribute or element values.
0
What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

 
JarodtweissAuthor Commented:
I know this is invalid XML. This page has been generated by the .NET framework but for sure I can update that to replace the "<<" bu "&lt;&lt;" (The XHTML validation is done on static page but not on the content of user controls)
But I was wondering how I could fix my regex to put CDATA blocks
0
 
c_myersCommented:
I still maintain that this is but one of a million little things you're going to run into and, even after you deploy this app, things will keep popping up.

The point, in these examples, is that the content simply CANNOT be reliably parse. You just HAPPEN to be lucky in that your content looks like "< blah"

but what if your content looked like "< blah >". How would you know, from a regex, that "< blah >" wasn't simply an inner tag?

The correct solution to your question is to either a.) Don't try to parse XHTML as XML or b.) Fix the XHTML source so that it's proper XML.

If you insist, though, here's a regex that works for both of your examples. But I can come up with several examples that will break this regex quickly. So what's the point?

(?s:(?<openTag><(?<tagName>[^/>=\s]*)\s*[^/>]*(?![/])>)(?!\s*<\s*\w)(?<content>[\s]*.+?[\s]*)(?<endTag></\k<tagName>>))

NOTE: I would recommend adding the RegexOption.Compiled to your 'new Regex()' declaration since this is an expensive regex and, I assume, you'll be using it a lot.

Good luck if you continue down this path. I don't envy all the hairpulling you're setting yourself up for! :)
0
 
JarodtweissAuthor Commented:
Thanks !
At least, I have learned something with regex :-)

But I definitely agree with you, I will encounter some other problems in the future, that's why I fixed my page to avoid those characters.
However, an extra question ( you already have the points, but it's just to continue a little bit the discussion ;-) )

Why I was doing that is because I'm working in TDD, so I have some tests running with HTTP request to validate my pages. So I need to be able to parse my page to validate the presence of such or such button, input, ...
First I was working with regex but my misknowledge of them was blocking me. Eg one think I had to search was :
  I want the input field with the id attribute containing the word "part of my id" and the value "my input value"
This is very easy with XPath but I didn't know how to do the same with regex because I didn't know in which order the framework was generating the attributes.
So two questions :
- Can we specify a test like "Id = 'something' and value = 'something else' whatever the order" in a regex ? (btw do you know some good tutorial / reference document for regex)
- If you had to validate the content of an (X)HTML page, which solution would you use ?

To complete, I need to parse the page for :
- validating that the page holds the correct control
- to extract some control name and value to use my ASP.NET web site thru HTTP request by providing the correct POST parameters
0
 
c_myersCommented:
- You should make sure that you're Server.HtmlEncode()'ing that bit with the << Before so that it shows up at &lt;&lt; Before.

- As far as testing content, you should check out NUnitASP. It does a lot of that kind of stuff for you. If that doesn't work, then you're probably doing what you have to, as ugly as it is, lol.

- As far as having parameterized Regexs, which is what I think you're asking, then you'll have to generate a new regex every time. You can use a generic Format, and then string.format it, maybe?

// This is just a made-up pattern, ignore most of it
string regexPat = @"(?<blah>\w{3}{0}\s{4}\{1})";

Then, later, you'd use it like:

Regex r = new Regex( String.Format( regexPat, "id", "value" ) );

- Validating content, you mean like whether it's valid XHTML according to the DOCTYPE/DTD? HtmlTidy does it, there's also some web services out there that do it, if I recall correctly.

To answer your conclusion:

- "validating that page holds the correct control" - NUnitASP does this better, because it also validates that the control hierarchy was created in the correct order, etc.
- driving your page through POST: Hrm, NUnitASP might do this, but I would look hard for a framework to do this because this is NOT a trivial undertaking and is best left to someone dedicated to doing that particular task.
0
 
JarodtweissAuthor Commented:
Ok thanks for ur help !
0

Featured Post

Keep up with what's happening at Experts Exchange!

Sign up to receive Decoded, a new monthly digest with product updates, feature release info, continuing education opportunities, and more.

  • 4
  • 4
Tackle projects and never again get stuck behind a technical roadblock.
Join Now