Problem with regular expression to capture tag content

Posted on 2006-04-01
Last Modified: 2013-11-19
Hello All !

This is maybe not the best place for that subject but anyway...
I need to be able to parse an XHTML page in XML.
So, first step, simply load the page in an XmlDocument. This was working fine until my page was having some special characters (accent, ...)
So second step: before loading the page, I wanted to parse it with regular expression to add CDATA blocks for each tag content. To do that, I was using the following regular expression:
  new Regex(@"(?<openTag><[^/][^>]*[^/]>)(?<content>\s*[^<>]+\s*)(?<endTag></[^>]*>)");
This was working fine until I received a page containing such tags:
  <td align="right" colspan="2">
     <a href="javascript:__doPostBack('ctl00$MasterHome$gvMonTableauProduction','Page$Prev')">
        << Previous
     <select id="ctl00_MasterHome_gvMonTableauProduction_cboNumPages" onchange="__doPostBack('ctl00$MasterHome$gvMonTableauProduction','SelectedPage')">
        <option value="0">Page 1</option>
        <option selected="selected" value="1">Page 2</option>


I can't succeed to modify corretly my regex to match the part with "<< Previous"
Do you have any idea how I can do this ?

Question by:Jarodtweiss
    LVL 4

    Expert Comment

    IMHO, you're heading down a long and endless rabbit hole trying to use Regexs on XHTML/XML-type syntax. Even if it is well formed, there's so many oddities that can occur in the XHTML that you'll probably never get it right (most of the major browsers can't parse some of the stuff that's out there).

    I'd say you should find out why these "special characters" are a problem.

    the XmlDocument should be able to handle extended and Unicode characters no problem.

    What does the erroneous XML look like? What error is the parser throwing?
    LVL 4

    Author Comment

    Here is below the XML that I cannot parse. I receive the error : "System.Xml.XmlException: Name cannot begin with the '' character, hexadecimal value 0x0D. Line 16, position 26"
    It's because of "<< Précédent" because it considers the '<' as an opening tag
    If a add a CDATA block around it, it's ok, that's why I wanted to use a regex before to add these blocks (I have also run my routine, that's why there are already some CDATA blocks in it)

    <?xml version="1.0" encoding="windows-1250"?>
          <link href="../../Styles/DH2.css" rel="stylesheet" type="text/css" />
       <body class="home">
             <form name="aspnetForm" method="post" action="MonTableauProduction.aspx" id="aspnetForm">
                <table width="80%">
                      <td align="right" colspan="2">
                         <a href="javascript:__doPostBack('ctl00$MasterHome$gvMonTableauProduction','Page$Prev')">
                            < Précédent
                         <select name="ctl00_MasterHome_gvMonTableauProduction_cboNumPages" id="ctl00_MasterHome_gvMonTableauProduction_cboNumPages" onchange="__doPostBack('ctl00$MasterHome$gvMonTableauProduction','SelectedPage')">
                            <option value="0"><![CDATA[Page 1]]></option>
                            <option selected="selected" value="1"><![CDATA[Page 2]]></option>
    LVL 4

    Expert Comment

    Well, that's just invalid XML, no two ways about it. Where'd you get this XML/XHTML from?

    I just have a REALLY bad feeling about a.) Trying to load XHTML as XML (almost never works, as you're seeing) and/or b.) Going into the document to try to "fix" it with regexes and adding CDATA sections to everything. I smell disaster here. You might be able to find a 90% solution, but it's only a matter of time before you get something that blows it up.

    Do you have control over the source XHTML (where it comes from)? Because you shouldn't be putting unescaped XML special characters in attribute or element values.
    LVL 4

    Author Comment

    I know this is invalid XML. This page has been generated by the .NET framework but for sure I can update that to replace the "<<" bu "&lt;&lt;" (The XHTML validation is done on static page but not on the content of user controls)
    But I was wondering how I could fix my regex to put CDATA blocks
    LVL 4

    Expert Comment

    I still maintain that this is but one of a million little things you're going to run into and, even after you deploy this app, things will keep popping up.

    The point, in these examples, is that the content simply CANNOT be reliably parse. You just HAPPEN to be lucky in that your content looks like "< blah"

    but what if your content looked like "< blah >". How would you know, from a regex, that "< blah >" wasn't simply an inner tag?

    The correct solution to your question is to either a.) Don't try to parse XHTML as XML or b.) Fix the XHTML source so that it's proper XML.

    If you insist, though, here's a regex that works for both of your examples. But I can come up with several examples that will break this regex quickly. So what's the point?


    NOTE: I would recommend adding the RegexOption.Compiled to your 'new Regex()' declaration since this is an expensive regex and, I assume, you'll be using it a lot.

    Good luck if you continue down this path. I don't envy all the hairpulling you're setting yourself up for! :)
    LVL 4

    Author Comment

    Thanks !
    At least, I have learned something with regex :-)

    But I definitely agree with you, I will encounter some other problems in the future, that's why I fixed my page to avoid those characters.
    However, an extra question ( you already have the points, but it's just to continue a little bit the discussion ;-) )

    Why I was doing that is because I'm working in TDD, so I have some tests running with HTTP request to validate my pages. So I need to be able to parse my page to validate the presence of such or such button, input, ...
    First I was working with regex but my misknowledge of them was blocking me. Eg one think I had to search was :
      I want the input field with the id attribute containing the word "part of my id" and the value "my input value"
    This is very easy with XPath but I didn't know how to do the same with regex because I didn't know in which order the framework was generating the attributes.
    So two questions :
    - Can we specify a test like "Id = 'something' and value = 'something else' whatever the order" in a regex ? (btw do you know some good tutorial / reference document for regex)
    - If you had to validate the content of an (X)HTML page, which solution would you use ?

    To complete, I need to parse the page for :
    - validating that the page holds the correct control
    - to extract some control name and value to use my ASP.NET web site thru HTTP request by providing the correct POST parameters
    LVL 4

    Accepted Solution

    - You should make sure that you're Server.HtmlEncode()'ing that bit with the << Before so that it shows up at &lt;&lt; Before.

    - As far as testing content, you should check out NUnitASP. It does a lot of that kind of stuff for you. If that doesn't work, then you're probably doing what you have to, as ugly as it is, lol.

    - As far as having parameterized Regexs, which is what I think you're asking, then you'll have to generate a new regex every time. You can use a generic Format, and then string.format it, maybe?

    // This is just a made-up pattern, ignore most of it
    string regexPat = @"(?<blah>\w{3}{0}\s{4}\{1})";

    Then, later, you'd use it like:

    Regex r = new Regex( String.Format( regexPat, "id", "value" ) );

    - Validating content, you mean like whether it's valid XHTML according to the DOCTYPE/DTD? HtmlTidy does it, there's also some web services out there that do it, if I recall correctly.

    To answer your conclusion:

    - "validating that page holds the correct control" - NUnitASP does this better, because it also validates that the control hierarchy was created in the correct order, etc.
    - driving your page through POST: Hrm, NUnitASP might do this, but I would look hard for a framework to do this because this is NOT a trivial undertaking and is best left to someone dedicated to doing that particular task.
    LVL 4

    Author Comment

    Ok thanks for ur help !

    Featured Post

    Looking for New Ways to Advertise?

    Engage with tech pros in our community with native advertising, as a Vendor Expert, and more.

    Join & Write a Comment

    Entity Framework is a powerful tool to help you interact with the DataBase but still doesn't help much when we have a Stored Procedure that returns more than one resultset. The solution takes some of out-of-the-box thinking; read on!
    It was really hard time for me to get the understanding of Delegates in C#. I went through many websites and articles but I found them very clumsy. After going through those sites, I noted down the points in a easy way so here I am sharing that unde…
    The viewer will learn how to dynamically set the form action using jQuery.
    The viewer will learn the benefit of using external CSS files and the relationship between class and ID selectors. Create your external css file by saving it as style.css then set up your style tags: (CODE) Reference the nav tag and set your prop…

    745 members asked questions and received personalized solutions in the past 7 days.

    Join the community of 500,000 technology professionals and ask your questions.

    Join & Ask a Question

    Need Help in Real-Time?

    Connect with top rated Experts

    16 Experts available now in Live!

    Get 1:1 Help Now