Solved

Help With Regex Pattern

Posted on 2006-10-31
7
406 Views
Last Modified: 2012-05-05
Experts, I need some help with a Regex pattern for a project. I am parsing an HTML page and throughout the page there is specific markers denoting dynamic content.

<html>
.....
.....
.....
<!--wc_start-->
<!--wc_end-->
.....
more Html, etc
.....
.....

I need a regex pattern that will match all of the content outside of the markers. I also need the content inside the markers, but I need to somehow seperate the content outside the markers from the content inside the markers. To summerize, if the following was my HTML document:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<title>Untitled Document</title>
</head>

<body>
<!--wc_start-->
The first piece of content
<!--wc_end-->
<div id="CoolContent">Hello There</div>
<!--wc_start-->
The second piece of content
<!--wc_end-->
</body>
</html>

Then my match results would equal:

ContentInsideMarker[0] = "The first piece of content"
ContentInsideMarker[1] = "The second piece of content"

ContentOutsideTheMarker[0] = "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" /><title>Untitled Document</title></head><body>";

ContentOutsideTheMarker[1] = "<div id="CoolContent">Hello There</div>";

ContentOutsideTheMarker[2] = "</body></html>";

I hope this makes sense. Let me know if you need more details. I don't know enough about regex to write the pattern myself. I would assume you would use a matchcollection, but again I'm not the expert :)

Thanks so much for the correct pattern.

~C
0
Comment
Question by:clickclickbang
7 Comments
 
LVL 15

Expert Comment

by:ozymandias
ID: 17845901
Do you have to use regex for this ?
It's not necessarily the best tool ?
Is all your html xhtml ?
You could probably parse this better using xml and and an xsl transform.
0
 
LVL 1

Author Comment

by:clickclickbang
ID: 17846411
The content will not always be xhtml and may contain improper use of HTML. If there is another way other than Regex, I am open to an example. :)
0
 
LVL 15

Accepted Solution

by:
ozymandias earned 200 total points
ID: 17846580
OK. Not using any regex or xml and assuming that the <!--wc-start--> and <!--wc_end--> comments are always paired and always correctly placed and on a line by themselves then the following works fine :

using System;

namespace Scraps
{
      public enum ContentType
      {
            Static,
            Dynamic
      }
      
      class Class1
      {

        private bool inStatic = true;
            private string[] lines;

            private string html = @"
                  <!DOCTYPE html PUBLIC '-//W3C//DTD XHTML 1.0 Transitional//EN' 'http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd'>
                  <html xmlns='http://www.w3.org/1999/xhtml'>
                  <head>
                  <meta http-equiv='Content-Type' content='text/html; charset=iso-8859-1' />
                  <title>Untitled Document</title>
                  </head>
                  <body>
                  <!--wc_start-->
                  The first piece of content
                  <!--wc_end-->
                  <div id='CoolContent'>Hello There</div>
                  <!--wc_start-->
                  The second piece of content
                  <!--wc_end-->
                  </body>
                  </html>";

            public Class1(){
                  lines = html.Split('\n');
            }

            public void Output(ContentType type){
                  foreach (string line in lines){
                        if (line.Trim().StartsWith("<!--wc_start-->")){
                              this.inStatic = false;
                              continue;
                        }else if (line.Trim().StartsWith("<!--wc_end-->")){
                              this.inStatic = true;
                              continue;
                        }else if ((type == ContentType.Dynamic && !this.inStatic) || (type == ContentType.Static && this.inStatic)){
                              Console.WriteLine(line);
                        }
                  }
            }
            
            [STAThread]
            static void Main(string[] args)
            {
                  Class1 c1 = new Class1();
                  c1.Output(ContentType.Static);
                  Console.WriteLine("\n");
                  c1.Output(ContentType.Dynamic);
                  Console.ReadLine();
            }
      }
}
0
Free Tool: IP Lookup

Get more info about an IP address or domain name, such as organization, abuse contacts and geolocation.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

 
LVL 15

Expert Comment

by:ozymandias
ID: 17846601
Basically it just loops through the lines of html and when it encounters either of the respective comments it switches mode.
The rest of the time it just outputs the lines depending on what mode its in and which type of content it's been asked for.
0
 
LVL 7

Assisted Solution

by:mjmarlow
mjmarlow earned 100 total points
ID: 17848380
public class Extract
    {
        const string START_TAG = "<!--wc_start-->";
        const string END_TAG = "<!--wc_end-->";
        public System.Collections.Specialized.StringCollection  GetComments(string html)
        {
            System.Collections.Specialized.StringCollection col = new System.Collections.Specialized.StringCollection();
            int pStart = 0;
            int pEnd   = 0;
            while ((pStart = html.IndexOf(START_TAG, pEnd)) > -1 )
            {
                // Find closing tag
                pEnd = html.IndexOf(END_TAG, pStart);

                // collect the comment
                col.Add(html.Substring(pStart+START_TAG.Length,pEnd-pStart-START_TAG.Length));

            }
            return col;
        }
    }
0
 
LVL 63

Assisted Solution

by:Fernando Soto
Fernando Soto earned 200 total points
ID: 17849476
Hi clickclickbang ;

Try this out.

      private static string pattern = @"(?:(?<1>.*?)<!--wc_start-->(?<2>.*?)<!--wc_end-->)|(?:<!--wc_start-->(?<2>.*?)<!--wc_end-->(?<1>.*?))|(?<1>.*)";
      private Regex re = new Regex(pattern, RegexOptions.Singleline | RegexOptions.Compiled);
      private ArrayList ContentInsideMarker = new ArrayList();
      private ArrayList ContentOutsideTheMarker = new ArrayList();

      private void button1_Click(object sender, System.EventArgs e)
      {
            StreamReader sr = new StreamReader(@"C:\Temp\Parsedata.txt");
            string input = sr.ReadToEnd();
            sr.Close();
            MatchCollection mc = re.Matches(input);
            foreach( Match m in mc)
            {
                  if(m.Groups[1].Value != "" )
                  {
                        ContentOutsideTheMarker.Add(m.Groups[1].Value.Trim);
                  }
                  if(m.Groups[2].Value != "" )
                  {
                        ContentInsideMarker.Add(m.Groups[2].Value.Trim);
                  }                        
            }
      }


Fernando
0
 
LVL 1

Author Comment

by:clickclickbang
ID: 17866110
Thank you all for your posts. I am running into a problem. First, I'm not in control of HOW the content I'm parsing is created and I've already been told that it will not always be on a new line. Second, the content is a web page with all the fun stuff that goes in a web page.  So far the examples not using Regex work as long as the comments are on new lines. The Regex works with the content below, except that it puts all of the content into the ContentOutsideTheMarker array.

Since the original post on this question showed an example using the comments on seperate lines, I felt I needed to continue this question on a new post. I'm going to close out this post and the continued question is:
http://www.experts-exchange.com/Programming/Programming_Languages/C_Sharp/Q_22047861.html

Thanks for all your help!

~ C

0

Featured Post

Networking for the Cloud Era

Join Microsoft and Riverbed for a discussion and demonstration of enhancements to SteelConnect:
-One-click orchestration and cloud connectivity in Azure environments
-Tight integration of SD-WAN and WAN optimization capabilities
-Scalability and resiliency equal to a data center

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Summary: Persistence is the capability of an application to store the state of objects and recover it when necessary. This article compares the two common types of serialization in aspects of data access, readability, and runtime cost. A ready-to…
Real-time is more about the business, not the technology. In day-to-day life, to make real-time decisions like buying or investing, business needs the latest information(e.g. Gold Rate/Stock Rate). Unlike traditional days, you need not wait for a fe…
I've attached the XLSM Excel spreadsheet I used in the video and also text files containing the macros used.

789 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question