Solved

Help With Regex Pattern

Posted on 2006-10-31
7
402 Views
Last Modified: 2012-05-05
Experts, I need some help with a Regex pattern for a project. I am parsing an HTML page and throughout the page there is specific markers denoting dynamic content.

<html>
.....
.....
.....
<!--wc_start-->
<!--wc_end-->
.....
more Html, etc
.....
.....

I need a regex pattern that will match all of the content outside of the markers. I also need the content inside the markers, but I need to somehow seperate the content outside the markers from the content inside the markers. To summerize, if the following was my HTML document:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<title>Untitled Document</title>
</head>

<body>
<!--wc_start-->
The first piece of content
<!--wc_end-->
<div id="CoolContent">Hello There</div>
<!--wc_start-->
The second piece of content
<!--wc_end-->
</body>
</html>

Then my match results would equal:

ContentInsideMarker[0] = "The first piece of content"
ContentInsideMarker[1] = "The second piece of content"

ContentOutsideTheMarker[0] = "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" /><title>Untitled Document</title></head><body>";

ContentOutsideTheMarker[1] = "<div id="CoolContent">Hello There</div>";

ContentOutsideTheMarker[2] = "</body></html>";

I hope this makes sense. Let me know if you need more details. I don't know enough about regex to write the pattern myself. I would assume you would use a matchcollection, but again I'm not the expert :)

Thanks so much for the correct pattern.

~C
0
Comment
Question by:clickclickbang
7 Comments
 
LVL 15

Expert Comment

by:ozymandias
Comment Utility
Do you have to use regex for this ?
It's not necessarily the best tool ?
Is all your html xhtml ?
You could probably parse this better using xml and and an xsl transform.
0
 
LVL 1

Author Comment

by:clickclickbang
Comment Utility
The content will not always be xhtml and may contain improper use of HTML. If there is another way other than Regex, I am open to an example. :)
0
 
LVL 15

Accepted Solution

by:
ozymandias earned 200 total points
Comment Utility
OK. Not using any regex or xml and assuming that the <!--wc-start--> and <!--wc_end--> comments are always paired and always correctly placed and on a line by themselves then the following works fine :

using System;

namespace Scraps
{
      public enum ContentType
      {
            Static,
            Dynamic
      }
      
      class Class1
      {

        private bool inStatic = true;
            private string[] lines;

            private string html = @"
                  <!DOCTYPE html PUBLIC '-//W3C//DTD XHTML 1.0 Transitional//EN' 'http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd'>
                  <html xmlns='http://www.w3.org/1999/xhtml'>
                  <head>
                  <meta http-equiv='Content-Type' content='text/html; charset=iso-8859-1' />
                  <title>Untitled Document</title>
                  </head>
                  <body>
                  <!--wc_start-->
                  The first piece of content
                  <!--wc_end-->
                  <div id='CoolContent'>Hello There</div>
                  <!--wc_start-->
                  The second piece of content
                  <!--wc_end-->
                  </body>
                  </html>";

            public Class1(){
                  lines = html.Split('\n');
            }

            public void Output(ContentType type){
                  foreach (string line in lines){
                        if (line.Trim().StartsWith("<!--wc_start-->")){
                              this.inStatic = false;
                              continue;
                        }else if (line.Trim().StartsWith("<!--wc_end-->")){
                              this.inStatic = true;
                              continue;
                        }else if ((type == ContentType.Dynamic && !this.inStatic) || (type == ContentType.Static && this.inStatic)){
                              Console.WriteLine(line);
                        }
                  }
            }
            
            [STAThread]
            static void Main(string[] args)
            {
                  Class1 c1 = new Class1();
                  c1.Output(ContentType.Static);
                  Console.WriteLine("\n");
                  c1.Output(ContentType.Dynamic);
                  Console.ReadLine();
            }
      }
}
0
How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

 
LVL 15

Expert Comment

by:ozymandias
Comment Utility
Basically it just loops through the lines of html and when it encounters either of the respective comments it switches mode.
The rest of the time it just outputs the lines depending on what mode its in and which type of content it's been asked for.
0
 
LVL 7

Assisted Solution

by:mjmarlow
mjmarlow earned 100 total points
Comment Utility
public class Extract
    {
        const string START_TAG = "<!--wc_start-->";
        const string END_TAG = "<!--wc_end-->";
        public System.Collections.Specialized.StringCollection  GetComments(string html)
        {
            System.Collections.Specialized.StringCollection col = new System.Collections.Specialized.StringCollection();
            int pStart = 0;
            int pEnd   = 0;
            while ((pStart = html.IndexOf(START_TAG, pEnd)) > -1 )
            {
                // Find closing tag
                pEnd = html.IndexOf(END_TAG, pStart);

                // collect the comment
                col.Add(html.Substring(pStart+START_TAG.Length,pEnd-pStart-START_TAG.Length));

            }
            return col;
        }
    }
0
 
LVL 62

Assisted Solution

by:Fernando Soto
Fernando Soto earned 200 total points
Comment Utility
Hi clickclickbang ;

Try this out.

      private static string pattern = @"(?:(?<1>.*?)<!--wc_start-->(?<2>.*?)<!--wc_end-->)|(?:<!--wc_start-->(?<2>.*?)<!--wc_end-->(?<1>.*?))|(?<1>.*)";
      private Regex re = new Regex(pattern, RegexOptions.Singleline | RegexOptions.Compiled);
      private ArrayList ContentInsideMarker = new ArrayList();
      private ArrayList ContentOutsideTheMarker = new ArrayList();

      private void button1_Click(object sender, System.EventArgs e)
      {
            StreamReader sr = new StreamReader(@"C:\Temp\Parsedata.txt");
            string input = sr.ReadToEnd();
            sr.Close();
            MatchCollection mc = re.Matches(input);
            foreach( Match m in mc)
            {
                  if(m.Groups[1].Value != "" )
                  {
                        ContentOutsideTheMarker.Add(m.Groups[1].Value.Trim);
                  }
                  if(m.Groups[2].Value != "" )
                  {
                        ContentInsideMarker.Add(m.Groups[2].Value.Trim);
                  }                        
            }
      }


Fernando
0
 
LVL 1

Author Comment

by:clickclickbang
Comment Utility
Thank you all for your posts. I am running into a problem. First, I'm not in control of HOW the content I'm parsing is created and I've already been told that it will not always be on a new line. Second, the content is a web page with all the fun stuff that goes in a web page.  So far the examples not using Regex work as long as the comments are on new lines. The Regex works with the content below, except that it puts all of the content into the ContentOutsideTheMarker array.

Since the original post on this question showed an example using the comments on seperate lines, I felt I needed to continue this question on a new post. I'm going to close out this post and the continued question is:
http://www.experts-exchange.com/Programming/Programming_Languages/C_Sharp/Q_22047861.html

Thanks for all your help!

~ C

0

Featured Post

How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

Join & Write a Comment

Suggested Solutions

Bit flags and bit flag manipulation is perhaps one of the most underrated strategies in programming, likely because most programmers developing in high-level languages rely too much on the high-level features, and forget about the low-level ones. Th…
It was really hard time for me to get the understanding of Delegates in C#. I went through many websites and articles but I found them very clumsy. After going through those sites, I noted down the points in a easy way so here I am sharing that unde…
Illustrator's Shape Builder tool will let you combine shapes visually and interactively. This video shows the Mac version, but the tool works the same way in Windows. To follow along with this video, you can draw your own shapes or download the file…
In this tutorial you'll learn about bandwidth monitoring with flows and packet sniffing with our network monitoring solution PRTG Network Monitor (https://www.paessler.com/prtg). If you're interested in additional methods for monitoring bandwidt…

771 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

12 Experts available now in Live!

Get 1:1 Help Now