Solved

Help With Regex Pattern

Posted on 2006-10-31
7
408 Views
Last Modified: 2012-05-05
Experts, I need some help with a Regex pattern for a project. I am parsing an HTML page and throughout the page there is specific markers denoting dynamic content.

<html>
.....
.....
.....
<!--wc_start-->
<!--wc_end-->
.....
more Html, etc
.....
.....

I need a regex pattern that will match all of the content outside of the markers. I also need the content inside the markers, but I need to somehow seperate the content outside the markers from the content inside the markers. To summerize, if the following was my HTML document:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<title>Untitled Document</title>
</head>

<body>
<!--wc_start-->
The first piece of content
<!--wc_end-->
<div id="CoolContent">Hello There</div>
<!--wc_start-->
The second piece of content
<!--wc_end-->
</body>
</html>

Then my match results would equal:

ContentInsideMarker[0] = "The first piece of content"
ContentInsideMarker[1] = "The second piece of content"

ContentOutsideTheMarker[0] = "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" /><title>Untitled Document</title></head><body>";

ContentOutsideTheMarker[1] = "<div id="CoolContent">Hello There</div>";

ContentOutsideTheMarker[2] = "</body></html>";

I hope this makes sense. Let me know if you need more details. I don't know enough about regex to write the pattern myself. I would assume you would use a matchcollection, but again I'm not the expert :)

Thanks so much for the correct pattern.

~C
0
Comment
Question by:clickclickbang
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
7 Comments
 
LVL 15

Expert Comment

by:ozymandias
ID: 17845901
Do you have to use regex for this ?
It's not necessarily the best tool ?
Is all your html xhtml ?
You could probably parse this better using xml and and an xsl transform.
0
 
LVL 1

Author Comment

by:clickclickbang
ID: 17846411
The content will not always be xhtml and may contain improper use of HTML. If there is another way other than Regex, I am open to an example. :)
0
 
LVL 15

Accepted Solution

by:
ozymandias earned 200 total points
ID: 17846580
OK. Not using any regex or xml and assuming that the <!--wc-start--> and <!--wc_end--> comments are always paired and always correctly placed and on a line by themselves then the following works fine :

using System;

namespace Scraps
{
      public enum ContentType
      {
            Static,
            Dynamic
      }
      
      class Class1
      {

        private bool inStatic = true;
            private string[] lines;

            private string html = @"
                  <!DOCTYPE html PUBLIC '-//W3C//DTD XHTML 1.0 Transitional//EN' 'http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd'>
                  <html xmlns='http://www.w3.org/1999/xhtml'>
                  <head>
                  <meta http-equiv='Content-Type' content='text/html; charset=iso-8859-1' />
                  <title>Untitled Document</title>
                  </head>
                  <body>
                  <!--wc_start-->
                  The first piece of content
                  <!--wc_end-->
                  <div id='CoolContent'>Hello There</div>
                  <!--wc_start-->
                  The second piece of content
                  <!--wc_end-->
                  </body>
                  </html>";

            public Class1(){
                  lines = html.Split('\n');
            }

            public void Output(ContentType type){
                  foreach (string line in lines){
                        if (line.Trim().StartsWith("<!--wc_start-->")){
                              this.inStatic = false;
                              continue;
                        }else if (line.Trim().StartsWith("<!--wc_end-->")){
                              this.inStatic = true;
                              continue;
                        }else if ((type == ContentType.Dynamic && !this.inStatic) || (type == ContentType.Static && this.inStatic)){
                              Console.WriteLine(line);
                        }
                  }
            }
            
            [STAThread]
            static void Main(string[] args)
            {
                  Class1 c1 = new Class1();
                  c1.Output(ContentType.Static);
                  Console.WriteLine("\n");
                  c1.Output(ContentType.Dynamic);
                  Console.ReadLine();
            }
      }
}
0
Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 15

Expert Comment

by:ozymandias
ID: 17846601
Basically it just loops through the lines of html and when it encounters either of the respective comments it switches mode.
The rest of the time it just outputs the lines depending on what mode its in and which type of content it's been asked for.
0
 
LVL 7

Assisted Solution

by:mjmarlow
mjmarlow earned 100 total points
ID: 17848380
public class Extract
    {
        const string START_TAG = "<!--wc_start-->";
        const string END_TAG = "<!--wc_end-->";
        public System.Collections.Specialized.StringCollection  GetComments(string html)
        {
            System.Collections.Specialized.StringCollection col = new System.Collections.Specialized.StringCollection();
            int pStart = 0;
            int pEnd   = 0;
            while ((pStart = html.IndexOf(START_TAG, pEnd)) > -1 )
            {
                // Find closing tag
                pEnd = html.IndexOf(END_TAG, pStart);

                // collect the comment
                col.Add(html.Substring(pStart+START_TAG.Length,pEnd-pStart-START_TAG.Length));

            }
            return col;
        }
    }
0
 
LVL 63

Assisted Solution

by:Fernando Soto
Fernando Soto earned 200 total points
ID: 17849476
Hi clickclickbang ;

Try this out.

      private static string pattern = @"(?:(?<1>.*?)<!--wc_start-->(?<2>.*?)<!--wc_end-->)|(?:<!--wc_start-->(?<2>.*?)<!--wc_end-->(?<1>.*?))|(?<1>.*)";
      private Regex re = new Regex(pattern, RegexOptions.Singleline | RegexOptions.Compiled);
      private ArrayList ContentInsideMarker = new ArrayList();
      private ArrayList ContentOutsideTheMarker = new ArrayList();

      private void button1_Click(object sender, System.EventArgs e)
      {
            StreamReader sr = new StreamReader(@"C:\Temp\Parsedata.txt");
            string input = sr.ReadToEnd();
            sr.Close();
            MatchCollection mc = re.Matches(input);
            foreach( Match m in mc)
            {
                  if(m.Groups[1].Value != "" )
                  {
                        ContentOutsideTheMarker.Add(m.Groups[1].Value.Trim);
                  }
                  if(m.Groups[2].Value != "" )
                  {
                        ContentInsideMarker.Add(m.Groups[2].Value.Trim);
                  }                        
            }
      }


Fernando
0
 
LVL 1

Author Comment

by:clickclickbang
ID: 17866110
Thank you all for your posts. I am running into a problem. First, I'm not in control of HOW the content I'm parsing is created and I've already been told that it will not always be on a new line. Second, the content is a web page with all the fun stuff that goes in a web page.  So far the examples not using Regex work as long as the comments are on new lines. The Regex works with the content below, except that it puts all of the content into the ContentOutsideTheMarker array.

Since the original post on this question showed an example using the comments on seperate lines, I felt I needed to continue this question on a new post. I'm going to close out this post and the continued question is:
http://www.experts-exchange.com/Programming/Programming_Languages/C_Sharp/Q_22047861.html

Thanks for all your help!

~ C

0

Featured Post

PeopleSoft Has Never Been Easier

PeopleSoft Adoption Made Smooth & Simple!

On-The-Job Training Is made Intuitive & Easy With WalkMe's On-Screen Guidance Tool.  Claim Your Free WalkMe Account Now

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Introduction Hi all and welcome to my first article on Experts Exchange. A while ago, someone asked me if i could do some tutorials on object oriented programming. I decided to do them on C#. Now you may ask me, why's that? Well, one of the re…
This article is for Object-Oriented Programming (OOP) beginners. An Interface contains declarations of events, indexers, methods and/or properties. Any class which implements the Interface should provide the concrete implementation for each Inter…
NetCrunch network monitor is a highly extensive platform for network monitoring and alert generation. In this video you'll see a live demo of NetCrunch with most notable features explained in a walk-through manner. You'll also get to know the philos…
Monitoring a network: why having a policy is the best policy? Michael Kulchisky, MCSE, MCSA, MCP, VTSP, VSP, CCSP outlines the enormous benefits of having a policy-based approach when monitoring medium and large networks. Software utilized in this v…

705 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question