?
Solved

Help With Regex Pattern

Posted on 2006-10-31
7
Medium Priority
?
409 Views
Last Modified: 2012-05-05
Experts, I need some help with a Regex pattern for a project. I am parsing an HTML page and throughout the page there is specific markers denoting dynamic content.

<html>
.....
.....
.....
<!--wc_start-->
<!--wc_end-->
.....
more Html, etc
.....
.....

I need a regex pattern that will match all of the content outside of the markers. I also need the content inside the markers, but I need to somehow seperate the content outside the markers from the content inside the markers. To summerize, if the following was my HTML document:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<title>Untitled Document</title>
</head>

<body>
<!--wc_start-->
The first piece of content
<!--wc_end-->
<div id="CoolContent">Hello There</div>
<!--wc_start-->
The second piece of content
<!--wc_end-->
</body>
</html>

Then my match results would equal:

ContentInsideMarker[0] = "The first piece of content"
ContentInsideMarker[1] = "The second piece of content"

ContentOutsideTheMarker[0] = "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" /><title>Untitled Document</title></head><body>";

ContentOutsideTheMarker[1] = "<div id="CoolContent">Hello There</div>";

ContentOutsideTheMarker[2] = "</body></html>";

I hope this makes sense. Let me know if you need more details. I don't know enough about regex to write the pattern myself. I would assume you would use a matchcollection, but again I'm not the expert :)

Thanks so much for the correct pattern.

~C
0
Comment
Question by:clickclickbang
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
7 Comments
 
LVL 15

Expert Comment

by:ozymandias
ID: 17845901
Do you have to use regex for this ?
It's not necessarily the best tool ?
Is all your html xhtml ?
You could probably parse this better using xml and and an xsl transform.
0
 
LVL 1

Author Comment

by:clickclickbang
ID: 17846411
The content will not always be xhtml and may contain improper use of HTML. If there is another way other than Regex, I am open to an example. :)
0
 
LVL 15

Accepted Solution

by:
ozymandias earned 800 total points
ID: 17846580
OK. Not using any regex or xml and assuming that the <!--wc-start--> and <!--wc_end--> comments are always paired and always correctly placed and on a line by themselves then the following works fine :

using System;

namespace Scraps
{
      public enum ContentType
      {
            Static,
            Dynamic
      }
      
      class Class1
      {

        private bool inStatic = true;
            private string[] lines;

            private string html = @"
                  <!DOCTYPE html PUBLIC '-//W3C//DTD XHTML 1.0 Transitional//EN' 'http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd'>
                  <html xmlns='http://www.w3.org/1999/xhtml'>
                  <head>
                  <meta http-equiv='Content-Type' content='text/html; charset=iso-8859-1' />
                  <title>Untitled Document</title>
                  </head>
                  <body>
                  <!--wc_start-->
                  The first piece of content
                  <!--wc_end-->
                  <div id='CoolContent'>Hello There</div>
                  <!--wc_start-->
                  The second piece of content
                  <!--wc_end-->
                  </body>
                  </html>";

            public Class1(){
                  lines = html.Split('\n');
            }

            public void Output(ContentType type){
                  foreach (string line in lines){
                        if (line.Trim().StartsWith("<!--wc_start-->")){
                              this.inStatic = false;
                              continue;
                        }else if (line.Trim().StartsWith("<!--wc_end-->")){
                              this.inStatic = true;
                              continue;
                        }else if ((type == ContentType.Dynamic && !this.inStatic) || (type == ContentType.Static && this.inStatic)){
                              Console.WriteLine(line);
                        }
                  }
            }
            
            [STAThread]
            static void Main(string[] args)
            {
                  Class1 c1 = new Class1();
                  c1.Output(ContentType.Static);
                  Console.WriteLine("\n");
                  c1.Output(ContentType.Dynamic);
                  Console.ReadLine();
            }
      }
}
0
Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 15

Expert Comment

by:ozymandias
ID: 17846601
Basically it just loops through the lines of html and when it encounters either of the respective comments it switches mode.
The rest of the time it just outputs the lines depending on what mode its in and which type of content it's been asked for.
0
 
LVL 7

Assisted Solution

by:mjmarlow
mjmarlow earned 400 total points
ID: 17848380
public class Extract
    {
        const string START_TAG = "<!--wc_start-->";
        const string END_TAG = "<!--wc_end-->";
        public System.Collections.Specialized.StringCollection  GetComments(string html)
        {
            System.Collections.Specialized.StringCollection col = new System.Collections.Specialized.StringCollection();
            int pStart = 0;
            int pEnd   = 0;
            while ((pStart = html.IndexOf(START_TAG, pEnd)) > -1 )
            {
                // Find closing tag
                pEnd = html.IndexOf(END_TAG, pStart);

                // collect the comment
                col.Add(html.Substring(pStart+START_TAG.Length,pEnd-pStart-START_TAG.Length));

            }
            return col;
        }
    }
0
 
LVL 63

Assisted Solution

by:Fernando Soto
Fernando Soto earned 800 total points
ID: 17849476
Hi clickclickbang ;

Try this out.

      private static string pattern = @"(?:(?<1>.*?)<!--wc_start-->(?<2>.*?)<!--wc_end-->)|(?:<!--wc_start-->(?<2>.*?)<!--wc_end-->(?<1>.*?))|(?<1>.*)";
      private Regex re = new Regex(pattern, RegexOptions.Singleline | RegexOptions.Compiled);
      private ArrayList ContentInsideMarker = new ArrayList();
      private ArrayList ContentOutsideTheMarker = new ArrayList();

      private void button1_Click(object sender, System.EventArgs e)
      {
            StreamReader sr = new StreamReader(@"C:\Temp\Parsedata.txt");
            string input = sr.ReadToEnd();
            sr.Close();
            MatchCollection mc = re.Matches(input);
            foreach( Match m in mc)
            {
                  if(m.Groups[1].Value != "" )
                  {
                        ContentOutsideTheMarker.Add(m.Groups[1].Value.Trim);
                  }
                  if(m.Groups[2].Value != "" )
                  {
                        ContentInsideMarker.Add(m.Groups[2].Value.Trim);
                  }                        
            }
      }


Fernando
0
 
LVL 1

Author Comment

by:clickclickbang
ID: 17866110
Thank you all for your posts. I am running into a problem. First, I'm not in control of HOW the content I'm parsing is created and I've already been told that it will not always be on a new line. Second, the content is a web page with all the fun stuff that goes in a web page.  So far the examples not using Regex work as long as the comments are on new lines. The Regex works with the content below, except that it puts all of the content into the ContentOutsideTheMarker array.

Since the original post on this question showed an example using the comments on seperate lines, I felt I needed to continue this question on a new post. I'm going to close out this post and the continued question is:
http://www.experts-exchange.com/Programming/Programming_Languages/C_Sharp/Q_22047861.html

Thanks for all your help!

~ C

0

Featured Post

Free Tool: Port Scanner

Check which ports are open to the outside world. Helps make sure that your firewall rules are working as intended.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Introduction Although it is an old technology, serial ports are still being used by many hardware manufacturers. If you develop applications in C#, Microsoft .NET framework has SerialPort class to communicate with the serial ports.  I needed to…
We all know that functional code is the leg that any good program stands on when it comes right down to it, however, if your program lacks a good user interface your product may not have the appeal needed to keep your customers happy. This issue can…
In this brief tutorial Pawel from AdRem Software explains how you can quickly find out which services are running on your network, or what are the IP addresses of servers responsible for each service. Software used is freeware NetCrunch Tools (https…
This is my first video review of Microsoft Bookings, I will be doing a part two with a bit more information, but wanted to get this out to you folks.
Suggested Courses
Course of the Month13 days, 23 hours left to enroll

800 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question