Solved

Help With Regex Pattern

Posted on 2006-10-31
7
405 Views
Last Modified: 2012-05-05
Experts, I need some help with a Regex pattern for a project. I am parsing an HTML page and throughout the page there is specific markers denoting dynamic content.

<html>
.....
.....
.....
<!--wc_start-->
<!--wc_end-->
.....
more Html, etc
.....
.....

I need a regex pattern that will match all of the content outside of the markers. I also need the content inside the markers, but I need to somehow seperate the content outside the markers from the content inside the markers. To summerize, if the following was my HTML document:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<title>Untitled Document</title>
</head>

<body>
<!--wc_start-->
The first piece of content
<!--wc_end-->
<div id="CoolContent">Hello There</div>
<!--wc_start-->
The second piece of content
<!--wc_end-->
</body>
</html>

Then my match results would equal:

ContentInsideMarker[0] = "The first piece of content"
ContentInsideMarker[1] = "The second piece of content"

ContentOutsideTheMarker[0] = "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" /><title>Untitled Document</title></head><body>";

ContentOutsideTheMarker[1] = "<div id="CoolContent">Hello There</div>";

ContentOutsideTheMarker[2] = "</body></html>";

I hope this makes sense. Let me know if you need more details. I don't know enough about regex to write the pattern myself. I would assume you would use a matchcollection, but again I'm not the expert :)

Thanks so much for the correct pattern.

~C
0
Comment
Question by:clickclickbang
7 Comments
 
LVL 15

Expert Comment

by:ozymandias
ID: 17845901
Do you have to use regex for this ?
It's not necessarily the best tool ?
Is all your html xhtml ?
You could probably parse this better using xml and and an xsl transform.
0
 
LVL 1

Author Comment

by:clickclickbang
ID: 17846411
The content will not always be xhtml and may contain improper use of HTML. If there is another way other than Regex, I am open to an example. :)
0
 
LVL 15

Accepted Solution

by:
ozymandias earned 200 total points
ID: 17846580
OK. Not using any regex or xml and assuming that the <!--wc-start--> and <!--wc_end--> comments are always paired and always correctly placed and on a line by themselves then the following works fine :

using System;

namespace Scraps
{
      public enum ContentType
      {
            Static,
            Dynamic
      }
      
      class Class1
      {

        private bool inStatic = true;
            private string[] lines;

            private string html = @"
                  <!DOCTYPE html PUBLIC '-//W3C//DTD XHTML 1.0 Transitional//EN' 'http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd'>
                  <html xmlns='http://www.w3.org/1999/xhtml'>
                  <head>
                  <meta http-equiv='Content-Type' content='text/html; charset=iso-8859-1' />
                  <title>Untitled Document</title>
                  </head>
                  <body>
                  <!--wc_start-->
                  The first piece of content
                  <!--wc_end-->
                  <div id='CoolContent'>Hello There</div>
                  <!--wc_start-->
                  The second piece of content
                  <!--wc_end-->
                  </body>
                  </html>";

            public Class1(){
                  lines = html.Split('\n');
            }

            public void Output(ContentType type){
                  foreach (string line in lines){
                        if (line.Trim().StartsWith("<!--wc_start-->")){
                              this.inStatic = false;
                              continue;
                        }else if (line.Trim().StartsWith("<!--wc_end-->")){
                              this.inStatic = true;
                              continue;
                        }else if ((type == ContentType.Dynamic && !this.inStatic) || (type == ContentType.Static && this.inStatic)){
                              Console.WriteLine(line);
                        }
                  }
            }
            
            [STAThread]
            static void Main(string[] args)
            {
                  Class1 c1 = new Class1();
                  c1.Output(ContentType.Static);
                  Console.WriteLine("\n");
                  c1.Output(ContentType.Dynamic);
                  Console.ReadLine();
            }
      }
}
0
Courses: Start Training Online With Pros, Today

Brush up on the basics or master the advanced techniques required to earn essential industry certifications, with Courses. Enroll in a course and start learning today. Training topics range from Android App Dev to the Xen Virtualization Platform.

 
LVL 15

Expert Comment

by:ozymandias
ID: 17846601
Basically it just loops through the lines of html and when it encounters either of the respective comments it switches mode.
The rest of the time it just outputs the lines depending on what mode its in and which type of content it's been asked for.
0
 
LVL 7

Assisted Solution

by:mjmarlow
mjmarlow earned 100 total points
ID: 17848380
public class Extract
    {
        const string START_TAG = "<!--wc_start-->";
        const string END_TAG = "<!--wc_end-->";
        public System.Collections.Specialized.StringCollection  GetComments(string html)
        {
            System.Collections.Specialized.StringCollection col = new System.Collections.Specialized.StringCollection();
            int pStart = 0;
            int pEnd   = 0;
            while ((pStart = html.IndexOf(START_TAG, pEnd)) > -1 )
            {
                // Find closing tag
                pEnd = html.IndexOf(END_TAG, pStart);

                // collect the comment
                col.Add(html.Substring(pStart+START_TAG.Length,pEnd-pStart-START_TAG.Length));

            }
            return col;
        }
    }
0
 
LVL 62

Assisted Solution

by:Fernando Soto
Fernando Soto earned 200 total points
ID: 17849476
Hi clickclickbang ;

Try this out.

      private static string pattern = @"(?:(?<1>.*?)<!--wc_start-->(?<2>.*?)<!--wc_end-->)|(?:<!--wc_start-->(?<2>.*?)<!--wc_end-->(?<1>.*?))|(?<1>.*)";
      private Regex re = new Regex(pattern, RegexOptions.Singleline | RegexOptions.Compiled);
      private ArrayList ContentInsideMarker = new ArrayList();
      private ArrayList ContentOutsideTheMarker = new ArrayList();

      private void button1_Click(object sender, System.EventArgs e)
      {
            StreamReader sr = new StreamReader(@"C:\Temp\Parsedata.txt");
            string input = sr.ReadToEnd();
            sr.Close();
            MatchCollection mc = re.Matches(input);
            foreach( Match m in mc)
            {
                  if(m.Groups[1].Value != "" )
                  {
                        ContentOutsideTheMarker.Add(m.Groups[1].Value.Trim);
                  }
                  if(m.Groups[2].Value != "" )
                  {
                        ContentInsideMarker.Add(m.Groups[2].Value.Trim);
                  }                        
            }
      }


Fernando
0
 
LVL 1

Author Comment

by:clickclickbang
ID: 17866110
Thank you all for your posts. I am running into a problem. First, I'm not in control of HOW the content I'm parsing is created and I've already been told that it will not always be on a new line. Second, the content is a web page with all the fun stuff that goes in a web page.  So far the examples not using Regex work as long as the comments are on new lines. The Regex works with the content below, except that it puts all of the content into the ContentOutsideTheMarker array.

Since the original post on this question showed an example using the comments on seperate lines, I felt I needed to continue this question on a new post. I'm going to close out this post and the continued question is:
http://www.experts-exchange.com/Programming/Programming_Languages/C_Sharp/Q_22047861.html

Thanks for all your help!

~ C

0

Featured Post

Gigs: Get Your Project Delivered by an Expert

Select from freelancers specializing in everything from database administration to programming, who have proven themselves as experts in their field. Hire the best, collaborate easily, pay securely and get projects done right.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Exception Handling is in the core of any application that is able to dignify its name. In this article, I'll guide you through the process of writing a DRY (Don't Repeat Yourself) Exception Handling mechanism, using Aspect Oriented Programming.
The article shows the basic steps of integrating an HTML theme template into an ASP.NET MVC project
In a recent question (https://www.experts-exchange.com/questions/28997919/Pagination-in-Adobe-Acrobat.html) here at Experts Exchange, a member asked how to add page numbers to a PDF file using Adobe Acrobat XI Pro. This short video Micro Tutorial sh…
This video shows how to use Hyena, from SystemTools Software, to bulk import 100 user accounts from an external text file. View in 1080p for best video quality.

813 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

20 Experts available now in Live!

Get 1:1 Help Now