Solved

regex html div

Posted on 2006-10-23
3
694 Views
Last Modified: 2009-12-16
I have read a html file into one long string.  

I need a regex that gets me everything between <div id="change"> and its matching closing </div>.

There is also the possibility of there being other DIV's inbetween this one.
0
Comment
Question by:cophi
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 2
3 Comments
 
LVL 6

Expert Comment

by:VovinE
ID: 17790188
This is not possible with regular expressions.
If there were no closing div's inside your div, the regular expression would look something liket this:

"<div id=\"change\">[A-Za-z<>/\\"' \t\n\r]*?</div>"

If you need to match against closing div's also, then regular expression is not the way to get it :)
0
 
LVL 22

Expert Comment

by:_TAD_
ID: 17790435

HTML, if properly coded, is really just a series of XML fields.

Try reading your html file as if it were XML data and then use xPath navigation to find the div tag with the proper attribute.
0
 
LVL 6

Accepted Solution

by:
VovinE earned 500 total points
ID: 17826740
Using regular expressions you might write a parser that does this for you.

Here is sample parser which does what you want:

    public class DivExtractor
    {
        private Regex regex;
        public DivExtractor()
        {
            regex = new Regex("(<div([^>]*)>)|(</div>)|(.)?", RegexOptions.IgnoreCase);
        }

        public string GetDiv(string content)
        {
            string[] r = GetDivs(content);
            if (r.Length > 0)
                return r[0];
            else
                return "";

        }
        public string[] GetDivs(string content)
        {
            extractions = new List<int>();
            level = 0;
            regex.Replace(content, DivMatched);
            List<string> results = new List<string>();
            int i = 0;
            while (i < extractions.Count-1)
            {
                int start = extractions[i++];
                int len = extractions[i++] - start;
                results.Add(content.Substring(start, len));
            }
            return results.ToArray();
        }

        private List<int> extractions;
        private int level;

        private string DivMatched(Match m)
        {
            if (m.Groups[1].Success)
            {
                if (level > 0)
                    level++;
                else if (m.Groups[2].Value.Contains("id=\"change\""))
                {
                    extractions.Add(m.Index + m.Length); // store starting extraction position
                    level++;
                }
            }
            else if (m.Groups[3].Success && level > 0)
            {
                level--;
                if (level == 0)
                {
                    extractions.Add(m.Index); // store closing index position
                }
            }
            return "";
        }
    }


Usage is very simple:

new DivExtractor().GetDiv(html)  // to extract single (first?) div content (without the div tags)

0

Featured Post

Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Summary: Persistence is the capability of an application to store the state of objects and recover it when necessary. This article compares the two common types of serialization in aspects of data access, readability, and runtime cost. A ready-to…
This article is for Object-Oriented Programming (OOP) beginners. An Interface contains declarations of events, indexers, methods and/or properties. Any class which implements the Interface should provide the concrete implementation for each Inter…
The Email Laundry PDF encryption service allows companies to send confidential encrypted  emails to anybody. The PDF document can also contain attachments that are embedded in the encrypted PDF. The password is randomly generated by The Email Laundr…
In an interesting question (https://www.experts-exchange.com/questions/29008360/) here at Experts Exchange, a member asked how to split a single image into multiple images. The primary usage for this is to place many photographs on a flatbed scanner…
Suggested Courses

734 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question