Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

x
?
Solved

Complex regex

Posted on 2005-04-26
20
Medium Priority
?
287 Views
Last Modified: 2010-04-16
I have quite a complex problem......

Our site content is generated dynamically from a database. We also have a page of a glossary of terms used on the site. I would like to parse the dynamic content and check if the terms are contained in it, and if so link to it....

e.g.
Database content: "This is some text on our site"
Glossary term: "our site"
The result would be "This is some text on <a href="/glossary.aspx?term=our site">our site</a>"

Now I can do this quite easily by looping through the glossary terms (DataTable) and then simply checking for a match. If there is a match I simply replace the original text with that of the link.

THE PROBLEM with this method is that it only works for plain text and not for HTML which I need it to. What happens for example if the text is already contained in a link, this would create a link within a link. In addition there would be certain tags I would not want to make this happen to. For example if the term was already contained within an <h1>

Therefore I need to find a way of doing this where I can exclude terms which are already contained within tags that I specifiy (e.g. <a>, <h1> etc...)

This is where I get lost......

I will give many many points for a solution...

Many thanks
0
Comment
Question by:jonnyboy69
  • 10
  • 9
19 Comments
 
LVL 96

Expert Comment

by:Bob Learned
ID: 13866703
With complex expressions, I like to break the problem down into smaller parts:

(1) Handle anchor
(2) Handle header tags <h1>, <h2>...

Have you seen this before?  
Regular Expression Library:
http://www.regexlib.com/

Anchor:
<a\s*href=(.*?)[\s|>]

Bob
0
 

Author Comment

by:jonnyboy69
ID: 13866900
Thanks. I kind of know thats what I have todo, its just I cant work out how todo it and how to use regular expressions.

Begginning to formulate an idea what I need is to...

1. Loop through the text and find all matches
2. Test each match to see if it is already contained within an HTML tag. If it is in a tag ignore it, if not wrap it in tags.

* A problem here would be if a match was contained partially in a tag
e.g.
Match = "some text"
String = "<a href="/">This is some text to match</a>"

In this case if I matched on "some text" it would find it, but it would also say it is not contained within an <a> tag but it is, just not exclusively?

Thanks
0
 
LVL 96

Expert Comment

by:Bob Learned
ID: 13867644
Idea:

(1) Start with regular expression to extract tags.  Get all the matches, including start index (Match.Index) and value (Match.Value).  

   Example:  ^<a(.+)</a> will search for anchor tags.

(2) Search through all text matches for the sequence (i.e. some text).  
(3) Check the index for each match, and if it falls within the range an HTML tag, then process through a different routine.

Bob
0
What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

 

Author Comment

by:jonnyboy69
ID: 13875874
Clues on how to actually do this in code would be greatly appreciated and more points will be given :)
Thanks
0
 
LVL 96

Expert Comment

by:Bob Learned
ID: 13876096
Untested code to build a list of Anchors:

    public ArrayList FindAnchors(string input, string search)
    {

      string patternAnchors = "^<a(.+)</a>";

      Regex parser = new Regex(patternAnchors);

      ArrayList anchors = new ArrayList();

      foreach (Match current in Matches(input))
        anchors.Add(current);

      return anchors;

    }

This needs to be verified before continuing with the next steps.

Bob
0
 

Author Comment

by:jonnyboy69
ID: 13883913
Cant seem to get it work?

1) Modified your function slightly, not sure if its right? (parser.Matches). What was the "search" parameter for?

public ArrayList FindAnchors(string input, string search)
{
      string patternAnchors = "^<a(.+)</a>";

      Regex parser = new Regex(patternAnchors);

      ArrayList anchors = new ArrayList();

      foreach(Match current in parser.Matches(input))
            anchors.Add(current);

      return anchors;

}

2. Tested using this:

string sInput = "<p><a href=\"http://www.adobe.com/products/acrobat/readstep.html\"></a></p><div align=\"right\"><a href=\"#top\">Back to top</a></div><a>Hello</a></div>";
ArrayList arr_Matches = FindAnchors(sInput, "");
Trace.Warn("Matches = " + arr_Matches.Count.ToString());

This traces "Matches = 0"

Thanks
0
 
LVL 96

Expert Comment

by:Bob Learned
ID: 13884985
Try this (search parameter was a left over from a previous attempt):

public ArrayList FindAnchors(string input)
{
     string patternAnchors = "<a(.+)</a>";

     Regex parser = new Regex(patternAnchors);

     ArrayList anchors = new ArrayList();

     foreach(Match current in parser.Matches(input))
          anchors.Add(current);

     return anchors;

}

The '^' metacharacter was throwing this off, and only worked in my single test case.  It means that the string to start with the <a> anchor tag, which it doesn't.  Let me know if there are any more problems or confusion.

Bob
0
 

Author Comment

by:jonnyboy69
ID: 13885090
Done that and it now returns 1 match which is wrong?

I traced out the value when it is being added to the array and this was:
<a href="http://www.adobe.com/products/acrobat/readstep.html"></a></p><div align="right"><a href="#top">Back to top</a></div><a>Hello</a>

I figured the pattern was not right so I changed it to: string patternAnchors = "<a((.|\n)*?)</a>"; (which is what someone else mentioned to me).

This returns 3 matches and they are all correct :)

So now we need to "2) Search through all text matches for the sequence (i.e. some text)."

Thanks
0
 
LVL 96

Expert Comment

by:Bob Learned
ID: 13885419
Nice catch on that one.  I am still trying to get all this regular expression stuff straight in my head.  It can be very confusing still, after all this time.

Step #2:

    public void SearchHTMLForMatches(
               string textHTML, string searchText, ArrayList anchorList,
               out ArrayList matchesText, out ArrayList matchesAnchor)

    {

      // Initialize return values.
      matchesText = new ArrayList();
      matchesAnchor = new ArrayList();

      // Look for the search string with the HTML.
      int index = textHTML.IndexOf(searchText);

      if (index >-1 )
      {

        // Start by assuming that the search text is not within an anchor.
        bool withinAnchor = false;

        // Search through each anchor, and validate.
        foreach (Match current in anchorList)
        {

          // The start and end position
          int start = current.Index;
          int end = start + current.Value.Length;

          // The search text was found between an anchor.
          if (index >= start && index <= end)
          {

            // Add this match to the return list.
            matchesAnchor.Add(current);

            // Indicate that the text was found in an anchor.
            withinAnchor = true;

          }

        }

        // If the text was found within the HTML, but not within an anchor
        // then add it to the text match list.
        if (!withinAnchor)
          matchesText.Add(index);

      }

    }

Pass in the HTML, the search text, and the list of anchors found from Step #1.

Return values are two ArrayList elements indicating if the text was found in the HTML and not within an anchor (matchesText), and if it was found within any anchor tags (matchesAnchor).

Bob
0
 

Author Comment

by:jonnyboy69
ID: 13885556
Sorry being dim here, dont know how to use out modifiers in c#. Where am I going werong here?

// Matches
string sHtml = "<a href=\"\">Some text</a><a>Back to top</a><a>Hello</a>";
string sMatch = "Some text";

// AnchorList
ArrayList arr_AnchorList = FindAnchors(sHtml, "");
Trace.Warn("Matches = " + arr_AnchorList.Count.ToString());

ArrayList matchesText;
ArrayList matchesAnchor;

// Get matches
SearchHTMLForMatches(sHtml, sMatch, arr_AnchorList, matchesText, matchesAnchor);
0
 
LVL 96

Expert Comment

by:Bob Learned
ID: 13885816
There are two different ways to pass back multiple return values from a method (ref and out).  The out keyword indicates that the incoming parameter doesn't need to be initialized, and will be set and manipulated by the method.

What is happening now?  

With this HTML text, I would expect you to get matchesText.Count = 0 and matchesAnchor.Count = 1.

Bob
0
 

Author Comment

by:jonnyboy69
ID: 13886030
You would be exactly correct...

matchesText = 0
matchesAnchor = 1
0
 
LVL 96

Expert Comment

by:Bob Learned
ID: 13886220
So, where are we at now?  Do you need step #3 (how to process anchors)?  Or are we done?

Bob
0
 

Author Comment

by:jonnyboy69
ID: 13887186
Yes sorry step 3 was the one I got least believe it or not ;) You can have 1000 points for it, I'll set up another question for you as I dont think you can increase to 1000?
0
 
LVL 96

Expert Comment

by:Bob Learned
ID: 13887264
I don't need 1000 points, and the limit is 500, which is fine.

Let's see if I understand what you need for Step #3.

Step #3, attempt #1:

    public string ReplaceAnchorText(string textHTML, string search, string replace, int position)
    {

      // Use a StringBuilder, since strings are immutable in C#.
      System.Text.StringBuilder sb = new System.Text.StringBuilder(textHTML);

      // Remove the search text.
      sb.Remove(position, search.Length);

      // Insert the new text at that position.
      sb.Insert(position, replace);

      // Return a string value.
      return sb.ToString();

    }

Bob
0
 

Author Comment

by:jonnyboy69
ID: 13887353
Not sure about this.

What I now need todo is replace ALL instances in the text which ARE NOT in anchors. all ones contained in anchors should be ignored.

Thanks
0
 
LVL 96

Expert Comment

by:Bob Learned
ID: 13887387
Right ;)  I knew that I didn't get it.  Let's rename that:

public string ReplaceRegularText(string textHTML, string search, string replace, int position)
    {

      // Use a StringBuilder, since strings are immutable in C#.
      System.Text.StringBuilder sb = new System.Text.StringBuilder(textHTML);

      // Remove the search text.
      sb.Remove(position, search.Length);

      // Insert the new text at that position.
      sb.Insert(position, replace);

      // Return a string value.
      return sb.ToString();

    }

Usage:

foreach (int position in matchesText)
  sHtml = ReplaceRegularText(sHtml, "some text", "replace text", position);

Bob
0
 

Author Comment

by:jonnyboy69
ID: 13887796
OK Looks like there is a problem with finding text matches. I have modified the HTMl so there is now a text match as well as an anchor match. I thn run it all like so:

// Matches
string sHtml = "<a href=\"\">Some text</a><a>Back to top</a><a>Hello</a>Some text";
string sMatch = "Some text";

// AnchorList
ArrayList arr_AnchorList = FindAnchors(sHtml, "");
Trace.Warn("Matches = " + arr_AnchorList.Count.ToString());

ArrayList matchesText;
ArrayList matchesAnchor;

// Get matches
SearchHTMLForMatches(sHtml, sMatch, arr_AnchorList, out matchesText, out matchesAnchor);

Trace.Warn("matchesText = " + matchesText.Count.ToString());
Trace.Warn("matchesAnchor = " + matchesAnchor.Count.ToString());

foreach (int position in matchesText)
      sHtml = ReplaceRegularText(sHtml, "some text", "<a href=\"\">Sometext</a>", position);

Trace.Warn("sHtml after sorting = " + sHtml);

This returns in the trace:
matchesText = 0
matchesAnchor = 1
sHtml after sorting = <a href="">Some text</a><a>Back to top</a><a>Hello</a>Some text

As you can see its not matching on the text ones for some reason?

0
 
LVL 96

Accepted Solution

by:
Bob Learned earned 2000 total points
ID: 13889085
The problem is that IndexOf always returns the first occurrence of the search text, which is within the anchor.  What we need to accomplish is to search for the text from a position past the first text found.  The IndexOf method has overloaded methods.  Try this function instead, and tell me what the counts are:

    public void SearchHTMLForMatches(
      string textHTML, string searchText, ArrayList anchorList,
      out ArrayList matchesText, out ArrayList matchesAnchor)

    {

      // Initialize return values.
      matchesText = new ArrayList();
      matchesAnchor = new ArrayList();

      // The start position for the next search.
      int searchPos = 0;

      // Keep searching for matches until none can be found.
      while (index != -1)
      {

        // Look for the search string with the HTML.
        int index = textHTML.IndexOf(searchText, searchPos);

        if (index >-1 )
        {

          // Move the starting position to after the text found.
          searchPos += (index + searchText.Length);

          // Start by assuming that the search text is not within an anchor.
          bool withinAnchor = false;

          // Search through each anchor, and validate.
          foreach (Match current in anchorList)
          {

            // The start and end position
            int start = current.Index;
            int end = start + current.Value.Length;

            // The search text was found between an anchor.
            if (index >= start && index <= end)
            {

              // Add this match to the return list.
              matchesAnchor.Add(current);

              // Indicate that the text was found in an anchor.
              withinAnchor = true;

            }

          }

          // If the text was found within the HTML, but not within an anchor
          // then add it to the text match list.
          if (!withinAnchor)
            matchesText.Add(index);

        }

      }

    }


Bob
0

Featured Post

Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

This article describes a simple method to resize a control at runtime.  It includes ready-to-use source code and a complete sample demonstration application.  We'll also talk about C# Extension Methods. Introduction In one of my applications…
Introduction Hi all and welcome to my first article on Experts Exchange. A while ago, someone asked me if i could do some tutorials on object oriented programming. I decided to do them on C#. Now you may ask me, why's that? Well, one of the re…
Loops Section Overview
How can you see what you are working on when you want to see it while you to save a copy? Add a "Save As" icon to the Quick Access Toolbar, or QAT. That way, when you save a copy of a query, form, report, or other object you are modifying, you…

564 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question