Solved

Regular Expressions

Posted on 2004-03-31
14
552 Views
Last Modified: 2010-04-15
Hi
Can anyone help with this regular expression which I found seaching through the PAQ's.

Regex.Replace(HtmlData, @"\s+|\s*((<script>((?!</script>).)+</script>|<([^>]|""[^""]*""|'[^']*')*>)\s*)+", " ").Trim();

When I try to use it I receive the following error.

Description: An error occurred during the compilation of a resource required to service this request. Please review the following specific error details and modify your source code appropriately.

Compiler Error Message: CS1010: Newline in constant

Source Error:

 

Line 29: Session["InsertCatID"] = Request.Form["Category"];
Line 30: Session["InsertCountryID"] = Request.Form["Country"];
Line 31: Session["InsertCityID"] = Request.Form["City"];
Line 32: Session["PenaltyID"] = Request.Form["County"];
Line 33: Session["InsertTitle"] = strTitle;
 

Source File: C:\**.aspx    Line: 31
Line 31 has nothing to do with this bit of code.

I'm trying to use the code below:

StringBuilder strTextBuilder=new StringBuilder();
    foreach (Match match in Regex.Replace(HtmlData, @"\s+|\s*((<script>((?!</script>).)+</script>|<([^>]|""[^""]*""|'[^']*')*>)\s*)+", " ").Trim(); RegexOptions.IgnoreCase|RegexOptions.Singleline))
        strTextBuilder.Append(match.Value); // use match.Groups["content"].Value to get rid of the tag
    strText=strTextBuilder.ToString();

What I need  to achieve is remove all text between < and > also text between <script>code here</script> or simaliar, then remove excess white space so that text is presented neatly.
Big job and I have no idea how to do, even after reading Mastering Regular Expressions.

Any help would be appreciated
George
0
Comment
Question by:Tourist_Search
  • 7
  • 7
14 Comments
 
LVL 12

Accepted Solution

by:
dfiala13 earned 500 total points
ID: 10724484
Your compilation problem is multifold.

Regex.Replace returns a string, not a match

You have a ; embedded in your arguments for Regex.Replace.

This will find  matches for your Regex and lcall Replacer to let you see what you get and decide what to replace the the text with

private string SomeFunction(string HtmlData){

    MatchEvaluator ev = new MatchEvaluator(Replacer);
 
    string strText = Regex.Replace(HtmlData, @"\s+|\s*((<script>((?!</script>).)+</script>|<([^>]|""[^""]*""|'[^']*')*>)\s*)+", ev, RegexOptions.IgnoreCase|RegexOptions.Singleline);

    return strText;
                                    
}

private string Replacer(Match m)
{

 //retun an empty string every time we find a match
//or do something fancier
    return "";

}

Now in terms of the Regex expression.  It is better to walk before you fly, so you can a) understand exactly what is going on and b) get a working base that can be expanded, instead of trying to go with a big-bang approach where either it all works or nothing works.

So start with first part of your task, emptying out the script guts. Simplify your regular expression to
<script>((?!</script>).)+</script>
and see what matches.  Change it until you get all your scripts back.  (Hint: this will probably only find scripts where the script tag has no attributes).

then you can add in the next requirement, and there is no shame to having to pass through the doucment multiple times if your requirements are complicated enough. Use the results of the previous trimming to feed your next one. Regex is powerful, but it's a pain to read and debug, once you get the discrete steps working you can start combining them where appropriate.
0
 

Author Comment

by:Tourist_Search
ID: 10725326
Thanks dfiala13

I'll have to do more reading, this is well over my head,

I have this regex that does work:

foreach (Match match in Regex.Matches(htmlData, @">(?:(?<c>[^<]+))", RegexOptions.IgnoreCase|RegexOptions.Singleline))
        strTextBuilder.Append(match.Value); // use match.Groups["content"].Value to get rid of the tag
    strText=strTextBuilder.ToString();

Only problem is, it does not remove the text between the <script>code here tags</script>

George
0
 
LVL 12

Expert Comment

by:dfiala13
ID: 10725609
By "this regex that does work"

do you mean it finds matches.  It does seem to find the end of any tag >
But at the moment you are just writing the value of what matched to your string builder, so it is not removing anything, in fact your string returned will only have the matches in it and none of the stuff you want.

I have to run out for a bit but will look at this later and see if I can get you on the right track.

In the meantime try to implement the code snippet I sent you with your regex, it removes the matches from the original string.
0
 

Author Comment

by:Tourist_Search
ID: 10725764
Hi dfiala13

If for example I have <span class="colour">Coloured text here</span>

It removes the <span class="colour"> and </span> to leave the: "Coloured text here" which is what i'm after.

But it will not remove any code between <script>this still remains</script>
Or <script> {
this too remains
}
</script>

I have tried the snippet you sent but still receive an error. Deep subject.

George
0
 
LVL 12

Expert Comment

by:dfiala13
ID: 10726883
OK,

That's because you are asking it to do two different things:

remove the entirety of the <script> tags including contents,
For other tags, just remove the tags and leave the contents

And running your current Regex expression I get dangling end angle brackets > throughout the source

So I changed your code a bit, and this leaves only the guts of the HTML tags.

                  foreach (Match match in Regex.Matches(HtmlData, @">(?:(?<c>[^<]+))", RegexOptions.IgnoreCase|RegexOptions.Singleline))
                        strTextBuilder.Append(match.Value.Substring(1).Trim());

Now, I'll see if I can get rid of the scripts.

Given that the code above removes your html tags, makes sense to kill the scripts first.

HtmlData = Regex.Replace(HtmlData, @"<script.*?</script>", ev, RegexOptions.IgnoreCase|RegexOptions.Singleline);

this calls the Replacer method, which right now returns an empty string no matter what you send it, but you could add more logic.

or you could also do this...

HtmlData = Regex.Replace(HtmlData, @"<script.*?</script>", "", RegexOptions.IgnoreCase|RegexOptions.Singleline);

so the final method looks like this (i left the call to the replacer method in)...

private string CleanPage(string HtmlData)
{
      StringBuilder strTextBuilder=new StringBuilder();
      MatchEvaluator ev = new MatchEvaluator(Replacer);

      //remove script tags including contents
      HtmlData = Regex.Replace(HtmlData, @"<script.*?</script>", ev, RegexOptions.IgnoreCase|RegexOptions.Singleline);

      //remove reminaing tags, leaving contents
      foreach (Match match in Regex.Matches(HtmlData, @">(?:(?<c>[^<]+))", RegexOptions.IgnoreCase|RegexOptions.Singleline))
      strTextBuilder.Append(match.Value.Substring(1).Trim());
      
                //return our cleaned string
                return strTextBuilder.ToString();
}

private string Replacer(Match m)
{
              //do soemthing fancier or fo rnow just return an empty string
      return "";
}
0
 

Author Comment

by:Tourist_Search
ID: 10730578
Hi dfiala13

When I run the following script it works OK, but once I add all of it I keep getting the error newline in constant

foreach (Match match in Regex.Matches(HtmlData, @">(?:(?<c>[^<]+))", RegexOptions.IgnoreCase|RegexOptions.Singleline))
                    strTextBuilder.Append(match.Value.Substring(1).Trim());

George
0
 
LVL 12

Expert Comment

by:dfiala13
ID: 10731789
Where do you get the error?

Try cutting and pasting the code back into place.

Failing that post alll your code and the specific error message.
0
Threat Intelligence Starter Resources

Integrating threat intelligence can be challenging, and not all companies are ready. These resources can help you build awareness and prepare for defense.

 

Author Comment

by:Tourist_Search
ID: 10732175
Hi dfiala13

Script below:

<script runat="server">
string strTitle, strDescription, strText, strKeyWords;
void Page_Load(Object Src, EventArgs E)
{
    string htmlData=Request.Form["HtmlData"];
    strTitle=Regex.Match(htmlData, @"(?<=<title>).*?(?=</title>)", RegexOptions.IgnoreCase|RegexOptions.ExplicitCapture).Value; // Find title and keep text then remove tags
    strDescription=Regex.Match(htmlData, @"(?<=<META\s+name=""description""\s+content="").*?(?=""\s*?>)|(?<=<META\s+name=""description""\s+content="").*?(?=""\s*?/>)|(?<=content="").*?(?=name=""Description""/>)|(?<=content="").*?(?=name=""Description"">)", RegexOptions.IgnoreCase|RegexOptions.ExplicitCapture).Value; // Find Description then keep text
      strKeyWords=Regex.Match(htmlData, @"(?<=<META\s+name=""keywords""\s+content="").*?(?=""\s*?>)|(?<=<META\s+name=""keywords""\s+content="").*?(?=""\s*?/>)|(?<=content="").*?(?=name=""keywords""/>)|(?<=content="").*?(?=name=""keywords"">)", RegexOptions.IgnoreCase|RegexOptions.ExplicitCapture).Value; // Find Keywords then keep text
string CleanPage(string htmlData)
{
     StringBuilder strTextBuilder=new StringBuilder();
     MatchEvaluator ev = new MatchEvaluator(Replacer);

     //remove script tags including contents
     htmlData = Regex.Replace(htmlData, @"<script.*?</script>", ev, RegexOptions.IgnoreCase|RegexOptions.Singleline);

     //remove reminaing tags, leaving contents
     foreach (Match match in Regex.Matches(htmlData, @">(?:(?<c>[^<]+))", RegexOptions.IgnoreCase|RegexOptions.Singleline))
     strTextBuilder.Append(match.Value.Substring(1).Trim());
     
                //return our cleaned string
                return strTextBuilder.ToString();
}

private string Replacer(Match m)
{
              //do soemthing fancier or fo rnow just return an empty string
     return "";
}
{
      
Session["AddressID"] = Request.Form["InsertAddress"];
Session["PropertyID"] = Request.Form["PropertyID"];
Session["URLID"]= Request.Form["URL"];
Session["CatID"] = Request.Form["Category"];
Session["CountryID"] = Request.Form["Country"];
Session["CityID"] = Request.Form["City"];
Session["TitleID"] = strTitle;
Session["MetaDescriptionID"] = strDescription;
Session["BodyTextID"] = Request.Form["InsertAddress"] + " " + strText;
Session["KeywordsID"] = strKeyWords;

Error message is:

Compiler Error Message: CS1010: Newline in constant

Source Error:

 

Line 26:                 //return our cleaned string
Line 27:                 return strTextBuilder.ToString();
Line 28: }
Line 29:
Line 30: private string Replacer(Match m)
 

Source File: C:\**.aspx    Line: 28

}
</script>

Thanks
George
0
 
LVL 12

Expert Comment

by:dfiala13
ID: 10732237
Well, you have methods nested in your page load event handler method.  methods call other methods.

let's see if we can clean this up...

<script runat="server">
string strTitle, strDescription, strText, strKeyWords;

void Page_Load(Object Src, EventArgs E)
{
Session["AddressID"] = Request.Form["InsertAddress"];
Session["PropertyID"] = Request.Form["PropertyID"];
Session["URLID"]= Request.Form["URL"];
Session["CatID"] = Request.Form["Category"];
Session["CountryID"] = Request.Form["Country"];
Session["CityID"] = Request.Form["City"];
Session["TitleID"] = strTitle;
Session["MetaDescriptionID"] = strDescription;
Session["BodyTextID"] = Request.Form["InsertAddress"] + " " + strText;
Session["KeywordsID"] = strKeyWords;

    string htmlData=Request.Form["HtmlData"];
    strTitle=Regex.Match(htmlData, @"(?<=<title>).*?(?=</title>)", RegexOptions.IgnoreCase|RegexOptions.ExplicitCapture).Value; // Find title and keep text then remove tags
    strDescription=Regex.Match(htmlData, @"(?<=<META\s+name=""description""\s+content="").*?(?=""\s*?>)|(?<=<META\s+name=""description""\s+content="").*?(?=""\s*?/>)|(?<=content="").*?(?=name=""Description""/>)|(?<=content="").*?(?=name=""Description"">)", RegexOptions.IgnoreCase|RegexOptions.ExplicitCapture).Value; // Find Description then keep text
     strKeyWords=Regex.Match(htmlData, @"(?<=<META\s+name=""keywords""\s+content="").*?(?=""\s*?>)|(?<=<META\s+name=""keywords""\s+content="").*?(?=""\s*?/>)|(?<=content="").*?(?=name=""keywords""/>)|(?<=content="").*?(?=name=""keywords"">)", RegexOptions.IgnoreCase|RegexOptions.ExplicitCapture).Value; // Find Keywords then keep text

  htmlData = CleanPage(htmlData);

}

string CleanPage(string htmlData)
{
     StringBuilder strTextBuilder=new StringBuilder();
     MatchEvaluator ev = new MatchEvaluator(Replacer);

     //remove script tags including contents
     htmlData = Regex.Replace(htmlData, @"<script.*?</script>", ev, RegexOptions.IgnoreCase|RegexOptions.Singleline);

     //remove reminaing tags, leaving contents
     foreach (Match match in Regex.Matches(htmlData, @">(?:(?<c>[^<]+))", RegexOptions.IgnoreCase|RegexOptions.Singleline))
     strTextBuilder.Append(match.Value.Substring(1).Trim());
     
                //return our cleaned string
                return strTextBuilder.ToString();
}

private string Replacer(Match m)
{
              //do soemthing fancier or fo rnow just return an empty string
     return "";
}
0
 

Author Comment

by:Tourist_Search
ID: 10732308
Hi

Still receive same error message.

Compiler Error Message: CS1010: Newline in constant

Source Error:

 

Line 43:                 //return our cleaned string
Line 44:                 return strTextBuilder.ToString();
Line 45: }
Line 46:
Line 47: private string Replacer(Match m)
 

Source File: C:\***.aspx    Line: 45
0
 
LVL 12

Expert Comment

by:dfiala13
ID: 10732395
delete lines 45 and 46 and add back in the }

You can also try putting the code into a new file

If that fails, start with a blank file and start adding things method by method and line by line until it breaks

you have something in your code that is malformed which is hard to diagnose via ee.  The two methods I sent you work, so there is something else in your script that is off.  
0
 

Author Comment

by:Tourist_Search
ID: 10732764
Hi dfial13

This section of code works.

<script runat="server">
string strTitle, strDescription, strText, strKeyWords;

void Page_Load(Object Src, EventArgs E)
{
Session["InsertAddress"] = Request.Form["InsertAddress"];
Session["InsertPropertyID"] = Request.Form["PropertyID"];
Session["InsertURL"]= Request.Form["URL"];
Session["InsertCatID"] = Request.Form["Category"];
Session["InsertCountryID"] = Request.Form["Country"];
Session["InsertCityID"] = Request.Form["City"];
Session["PenaltyID"] = Request.Form["County"];
Session["InsertTitle"] = strTitle;
Session["InsertMetaDescription"] = strDescription;
Session["InsertBodyText"] = Request.Form["InsertAddress"] + " " + strText;
Session["InsertKeywords"] = strKeyWords;
string htmlData=Request.Form["HtmlData"];
strTitle=Regex.Match(htmlData, @"(?<=<title>).*?(?=</title>)", RegexOptions.IgnoreCase|RegexOptions.ExplicitCapture).Value; // Find title and keep text then remove tags
strDescription=Regex.Match(htmlData, @"(?<=<META\s+name=""description""\s+content="").*?(?=""\s*?>)|(?<=<META\s+name=""description""\s+content="").*?(?=""\s*?/>)|(?<=content="").*?(?=name=""Description""/>)|(?<=content="").*?(?=name=""Description"">)", RegexOptions.IgnoreCase|RegexOptions.ExplicitCapture).Value; // Find Description then keep text
strKeyWords=Regex.Match(htmlData, @"(?<=<META\s+name=""keywords""\s+content="").*?(?=""\s*?>)|(?<=<META\s+name=""keywords""\s+content="").*?(?=""\s*?/>)|(?<=content="").*?(?=name=""keywords""/>)|(?<=content="").*?(?=name=""keywords"">)", RegexOptions.IgnoreCase|RegexOptions.ExplicitCapture).Value; // Find Keywords then keep text



StringBuilder strTextBuilder=new StringBuilder();
    foreach (Match match in Regex.Matches(htmlData, @">(?:(?<c>[^<]+))", RegexOptions.IgnoreCase|RegexOptions.Singleline))
                    strTextBuilder.Append(match.Value.Substring(1).Trim());
                    strText=strTextBuilder.ToString();
                              

// I can add the above and works OK removes the >
// But when try to  put it all into place the error occurs
 
}
</script>


But when I try to add this section the error occurs.

htmlData = CleanPage(htmlData);

}

string CleanPage(string htmlData)
{

     MatchEvaluator ev = new MatchEvaluator(Replacer);

     //remove script tags including contents
     htmlData = Regex.Replace(htmlData, @"<script.*?</script>", ev, RegexOptions.IgnoreCase|RegexOptions.Singleline);

     //remove reminaing tags, leaving contents
     foreach (Match match in Regex.Matches(htmlData, @">(?:(?<c>[^<]+))", RegexOptions.IgnoreCase|RegexOptions.Singleline))
     strTextBuilder.Append(match.Value.Substring(1).Trim());
     
                //return our cleaned string
                return strTextBuilder.ToString();
}

private string Replacer(Match m)
{
              //do soemthing fancier or fo rnow just return an empty string
     return "";
}

Error message:

Compiler Error Message: CS1010: Newline in constant

Source Error:

 

Line 44:                 //return our cleaned string
Line 45:                 return strTextBuilder.ToString();
Line 46: }
Line 47:
Line 48: private string Replacer(Match m)
 

Source File: C:\****.aspx    Line: 46
0
 
LVL 12

Expert Comment

by:dfiala13
ID: 10733021
The long and the short of it is you can't run this script side.  

The   htmlData = Regex.Replace(htmlData, @"<script.*?</script>", ev, RegexOptions.IgnoreCase|RegexOptions.Singleline);

line blows up because of the </script> tag.  The script compiler doesn't care that it is in a literal.

Move this code into code-behind and you'll be all set.

string strTitle, strDescription, strText, strKeyWords;

void Page_Load(Object Src, EventArgs E)
{

Session["InsertAddress"] = Request.Form["InsertAddress"];
Session["InsertPropertyID"] = Request.Form["PropertyID"];
Session["InsertURL"]= Request.Form["URL"];
Session["InsertCatID"] = Request.Form["Category"];
Session["InsertCountryID"] = Request.Form["Country"];
Session["InsertCityID"] = Request.Form["City"];
Session["PenaltyID"] = Request.Form["County"];
Session["InsertTitle"] = strTitle;
Session["InsertMetaDescription"] = strDescription;
Session["InsertBodyText"] = Request.Form["InsertAddress"] + " " + strText;
Session["InsertKeywords"] = strKeyWords;
string htmlData=Request.Form["HtmlData"];
strTitle=Regex.Match(htmlData, @"(?<=<title>).*?(?=</title>)", RegexOptions.IgnoreCase|RegexOptions.ExplicitCapture).Value; // Find title and keep text then remove tags
strDescription=Regex.Match(htmlData, @"(?<=<META\s+name=""description""\s+content="").*?(?=""\s*?>)|(?<=<META\s+name=""description""\s+content="").*?(?=""\s*?/>)|(?<=content="").*?(?=name=""Description""/>)|(?<=content="").*?(?=name=""Description"">)", RegexOptions.IgnoreCase|RegexOptions.ExplicitCapture).Value; // Find Description then keep text
strKeyWords=Regex.Match(htmlData, @"(?<=<META\s+name=""keywords""\s+content="").*?(?=""\s*?>)|(?<=<META\s+name=""keywords""\s+content="").*?(?=""\s*?/>)|(?<=content="").*?(?=name=""keywords""/>)|(?<=content="").*?(?=name=""keywords"">)", RegexOptions.IgnoreCase|RegexOptions.ExplicitCapture).Value; // Find Keywords then keep text
               
htmlData = CleanPage(htmlData);
}

private string CleanPage(string htmlData)
{

    htmlData = Regex.Replace(htmlData, @"<script.*?</script>", "", RegexOptions.IgnoreCase|RegexOptions.Singleline);

    foreach (Match m in Regex.Matches(htmlData, @">(?:(?<c>[^<]+))", RegexOptions.IgnoreCase|RegexOptions.Singleline))
        strTextBuilder.Append(m.Value.Substring(1).Trim());
   
    return strTextBuilder.ToString();
}
0
 

Author Comment

by:Tourist_Search
ID: 10753509
Hi dfiala13

Still recieved same error, but took note about what you said and I have spent the weekend reading the book.

I have came up with a solution to the problem which is below.

strCleanHTML = Regex.Replace(strCleanHTML, @"(?i)<script([^>])*>(\w|\W)*</script([^>])*>", " ");

That seems to have done the trick for the script.

Thanks for pointing me in the right direction.

George
0

Featured Post

What Security Threats Are You Missing?

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

Join & Write a Comment

Bit flags and bit flag manipulation is perhaps one of the most underrated strategies in programming, likely because most programmers developing in high-level languages rely too much on the high-level features, and forget about the low-level ones. Th…
Introduction This article series is supposed to shed some light on the use of IDisposable and objects that inherit from it. In essence, a more apt title for this article would be: using (IDisposable) {}. I’m just not sure how many people would ge…
Illustrator's Shape Builder tool will let you combine shapes visually and interactively. This video shows the Mac version, but the tool works the same way in Windows. To follow along with this video, you can draw your own shapes or download the file…
In this tutorial you'll learn about bandwidth monitoring with flows and packet sniffing with our network monitoring solution PRTG Network Monitor (https://www.paessler.com/prtg). If you're interested in additional methods for monitoring bandwidt…

744 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

15 Experts available now in Live!

Get 1:1 Help Now