Still celebrating National IT Professionals Day with 3 months of free Premium Membership. Use Code ITDAY17

x
?
Solved

Regular Expressions

Posted on 2004-03-31
14
Medium Priority
?
562 Views
Last Modified: 2010-04-15
Hi
Can anyone help with this regular expression which I found seaching through the PAQ's.

Regex.Replace(HtmlData, @"\s+|\s*((<script>((?!</script>).)+</script>|<([^>]|""[^""]*""|'[^']*')*>)\s*)+", " ").Trim();

When I try to use it I receive the following error.

Description: An error occurred during the compilation of a resource required to service this request. Please review the following specific error details and modify your source code appropriately.

Compiler Error Message: CS1010: Newline in constant

Source Error:

 

Line 29: Session["InsertCatID"] = Request.Form["Category"];
Line 30: Session["InsertCountryID"] = Request.Form["Country"];
Line 31: Session["InsertCityID"] = Request.Form["City"];
Line 32: Session["PenaltyID"] = Request.Form["County"];
Line 33: Session["InsertTitle"] = strTitle;
 

Source File: C:\**.aspx    Line: 31
Line 31 has nothing to do with this bit of code.

I'm trying to use the code below:

StringBuilder strTextBuilder=new StringBuilder();
    foreach (Match match in Regex.Replace(HtmlData, @"\s+|\s*((<script>((?!</script>).)+</script>|<([^>]|""[^""]*""|'[^']*')*>)\s*)+", " ").Trim(); RegexOptions.IgnoreCase|RegexOptions.Singleline))
        strTextBuilder.Append(match.Value); // use match.Groups["content"].Value to get rid of the tag
    strText=strTextBuilder.ToString();

What I need  to achieve is remove all text between < and > also text between <script>code here</script> or simaliar, then remove excess white space so that text is presented neatly.
Big job and I have no idea how to do, even after reading Mastering Regular Expressions.

Any help would be appreciated
George
0
Comment
Question by:Tourist_Search
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 7
  • 7
14 Comments
 
LVL 12

Accepted Solution

by:
dfiala13 earned 2000 total points
ID: 10724484
Your compilation problem is multifold.

Regex.Replace returns a string, not a match

You have a ; embedded in your arguments for Regex.Replace.

This will find  matches for your Regex and lcall Replacer to let you see what you get and decide what to replace the the text with

private string SomeFunction(string HtmlData){

    MatchEvaluator ev = new MatchEvaluator(Replacer);
 
    string strText = Regex.Replace(HtmlData, @"\s+|\s*((<script>((?!</script>).)+</script>|<([^>]|""[^""]*""|'[^']*')*>)\s*)+", ev, RegexOptions.IgnoreCase|RegexOptions.Singleline);

    return strText;
                                    
}

private string Replacer(Match m)
{

 //retun an empty string every time we find a match
//or do something fancier
    return "";

}

Now in terms of the Regex expression.  It is better to walk before you fly, so you can a) understand exactly what is going on and b) get a working base that can be expanded, instead of trying to go with a big-bang approach where either it all works or nothing works.

So start with first part of your task, emptying out the script guts. Simplify your regular expression to
<script>((?!</script>).)+</script>
and see what matches.  Change it until you get all your scripts back.  (Hint: this will probably only find scripts where the script tag has no attributes).

then you can add in the next requirement, and there is no shame to having to pass through the doucment multiple times if your requirements are complicated enough. Use the results of the previous trimming to feed your next one. Regex is powerful, but it's a pain to read and debug, once you get the discrete steps working you can start combining them where appropriate.
0
 

Author Comment

by:Tourist_Search
ID: 10725326
Thanks dfiala13

I'll have to do more reading, this is well over my head,

I have this regex that does work:

foreach (Match match in Regex.Matches(htmlData, @">(?:(?<c>[^<]+))", RegexOptions.IgnoreCase|RegexOptions.Singleline))
        strTextBuilder.Append(match.Value); // use match.Groups["content"].Value to get rid of the tag
    strText=strTextBuilder.ToString();

Only problem is, it does not remove the text between the <script>code here tags</script>

George
0
 
LVL 12

Expert Comment

by:dfiala13
ID: 10725609
By "this regex that does work"

do you mean it finds matches.  It does seem to find the end of any tag >
But at the moment you are just writing the value of what matched to your string builder, so it is not removing anything, in fact your string returned will only have the matches in it and none of the stuff you want.

I have to run out for a bit but will look at this later and see if I can get you on the right track.

In the meantime try to implement the code snippet I sent you with your regex, it removes the matches from the original string.
0
Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

 

Author Comment

by:Tourist_Search
ID: 10725764
Hi dfiala13

If for example I have <span class="colour">Coloured text here</span>

It removes the <span class="colour"> and </span> to leave the: "Coloured text here" which is what i'm after.

But it will not remove any code between <script>this still remains</script>
Or <script> {
this too remains
}
</script>

I have tried the snippet you sent but still receive an error. Deep subject.

George
0
 
LVL 12

Expert Comment

by:dfiala13
ID: 10726883
OK,

That's because you are asking it to do two different things:

remove the entirety of the <script> tags including contents,
For other tags, just remove the tags and leave the contents

And running your current Regex expression I get dangling end angle brackets > throughout the source

So I changed your code a bit, and this leaves only the guts of the HTML tags.

                  foreach (Match match in Regex.Matches(HtmlData, @">(?:(?<c>[^<]+))", RegexOptions.IgnoreCase|RegexOptions.Singleline))
                        strTextBuilder.Append(match.Value.Substring(1).Trim());

Now, I'll see if I can get rid of the scripts.

Given that the code above removes your html tags, makes sense to kill the scripts first.

HtmlData = Regex.Replace(HtmlData, @"<script.*?</script>", ev, RegexOptions.IgnoreCase|RegexOptions.Singleline);

this calls the Replacer method, which right now returns an empty string no matter what you send it, but you could add more logic.

or you could also do this...

HtmlData = Regex.Replace(HtmlData, @"<script.*?</script>", "", RegexOptions.IgnoreCase|RegexOptions.Singleline);

so the final method looks like this (i left the call to the replacer method in)...

private string CleanPage(string HtmlData)
{
      StringBuilder strTextBuilder=new StringBuilder();
      MatchEvaluator ev = new MatchEvaluator(Replacer);

      //remove script tags including contents
      HtmlData = Regex.Replace(HtmlData, @"<script.*?</script>", ev, RegexOptions.IgnoreCase|RegexOptions.Singleline);

      //remove reminaing tags, leaving contents
      foreach (Match match in Regex.Matches(HtmlData, @">(?:(?<c>[^<]+))", RegexOptions.IgnoreCase|RegexOptions.Singleline))
      strTextBuilder.Append(match.Value.Substring(1).Trim());
      
                //return our cleaned string
                return strTextBuilder.ToString();
}

private string Replacer(Match m)
{
              //do soemthing fancier or fo rnow just return an empty string
      return "";
}
0
 

Author Comment

by:Tourist_Search
ID: 10730578
Hi dfiala13

When I run the following script it works OK, but once I add all of it I keep getting the error newline in constant

foreach (Match match in Regex.Matches(HtmlData, @">(?:(?<c>[^<]+))", RegexOptions.IgnoreCase|RegexOptions.Singleline))
                    strTextBuilder.Append(match.Value.Substring(1).Trim());

George
0
 
LVL 12

Expert Comment

by:dfiala13
ID: 10731789
Where do you get the error?

Try cutting and pasting the code back into place.

Failing that post alll your code and the specific error message.
0
 

Author Comment

by:Tourist_Search
ID: 10732175
Hi dfiala13

Script below:

<script runat="server">
string strTitle, strDescription, strText, strKeyWords;
void Page_Load(Object Src, EventArgs E)
{
    string htmlData=Request.Form["HtmlData"];
    strTitle=Regex.Match(htmlData, @"(?<=<title>).*?(?=</title>)", RegexOptions.IgnoreCase|RegexOptions.ExplicitCapture).Value; // Find title and keep text then remove tags
    strDescription=Regex.Match(htmlData, @"(?<=<META\s+name=""description""\s+content="").*?(?=""\s*?>)|(?<=<META\s+name=""description""\s+content="").*?(?=""\s*?/>)|(?<=content="").*?(?=name=""Description""/>)|(?<=content="").*?(?=name=""Description"">)", RegexOptions.IgnoreCase|RegexOptions.ExplicitCapture).Value; // Find Description then keep text
      strKeyWords=Regex.Match(htmlData, @"(?<=<META\s+name=""keywords""\s+content="").*?(?=""\s*?>)|(?<=<META\s+name=""keywords""\s+content="").*?(?=""\s*?/>)|(?<=content="").*?(?=name=""keywords""/>)|(?<=content="").*?(?=name=""keywords"">)", RegexOptions.IgnoreCase|RegexOptions.ExplicitCapture).Value; // Find Keywords then keep text
string CleanPage(string htmlData)
{
     StringBuilder strTextBuilder=new StringBuilder();
     MatchEvaluator ev = new MatchEvaluator(Replacer);

     //remove script tags including contents
     htmlData = Regex.Replace(htmlData, @"<script.*?</script>", ev, RegexOptions.IgnoreCase|RegexOptions.Singleline);

     //remove reminaing tags, leaving contents
     foreach (Match match in Regex.Matches(htmlData, @">(?:(?<c>[^<]+))", RegexOptions.IgnoreCase|RegexOptions.Singleline))
     strTextBuilder.Append(match.Value.Substring(1).Trim());
     
                //return our cleaned string
                return strTextBuilder.ToString();
}

private string Replacer(Match m)
{
              //do soemthing fancier or fo rnow just return an empty string
     return "";
}
{
      
Session["AddressID"] = Request.Form["InsertAddress"];
Session["PropertyID"] = Request.Form["PropertyID"];
Session["URLID"]= Request.Form["URL"];
Session["CatID"] = Request.Form["Category"];
Session["CountryID"] = Request.Form["Country"];
Session["CityID"] = Request.Form["City"];
Session["TitleID"] = strTitle;
Session["MetaDescriptionID"] = strDescription;
Session["BodyTextID"] = Request.Form["InsertAddress"] + " " + strText;
Session["KeywordsID"] = strKeyWords;

Error message is:

Compiler Error Message: CS1010: Newline in constant

Source Error:

 

Line 26:                 //return our cleaned string
Line 27:                 return strTextBuilder.ToString();
Line 28: }
Line 29:
Line 30: private string Replacer(Match m)
 

Source File: C:\**.aspx    Line: 28

}
</script>

Thanks
George
0
 
LVL 12

Expert Comment

by:dfiala13
ID: 10732237
Well, you have methods nested in your page load event handler method.  methods call other methods.

let's see if we can clean this up...

<script runat="server">
string strTitle, strDescription, strText, strKeyWords;

void Page_Load(Object Src, EventArgs E)
{
Session["AddressID"] = Request.Form["InsertAddress"];
Session["PropertyID"] = Request.Form["PropertyID"];
Session["URLID"]= Request.Form["URL"];
Session["CatID"] = Request.Form["Category"];
Session["CountryID"] = Request.Form["Country"];
Session["CityID"] = Request.Form["City"];
Session["TitleID"] = strTitle;
Session["MetaDescriptionID"] = strDescription;
Session["BodyTextID"] = Request.Form["InsertAddress"] + " " + strText;
Session["KeywordsID"] = strKeyWords;

    string htmlData=Request.Form["HtmlData"];
    strTitle=Regex.Match(htmlData, @"(?<=<title>).*?(?=</title>)", RegexOptions.IgnoreCase|RegexOptions.ExplicitCapture).Value; // Find title and keep text then remove tags
    strDescription=Regex.Match(htmlData, @"(?<=<META\s+name=""description""\s+content="").*?(?=""\s*?>)|(?<=<META\s+name=""description""\s+content="").*?(?=""\s*?/>)|(?<=content="").*?(?=name=""Description""/>)|(?<=content="").*?(?=name=""Description"">)", RegexOptions.IgnoreCase|RegexOptions.ExplicitCapture).Value; // Find Description then keep text
     strKeyWords=Regex.Match(htmlData, @"(?<=<META\s+name=""keywords""\s+content="").*?(?=""\s*?>)|(?<=<META\s+name=""keywords""\s+content="").*?(?=""\s*?/>)|(?<=content="").*?(?=name=""keywords""/>)|(?<=content="").*?(?=name=""keywords"">)", RegexOptions.IgnoreCase|RegexOptions.ExplicitCapture).Value; // Find Keywords then keep text

  htmlData = CleanPage(htmlData);

}

string CleanPage(string htmlData)
{
     StringBuilder strTextBuilder=new StringBuilder();
     MatchEvaluator ev = new MatchEvaluator(Replacer);

     //remove script tags including contents
     htmlData = Regex.Replace(htmlData, @"<script.*?</script>", ev, RegexOptions.IgnoreCase|RegexOptions.Singleline);

     //remove reminaing tags, leaving contents
     foreach (Match match in Regex.Matches(htmlData, @">(?:(?<c>[^<]+))", RegexOptions.IgnoreCase|RegexOptions.Singleline))
     strTextBuilder.Append(match.Value.Substring(1).Trim());
     
                //return our cleaned string
                return strTextBuilder.ToString();
}

private string Replacer(Match m)
{
              //do soemthing fancier or fo rnow just return an empty string
     return "";
}
0
 

Author Comment

by:Tourist_Search
ID: 10732308
Hi

Still receive same error message.

Compiler Error Message: CS1010: Newline in constant

Source Error:

 

Line 43:                 //return our cleaned string
Line 44:                 return strTextBuilder.ToString();
Line 45: }
Line 46:
Line 47: private string Replacer(Match m)
 

Source File: C:\***.aspx    Line: 45
0
 
LVL 12

Expert Comment

by:dfiala13
ID: 10732395
delete lines 45 and 46 and add back in the }

You can also try putting the code into a new file

If that fails, start with a blank file and start adding things method by method and line by line until it breaks

you have something in your code that is malformed which is hard to diagnose via ee.  The two methods I sent you work, so there is something else in your script that is off.  
0
 

Author Comment

by:Tourist_Search
ID: 10732764
Hi dfial13

This section of code works.

<script runat="server">
string strTitle, strDescription, strText, strKeyWords;

void Page_Load(Object Src, EventArgs E)
{
Session["InsertAddress"] = Request.Form["InsertAddress"];
Session["InsertPropertyID"] = Request.Form["PropertyID"];
Session["InsertURL"]= Request.Form["URL"];
Session["InsertCatID"] = Request.Form["Category"];
Session["InsertCountryID"] = Request.Form["Country"];
Session["InsertCityID"] = Request.Form["City"];
Session["PenaltyID"] = Request.Form["County"];
Session["InsertTitle"] = strTitle;
Session["InsertMetaDescription"] = strDescription;
Session["InsertBodyText"] = Request.Form["InsertAddress"] + " " + strText;
Session["InsertKeywords"] = strKeyWords;
string htmlData=Request.Form["HtmlData"];
strTitle=Regex.Match(htmlData, @"(?<=<title>).*?(?=</title>)", RegexOptions.IgnoreCase|RegexOptions.ExplicitCapture).Value; // Find title and keep text then remove tags
strDescription=Regex.Match(htmlData, @"(?<=<META\s+name=""description""\s+content="").*?(?=""\s*?>)|(?<=<META\s+name=""description""\s+content="").*?(?=""\s*?/>)|(?<=content="").*?(?=name=""Description""/>)|(?<=content="").*?(?=name=""Description"">)", RegexOptions.IgnoreCase|RegexOptions.ExplicitCapture).Value; // Find Description then keep text
strKeyWords=Regex.Match(htmlData, @"(?<=<META\s+name=""keywords""\s+content="").*?(?=""\s*?>)|(?<=<META\s+name=""keywords""\s+content="").*?(?=""\s*?/>)|(?<=content="").*?(?=name=""keywords""/>)|(?<=content="").*?(?=name=""keywords"">)", RegexOptions.IgnoreCase|RegexOptions.ExplicitCapture).Value; // Find Keywords then keep text



StringBuilder strTextBuilder=new StringBuilder();
    foreach (Match match in Regex.Matches(htmlData, @">(?:(?<c>[^<]+))", RegexOptions.IgnoreCase|RegexOptions.Singleline))
                    strTextBuilder.Append(match.Value.Substring(1).Trim());
                    strText=strTextBuilder.ToString();
                              

// I can add the above and works OK removes the >
// But when try to  put it all into place the error occurs
 
}
</script>


But when I try to add this section the error occurs.

htmlData = CleanPage(htmlData);

}

string CleanPage(string htmlData)
{

     MatchEvaluator ev = new MatchEvaluator(Replacer);

     //remove script tags including contents
     htmlData = Regex.Replace(htmlData, @"<script.*?</script>", ev, RegexOptions.IgnoreCase|RegexOptions.Singleline);

     //remove reminaing tags, leaving contents
     foreach (Match match in Regex.Matches(htmlData, @">(?:(?<c>[^<]+))", RegexOptions.IgnoreCase|RegexOptions.Singleline))
     strTextBuilder.Append(match.Value.Substring(1).Trim());
     
                //return our cleaned string
                return strTextBuilder.ToString();
}

private string Replacer(Match m)
{
              //do soemthing fancier or fo rnow just return an empty string
     return "";
}

Error message:

Compiler Error Message: CS1010: Newline in constant

Source Error:

 

Line 44:                 //return our cleaned string
Line 45:                 return strTextBuilder.ToString();
Line 46: }
Line 47:
Line 48: private string Replacer(Match m)
 

Source File: C:\****.aspx    Line: 46
0
 
LVL 12

Expert Comment

by:dfiala13
ID: 10733021
The long and the short of it is you can't run this script side.  

The   htmlData = Regex.Replace(htmlData, @"<script.*?</script>", ev, RegexOptions.IgnoreCase|RegexOptions.Singleline);

line blows up because of the </script> tag.  The script compiler doesn't care that it is in a literal.

Move this code into code-behind and you'll be all set.

string strTitle, strDescription, strText, strKeyWords;

void Page_Load(Object Src, EventArgs E)
{

Session["InsertAddress"] = Request.Form["InsertAddress"];
Session["InsertPropertyID"] = Request.Form["PropertyID"];
Session["InsertURL"]= Request.Form["URL"];
Session["InsertCatID"] = Request.Form["Category"];
Session["InsertCountryID"] = Request.Form["Country"];
Session["InsertCityID"] = Request.Form["City"];
Session["PenaltyID"] = Request.Form["County"];
Session["InsertTitle"] = strTitle;
Session["InsertMetaDescription"] = strDescription;
Session["InsertBodyText"] = Request.Form["InsertAddress"] + " " + strText;
Session["InsertKeywords"] = strKeyWords;
string htmlData=Request.Form["HtmlData"];
strTitle=Regex.Match(htmlData, @"(?<=<title>).*?(?=</title>)", RegexOptions.IgnoreCase|RegexOptions.ExplicitCapture).Value; // Find title and keep text then remove tags
strDescription=Regex.Match(htmlData, @"(?<=<META\s+name=""description""\s+content="").*?(?=""\s*?>)|(?<=<META\s+name=""description""\s+content="").*?(?=""\s*?/>)|(?<=content="").*?(?=name=""Description""/>)|(?<=content="").*?(?=name=""Description"">)", RegexOptions.IgnoreCase|RegexOptions.ExplicitCapture).Value; // Find Description then keep text
strKeyWords=Regex.Match(htmlData, @"(?<=<META\s+name=""keywords""\s+content="").*?(?=""\s*?>)|(?<=<META\s+name=""keywords""\s+content="").*?(?=""\s*?/>)|(?<=content="").*?(?=name=""keywords""/>)|(?<=content="").*?(?=name=""keywords"">)", RegexOptions.IgnoreCase|RegexOptions.ExplicitCapture).Value; // Find Keywords then keep text
               
htmlData = CleanPage(htmlData);
}

private string CleanPage(string htmlData)
{

    htmlData = Regex.Replace(htmlData, @"<script.*?</script>", "", RegexOptions.IgnoreCase|RegexOptions.Singleline);

    foreach (Match m in Regex.Matches(htmlData, @">(?:(?<c>[^<]+))", RegexOptions.IgnoreCase|RegexOptions.Singleline))
        strTextBuilder.Append(m.Value.Substring(1).Trim());
   
    return strTextBuilder.ToString();
}
0
 

Author Comment

by:Tourist_Search
ID: 10753509
Hi dfiala13

Still recieved same error, but took note about what you said and I have spent the weekend reading the book.

I have came up with a solution to the problem which is below.

strCleanHTML = Regex.Replace(strCleanHTML, @"(?i)<script([^>])*>(\w|\W)*</script([^>])*>", " ");

That seems to have done the trick for the script.

Thanks for pointing me in the right direction.

George
0

Featured Post

What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Extention Methods in C# 3.0 by Ivo Stoykov C# 3.0 offers extension methods. They allow extending existing classes without changing the class's source code or relying on inheritance. These are static methods invoked as instance method. This…
This article introduced a TextBox that supports transparent background.   Introduction TextBox is the most widely used control component in GUI design. Most GUI controls do not support transparent background and more or less do not have the…
In this brief tutorial Pawel from AdRem Software explains how you can quickly find out which services are running on your network, or what are the IP addresses of servers responsible for each service. Software used is freeware NetCrunch Tools (https…
Visualize your data even better in Access queries. Given a date and a value, this lesson shows how to compare that value with the previous value, calculate the difference, and display a circle if the value is the same, an up triangle if it increased…

704 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question