asked on

Regular Expressions

Hi
Can anyone help with this regular expression which I found seaching through the PAQ's.

Regex.Replace(HtmlData, @"\s+|\s*((<script>((?!</script>).)+</script>|<([^>]|""[^""]*""|'[^']*')*>)\s*)+", " ").Trim();

When I try to use it I receive the following error.

Description: An error occurred during the compilation of a resource required to service this request. Please review the following specific error details and modify your source code appropriately.

Compiler Error Message: CS1010: Newline in constant

Source Error:

Line 29: Session["InsertCatID"] = Request.Form["Category"];
Line 30: Session["InsertCountryID"] = Request.Form["Country"];
Line 31: Session["InsertCityID"] = Request.Form["City"];
Line 32: Session["PenaltyID"] = Request.Form["County"];
Line 33: Session["InsertTitle"] = strTitle;

Source File: C:\**.aspx Line: 31
Line 31 has nothing to do with this bit of code.

I'm trying to use the code below:

StringBuilder strTextBuilder=new StringBuilder();
foreach (Match match in Regex.Replace(HtmlData, @"\s+|\s*((<script>((?!</script>).)+</script>|<([^>]|""[^""]*""|'[^']*')*>)\s*)+", " ").Trim(); RegexOptions.IgnoreCase|RegexOptions.Singleline))
strTextBuilder.Append(match.Value); // use match.Groups["content"].Value to get rid of the tag
strText=strTextBuilder.ToString();

What I need to achieve is remove all text between < and > also text between <script>code here</script> or simaliar, then remove excess white space so that text is presented neatly.
Big job and I have no idea how to do, even after reading Mastering Regular Expressions.

Any help would be appreciated
George

ASKER CERTIFIED SOLUTION

dfiala13

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Tourist_Search

ASKER

Thanks dfiala13

I'll have to do more reading, this is well over my head,

I have this regex that does work:

foreach (Match match in Regex.Matches(htmlData, @">(?:(?<c>[^<]+))", RegexOptions.IgnoreCase|RegexOptions.Singleline))
strTextBuilder.Append(match.Value); // use match.Groups["content"].Value to get rid of the tag
strText=strTextBuilder.ToString();

Only problem is, it does not remove the text between the <script>code here tags</script>

George

dfiala13

By "this regex that does work"

do you mean it finds matches. It does seem to find the end of any tag >
But at the moment you are just writing the value of what matched to your string builder, so it is not removing anything, in fact your string returned will only have the matches in it and none of the stuff you want.

I have to run out for a bit but will look at this later and see if I can get you on the right track.

In the meantime try to implement the code snippet I sent you with your regex, it removes the matches from the original string.

Tourist_Search

ASKER

Hi dfiala13

If for example I have <span class="colour">Coloured text here</span>

It removes the <span class="colour"> and </span> to leave the: "Coloured text here" which is what i'm after.

But it will not remove any code between <script>this still remains</script>
Or <script> {
this too remains
}
</script>

I have tried the snippet you sent but still receive an error. Deep subject.

George

dfiala13

OK,

That's because you are asking it to do two different things:

remove the entirety of the <script> tags including contents,
For other tags, just remove the tags and leave the contents

And running your current Regex expression I get dangling end angle brackets > throughout the source

So I changed your code a bit, and this leaves only the guts of the HTML tags.

                  foreach (Match match in Regex.Matches(HtmlData, @">(?:(?<c>[^<]+))", RegexOptions.IgnoreCase|RegexOptions.Singleline))
                        strTextBuilder.Append(match.Value.Substring(1).Trim());

Now, I'll see if I can get rid of the scripts.

Given that the code above removes your html tags, makes sense to kill the scripts first.

HtmlData = Regex.Replace(HtmlData, @"<script.*?</script>", ev, RegexOptions.IgnoreCase|RegexOptions.Singleline);

this calls the Replacer method, which right now returns an empty string no matter what you send it, but you could add more logic.

or you could also do this...

HtmlData = Regex.Replace(HtmlData, @"<script.*?</script>", "", RegexOptions.IgnoreCase|RegexOptions.Singleline);

so the final method looks like this (i left the call to the replacer method in)...

private string CleanPage(string HtmlData)
{
      StringBuilder strTextBuilder=new StringBuilder();
      MatchEvaluator ev = new MatchEvaluator(Replacer);

      //remove script tags including contents
      HtmlData = Regex.Replace(HtmlData, @"<script.*?</script>", ev, RegexOptions.IgnoreCase|RegexOptions.Singleline);

      //remove reminaing tags, leaving contents
      foreach (Match match in Regex.Matches(HtmlData, @">(?:(?<c>[^<]+))", RegexOptions.IgnoreCase|RegexOptions.Singleline))
      strTextBuilder.Append(match.Value.Substring(1).Trim());

//return our cleaned string
return strTextBuilder.ToString();
}

private string Replacer(Match m)
{
//do soemthing fancier or fo rnow just return an empty string
      return "";
}

Tourist_Search

ASKER

Hi dfiala13

When I run the following script it works OK, but once I add all of it I keep getting the error newline in constant

foreach (Match match in Regex.Matches(HtmlData, @">(?:(?<c>[^<]+))", RegexOptions.IgnoreCase|RegexOptions.Singleline))
strTextBuilder.Append(match.Value.Substring(1).Trim());

George

dfiala13

Where do you get the error?

Try cutting and pasting the code back into place.

Failing that post alll your code and the specific error message.

Tourist_Search

ASKER

Hi dfiala13

Script below:

<script runat="server">
string strTitle, strDescription, strText, strKeyWords;
void Page_Load(Object Src, EventArgs E)
{
string htmlData=Request.Form["HtmlData"];
strTitle=Regex.Match(htmlData, @"(?<=<title>).*?(?=</title>)", RegexOptions.IgnoreCase|RegexOptions.ExplicitCapture).Value; // Find title and keep text then remove tags
strDescription=Regex.Match(htmlData, @"(?<=<META\s+name=""description""\s+content="").*?(?=""\s*?>)|(?<=<META\s+name=""description""\s+content="").*?(?=""\s*?/>)|(?<=content="").*?(?=name=""Description""/>)|(?<=content="").*?(?=name=""Description"">)", RegexOptions.IgnoreCase|RegexOptions.ExplicitCapture).Value; // Find Description then keep text
strKeyWords=Regex.Match(htmlData, @"(?<=<META\s+name=""keywords""\s+content="").*?(?=""\s*?>)|(?<=<META\s+name=""keywords""\s+content="").*?(?=""\s*?/>)|(?<=content="").*?(?=name=""keywords""/>)|(?<=content="").*?(?=name=""keywords"">)", RegexOptions.IgnoreCase|RegexOptions.ExplicitCapture).Value; // Find Keywords then keep text
string CleanPage(string htmlData)
{
StringBuilder strTextBuilder=new StringBuilder();
MatchEvaluator ev = new MatchEvaluator(Replacer);

//remove script tags including contents
htmlData = Regex.Replace(htmlData, @"<script.*?</script>", ev, RegexOptions.IgnoreCase|RegexOptions.Singleline);

//remove reminaing tags, leaving contents
foreach (Match match in Regex.Matches(htmlData, @">(?:(?<c>[^<]+))", RegexOptions.IgnoreCase|RegexOptions.Singleline))
strTextBuilder.Append(match.Value.Substring(1).Trim());

//return our cleaned string
return strTextBuilder.ToString();
}

private string Replacer(Match m)
{
//do soemthing fancier or fo rnow just return an empty string
return "";
}
{

Session["AddressID"] = Request.Form["InsertAddress"];
Session["PropertyID"] = Request.Form["PropertyID"];
Session["URLID"]= Request.Form["URL"];
Session["CatID"] = Request.Form["Category"];
Session["CountryID"] = Request.Form["Country"];
Session["CityID"] = Request.Form["City"];
Session["TitleID"] = strTitle;
Session["MetaDescriptionID"] = strDescription;
Session["BodyTextID"] = Request.Form["InsertAddress"] + " " + strText;
Session["KeywordsID"] = strKeyWords;

Error message is:

Compiler Error Message: CS1010: Newline in constant

Source Error:

Line 26: //return our cleaned string
Line 27: return strTextBuilder.ToString();
Line 28: }
Line 29:
Line 30: private string Replacer(Match m)

Source File: C:\**.aspx Line: 28

}
</script>

Thanks
George

dfiala13

Well, you have methods nested in your page load event handler method. methods call other methods.

let's see if we can clean this up...

<script runat="server">
string strTitle, strDescription, strText, strKeyWords;

void Page_Load(Object Src, EventArgs E)
{
Session["AddressID"] = Request.Form["InsertAddress"];
Session["PropertyID"] = Request.Form["PropertyID"];
Session["URLID"]= Request.Form["URL"];
Session["CatID"] = Request.Form["Category"];
Session["CountryID"] = Request.Form["Country"];
Session["CityID"] = Request.Form["City"];
Session["TitleID"] = strTitle;
Session["MetaDescriptionID"] = strDescription;
Session["BodyTextID"] = Request.Form["InsertAddress"] + " " + strText;
Session["KeywordsID"] = strKeyWords;

string htmlData=Request.Form["HtmlData"];
strTitle=Regex.Match(htmlData, @"(?<=<title>).*?(?=</title>)", RegexOptions.IgnoreCase|RegexOptions.ExplicitCapture).Value; // Find title and keep text then remove tags
strDescription=Regex.Match(htmlData, @"(?<=<META\s+name=""description""\s+content="").*?(?=""\s*?>)|(?<=<META\s+name=""description""\s+content="").*?(?=""\s*?/>)|(?<=content="").*?(?=name=""Description""/>)|(?<=content="").*?(?=name=""Description"">)", RegexOptions.IgnoreCase|RegexOptions.ExplicitCapture).Value; // Find Description then keep text
strKeyWords=Regex.Match(htmlData, @"(?<=<META\s+name=""keywords""\s+content="").*?(?=""\s*?>)|(?<=<META\s+name=""keywords""\s+content="").*?(?=""\s*?/>)|(?<=content="").*?(?=name=""keywords""/>)|(?<=content="").*?(?=name=""keywords"">)", RegexOptions.IgnoreCase|RegexOptions.ExplicitCapture).Value; // Find Keywords then keep text

htmlData = CleanPage(htmlData);

}

string CleanPage(string htmlData)
{
StringBuilder strTextBuilder=new StringBuilder();
MatchEvaluator ev = new MatchEvaluator(Replacer);

//remove script tags including contents
htmlData = Regex.Replace(htmlData, @"<script.*?</script>", ev, RegexOptions.IgnoreCase|RegexOptions.Singleline);

//remove reminaing tags, leaving contents
foreach (Match match in Regex.Matches(htmlData, @">(?:(?<c>[^<]+))", RegexOptions.IgnoreCase|RegexOptions.Singleline))
strTextBuilder.Append(match.Value.Substring(1).Trim());

//return our cleaned string
return strTextBuilder.ToString();
}

private string Replacer(Match m)
{
//do soemthing fancier or fo rnow just return an empty string
return "";
}

Tourist_Search

ASKER

Hi

Still receive same error message.

Compiler Error Message: CS1010: Newline in constant

Source Error:

Line 43: //return our cleaned string
Line 44: return strTextBuilder.ToString();
Line 45: }
Line 46:
Line 47: private string Replacer(Match m)

Source File: C:\***.aspx Line: 45

dfiala13

delete lines 45 and 46 and add back in the }

You can also try putting the code into a new file

If that fails, start with a blank file and start adding things method by method and line by line until it breaks

you have something in your code that is malformed which is hard to diagnose via ee. The two methods I sent you work, so there is something else in your script that is off.

Tourist_Search

ASKER

Hi dfial13

This section of code works.

<script runat="server">
string strTitle, strDescription, strText, strKeyWords;

void Page_Load(Object Src, EventArgs E)
{
Session["InsertAddress"] = Request.Form["InsertAddress"];
Session["InsertPropertyID"] = Request.Form["PropertyID"];
Session["InsertURL"]= Request.Form["URL"];
Session["InsertCatID"] = Request.Form["Category"];
Session["InsertCountryID"] = Request.Form["Country"];
Session["InsertCityID"] = Request.Form["City"];
Session["PenaltyID"] = Request.Form["County"];
Session["InsertTitle"] = strTitle;
Session["InsertMetaDescription"] = strDescription;
Session["InsertBodyText"] = Request.Form["InsertAddress"] + " " + strText;
Session["InsertKeywords"] = strKeyWords;
string htmlData=Request.Form["HtmlData"];
strTitle=Regex.Match(htmlData, @"(?<=<title>).*?(?=</title>)", RegexOptions.IgnoreCase|RegexOptions.ExplicitCapture).Value; // Find title and keep text then remove tags
strDescription=Regex.Match(htmlData, @"(?<=<META\s+name=""description""\s+content="").*?(?=""\s*?>)|(?<=<META\s+name=""description""\s+content="").*?(?=""\s*?/>)|(?<=content="").*?(?=name=""Description""/>)|(?<=content="").*?(?=name=""Description"">)", RegexOptions.IgnoreCase|RegexOptions.ExplicitCapture).Value; // Find Description then keep text
strKeyWords=Regex.Match(htmlData, @"(?<=<META\s+name=""keywords""\s+content="").*?(?=""\s*?>)|(?<=<META\s+name=""keywords""\s+content="").*?(?=""\s*?/>)|(?<=content="").*?(?=name=""keywords""/>)|(?<=content="").*?(?=name=""keywords"">)", RegexOptions.IgnoreCase|RegexOptions.ExplicitCapture).Value; // Find Keywords then keep text

StringBuilder strTextBuilder=new StringBuilder();
foreach (Match match in Regex.Matches(htmlData, @">(?:(?<c>[^<]+))", RegexOptions.IgnoreCase|RegexOptions.Singleline))
strTextBuilder.Append(match.Value.Substring(1).Trim());
strText=strTextBuilder.ToString();

// I can add the above and works OK removes the >
// But when try to put it all into place the error occurs

}
</script>

But when I try to add this section the error occurs.

htmlData = CleanPage(htmlData);

}

string CleanPage(string htmlData)
{

MatchEvaluator ev = new MatchEvaluator(Replacer);

//remove script tags including contents
htmlData = Regex.Replace(htmlData, @"<script.*?</script>", ev, RegexOptions.IgnoreCase|RegexOptions.Singleline);

//remove reminaing tags, leaving contents
foreach (Match match in Regex.Matches(htmlData, @">(?:(?<c>[^<]+))", RegexOptions.IgnoreCase|RegexOptions.Singleline))
strTextBuilder.Append(match.Value.Substring(1).Trim());

//return our cleaned string
return strTextBuilder.ToString();
}

private string Replacer(Match m)
{
//do soemthing fancier or fo rnow just return an empty string
return "";
}

Error message:

Compiler Error Message: CS1010: Newline in constant

Source Error:

Line 44: //return our cleaned string
Line 45: return strTextBuilder.ToString();
Line 46: }
Line 47:
Line 48: private string Replacer(Match m)

Source File: C:\****.aspx Line: 46

dfiala13

The long and the short of it is you can't run this script side.

The htmlData = Regex.Replace(htmlData, @"<script.*?</script>", ev, RegexOptions.IgnoreCase|RegexOptions.Singleline);

line blows up because of the </script> tag. The script compiler doesn't care that it is in a literal.

Move this code into code-behind and you'll be all set.

string strTitle, strDescription, strText, strKeyWords;

void Page_Load(Object Src, EventArgs E)
{

Session["InsertAddress"] = Request.Form["InsertAddress"];
Session["InsertPropertyID"] = Request.Form["PropertyID"];
Session["InsertURL"]= Request.Form["URL"];
Session["InsertCatID"] = Request.Form["Category"];
Session["InsertCountryID"] = Request.Form["Country"];
Session["InsertCityID"] = Request.Form["City"];
Session["PenaltyID"] = Request.Form["County"];
Session["InsertTitle"] = strTitle;
Session["InsertMetaDescription"] = strDescription;
Session["InsertBodyText"] = Request.Form["InsertAddress"] + " " + strText;
Session["InsertKeywords"] = strKeyWords;
string htmlData=Request.Form["HtmlData"];
strTitle=Regex.Match(htmlData, @"(?<=<title>).*?(?=</title>)", RegexOptions.IgnoreCase|RegexOptions.ExplicitCapture).Value; // Find title and keep text then remove tags
strDescription=Regex.Match(htmlData, @"(?<=<META\s+name=""description""\s+content="").*?(?=""\s*?>)|(?<=<META\s+name=""description""\s+content="").*?(?=""\s*?/>)|(?<=content="").*?(?=name=""Description""/>)|(?<=content="").*?(?=name=""Description"">)", RegexOptions.IgnoreCase|RegexOptions.ExplicitCapture).Value; // Find Description then keep text
strKeyWords=Regex.Match(htmlData, @"(?<=<META\s+name=""keywords""\s+content="").*?(?=""\s*?>)|(?<=<META\s+name=""keywords""\s+content="").*?(?=""\s*?/>)|(?<=content="").*?(?=name=""keywords""/>)|(?<=content="").*?(?=name=""keywords"">)", RegexOptions.IgnoreCase|RegexOptions.ExplicitCapture).Value; // Find Keywords then keep text

htmlData = CleanPage(htmlData);
}

private string CleanPage(string htmlData)
{

htmlData = Regex.Replace(htmlData, @"<script.*?</script>", "", RegexOptions.IgnoreCase|RegexOptions.Singleline);

foreach (Match m in Regex.Matches(htmlData, @">(?:(?<c>[^<]+))", RegexOptions.IgnoreCase|RegexOptions.Singleline))
strTextBuilder.Append(m.Value.Substring(1).Trim());

return strTextBuilder.ToString();
}

Tourist_Search

ASKER

Hi dfiala13

Still recieved same error, but took note about what you said and I have spent the weekend reading the book.

I have came up with a solution to the problem which is below.

strCleanHTML = Regex.Replace(strCleanHTML, @"(?i)<script([^>])*>(\w|\W)*</script([^>])*>", " ");

That seems to have done the trick for the script.

Thanks for pointing me in the right direction.

George