Tourist_Search
asked on
Regular Expressions
Hi
Can anyone help with this regular expression which I found seaching through the PAQ's.
Regex.Replace(HtmlData, @"\s+|\s*((<script>((?!</s cript>).)+ </script>| <([^>]|""[ ^""]*""|'[ ^']*')*>)\ s*)+", " ").Trim();
When I try to use it I receive the following error.
Description: An error occurred during the compilation of a resource required to service this request. Please review the following specific error details and modify your source code appropriately.
Compiler Error Message: CS1010: Newline in constant
Source Error:
Line 29: Session["InsertCatID"] = Request.Form["Category"];
Line 30: Session["InsertCountryID"] = Request.Form["Country"];
Line 31: Session["InsertCityID"] = Request.Form["City"];
Line 32: Session["PenaltyID"] = Request.Form["County"];
Line 33: Session["InsertTitle"] = strTitle;
Source File: C:\**.aspx Line: 31
Line 31 has nothing to do with this bit of code.
I'm trying to use the code below:
StringBuilder strTextBuilder=new StringBuilder();
foreach (Match match in Regex.Replace(HtmlData, @"\s+|\s*((<script>((?!</s cript>).)+ </script>| <([^>]|""[ ^""]*""|'[ ^']*')*>)\ s*)+", " ").Trim(); RegexOptions.IgnoreCase|Re gexOptions .Singlelin e))
strTextBuilder.Append(matc h.Value); // use match.Groups["content"].Va lue to get rid of the tag
strText=strTextBuilder.ToS tring();
What I need to achieve is remove all text between < and > also text between <script>code here</script> or simaliar, then remove excess white space so that text is presented neatly.
Big job and I have no idea how to do, even after reading Mastering Regular Expressions.
Any help would be appreciated
George
Can anyone help with this regular expression which I found seaching through the PAQ's.
Regex.Replace(HtmlData, @"\s+|\s*((<script>((?!</s
When I try to use it I receive the following error.
Description: An error occurred during the compilation of a resource required to service this request. Please review the following specific error details and modify your source code appropriately.
Compiler Error Message: CS1010: Newline in constant
Source Error:
Line 29: Session["InsertCatID"] = Request.Form["Category"];
Line 30: Session["InsertCountryID"]
Line 31: Session["InsertCityID"] = Request.Form["City"];
Line 32: Session["PenaltyID"] = Request.Form["County"];
Line 33: Session["InsertTitle"] = strTitle;
Source File: C:\**.aspx Line: 31
Line 31 has nothing to do with this bit of code.
I'm trying to use the code below:
StringBuilder strTextBuilder=new StringBuilder();
foreach (Match match in Regex.Replace(HtmlData, @"\s+|\s*((<script>((?!</s
strTextBuilder.Append(matc
strText=strTextBuilder.ToS
What I need to achieve is remove all text between < and > also text between <script>code here</script> or simaliar, then remove excess white space so that text is presented neatly.
Big job and I have no idea how to do, even after reading Mastering Regular Expressions.
Any help would be appreciated
George
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
By "this regex that does work"
do you mean it finds matches. It does seem to find the end of any tag >
But at the moment you are just writing the value of what matched to your string builder, so it is not removing anything, in fact your string returned will only have the matches in it and none of the stuff you want.
I have to run out for a bit but will look at this later and see if I can get you on the right track.
In the meantime try to implement the code snippet I sent you with your regex, it removes the matches from the original string.
do you mean it finds matches. It does seem to find the end of any tag >
But at the moment you are just writing the value of what matched to your string builder, so it is not removing anything, in fact your string returned will only have the matches in it and none of the stuff you want.
I have to run out for a bit but will look at this later and see if I can get you on the right track.
In the meantime try to implement the code snippet I sent you with your regex, it removes the matches from the original string.
ASKER
Hi dfiala13
If for example I have <span class="colour">Coloured text here</span>
It removes the <span class="colour"> and </span> to leave the: "Coloured text here" which is what i'm after.
But it will not remove any code between <script>this still remains</script>
Or <script> {
this too remains
}
</script>
I have tried the snippet you sent but still receive an error. Deep subject.
George
If for example I have <span class="colour">Coloured text here</span>
It removes the <span class="colour"> and </span> to leave the: "Coloured text here" which is what i'm after.
But it will not remove any code between <script>this still remains</script>
Or <script> {
this too remains
}
</script>
I have tried the snippet you sent but still receive an error. Deep subject.
George
OK,
That's because you are asking it to do two different things:
remove the entirety of the <script> tags including contents,
For other tags, just remove the tags and leave the contents
And running your current Regex expression I get dangling end angle brackets > throughout the source
So I changed your code a bit, and this leaves only the guts of the HTML tags.
foreach (Match match in Regex.Matches(HtmlData, @">(?:(?<c>[^<]+))", RegexOptions.IgnoreCase|Re gexOptions .Singlelin e))
strTextBuilder.Append(matc h.Value.Su bstring(1) .Trim());
Now, I'll see if I can get rid of the scripts.
Given that the code above removes your html tags, makes sense to kill the scripts first.
HtmlData = Regex.Replace(HtmlData, @"<script.*?</script>", ev, RegexOptions.IgnoreCase|Re gexOptions .Singlelin e);
this calls the Replacer method, which right now returns an empty string no matter what you send it, but you could add more logic.
or you could also do this...
HtmlData = Regex.Replace(HtmlData, @"<script.*?</script>", "", RegexOptions.IgnoreCase|Re gexOptions .Singlelin e);
so the final method looks like this (i left the call to the replacer method in)...
private string CleanPage(string HtmlData)
{
StringBuilder strTextBuilder=new StringBuilder();
MatchEvaluator ev = new MatchEvaluator(Replacer);
//remove script tags including contents
HtmlData = Regex.Replace(HtmlData, @"<script.*?</script>", ev, RegexOptions.IgnoreCase|Re gexOptions .Singlelin e);
//remove reminaing tags, leaving contents
foreach (Match match in Regex.Matches(HtmlData, @">(?:(?<c>[^<]+))", RegexOptions.IgnoreCase|Re gexOptions .Singlelin e))
strTextBuilder.Append(matc h.Value.Su bstring(1) .Trim());
//return our cleaned string
return strTextBuilder.ToString();
}
private string Replacer(Match m)
{
//do soemthing fancier or fo rnow just return an empty string
return "";
}
That's because you are asking it to do two different things:
remove the entirety of the <script> tags including contents,
For other tags, just remove the tags and leave the contents
And running your current Regex expression I get dangling end angle brackets > throughout the source
So I changed your code a bit, and this leaves only the guts of the HTML tags.
foreach (Match match in Regex.Matches(HtmlData, @">(?:(?<c>[^<]+))", RegexOptions.IgnoreCase|Re
strTextBuilder.Append(matc
Now, I'll see if I can get rid of the scripts.
Given that the code above removes your html tags, makes sense to kill the scripts first.
HtmlData = Regex.Replace(HtmlData, @"<script.*?</script>", ev, RegexOptions.IgnoreCase|Re
this calls the Replacer method, which right now returns an empty string no matter what you send it, but you could add more logic.
or you could also do this...
HtmlData = Regex.Replace(HtmlData, @"<script.*?</script>", "", RegexOptions.IgnoreCase|Re
so the final method looks like this (i left the call to the replacer method in)...
private string CleanPage(string HtmlData)
{
StringBuilder strTextBuilder=new StringBuilder();
MatchEvaluator ev = new MatchEvaluator(Replacer);
//remove script tags including contents
HtmlData = Regex.Replace(HtmlData, @"<script.*?</script>", ev, RegexOptions.IgnoreCase|Re
//remove reminaing tags, leaving contents
foreach (Match match in Regex.Matches(HtmlData, @">(?:(?<c>[^<]+))", RegexOptions.IgnoreCase|Re
strTextBuilder.Append(matc
//return our cleaned string
return strTextBuilder.ToString();
}
private string Replacer(Match m)
{
//do soemthing fancier or fo rnow just return an empty string
return "";
}
ASKER
Hi dfiala13
When I run the following script it works OK, but once I add all of it I keep getting the error newline in constant
foreach (Match match in Regex.Matches(HtmlData, @">(?:(?<c>[^<]+))", RegexOptions.IgnoreCase|Re gexOptions .Singlelin e))
strTextBuilder.Append(matc h.Value.Su bstring(1) .Trim());
George
When I run the following script it works OK, but once I add all of it I keep getting the error newline in constant
foreach (Match match in Regex.Matches(HtmlData, @">(?:(?<c>[^<]+))", RegexOptions.IgnoreCase|Re
strTextBuilder.Append(matc
George
Where do you get the error?
Try cutting and pasting the code back into place.
Failing that post alll your code and the specific error message.
Try cutting and pasting the code back into place.
Failing that post alll your code and the specific error message.
ASKER
Hi dfiala13
Script below:
<script runat="server">
string strTitle, strDescription, strText, strKeyWords;
void Page_Load(Object Src, EventArgs E)
{
string htmlData=Request.Form["Htm lData"];
strTitle=Regex.Match(htmlD ata, @"(?<=<title>).*?(?=</titl e>)", RegexOptions.IgnoreCase|Re gexOptions .ExplicitC apture).Va lue; // Find title and keep text then remove tags
strDescription=Regex.Match (htmlData, @"(?<=<META\s+name=""descr iption""\s +content=" ").*?(?="" \s*?>)|(?< =<META\s+n ame=""desc ription""\ s+content= "").*?(?=" "\s*?/>)|( ?<=content ="").*?(?= name=""Des cription"" />)|(?<=co ntent=""). *?(?=name= ""Descript ion"">)", RegexOptions.IgnoreCase|Re gexOptions .ExplicitC apture).Va lue; // Find Description then keep text
strKeyWords=Regex.Match(ht mlData, @"(?<=<META\s+name=""keywo rds""\s+co ntent=""). *?(?=""\s* ?>)|(?<=<M ETA\s+name =""keyword s""\s+cont ent="").*? (?=""\s*?/ >)|(?<=con tent="").* ?(?=name=" "keywords" "/>)|(?<=c ontent="") .*?(?=name =""keyword s"">)", RegexOptions.IgnoreCase|Re gexOptions .ExplicitC apture).Va lue; // Find Keywords then keep text
string CleanPage(string htmlData)
{
StringBuilder strTextBuilder=new StringBuilder();
MatchEvaluator ev = new MatchEvaluator(Replacer);
//remove script tags including contents
htmlData = Regex.Replace(htmlData, @"<script.*?</script>", ev, RegexOptions.IgnoreCase|Re gexOptions .Singlelin e);
//remove reminaing tags, leaving contents
foreach (Match match in Regex.Matches(htmlData, @">(?:(?<c>[^<]+))", RegexOptions.IgnoreCase|Re gexOptions .Singlelin e))
strTextBuilder.Append(matc h.Value.Su bstring(1) .Trim());
//return our cleaned string
return strTextBuilder.ToString();
}
private string Replacer(Match m)
{
//do soemthing fancier or fo rnow just return an empty string
return "";
}
{
Session["AddressID"] = Request.Form["InsertAddres s"];
Session["PropertyID"] = Request.Form["PropertyID"] ;
Session["URLID"]= Request.Form["URL"];
Session["CatID"] = Request.Form["Category"];
Session["CountryID"] = Request.Form["Country"];
Session["CityID"] = Request.Form["City"];
Session["TitleID"] = strTitle;
Session["MetaDescriptionID "] = strDescription;
Session["BodyTextID"] = Request.Form["InsertAddres s"] + " " + strText;
Session["KeywordsID"] = strKeyWords;
Error message is:
Compiler Error Message: CS1010: Newline in constant
Source Error:
Line 26: //return our cleaned string
Line 27: return strTextBuilder.ToString();
Line 28: }
Line 29:
Line 30: private string Replacer(Match m)
Source File: C:\**.aspx Line: 28
}
</script>
Thanks
George
Script below:
<script runat="server">
string strTitle, strDescription, strText, strKeyWords;
void Page_Load(Object Src, EventArgs E)
{
string htmlData=Request.Form["Htm
strTitle=Regex.Match(htmlD
strDescription=Regex.Match
strKeyWords=Regex.Match(ht
string CleanPage(string htmlData)
{
StringBuilder strTextBuilder=new StringBuilder();
MatchEvaluator ev = new MatchEvaluator(Replacer);
//remove script tags including contents
htmlData = Regex.Replace(htmlData, @"<script.*?</script>", ev, RegexOptions.IgnoreCase|Re
//remove reminaing tags, leaving contents
foreach (Match match in Regex.Matches(htmlData, @">(?:(?<c>[^<]+))", RegexOptions.IgnoreCase|Re
strTextBuilder.Append(matc
//return our cleaned string
return strTextBuilder.ToString();
}
private string Replacer(Match m)
{
//do soemthing fancier or fo rnow just return an empty string
return "";
}
{
Session["AddressID"] = Request.Form["InsertAddres
Session["PropertyID"] = Request.Form["PropertyID"]
Session["URLID"]= Request.Form["URL"];
Session["CatID"] = Request.Form["Category"];
Session["CountryID"] = Request.Form["Country"];
Session["CityID"] = Request.Form["City"];
Session["TitleID"] = strTitle;
Session["MetaDescriptionID
Session["BodyTextID"] = Request.Form["InsertAddres
Session["KeywordsID"] = strKeyWords;
Error message is:
Compiler Error Message: CS1010: Newline in constant
Source Error:
Line 26: //return our cleaned string
Line 27: return strTextBuilder.ToString();
Line 28: }
Line 29:
Line 30: private string Replacer(Match m)
Source File: C:\**.aspx Line: 28
}
</script>
Thanks
George
Well, you have methods nested in your page load event handler method. methods call other methods.
let's see if we can clean this up...
<script runat="server">
string strTitle, strDescription, strText, strKeyWords;
void Page_Load(Object Src, EventArgs E)
{
Session["AddressID"] = Request.Form["InsertAddres s"];
Session["PropertyID"] = Request.Form["PropertyID"] ;
Session["URLID"]= Request.Form["URL"];
Session["CatID"] = Request.Form["Category"];
Session["CountryID"] = Request.Form["Country"];
Session["CityID"] = Request.Form["City"];
Session["TitleID"] = strTitle;
Session["MetaDescriptionID "] = strDescription;
Session["BodyTextID"] = Request.Form["InsertAddres s"] + " " + strText;
Session["KeywordsID"] = strKeyWords;
string htmlData=Request.Form["Htm lData"];
strTitle=Regex.Match(htmlD ata, @"(?<=<title>).*?(?=</titl e>)", RegexOptions.IgnoreCase|Re gexOptions .ExplicitC apture).Va lue; // Find title and keep text then remove tags
strDescription=Regex.Match (htmlData, @"(?<=<META\s+name=""descr iption""\s +content=" ").*?(?="" \s*?>)|(?< =<META\s+n ame=""desc ription""\ s+content= "").*?(?=" "\s*?/>)|( ?<=content ="").*?(?= name=""Des cription"" />)|(?<=co ntent=""). *?(?=name= ""Descript ion"">)", RegexOptions.IgnoreCase|Re gexOptions .ExplicitC apture).Va lue; // Find Description then keep text
strKeyWords=Regex.Match(ht mlData, @"(?<=<META\s+name=""keywo rds""\s+co ntent=""). *?(?=""\s* ?>)|(?<=<M ETA\s+name =""keyword s""\s+cont ent="").*? (?=""\s*?/ >)|(?<=con tent="").* ?(?=name=" "keywords" "/>)|(?<=c ontent="") .*?(?=name =""keyword s"">)", RegexOptions.IgnoreCase|Re gexOptions .ExplicitC apture).Va lue; // Find Keywords then keep text
htmlData = CleanPage(htmlData);
}
string CleanPage(string htmlData)
{
StringBuilder strTextBuilder=new StringBuilder();
MatchEvaluator ev = new MatchEvaluator(Replacer);
//remove script tags including contents
htmlData = Regex.Replace(htmlData, @"<script.*?</script>", ev, RegexOptions.IgnoreCase|Re gexOptions .Singlelin e);
//remove reminaing tags, leaving contents
foreach (Match match in Regex.Matches(htmlData, @">(?:(?<c>[^<]+))", RegexOptions.IgnoreCase|Re gexOptions .Singlelin e))
strTextBuilder.Append(matc h.Value.Su bstring(1) .Trim());
//return our cleaned string
return strTextBuilder.ToString();
}
private string Replacer(Match m)
{
//do soemthing fancier or fo rnow just return an empty string
return "";
}
let's see if we can clean this up...
<script runat="server">
string strTitle, strDescription, strText, strKeyWords;
void Page_Load(Object Src, EventArgs E)
{
Session["AddressID"] = Request.Form["InsertAddres
Session["PropertyID"] = Request.Form["PropertyID"]
Session["URLID"]= Request.Form["URL"];
Session["CatID"] = Request.Form["Category"];
Session["CountryID"] = Request.Form["Country"];
Session["CityID"] = Request.Form["City"];
Session["TitleID"] = strTitle;
Session["MetaDescriptionID
Session["BodyTextID"] = Request.Form["InsertAddres
Session["KeywordsID"] = strKeyWords;
string htmlData=Request.Form["Htm
strTitle=Regex.Match(htmlD
strDescription=Regex.Match
strKeyWords=Regex.Match(ht
htmlData = CleanPage(htmlData);
}
string CleanPage(string htmlData)
{
StringBuilder strTextBuilder=new StringBuilder();
MatchEvaluator ev = new MatchEvaluator(Replacer);
//remove script tags including contents
htmlData = Regex.Replace(htmlData, @"<script.*?</script>", ev, RegexOptions.IgnoreCase|Re
//remove reminaing tags, leaving contents
foreach (Match match in Regex.Matches(htmlData, @">(?:(?<c>[^<]+))", RegexOptions.IgnoreCase|Re
strTextBuilder.Append(matc
//return our cleaned string
return strTextBuilder.ToString();
}
private string Replacer(Match m)
{
//do soemthing fancier or fo rnow just return an empty string
return "";
}
ASKER
Hi
Still receive same error message.
Compiler Error Message: CS1010: Newline in constant
Source Error:
Line 43: //return our cleaned string
Line 44: return strTextBuilder.ToString();
Line 45: }
Line 46:
Line 47: private string Replacer(Match m)
Source File: C:\***.aspx Line: 45
Still receive same error message.
Compiler Error Message: CS1010: Newline in constant
Source Error:
Line 43: //return our cleaned string
Line 44: return strTextBuilder.ToString();
Line 45: }
Line 46:
Line 47: private string Replacer(Match m)
Source File: C:\***.aspx Line: 45
delete lines 45 and 46 and add back in the }
You can also try putting the code into a new file
If that fails, start with a blank file and start adding things method by method and line by line until it breaks
you have something in your code that is malformed which is hard to diagnose via ee. The two methods I sent you work, so there is something else in your script that is off.
You can also try putting the code into a new file
If that fails, start with a blank file and start adding things method by method and line by line until it breaks
you have something in your code that is malformed which is hard to diagnose via ee. The two methods I sent you work, so there is something else in your script that is off.
ASKER
Hi dfial13
This section of code works.
<script runat="server">
string strTitle, strDescription, strText, strKeyWords;
void Page_Load(Object Src, EventArgs E)
{
Session["InsertAddress"] = Request.Form["InsertAddres s"];
Session["InsertPropertyID" ] = Request.Form["PropertyID"] ;
Session["InsertURL"]= Request.Form["URL"];
Session["InsertCatID"] = Request.Form["Category"];
Session["InsertCountryID"] = Request.Form["Country"];
Session["InsertCityID"] = Request.Form["City"];
Session["PenaltyID"] = Request.Form["County"];
Session["InsertTitle"] = strTitle;
Session["InsertMetaDescrip tion"] = strDescription;
Session["InsertBodyText"] = Request.Form["InsertAddres s"] + " " + strText;
Session["InsertKeywords"] = strKeyWords;
string htmlData=Request.Form["Htm lData"];
strTitle=Regex.Match(htmlD ata, @"(?<=<title>).*?(?=</titl e>)", RegexOptions.IgnoreCase|Re gexOptions .ExplicitC apture).Va lue; // Find title and keep text then remove tags
strDescription=Regex.Match (htmlData, @"(?<=<META\s+name=""descr iption""\s +content=" ").*?(?="" \s*?>)|(?< =<META\s+n ame=""desc ription""\ s+content= "").*?(?=" "\s*?/>)|( ?<=content ="").*?(?= name=""Des cription"" />)|(?<=co ntent=""). *?(?=name= ""Descript ion"">)", RegexOptions.IgnoreCase|Re gexOptions .ExplicitC apture).Va lue; // Find Description then keep text
strKeyWords=Regex.Match(ht mlData, @"(?<=<META\s+name=""keywo rds""\s+co ntent=""). *?(?=""\s* ?>)|(?<=<M ETA\s+name =""keyword s""\s+cont ent="").*? (?=""\s*?/ >)|(?<=con tent="").* ?(?=name=" "keywords" "/>)|(?<=c ontent="") .*?(?=name =""keyword s"">)", RegexOptions.IgnoreCase|Re gexOptions .ExplicitC apture).Va lue; // Find Keywords then keep text
StringBuilder strTextBuilder=new StringBuilder();
foreach (Match match in Regex.Matches(htmlData, @">(?:(?<c>[^<]+))", RegexOptions.IgnoreCase|Re gexOptions .Singlelin e))
strTextBuilder.Append(matc h.Value.Su bstring(1) .Trim());
strText=strTextBuilder.ToS tring();
// I can add the above and works OK removes the >
// But when try to put it all into place the error occurs
}
</script>
But when I try to add this section the error occurs.
htmlData = CleanPage(htmlData);
}
string CleanPage(string htmlData)
{
MatchEvaluator ev = new MatchEvaluator(Replacer);
//remove script tags including contents
htmlData = Regex.Replace(htmlData, @"<script.*?</script>", ev, RegexOptions.IgnoreCase|Re gexOptions .Singlelin e);
//remove reminaing tags, leaving contents
foreach (Match match in Regex.Matches(htmlData, @">(?:(?<c>[^<]+))", RegexOptions.IgnoreCase|Re gexOptions .Singlelin e))
strTextBuilder.Append(matc h.Value.Su bstring(1) .Trim());
//return our cleaned string
return strTextBuilder.ToString();
}
private string Replacer(Match m)
{
//do soemthing fancier or fo rnow just return an empty string
return "";
}
Error message:
Compiler Error Message: CS1010: Newline in constant
Source Error:
Line 44: //return our cleaned string
Line 45: return strTextBuilder.ToString();
Line 46: }
Line 47:
Line 48: private string Replacer(Match m)
Source File: C:\****.aspx Line: 46
This section of code works.
<script runat="server">
string strTitle, strDescription, strText, strKeyWords;
void Page_Load(Object Src, EventArgs E)
{
Session["InsertAddress"] = Request.Form["InsertAddres
Session["InsertPropertyID"
Session["InsertURL"]= Request.Form["URL"];
Session["InsertCatID"] = Request.Form["Category"];
Session["InsertCountryID"]
Session["InsertCityID"] = Request.Form["City"];
Session["PenaltyID"] = Request.Form["County"];
Session["InsertTitle"] = strTitle;
Session["InsertMetaDescrip
Session["InsertBodyText"] = Request.Form["InsertAddres
Session["InsertKeywords"] = strKeyWords;
string htmlData=Request.Form["Htm
strTitle=Regex.Match(htmlD
strDescription=Regex.Match
strKeyWords=Regex.Match(ht
StringBuilder strTextBuilder=new StringBuilder();
foreach (Match match in Regex.Matches(htmlData, @">(?:(?<c>[^<]+))", RegexOptions.IgnoreCase|Re
strTextBuilder.Append(matc
strText=strTextBuilder.ToS
// I can add the above and works OK removes the >
// But when try to put it all into place the error occurs
}
</script>
But when I try to add this section the error occurs.
htmlData = CleanPage(htmlData);
}
string CleanPage(string htmlData)
{
MatchEvaluator ev = new MatchEvaluator(Replacer);
//remove script tags including contents
htmlData = Regex.Replace(htmlData, @"<script.*?</script>", ev, RegexOptions.IgnoreCase|Re
//remove reminaing tags, leaving contents
foreach (Match match in Regex.Matches(htmlData, @">(?:(?<c>[^<]+))", RegexOptions.IgnoreCase|Re
strTextBuilder.Append(matc
//return our cleaned string
return strTextBuilder.ToString();
}
private string Replacer(Match m)
{
//do soemthing fancier or fo rnow just return an empty string
return "";
}
Error message:
Compiler Error Message: CS1010: Newline in constant
Source Error:
Line 44: //return our cleaned string
Line 45: return strTextBuilder.ToString();
Line 46: }
Line 47:
Line 48: private string Replacer(Match m)
Source File: C:\****.aspx Line: 46
The long and the short of it is you can't run this script side.
The htmlData = Regex.Replace(htmlData, @"<script.*?</script>", ev, RegexOptions.IgnoreCase|Re gexOptions .Singlelin e);
line blows up because of the </script> tag. The script compiler doesn't care that it is in a literal.
Move this code into code-behind and you'll be all set.
string strTitle, strDescription, strText, strKeyWords;
void Page_Load(Object Src, EventArgs E)
{
Session["InsertAddress"] = Request.Form["InsertAddres s"];
Session["InsertPropertyID" ] = Request.Form["PropertyID"] ;
Session["InsertURL"]= Request.Form["URL"];
Session["InsertCatID"] = Request.Form["Category"];
Session["InsertCountryID"] = Request.Form["Country"];
Session["InsertCityID"] = Request.Form["City"];
Session["PenaltyID"] = Request.Form["County"];
Session["InsertTitle"] = strTitle;
Session["InsertMetaDescrip tion"] = strDescription;
Session["InsertBodyText"] = Request.Form["InsertAddres s"] + " " + strText;
Session["InsertKeywords"] = strKeyWords;
string htmlData=Request.Form["Htm lData"];
strTitle=Regex.Match(htmlD ata, @"(?<=<title>).*?(?=</titl e>)", RegexOptions.IgnoreCase|Re gexOptions .ExplicitC apture).Va lue; // Find title and keep text then remove tags
strDescription=Regex.Match (htmlData, @"(?<=<META\s+name=""descr iption""\s +content=" ").*?(?="" \s*?>)|(?< =<META\s+n ame=""desc ription""\ s+content= "").*?(?=" "\s*?/>)|( ?<=content ="").*?(?= name=""Des cription"" />)|(?<=co ntent=""). *?(?=name= ""Descript ion"">)", RegexOptions.IgnoreCase|Re gexOptions .ExplicitC apture).Va lue; // Find Description then keep text
strKeyWords=Regex.Match(ht mlData, @"(?<=<META\s+name=""keywo rds""\s+co ntent=""). *?(?=""\s* ?>)|(?<=<M ETA\s+name =""keyword s""\s+cont ent="").*? (?=""\s*?/ >)|(?<=con tent="").* ?(?=name=" "keywords" "/>)|(?<=c ontent="") .*?(?=name =""keyword s"">)", RegexOptions.IgnoreCase|Re gexOptions .ExplicitC apture).Va lue; // Find Keywords then keep text
htmlData = CleanPage(htmlData);
}
private string CleanPage(string htmlData)
{
htmlData = Regex.Replace(htmlData, @"<script.*?</script>", "", RegexOptions.IgnoreCase|Re gexOptions .Singlelin e);
foreach (Match m in Regex.Matches(htmlData, @">(?:(?<c>[^<]+))", RegexOptions.IgnoreCase|Re gexOptions .Singlelin e))
strTextBuilder.Append(m.Va lue.Substr ing(1).Tri m());
return strTextBuilder.ToString();
}
The htmlData = Regex.Replace(htmlData, @"<script.*?</script>", ev, RegexOptions.IgnoreCase|Re
line blows up because of the </script> tag. The script compiler doesn't care that it is in a literal.
Move this code into code-behind and you'll be all set.
string strTitle, strDescription, strText, strKeyWords;
void Page_Load(Object Src, EventArgs E)
{
Session["InsertAddress"] = Request.Form["InsertAddres
Session["InsertPropertyID"
Session["InsertURL"]= Request.Form["URL"];
Session["InsertCatID"] = Request.Form["Category"];
Session["InsertCountryID"]
Session["InsertCityID"] = Request.Form["City"];
Session["PenaltyID"] = Request.Form["County"];
Session["InsertTitle"] = strTitle;
Session["InsertMetaDescrip
Session["InsertBodyText"] = Request.Form["InsertAddres
Session["InsertKeywords"] = strKeyWords;
string htmlData=Request.Form["Htm
strTitle=Regex.Match(htmlD
strDescription=Regex.Match
strKeyWords=Regex.Match(ht
htmlData = CleanPage(htmlData);
}
private string CleanPage(string htmlData)
{
htmlData = Regex.Replace(htmlData, @"<script.*?</script>", "", RegexOptions.IgnoreCase|Re
foreach (Match m in Regex.Matches(htmlData, @">(?:(?<c>[^<]+))", RegexOptions.IgnoreCase|Re
strTextBuilder.Append(m.Va
return strTextBuilder.ToString();
}
ASKER
Hi dfiala13
Still recieved same error, but took note about what you said and I have spent the weekend reading the book.
I have came up with a solution to the problem which is below.
strCleanHTML = Regex.Replace(strCleanHTML , @"(?i)<script([^>])*>(\w|\ W)*</scrip t([^>])*>" , " ");
That seems to have done the trick for the script.
Thanks for pointing me in the right direction.
George
Still recieved same error, but took note about what you said and I have spent the weekend reading the book.
I have came up with a solution to the problem which is below.
strCleanHTML = Regex.Replace(strCleanHTML
That seems to have done the trick for the script.
Thanks for pointing me in the right direction.
George
ASKER
I'll have to do more reading, this is well over my head,
I have this regex that does work:
foreach (Match match in Regex.Matches(htmlData, @">(?:(?<c>[^<]+))", RegexOptions.IgnoreCase|Re
strTextBuilder.Append(matc
strText=strTextBuilder.ToS
Only problem is, it does not remove the text between the <script>code here tags</script>
George