garethtnash
asked on
MS SQL Replace Function - Remove HTML Attributes - Regex
Hello Experts,
I'm trying to write an MS SQL 2008 version of a VBScript function that cleans a string, the reason for doing it in SQL is because I want to clean the string on Insert Update, and data can be Inserted / Updated via a number of inputs. What the Function should do is -
Strip any HTML attributes, but keep the element - <p style="style1>text</p> becomes <p>text</p>
This is the section of the code below --
If the input has already been HTML encoded or double encoded, unencode it -
This section deals with this -
Remove tabs, double spaces etc -
Replace line breaks with < /br>
Once the string is clean, encode special characters -
My VBScript version of this, which is probably a little mixed up is -
How do I build this as a function in MS SQL 2008.
Appreciate your help with this.
Thank you
I'm trying to write an MS SQL 2008 version of a VBScript function that cleans a string, the reason for doing it in SQL is because I want to clean the string on Insert Update, and data can be Inserted / Updated via a number of inputs. What the Function should do is -
Strip any HTML attributes, but keep the element - <p style="style1>text</p> becomes <p>text</p>
This is the section of the code below --
.Pattern = "(<[a-z][^ >/]*)(?:[^>/]|/(?!>))*" ' keep all HTML Entities but remove attributes
strInput = .Replace(strInput, "$1")
If the input has already been HTML encoded or double encoded, unencode it -
This section deals with this -
strInput = Replace(strInput, "&amp;", "&") ' removes all ... (tab etc)
strInput = Replace(strInput, "&", "&") ' removes all ... (tab etc)
strInput = Replace(strInput, " ", " ") ' removes all ... (tab etc)
strInput = Replace(strInput, "£", "£")
strInput = Replace(strInput, "£", "£")
strInput = Replace(strInput, "'", "'") ' removes all ... (tab etc)
strInput = Replace(strInput, """, """") ' removes all ... (tab etc)
Remove tabs, double spaces etc -
strInput = Replace(strInput, " ", " ") ' removes all ... (tab etc)
strInput = Replace(strInput, vbTab, "") ' removes all ... (tab etc)
strInput = Replace(strInput, CHAR(9)," ") ' replace tabs with a space
Replace line breaks with < /br>
.Pattern = "\r\n|\r|\n"
strInput = .Replace(strInput, "<br />") ' replace any carriage returns with either <br />
Once the string is clean, encode special characters -
strInput = Replace("&", "&");
strInput = Replace("£", "£");
strInput = Replace("""", """);
strInput = Replace("'", "'");
My VBScript version of this, which is probably a little mixed up is -
<%
Function removeformatting(strInput)
Set reg = New Regexp
With reg
.Global = True
.IgnoreCase = True
.Pattern = "(<[a-z][^ >/]*)(?:[^>/]|/(?!>))*" ' keep all HTML Entities but remove attributes
strInput = .Replace(strInput, "$1")
.Pattern = "\r\n|\r|\n"
strInput = .Replace(strInput, "<br />") ' replace any carriage returns with either <br />
End With
strInput = Replace(strInput, "&amp;", "&") ' removes all ... (tab etc)
strInput = Replace(strInput, "&", "&") ' removes all ... (tab etc)
strInput = Replace(strInput, " ", " ") ' removes all ... (tab etc)
strInput = Replace(strInput, "£", "£")
strInput = Replace(strInput, "£", "£")
strInput = Replace(strInput, "'", "'") ' removes all ... (tab etc)
strInput = Replace(strInput, """, """") ' removes all ... (tab etc)
.Pattern = "\r\n|\r|\n"
strInput = Replace(strInput, " ")
strInput = Replace("&", "&");
strInput = Replace("£", "£");
strInput = Replace("""", """);
strInput = Replace("'", "'");
strInput = Replace(strInput, " ", " ") ' removes all ... (tab etc)
strInput = Replace(strInput, vbTab, "") ' removes all ... (tab etc)
strInput = Replace(strInput, CHAR(9)," ") ' replace tabs with a space
removeformatting = strInput
End Function
%>
How do I build this as a function in MS SQL 2008.
Appreciate your help with this.
Thank you
ASKER
Thanks Exoduster,
What about the regex part?
.Pattern = "(<[a-z][^ >/]*)(?:[^>/]|/(?!>))*" ' keep all HTML Entities but remove attributes
strInput = .Replace(strInput, "$1")
Thanks :)
What about the regex part?
.Pattern = "(<[a-z][^ >/]*)(?:[^>/]|/(?!>))*" ' keep all HTML Entities but remove attributes
strInput = .Replace(strInput, "$1")
Thanks :)
For regular expressions you need use CLR.
This should help you: http://www.codeproject.com/Articles/42764/Regular-Expressions-in-MS-SQL-Server-2005-2008
This should help you: http://www.codeproject.com/Articles/42764/Regular-Expressions-in-MS-SQL-Server-2005-2008
ASKER
Umm, my column is an Nvarchar(max) and last time I tried using a CLR, i got --
Which really confused me, I googled it and came up with nothing..., I can see it has something to do with the field size?
Any suggestions?
Thank you
A .NET Framework error occurred during execution of user-defined routine or aggregate "CleanHTML":
System.Data.SqlServer.TruncationExce ption: Trying to convert return value or output parameter of size 8090 bytes to a T-SQL type with a smaller size limit of 8000 bytes.
System.Data.SqlServer.TruncationExce ption:
at System.Data.SqlServer.Internal.CXVar iantBase.S tringToWST R(String pstrValue, Int64 cbMaxLength, Int32 iOffset, EPadding ePad)
at System.Data.SqlServer.Internal.CXVar iantBase.S qlStringTo WSTR(SqlSt ring strValue, Int64 cbMaxLength, Int32 iOffset, EPadding ePad)
Which really confused me, I googled it and came up with nothing..., I can see it has something to do with the field size?
Any suggestions?
Thank you
I'm not so familiar with CLRs... Probably i would try to cut string to strings of type VARCHAR(8000)...
If the parser doesn't recognize nVarChar(max), you can cast it as a nText value which should be handled correctly.
Hi garethtnash,
How about this:
BTW, if you are more comfortable with VB.NET, you could write CLR functions in that language too.
How about this:
using System;
using System.Data.Sql;
using System.Data.SqlTypes;
using System.Collections.Generic;
using Microsoft.SqlServer.Server;
using System.Text;
using System.Text.RegularExpressions;
public partial class UserDefinedFunctions
{
[Microsoft.SqlServer.Server.SqlFunction]
[return: SqlFacet(MaxSize = -1)]
public static SqlString CleanHTML(SqlString s)
{
string result = s.ToString();
//keep all HTML Entities but remove attributes
string pattern = @"(<[a-z][^ >/]*)(?:[^>/]|/(?!>))*";
Regex rgx = new Regex(pattern);
result = rgx.Replace(result, "$1");
//If the input has already been HTML encoded or double encoded, unencode it -
result = result.Replace("&amp;", "&");// removes all ... (tab etc)
result = result.Replace("&", "&");// removes all ... (tab etc)
result = result.Replace(" ", " ");// removes all ... (tab etc)
result = result.Replace("£", "£");
result = result.Replace("£", "£");
result = result.Replace("'", "'");// removes all ... (tab etc)
result = result.Replace(""", @"""");// removes all ... (tab etc)
//Remove tabs, double spaces etc -
result = result.Replace("\t", " ");// replace tabs with a space
//Replace line breaks with < /br>
pattern = "\r\n|\r|\n";
rgx = new Regex(pattern);
result = rgx.Replace(result, "<br />");// replace any carriage returns with either <br />
//Once the string is clean, encode special characters
result = result.Replace("&", "&");
result = result.Replace("£", "£");
result = result.Replace(@"""", """);
result = result.Replace("'", "'");
return result;
}
}
BTW, if you are more comfortable with VB.NET, you could write CLR functions in that language too.
ASKER
Thats great, thanks Rimvis, there is a tiny issue though..
When I run this --
It removes the existing Function of --
despite both having different names, why is that, can I not have 2 or more UDFs?
Thanks.
By the way, how was your break?
Thank you
When I run this --
using System;
using System.Data.Sql;
using System.Data.SqlTypes;
using System.Collections.Generic;
using Microsoft.SqlServer.Server;
using System.Text;
using System.Text.RegularExpressions;
public partial class UserDefinedFunctions
{
[Microsoft.SqlServer.Server.SqlFunction]
[return: SqlFacet(MaxSize = -1)]
public static SqlString CleanHTMLAtt(SqlString s)
{
string result = s.ToString();
//keep all HTML Entities but remove attributes
string pattern = @"(<[a-z][^ >/]*)(?:[^>/]|/(?!>))*";
Regex rgx = new Regex(pattern);
result = rgx.Replace(result, "$1");
//If the input has already been HTML encoded or double encoded, unencode it -
result = result.Replace("&amp;", "&");// removes all ... (tab etc)
result = result.Replace("&", "&");// removes all ... (tab etc)
result = result.Replace(" ", " ");// removes all ... (tab etc)
result = result.Replace("£", "£");
result = result.Replace("£", "£");
result = result.Replace("'", "'");// removes all ... (tab etc)
result = result.Replace(""", @"""");// removes all ... (tab etc)
//Remove tabs, double spaces etc -
result = result.Replace("\t", " ");// replace tabs with a space
result = result.Replace(" ", " ");// replace tabs with a space
//Replace line breaks with < /br>
pattern = "\r\n|\r|\n";
rgx = new Regex(pattern);
result = rgx.Replace(result, "<br />");// replace any carriage returns with either <br />
//Once the string is clean, encode special characters
result = result.Replace("&", "&");
result = result.Replace("£", "£");
result = result.Replace(@"""", """);
result = result.Replace("'", "'");
result = result.Replace("&bull;", "•");
return result;
}
}
It removes the existing Function of --
using System;
using System.Data.Sql;
using System.Data.SqlTypes;
using System.Collections.Generic;
using Microsoft.SqlServer.Server;
using System.Text;
using System.Text.RegularExpressions;
public partial class UserDefinedFunctions
{
[Microsoft.SqlServer.Server.SqlFunction]
[return: SqlFacet(MaxSize = -1)]
public static SqlString CleanHTML(SqlString s)
{
if (s.IsNull) return String.Empty;
string s1 = s.ToString().Trim();
if (s1.Length == 0) return String.Empty;
StringBuilder tmpS = new StringBuilder(s1.Length);
//striping out the "control characters"
if (!Char.IsControl(s1[0])) tmpS.Append(s1[0]);
for (int i = 1; i <= s1.Length - 1; i++)
{
if (Char.IsControl(s1[i]))
{
if (s1[i - 1] != ' ') tmpS.Append(' ');
}
else
{
tmpS.Append(s1[i]);
}
}
string result = tmpS.ToString();
//finding the HTML tags and replacing them with an empty string
string pattern = @"<[^>]*?>|<[^>]*>";
Regex rgx = new Regex(pattern);
result = rgx.Replace(result, String.Empty);
result = result.Replace(@"&amp;", @"&");
result = result.Replace(@"&", @"&");
result = result.Replace(@"£", @"£");
result = result.Replace(@"£", @"£");
result = result.Replace(@""", @"""");
result = result.Replace(@"'", @"'");
result = result.Replace(@"&", @"&");
result = result.Replace(@"£", @"£");
result = result.Replace(@"""", @""");
result = result.Replace(@"'", @"'");
//more replaces here
return result;
}
}
despite both having different names, why is that, can I not have 2 or more UDFs?
Thanks.
By the way, how was your break?
Thank you
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Thank you Rimvis :) Superb
There is also REPLACE function is T-SQL.
Just look at books online and string functions.
About tabs: you can also use CHAR(9).
Good luck!