MS SQL Replace Function - Remove HTML Attributes - Regex

Hello Experts,

I'm trying to write an MS SQL 2008 version of a VBScript function that cleans a string, the reason for doing it in SQL is because I want to clean the string on Insert Update, and data can be Inserted / Updated via a number of inputs. What the Function should do is -

Strip any HTML attributes, but keep the element - <p style="style1>text</p> becomes <p>text</p>

This is the section of the code below --

.Pattern = "(<[a-z][^ >/]*)(?:[^>/]|/(?!>))*"	 ' keep all HTML Entities but remove attributes 
    strInput = .Replace(strInput, "$1")    

Open in new window


If the input has already been HTML encoded or double encoded, unencode it -
This section deals with this -

    strInput = Replace(strInput, "&amp;amp;", "&amp;")	         ' removes all ... (tab etc)
    strInput = Replace(strInput, "&amp;", "&")	         ' removes all ... (tab etc)
    strInput = Replace(strInput, "&nbsp;", " ")	         ' removes all ... (tab etc)
    strInput = Replace(strInput, "&pound;", "£")
	strInput = Replace(strInput, "&#163;", "£")
    strInput = Replace(strInput, "&apos;", "'")	         ' removes all ... (tab etc)
    strInput = Replace(strInput, "&quot;", """")	         ' removes all ... (tab etc)

Open in new window


Remove tabs, double spaces etc -

    strInput = Replace(strInput, "  ", " ")	         ' removes all ... (tab etc)
	strInput = Replace(strInput, vbTab, "")	         ' removes all ... (tab etc)
	strInput = Replace(strInput, CHAR(9)," ")		' replace tabs with a space 

Open in new window


Replace line breaks with < /br>

    .Pattern = "\r\n|\r|\n"
    strInput = .Replace(strInput, "<br />")          ' replace any carriage returns with either <br />

Open in new window


Once the string is clean, encode special characters -

    strInput = Replace("&", "&amp;");
    strInput = Replace("£", "&pound;");
    strInput = Replace("""", "&quot;");
    strInput = Replace("'", "&apos;");

Open in new window


My VBScript version of this, which is probably a little mixed up is -

<%
Function removeformatting(strInput)
    Set reg = New Regexp
    With reg
    .Global = True
    .IgnoreCase = True
    .Pattern = "(<[a-z][^ >/]*)(?:[^>/]|/(?!>))*"	 ' keep all HTML Entities but remove attributes 
    strInput = .Replace(strInput, "$1")        
    .Pattern = "\r\n|\r|\n"
    strInput = .Replace(strInput, "<br />")          ' replace any carriage returns with either <br />
    End With
    strInput = Replace(strInput, "&amp;amp;", "&amp;")	         ' removes all ... (tab etc)
    strInput = Replace(strInput, "&amp;", "&")	         ' removes all ... (tab etc)
    strInput = Replace(strInput, "&nbsp;", " ")	         ' removes all ... (tab etc)
    strInput = Replace(strInput, "&pound;", "£")
	strInput = Replace(strInput, "&#163;", "£")
    strInput = Replace(strInput, "&apos;", "'")	         ' removes all ... (tab etc)
    strInput = Replace(strInput, "&quot;", """")	         ' removes all ... (tab etc)
	.Pattern = "\r\n|\r|\n"
    strInput = Replace(strInput, " ")
    strInput = Replace("&", "&amp;");
    strInput = Replace("£", "&pound;");
    strInput = Replace("""", "&quot;");
    strInput = Replace("'", "&apos;");
    strInput = Replace(strInput, "  ", " ")	         ' removes all ... (tab etc)
	strInput = Replace(strInput, vbTab, "")	         ' removes all ... (tab etc)
	strInput = Replace(strInput, CHAR(9)," ")		' replace tabs with a space 

    removeformatting = strInput
End Function
%>

Open in new window


How do I build this as a function in MS SQL 2008.

Appreciate your help with this.

Thank you
garethtnashAsked:
Who is Participating?
 
RimvisCommented:
You can have as many functions as you want. You have to put them into same class, like this:
using System;
using System.Data.Sql;
using System.Data.SqlTypes;
using System.Collections.Generic;
using Microsoft.SqlServer.Server;
using System.Text;
using System.Text.RegularExpressions;
public partial class UserDefinedFunctions
{
    [Microsoft.SqlServer.Server.SqlFunction]
    [return: SqlFacet(MaxSize = -1)]
    public static SqlString CleanHTML(SqlString s)
    {
        if (s.IsNull) return String.Empty;
        string s1 = s.ToString().Trim();
        if (s1.Length == 0) return String.Empty;
        StringBuilder tmpS = new StringBuilder(s1.Length);
        //striping out the "control characters"
        if (!Char.IsControl(s1[0])) tmpS.Append(s1[0]);
        for (int i = 1; i <= s1.Length - 1; i++)
        {
            if (Char.IsControl(s1[i]))
            {
                if (s1[i - 1] != ' ') tmpS.Append(' ');
            }
            else
            {
                tmpS.Append(s1[i]);
            }
        }
        string result = tmpS.ToString();
        //finding the HTML tags and replacing them with an empty string
        string pattern = @"<[^>]*?>|<[^>]*>";

        Regex rgx = new Regex(pattern);
        result = rgx.Replace(result, String.Empty);
        result = result.Replace(@"&amp;amp;", @"&amp;");
        result = result.Replace(@"&amp;", @"&");
        result = result.Replace(@"&pound;", @"£");
        result = result.Replace(@"&#163;", @"£");
        result = result.Replace(@"&quot;", @"""");
        result = result.Replace(@"&apos;", @"'");
        result = result.Replace(@"&", @"&amp;");
        result = result.Replace(@"£", @"&pound;");
        result = result.Replace(@"""", @"&quot;");
        result = result.Replace(@"'", @"&apos;");
        //more replaces here
        return result;
    }

    [Microsoft.SqlServer.Server.SqlFunction]
    [return: SqlFacet(MaxSize = -1)]
    public static SqlString CleanHTMLAtt(SqlString s)
    {
        string result = s.ToString();

        //keep all HTML Entities but remove attributes 
        string pattern = @"(<[a-z][^ >/]*)(?:[^>/]|/(?!>))*";
        Regex rgx = new Regex(pattern);
        result = rgx.Replace(result, "$1");

        //If the input has already been HTML encoded or double encoded, unencode it -
        result = result.Replace("&amp;amp;", "&amp;");// removes all ... (tab etc)
        result = result.Replace("&amp;", "&");// removes all ... (tab etc)
        result = result.Replace("&nbsp;", " ");// removes all ... (tab etc)
        result = result.Replace("&pound;", "£");
        result = result.Replace("&#163;", "£");
        result = result.Replace("&apos;", "'");// removes all ... (tab etc)
        result = result.Replace("&quot;", @"""");// removes all ... (tab etc)

        //Remove tabs, double spaces etc -
        result = result.Replace("\t", " ");// replace tabs with a space 
        result = result.Replace("  ", " ");// replace tabs with a space 

        //Replace line breaks with < /br>
        pattern = "\r\n|\r|\n";
        rgx = new Regex(pattern);
        result = rgx.Replace(result, "<br />");// replace any carriage returns with either <br />

        //Once the string is clean, encode special characters
        result = result.Replace("&", "&amp;");
        result = result.Replace("£", "&pound;");
        result = result.Replace(@"""", "&quot;");
        result = result.Replace("'", "&apos;");
        result = result.Replace("&amp;bull;", "&bull;");

        return result;
    }
}

Open in new window

0
 
exodusterCommented:
Hi.
There is also REPLACE function is T-SQL.
Just look at books online and string functions.
About tabs: you can also use CHAR(9).

Good luck!
0
 
garethtnashAuthor Commented:
Thanks Exoduster,

What about the regex part?

.Pattern = "(<[a-z][^ >/]*)(?:[^>/]|/(?!>))*"       ' keep all HTML Entities but remove attributes
    strInput = .Replace(strInput, "$1")        

Thanks :)
0
Ultimate Tool Kit for Technology Solution Provider

Broken down into practical pointers and step-by-step instructions, the IT Service Excellence Tool Kit delivers expert advice for technology solution providers. Get your free copy now.

 
exodusterCommented:
For regular expressions you need use CLR.
This should help you: http://www.codeproject.com/Articles/42764/Regular-Expressions-in-MS-SQL-Server-2005-2008
0
 
garethtnashAuthor Commented:
Umm, my column is an Nvarchar(max) and last time I tried using a CLR, i got --

A .NET Framework error occurred during execution of user-defined routine or aggregate "CleanHTML":
System.Data.SqlServer.TruncationException: Trying to convert return value or output parameter of size 8090 bytes to a T-SQL type with a smaller size limit of 8000 bytes.
System.Data.SqlServer.TruncationException:
   at System.Data.SqlServer.Internal.CXVariantBase.StringToWSTR(String pstrValue, Int64 cbMaxLength, Int32 iOffset, EPadding ePad)
   at System.Data.SqlServer.Internal.CXVariantBase.SqlStringToWSTR(SqlString strValue, Int64 cbMaxLength, Int32 iOffset, EPadding ePad)

Which really confused me, I googled it and came up with nothing..., I can see it has something to do with the field size?

Any suggestions?

Thank you
0
 
exodusterCommented:
I'm not so familiar with CLRs... Probably i would try to cut string to strings of type VARCHAR(8000)...
0
 
Brendt HessSenior DBACommented:
If the parser doesn't recognize nVarChar(max), you can cast it as a nText value which should be handled correctly.
0
 
RimvisCommented:
Hi garethtnash,

How about this:

using System;
using System.Data.Sql;
using System.Data.SqlTypes;
using System.Collections.Generic;
using Microsoft.SqlServer.Server;
using System.Text;
using System.Text.RegularExpressions;
public partial class UserDefinedFunctions
{

    [Microsoft.SqlServer.Server.SqlFunction]
    [return: SqlFacet(MaxSize = -1)]
    public static SqlString CleanHTML(SqlString s)
    {
        string result = s.ToString();

        //keep all HTML Entities but remove attributes 
        string pattern = @"(<[a-z][^ >/]*)(?:[^>/]|/(?!>))*";
        Regex rgx = new Regex(pattern);
        result = rgx.Replace(result, "$1");

        //If the input has already been HTML encoded or double encoded, unencode it -
        result = result.Replace("&amp;amp;", "&amp;");// removes all ... (tab etc)
        result = result.Replace("&amp;", "&");// removes all ... (tab etc)
        result = result.Replace("&nbsp;", " ");// removes all ... (tab etc)
        result = result.Replace("&pound;", "£");
        result = result.Replace("&#163;", "£");
        result = result.Replace("&apos;", "'");// removes all ... (tab etc)
        result = result.Replace("&quot;", @"""");// removes all ... (tab etc)

        //Remove tabs, double spaces etc -
        result = result.Replace("\t", " ");// replace tabs with a space 

        //Replace line breaks with < /br>
        pattern = "\r\n|\r|\n";
        rgx = new Regex(pattern);
        result = rgx.Replace(result, "<br />");// replace any carriage returns with either <br />

        //Once the string is clean, encode special characters
        result = result.Replace("&", "&amp;");
        result = result.Replace("£", "&pound;");
        result = result.Replace(@"""", "&quot;");
        result = result.Replace("'", "&apos;");

        return result;
    }

}

Open in new window


BTW, if you are more comfortable with VB.NET, you could write CLR functions in that language too.
0
 
garethtnashAuthor Commented:
Thats great, thanks Rimvis, there is a tiny issue though..

When I run this --

using System;
using System.Data.Sql;
using System.Data.SqlTypes;
using System.Collections.Generic;
using Microsoft.SqlServer.Server;
using System.Text;
using System.Text.RegularExpressions;
public partial class UserDefinedFunctions
{

    [Microsoft.SqlServer.Server.SqlFunction]
    [return: SqlFacet(MaxSize = -1)]
    public static SqlString CleanHTMLAtt(SqlString s)
    {
        string result = s.ToString();

        //keep all HTML Entities but remove attributes 
        string pattern = @"(<[a-z][^ >/]*)(?:[^>/]|/(?!>))*";
        Regex rgx = new Regex(pattern);
        result = rgx.Replace(result, "$1");

        //If the input has already been HTML encoded or double encoded, unencode it -
        result = result.Replace("&amp;amp;", "&amp;");// removes all ... (tab etc)
        result = result.Replace("&amp;", "&");// removes all ... (tab etc)
        result = result.Replace("&nbsp;", " ");// removes all ... (tab etc)
        result = result.Replace("&pound;", "£");
        result = result.Replace("&#163;", "£");
        result = result.Replace("&apos;", "'");// removes all ... (tab etc)
        result = result.Replace("&quot;", @"""");// removes all ... (tab etc)

        //Remove tabs, double spaces etc -
        result = result.Replace("\t", " ");// replace tabs with a space 
        result = result.Replace("  ", " ");// replace tabs with a space 

        //Replace line breaks with < /br>
        pattern = "\r\n|\r|\n";
        rgx = new Regex(pattern);
        result = rgx.Replace(result, "<br />");// replace any carriage returns with either <br />

        //Once the string is clean, encode special characters
        result = result.Replace("&", "&amp;");
        result = result.Replace("£", "&pound;");
        result = result.Replace(@"""", "&quot;");
        result = result.Replace("'", "&apos;");
        result = result.Replace("&amp;bull;", "&bull;");

        return result;
    }

}

Open in new window


It removes the existing Function of --

using System;
using System.Data.Sql;
using System.Data.SqlTypes;
using System.Collections.Generic;
using Microsoft.SqlServer.Server;
using System.Text;
using System.Text.RegularExpressions;
public partial class UserDefinedFunctions
{
    [Microsoft.SqlServer.Server.SqlFunction]
    [return: SqlFacet(MaxSize = -1)]
    public static SqlString CleanHTML(SqlString s)
    {
        if (s.IsNull) return String.Empty;
        string s1 = s.ToString().Trim();
        if (s1.Length == 0) return String.Empty;
        StringBuilder tmpS = new StringBuilder(s1.Length);
        //striping out the "control characters"
        if (!Char.IsControl(s1[0])) tmpS.Append(s1[0]);
        for (int i = 1; i <= s1.Length - 1; i++)
        {
            if (Char.IsControl(s1[i]))
            {
                if (s1[i - 1] != ' ') tmpS.Append(' ');
            }
            else
            {
                tmpS.Append(s1[i]);
            }
        }
        string result = tmpS.ToString();
        //finding the HTML tags and replacing them with an empty string
        string pattern = @"<[^>]*?>|<[^>]*>";

        Regex rgx = new Regex(pattern);
        result = rgx.Replace(result, String.Empty);
        result = result.Replace(@"&amp;amp;", @"&amp;");
        result = result.Replace(@"&amp;", @"&");
        result = result.Replace(@"&pound;", @"£");
        result = result.Replace(@"&#163;", @"£");
        result = result.Replace(@"&quot;", @"""");
        result = result.Replace(@"&apos;", @"'");
        result = result.Replace(@"&", @"&amp;");
        result = result.Replace(@"£", @"&pound;");
        result = result.Replace(@"""", @"&quot;");
        result = result.Replace(@"'", @"&apos;");
        //more replaces here
        return result;
    }
}

Open in new window


despite both having different names, why is that, can I not have 2 or more UDFs?

Thanks.

By the way, how was your break?

Thank you
0
 
garethtnashAuthor Commented:
Thank you Rimvis :) Superb
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.