Link to home
Start Free TrialLog in
Avatar of garethtnash
garethtnashFlag for United Kingdom of Great Britain and Northern Ireland

asked on

MS SQL Replace Function - Remove HTML Attributes - Regex

Hello Experts,

I'm trying to write an MS SQL 2008 version of a VBScript function that cleans a string, the reason for doing it in SQL is because I want to clean the string on Insert Update, and data can be Inserted / Updated via a number of inputs. What the Function should do is -

Strip any HTML attributes, but keep the element - <p style="style1>text</p> becomes <p>text</p>

This is the section of the code below --

.Pattern = "(<[a-z][^ >/]*)(?:[^>/]|/(?!>))*"	 ' keep all HTML Entities but remove attributes 
    strInput = .Replace(strInput, "$1")    

Open in new window


If the input has already been HTML encoded or double encoded, unencode it -
This section deals with this -

    strInput = Replace(strInput, "&amp;amp;", "&amp;")	         ' removes all ... (tab etc)
    strInput = Replace(strInput, "&amp;", "&")	         ' removes all ... (tab etc)
    strInput = Replace(strInput, "&nbsp;", " ")	         ' removes all ... (tab etc)
    strInput = Replace(strInput, "&pound;", "£")
	strInput = Replace(strInput, "&#163;", "£")
    strInput = Replace(strInput, "&apos;", "'")	         ' removes all ... (tab etc)
    strInput = Replace(strInput, "&quot;", """")	         ' removes all ... (tab etc)

Open in new window


Remove tabs, double spaces etc -

    strInput = Replace(strInput, "  ", " ")	         ' removes all ... (tab etc)
	strInput = Replace(strInput, vbTab, "")	         ' removes all ... (tab etc)
	strInput = Replace(strInput, CHAR(9)," ")		' replace tabs with a space 

Open in new window


Replace line breaks with < /br>

    .Pattern = "\r\n|\r|\n"
    strInput = .Replace(strInput, "<br />")          ' replace any carriage returns with either <br />

Open in new window


Once the string is clean, encode special characters -

    strInput = Replace("&", "&amp;");
    strInput = Replace("£", "&pound;");
    strInput = Replace("""", "&quot;");
    strInput = Replace("'", "&apos;");

Open in new window


My VBScript version of this, which is probably a little mixed up is -

<%
Function removeformatting(strInput)
    Set reg = New Regexp
    With reg
    .Global = True
    .IgnoreCase = True
    .Pattern = "(<[a-z][^ >/]*)(?:[^>/]|/(?!>))*"	 ' keep all HTML Entities but remove attributes 
    strInput = .Replace(strInput, "$1")        
    .Pattern = "\r\n|\r|\n"
    strInput = .Replace(strInput, "<br />")          ' replace any carriage returns with either <br />
    End With
    strInput = Replace(strInput, "&amp;amp;", "&amp;")	         ' removes all ... (tab etc)
    strInput = Replace(strInput, "&amp;", "&")	         ' removes all ... (tab etc)
    strInput = Replace(strInput, "&nbsp;", " ")	         ' removes all ... (tab etc)
    strInput = Replace(strInput, "&pound;", "£")
	strInput = Replace(strInput, "&#163;", "£")
    strInput = Replace(strInput, "&apos;", "'")	         ' removes all ... (tab etc)
    strInput = Replace(strInput, "&quot;", """")	         ' removes all ... (tab etc)
	.Pattern = "\r\n|\r|\n"
    strInput = Replace(strInput, " ")
    strInput = Replace("&", "&amp;");
    strInput = Replace("£", "&pound;");
    strInput = Replace("""", "&quot;");
    strInput = Replace("'", "&apos;");
    strInput = Replace(strInput, "  ", " ")	         ' removes all ... (tab etc)
	strInput = Replace(strInput, vbTab, "")	         ' removes all ... (tab etc)
	strInput = Replace(strInput, CHAR(9)," ")		' replace tabs with a space 

    removeformatting = strInput
End Function
%>

Open in new window


How do I build this as a function in MS SQL 2008.

Appreciate your help with this.

Thank you
Avatar of exoduster
exoduster

Hi.
There is also REPLACE function is T-SQL.
Just look at books online and string functions.
About tabs: you can also use CHAR(9).

Good luck!
Avatar of garethtnash

ASKER

Thanks Exoduster,

What about the regex part?

.Pattern = "(<[a-z][^ >/]*)(?:[^>/]|/(?!>))*"       ' keep all HTML Entities but remove attributes
    strInput = .Replace(strInput, "$1")        

Thanks :)
For regular expressions you need use CLR.
This should help you: http://www.codeproject.com/Articles/42764/Regular-Expressions-in-MS-SQL-Server-2005-2008
Umm, my column is an Nvarchar(max) and last time I tried using a CLR, i got --

A .NET Framework error occurred during execution of user-defined routine or aggregate "CleanHTML":
System.Data.SqlServer.TruncationException: Trying to convert return value or output parameter of size 8090 bytes to a T-SQL type with a smaller size limit of 8000 bytes.
System.Data.SqlServer.TruncationException:
   at System.Data.SqlServer.Internal.CXVariantBase.StringToWSTR(String pstrValue, Int64 cbMaxLength, Int32 iOffset, EPadding ePad)
   at System.Data.SqlServer.Internal.CXVariantBase.SqlStringToWSTR(SqlString strValue, Int64 cbMaxLength, Int32 iOffset, EPadding ePad)

Which really confused me, I googled it and came up with nothing..., I can see it has something to do with the field size?

Any suggestions?

Thank you
I'm not so familiar with CLRs... Probably i would try to cut string to strings of type VARCHAR(8000)...
Avatar of Brendt Hess
If the parser doesn't recognize nVarChar(max), you can cast it as a nText value which should be handled correctly.
Hi garethtnash,

How about this:

using System;
using System.Data.Sql;
using System.Data.SqlTypes;
using System.Collections.Generic;
using Microsoft.SqlServer.Server;
using System.Text;
using System.Text.RegularExpressions;
public partial class UserDefinedFunctions
{

    [Microsoft.SqlServer.Server.SqlFunction]
    [return: SqlFacet(MaxSize = -1)]
    public static SqlString CleanHTML(SqlString s)
    {
        string result = s.ToString();

        //keep all HTML Entities but remove attributes 
        string pattern = @"(<[a-z][^ >/]*)(?:[^>/]|/(?!>))*";
        Regex rgx = new Regex(pattern);
        result = rgx.Replace(result, "$1");

        //If the input has already been HTML encoded or double encoded, unencode it -
        result = result.Replace("&amp;amp;", "&amp;");// removes all ... (tab etc)
        result = result.Replace("&amp;", "&");// removes all ... (tab etc)
        result = result.Replace("&nbsp;", " ");// removes all ... (tab etc)
        result = result.Replace("&pound;", "£");
        result = result.Replace("&#163;", "£");
        result = result.Replace("&apos;", "'");// removes all ... (tab etc)
        result = result.Replace("&quot;", @"""");// removes all ... (tab etc)

        //Remove tabs, double spaces etc -
        result = result.Replace("\t", " ");// replace tabs with a space 

        //Replace line breaks with < /br>
        pattern = "\r\n|\r|\n";
        rgx = new Regex(pattern);
        result = rgx.Replace(result, "<br />");// replace any carriage returns with either <br />

        //Once the string is clean, encode special characters
        result = result.Replace("&", "&amp;");
        result = result.Replace("£", "&pound;");
        result = result.Replace(@"""", "&quot;");
        result = result.Replace("'", "&apos;");

        return result;
    }

}

Open in new window


BTW, if you are more comfortable with VB.NET, you could write CLR functions in that language too.
Thats great, thanks Rimvis, there is a tiny issue though..

When I run this --

using System;
using System.Data.Sql;
using System.Data.SqlTypes;
using System.Collections.Generic;
using Microsoft.SqlServer.Server;
using System.Text;
using System.Text.RegularExpressions;
public partial class UserDefinedFunctions
{

    [Microsoft.SqlServer.Server.SqlFunction]
    [return: SqlFacet(MaxSize = -1)]
    public static SqlString CleanHTMLAtt(SqlString s)
    {
        string result = s.ToString();

        //keep all HTML Entities but remove attributes 
        string pattern = @"(<[a-z][^ >/]*)(?:[^>/]|/(?!>))*";
        Regex rgx = new Regex(pattern);
        result = rgx.Replace(result, "$1");

        //If the input has already been HTML encoded or double encoded, unencode it -
        result = result.Replace("&amp;amp;", "&amp;");// removes all ... (tab etc)
        result = result.Replace("&amp;", "&");// removes all ... (tab etc)
        result = result.Replace("&nbsp;", " ");// removes all ... (tab etc)
        result = result.Replace("&pound;", "£");
        result = result.Replace("&#163;", "£");
        result = result.Replace("&apos;", "'");// removes all ... (tab etc)
        result = result.Replace("&quot;", @"""");// removes all ... (tab etc)

        //Remove tabs, double spaces etc -
        result = result.Replace("\t", " ");// replace tabs with a space 
        result = result.Replace("  ", " ");// replace tabs with a space 

        //Replace line breaks with < /br>
        pattern = "\r\n|\r|\n";
        rgx = new Regex(pattern);
        result = rgx.Replace(result, "<br />");// replace any carriage returns with either <br />

        //Once the string is clean, encode special characters
        result = result.Replace("&", "&amp;");
        result = result.Replace("£", "&pound;");
        result = result.Replace(@"""", "&quot;");
        result = result.Replace("'", "&apos;");
        result = result.Replace("&amp;bull;", "&bull;");

        return result;
    }

}

Open in new window


It removes the existing Function of --

using System;
using System.Data.Sql;
using System.Data.SqlTypes;
using System.Collections.Generic;
using Microsoft.SqlServer.Server;
using System.Text;
using System.Text.RegularExpressions;
public partial class UserDefinedFunctions
{
    [Microsoft.SqlServer.Server.SqlFunction]
    [return: SqlFacet(MaxSize = -1)]
    public static SqlString CleanHTML(SqlString s)
    {
        if (s.IsNull) return String.Empty;
        string s1 = s.ToString().Trim();
        if (s1.Length == 0) return String.Empty;
        StringBuilder tmpS = new StringBuilder(s1.Length);
        //striping out the "control characters"
        if (!Char.IsControl(s1[0])) tmpS.Append(s1[0]);
        for (int i = 1; i <= s1.Length - 1; i++)
        {
            if (Char.IsControl(s1[i]))
            {
                if (s1[i - 1] != ' ') tmpS.Append(' ');
            }
            else
            {
                tmpS.Append(s1[i]);
            }
        }
        string result = tmpS.ToString();
        //finding the HTML tags and replacing them with an empty string
        string pattern = @"<[^>]*?>|<[^>]*>";

        Regex rgx = new Regex(pattern);
        result = rgx.Replace(result, String.Empty);
        result = result.Replace(@"&amp;amp;", @"&amp;");
        result = result.Replace(@"&amp;", @"&");
        result = result.Replace(@"&pound;", @"£");
        result = result.Replace(@"&#163;", @"£");
        result = result.Replace(@"&quot;", @"""");
        result = result.Replace(@"&apos;", @"'");
        result = result.Replace(@"&", @"&amp;");
        result = result.Replace(@"£", @"&pound;");
        result = result.Replace(@"""", @"&quot;");
        result = result.Replace(@"'", @"&apos;");
        //more replaces here
        return result;
    }
}

Open in new window


despite both having different names, why is that, can I not have 2 or more UDFs?

Thanks.

By the way, how was your break?

Thank you
ASKER CERTIFIED SOLUTION
Avatar of Rimvis
Rimvis
Flag of Lithuania image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thank you Rimvis :) Superb