[Last Call] Learn how to a build a cloud-first strategyRegister Now

x
?
Solved

MS SQL Replace Function - Remove HTML Attributes - Regex

Posted on 2012-09-07
10
Medium Priority
?
2,298 Views
Last Modified: 2012-09-16
Hello Experts,

I'm trying to write an MS SQL 2008 version of a VBScript function that cleans a string, the reason for doing it in SQL is because I want to clean the string on Insert Update, and data can be Inserted / Updated via a number of inputs. What the Function should do is -

Strip any HTML attributes, but keep the element - <p style="style1>text</p> becomes <p>text</p>

This is the section of the code below --

.Pattern = "(<[a-z][^ >/]*)(?:[^>/]|/(?!>))*"	 ' keep all HTML Entities but remove attributes 
    strInput = .Replace(strInput, "$1")    

Open in new window


If the input has already been HTML encoded or double encoded, unencode it -
This section deals with this -

    strInput = Replace(strInput, "&amp;amp;", "&amp;")	         ' removes all ... (tab etc)
    strInput = Replace(strInput, "&amp;", "&")	         ' removes all ... (tab etc)
    strInput = Replace(strInput, "&nbsp;", " ")	         ' removes all ... (tab etc)
    strInput = Replace(strInput, "&pound;", "£")
	strInput = Replace(strInput, "&#163;", "£")
    strInput = Replace(strInput, "&apos;", "'")	         ' removes all ... (tab etc)
    strInput = Replace(strInput, "&quot;", """")	         ' removes all ... (tab etc)

Open in new window


Remove tabs, double spaces etc -

    strInput = Replace(strInput, "  ", " ")	         ' removes all ... (tab etc)
	strInput = Replace(strInput, vbTab, "")	         ' removes all ... (tab etc)
	strInput = Replace(strInput, CHAR(9)," ")		' replace tabs with a space 

Open in new window


Replace line breaks with < /br>

    .Pattern = "\r\n|\r|\n"
    strInput = .Replace(strInput, "<br />")          ' replace any carriage returns with either <br />

Open in new window


Once the string is clean, encode special characters -

    strInput = Replace("&", "&amp;");
    strInput = Replace("£", "&pound;");
    strInput = Replace("""", "&quot;");
    strInput = Replace("'", "&apos;");

Open in new window


My VBScript version of this, which is probably a little mixed up is -

<%
Function removeformatting(strInput)
    Set reg = New Regexp
    With reg
    .Global = True
    .IgnoreCase = True
    .Pattern = "(<[a-z][^ >/]*)(?:[^>/]|/(?!>))*"	 ' keep all HTML Entities but remove attributes 
    strInput = .Replace(strInput, "$1")        
    .Pattern = "\r\n|\r|\n"
    strInput = .Replace(strInput, "<br />")          ' replace any carriage returns with either <br />
    End With
    strInput = Replace(strInput, "&amp;amp;", "&amp;")	         ' removes all ... (tab etc)
    strInput = Replace(strInput, "&amp;", "&")	         ' removes all ... (tab etc)
    strInput = Replace(strInput, "&nbsp;", " ")	         ' removes all ... (tab etc)
    strInput = Replace(strInput, "&pound;", "£")
	strInput = Replace(strInput, "&#163;", "£")
    strInput = Replace(strInput, "&apos;", "'")	         ' removes all ... (tab etc)
    strInput = Replace(strInput, "&quot;", """")	         ' removes all ... (tab etc)
	.Pattern = "\r\n|\r|\n"
    strInput = Replace(strInput, " ")
    strInput = Replace("&", "&amp;");
    strInput = Replace("£", "&pound;");
    strInput = Replace("""", "&quot;");
    strInput = Replace("'", "&apos;");
    strInput = Replace(strInput, "  ", " ")	         ' removes all ... (tab etc)
	strInput = Replace(strInput, vbTab, "")	         ' removes all ... (tab etc)
	strInput = Replace(strInput, CHAR(9)," ")		' replace tabs with a space 

    removeformatting = strInput
End Function
%>

Open in new window


How do I build this as a function in MS SQL 2008.

Appreciate your help with this.

Thank you
0
Comment
Question by:garethtnash
  • 4
  • 3
  • 2
  • +1
10 Comments
 
LVL 2

Expert Comment

by:exoduster
ID: 38376337
Hi.
There is also REPLACE function is T-SQL.
Just look at books online and string functions.
About tabs: you can also use CHAR(9).

Good luck!
0
 

Author Comment

by:garethtnash
ID: 38376350
Thanks Exoduster,

What about the regex part?

.Pattern = "(<[a-z][^ >/]*)(?:[^>/]|/(?!>))*"       ' keep all HTML Entities but remove attributes
    strInput = .Replace(strInput, "$1")        

Thanks :)
0
 
LVL 2

Expert Comment

by:exoduster
ID: 38376378
For regular expressions you need use CLR.
This should help you: http://www.codeproject.com/Articles/42764/Regular-Expressions-in-MS-SQL-Server-2005-2008
0
Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

 

Author Comment

by:garethtnash
ID: 38376429
Umm, my column is an Nvarchar(max) and last time I tried using a CLR, i got --

A .NET Framework error occurred during execution of user-defined routine or aggregate "CleanHTML":
System.Data.SqlServer.TruncationException: Trying to convert return value or output parameter of size 8090 bytes to a T-SQL type with a smaller size limit of 8000 bytes.
System.Data.SqlServer.TruncationException:
   at System.Data.SqlServer.Internal.CXVariantBase.StringToWSTR(String pstrValue, Int64 cbMaxLength, Int32 iOffset, EPadding ePad)
   at System.Data.SqlServer.Internal.CXVariantBase.SqlStringToWSTR(SqlString strValue, Int64 cbMaxLength, Int32 iOffset, EPadding ePad)

Which really confused me, I googled it and came up with nothing..., I can see it has something to do with the field size?

Any suggestions?

Thank you
0
 
LVL 2

Expert Comment

by:exoduster
ID: 38376690
I'm not so familiar with CLRs... Probably i would try to cut string to strings of type VARCHAR(8000)...
0
 
LVL 32

Expert Comment

by:Brendt Hess
ID: 38378519
If the parser doesn't recognize nVarChar(max), you can cast it as a nText value which should be handled correctly.
0
 
LVL 19

Expert Comment

by:Rimvis
ID: 38392734
Hi garethtnash,

How about this:

using System;
using System.Data.Sql;
using System.Data.SqlTypes;
using System.Collections.Generic;
using Microsoft.SqlServer.Server;
using System.Text;
using System.Text.RegularExpressions;
public partial class UserDefinedFunctions
{

    [Microsoft.SqlServer.Server.SqlFunction]
    [return: SqlFacet(MaxSize = -1)]
    public static SqlString CleanHTML(SqlString s)
    {
        string result = s.ToString();

        //keep all HTML Entities but remove attributes 
        string pattern = @"(<[a-z][^ >/]*)(?:[^>/]|/(?!>))*";
        Regex rgx = new Regex(pattern);
        result = rgx.Replace(result, "$1");

        //If the input has already been HTML encoded or double encoded, unencode it -
        result = result.Replace("&amp;amp;", "&amp;");// removes all ... (tab etc)
        result = result.Replace("&amp;", "&");// removes all ... (tab etc)
        result = result.Replace("&nbsp;", " ");// removes all ... (tab etc)
        result = result.Replace("&pound;", "£");
        result = result.Replace("&#163;", "£");
        result = result.Replace("&apos;", "'");// removes all ... (tab etc)
        result = result.Replace("&quot;", @"""");// removes all ... (tab etc)

        //Remove tabs, double spaces etc -
        result = result.Replace("\t", " ");// replace tabs with a space 

        //Replace line breaks with < /br>
        pattern = "\r\n|\r|\n";
        rgx = new Regex(pattern);
        result = rgx.Replace(result, "<br />");// replace any carriage returns with either <br />

        //Once the string is clean, encode special characters
        result = result.Replace("&", "&amp;");
        result = result.Replace("£", "&pound;");
        result = result.Replace(@"""", "&quot;");
        result = result.Replace("'", "&apos;");

        return result;
    }

}

Open in new window


BTW, if you are more comfortable with VB.NET, you could write CLR functions in that language too.
0
 

Author Comment

by:garethtnash
ID: 38396059
Thats great, thanks Rimvis, there is a tiny issue though..

When I run this --

using System;
using System.Data.Sql;
using System.Data.SqlTypes;
using System.Collections.Generic;
using Microsoft.SqlServer.Server;
using System.Text;
using System.Text.RegularExpressions;
public partial class UserDefinedFunctions
{

    [Microsoft.SqlServer.Server.SqlFunction]
    [return: SqlFacet(MaxSize = -1)]
    public static SqlString CleanHTMLAtt(SqlString s)
    {
        string result = s.ToString();

        //keep all HTML Entities but remove attributes 
        string pattern = @"(<[a-z][^ >/]*)(?:[^>/]|/(?!>))*";
        Regex rgx = new Regex(pattern);
        result = rgx.Replace(result, "$1");

        //If the input has already been HTML encoded or double encoded, unencode it -
        result = result.Replace("&amp;amp;", "&amp;");// removes all ... (tab etc)
        result = result.Replace("&amp;", "&");// removes all ... (tab etc)
        result = result.Replace("&nbsp;", " ");// removes all ... (tab etc)
        result = result.Replace("&pound;", "£");
        result = result.Replace("&#163;", "£");
        result = result.Replace("&apos;", "'");// removes all ... (tab etc)
        result = result.Replace("&quot;", @"""");// removes all ... (tab etc)

        //Remove tabs, double spaces etc -
        result = result.Replace("\t", " ");// replace tabs with a space 
        result = result.Replace("  ", " ");// replace tabs with a space 

        //Replace line breaks with < /br>
        pattern = "\r\n|\r|\n";
        rgx = new Regex(pattern);
        result = rgx.Replace(result, "<br />");// replace any carriage returns with either <br />

        //Once the string is clean, encode special characters
        result = result.Replace("&", "&amp;");
        result = result.Replace("£", "&pound;");
        result = result.Replace(@"""", "&quot;");
        result = result.Replace("'", "&apos;");
        result = result.Replace("&amp;bull;", "&bull;");

        return result;
    }

}

Open in new window


It removes the existing Function of --

using System;
using System.Data.Sql;
using System.Data.SqlTypes;
using System.Collections.Generic;
using Microsoft.SqlServer.Server;
using System.Text;
using System.Text.RegularExpressions;
public partial class UserDefinedFunctions
{
    [Microsoft.SqlServer.Server.SqlFunction]
    [return: SqlFacet(MaxSize = -1)]
    public static SqlString CleanHTML(SqlString s)
    {
        if (s.IsNull) return String.Empty;
        string s1 = s.ToString().Trim();
        if (s1.Length == 0) return String.Empty;
        StringBuilder tmpS = new StringBuilder(s1.Length);
        //striping out the "control characters"
        if (!Char.IsControl(s1[0])) tmpS.Append(s1[0]);
        for (int i = 1; i <= s1.Length - 1; i++)
        {
            if (Char.IsControl(s1[i]))
            {
                if (s1[i - 1] != ' ') tmpS.Append(' ');
            }
            else
            {
                tmpS.Append(s1[i]);
            }
        }
        string result = tmpS.ToString();
        //finding the HTML tags and replacing them with an empty string
        string pattern = @"<[^>]*?>|<[^>]*>";

        Regex rgx = new Regex(pattern);
        result = rgx.Replace(result, String.Empty);
        result = result.Replace(@"&amp;amp;", @"&amp;");
        result = result.Replace(@"&amp;", @"&");
        result = result.Replace(@"&pound;", @"£");
        result = result.Replace(@"&#163;", @"£");
        result = result.Replace(@"&quot;", @"""");
        result = result.Replace(@"&apos;", @"'");
        result = result.Replace(@"&", @"&amp;");
        result = result.Replace(@"£", @"&pound;");
        result = result.Replace(@"""", @"&quot;");
        result = result.Replace(@"'", @"&apos;");
        //more replaces here
        return result;
    }
}

Open in new window


despite both having different names, why is that, can I not have 2 or more UDFs?

Thanks.

By the way, how was your break?

Thank you
0
 
LVL 19

Accepted Solution

by:
Rimvis earned 2000 total points
ID: 38396738
You can have as many functions as you want. You have to put them into same class, like this:
using System;
using System.Data.Sql;
using System.Data.SqlTypes;
using System.Collections.Generic;
using Microsoft.SqlServer.Server;
using System.Text;
using System.Text.RegularExpressions;
public partial class UserDefinedFunctions
{
    [Microsoft.SqlServer.Server.SqlFunction]
    [return: SqlFacet(MaxSize = -1)]
    public static SqlString CleanHTML(SqlString s)
    {
        if (s.IsNull) return String.Empty;
        string s1 = s.ToString().Trim();
        if (s1.Length == 0) return String.Empty;
        StringBuilder tmpS = new StringBuilder(s1.Length);
        //striping out the "control characters"
        if (!Char.IsControl(s1[0])) tmpS.Append(s1[0]);
        for (int i = 1; i <= s1.Length - 1; i++)
        {
            if (Char.IsControl(s1[i]))
            {
                if (s1[i - 1] != ' ') tmpS.Append(' ');
            }
            else
            {
                tmpS.Append(s1[i]);
            }
        }
        string result = tmpS.ToString();
        //finding the HTML tags and replacing them with an empty string
        string pattern = @"<[^>]*?>|<[^>]*>";

        Regex rgx = new Regex(pattern);
        result = rgx.Replace(result, String.Empty);
        result = result.Replace(@"&amp;amp;", @"&amp;");
        result = result.Replace(@"&amp;", @"&");
        result = result.Replace(@"&pound;", @"£");
        result = result.Replace(@"&#163;", @"£");
        result = result.Replace(@"&quot;", @"""");
        result = result.Replace(@"&apos;", @"'");
        result = result.Replace(@"&", @"&amp;");
        result = result.Replace(@"£", @"&pound;");
        result = result.Replace(@"""", @"&quot;");
        result = result.Replace(@"'", @"&apos;");
        //more replaces here
        return result;
    }

    [Microsoft.SqlServer.Server.SqlFunction]
    [return: SqlFacet(MaxSize = -1)]
    public static SqlString CleanHTMLAtt(SqlString s)
    {
        string result = s.ToString();

        //keep all HTML Entities but remove attributes 
        string pattern = @"(<[a-z][^ >/]*)(?:[^>/]|/(?!>))*";
        Regex rgx = new Regex(pattern);
        result = rgx.Replace(result, "$1");

        //If the input has already been HTML encoded or double encoded, unencode it -
        result = result.Replace("&amp;amp;", "&amp;");// removes all ... (tab etc)
        result = result.Replace("&amp;", "&");// removes all ... (tab etc)
        result = result.Replace("&nbsp;", " ");// removes all ... (tab etc)
        result = result.Replace("&pound;", "£");
        result = result.Replace("&#163;", "£");
        result = result.Replace("&apos;", "'");// removes all ... (tab etc)
        result = result.Replace("&quot;", @"""");// removes all ... (tab etc)

        //Remove tabs, double spaces etc -
        result = result.Replace("\t", " ");// replace tabs with a space 
        result = result.Replace("  ", " ");// replace tabs with a space 

        //Replace line breaks with < /br>
        pattern = "\r\n|\r|\n";
        rgx = new Regex(pattern);
        result = rgx.Replace(result, "<br />");// replace any carriage returns with either <br />

        //Once the string is clean, encode special characters
        result = result.Replace("&", "&amp;");
        result = result.Replace("£", "&pound;");
        result = result.Replace(@"""", "&quot;");
        result = result.Replace("'", "&apos;");
        result = result.Replace("&amp;bull;", "&bull;");

        return result;
    }
}

Open in new window

0
 

Author Closing Comment

by:garethtnash
ID: 38403192
Thank you Rimvis :) Superb
0

Featured Post

Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

In the first part of this tutorial we will cover the prerequisites for installing SQL Server vNext on Linux.
What if you have to shut down the entire Citrix infrastructure for hardware maintenance, software upgrades or "the unknown"? I developed this plan for "the unknown" and hope that it helps you as well. This article explains how to properly shut down …
Via a live example combined with referencing Books Online, show some of the information that can be extracted from the Catalog Views in SQL Server.
Using examples as well as descriptions, and references to Books Online, show the documentation available for datatypes, explain the available data types and show how data can be passed into and out of variables.
Suggested Courses

831 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question