[Webinar] Learn how to a build a cloud-first strategyRegister Now

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 10102
  • Last Modified:

MS SQL Replace Special Characters & Remove All HTML Attributes

Hello,

I'm having difficulty with an ASP VBScript generated XML feed, as the data in the MS SQL database contains XML Special Characters as well as HTML attributes.

I've tried stripping these out using VBScript, but because of the amount of data, it is causing the buffer limit to be exceeded.

So what I'm hoping to do, as I'm sure that it will cost less overhead, is put together an MS SQL function that will do it.

What I'm hoping to do, is -

remove all <p> <b> <h1> html tags, including style attributes.
replace special characters, including characters like £,€ etc

And out put clean data.

Assuming that my select statement reads -

SELECT S.[ID]
      ,[SupplierSurveyID]
      ,[BrandID]
      ,B.Name
      ,B.description
      ,B.Logo
FROM [dbo].[SupplierSurveySupplier] S
inner join [dbo].[Brand] B on S.BrandID = B.ID
WHERE [SupplierSurveyID] = 1
order by B.Name 

Open in new window


And it is B.Name & B.description that I want to apply the replace / remove function too, how would I do this?

Also, given the number of records, would there be any benefit in using a view, outputting the cleaned data to a view and then using a select statement to select the data from the view?

Thank you
0
garethtnash
Asked:
garethtnash
  • 15
  • 11
  • 2
  • +2
3 Solutions
 
Habib PourfardCommented:
best way for removing html tags is to use a user defined function:

CREATE FUNCTION [dbo].[udf_StripHTML]
(@HTMLText VARCHAR(MAX))
RETURNS VARCHAR(MAX)
AS
BEGIN
DECLARE @Start INT
DECLARE @End INT
DECLARE @Length INT
SET @Start = CHARINDEX('<',@HTMLText)
SET @End = CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText))
SET @Length = (@End - @Start) + 1
WHILE @Start > 0
AND @End > 0
AND @Length > 0
BEGIN
SET @HTMLText = STUFF(@HTMLText,@Start,@Length,'')
SET @Start = CHARINDEX('<',@HTMLText)
SET @End = CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText))
SET @Length = (@End - @Start) + 1
END
RETURN LTRIM(RTRIM(@HTMLText))
END
GO

Open in new window


Test above function like this :
SELECT dbo.udf_StripHTML('<b>UDF at SQLAuthority.com </b><br><br><a href="http://www.SQLAuthority.com">SQLAuthority.com</a>')

Open in new window


To remove special characters you can write:

REPLACE(REPLACE(REPLACE(B.description,'$',''),'£',''),'€','')

Open in new window

0
 
garethtnashAuthor Commented:
Thanks, how do I add this UDF to my select statement? Also, any point in using a View?

Also, I found this UDF is it any good -

http://lazycoders.blogspot.co.uk/2007/06/stripping-html-from-text-in-sql-server.html

Thanks Pourfard
0
 
RimvisCommented:
Hi garethtnash,

This might be helpful:

SQL SERVER – 2005 – UDF – User Defined Function to Strip HTML – Parse HTML – No Regular Expression
http://blog.sqlauthority.com/2007/06/16/sql-server-udf-user-defined-function-to-strip-html-parse-html-no-regular-expression/1

As for the view, I don't think you will get improvement in performance, unless it is an indexed view. And I would not create indexed view on the table where data is changing constantly. Not sure what is your situation.
0
Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
garethtnashAuthor Commented:
Thanks Both,

Is there a less expensive way of doing this? It seems to have increased the query execution time from seconds to minutes...

Also is there a way to include the

REPLACE(B.description,'£','&pound;') as part of the UDF dbo.udf_StripHTML?

Thank you so much
0
 
Habib PourfardCommented:
I have included the replace in UDF:
CREATE FUNCTION [dbo].[StripHTML] ( @HTMLText VARCHAR(MAX) )
RETURNS VARCHAR(MAX)
AS BEGIN
    DECLARE @Start INT
    DECLARE @End INT
    DECLARE @Length INT
    SET @Start = CHARINDEX('<', @HTMLText)
    SET @End = CHARINDEX('>', @HTMLText, CHARINDEX('<', @HTMLText))
    SET @Length = ( @End - @Start ) + 1
    WHILE @Start > 0 AND @End > 0 AND @Length > 0
        BEGIN
            SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '')
            SET @Start = CHARINDEX('<', @HTMLText)
            SET @End = CHARINDEX('>', @HTMLText, CHARINDEX('<', @HTMLText))
            SET @Length = ( @End - @Start ) + 1
        END
    RETURN REPLACE(REPLACE(REPLACE(LTRIM(RTRIM(@HTMLText)), '$', ''), '£', ''), '€', '')
   END

Open in new window


and your query:
   
   SELECT   S.[ID]
           ,[SupplierSurveyID]
           ,[BrandID]
           ,B.Name
           ,[dbo].StripHTML(B.description)
           ,B.Logo
   FROM     [dbo].[SupplierSurveySupplier] S INNER JOIN [dbo].[Brand] B
                ON S.BrandID = B.ID
   WHERE    [SupplierSurveyID] = 1
   ORDER BY B.Name 

Open in new window


To improve the performance you can keep a text only version of your html column in database.
0
 
DcpKingCommented:
The stripping code is fine - except that if you use it at runtime you'll use it every time you look at your data! That way you're using cpu time, and the amount will depend on the popularity of your pages!

The other way is to do the transformation once only - into a table you use for presentation - and (after you've processed what you already have) alter your input code so that whenever new data is added to your current table a "cleaned" version is added to your presentation table. This minimises cpu usage at the expense of space your choice!

hth

Mike
0
 
RimvisCommented:
Hi garethtnash,

To improve performance, you can try to use CLR function instead of UDF:

Remove HTML tags from strings using the SQL Server CLR
http://www.mssqltips.com/sqlservertip/2418/remove-html-tags-from-strings-using-the-sql-server-clr/
0
 
Anthony PerkinsCommented:
As indicated previously you should not use T-SQL for this task.  T-SQL is a dog when it comes to string manipulation and I can guarantee you there will always be some HTML tag you overlooked.  Instead as suggested use something like CLR and specifically .NET RegEx.
0
 
garethtnashAuthor Commented:
Hi All,

Thank you so much for your help and suggestions;

Rimvis & Acperkins, I hear you.  
But ASP.Net C# is analien language to me...

I've read a little, but can't see how I put my select statement into the CLR Code -

using System;
    using System.Data.Sql;
    using System.Data.SqlTypes;
    using System.Collections.Generic;
    using Microsoft.SqlServer.Server;
    using System.Text;
    using System.Text.RegularExpressions;
    public partial class UserDefinedFunctions
    {
        [Microsoft.SqlServer.Server.SqlFunction]
        public static SqlString CleanHTML(SqlString s){
            if (s.IsNull) return String.Empty;
            string s1 = s.ToString().Trim();
            if (s1.Length == 0) return String.Empty;
            StringBuilder tmpS = new StringBuilder(s1.Length);
            //striping out the "control characters"
            if (!Char.IsControl(s1[0])) tmpS.Append(s1[0]);
            for (int i = 1; i <= s1.Length - 1; i++)
            {
                if (Char.IsControl(s1[i]))
                {
                    if (s1[i - 1] != ' ') tmpS.Append(' ');
                }
                else
                {
                    tmpS.Append(s1[i]);
                }
            }
            string result = tmpS.ToString();
            //finding the HTML tags and replacing them with an empty string
            string pattern = @"<[^>]*?>|<[^>]*>";
            Regex rgx = new Regex(pattern);
            return rgx.Replace(result, String.Empty);
        }
    }

Open in new window


I suppose my other query is how do I then write this to an XML format?

Like http://www.mssqltips.com/sqlservertip/2418/remove-html-tags-from-strings-using-the-sql-server-clr/ suggests, i have some data (25000+ records) that I need to export to a custom XML feed...

How do I do this using CLR?

Anything that reduces overhead and time is great..

Thank you
0
 
RimvisCommented:
>> how I put my select statement into the CLR Code
Could you just use CLR function provided in example? First of all, you need to compile it and deploy to SQL Server. Do you have Visual Studio available?
0
 
garethtnashAuthor Commented:
yes
0
 
RimvisCommented:
Great. Now you need to compile a CLR function. I think it's described in the article I gave you. But feel free to ask if you have any questions.


Solution

In one of my previous tips I've detailed the steps you need to build and deploy a CLR user defined function from Visual Studio 2010. Below are the steps that need to be followed.

First, make sure that the CLR integration is enabled. This can be accomplished by using the server facets in SQL Server 2008, the Surface Area Configuration tool in SQL Server 2005 or run the following code:

sp_configure 'clr enabled', 1  
GO  
RECONFIGURE  
GO

Next, follow these steps:

    Open Visual Studio 2010
    Click on "New Project"
    Choose the Database ---> SQL Server ---> Visual C# SQL CLR Database Project template.
    Make sure that the project targets .NET 2 / .NET 3 / .NET 3.5.
    Set up a connection to your database, test the connection and click OK
    Right click on the project and add a user defined function as explained in the next section
0
 
garethtnashAuthor Commented:
Ok, thanks, I'll no doubt be back for help a little later, thank you :)
0
 
garethtnashAuthor Commented:
HI,

It turns out I've got Visual Studio 2008, is that pretty much the same?

I've done -

File > New > Project > Visual C# > DataBase > SQL Server Project
Set up a connection to the database and clicked OK

I can't see where to do this -
'Right click on the project and add a user defined function as explained in the next section'

But if I click on Project > Add User Defined Function - I get the option

This loads a new window with -

using System;
using System.Data;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;

public partial class UserDefinedFunctions
{
    [Microsoft.SqlServer.Server.SqlFunction]
    public static SqlString Function1()
    {
        // Put your code here
        return new SqlString("Hello");
    }
};

Open in new window


Sorry this is completely new to me, where do I put my select statement?

SELECT A.JBAID, A.JBATitle, A.JBALocation, A.JBACategory, REPLACE ([JBAPayRate],'£', '&pound')AS [JBAPayRate], A.JBADescription, A.JBAEmplymentType, A.JBAFeaturedJob, CONVERT(CHAR(11),A.JBADatePosted,106) AS JBADatePosted, C.JBCLName, S.JBSURL, S.JBSURLShort, S.JBSRegion, CASE WHEN A.JBADatePosted >= DATEADD(d,-7,GETDATE()) AND A.JBAFeaturedJob = 'Y' THEN '1' ELSE '' END AS sponsored FROM dbo.JBAdvert A inner join dbo.JBClient C on A.JBAClientID = C.JBCLID inner join dbo.JBSite S on A.JBASiteID = S.JBSSiteID order by JBSRegion, JBAID desc

Open in new window


And then where do I put the function to remove HTML etc from 'A.JBATitle' AND 'A.JBADescription' ?

Thank you all
0
 
garethtnashAuthor Commented:
A quick update All,

When I try the non CLR way using -

Create PROCEDURE [dbo].[FeedOutPutSP]
AS
BEGIN
SET NOCOUNT ON
SELECT
A.JBAID,
dbo.udf_StripHTML(A.JBATitle) JBATitle, 
A.JBALocation, 
A.JBACategory, 
REPLACE(REPLACE([JBAPayRate],'£', '&pound'),'c&#163;', '&pound') JBAPayRate, 
dbo.udf_StripHTML(A.JBADescription) JBADescription, 
A.JBAEmplymentType, 
A.JBAFeaturedJob, 
CONVERT(CHAR(11),A.JBADatePosted,106) AS JBADatePosted, 
C.JBCLName, 
S.JBSURL, 
S.JBSURLShort, 
S.JBSRegion, 
CASE WHEN A.JBADatePosted >= DATEADD(d,-7,GETDATE()) AND A.JBAFeaturedJob = 'Y' THEN '1' ELSE '' END AS sponsored
FROM dbo.JBAdvert A 
inner join dbo.JBClient C on A.JBAClientID = C.JBCLID 
inner join dbo.JBSite S on A.JBASiteID = S.JBSSiteID
END

GO

Open in new window


As my Stored Procedure,

And

CREATE FUNCTION [dbo].[udf_StripHTML]
 (@HTMLText varchar(MAX))
 RETURNS varchar(MAX)AS
 BEGIN
 DECLARE @Start  int
 DECLARE @End    int
 DECLARE @Length int
 -- Replace the HTML entity &amp; with the '&' character (this needs to be done first, as-- '&' might be double encoded as '&amp;amp;')
 SET @Start = CHARINDEX('&amp;', @HTMLText)
 SET @End = @Start + 4
 SET @Length = (@End - @Start) + 1
 WHILE (@Start > 0 AND @End > 0 AND @Length > 0) 
 BEGIN
 SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '&')
 SET @Start = CHARINDEX('&amp;', @HTMLText)
 SET @End = @Start + 4
 SET @Length = (@End - @Start) + 1
 END
 -- Replace the HTML entity &lt; with the '<' character
 SET @Start = CHARINDEX('&lt;', @HTMLText)
 SET @End = @Start + 3
 SET @Length = (@End - @Start) + 1
 WHILE (@Start > 0 AND @End > 0 AND @Length > 0) 
 BEGIN
 SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '<')
 SET @Start = CHARINDEX('&lt;', @HTMLText)
 SET @End = @Start + 3
 SET @Length = (@End - @Start) + 1
 END
 -- Replace the HTML entity &gt; with the '>' character
 SET @Start = CHARINDEX('&gt;', @HTMLText)
 SET @End = @Start + 3
 SET @Length = (@End - @Start) + 1
 WHILE (@Start > 0 AND @End > 0 AND @Length > 0) 
 BEGIN
 SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '>')
 SET @Start = CHARINDEX('&gt;', @HTMLText)
 SET @End = @Start + 3
 SET @Length = (@End - @Start) + 1
 END
 -- Replace the HTML entity &amp; with the '&' character
 SET @Start = CHARINDEX('&amp;amp;', @HTMLText)
 SET @End = @Start + 4
 SET @Length = (@End - @Start) + 1
 WHILE (@Start > 0 AND @End > 0 AND @Length > 0) 
 BEGIN
 SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '&')
 SET @Start = CHARINDEX('&amp;amp;', @HTMLText)
 SET @End = @Start + 4
 SET @Length = (@End - @Start) + 1
 END
 -- Replace the HTML entity &nbsp; with the ' ' character
 SET @Start = CHARINDEX('&nbsp;', @HTMLText)
 SET @End = @Start + 5
 SET @Length = (@End - @Start) + 1
 WHILE (@Start > 0 AND @End > 0 AND @Length > 0) 
 BEGIN
 SET @HTMLText = STUFF(@HTMLText, @Start, @Length, ' ')
 SET @Start = CHARINDEX('&nbsp;', @HTMLText)
 SET @End = @Start + 5
 SET @Length = (@End - @Start) + 1
 END
 -- Replace any <br> tags with a newline
 SET @Start = CHARINDEX('<br>', @HTMLText)
 SET @End = @Start + 3
 SET @Length = (@End - @Start) + 1
 WHILE (@Start > 0 AND @End > 0 AND @Length > 0) 
 BEGIN
 SET @HTMLText = STUFF(@HTMLText, @Start, @Length, CHAR(13) + CHAR(10))
 SET @Start = CHARINDEX('<br>', @HTMLText)
 SET @End = @Start + 3SET @Length = (@End - @Start) + 1
 END
 -- Replace any <br/> tags with a newline
 SET @Start = CHARINDEX('<br/>', @HTMLText)
 SET @End = @Start + 4
 SET @Length = (@End - @Start) + 1
 WHILE (@Start > 0 AND @End > 0 AND @Length > 0) 
 BEGIN
 SET @HTMLText = STUFF(@HTMLText, @Start, @Length, 'CHAR(13) + CHAR(10)')
 SET @Start = CHARINDEX('<br/>', @HTMLText)
 SET @End = @Start + 4
 SET @Length = (@End - @Start) + 1
 END
 -- Replace any <br /> tags with a newline
 SET @Start = CHARINDEX('<br />', @HTMLText)
 SET @End = @Start + 5
 SET @Length = (@End - @Start) + 1
 WHILE (@Start > 0 AND @End > 0 AND @Length > 0) 
 BEGIN
 SET @HTMLText = STUFF(@HTMLText, @Start, @Length, 'CHAR(13) + CHAR(10)')
 SET @Start = CHARINDEX('<br />', @HTMLText)
 SET @End = @Start + 5
 SET @Length = (@End - @Start) + 1
 END
 -- Remove anything between <whatever> tags
 SET @Start = CHARINDEX('<', @HTMLText)
 SET @End = CHARINDEX('>', @HTMLText, CHARINDEX('<', @HTMLText))
 SET @Length = (@End - @Start) + 1
 WHILE (@Start > 0 AND @End > 0 AND @Length > 0) 
 BEGIN
 SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '')
 SET @Start = CHARINDEX('<', @HTMLText)
 SET @End = CHARINDEX('>', @HTMLText, CHARINDEX('<', @HTMLText))
 SET @Length = (@End - @Start) + 1
 END
 RETURN REPLACE(REPLACE(LTRIM(RTRIM(@HTMLText)), '&#163;', '&pound'), '£', '&pound')
 RETURN LTRIM(RTRIM(@HTMLText))
 END

GO

Open in new window


As my UDF

And then -

<%@LANGUAGE="VBSCRIPT"%><?xml version="1.0" encoding="ISO-8859-1"?>
<!--#include virtual="/Connections/recruta2.asp" -->
<%
Response.Buffer = true
Response.ContentType = "text/xml"

set CMDFeed = Server.CreateObject("ADODB.Command")
CMDFeed.ActiveConnection = MM_recruta2_STRING
CMDFeed.CommandText = "dbo.FeedOutPutSP"
CMDFeed.CommandType = 4
CMDFeed.CommandTimeout = 0
CMDFeed.Prepared = true
CMDFeed.Parameters.Append CMDIndeedFeed.CreateParameter("@RETURN_VALUE", 3, 4)
set searchresults = CMDFeed.Execute
searchresults_numRows = 0

arrSR = searchresults.GetRows() 'Get the results into a 2-dimensional array
			
searchresults.Close() 'Clean Up
Set searchresults = Nothing 'Clean Up
Set CMDFeed = Nothing 'Clean Up

With response

.write("<source>" & VbCrLf)
.write("<publisher></publisher>" & VbCrLf)
.write("<publisherurl></publisherurl>" & VbCrLf)

End With 

If IsArray(arrSR) Then
Dim i
For i = 0 To UBound(arrSR, 2)

With response

.write("<job>" & VbCrLf)
.write("<title><![CDATA[" & (arrSR(1,i))& " - " & (arrSR(2,i))& "]]></title>" & VbCrLf)
.write("<date><![CDATA[" & (arrSR(8,i))& "]]></date>" & VbCrLf)
.write("<referencenumber><![CDATA[" & (arrSR(0,i))& "]]></referencenumber>" & VbCrLf)
.write("<url><![CDATA[" & (arrSR(10,i))& "/detail.asp?ID=" & (arrSR(0,i))& "]]></url>" & VbCrLf)
.write("<company><![CDATA[" & (arrSR(9,i))& "]]></company>" & VbCrLf)
.write("<city><![CDATA[" & (arrSR(2,i))& "]]></city>" & VbCrLf)
.write("<country>UK</country>" & VbCrLf)
.write("<description><![CDATA[" & (arrSR(5,i))& "]]></description>" & VbCrLf)
.write("<salary><![CDATA[" & (arrSR(4,i))& "]]></salary>" & VbCrLf)
.write("<jobtype><![CDATA[" & (arrSR(6,i))& "]]></jobtype>" & VbCrLf)
.write("<category><![CDATA[" & (arrSR(3,i))& "]]></category>" & VbCrLf)
.write("<sponsored><![CDATA[" & (arrSR(13,i))& "]]></sponsored>"& VbCrLf)
.write("</job>" & VbCrLf)
End With 
response.Flush
next
End if 

With response
.write("</source>" & VbCrLf)
End With
%>

Open in new window


I eventually get a

The maximum amount of time for a script to execute was exceeded. You can change this limit by specifying a new value for the property Server.ScriptTimeout or by changing the value in the IIS administration tools.

Error Message

:(

Help ! Please (I need to reduce the resources this script uses and time to load) - thank you
0
 
RimvisCommented:
Hi garethtnash,

>>This loads a new window with -

You should replace it with this code:
using System;
    using System.Data.Sql;
    using System.Data.SqlTypes;
    using System.Collections.Generic;
    using Microsoft.SqlServer.Server;
    using System.Text;
    using System.Text.RegularExpressions;
    public partial class UserDefinedFunctions
    {
        [Microsoft.SqlServer.Server.SqlFunction]
        public static SqlString CleanHTML(SqlString s){
            if (s.IsNull) return String.Empty;
            string s1 = s.ToString().Trim();
            if (s1.Length == 0) return String.Empty;
            StringBuilder tmpS = new StringBuilder(s1.Length);
            //striping out the "control characters"
            if (!Char.IsControl(s1[0])) tmpS.Append(s1[0]);
            for (int i = 1; i <= s1.Length - 1; i++)
            {
                if (Char.IsControl(s1[i]))
                {
                    if (s1[i - 1] != ' ') tmpS.Append(' ');
                }
                else
                {
                    tmpS.Append(s1[i]);
                }
            }
            string result = tmpS.ToString();
            //finding the HTML tags and replacing them with an empty string
            string pattern = @"<[^>]*?>|<[^>]*>";
            Regex rgx = new Regex(pattern);
            return rgx.Replace(result, String.Empty);
        }
    }

Open in new window


Then you should build and deploy it. I don't have a VS2008 at hand, so I cannot describe it step by step, sorry. But this might be handy:
http://msdn.microsoft.com/en-us/library/dahcx0ww%28v=vs.80%29.aspx

After you deploy your function to SQL Server, you use it in your SELECT statement like this:

SELECT A.JBAID, dbo.CleanHTML(A.JBATitle) AS JBATitle ...
0
 
garethtnashAuthor Commented:
HI Rimvis,

Thank you, I'll do that,

Sorry, but one finsal question, will this replace special characters also - so

£ for &pound;
&#163; for &pound;
& for &amp;
&&amp; for &amp;
&amp;&amp; for &amp;

If not how do i add this to the CLR?

Thank you so much for your help.

Also would it be quicker to use a CLR to run the stored procedure that selects the data?

Thank you
0
 
RimvisCommented:
Hi garethtnash,

Do you want to replace "£" with "&pound;"? Or other way around?
0
 
garethtnashAuthor Commented:
Hi Rimvis,

replace
£ with &pound;

so £ becomes &pound;

if you can show me how with a couple of the examples, I'll do the other replacements.

Thank you Rimvis
0
 
RimvisCommented:
Hi garethtnash,


replace
return rgx.Replace(result, String.Empty);

Open in new window

with this
            result = rgx.Replace(result, String.Empty);
            result = result.Replace(@"£", @"&pound;");
            //more replaces here
            return result;

Open in new window

0
 
garethtnashAuthor Commented:
That's fantastic thank you Rimvis,

Just couple of final questions, please. Here is my (or should i say - your) code -

using System;
    using System.Data.Sql;
    using System.Data.SqlTypes;
    using System.Collections.Generic;
    using Microsoft.SqlServer.Server;
    using System.Text;
    using System.Text.RegularExpressions;
    public partial class UserDefinedFunctions
    {
        [Microsoft.SqlServer.Server.SqlFunction]
        public static SqlString CleanHTML(SqlString s){
            if (s.IsNull) return String.Empty;
            string s1 = s.ToString().Trim();
            if (s1.Length == 0) return String.Empty;
            StringBuilder tmpS = new StringBuilder(s1.Length);
            //striping out the "control characters"
            if (!Char.IsControl(s1[0])) tmpS.Append(s1[0]);
            for (int i = 1; i <= s1.Length - 1; i++)
            {
                if (Char.IsControl(s1[i]))
                {
                    if (s1[i - 1] != ' ') tmpS.Append(' ');
                }
                else
                {
                    tmpS.Append(s1[i]);
                }
            }
            string result = tmpS.ToString();
            //finding the HTML tags and replacing them with an empty string
            string pattern = @"<[^>]*?>|<[^>]*>";
            Regex rgx = new Regex(pattern);
            result = rgx.Replace(result, String.Empty);
            result = result.Replace(@"&amp", @"&");
            result = result.Replace(@"&amp;amp;", @"&");
            result = result.Replace(@"&", @"&amp;");
            result = result.Replace(@"£", @"&pound");
            result = result.Replace(@"&#163;", @"&pound");
            result = result.Replace(@"""", @"&quot;");
            result = result.Replace(@"'", @"&apos;");
            //more replaces here
            return result;
        }
    }

Open in new window


Does this look correct to you? In particular this line -

result = result.Replace(@"&amp", @"&");
            result = result.Replace(@"&amp;amp;", @"&");
            result = result.Replace(@"&", @"&amp;");

is firstly I want to make sure there are no html encoded or double encoded ampersands, before encoding the ampersand, I suppose I really should do the same with &pound;, &quot; etc before calling the line -

result = result.Replace(@"&", @"&amp;"); ?

So it would look like  -

            result = rgx.Replace(result, String.Empty);
            result = result.Replace(@"&amp", @"&");
            result = result.Replace(@"&amp;amp;", @"&");
            result = result.Replace(@"&pound", @"£");
            result = result.Replace(@"&#163;", @"£");
            result = result.Replace(@"&quot;", @"""");
            result = result.Replace(@"&apos;", @"'");
            result = result.Replace(@"&", @"&amp;");
            result = result.Replace(@"£", @"&pound");
            result = result.Replace(@"&#163;", @"&pound");
            result = result.Replace(@"""", @"&quot;");
            result = result.Replace(@"'", @"&apos;");

Open in new window


Is that overly complicated, or does it make sense to you? I appreciate your thougts and guidance on this.

Again, thanks for your help.

result = result.Replace(@"""", @"&quot;"); ?

I'm trying to replace all occurances of " with &quot;.

The reason for the three lines
0
 
RimvisCommented:
Hi,

The syntax is OK. But I'm not sure about the logic.
If you do this:
   result.Replace(@"&amp", @"&"),
the next line
   result.Replace(@"&amp;amp;", @"&");
will do nothing, because there won't be any "&amp" left. You should start with the longest strings.
I'm also not sure why you are replacing HTML encoded quotes to normal, and then back to HTML. It does nothing in result.
0
 
garethtnashAuthor Commented:
HI Rimvis,

Thank you.

My logic (or lack of) is, it is highly likely that some of the records contain data where special charaters have been replaced by HTML entity names or numbers (sometimes twice) meaning that it is highly likely that the data returned may contain -

&amp;amp;
&amp;
&pound;
&amp;pound;
&amp,quote;

etc

So what i was hoping to do, before replacing special charaters with HTML entities, was clean the data, so my thought process is --

'Clean all instances of &amp;amp
-- Replace &amp;amp; with &amp;
'Clean all instances of &amp;
-- Replace &amp; with &
Clean all instances of &pound;
-- Replace &pound; with £
Clean all instances of &quote;
-- Replace &quote; with "
Clean all instances of &apos;
-- Replace &apos; with '

Now data is clean, replace special charaters starting with &, to ensure that there is no instance of double encoding (&amp;pound) --

Replace & with &amp;
Replace £ with &pound;
Replace " with &quote;
Replace ' with &apos'

I can see that my changes should be --

result = rgx.Replace(result, String.Empty);
            result = result.Replace(@"&amp;amp;", @"&amp;"); 
            result = result.Replace(@"&amp", @"&");
            result = result.Replace(@"&pound", @"£");
            result = result.Replace(@"&#163;", @"£");
            result = result.Replace(@"&quot;", @"""");
            result = result.Replace(@"&apos;", @"'");
            result = result.Replace(@"&", @"&amp;");
            result = result.Replace(@"£", @"&pound");
            result = result.Replace(@"&#163;", @"&pound");
            result = result.Replace(@"""", @"&quot;");
            result = result.Replace(@"'", @"&apos;");

Open in new window


Does my logic make sense? Is the best way?

Thank you for all your support.
0
 
RimvisCommented:
Yes, now it does make sense to me :)
Except there are a couple of things:

1) I think you are missing semicolons here:
result = result.Replace(@"&amp", @"&");
result = result.Replace(@"&pound", @"£");

Open in new window


Shouldn't it be like this?
result = result.Replace(@"&amp;", @"&");
result = result.Replace(@"&pound;", @"£");

Open in new window



2) I don't think this one is necessary:
result = result.Replace(@"&#163;", @"&pound");

Open in new window


You are already processed it here:
result = result.Replace(@"&#163;", @"£");

Open in new window

0
 
garethtnashAuthor Commented:
Thank you,

Built it, deployed it and tested it, but got a very strange error - strange to me --

Msg 6522, Level 16, State 2, Line 1
A .NET Framework error occurred during execution of user-defined routine or aggregate "CleanHTML":
System.Data.SqlServer.TruncationException: Trying to convert return value or output parameter of size 8090 bytes to a T-SQL type with a smaller size limit of 8000 bytes.
System.Data.SqlServer.TruncationException:
   at System.Data.SqlServer.Internal.CXVariantBase.StringToWSTR(String pstrValue, Int64 cbMaxLength, Int32 iOffset, EPadding ePad)
   at System.Data.SqlServer.Internal.CXVariantBase.SqlStringToWSTR(SqlString strValue, Int64 cbMaxLength, Int32 iOffset, EPadding ePad)

Any thoughts?

Thank you
0
 
garethtnashAuthor Commented:
Thank you Experts, you have been Excellent as ever.<br /><br />Rimvis, if you get any thoughts regarding the error i've recieved, please can you let me know - <br /><br />the field I'm using this on is an Nvarchar(Max) column<br /><br />Thank you All :)
0
 
RimvisCommented:
Hi  garethtnash,

Thanks for the points. Sorry for the delay, I wasn't at the computer for some time.

Try to add this attribute:

        [Microsoft.SqlServer.Server.SqlFunction]
        [return: SqlFacet(MaxSize = -1)]
        public static SqlString CleanHTML(SqlString s){
0
 
garethtnashAuthor Commented:
Hi Rimvis,

No need to thank me for the points, you were worth all 500 plus more, but i needed to spread the points as Pourfards solution waas great and worked also, and DCPKing raised an intestesting solution, but ultimately your commitment to helping me with this has been fantastic, so thank you.

I've applied your change to the CLR and it has worked nicely, thank you.

Although it is still taking 70 seconds to return 25000 records. As i said DCPKing put an interesting perspective on the query, which I've read as run the function at INSERT / UPDATE time instead of at the point of trying to select * the records...

This has made me think about two things, ideally I need the description with all the HTML editing but without style, class, id, title attributes and with proper encoding for the data shown on the site. I also need the data as returned by the CLR function you've given me (with no HTML) for the feed. I could create a description2 column on the table (feeddescription) and insert the description as run through the CLR function into this at insert update, for the feed.

And run a CLR function similar to this one, that keeps the HTML entities but removes the HTML attributes as well as doing all the replace functions that we wrote earlier for the main description column again run at insert /update - for the main data shown on the site....

My only issue here is I'm not sure how to write the regex to remove HTML attributes whilst keeping the HTML entities in MS SQL..

In VBScript, I had something like --

.Pattern = "(<[a-z][^ >/]*)(?:[^>/]|/(?!>))*"	 ' keep all HTML Entities but remove attributes 
    strInput = .Replace(strInput, "$1")

Open in new window


AND

    .Pattern = "\r\n|\r|\n"
    strInput = .Replace(strInput, "<br />")          ' replace any carriage returns with either <br />

Open in new window


AND

    strInput = Replace(strInput, "  ", " ")	         ' removes all ... (tab etc)
	strInput = Replace(strInput, vbTab, "")	         ' removes all ... (tab etc)
	strInput = Replace(strInput, CHAR(9)," ")		' replace tabs with a space 

Open in new window


So remove HTML attributes, replace line breaks with < br/> and remove tabs;

I wonder whether I can be really cheeky,.. and ask you if you would mind having a look at the regex needed for the second CLR?

There are 500 points to be awarded, as I have raised another question here >>

Sorry, i just think do the clean during INSERT / UPLOAD is probably a lot more efficient, but for the description shown on the site I need to keep the html formatting, just without allowing publishers to dictate style etc..

Again, I can't thank you enough, you have been absolutely fantastic.

THank you :)
0
 
RimvisCommented:
Hi,
Actually, I'm on holidays right now. I will be back at PC in a couple of days. If you still will be needing help by then, I might take a look.
0
 
DcpKingCommented:
Thanks for the points - although I wasn't really expecting them, as it was fairly clear that the others were ahead of me in figuring out a way of actually identifying the strings to remove.
0
 
garethtnashAuthor Commented:
Hi Rimvis, hope you are having a good break, thanks for your feedback.
0

Featured Post

How to Use the Help Bell

Need to boost the visibility of your question for solutions? Use the Experts Exchange Help Bell to confirm priority levels and contact subject-matter experts for question attention.  Check out this how-to article for more information.

  • 15
  • 11
  • 2
  • +2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now