Link to home
Start Free TrialLog in
Avatar of computergiants
computergiants

asked on

SQL statement to convert HTML characters to ASCII for eventual output to XML file

Hello Experts,

Anyone have a ready-made sql statement to take data from a column that is populated with HTML formatted text and convert it into ASCII characters?

I could build my own but I don't want to miss any characters. 500 pts!!

Thanks,
Sean
Avatar of computergiants
computergiants

ASKER

This is for SQL Server 2000
Avatar of Aneesh
That proc strips the tags out. I wish to leave the formatting in but translate it to ASCII. This data will eventually be used to create an XML file for a product feed to be displayed on a resellers website (not my own). Therefore it must keep the HTML tags but convert any XML non-conforming characters to ASCII.
I haven't done this so far, hope some oneelse will help you
I am thinking of something stupid, like

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '<b>', '&lt;b&gt;')

I just don't want to have to put in all of the possible HTML tags manually. I was hoping someone else had it done. If no one else does, I will post it here later today.
Take a look at this:

CREATE PROCEDURE spShowHTMLNoTags @HTML text AS

     DECLARE @HTMLLen int,
          @HTMLPos int,
          @HTMLChar char,
          @HTMLStr varchar(7168),
          @InTag bit
     
     SELECT @HTMLLen = DATALENGTH(@HTML),
          @HTMLPos = 1,
          @InTag = 0,
          @HTMLStr = ''

     CREATE TABLE #HTMLNoTags (HTMLNoTags text)

     WHILE @HTMLPos < @HTMLLen
     BEGIN
          SET @HTMLChar = SUBSTRING(@HTML, @HTMLPos, 1)

          IF @HTMLChar = '>' AND @InTag = 1 SELECT @InTag = 0
          ELSE
          BEGIN
               IF @HTMLChar = '<' SELECT @InTag = 1
               IF @InTag = 0
                    SELECT @HTMLStr = @HTMLStr + @HTMLChar
          END

          SET @HTMLPos = @HTMLPos + 1
     END
     INSERT INTO #HTMLNoTags VALUES (@HTMLStr)

     SELECT * FROM #HTMLNoTags
     DROP TABLE #HTMLNoTags

RETURN
GO



CREATE TABLE Pages (HTML text)
INSERT INTO Pages VALUES (<<<insert some html here>>>)
DECLARE @ht varchar(7168)
SELECT @ht = HTML FROM Pages
EXEC spShowHTMLNoTags @ht


computergiants:

You can ignore the previous messages and take a look at this one first:

Why do you want to convert to ASCII ?
"&lt" is not ASCII characters for "<"

If you just want to display the data as XML just replace the special characters in XML which are:
         **  & - &amp;
         **  < - &lt;
         **  > - &gt;
         **  " - &quot;
         **  ' - &#39;
         **  ' - &apos;

So you dont have to change '<b>' to '&lt;b&gt;'
You simply search and replace all '<' with '&lt;' and replace all '>' with '&gt;' and everything in between should stay the same.



Try the following which will replace the <,> and ". You can add more replace statements to take care of other special characters:

Case When Product_Desc <> '' THEN REPLACE(REPLACE(REPLACE(Product_Desc, '<' , '&lt;'), '>' , '&gt;'), '"' , '&quot;') Else Product_Desc End
Anyone who told you that by replacing specific characters in an HTML page would convert it to a well-formed Xml document was seriously delusional.  It is a much more complex task than that and best asked in a more appropriate Topic Area such as:
https://www.experts-exchange.com/Web/Web_Languages/HTML/
https://www.experts-exchange.com/Web/Web_Languages/XML/
acperkins -

What's so complicated about it? I have done this and it works fine.
You make the HTML as a Freindly XML and you have it in a XML node or a  "<![CDATA["  and it works perfect.

All he wants to do is to be able to have the HTML code in XML so he can retrive it and display it as HTML, and I don't see why you think I'm delusional.
>>You make the HTML as a Freindly XML and you have it in a XML node or a  "<![CDATA["  and it works perfect.<<
Excellent point.

>>and I don't see why you think I'm delusional.<<
You are getting paranoid, I was not even thinking of you.  But it reminds me of the joke about the teacher who walks into a class and asks:  "Who was the idiot who wrote on the blackboard?".  From the back of the class, a student raises his hand:  "Sir, why are you calling me an idiot?" :)
I'm really not paranoide and I don't take this personal at all. As a matter of fact I have learned SQL Server from EE and mostly from your posts at EE in situations like this. So I really appriciate any critsisem from the top experts like you. I always try to improve myself and find better ways of doing things, and thats why I was just looking for the best answer out there....
Thank you Apirnia for your detailed code and numerous tips. It is very much appreciated. I have taken that approach with the replace statements to search for all instances of non-XML compliant characters and replacing them with ASCII or sometimes HTML symbols.

I knew about CDATA before posting this, as I have done that for other website databases, but unfortunately this reseller does not allow CDATA tags. Great tip though.

Acperkins: if you have any specific ideas about how to solve this problem any tips would be helpful. Right now I have a very long and unsophisticated set of sql statements.

To be specific, the requirements stress that any & be converted to &amp;  so any html characters such as ™ must be converted to &amp;#153;     NOT &#153;

So what I have done is to FIRST find and replace all &amp; with &.

Next I find all those funky characters and replace them with their ascii or html codes.

Finally I replace the & with &amp; which will achieve the desired result. Not an elegant solution but I think it will work. I have to turn in the XML file later today so we will see. Any futher tips or refinement of this code is appreciated.

My only real question is how to handle the # sign. Since this data already has some ascii codes in it, if I do a replace on # and substitute &#35; will this mess up those tags?

Also, how do I replace the single quote character '  since it is a qualifier in SQL Server? I was thinking about doing a find/replace in a text editor prior to importing it into the db, but I don't like that solution. I can't automate it with DTS.

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '&amp;', '&')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '&#13;&#10;', '')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, ''', '&#39;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '<', '&lt;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '>', '&gt;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '"', '&quot;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '\r\n\r\n\r\n', '')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '\r\n\r\n\r', '')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '\r\n\r\n', '')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '\r\n\r', '')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '\r\n', '')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '!', '&#33;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '%', '&#37;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '*', '&#42;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '+', '&#43;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, ',', '&#44;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '@', '&#64;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '[', '&#91;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '\', '&#92;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, ']', '&#93;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '^', '&#94;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '_', '&#95;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '`', '&#96;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '{', '&#123;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '|', '&#124;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '}', '&#125;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '~', '&#126;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '€', '&#128;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '‚', '&#130;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'ƒ', '&#131;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '„', '&#132;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '…', '&#133;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '†', '&#134;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '‡', '&#135;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'ˆ', '&#136;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '‰', '&#137;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'Š', '&#138;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '‹', '&#139;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'Œ', '&#140;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'Ž', '&#142;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '‘', '&#145;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '’', '&#146;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '“', '&#147;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '”', '&#148;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '•', '&#149;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '–', '&#150;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '—', '&#151;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '˜', '&#152;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '™', '&#153;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'š', '&#154;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '›', '&#155;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'œ', '&#156;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'ž', '&#157;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'Ÿ', '&#158;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '¡', '&#161;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '¢', '&#162;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '£', '&#163;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '¤', '&#164;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '¥', '&#165;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '¦', '&#166;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '§', '&#167;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '¨', '&#168;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '©', '&#169;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'ª', '&#170;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '«', '&#171;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '¬', '&#172;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '­', '&#173;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '®', '&#174;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '¯', '&#175;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '°', '&#176;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '±', '&#177;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '²', '&#178;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '³', '&#179;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '´', '&#180;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'µ', '&#181;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '¶', '&#182;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '·', '&#183;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '¸', '&#184;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '¹', '&#185;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'º', '&#186;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '»', '&#187;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '¼', '&#188;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '½', '&#189;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '¾', '&#190;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '¿', '&#191;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'À', '&#192;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'Á', '&#193;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'Â', '&#194;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'Ã', '&#195;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'Ä', '&#196;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'Å', '&#197;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'Æ', '&#198;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'Ç', '&#199;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'È', '&#200;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'É', '&#201;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'Ê', '&#202;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'Ë', '&#203;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'Ì', '&#204;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'Í', '&#205;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'Î', '&#206;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'Ï', '&#207;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'Ð', '&#208;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'Ñ', '&#209;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'Ò', '&#210;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'Ó', '&#211;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'Ô', '&#212;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'Õ', '&#213;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'Ö', '&#214;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '×', '&#215;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'Ø', '&#216;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'Ù', '&#217;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'Ú', '&#218;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'Û', '&#219;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'Ü', '&#220;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'Ý', '&#221;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'Þ', '&#222;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'ß', '&#223;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'à', '&#224;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'á', '&#225;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'â', '&#226;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'ã', '&#227;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'ä', '&#228;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'å', '&#229;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'æ', '&#230;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'ç', '&#231;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'è', '&#232;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'é', '&#233;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'ê', '&#234;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'ë', '&#235;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'ì', '&#236;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'í', '&#237;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'î', '&#238;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '', '&#239;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'ï', '&#239;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'ð', '&#240;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'ñ', '&#241;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'ò', '&#242;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'ó', '&#243;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'ô', '&#244;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'õ', '&#245;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'ö', '&#246;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '÷', '&#247;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'ø', '&#248;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'ù', '&#249;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'ú', '&#250;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'û', '&#251;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'ü', '&#252;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'ý', '&#253;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'þ', '&#254;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, 'ÿ', '&#255;')

Update       HSNMivaProducts
Set      Product_Desc = replace(Product_Desc, '&', '&amp;')

You can remove the single quotes like this:
replace(sValue, '''', '')


=> above are all single quotes and no double quote( ' + ' + ' + ' , '+')
ASKER CERTIFIED SOLUTION
Avatar of Anthony Perkins
Anthony Perkins
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thanks for tightening up the code! I am implementing this today!