Go Premium for a chance to win a PS4. Enter to Win


T-SQL code for Stripping HTML Tags from Text Field during T-SQL Conversion

Posted on 2004-04-03
Medium Priority
Last Modified: 2007-12-19
I have two tables (defined below). One is called old_KB and the other is KB. I want to convert some of the data over from old_KB to KB, but I want to manipulate some of the data during the conversion. I have a text field called "HTML" inside the old_KB table. I want that data to be stored to the "HTML" field of the KB table and I want a version of the "HTML" field with all of the HTML tags eliminated, moved to the "search_text" field of the KB table. I need the Transact SQL code for this conversion. I DO NOT want to type cast to a varchar. I want the TEXT fields kept intact. I know that I need to use text pointers, but I'm not sure how to use them. Anyways, I need the Transact SQL Code to do this as soon as you can. Thanks.

The data conversion has the following map:

old_KB.id -> KB.id
old_KB.Description -> KB.Name
old_KB.HTML (manipulated w/ HTML tags removed) -> KB.search_text
KB.Notes = NULL

Defined Tables:

CREATE TABLE [dbo].[old_KB] (
      [ID] [int] NOT NULL ,
      [Prob] [text] COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
      [Solution] [text] COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
      [Description] [varchar] (100) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
      [HTML] [text] COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
      [ModifiedBy] [varchar] (100) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
      [CreatedBy] [varchar] (100) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
      [ModifyDate] [datetime] NULL ,
      [CreateDate] [datetime] NULL ,
      [Keywords] [nvarchar] (4000) COLLATE SQL_Latin1_General_CP1_CI_AS NULL

      [ID] [int] NULL ,
      [Name] [varchar] (2000) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
      [HTML] [text] COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
      [search_text] [text] COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
      [Notes] [text] COLLATE SQL_Latin1_General_CP1_CI_AS NULL

Question by:yodasan000

Expert Comment

ID: 10751402

how about a regular expression? If you create a function to replace strings using regular expressions, you can do a simple mapping using a "select into" statement

for instance, udf called "dbo.regexreplace (input, replacepattern, replacestring, global, ignorecase) returns replace'd string

and you write something along the lines of
select replace(html, '<.*?>','') ..... from old_kb into kb

easy as pie - if you have access to user defined functions. And it's a nice and handy function to have for other text-substitutions where the replace-function is inadequate.

-- Declarartion of function
-- found on SQLServers Forum. Credit where credit is due.
CREATE FUNCTION [dbo].regexreplace
@input varchar(8000),
@regex varchar(300),
@replacestring varchar(1000),
@ignorecase bit,
@global bit
RETURNS varchar(5000) AS  

      -- output hr
      declare @hr  int
      declare @objRegEx int
      execute  @hr = sp_OACreate 'VBScript.RegExp', @objRegEx out
      if @hr <> 0  begin
            return 'create error'

      execute @hr = sp_OASetProperty @objRegEx, 'pattern',@regex
      if @hr <> 0 begin
            execute sp_OADestroy @objRegEx
            return 'pattern error'

      execute @hr = sp_OASetProperty @objRegEx,'Global',@global
      if @hr <> 0 begin
            execute sp_OADestroy @objRegEx
            return 'Global error'

      execute @hr = sp_OASetProperty @objRegEx,'Ignorecase',@ignorecase
            if @hr <> 0 begin
            execute sp_OADestroy @objRegEx
            return 'ignorecase error'
      declare @result varchar(5000)
      execute @hr = sp_OAMethod @objRegEx, 'Replace',@result out,@input, @replacestring
      execute sp_OADestroy @objRegEx

      if @hr <> 0 begin
            return 'method call failed'
      return @result


To use this, you need access to use sp_OACreate/property/method - ask your dba - then simply select dbo.regexreplace (html,'<.*?>' ,'',1,1) which does a global replace on any '<..................>' found in the html ( I guess you could refine this a bit)

declare @str varchar(100)
set @str = '<html><body onload="alert(''hi'');">Hello world</body></html>'

declare @regex varchar(100)
set @regex = '<.*?>'
set @replacewith = '' -- ie. nothing
declare @output varchar(100)
select dbo.regexreplace(@str,@regex,@replacewith,1,1)

Hello world

good luck
LVL 34

Expert Comment

ID: 10752490
like the question asker already stated, you can't use simple replace functions on TEXT types....
LVL 13

Accepted Solution

danblake earned 2000 total points
ID: 10753108
I DO NOT want to type cast to a varchar. I want the TEXT fields kept intact.

Do you have any problem in bulk extracting the file with the TEXT file HTML & your PK: ID, bulk inputing the file and running a perl or similar process on the file ?
It keeps the file in TEXT format (we can easily bulk insert/extract the data by using bcp).

(It will get very messy performing this with pure T-SQL -- its not very good at this kind of thing...)

I noticed that you are a perl programmer... this thing is better done in perl than T-sql (sed is available as a MS Dos based version similar to perl and should allow you to strip the < ...varlength.... > items from the file.  Alternativly run on a <*> <*> <***> replace on the file of incrementing lengths to allow stripping of the file.)

Sed is available from: http://www.student.northpark.edu/pemente/sed/

We could probably come up with a T-SQL equivelent, but this is likely to be slow.  It will need to read in a character at a time until < is encountered, then carry on reading a character till a > is encountered without encountering another < inbetween or whatever our rules are going to be (Ensuring that a properly formatted string is available).  This is just an idea...

To do this with T-SQL txt pointers refer to the following functions:
output = substring (txtexpression, start_pos, end_pos)  unfortunatly for you it returns this data as a varchar !

OR we can do this with the following T-SQL PATINDEX function:
SELECT PATINDEX('%<%>%', txtexpression)  This will return the start position of the string, that you can search for the > expression of the html.

such as using:
select patindex ('%>%',txtexpression)
from <table_name>
where PATINDEX('%<%>%', txtexpression) < patindex ('%>%',txtexpression)
this will grab the last point of the string hopefully ensuring you can splice the start -> end positions from the HTML reference file.

You can then Return specific text data (Txt Ptr stuff):
This example locates the text column (pr_info) associated with pub_id 0736 in the pub_info table of the pubs database. It first declares the local variable @val. The text pointer (a long binary string) is then put into @val and supplied as a parameter to the READTEXT statement, which returns 10 bytes starting at the fifth byte (offset of 4).

USE pubs
DECLARE @val varbinary(16)
SELECT @val = TEXTPTR(pr_info)
FROM pub_info
WHERE pub_id = '0736'
READTEXT pub_info.pr_info @val 4 10

Personally I would grab all the start->end blocks of the html function into a temporary table using PATINDEX and chop the longest length strings (agregating the results from the temp table to remove multiple HTML tags placed end-to-end) from the text field for every row.  Its very messy, a lot easier to perform in perl or a similar grep engine.


Expert Comment

ID: 10755525

...my bad about the typecasting. I usually aviod BLOB-fields at any cost, so I wasn't aware of the implications regarding functions and the like...

but given that you don't have 8000 bytes or more of html, the function still works- and you can put it back in a TEXT-field. (modify returnval to varchar(8000)). otherwise you could use the TEXTPTR to split the text into suitable 8k parts (modify return to varchar(8000))

Now, I don't know if this screws up some internal format of the string due to the implicit cast to varchar?

-- test sql script

create table #ee
( i int,
thtml text) -- html in text-field, not varchar

insert into #ee
values (1,'<html><body>Hello World</body></html>')

insert into #ee
select 2,dbo.regexreplace(thtml,'<.*?>','',1,1) from #ee

select * from #ee
drop table #ee

1      <html><body>Hello World</body></html>
2      Hello World


Author Comment

ID: 10762099
Just as a note ... this is a corporate database and I only have access to transact-sql code. I will try to use your guy's pointers and try to write some code for it. If I get anywhere with it then I'll give the points to whoever's tips I used to get it to work. Seems like danblake has some good tips, but I can't give points until I'm sure I've got a solution. If anyone can code the actual T-SQL code ... that would be awesome. But, as I mentioned again, I can only use t-sql code and I don't want to typecast to varchar because most of the data is over 8000 bytes. Thanks.

Featured Post

Important Lessons on Recovering from Petya

In their most recent webinar, Skyport Systems explores ways to isolate and protect critical databases to keep the core of your company safe from harm.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

What if you have to shut down the entire Citrix infrastructure for hardware maintenance, software upgrades or "the unknown"? I developed this plan for "the unknown" and hope that it helps you as well. This article explains how to properly shut down …
When trying to connect from SSMS v17.x to a SQL Server Integration Services 2016 instance or previous version, you get the error “Connecting to the Integration Services service on the computer failed with the following error: 'The specified service …
This video shows, step by step, how to configure Oracle Heterogeneous Services via the Generic Gateway Agent in order to make a connection from an Oracle session and access a remote SQL Server database table.
Viewers will learn how to use the SELECT statement in SQL and will be exposed to the many uses the SELECT statement has.

876 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question