T-SQL code for Stripping HTML Tags from Text Field during T-SQL Conversion

I have two tables (defined below). One is called old_KB and the other is KB. I want to convert some of the data over from old_KB to KB, but I want to manipulate some of the data during the conversion. I have a text field called "HTML" inside the old_KB table. I want that data to be stored to the "HTML" field of the KB table and I want a version of the "HTML" field with all of the HTML tags eliminated, moved to the "search_text" field of the KB table. I need the Transact SQL code for this conversion. I DO NOT want to type cast to a varchar. I want the TEXT fields kept intact. I know that I need to use text pointers, but I'm not sure how to use them. Anyways, I need the Transact SQL Code to do this as soon as you can. Thanks.

The data conversion has the following map:

old_KB.id -> KB.id
old_KB.Description -> KB.Name
old_KB.HTML -> KB.HTML
old_KB.HTML (manipulated w/ HTML tags removed) -> KB.search_text
KB.Notes = NULL


Defined Tables:

CREATE TABLE [dbo].[old_KB] (
      [ID] [int] NOT NULL ,
      [Prob] [text] COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
      [Solution] [text] COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
      [Description] [varchar] (100) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
      [HTML] [text] COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
      [ModifiedBy] [varchar] (100) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
      [CreatedBy] [varchar] (100) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
      [ModifyDate] [datetime] NULL ,
      [CreateDate] [datetime] NULL ,
      [Keywords] [nvarchar] (4000) COLLATE SQL_Latin1_General_CP1_CI_AS NULL
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]


CREATE TABLE [dbo].[KB] (
      [ID] [int] NULL ,
      [Name] [varchar] (2000) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
      [HTML] [text] COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
      [search_text] [text] COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
      [Notes] [text] COLLATE SQL_Latin1_General_CP1_CI_AS NULL
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]

yodasan000Asked:
Who is Participating?

[Webinar] Streamline your web hosting managementRegister Today

x
 
danblakeConnect With a Mentor Commented:
I DO NOT want to type cast to a varchar. I want the TEXT fields kept intact.

Do you have any problem in bulk extracting the file with the TEXT file HTML & your PK: ID, bulk inputing the file and running a perl or similar process on the file ?
It keeps the file in TEXT format (we can easily bulk insert/extract the data by using bcp).

(It will get very messy performing this with pure T-SQL -- its not very good at this kind of thing...)

I noticed that you are a perl programmer... this thing is better done in perl than T-sql (sed is available as a MS Dos based version similar to perl and should allow you to strip the < ...varlength.... > items from the file.  Alternativly run on a <*> <*> <***> replace on the file of incrementing lengths to allow stripping of the file.)

Sed is available from: http://www.student.northpark.edu/pemente/sed/

We could probably come up with a T-SQL equivelent, but this is likely to be slow.  It will need to read in a character at a time until < is encountered, then carry on reading a character till a > is encountered without encountering another < inbetween or whatever our rules are going to be (Ensuring that a properly formatted string is available).  This is just an idea...

To do this with T-SQL txt pointers refer to the following functions:
output = substring (txtexpression, start_pos, end_pos)  unfortunatly for you it returns this data as a varchar !

OR we can do this with the following T-SQL PATINDEX function:
SELECT PATINDEX('%<%>%', txtexpression)  This will return the start position of the string, that you can search for the > expression of the html.

such as using:
select patindex ('%>%',txtexpression)
from <table_name>
where PATINDEX('%<%>%', txtexpression) < patindex ('%>%',txtexpression)
this will grab the last point of the string hopefully ensuring you can splice the start -> end positions from the HTML reference file.

Using:
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/tsqlref/ts_ta-tz_3zaq.asp
You can then Return specific text data (Txt Ptr stuff):
This example locates the text column (pr_info) associated with pub_id 0736 in the pub_info table of the pubs database. It first declares the local variable @val. The text pointer (a long binary string) is then put into @val and supplied as a parameter to the READTEXT statement, which returns 10 bytes starting at the fifth byte (offset of 4).

USE pubs
GO
DECLARE @val varbinary(16)
SELECT @val = TEXTPTR(pr_info)
FROM pub_info
WHERE pub_id = '0736'
READTEXT pub_info.pr_info @val 4 10
GO


Personally I would grab all the start->end blocks of the html function into a temporary table using PATINDEX and chop the longest length strings (agregating the results from the temp table to remove multiple HTML tags placed end-to-end) from the text field for every row.  Its very messy, a lot easier to perform in perl or a similar grep engine.




0
 
mikkelpCommented:
Hi,

how about a regular expression? If you create a function to replace strings using regular expressions, you can do a simple mapping using a "select into" statement

for instance, udf called "dbo.regexreplace (input, replacepattern, replacestring, global, ignorecase) returns replace'd string

and you write something along the lines of
select replace(html, '<.*?>','') ..... from old_kb into kb

easy as pie - if you have access to user defined functions. And it's a nice and handy function to have for other text-substitutions where the replace-function is inadequate.




-- Declarartion of function
-- found on SQLServers Forum. Credit where credit is due.
CREATE FUNCTION [dbo].regexreplace
(
@input varchar(8000),
@regex varchar(300),
@replacestring varchar(1000),
@ignorecase bit,
@global bit
)  
RETURNS varchar(5000) AS  


BEGIN
      -- output hr
      declare @hr  int
      declare @objRegEx int
      execute  @hr = sp_OACreate 'VBScript.RegExp', @objRegEx out
      if @hr <> 0  begin
            return 'create error'
      end

      execute @hr = sp_OASetProperty @objRegEx, 'pattern',@regex
      if @hr <> 0 begin
            execute sp_OADestroy @objRegEx
            return 'pattern error'
      end

      execute @hr = sp_OASetProperty @objRegEx,'Global',@global
      if @hr <> 0 begin
            execute sp_OADestroy @objRegEx
            return 'Global error'
      end

      execute @hr = sp_OASetProperty @objRegEx,'Ignorecase',@ignorecase
            if @hr <> 0 begin
            execute sp_OADestroy @objRegEx
            return 'ignorecase error'
      end
      declare @result varchar(5000)
      execute @hr = sp_OAMethod @objRegEx, 'Replace',@result out,@input, @replacestring
      execute sp_OADestroy @objRegEx

      if @hr <> 0 begin
            return 'method call failed'
      end            
      
      return @result

END

To use this, you need access to use sp_OACreate/property/method - ask your dba - then simply select dbo.regexreplace (html,'<.*?>' ,'',1,1) which does a global replace on any '<..................>' found in the html ( I guess you could refine this a bit)

Example
declare @str varchar(100)
set @str = '<html><body onload="alert(''hi'');">Hello world</body></html>'

declare @regex varchar(100)
set @regex = '<.*?>'
set @replacewith = '' -- ie. nothing
declare @output varchar(100)
select dbo.regexreplace(@str,@regex,@replacewith,1,1)

---outputs
Hello world

good luck
mikkelp
0
 
arbertCommented:
like the question asker already stated, you can't use simple replace functions on TEXT types....
0
 
mikkelpCommented:
Doh

...my bad about the typecasting. I usually aviod BLOB-fields at any cost, so I wasn't aware of the implications regarding functions and the like...

but given that you don't have 8000 bytes or more of html, the function still works- and you can put it back in a TEXT-field. (modify returnval to varchar(8000)). otherwise you could use the TEXTPTR to split the text into suitable 8k parts (modify return to varchar(8000))

Now, I don't know if this screws up some internal format of the string due to the implicit cast to varchar?

-- test sql script

create table #ee
( i int,
thtml text) -- html in text-field, not varchar

insert into #ee
values (1,'<html><body>Hello World</body></html>')

insert into #ee
select 2,dbo.regexreplace(thtml,'<.*?>','',1,1) from #ee

select * from #ee
drop table #ee

returns
1      <html><body>Hello World</body></html>
2      Hello World

0
 
yodasan000Author Commented:
Just as a note ... this is a corporate database and I only have access to transact-sql code. I will try to use your guy's pointers and try to write some code for it. If I get anywhere with it then I'll give the points to whoever's tips I used to get it to work. Seems like danblake has some good tips, but I can't give points until I'm sure I've got a solution. If anyone can code the actual T-SQL code ... that would be awesome. But, as I mentioned again, I can only use t-sql code and I don't want to typecast to varchar because most of the data is over 8000 bytes. Thanks.
0
All Courses

From novice to tech pro — start learning today.