Solved

T-SQL code for Stripping HTML Tags from Text Field during T-SQL Conversion

Posted on 2004-04-03
5
1,637 Views
Last Modified: 2007-12-19
I have two tables (defined below). One is called old_KB and the other is KB. I want to convert some of the data over from old_KB to KB, but I want to manipulate some of the data during the conversion. I have a text field called "HTML" inside the old_KB table. I want that data to be stored to the "HTML" field of the KB table and I want a version of the "HTML" field with all of the HTML tags eliminated, moved to the "search_text" field of the KB table. I need the Transact SQL code for this conversion. I DO NOT want to type cast to a varchar. I want the TEXT fields kept intact. I know that I need to use text pointers, but I'm not sure how to use them. Anyways, I need the Transact SQL Code to do this as soon as you can. Thanks.

The data conversion has the following map:

old_KB.id -> KB.id
old_KB.Description -> KB.Name
old_KB.HTML -> KB.HTML
old_KB.HTML (manipulated w/ HTML tags removed) -> KB.search_text
KB.Notes = NULL


Defined Tables:

CREATE TABLE [dbo].[old_KB] (
      [ID] [int] NOT NULL ,
      [Prob] [text] COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
      [Solution] [text] COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
      [Description] [varchar] (100) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
      [HTML] [text] COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
      [ModifiedBy] [varchar] (100) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
      [CreatedBy] [varchar] (100) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
      [ModifyDate] [datetime] NULL ,
      [CreateDate] [datetime] NULL ,
      [Keywords] [nvarchar] (4000) COLLATE SQL_Latin1_General_CP1_CI_AS NULL
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]


CREATE TABLE [dbo].[KB] (
      [ID] [int] NULL ,
      [Name] [varchar] (2000) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
      [HTML] [text] COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
      [search_text] [text] COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
      [Notes] [text] COLLATE SQL_Latin1_General_CP1_CI_AS NULL
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]

0
Comment
Question by:yodasan000
5 Comments
 
LVL 4

Expert Comment

by:mikkelp
Comment Utility
Hi,

how about a regular expression? If you create a function to replace strings using regular expressions, you can do a simple mapping using a "select into" statement

for instance, udf called "dbo.regexreplace (input, replacepattern, replacestring, global, ignorecase) returns replace'd string

and you write something along the lines of
select replace(html, '<.*?>','') ..... from old_kb into kb

easy as pie - if you have access to user defined functions. And it's a nice and handy function to have for other text-substitutions where the replace-function is inadequate.




-- Declarartion of function
-- found on SQLServers Forum. Credit where credit is due.
CREATE FUNCTION [dbo].regexreplace
(
@input varchar(8000),
@regex varchar(300),
@replacestring varchar(1000),
@ignorecase bit,
@global bit
)  
RETURNS varchar(5000) AS  


BEGIN
      -- output hr
      declare @hr  int
      declare @objRegEx int
      execute  @hr = sp_OACreate 'VBScript.RegExp', @objRegEx out
      if @hr <> 0  begin
            return 'create error'
      end

      execute @hr = sp_OASetProperty @objRegEx, 'pattern',@regex
      if @hr <> 0 begin
            execute sp_OADestroy @objRegEx
            return 'pattern error'
      end

      execute @hr = sp_OASetProperty @objRegEx,'Global',@global
      if @hr <> 0 begin
            execute sp_OADestroy @objRegEx
            return 'Global error'
      end

      execute @hr = sp_OASetProperty @objRegEx,'Ignorecase',@ignorecase
            if @hr <> 0 begin
            execute sp_OADestroy @objRegEx
            return 'ignorecase error'
      end
      declare @result varchar(5000)
      execute @hr = sp_OAMethod @objRegEx, 'Replace',@result out,@input, @replacestring
      execute sp_OADestroy @objRegEx

      if @hr <> 0 begin
            return 'method call failed'
      end            
      
      return @result

END

To use this, you need access to use sp_OACreate/property/method - ask your dba - then simply select dbo.regexreplace (html,'<.*?>' ,'',1,1) which does a global replace on any '<..................>' found in the html ( I guess you could refine this a bit)

Example
declare @str varchar(100)
set @str = '<html><body onload="alert(''hi'');">Hello world</body></html>'

declare @regex varchar(100)
set @regex = '<.*?>'
set @replacewith = '' -- ie. nothing
declare @output varchar(100)
select dbo.regexreplace(@str,@regex,@replacewith,1,1)

---outputs
Hello world

good luck
mikkelp
0
 
LVL 34

Expert Comment

by:arbert
Comment Utility
like the question asker already stated, you can't use simple replace functions on TEXT types....
0
 
LVL 13

Accepted Solution

by:
danblake earned 500 total points
Comment Utility
I DO NOT want to type cast to a varchar. I want the TEXT fields kept intact.

Do you have any problem in bulk extracting the file with the TEXT file HTML & your PK: ID, bulk inputing the file and running a perl or similar process on the file ?
It keeps the file in TEXT format (we can easily bulk insert/extract the data by using bcp).

(It will get very messy performing this with pure T-SQL -- its not very good at this kind of thing...)

I noticed that you are a perl programmer... this thing is better done in perl than T-sql (sed is available as a MS Dos based version similar to perl and should allow you to strip the < ...varlength.... > items from the file.  Alternativly run on a <*> <*> <***> replace on the file of incrementing lengths to allow stripping of the file.)

Sed is available from: http://www.student.northpark.edu/pemente/sed/

We could probably come up with a T-SQL equivelent, but this is likely to be slow.  It will need to read in a character at a time until < is encountered, then carry on reading a character till a > is encountered without encountering another < inbetween or whatever our rules are going to be (Ensuring that a properly formatted string is available).  This is just an idea...

To do this with T-SQL txt pointers refer to the following functions:
output = substring (txtexpression, start_pos, end_pos)  unfortunatly for you it returns this data as a varchar !

OR we can do this with the following T-SQL PATINDEX function:
SELECT PATINDEX('%<%>%', txtexpression)  This will return the start position of the string, that you can search for the > expression of the html.

such as using:
select patindex ('%>%',txtexpression)
from <table_name>
where PATINDEX('%<%>%', txtexpression) < patindex ('%>%',txtexpression)
this will grab the last point of the string hopefully ensuring you can splice the start -> end positions from the HTML reference file.

Using:
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/tsqlref/ts_ta-tz_3zaq.asp
You can then Return specific text data (Txt Ptr stuff):
This example locates the text column (pr_info) associated with pub_id 0736 in the pub_info table of the pubs database. It first declares the local variable @val. The text pointer (a long binary string) is then put into @val and supplied as a parameter to the READTEXT statement, which returns 10 bytes starting at the fifth byte (offset of 4).

USE pubs
GO
DECLARE @val varbinary(16)
SELECT @val = TEXTPTR(pr_info)
FROM pub_info
WHERE pub_id = '0736'
READTEXT pub_info.pr_info @val 4 10
GO


Personally I would grab all the start->end blocks of the html function into a temporary table using PATINDEX and chop the longest length strings (agregating the results from the temp table to remove multiple HTML tags placed end-to-end) from the text field for every row.  Its very messy, a lot easier to perform in perl or a similar grep engine.




0
 
LVL 4

Expert Comment

by:mikkelp
Comment Utility
Doh

...my bad about the typecasting. I usually aviod BLOB-fields at any cost, so I wasn't aware of the implications regarding functions and the like...

but given that you don't have 8000 bytes or more of html, the function still works- and you can put it back in a TEXT-field. (modify returnval to varchar(8000)). otherwise you could use the TEXTPTR to split the text into suitable 8k parts (modify return to varchar(8000))

Now, I don't know if this screws up some internal format of the string due to the implicit cast to varchar?

-- test sql script

create table #ee
( i int,
thtml text) -- html in text-field, not varchar

insert into #ee
values (1,'<html><body>Hello World</body></html>')

insert into #ee
select 2,dbo.regexreplace(thtml,'<.*?>','',1,1) from #ee

select * from #ee
drop table #ee

returns
1      <html><body>Hello World</body></html>
2      Hello World

0
 

Author Comment

by:yodasan000
Comment Utility
Just as a note ... this is a corporate database and I only have access to transact-sql code. I will try to use your guy's pointers and try to write some code for it. If I get anywhere with it then I'll give the points to whoever's tips I used to get it to work. Seems like danblake has some good tips, but I can't give points until I'm sure I've got a solution. If anyone can code the actual T-SQL code ... that would be awesome. But, as I mentioned again, I can only use t-sql code and I don't want to typecast to varchar because most of the data is over 8000 bytes. Thanks.
0

Featured Post

Highfive + Dolby Voice = No More Audio Complaints!

Poor audio quality is one of the top reasons people don’t use video conferencing. Get the crispest, clearest audio powered by Dolby Voice in every meeting. Highfive and Dolby Voice deliver the best video conferencing and audio experience for every meeting and every room.

Join & Write a Comment

Suggested Solutions

Introduction In my previous article (http://www.experts-exchange.com/Microsoft/Development/MS-SQL-Server/SSIS/A_9150-Loading-XML-Using-SSIS.html) I showed you how the XML Source component can be used to load XML files into a SQL Server database, us…
Slowly Changing Dimension Transformation component in data task flow is very useful for us to manage and control how data changes in SSIS.
Familiarize people with the process of utilizing SQL Server functions from within Microsoft Access. Microsoft Access is a very powerful client/server development tool. One of the SQL Server objects that you can interact with from within Microsoft Ac…
This video shows, step by step, how to configure Oracle Heterogeneous Services via the Generic Gateway Agent in order to make a connection from an Oracle session and access a remote SQL Server database table.

772 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

14 Experts available now in Live!

Get 1:1 Help Now