Solved

Remove html tags in MYSQL data - inconsistent results with function

Posted on 2010-09-11
12
3,649 Views
Last Modified: 2012-05-10
I am trying to remove html tags using a function someone helped me with last week, and it works great most of the time, but I am seeing some unexpected results.  Can anyone help me troubleshoot this, please?

This is the text I am starting with:
<div class="summary">
  <h2>AutoCAD MEP 2010 Tutorial Series</h2><font size="4"><font size="4"><em>
  <p><font size="4"><em>HDR University Accredited (16.75 Professional Development Hours)</em></font></p></em></font></font><font size="4"><font size="4">

this is the result I am getting:
 
  AD MEP 2010 Tutorial Seriesm>
  ty Accredited (16.75 Professional Development Hours)>

Please note there is a blank row to start with, and the first five letters of 'AutoCAD' were removed, and at the end of the first line of text, it should have stopped with the word 'Series'.  I don't know if this has anything to do with line breaks, carriage returns, or anything like that.

Here is an example of one that mostly works.  This is the original text:
<h3 style="text-align: center;"><a name="Modeling"></a><img width="243" vspace="0" hspace="0" height="64" border="0" src="http://www.hdr.cadlearning.com/file.php/18/model.gif" alt="Modeling" title="Modeling" /><br /></h3>
<h3> Getting Organized </h3>

This is the result:

Getting Organized

Please note the first blank row.

This is the function that I am using:
DROP FUNCTION IF EXISTS `udf_StripHTML`;

CREATE FUNCTION `udf_StripHTML`(_HTMLText VARCHAR(8000)) RETURNS varchar(8000)
    DETERMINISTIC
BEGIN
      set @HTMLText=_HTMLText;
      
      while locate('<',@HTMLText)>0 do
      begin
            set
                  @pos_lt=locate('<',@HTMLText),
                  @pos_gt=locate('>',@HTMLText);
            set
                  @HTMLText=replace(@HTMLText,substring(@HTMLText,@pos_lt,@pos_gt),'');
      end;
      end while;
      RETURN @HTMLText;
END;


I appreciate any assistance you can provide.  Thank you in advance.
0
Comment
Question by:ginsburg7
  • 6
  • 4
  • 2
12 Comments
 
LVL 59

Expert Comment

by:Kevin Cross
Comment Utility
Is the HTML XHTML compliant?  I ask as the above samples have unclosed tags like the  and the extraneous  tags at the end.  If the real data is always well formatted, then you could use this trick:


DROP FUNCTION IF EXISTS `udf_StripHTML`;

CREATE FUNCTION `udf_StripHTML`(_HTMLText VARCHAR(8000)) 
RETURNS varchar(8000) DETERMINISTIC
BEGIN
      set @HTMLText=extractvalue(_HTMLText, '//child::text()');
      RETURN @HTMLText;
END; 

Open in new window

0
 
LVL 59

Expert Comment

by:Kevin Cross
Comment Utility
If it is not, I think I see the issue.  You are using substring which takes the form of substring(text, startposition, length) but are passing two index values.  I will try to correct that and repost.
0
 
LVL 59

Accepted Solution

by:
Kevin Cross earned 500 total points
Comment Utility
Try like this!
DROP FUNCTION IF EXISTS `udf_StripHTML`;

CREATE FUNCTION `udf_StripHTML`(_HTMLText VARCHAR(8000)) 
RETURNS varchar(8000) DETERMINISTIC
BEGIN
      set @HTMLText=_HTMLText;
      
      while locate('<',@HTMLText)>0 do 
            set 
                  @pos_lt=locate('<',@HTMLText),
                  @len=locate('>',@HTMLText,@pos_lt)-@pos_lt+1;
            set 
                  @HTMLText=replace(@HTMLText,substring(@HTMLText,@pos_lt,@len),'');
      end while;
      
      RETURN @HTMLText;
END; 

Open in new window

0
 

Author Comment

by:ginsburg7
Comment Utility
wow, mwvisa1, that looks pretty good.  Thank you.  I will test it and get back to you.  I think I need a loop to remove line breaks and carriage returns first, then apply your function.  Do you have a function that can do that, too?  I appreciate it.
0
 
LVL 59

Expert Comment

by:Kevin Cross
Comment Utility
You can just use the same function and use replace on the initial text.  '\n' = line feed; '\r' = carriage return.
0
 

Author Comment

by:ginsburg7
Comment Utility
this is very close.  the problem I am seeing is it turns this:

<div class="summary">   <h2>Revit MEP 2010 Tutorial Series</h2>  

into this:

   Revit MEP 2010 Tutorial Series  

note the leading spaces.  Can you help me get rid of them, please?

Thank you very much, mwvisa1.
0
Highfive Gives IT Their Time Back

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

 
LVL 26

Expert Comment

by:Zberteoc
Comment Utility
Use:

RETURN TRIM(@HTMLText)
0
 
LVL 59

Expert Comment

by:Kevin Cross
Comment Utility
Thanks, Zberteoc! I thought I had put the TRIM() function in there.  I think I had taken it out playing with showing the REPLACE() function also but guess I showed neither. *laughing*

If you have a word and then empty space before or after a tag, you will get double spaces inside the text where TRIM() will not work, so you can additionally use REPLACE(TRIM(@HTMLText), '  ', ' ')

You can possibly put this inside the WHILE loop and the initial "set @HTMLText=_HTMLText;" to get this done a couple of times in the process.

Hope that helps!
0
 

Author Comment

by:ginsburg7
Comment Utility
It does indeed help!  Thank you.  

I now have one new issue that I hope you know the answer to -
I am expecting to see this:
Revit MEP 2010 Tutorial Series

but and getting this odd character:
Revit MEP 2010 Tutorial Series

I tried using the Replace function, but it doesn't seem to recognize that character.  I've done some research, but cannot find a solution.  Do you have one, please?
0
 
LVL 59

Expert Comment

by:Kevin Cross
Comment Utility
Can you paste the original HTML content into a code snippet -- maybe there is something in the original text.  Also, please post the function code you are now using.  Can use both to test on our end.
0
 

Author Comment

by:ginsburg7
Comment Utility
My ultimate goal is to get only the name of the "chapter" out of this text, which in this case would be "Revit MEP 2010 Tutorial Series" (no quotes).

Here is the original text:

<div class="summary">
  <h2>Revit MEP 2010 Tutorial Series</h2>
  <p><font size="4"><em>HDR University Accredited (15.75 Professional Development Hours)</em></font></p><font size="4">
  <p><font size="4">In order to receive credit for this course you must score at least 80% on all of the Chapter Quizzes and on the Final Exam.</font></p>
  <p><font size="4" face="Trebuchet MS,Verdana,Arial,Helvetica,sans-serif"><em>It is strongly recommended that you watch all of the videos.  Partial credit will not be awarded.</em></font></p>
  <p><em></em></p>
  <p><font size="4">Shortly after completion the PDH credit will be recorded in your <strong>Professional Training History </strong>(PTH) in Connects. </font></p>
  <p></p><hr />
  <p></p>
  <p><font size="4">The <strong><i>Course Overview </i></strong>pdf contains more specific information. <br />The <strong><i>Course Outline with Lesson Times </i></strong>pdf will break down the elapsed time for each video, chapter and the total course time.</font> </p>
  <p></p>
  <p><font size="4">Start by downloading all of the Datasets and unzipping the files to your local drive. These may be referenced over the course of the tutorials.</font></p>
  <p><font size="4">You can start and stop the course at any time. If you are returning to a course and want to see where you left off, select <strong><i>Quizzes </i></strong>from the <strong><i>Activities </i></strong>block at the top left of this page.</font></p>
  <p><font size="4">Good luck!</font></p></font></div>

This may be quite convoluted, but this is what I am doing.  I run a view that has this function first to remove line feeds:
BEGIN
      set @HTMLText=_HTMLText;
     
      while locate(char(10),@HTMLText)>0 do
            set
                  @pos_lt=locate(char(10),@HTMLText),
                  @len=locate(char(10),@HTMLText,@pos_lt)-@pos_lt+1;
            set
                  @HTMLText=replace(@HTMLText,substring(@HTMLText,@pos_lt,@len),'');
      end while;
     
      RETURN @HTMLText;
END

Then I run the second view that calls the first view.  It has this function to remove carriage returns:
BEGIN
      set @HTMLText=_HTMLText;
     
      while locate(char(13),@HTMLText)>0 do
            set
                  @pos_lt=locate(char(13),@HTMLText),
                  @len=locate(char(13),@HTMLText,@pos_lt)-@pos_lt+1;
            set
                  @HTMLText=replace(@HTMLText,substring(@HTMLText,@pos_lt,@len),'');
      end while;
     
      RETURN @HTMLText;
END

then I run the third view that calls the second view, and it has this function to remove paragraph tags:
BEGIN
      set @HTMLText=_HTMLText;
     
      while locate(char(13)+char(10),@HTMLText)>0 do
            set
                  @pos_lt=locate(char(13)+char(10),@HTMLText),
                  @len=locate(char(13)+char(10),@HTMLText,@pos_lt)-@pos_lt+1;
            set
                  @HTMLText=replace(@HTMLText,substring(@HTMLText,@pos_lt,@len),'');
      end while;
     
      RETURN @HTMLText;
END

then I run a fourth view that calls the third view that has this function to strip remaining html tags:
      set @HTMLText=_HTMLText;
     
      while locate('<',@HTMLText)>0 do
            set
                  @pos_lt=locate('<',@HTMLText),
                  @len=locate('>',@HTMLText,@pos_lt)-@pos_lt+1;
            set
                  @HTMLText=replace(trim(replace(@HTMLText,substring(@HTMLText,@pos_lt,@len),'')),' ' ,'');
      end while;
     
      RETURN @HTMLText;
END

Here is the fouth view.  It concatenates one field from the data table and the derived 'summary' text field, which is fine.  The problem develops when this fourth view is linked to another table on the 'id' field.

select `RJM04 - strip paragraphs`.`id` AS `id`,
`RJM04 - strip paragraphs`.`course` AS `course`,
`RJM04 - strip paragraphs`.`section` AS `section`,
`RJM04 - strip paragraphs`.`SummaryOriginal` AS `SummaryOriginal`,
`udf_StripHTML`(`RJM04 - strip paragraphs`.`NewSummary3`) AS `summary`,
concat(`RJM04 - strip paragraphs`.`section`,_utf8' ',`udf_StripHTML`(`RJM04 - strip paragraphs`.`NewSummary3`)) AS `chapter`
from `RJM04 - strip paragraphs`

I suspect that is going to be hard to follow.  I think I can live with the results, as is, but if you have a simple suggestion, I'm all ears.

Thank you once again.
0
 
LVL 26

Expert Comment

by:Zberteoc
Comment Utility
Ok, I got it.

I built this function in the first place but I had few omissions I notice. ;o)

I modified it to remove the returns from the whole text and I trim the string with every iteration. You don't need to apply the function in successive steps but only once to the string/column you need:

SELECT udf_StripHTML(your_col) FROM your_table;

here is the code you should replace in the function. I only post the statements from the functions BEGIN END block:


BEGIN

	set @HTMLText=trim(replace(_HTMLText,'\n',''));

	

	while locate('<',@HTMLText)>0 do 

		begin

			set 

				@pos_lt=locate('<',@HTMLText),

				@pos_gt=locate('>',@HTMLText),

				@HTMLText=trim(replace(@HTMLText,substring(@HTMLText,@pos_lt,@pos_gt-@pos_lt+1),''));

		end;

	end while;

	RETURN @HTMLText;

END

Open in new window

0

Featured Post

Top 6 Sources for Identifying Threat Actor TTPs

Understanding your enemy is essential. These six sources will help you identify the most popular threat actor tactics, techniques, and procedures (TTPs).

Join & Write a Comment

Introduction In this article, I will by showing a nice little trick for MySQL similar to that of my previous EE Article for SQLite (http://www.sqlite.org/), A SQLite Tidbit: Quick Numbers Table Generation (http://www.experts-exchange.com/A_3570.htm…
All XML, All the Time; More Fun MySQL Tidbits – Dynamically Generate XML via Stored Procedure in MySQL Extensible Markup Language (XML) and database systems, a marriage we are seeing more and more of.  So the topics of parsing and manipulating XM…
Polish reports in Access so they look terrific. Take yourself to another level. Equations, Back Color, Alternate Back Color. Write easy VBA Code. Tighten space to use less pages. Launch report from a menu, considering criteria only when it is filled…
You have products, that come in variants and want to set different prices for them? Watch this micro tutorial that describes how to configure prices for Magento super attributes. Assigning simple products to configurable: We assigned simple products…

744 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

16 Experts available now in Live!

Get 1:1 Help Now