Remove html tags in MYSQL data - inconsistent results with function

I am trying to remove html tags using a function someone helped me with last week, and it works great most of the time, but I am seeing some unexpected results.  Can anyone help me troubleshoot this, please?

This is the text I am starting with:
<div class="summary">
  <h2>AutoCAD MEP 2010 Tutorial Series</h2><font size="4"><font size="4"><em>
  <p><font size="4"><em>HDR University Accredited (16.75 Professional Development Hours)</em></font></p></em></font></font><font size="4"><font size="4">

this is the result I am getting:
 
  AD MEP 2010 Tutorial Seriesm>
  ty Accredited (16.75 Professional Development Hours)>

Please note there is a blank row to start with, and the first five letters of 'AutoCAD' were removed, and at the end of the first line of text, it should have stopped with the word 'Series'.  I don't know if this has anything to do with line breaks, carriage returns, or anything like that.

Here is an example of one that mostly works.  This is the original text:
<h3 style="text-align: center;"><a name="Modeling"></a><img width="243" vspace="0" hspace="0" height="64" border="0" src="http://www.hdr.cadlearning.com/file.php/18/model.gif" alt="Modeling" title="Modeling" /><br /></h3>
<h3> Getting Organized </h3>

This is the result:

Getting Organized

Please note the first blank row.

This is the function that I am using:
DROP FUNCTION IF EXISTS `udf_StripHTML`;

CREATE FUNCTION `udf_StripHTML`(_HTMLText VARCHAR(8000)) RETURNS varchar(8000)
    DETERMINISTIC
BEGIN
      set @HTMLText=_HTMLText;
      
      while locate('<',@HTMLText)>0 do
      begin
            set
                  @pos_lt=locate('<',@HTMLText),
                  @pos_gt=locate('>',@HTMLText);
            set
                  @HTMLText=replace(@HTMLText,substring(@HTMLText,@pos_lt,@pos_gt),'');
      end;
      end while;
      RETURN @HTMLText;
END;


I appreciate any assistance you can provide.  Thank you in advance.
ginsburg7Asked:
Who is Participating?
 
Kevin CrossConnect With a Mentor Chief Technology OfficerCommented:
Try like this!
DROP FUNCTION IF EXISTS `udf_StripHTML`;

CREATE FUNCTION `udf_StripHTML`(_HTMLText VARCHAR(8000)) 
RETURNS varchar(8000) DETERMINISTIC
BEGIN
      set @HTMLText=_HTMLText;
      
      while locate('<',@HTMLText)>0 do 
            set 
                  @pos_lt=locate('<',@HTMLText),
                  @len=locate('>',@HTMLText,@pos_lt)-@pos_lt+1;
            set 
                  @HTMLText=replace(@HTMLText,substring(@HTMLText,@pos_lt,@len),'');
      end while;
      
      RETURN @HTMLText;
END; 

Open in new window

0
 
Kevin CrossChief Technology OfficerCommented:
Is the HTML XHTML compliant?  I ask as the above samples have unclosed tags like the  and the extraneous  tags at the end.  If the real data is always well formatted, then you could use this trick:


DROP FUNCTION IF EXISTS `udf_StripHTML`;

CREATE FUNCTION `udf_StripHTML`(_HTMLText VARCHAR(8000)) 
RETURNS varchar(8000) DETERMINISTIC
BEGIN
      set @HTMLText=extractvalue(_HTMLText, '//child::text()');
      RETURN @HTMLText;
END; 

Open in new window

0
 
Kevin CrossChief Technology OfficerCommented:
If it is not, I think I see the issue.  You are using substring which takes the form of substring(text, startposition, length) but are passing two index values.  I will try to correct that and repost.
0
Get your problem seen by more experts

Be seen. Boost your question’s priority for more expert views and faster solutions

 
ginsburg7Author Commented:
wow, mwvisa1, that looks pretty good.  Thank you.  I will test it and get back to you.  I think I need a loop to remove line breaks and carriage returns first, then apply your function.  Do you have a function that can do that, too?  I appreciate it.
0
 
Kevin CrossChief Technology OfficerCommented:
You can just use the same function and use replace on the initial text.  '\n' = line feed; '\r' = carriage return.
0
 
ginsburg7Author Commented:
this is very close.  the problem I am seeing is it turns this:

<div class="summary">   <h2>Revit MEP 2010 Tutorial Series</h2>  

into this:

   Revit MEP 2010 Tutorial Series  

note the leading spaces.  Can you help me get rid of them, please?

Thank you very much, mwvisa1.
0
 
ZberteocCommented:
Use:

RETURN TRIM(@HTMLText)
0
 
Kevin CrossChief Technology OfficerCommented:
Thanks, Zberteoc! I thought I had put the TRIM() function in there.  I think I had taken it out playing with showing the REPLACE() function also but guess I showed neither. *laughing*

If you have a word and then empty space before or after a tag, you will get double spaces inside the text where TRIM() will not work, so you can additionally use REPLACE(TRIM(@HTMLText), '  ', ' ')

You can possibly put this inside the WHILE loop and the initial "set @HTMLText=_HTMLText;" to get this done a couple of times in the process.

Hope that helps!
0
 
ginsburg7Author Commented:
It does indeed help!  Thank you.  

I now have one new issue that I hope you know the answer to -
I am expecting to see this:
Revit MEP 2010 Tutorial Series

but and getting this odd character:
Revit MEP 2010 Tutorial Series

I tried using the Replace function, but it doesn't seem to recognize that character.  I've done some research, but cannot find a solution.  Do you have one, please?
0
 
Kevin CrossChief Technology OfficerCommented:
Can you paste the original HTML content into a code snippet -- maybe there is something in the original text.  Also, please post the function code you are now using.  Can use both to test on our end.
0
 
ginsburg7Author Commented:
My ultimate goal is to get only the name of the "chapter" out of this text, which in this case would be "Revit MEP 2010 Tutorial Series" (no quotes).

Here is the original text:

<div class="summary">
  <h2>Revit MEP 2010 Tutorial Series</h2>
  <p><font size="4"><em>HDR University Accredited (15.75 Professional Development Hours)</em></font></p><font size="4">
  <p><font size="4">In order to receive credit for this course you must score at least 80% on all of the Chapter Quizzes and on the Final Exam.</font></p>
  <p><font size="4" face="Trebuchet MS,Verdana,Arial,Helvetica,sans-serif"><em>It is strongly recommended that you watch all of the videos.  Partial credit will not be awarded.</em></font></p>
  <p><em></em></p>
  <p><font size="4">Shortly after completion the PDH credit will be recorded in your <strong>Professional Training History </strong>(PTH) in Connects. </font></p>
  <p></p><hr />
  <p></p>
  <p><font size="4">The <strong><i>Course Overview </i></strong>pdf contains more specific information. <br />The <strong><i>Course Outline with Lesson Times </i></strong>pdf will break down the elapsed time for each video, chapter and the total course time.</font> </p>
  <p></p>
  <p><font size="4">Start by downloading all of the Datasets and unzipping the files to your local drive. These may be referenced over the course of the tutorials.</font></p>
  <p><font size="4">You can start and stop the course at any time. If you are returning to a course and want to see where you left off, select <strong><i>Quizzes </i></strong>from the <strong><i>Activities </i></strong>block at the top left of this page.</font></p>
  <p><font size="4">Good luck!</font></p></font></div>

This may be quite convoluted, but this is what I am doing.  I run a view that has this function first to remove line feeds:
BEGIN
      set @HTMLText=_HTMLText;
     
      while locate(char(10),@HTMLText)>0 do
            set
                  @pos_lt=locate(char(10),@HTMLText),
                  @len=locate(char(10),@HTMLText,@pos_lt)-@pos_lt+1;
            set
                  @HTMLText=replace(@HTMLText,substring(@HTMLText,@pos_lt,@len),'');
      end while;
     
      RETURN @HTMLText;
END

Then I run the second view that calls the first view.  It has this function to remove carriage returns:
BEGIN
      set @HTMLText=_HTMLText;
     
      while locate(char(13),@HTMLText)>0 do
            set
                  @pos_lt=locate(char(13),@HTMLText),
                  @len=locate(char(13),@HTMLText,@pos_lt)-@pos_lt+1;
            set
                  @HTMLText=replace(@HTMLText,substring(@HTMLText,@pos_lt,@len),'');
      end while;
     
      RETURN @HTMLText;
END

then I run the third view that calls the second view, and it has this function to remove paragraph tags:
BEGIN
      set @HTMLText=_HTMLText;
     
      while locate(char(13)+char(10),@HTMLText)>0 do
            set
                  @pos_lt=locate(char(13)+char(10),@HTMLText),
                  @len=locate(char(13)+char(10),@HTMLText,@pos_lt)-@pos_lt+1;
            set
                  @HTMLText=replace(@HTMLText,substring(@HTMLText,@pos_lt,@len),'');
      end while;
     
      RETURN @HTMLText;
END

then I run a fourth view that calls the third view that has this function to strip remaining html tags:
      set @HTMLText=_HTMLText;
     
      while locate('<',@HTMLText)>0 do
            set
                  @pos_lt=locate('<',@HTMLText),
                  @len=locate('>',@HTMLText,@pos_lt)-@pos_lt+1;
            set
                  @HTMLText=replace(trim(replace(@HTMLText,substring(@HTMLText,@pos_lt,@len),'')),' ' ,'');
      end while;
     
      RETURN @HTMLText;
END

Here is the fouth view.  It concatenates one field from the data table and the derived 'summary' text field, which is fine.  The problem develops when this fourth view is linked to another table on the 'id' field.

select `RJM04 - strip paragraphs`.`id` AS `id`,
`RJM04 - strip paragraphs`.`course` AS `course`,
`RJM04 - strip paragraphs`.`section` AS `section`,
`RJM04 - strip paragraphs`.`SummaryOriginal` AS `SummaryOriginal`,
`udf_StripHTML`(`RJM04 - strip paragraphs`.`NewSummary3`) AS `summary`,
concat(`RJM04 - strip paragraphs`.`section`,_utf8' ',`udf_StripHTML`(`RJM04 - strip paragraphs`.`NewSummary3`)) AS `chapter`
from `RJM04 - strip paragraphs`

I suspect that is going to be hard to follow.  I think I can live with the results, as is, but if you have a simple suggestion, I'm all ears.

Thank you once again.
0
 
ZberteocCommented:
Ok, I got it.

I built this function in the first place but I had few omissions I notice. ;o)

I modified it to remove the returns from the whole text and I trim the string with every iteration. You don't need to apply the function in successive steps but only once to the string/column you need:

SELECT udf_StripHTML(your_col) FROM your_table;

here is the code you should replace in the function. I only post the statements from the functions BEGIN END block:

BEGIN
	set @HTMLText=trim(replace(_HTMLText,'\n',''));
	
	while locate('<',@HTMLText)>0 do 
		begin
			set 
				@pos_lt=locate('<',@HTMLText),
				@pos_gt=locate('>',@HTMLText),
				@HTMLText=trim(replace(@HTMLText,substring(@HTMLText,@pos_lt,@pos_gt-@pos_lt+1),''));
		end;
	end while;
	RETURN @HTMLText;
END

Open in new window

0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.