Solved

Remove html tags in MYSQL data - inconsistent results with function

Posted on 2010-09-11
12
4,027 Views
Last Modified: 2012-05-10
I am trying to remove html tags using a function someone helped me with last week, and it works great most of the time, but I am seeing some unexpected results.  Can anyone help me troubleshoot this, please?

This is the text I am starting with:
<div class="summary">
  <h2>AutoCAD MEP 2010 Tutorial Series</h2><font size="4"><font size="4"><em>
  <p><font size="4"><em>HDR University Accredited (16.75 Professional Development Hours)</em></font></p></em></font></font><font size="4"><font size="4">

this is the result I am getting:
 
  AD MEP 2010 Tutorial Seriesm>
  ty Accredited (16.75 Professional Development Hours)>

Please note there is a blank row to start with, and the first five letters of 'AutoCAD' were removed, and at the end of the first line of text, it should have stopped with the word 'Series'.  I don't know if this has anything to do with line breaks, carriage returns, or anything like that.

Here is an example of one that mostly works.  This is the original text:
<h3 style="text-align: center;"><a name="Modeling"></a><img width="243" vspace="0" hspace="0" height="64" border="0" src="http://www.hdr.cadlearning.com/file.php/18/model.gif" alt="Modeling" title="Modeling" /><br /></h3>
<h3> Getting Organized </h3>

This is the result:

Getting Organized

Please note the first blank row.

This is the function that I am using:
DROP FUNCTION IF EXISTS `udf_StripHTML`;

CREATE FUNCTION `udf_StripHTML`(_HTMLText VARCHAR(8000)) RETURNS varchar(8000)
    DETERMINISTIC
BEGIN
      set @HTMLText=_HTMLText;
      
      while locate('<',@HTMLText)>0 do
      begin
            set
                  @pos_lt=locate('<',@HTMLText),
                  @pos_gt=locate('>',@HTMLText);
            set
                  @HTMLText=replace(@HTMLText,substring(@HTMLText,@pos_lt,@pos_gt),'');
      end;
      end while;
      RETURN @HTMLText;
END;


I appreciate any assistance you can provide.  Thank you in advance.
0
Comment
Question by:ginsburg7
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 6
  • 4
  • 2
12 Comments
 
LVL 60

Expert Comment

by:Kevin Cross
ID: 33655019
Is the HTML XHTML compliant?  I ask as the above samples have unclosed tags like the  and the extraneous  tags at the end.  If the real data is always well formatted, then you could use this trick:


DROP FUNCTION IF EXISTS `udf_StripHTML`;

CREATE FUNCTION `udf_StripHTML`(_HTMLText VARCHAR(8000)) 
RETURNS varchar(8000) DETERMINISTIC
BEGIN
      set @HTMLText=extractvalue(_HTMLText, '//child::text()');
      RETURN @HTMLText;
END; 

Open in new window

0
 
LVL 60

Expert Comment

by:Kevin Cross
ID: 33655047
If it is not, I think I see the issue.  You are using substring which takes the form of substring(text, startposition, length) but are passing two index values.  I will try to correct that and repost.
0
 
LVL 60

Accepted Solution

by:
Kevin Cross earned 500 total points
ID: 33655067
Try like this!
DROP FUNCTION IF EXISTS `udf_StripHTML`;

CREATE FUNCTION `udf_StripHTML`(_HTMLText VARCHAR(8000)) 
RETURNS varchar(8000) DETERMINISTIC
BEGIN
      set @HTMLText=_HTMLText;
      
      while locate('<',@HTMLText)>0 do 
            set 
                  @pos_lt=locate('<',@HTMLText),
                  @len=locate('>',@HTMLText,@pos_lt)-@pos_lt+1;
            set 
                  @HTMLText=replace(@HTMLText,substring(@HTMLText,@pos_lt,@len),'');
      end while;
      
      RETURN @HTMLText;
END; 

Open in new window

0
Windows Server 2016: All you need to know

Learn about Hyper-V features that increase functionality and usability of Microsoft Windows Server 2016. Also, throughout this eBook, you’ll find some basic PowerShell examples that will help you leverage the scripts in your environments!

 

Author Comment

by:ginsburg7
ID: 33655434
wow, mwvisa1, that looks pretty good.  Thank you.  I will test it and get back to you.  I think I need a loop to remove line breaks and carriage returns first, then apply your function.  Do you have a function that can do that, too?  I appreciate it.
0
 
LVL 60

Expert Comment

by:Kevin Cross
ID: 33655465
You can just use the same function and use replace on the initial text.  '\n' = line feed; '\r' = carriage return.
0
 

Author Comment

by:ginsburg7
ID: 33659513
this is very close.  the problem I am seeing is it turns this:

<div class="summary">   <h2>Revit MEP 2010 Tutorial Series</h2>  

into this:

   Revit MEP 2010 Tutorial Series  

note the leading spaces.  Can you help me get rid of them, please?

Thank you very much, mwvisa1.
0
 
LVL 27

Expert Comment

by:Zberteoc
ID: 33661116
Use:

RETURN TRIM(@HTMLText)
0
 
LVL 60

Expert Comment

by:Kevin Cross
ID: 33661524
Thanks, Zberteoc! I thought I had put the TRIM() function in there.  I think I had taken it out playing with showing the REPLACE() function also but guess I showed neither. *laughing*

If you have a word and then empty space before or after a tag, you will get double spaces inside the text where TRIM() will not work, so you can additionally use REPLACE(TRIM(@HTMLText), '  ', ' ')

You can possibly put this inside the WHILE loop and the initial "set @HTMLText=_HTMLText;" to get this done a couple of times in the process.

Hope that helps!
0
 

Author Comment

by:ginsburg7
ID: 33663630
It does indeed help!  Thank you.  

I now have one new issue that I hope you know the answer to -
I am expecting to see this:
Revit MEP 2010 Tutorial Series

but and getting this odd character:
Revit MEP 2010 Tutorial Series

I tried using the Replace function, but it doesn't seem to recognize that character.  I've done some research, but cannot find a solution.  Do you have one, please?
0
 
LVL 60

Expert Comment

by:Kevin Cross
ID: 33663645
Can you paste the original HTML content into a code snippet -- maybe there is something in the original text.  Also, please post the function code you are now using.  Can use both to test on our end.
0
 

Author Comment

by:ginsburg7
ID: 33664413
My ultimate goal is to get only the name of the "chapter" out of this text, which in this case would be "Revit MEP 2010 Tutorial Series" (no quotes).

Here is the original text:

<div class="summary">
  <h2>Revit MEP 2010 Tutorial Series</h2>
  <p><font size="4"><em>HDR University Accredited (15.75 Professional Development Hours)</em></font></p><font size="4">
  <p><font size="4">In order to receive credit for this course you must score at least 80% on all of the Chapter Quizzes and on the Final Exam.</font></p>
  <p><font size="4" face="Trebuchet MS,Verdana,Arial,Helvetica,sans-serif"><em>It is strongly recommended that you watch all of the videos.  Partial credit will not be awarded.</em></font></p>
  <p><em></em></p>
  <p><font size="4">Shortly after completion the PDH credit will be recorded in your <strong>Professional Training History </strong>(PTH) in Connects. </font></p>
  <p></p><hr />
  <p></p>
  <p><font size="4">The <strong><i>Course Overview </i></strong>pdf contains more specific information. <br />The <strong><i>Course Outline with Lesson Times </i></strong>pdf will break down the elapsed time for each video, chapter and the total course time.</font> </p>
  <p></p>
  <p><font size="4">Start by downloading all of the Datasets and unzipping the files to your local drive. These may be referenced over the course of the tutorials.</font></p>
  <p><font size="4">You can start and stop the course at any time. If you are returning to a course and want to see where you left off, select <strong><i>Quizzes </i></strong>from the <strong><i>Activities </i></strong>block at the top left of this page.</font></p>
  <p><font size="4">Good luck!</font></p></font></div>

This may be quite convoluted, but this is what I am doing.  I run a view that has this function first to remove line feeds:
BEGIN
      set @HTMLText=_HTMLText;
     
      while locate(char(10),@HTMLText)>0 do
            set
                  @pos_lt=locate(char(10),@HTMLText),
                  @len=locate(char(10),@HTMLText,@pos_lt)-@pos_lt+1;
            set
                  @HTMLText=replace(@HTMLText,substring(@HTMLText,@pos_lt,@len),'');
      end while;
     
      RETURN @HTMLText;
END

Then I run the second view that calls the first view.  It has this function to remove carriage returns:
BEGIN
      set @HTMLText=_HTMLText;
     
      while locate(char(13),@HTMLText)>0 do
            set
                  @pos_lt=locate(char(13),@HTMLText),
                  @len=locate(char(13),@HTMLText,@pos_lt)-@pos_lt+1;
            set
                  @HTMLText=replace(@HTMLText,substring(@HTMLText,@pos_lt,@len),'');
      end while;
     
      RETURN @HTMLText;
END

then I run the third view that calls the second view, and it has this function to remove paragraph tags:
BEGIN
      set @HTMLText=_HTMLText;
     
      while locate(char(13)+char(10),@HTMLText)>0 do
            set
                  @pos_lt=locate(char(13)+char(10),@HTMLText),
                  @len=locate(char(13)+char(10),@HTMLText,@pos_lt)-@pos_lt+1;
            set
                  @HTMLText=replace(@HTMLText,substring(@HTMLText,@pos_lt,@len),'');
      end while;
     
      RETURN @HTMLText;
END

then I run a fourth view that calls the third view that has this function to strip remaining html tags:
      set @HTMLText=_HTMLText;
     
      while locate('<',@HTMLText)>0 do
            set
                  @pos_lt=locate('<',@HTMLText),
                  @len=locate('>',@HTMLText,@pos_lt)-@pos_lt+1;
            set
                  @HTMLText=replace(trim(replace(@HTMLText,substring(@HTMLText,@pos_lt,@len),'')),' ' ,'');
      end while;
     
      RETURN @HTMLText;
END

Here is the fouth view.  It concatenates one field from the data table and the derived 'summary' text field, which is fine.  The problem develops when this fourth view is linked to another table on the 'id' field.

select `RJM04 - strip paragraphs`.`id` AS `id`,
`RJM04 - strip paragraphs`.`course` AS `course`,
`RJM04 - strip paragraphs`.`section` AS `section`,
`RJM04 - strip paragraphs`.`SummaryOriginal` AS `SummaryOriginal`,
`udf_StripHTML`(`RJM04 - strip paragraphs`.`NewSummary3`) AS `summary`,
concat(`RJM04 - strip paragraphs`.`section`,_utf8' ',`udf_StripHTML`(`RJM04 - strip paragraphs`.`NewSummary3`)) AS `chapter`
from `RJM04 - strip paragraphs`

I suspect that is going to be hard to follow.  I think I can live with the results, as is, but if you have a simple suggestion, I'm all ears.

Thank you once again.
0
 
LVL 27

Expert Comment

by:Zberteoc
ID: 33664577
Ok, I got it.

I built this function in the first place but I had few omissions I notice. ;o)

I modified it to remove the returns from the whole text and I trim the string with every iteration. You don't need to apply the function in successive steps but only once to the string/column you need:

SELECT udf_StripHTML(your_col) FROM your_table;

here is the code you should replace in the function. I only post the statements from the functions BEGIN END block:

BEGIN
	set @HTMLText=trim(replace(_HTMLText,'\n',''));
	
	while locate('<',@HTMLText)>0 do 
		begin
			set 
				@pos_lt=locate('<',@HTMLText),
				@pos_gt=locate('>',@HTMLText),
				@HTMLText=trim(replace(@HTMLText,substring(@HTMLText,@pos_lt,@pos_gt-@pos_lt+1),''));
		end;
	end while;
	RETURN @HTMLText;
END

Open in new window

0

Featured Post

Will your db performance match your db growth?

In Percona’s white paper “Performance at Scale: Keeping Your Database on Its Toes,” we take a high-level approach to what you need to think about when planning for database scalability.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

All XML, All the Time; More Fun MySQL Tidbits – Dynamically Generate XML via Stored Procedure in MySQL Extensible Markup Language (XML) and database systems, a marriage we are seeing more and more of.  So the topics of parsing and manipulating XM…
Foreword This is an old article.  Instead of using the MySQL extension that was used in the original code examples, please choose one of the currently supported database extensions instead.  More information is available here: MySQLi / PDO (http://…
In this video, viewers will be given step by step instructions on adjusting mouse, pointer and cursor visibility in Microsoft Windows 10. The video seeks to educate those who are struggling with the new Windows 10 Graphical User Interface. Change Cu…
In this video we outline the Physical Segments view of NetCrunch network monitor. By following this brief how-to video, you will be able to learn how NetCrunch visualizes your network, how granular is the information collected, as well as where to f…

691 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question