Solved

Remove html tags in MYSQL data - inconsistent results with function

Posted on 2010-09-11
12
3,942 Views
Last Modified: 2012-05-10
I am trying to remove html tags using a function someone helped me with last week, and it works great most of the time, but I am seeing some unexpected results.  Can anyone help me troubleshoot this, please?

This is the text I am starting with:
<div class="summary">
  <h2>AutoCAD MEP 2010 Tutorial Series</h2><font size="4"><font size="4"><em>
  <p><font size="4"><em>HDR University Accredited (16.75 Professional Development Hours)</em></font></p></em></font></font><font size="4"><font size="4">

this is the result I am getting:
 
  AD MEP 2010 Tutorial Seriesm>
  ty Accredited (16.75 Professional Development Hours)>

Please note there is a blank row to start with, and the first five letters of 'AutoCAD' were removed, and at the end of the first line of text, it should have stopped with the word 'Series'.  I don't know if this has anything to do with line breaks, carriage returns, or anything like that.

Here is an example of one that mostly works.  This is the original text:
<h3 style="text-align: center;"><a name="Modeling"></a><img width="243" vspace="0" hspace="0" height="64" border="0" src="http://www.hdr.cadlearning.com/file.php/18/model.gif" alt="Modeling" title="Modeling" /><br /></h3>
<h3> Getting Organized </h3>

This is the result:

Getting Organized

Please note the first blank row.

This is the function that I am using:
DROP FUNCTION IF EXISTS `udf_StripHTML`;

CREATE FUNCTION `udf_StripHTML`(_HTMLText VARCHAR(8000)) RETURNS varchar(8000)
    DETERMINISTIC
BEGIN
      set @HTMLText=_HTMLText;
      
      while locate('<',@HTMLText)>0 do
      begin
            set
                  @pos_lt=locate('<',@HTMLText),
                  @pos_gt=locate('>',@HTMLText);
            set
                  @HTMLText=replace(@HTMLText,substring(@HTMLText,@pos_lt,@pos_gt),'');
      end;
      end while;
      RETURN @HTMLText;
END;


I appreciate any assistance you can provide.  Thank you in advance.
0
Comment
Question by:ginsburg7
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 6
  • 4
  • 2
12 Comments
 
LVL 60

Expert Comment

by:Kevin Cross
ID: 33655019
Is the HTML XHTML compliant?  I ask as the above samples have unclosed tags like the  and the extraneous  tags at the end.  If the real data is always well formatted, then you could use this trick:


DROP FUNCTION IF EXISTS `udf_StripHTML`;

CREATE FUNCTION `udf_StripHTML`(_HTMLText VARCHAR(8000)) 
RETURNS varchar(8000) DETERMINISTIC
BEGIN
      set @HTMLText=extractvalue(_HTMLText, '//child::text()');
      RETURN @HTMLText;
END; 

Open in new window

0
 
LVL 60

Expert Comment

by:Kevin Cross
ID: 33655047
If it is not, I think I see the issue.  You are using substring which takes the form of substring(text, startposition, length) but are passing two index values.  I will try to correct that and repost.
0
 
LVL 60

Accepted Solution

by:
Kevin Cross earned 500 total points
ID: 33655067
Try like this!
DROP FUNCTION IF EXISTS `udf_StripHTML`;

CREATE FUNCTION `udf_StripHTML`(_HTMLText VARCHAR(8000)) 
RETURNS varchar(8000) DETERMINISTIC
BEGIN
      set @HTMLText=_HTMLText;
      
      while locate('<',@HTMLText)>0 do 
            set 
                  @pos_lt=locate('<',@HTMLText),
                  @len=locate('>',@HTMLText,@pos_lt)-@pos_lt+1;
            set 
                  @HTMLText=replace(@HTMLText,substring(@HTMLText,@pos_lt,@len),'');
      end while;
      
      RETURN @HTMLText;
END; 

Open in new window

0
Complete VMware vSphere® ESX(i) & Hyper-V Backup

Capture your entire system, including the host, with patented disk imaging integrated with VMware VADP / Microsoft VSS and RCT. RTOs is as low as 15 seconds with Acronis Active Restore™. You can enjoy unlimited P2V/V2V migrations from any source (even from a different hypervisor)

 

Author Comment

by:ginsburg7
ID: 33655434
wow, mwvisa1, that looks pretty good.  Thank you.  I will test it and get back to you.  I think I need a loop to remove line breaks and carriage returns first, then apply your function.  Do you have a function that can do that, too?  I appreciate it.
0
 
LVL 60

Expert Comment

by:Kevin Cross
ID: 33655465
You can just use the same function and use replace on the initial text.  '\n' = line feed; '\r' = carriage return.
0
 

Author Comment

by:ginsburg7
ID: 33659513
this is very close.  the problem I am seeing is it turns this:

<div class="summary">   <h2>Revit MEP 2010 Tutorial Series</h2>  

into this:

   Revit MEP 2010 Tutorial Series  

note the leading spaces.  Can you help me get rid of them, please?

Thank you very much, mwvisa1.
0
 
LVL 27

Expert Comment

by:Zberteoc
ID: 33661116
Use:

RETURN TRIM(@HTMLText)
0
 
LVL 60

Expert Comment

by:Kevin Cross
ID: 33661524
Thanks, Zberteoc! I thought I had put the TRIM() function in there.  I think I had taken it out playing with showing the REPLACE() function also but guess I showed neither. *laughing*

If you have a word and then empty space before or after a tag, you will get double spaces inside the text where TRIM() will not work, so you can additionally use REPLACE(TRIM(@HTMLText), '  ', ' ')

You can possibly put this inside the WHILE loop and the initial "set @HTMLText=_HTMLText;" to get this done a couple of times in the process.

Hope that helps!
0
 

Author Comment

by:ginsburg7
ID: 33663630
It does indeed help!  Thank you.  

I now have one new issue that I hope you know the answer to -
I am expecting to see this:
Revit MEP 2010 Tutorial Series

but and getting this odd character:
Revit MEP 2010 Tutorial Series

I tried using the Replace function, but it doesn't seem to recognize that character.  I've done some research, but cannot find a solution.  Do you have one, please?
0
 
LVL 60

Expert Comment

by:Kevin Cross
ID: 33663645
Can you paste the original HTML content into a code snippet -- maybe there is something in the original text.  Also, please post the function code you are now using.  Can use both to test on our end.
0
 

Author Comment

by:ginsburg7
ID: 33664413
My ultimate goal is to get only the name of the "chapter" out of this text, which in this case would be "Revit MEP 2010 Tutorial Series" (no quotes).

Here is the original text:

<div class="summary">
  <h2>Revit MEP 2010 Tutorial Series</h2>
  <p><font size="4"><em>HDR University Accredited (15.75 Professional Development Hours)</em></font></p><font size="4">
  <p><font size="4">In order to receive credit for this course you must score at least 80% on all of the Chapter Quizzes and on the Final Exam.</font></p>
  <p><font size="4" face="Trebuchet MS,Verdana,Arial,Helvetica,sans-serif"><em>It is strongly recommended that you watch all of the videos.  Partial credit will not be awarded.</em></font></p>
  <p><em></em></p>
  <p><font size="4">Shortly after completion the PDH credit will be recorded in your <strong>Professional Training History </strong>(PTH) in Connects. </font></p>
  <p></p><hr />
  <p></p>
  <p><font size="4">The <strong><i>Course Overview </i></strong>pdf contains more specific information. <br />The <strong><i>Course Outline with Lesson Times </i></strong>pdf will break down the elapsed time for each video, chapter and the total course time.</font> </p>
  <p></p>
  <p><font size="4">Start by downloading all of the Datasets and unzipping the files to your local drive. These may be referenced over the course of the tutorials.</font></p>
  <p><font size="4">You can start and stop the course at any time. If you are returning to a course and want to see where you left off, select <strong><i>Quizzes </i></strong>from the <strong><i>Activities </i></strong>block at the top left of this page.</font></p>
  <p><font size="4">Good luck!</font></p></font></div>

This may be quite convoluted, but this is what I am doing.  I run a view that has this function first to remove line feeds:
BEGIN
      set @HTMLText=_HTMLText;
     
      while locate(char(10),@HTMLText)>0 do
            set
                  @pos_lt=locate(char(10),@HTMLText),
                  @len=locate(char(10),@HTMLText,@pos_lt)-@pos_lt+1;
            set
                  @HTMLText=replace(@HTMLText,substring(@HTMLText,@pos_lt,@len),'');
      end while;
     
      RETURN @HTMLText;
END

Then I run the second view that calls the first view.  It has this function to remove carriage returns:
BEGIN
      set @HTMLText=_HTMLText;
     
      while locate(char(13),@HTMLText)>0 do
            set
                  @pos_lt=locate(char(13),@HTMLText),
                  @len=locate(char(13),@HTMLText,@pos_lt)-@pos_lt+1;
            set
                  @HTMLText=replace(@HTMLText,substring(@HTMLText,@pos_lt,@len),'');
      end while;
     
      RETURN @HTMLText;
END

then I run the third view that calls the second view, and it has this function to remove paragraph tags:
BEGIN
      set @HTMLText=_HTMLText;
     
      while locate(char(13)+char(10),@HTMLText)>0 do
            set
                  @pos_lt=locate(char(13)+char(10),@HTMLText),
                  @len=locate(char(13)+char(10),@HTMLText,@pos_lt)-@pos_lt+1;
            set
                  @HTMLText=replace(@HTMLText,substring(@HTMLText,@pos_lt,@len),'');
      end while;
     
      RETURN @HTMLText;
END

then I run a fourth view that calls the third view that has this function to strip remaining html tags:
      set @HTMLText=_HTMLText;
     
      while locate('<',@HTMLText)>0 do
            set
                  @pos_lt=locate('<',@HTMLText),
                  @len=locate('>',@HTMLText,@pos_lt)-@pos_lt+1;
            set
                  @HTMLText=replace(trim(replace(@HTMLText,substring(@HTMLText,@pos_lt,@len),'')),' ' ,'');
      end while;
     
      RETURN @HTMLText;
END

Here is the fouth view.  It concatenates one field from the data table and the derived 'summary' text field, which is fine.  The problem develops when this fourth view is linked to another table on the 'id' field.

select `RJM04 - strip paragraphs`.`id` AS `id`,
`RJM04 - strip paragraphs`.`course` AS `course`,
`RJM04 - strip paragraphs`.`section` AS `section`,
`RJM04 - strip paragraphs`.`SummaryOriginal` AS `SummaryOriginal`,
`udf_StripHTML`(`RJM04 - strip paragraphs`.`NewSummary3`) AS `summary`,
concat(`RJM04 - strip paragraphs`.`section`,_utf8' ',`udf_StripHTML`(`RJM04 - strip paragraphs`.`NewSummary3`)) AS `chapter`
from `RJM04 - strip paragraphs`

I suspect that is going to be hard to follow.  I think I can live with the results, as is, but if you have a simple suggestion, I'm all ears.

Thank you once again.
0
 
LVL 27

Expert Comment

by:Zberteoc
ID: 33664577
Ok, I got it.

I built this function in the first place but I had few omissions I notice. ;o)

I modified it to remove the returns from the whole text and I trim the string with every iteration. You don't need to apply the function in successive steps but only once to the string/column you need:

SELECT udf_StripHTML(your_col) FROM your_table;

here is the code you should replace in the function. I only post the statements from the functions BEGIN END block:

BEGIN
	set @HTMLText=trim(replace(_HTMLText,'\n',''));
	
	while locate('<',@HTMLText)>0 do 
		begin
			set 
				@pos_lt=locate('<',@HTMLText),
				@pos_gt=locate('>',@HTMLText),
				@HTMLText=trim(replace(@HTMLText,substring(@HTMLText,@pos_lt,@pos_gt-@pos_lt+1),''));
		end;
	end while;
	RETURN @HTMLText;
END

Open in new window

0

Featured Post

Comprehensive Backup Solutions for Microsoft

Acronis protects the complete Microsoft technology stack: Windows Server, Windows PC, laptop and Surface data; Microsoft business applications; Microsoft Hyper-V; Azure VMs; Microsoft Windows Server 2016; Microsoft Exchange 2016 and SQL Server 2016.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
possible to record changes (trigger I think) msql 11 47
Very Large data in MYSQL 7 105
Uploading a CSV Data Import via PHP & MySql 3 68
Error in sql query statment. 21 69
Introduction In this article, I will by showing a nice little trick for MySQL similar to that of my previous EE Article for SQLite (http://www.sqlite.org/), A SQLite Tidbit: Quick Numbers Table Generation (http://www.experts-exchange.com/A_3570.htm…
Introduction This article is intended for those who are new to PHP error handling (https://www.experts-exchange.com/articles/11769/And-by-the-way-I-am-New-to-PHP.html).  It addresses one of the most common problems that plague beginning PHP develop…
I've attached the XLSM Excel spreadsheet I used in the video and also text files containing the macros used below. https://filedb.experts-exchange.com/incoming/2017/03_w12/1151775/Permutations.txt https://filedb.experts-exchange.com/incoming/201…
Attackers love to prey on accounts that have privileges. Reducing privileged accounts and protecting privileged accounts therefore is paramount. Users, groups, and service accounts need to be protected to help protect the entire Active Directory …

738 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question