Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people, just like you, are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
Solved

Remove html tags in MYSQL data - inconsistent results with function

Posted on 2010-09-11
12
3,833 Views
Last Modified: 2012-05-10
I am trying to remove html tags using a function someone helped me with last week, and it works great most of the time, but I am seeing some unexpected results.  Can anyone help me troubleshoot this, please?

This is the text I am starting with:
<div class="summary">
  <h2>AutoCAD MEP 2010 Tutorial Series</h2><font size="4"><font size="4"><em>
  <p><font size="4"><em>HDR University Accredited (16.75 Professional Development Hours)</em></font></p></em></font></font><font size="4"><font size="4">

this is the result I am getting:
 
  AD MEP 2010 Tutorial Seriesm>
  ty Accredited (16.75 Professional Development Hours)>

Please note there is a blank row to start with, and the first five letters of 'AutoCAD' were removed, and at the end of the first line of text, it should have stopped with the word 'Series'.  I don't know if this has anything to do with line breaks, carriage returns, or anything like that.

Here is an example of one that mostly works.  This is the original text:
<h3 style="text-align: center;"><a name="Modeling"></a><img width="243" vspace="0" hspace="0" height="64" border="0" src="http://www.hdr.cadlearning.com/file.php/18/model.gif" alt="Modeling" title="Modeling" /><br /></h3>
<h3> Getting Organized </h3>

This is the result:

Getting Organized

Please note the first blank row.

This is the function that I am using:
DROP FUNCTION IF EXISTS `udf_StripHTML`;

CREATE FUNCTION `udf_StripHTML`(_HTMLText VARCHAR(8000)) RETURNS varchar(8000)
    DETERMINISTIC
BEGIN
      set @HTMLText=_HTMLText;
      
      while locate('<',@HTMLText)>0 do
      begin
            set
                  @pos_lt=locate('<',@HTMLText),
                  @pos_gt=locate('>',@HTMLText);
            set
                  @HTMLText=replace(@HTMLText,substring(@HTMLText,@pos_lt,@pos_gt),'');
      end;
      end while;
      RETURN @HTMLText;
END;


I appreciate any assistance you can provide.  Thank you in advance.
0
Comment
Question by:ginsburg7
  • 6
  • 4
  • 2
12 Comments
 
LVL 59

Expert Comment

by:Kevin Cross
ID: 33655019
Is the HTML XHTML compliant?  I ask as the above samples have unclosed tags like the  and the extraneous  tags at the end.  If the real data is always well formatted, then you could use this trick:


DROP FUNCTION IF EXISTS `udf_StripHTML`;

CREATE FUNCTION `udf_StripHTML`(_HTMLText VARCHAR(8000)) 
RETURNS varchar(8000) DETERMINISTIC
BEGIN
      set @HTMLText=extractvalue(_HTMLText, '//child::text()');
      RETURN @HTMLText;
END; 

Open in new window

0
 
LVL 59

Expert Comment

by:Kevin Cross
ID: 33655047
If it is not, I think I see the issue.  You are using substring which takes the form of substring(text, startposition, length) but are passing two index values.  I will try to correct that and repost.
0
 
LVL 59

Accepted Solution

by:
Kevin Cross earned 500 total points
ID: 33655067
Try like this!
DROP FUNCTION IF EXISTS `udf_StripHTML`;

CREATE FUNCTION `udf_StripHTML`(_HTMLText VARCHAR(8000)) 
RETURNS varchar(8000) DETERMINISTIC
BEGIN
      set @HTMLText=_HTMLText;
      
      while locate('<',@HTMLText)>0 do 
            set 
                  @pos_lt=locate('<',@HTMLText),
                  @len=locate('>',@HTMLText,@pos_lt)-@pos_lt+1;
            set 
                  @HTMLText=replace(@HTMLText,substring(@HTMLText,@pos_lt,@len),'');
      end while;
      
      RETURN @HTMLText;
END; 

Open in new window

0
PRTG Network Monitor: Intuitive Network Monitoring

Network Monitoring is essential to ensure that computer systems and network devices are running. Use PRTG to monitor LANs, servers, websites, applications and devices, bandwidth, virtual environments, remote systems, IoT, and many more. PRTG is easy to set up & use.

 

Author Comment

by:ginsburg7
ID: 33655434
wow, mwvisa1, that looks pretty good.  Thank you.  I will test it and get back to you.  I think I need a loop to remove line breaks and carriage returns first, then apply your function.  Do you have a function that can do that, too?  I appreciate it.
0
 
LVL 59

Expert Comment

by:Kevin Cross
ID: 33655465
You can just use the same function and use replace on the initial text.  '\n' = line feed; '\r' = carriage return.
0
 

Author Comment

by:ginsburg7
ID: 33659513
this is very close.  the problem I am seeing is it turns this:

<div class="summary">   <h2>Revit MEP 2010 Tutorial Series</h2>  

into this:

   Revit MEP 2010 Tutorial Series  

note the leading spaces.  Can you help me get rid of them, please?

Thank you very much, mwvisa1.
0
 
LVL 26

Expert Comment

by:Zberteoc
ID: 33661116
Use:

RETURN TRIM(@HTMLText)
0
 
LVL 59

Expert Comment

by:Kevin Cross
ID: 33661524
Thanks, Zberteoc! I thought I had put the TRIM() function in there.  I think I had taken it out playing with showing the REPLACE() function also but guess I showed neither. *laughing*

If you have a word and then empty space before or after a tag, you will get double spaces inside the text where TRIM() will not work, so you can additionally use REPLACE(TRIM(@HTMLText), '  ', ' ')

You can possibly put this inside the WHILE loop and the initial "set @HTMLText=_HTMLText;" to get this done a couple of times in the process.

Hope that helps!
0
 

Author Comment

by:ginsburg7
ID: 33663630
It does indeed help!  Thank you.  

I now have one new issue that I hope you know the answer to -
I am expecting to see this:
Revit MEP 2010 Tutorial Series

but and getting this odd character:
Revit MEP 2010 Tutorial Series

I tried using the Replace function, but it doesn't seem to recognize that character.  I've done some research, but cannot find a solution.  Do you have one, please?
0
 
LVL 59

Expert Comment

by:Kevin Cross
ID: 33663645
Can you paste the original HTML content into a code snippet -- maybe there is something in the original text.  Also, please post the function code you are now using.  Can use both to test on our end.
0
 

Author Comment

by:ginsburg7
ID: 33664413
My ultimate goal is to get only the name of the "chapter" out of this text, which in this case would be "Revit MEP 2010 Tutorial Series" (no quotes).

Here is the original text:

<div class="summary">
  <h2>Revit MEP 2010 Tutorial Series</h2>
  <p><font size="4"><em>HDR University Accredited (15.75 Professional Development Hours)</em></font></p><font size="4">
  <p><font size="4">In order to receive credit for this course you must score at least 80% on all of the Chapter Quizzes and on the Final Exam.</font></p>
  <p><font size="4" face="Trebuchet MS,Verdana,Arial,Helvetica,sans-serif"><em>It is strongly recommended that you watch all of the videos.  Partial credit will not be awarded.</em></font></p>
  <p><em></em></p>
  <p><font size="4">Shortly after completion the PDH credit will be recorded in your <strong>Professional Training History </strong>(PTH) in Connects. </font></p>
  <p></p><hr />
  <p></p>
  <p><font size="4">The <strong><i>Course Overview </i></strong>pdf contains more specific information. <br />The <strong><i>Course Outline with Lesson Times </i></strong>pdf will break down the elapsed time for each video, chapter and the total course time.</font> </p>
  <p></p>
  <p><font size="4">Start by downloading all of the Datasets and unzipping the files to your local drive. These may be referenced over the course of the tutorials.</font></p>
  <p><font size="4">You can start and stop the course at any time. If you are returning to a course and want to see where you left off, select <strong><i>Quizzes </i></strong>from the <strong><i>Activities </i></strong>block at the top left of this page.</font></p>
  <p><font size="4">Good luck!</font></p></font></div>

This may be quite convoluted, but this is what I am doing.  I run a view that has this function first to remove line feeds:
BEGIN
      set @HTMLText=_HTMLText;
     
      while locate(char(10),@HTMLText)>0 do
            set
                  @pos_lt=locate(char(10),@HTMLText),
                  @len=locate(char(10),@HTMLText,@pos_lt)-@pos_lt+1;
            set
                  @HTMLText=replace(@HTMLText,substring(@HTMLText,@pos_lt,@len),'');
      end while;
     
      RETURN @HTMLText;
END

Then I run the second view that calls the first view.  It has this function to remove carriage returns:
BEGIN
      set @HTMLText=_HTMLText;
     
      while locate(char(13),@HTMLText)>0 do
            set
                  @pos_lt=locate(char(13),@HTMLText),
                  @len=locate(char(13),@HTMLText,@pos_lt)-@pos_lt+1;
            set
                  @HTMLText=replace(@HTMLText,substring(@HTMLText,@pos_lt,@len),'');
      end while;
     
      RETURN @HTMLText;
END

then I run the third view that calls the second view, and it has this function to remove paragraph tags:
BEGIN
      set @HTMLText=_HTMLText;
     
      while locate(char(13)+char(10),@HTMLText)>0 do
            set
                  @pos_lt=locate(char(13)+char(10),@HTMLText),
                  @len=locate(char(13)+char(10),@HTMLText,@pos_lt)-@pos_lt+1;
            set
                  @HTMLText=replace(@HTMLText,substring(@HTMLText,@pos_lt,@len),'');
      end while;
     
      RETURN @HTMLText;
END

then I run a fourth view that calls the third view that has this function to strip remaining html tags:
      set @HTMLText=_HTMLText;
     
      while locate('<',@HTMLText)>0 do
            set
                  @pos_lt=locate('<',@HTMLText),
                  @len=locate('>',@HTMLText,@pos_lt)-@pos_lt+1;
            set
                  @HTMLText=replace(trim(replace(@HTMLText,substring(@HTMLText,@pos_lt,@len),'')),' ' ,'');
      end while;
     
      RETURN @HTMLText;
END

Here is the fouth view.  It concatenates one field from the data table and the derived 'summary' text field, which is fine.  The problem develops when this fourth view is linked to another table on the 'id' field.

select `RJM04 - strip paragraphs`.`id` AS `id`,
`RJM04 - strip paragraphs`.`course` AS `course`,
`RJM04 - strip paragraphs`.`section` AS `section`,
`RJM04 - strip paragraphs`.`SummaryOriginal` AS `SummaryOriginal`,
`udf_StripHTML`(`RJM04 - strip paragraphs`.`NewSummary3`) AS `summary`,
concat(`RJM04 - strip paragraphs`.`section`,_utf8' ',`udf_StripHTML`(`RJM04 - strip paragraphs`.`NewSummary3`)) AS `chapter`
from `RJM04 - strip paragraphs`

I suspect that is going to be hard to follow.  I think I can live with the results, as is, but if you have a simple suggestion, I'm all ears.

Thank you once again.
0
 
LVL 26

Expert Comment

by:Zberteoc
ID: 33664577
Ok, I got it.

I built this function in the first place but I had few omissions I notice. ;o)

I modified it to remove the returns from the whole text and I trim the string with every iteration. You don't need to apply the function in successive steps but only once to the string/column you need:

SELECT udf_StripHTML(your_col) FROM your_table;

here is the code you should replace in the function. I only post the statements from the functions BEGIN END block:

BEGIN
	set @HTMLText=trim(replace(_HTMLText,'\n',''));
	
	while locate('<',@HTMLText)>0 do 
		begin
			set 
				@pos_lt=locate('<',@HTMLText),
				@pos_gt=locate('>',@HTMLText),
				@HTMLText=trim(replace(@HTMLText,substring(@HTMLText,@pos_lt,@pos_gt-@pos_lt+1),''));
		end;
	end while;
	RETURN @HTMLText;
END

Open in new window

0

Featured Post

PRTG Network Monitor: Intuitive Network Monitoring

Network Monitoring is essential to ensure that computer systems and network devices are running. Use PRTG to monitor LANs, servers, websites, applications and devices, bandwidth, virtual environments, remote systems, IoT, and many more. PRTG is easy to set up & use.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
SQL inner join confusion 15 55
Insert values are dynamic 11 51
sort in mysql based off of query param 4 26
Why does Opencart Use Product tables use the MyISAM storage Engine 4 23
Foreword In the years since this article was written, numerous hacking attacks have targeted password-protected web sites.  The storage of client passwords has become a subject of much discussion, some of it useful and some of it misguided.  Of cou…
Foreword This is an old article.  Instead of using the MySQL extension that was used in the original code examples, please choose one of the currently supported database extensions instead.  More information is available here: MySQLi / PDO (http://…
Microsoft Active Directory, the widely used IT infrastructure, is known for its high risk of credential theft. The best way to test your Active Directory’s vulnerabilities to pass-the-ticket, pass-the-hash, privilege escalation, and malware attacks …
This video shows how to quickly and easily add an email signature for all users on Exchange 2016. The resulting signature is applied on a server level by Exchange Online. The email signature template has been downloaded from: www.mail-signatures…

791 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question