?
Solved

Clean up HTML Using preg_replace

Posted on 2007-10-18
6
Medium Priority
?
6,120 Views
Last Modified: 2013-12-13
HI

I need to clean up an html string ($text) for inculding in a csv file, so it needs all html, tabs, white space, (comma's?) removed of commmnted in sure a way that we can build a valid csv file.

This is my attempt using preg_replace, but i cant get it working.

Be great if someone could take a look at it and/or advice a better solution.

Many Thanks

-s-

/////////////////////////////////////////////////////////////////////////

$search = array ("'<script[^>]*?>.*?</script>'si",  // Strip out javascript
                 "'<[\/\!]*?[^<>]*?>'si",           // Strip out HTML tags
                 "'([\r\n])[\s]+'",                 // Strip out white space
                 "'&(quot|#34);'i",                 // Replace HTML entities
                 "'&(amp|#38);'i",
                 "'&(lt|#60);'i",
                 "'&(gt|#62);'i",
                 "'&(nbsp|#160);'i",
                 "'&(iexcl|#161);'i",
                 "'&(cent|#162);'i",
                 "'&(pound|#163);'i",
                 "'&(copy|#169);'i",
                 "'&#(\d+);'e",
                 "\n");                    // evaluate as php

$replace = array ("","","\\1","\"","&","<",">"," ",chr(161),chr(162),chr(163),chr(169),"chr(\\1)","");


$result = preg_replace($search,$replace,$text);
0
Comment
Question by:socross
  • 3
  • 2
6 Comments
 
LVL 49

Expert Comment

by:Roonaan
ID: 20099249
Hello socross,

You could use a whitelist approach:
$allowedTokens = array(' ', '.', '?', '!' , '$');
$preg = '/[^\w'.preg_quote(implode('', $allowedTokens), '/').']/';
$text = preg_replace($preg, '', $text);

Regards,

Roonaan
0
 
LVL 1

Author Comment

by:socross
ID: 20099280
What About the &amp; &nbsp;

These need to be replaced with '&' and ' '

It just getting it into a format that csv will like, whilst keeping its basic formatting

-s-
0
 
LVL 49

Accepted Solution

by:
Roonaan earned 2000 total points
ID: 20099309
You could use unhtmlentities and/or unhtmlspecialchars:

function unhtmlentities ($string) {
   $trans_tbl =get_html_translation_table (HTML_ENTITIES );
   $trans_tbl =array_flip ($trans_tbl );
   return strtr ($string ,$trans_tbl );
}

function unhtmlspecialchars( $string )
{
  $string = str_replace ( '&amp;', '&', $string );
  $string = str_replace ( '&#039;', '\'', $string );
  $string = str_replace ( '&quot;', '"', $string );
  $string = str_replace ( '&lt;', '<', $string );
  $string = str_replace ( '&gt;', '>', $string );
  $string = str_replace ( '&uuml;', '?', $string );
  $string = str_replace ( '&Uuml;', '?', $string );
  $string = str_replace ( '&auml;', '?', $string );
  $string = str_replace ( '&Auml;', '?', $string );
  $string = str_replace ( '&ouml;', '?', $string );
  $string = str_replace ( '&Ouml;', '?', $string );  
  return $string;
}

-r-
0
Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 1

Author Comment

by:socross
ID: 20099363
We get unexpected t_string when we call the unhtmlentities function

This is the line which it gets stuck on

$trans_tbl =get_html_translation_table (HTML_ENTITIES );

-s-
0
 
LVL 29

Expert Comment

by:fibo
ID: 20099398
Have you had problems with strip_tags http://www.php.net/strip_tags?
The page mentions several risks, but if your html is well-formed this might be a solution.
0
 
LVL 1

Author Comment

by:socross
ID: 20268668
we think the problem was with our isp's config of php4, the htmlentities code works really well on or php 5 server though will come in very handy!!

thanks

-s-
0

Featured Post

Keep up with what's happening at Experts Exchange!

Sign up to receive Decoded, a new monthly digest with product updates, feature release info, continuing education opportunities, and more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Build an array called $myWeek which will hold the array elements Today, Yesterday and then builds up the rest of the week by the name of the day going back 1 week.   (CODE) (CODE) Then you just need to pass your date to the function. If i…
3 proven steps to speed up Magento powered sites. The article focus is on optimizing time to first byte (TTFB), full page caching and configuring server for optimal performance.
The viewer will learn how to dynamically set the form action using jQuery.
Video by: Mark
This lesson goes over how to construct ordered and unordered lists and how to create hyperlinks.
Suggested Courses
Course of the Month14 days, 2 hours left to enroll

809 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question