Clean up HTML Using preg_replace

HI

I need to clean up an html string ($text) for inculding in a csv file, so it needs all html, tabs, white space, (comma's?) removed of commmnted in sure a way that we can build a valid csv file.

This is my attempt using preg_replace, but i cant get it working.

Be great if someone could take a look at it and/or advice a better solution.

Many Thanks

-s-

/////////////////////////////////////////////////////////////////////////

$search = array ("'<script[^>]*?>.*?</script>'si",  // Strip out javascript
                 "'<[\/\!]*?[^<>]*?>'si",           // Strip out HTML tags
                 "'([\r\n])[\s]+'",                 // Strip out white space
                 "'&(quot|#34);'i",                 // Replace HTML entities
                 "'&(amp|#38);'i",
                 "'&(lt|#60);'i",
                 "'&(gt|#62);'i",
                 "'&(nbsp|#160);'i",
                 "'&(iexcl|#161);'i",
                 "'&(cent|#162);'i",
                 "'&(pound|#163);'i",
                 "'&(copy|#169);'i",
                 "'&#(\d+);'e",
                 "\n");                    // evaluate as php

$replace = array ("","","\\1","\"","&","<",">"," ",chr(161),chr(162),chr(163),chr(169),"chr(\\1)","");


$result = preg_replace($search,$replace,$text);
LVL 1
socrossAsked:
Who is Participating?
 
RoonaanCommented:
You could use unhtmlentities and/or unhtmlspecialchars:

function unhtmlentities ($string) {
   $trans_tbl =get_html_translation_table (HTML_ENTITIES );
   $trans_tbl =array_flip ($trans_tbl );
   return strtr ($string ,$trans_tbl );
}

function unhtmlspecialchars( $string )
{
  $string = str_replace ( '&amp;', '&', $string );
  $string = str_replace ( '&#039;', '\'', $string );
  $string = str_replace ( '&quot;', '"', $string );
  $string = str_replace ( '&lt;', '<', $string );
  $string = str_replace ( '&gt;', '>', $string );
  $string = str_replace ( '&uuml;', '?', $string );
  $string = str_replace ( '&Uuml;', '?', $string );
  $string = str_replace ( '&auml;', '?', $string );
  $string = str_replace ( '&Auml;', '?', $string );
  $string = str_replace ( '&ouml;', '?', $string );
  $string = str_replace ( '&Ouml;', '?', $string );  
  return $string;
}

-r-
0
 
RoonaanCommented:
Hello socross,

You could use a whitelist approach:
$allowedTokens = array(' ', '.', '?', '!' , '$');
$preg = '/[^\w'.preg_quote(implode('', $allowedTokens), '/').']/';
$text = preg_replace($preg, '', $text);

Regards,

Roonaan
0
 
socrossAuthor Commented:
What About the &amp; &nbsp;

These need to be replaced with '&' and ' '

It just getting it into a format that csv will like, whilst keeping its basic formatting

-s-
0
Cloud Class® Course: Microsoft Azure 2017

Azure has a changed a lot since it was originally introduce by adding new services and features. Do you know everything you need to about Azure? This course will teach you about the Azure App Service, monitoring and application insights, DevOps, and Team Services.

 
socrossAuthor Commented:
We get unexpected t_string when we call the unhtmlentities function

This is the line which it gets stuck on

$trans_tbl =get_html_translation_table (HTML_ENTITIES );

-s-
0
 
Bernard S.CTOCommented:
Have you had problems with strip_tags http://www.php.net/strip_tags?
The page mentions several risks, but if your html is well-formed this might be a solution.
0
 
socrossAuthor Commented:
we think the problem was with our isp's config of php4, the htmlentities code works really well on or php 5 server though will come in very handy!!

thanks

-s-
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.