Link to home
Start Free TrialLog in
Avatar of socross
socrossFlag for United Kingdom of Great Britain and Northern Ireland

asked on

Clean up HTML Using preg_replace

HI

I need to clean up an html string ($text) for inculding in a csv file, so it needs all html, tabs, white space, (comma's?) removed of commmnted in sure a way that we can build a valid csv file.

This is my attempt using preg_replace, but i cant get it working.

Be great if someone could take a look at it and/or advice a better solution.

Many Thanks

-s-

/////////////////////////////////////////////////////////////////////////

$search = array ("'<script[^>]*?>.*?</script>'si",  // Strip out javascript
                 "'<[\/\!]*?[^<>]*?>'si",           // Strip out HTML tags
                 "'([\r\n])[\s]+'",                 // Strip out white space
                 "'&(quot|#34);'i",                 // Replace HTML entities
                 "'&(amp|#38);'i",
                 "'&(lt|#60);'i",
                 "'&(gt|#62);'i",
                 "'&(nbsp|#160);'i",
                 "'&(iexcl|#161);'i",
                 "'&(cent|#162);'i",
                 "'&(pound|#163);'i",
                 "'&(copy|#169);'i",
                 "'&#(\d+);'e",
                 "\n");                    // evaluate as php

$replace = array ("","","\\1","\"","&","<",">"," ",chr(161),chr(162),chr(163),chr(169),"chr(\\1)","");


$result = preg_replace($search,$replace,$text);
Avatar of Roonaan
Roonaan
Flag of Netherlands image

Hello socross,

You could use a whitelist approach:
$allowedTokens = array(' ', '.', '?', '!' , '$');
$preg = '/[^\w'.preg_quote(implode('', $allowedTokens), '/').']/';
$text = preg_replace($preg, '', $text);

Regards,

Roonaan
Avatar of socross

ASKER

What About the &amp; &nbsp;

These need to be replaced with '&' and ' '

It just getting it into a format that csv will like, whilst keeping its basic formatting

-s-
ASKER CERTIFIED SOLUTION
Avatar of Roonaan
Roonaan
Flag of Netherlands image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of socross

ASKER

We get unexpected t_string when we call the unhtmlentities function

This is the line which it gets stuck on

$trans_tbl =get_html_translation_table (HTML_ENTITIES );

-s-
Have you had problems with strip_tags http://www.php.net/strip_tags?
The page mentions several risks, but if your html is well-formed this might be a solution.
Avatar of socross

ASKER

we think the problem was with our isp's config of php4, the htmlentities code works really well on or php 5 server though will come in very handy!!

thanks

-s-