Clean up HTML Using preg_replace

Posted on 2007-10-18
Last Modified: 2013-12-13

I need to clean up an html string ($text) for inculding in a csv file, so it needs all html, tabs, white space, (comma's?) removed of commmnted in sure a way that we can build a valid csv file.

This is my attempt using preg_replace, but i cant get it working.

Be great if someone could take a look at it and/or advice a better solution.

Many Thanks



$search = array ("'<script[^>]*?>.*?</script>'si",  // Strip out javascript
                 "'<[\/\!]*?[^<>]*?>'si",           // Strip out HTML tags
                 "'([\r\n])[\s]+'",                 // Strip out white space
                 "'&(quot|#34);'i",                 // Replace HTML entities
                 "\n");                    // evaluate as php

$replace = array ("","","\\1","\"","&","<",">"," ",chr(161),chr(162),chr(163),chr(169),"chr(\\1)","");

$result = preg_replace($search,$replace,$text);
Question by:socross
    LVL 49

    Expert Comment

    Hello socross,

    You could use a whitelist approach:
    $allowedTokens = array(' ', '.', '?', '!' , '$');
    $preg = '/[^\w'.preg_quote(implode('', $allowedTokens), '/').']/';
    $text = preg_replace($preg, '', $text);


    LVL 1

    Author Comment

    What About the &amp; &nbsp;

    These need to be replaced with '&' and ' '

    It just getting it into a format that csv will like, whilst keeping its basic formatting

    LVL 49

    Accepted Solution

    You could use unhtmlentities and/or unhtmlspecialchars:

    function unhtmlentities ($string) {
       $trans_tbl =get_html_translation_table (HTML_ENTITIES );
       $trans_tbl =array_flip ($trans_tbl );
       return strtr ($string ,$trans_tbl );

    function unhtmlspecialchars( $string )
      $string = str_replace ( '&amp;', '&', $string );
      $string = str_replace ( '&#039;', '\'', $string );
      $string = str_replace ( '&quot;', '"', $string );
      $string = str_replace ( '&lt;', '<', $string );
      $string = str_replace ( '&gt;', '>', $string );
      $string = str_replace ( '&uuml;', '?', $string );
      $string = str_replace ( '&Uuml;', '?', $string );
      $string = str_replace ( '&auml;', '?', $string );
      $string = str_replace ( '&Auml;', '?', $string );
      $string = str_replace ( '&ouml;', '?', $string );
      $string = str_replace ( '&Ouml;', '?', $string );  
      return $string;

    LVL 1

    Author Comment

    We get unexpected t_string when we call the unhtmlentities function

    This is the line which it gets stuck on

    $trans_tbl =get_html_translation_table (HTML_ENTITIES );

    LVL 29

    Expert Comment

    Have you had problems with strip_tags
    The page mentions several risks, but if your html is well-formed this might be a solution.
    LVL 1

    Author Comment

    we think the problem was with our isp's config of php4, the htmlentities code works really well on or php 5 server though will come in very handy!!



    Featured Post

    How to run any project with ease

    Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
    - Combine task lists, docs, spreadsheets, and chat in one
    - View and edit from mobile/offline
    - Cut down on emails

    Join & Write a Comment

    Building a website can seem like a daunting task to the uninitiated but it really only requires knowledge of two basic languages: HTML and CSS.
    Not sure what the best email signature size is? Are you worried about email signature image size? Follow this best practice guide.
    In this tutorial viewers will learn how to embed Flash content in a webpage using HTML5. Ensure your DOCTYPE declaration is set to HTML5: "<!DOCTYPE html>": Use the <object> tag to embed Flash content.: To specify that the object is Flash content, d…
    The viewer will learn how to dynamically set the form action using jQuery.

    730 members asked questions and received personalized solutions in the past 7 days.

    Join the community of 500,000 technology professionals and ask your questions.

    Join & Ask a Question

    Need Help in Real-Time?

    Connect with top rated Experts

    19 Experts available now in Live!

    Get 1:1 Help Now