regular expression replacements

Posted on 2005-04-24
Last Modified: 2008-01-09
I have a html document in a string, and I want to remove some tags from it. There are two basic cases
1. remove  a simple tag. examples:
   example 1.1  <sometag1 attribute="value"> needs to be removed.
   example 1.2 </sometag1> needs to be removed (if it exists)
   example 1.3 <sometag /> needs to be removed
2. remove tags and everything betweem. example:
  example 2.1 <sometag2>blah dont worry there are no sometag2s here blah</sometag2> needs to be removed entirely.

In this case all instances of sometag1 and sometag2 can be removed, allthough it would be better to have a solution that removes only those that are between the HEAD tags.
Question by:alberthendriks
    LVL 49

    Expert Comment

    To remove only elements between <head> and </head> just use strpos to find those both tags. Make a slice using substring. In this substring replace the <sometags> and replace the original substring with the new one:


    $head1 = strpos(strtolower($text, '<head'));
    $head2 = strpos(strtolower($text, '</head'));

    $slice = substr($text, $head1, $head2 - $head1);

    $slice = preg_replace('/<sometag [^>]+>/i','',$slice);
    $slice = preg_replace('/<[^>]+ sometag>/i','',$slice);

    $new = substr($text, 0, $head1).$slice.substr($text, $head2);

    LVL 49

    Expert Comment

    Except for a typo in the $head1 = and $head2 = line, I also wrote you an example:

    $text = '
    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "">
      <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
      <title>PHP: regular expression replacements</title>
      <link href="/images/ee.ico" rel="shortcut icon">
      <link href="/scripts/ee.6.css" rel="stylesheet" type="text/css">
      <link href="/scripts/eeExpert.css" rel="stylesheet" type="text/css">
    <script src="/scripts/eeSubs.1.js" type="text/javascript"></script>
    <meta name="description" content="I have a html document in a string, and I want to remove some tags from it. There are two basic cases 1. remove a simple tag. examples: example 1.1 <sometag1 attribute= value > needs to be removed....">


    $head1 = strpos(strtolower($text), '<head');
    $head2 = strpos(strtolower($text), '</head');

    $slice = substr($text, $head1, $head2 - $head1);

    $slice = preg_replace('/<link [^>]+>/i','',$slice);
    $slice = preg_replace('/<[^>]+ link>/i','',$slice);

    $new = substr($text, 0, $head1).$slice.substr($text, $head2);

    echo '<pre>'.htmlspecialchars($text).'</pre>';
    echo '<hr/>';
    echo '<pre>'.htmlspecialchars($new).'</pre>';

    LVL 3

    Expert Comment

    use strip_tags.

    whatever the tags you need to parse, specify those tags in the strip_tags, which will skip over all other tags.

    then use the regular expression to achieve it.

    hope this helps.
    LVL 2

    Author Comment

    Roonan, what does the 2nd slice do?
    $slice = preg_replace('/<[^>]+ link>/i','',$slice);

    Also, I don't see a way that <link>bla</link> is entirely removed (the 2nd case in my description). Maybe you misinterpreted my question: the remark at the end applies enitrely around both cases.
    LVL 49

    Accepted Solution

    You are correct. My code just remove the tags, and not the tag contents.

    The second $cslice was to remove </link> tags. However when you are in need of removing also tag contents, we'd better use the following two preg_replace statements instead of the ones I wrote earlier:

    $slice = preg_replace('/<link(.*)\/link>/i','',$slice);  //<link ...>......</link> tags with content
    $slice = preg_replace('/<link [^>]+>/i','',$slice);    //<link  ...> (all remaining tags or <link  />)

    LVL 14

    Expert Comment

    No comment has been added to this question in more than 21 days, so it is now classified as abandoned..
    I will leave the following recommendation for this question in the Cleanup topic area:
    Accept: Roonaan {http:#13854116}

    Any objections should be posted here in the next 4 days. After that time, the question will be closed.

    EE Cleanup Volunteer

    Featured Post

    6 Surprising Benefits of Threat Intelligence

    All sorts of threat intelligence is available on the web. Intelligence you can learn from, and use to anticipate and prepare for future attacks.

    Join & Write a Comment

    Suggested Solutions

    Introduction HTML checkboxes provide the perfect way for a web developer to receive client input when the client's options might be none, one or many.  But the PHP code for processing the checkboxes can be confusing at first.  What if a checkbox is…
    Deprecated and Headed for the Dustbin By now, you have probably heard that some PHP features, while convenient, can also cause PHP security problems.  This article discusses one of those, called register_globals.  It is a thing you do not want.  …
    The viewer will learn how to dynamically set the form action using jQuery.
    The viewer will learn how to create a basic form using some HTML5 and PHP for later processing. Set up your basic HTML file. Open your form tag and set the method and action attributes.: (CODE) Set up your first few inputs one for the name and …

    728 members asked questions and received personalized solutions in the past 7 days.

    Join the community of 500,000 technology professionals and ask your questions.

    Join & Ask a Question

    Need Help in Real-Time?

    Connect with top rated Experts

    16 Experts available now in Live!

    Get 1:1 Help Now