Link to home
Start Free TrialLog in
Avatar of alberthendriks
alberthendriks

asked on

regular expression replacements

I have a html document in a string, and I want to remove some tags from it. There are two basic cases
1. remove  a simple tag. examples:
   example 1.1  <sometag1 attribute="value"> needs to be removed.
   example 1.2 </sometag1> needs to be removed (if it exists)
   example 1.3 <sometag /> needs to be removed
2. remove tags and everything betweem. example:
  example 2.1 <sometag2>blah dont worry there are no sometag2s here blah</sometag2> needs to be removed entirely.

In this case all instances of sometag1 and sometag2 can be removed, allthough it would be better to have a solution that removes only those that are between the HEAD tags.
Avatar of Roonaan
Roonaan
Flag of Netherlands image

To remove only elements between <head> and </head> just use strpos to find those both tags. Make a slice using substring. In this substring replace the <sometags> and replace the original substring with the new one:

<?php

$head1 = strpos(strtolower($text, '<head'));
$head2 = strpos(strtolower($text, '</head'));

$slice = substr($text, $head1, $head2 - $head1);

$slice = preg_replace('/<sometag [^>]+>/i','',$slice);
$slice = preg_replace('/<[^>]+ sometag>/i','',$slice);

$new = substr($text, 0, $head1).$slice.substr($text, $head2);
?>

-r-
Except for a typo in the $head1 = and $head2 = line, I also wrote you an example:

<?php
$text = '
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> 
<html>
<head>
  <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
  <title>PHP: regular expression replacements</title>
  <link href="/images/ee.ico" rel="shortcut icon">
  <link href="/scripts/ee.6.css" rel="stylesheet" type="text/css">
  <link href="/scripts/eeExpert.css" rel="stylesheet" type="text/css">
<script src="/scripts/eeSubs.1.js" type="text/javascript"></script>
<meta name="description" content="I have a html document in a string, and I want to remove some tags from it. There are two basic cases 1. remove a simple tag. examples: example 1.1 <sometag1 attribute= value > needs to be removed....">
</head>
<body>

</body>
</html>
';

$head1 = strpos(strtolower($text), '<head');
$head2 = strpos(strtolower($text), '</head');

$slice = substr($text, $head1, $head2 - $head1);

$slice = preg_replace('/<link [^>]+>/i','',$slice);
$slice = preg_replace('/<[^>]+ link>/i','',$slice);

$new = substr($text, 0, $head1).$slice.substr($text, $head2);

echo '<pre>'.htmlspecialchars($text).'</pre>';
echo '<hr/>';
echo '<pre>'.htmlspecialchars($new).'</pre>';
?>

-r-
Avatar of designbai
designbai

use strip_tags.

whatever the tags you need to parse, specify those tags in the strip_tags, which will skip over all other tags.

then use the regular expression to achieve it.

hope this helps.
Avatar of alberthendriks

ASKER

Roonan, what does the 2nd slice do?
$slice = preg_replace('/<[^>]+ link>/i','',$slice);

Also, I don't see a way that <link>bla</link> is entirely removed (the 2nd case in my description). Maybe you misinterpreted my question: the remark at the end applies enitrely around both cases.
ASKER CERTIFIED SOLUTION
Avatar of Roonaan
Roonaan
Flag of Netherlands image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
No comment has been added to this question in more than 21 days, so it is now classified as abandoned..
I will leave the following recommendation for this question in the Cleanup topic area:
Accept: Roonaan {http:#13854116}

Any objections should be posted here in the next 4 days. After that time, the question will be closed.

Huji
EE Cleanup Volunteer