asked on

regular expression replacements

I have a html document in a string, and I want to remove some tags from it. There are two basic cases
1. remove a simple tag. examples:
example 1.1 <sometag1 attribute="value"> needs to be removed.
example 1.2 </sometag1> needs to be removed (if it exists)
example 1.3 <sometag /> needs to be removed
2. remove tags and everything betweem. example:
example 2.1 <sometag2>blah dont worry there are no sometag2s here blah</sometag2> needs to be removed entirely.

In this case all instances of sometag1 and sometag2 can be removed, allthough it would be better to have a solution that removes only those that are between the HEAD tags.

Roonaan

To remove only elements between <head> and </head> just use strpos to find those both tags. Make a slice using substring. In this substring replace the <sometags> and replace the original substring with the new one:

<?php

$head1 = strpos(strtolower($text, '<head'));
$head2 = strpos(strtolower($text, '</head'));

$slice = substr($text, $head1, $head2 - $head1);

$slice = preg_replace('/<sometag [^>]+>/i','',$slice);
$slice = preg_replace('/<[^>]+ sometag>/i','',$slice);

$new = substr($text, 0, $head1).$slice.substr($text, $head2);
?>

-r-

Roonaan

Except for a typo in the $head1 = and $head2 = line, I also wrote you an example:

<?php
$text = '
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<title>PHP: regular expression replacements</title>
<link href="/images/ee.ico" rel="shortcut icon">
<link href="/scripts/ee.6.css" rel="stylesheet" type="text/css">
<link href="/scripts/eeExpert.css" rel="stylesheet" type="text/css">
<script src="/scripts/eeSubs.1.js" type="text/javascript"></script>
<meta name="description" content="I have a html document in a string, and I want to remove some tags from it. There are two basic cases 1. remove a simple tag. examples: example 1.1 <sometag1 attribute= value > needs to be removed....">
</head>
<body>

</body>
</html>
';

$head1 = strpos(strtolower($text), '<head');
$head2 = strpos(strtolower($text), '</head');

$slice = substr($text, $head1, $head2 - $head1);

$slice = preg_replace('/<link [^>]+>/i','',$slice);
$slice = preg_replace('/<[^>]+ link>/i','',$slice);

$new = substr($text, 0, $head1).$slice.substr($text, $head2);

echo '<pre>'.htmlspecialchars($text).'</pre>';
echo '<hr/>';
echo '<pre>'.htmlspecialchars($new).'</pre>';
?>

-r-

designbai

use strip_tags.

whatever the tags you need to parse, specify those tags in the strip_tags, which will skip over all other tags.

then use the regular expression to achieve it.

hope this helps.

alberthendriks

ASKER

Roonan, what does the 2nd slice do?
$slice = preg_replace('/<[^>]+ link>/i','',$slice);

Also, I don't see a way that <link>bla</link> is entirely removed (the 2nd case in my description). Maybe you misinterpreted my question: the remark at the end applies enitrely around both cases.

ASKER CERTIFIED SOLUTION

Roonaan

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

huji

No comment has been added to this question in more than 21 days, so it is now classified as abandoned..
I will leave the following recommendation for this question in the Cleanup topic area:
Accept: Roonaan {http:#13854116}

Any objections should be posted here in the next 4 days. After that time, the question will be closed.

Huji
EE Cleanup Volunteer