alberthendriks
asked on
regular expression replacements
I have a html document in a string, and I want to remove some tags from it. There are two basic cases
1. remove a simple tag. examples:
example 1.1 <sometag1 attribute="value"> needs to be removed.
example 1.2 </sometag1> needs to be removed (if it exists)
example 1.3 <sometag /> needs to be removed
2. remove tags and everything betweem. example:
example 2.1 <sometag2>blah dont worry there are no sometag2s here blah</sometag2> needs to be removed entirely.
In this case all instances of sometag1 and sometag2 can be removed, allthough it would be better to have a solution that removes only those that are between the HEAD tags.
1. remove a simple tag. examples:
example 1.1 <sometag1 attribute="value"> needs to be removed.
example 1.2 </sometag1> needs to be removed (if it exists)
example 1.3 <sometag /> needs to be removed
2. remove tags and everything betweem. example:
example 2.1 <sometag2>blah dont worry there are no sometag2s here blah</sometag2> needs to be removed entirely.
In this case all instances of sometag1 and sometag2 can be removed, allthough it would be better to have a solution that removes only those that are between the HEAD tags.
Except for a typo in the $head1 = and $head2 = line, I also wrote you an example:
<?php
$text = '
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<title>PHP: regular expression replacements</title>
<link href="/images/ee.ico" rel="shortcut icon">
<link href="/scripts/ee.6.css" rel="stylesheet" type="text/css">
<link href="/scripts/eeExpert.cs s" rel="stylesheet" type="text/css">
<script src="/scripts/eeSubs.1.js" type="text/javascript"></s cript>
<meta name="description" content="I have a html document in a string, and I want to remove some tags from it. There are two basic cases 1. remove a simple tag. examples: example 1.1 <sometag1 attribute= value > needs to be removed....">
</head>
<body>
</body>
</html>
';
$head1 = strpos(strtolower($text), '<head');
$head2 = strpos(strtolower($text), '</head');
$slice = substr($text, $head1, $head2 - $head1);
$slice = preg_replace('/<link [^>]+>/i','',$slice);
$slice = preg_replace('/<[^>]+ link>/i','',$slice);
$new = substr($text, 0, $head1).$slice.substr($tex t, $head2);
echo '<pre>'.htmlspecialchars($ text).'</p re>';
echo '<hr/>';
echo '<pre>'.htmlspecialchars($ new).'</pr e>';
?>
-r-
<?php
$text = '
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<title>PHP: regular expression replacements</title>
<link href="/images/ee.ico" rel="shortcut icon">
<link href="/scripts/ee.6.css" rel="stylesheet" type="text/css">
<link href="/scripts/eeExpert.cs
<script src="/scripts/eeSubs.1.js"
<meta name="description" content="I have a html document in a string, and I want to remove some tags from it. There are two basic cases 1. remove a simple tag. examples: example 1.1 <sometag1 attribute= value > needs to be removed....">
</head>
<body>
</body>
</html>
';
$head1 = strpos(strtolower($text), '<head');
$head2 = strpos(strtolower($text), '</head');
$slice = substr($text, $head1, $head2 - $head1);
$slice = preg_replace('/<link [^>]+>/i','',$slice);
$slice = preg_replace('/<[^>]+ link>/i','',$slice);
$new = substr($text, 0, $head1).$slice.substr($tex
echo '<pre>'.htmlspecialchars($
echo '<hr/>';
echo '<pre>'.htmlspecialchars($
?>
-r-
use strip_tags.
whatever the tags you need to parse, specify those tags in the strip_tags, which will skip over all other tags.
then use the regular expression to achieve it.
hope this helps.
whatever the tags you need to parse, specify those tags in the strip_tags, which will skip over all other tags.
then use the regular expression to achieve it.
hope this helps.
ASKER
Roonan, what does the 2nd slice do?
$slice = preg_replace('/<[^>]+ link>/i','',$slice);
Also, I don't see a way that <link>bla</link> is entirely removed (the 2nd case in my description). Maybe you misinterpreted my question: the remark at the end applies enitrely around both cases.
$slice = preg_replace('/<[^>]+ link>/i','',$slice);
Also, I don't see a way that <link>bla</link> is entirely removed (the 2nd case in my description). Maybe you misinterpreted my question: the remark at the end applies enitrely around both cases.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
No comment has been added to this question in more than 21 days, so it is now classified as abandoned..
I will leave the following recommendation for this question in the Cleanup topic area:
Accept: Roonaan {http:#13854116}
Any objections should be posted here in the next 4 days. After that time, the question will be closed.
Huji
EE Cleanup Volunteer
I will leave the following recommendation for this question in the Cleanup topic area:
Accept: Roonaan {http:#13854116}
Any objections should be posted here in the next 4 days. After that time, the question will be closed.
Huji
EE Cleanup Volunteer
<?php
$head1 = strpos(strtolower($text, '<head'));
$head2 = strpos(strtolower($text, '</head'));
$slice = substr($text, $head1, $head2 - $head1);
$slice = preg_replace('/<sometag [^>]+>/i','',$slice);
$slice = preg_replace('/<[^>]+ sometag>/i','',$slice);
$new = substr($text, 0, $head1).$slice.substr($tex
?>
-r-