• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 411
  • Last Modified:

removing html tags

hi! i need to remove all html tags from an html document except for the <body> tag. i've decided to use the tag stripper function availble in php. it removes the tags but i have to replace the text located between the tags with a whitespace character. can anyone help me come up with a regular expression which would do this? the only text i want to retain is the text inside the <body> tags. =)
0
inamorata
Asked:
inamorata
2 Solutions
 
TintinCommented:
If you want to do stuff with HTML properly, then use an HTML parser.

If you can 100% gaurantee the format of the HTML, then you can get away with using a regex.

Suggest you look at:

http://search.cpan.org/author/OVID/HTML-TokeParser-Simple-1.4/Simple.pm
http://search.cpan.org/author/GAAS/HTML-Parser-3.27/Parser.pm
0
 
wilcoxonCommented:
Something like this should work to pare the file down to only the data between the body tags.

open IN, $file or die "could not open $file: $!";
$data = join '', <IN>;
close IN;

$data =~ s/^.*<body[^>]*>//si; # remove up to the body tag
$data =~ s%</body>.*$%%si; # remove /body tag to EOF

Unless you can guarantee the format of the file within the body tags, it is VERY hard to remove html tags and leave the plain text using regexes.
0
 
ironladyCommented:
hi,

the one thing you might do is to copy the whole contents of file into scalar variable as given below the remove everything which s before and after body tags
the first substitution removes evrything before body tag and the second afterwards so u are only left with text inside body




undef $/;
$_=<PAGE>;
s|^.*?<body\b.*?>||i;
s|^(.*)</body\b.*|$1|i;

if u still wnat to remove htnl tags within body tags u could use this command

s|<.*?>||g;

then u can use print command
like
print $_;
or just print;

if any questions please post as comment

thanks,
tie
0
 
jmcgOwnerCommented:
Nothing has happened on this question in over 8 months. It's time for cleanup!

My recommendation, which I will post in the Cleanup topic area, is to
split points between Tintin [40 pts] and wilcoxon [30 pts]

Please post any comments here within the next seven days. Moderators check comments here before acting on the recommendation. Experts: silence will likely be taken as assent.

PLEASE DO NOT ACCEPT THIS COMMENT AS AN ANSWER!

jmcg
EE Cleanup Volunteer
0

Featured Post

Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

Tackle projects and never again get stuck behind a technical roadblock.
Join Now