?
Solved

removing html tags

Posted on 2003-02-21
5
Medium Priority
?
403 Views
Last Modified: 2010-03-05
hi! i need to remove all html tags from an html document except for the <body> tag. i've decided to use the tag stripper function availble in php. it removes the tags but i have to replace the text located between the tags with a whitespace character. can anyone help me come up with a regular expression which would do this? the only text i want to retain is the text inside the <body> tags. =)
0
Comment
Question by:inamorata
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
5 Comments
 
LVL 48

Accepted Solution

by:
Tintin earned 160 total points
ID: 7997917
If you want to do stuff with HTML properly, then use an HTML parser.

If you can 100% gaurantee the format of the HTML, then you can get away with using a regex.

Suggest you look at:

http://search.cpan.org/author/OVID/HTML-TokeParser-Simple-1.4/Simple.pm
http://search.cpan.org/author/GAAS/HTML-Parser-3.27/Parser.pm
0
 
LVL 26

Assisted Solution

by:wilcoxon
wilcoxon earned 140 total points
ID: 8009814
Something like this should work to pare the file down to only the data between the body tags.

open IN, $file or die "could not open $file: $!";
$data = join '', <IN>;
close IN;

$data =~ s/^.*<body[^>]*>//si; # remove up to the body tag
$data =~ s%</body>.*$%%si; # remove /body tag to EOF

Unless you can guarantee the format of the file within the body tags, it is VERY hard to remove html tags and leave the plain text using regexes.
0
 

Expert Comment

by:ironlady
ID: 8046100
hi,

the one thing you might do is to copy the whole contents of file into scalar variable as given below the remove everything which s before and after body tags
the first substitution removes evrything before body tag and the second afterwards so u are only left with text inside body




undef $/;
$_=<PAGE>;
s|^.*?<body\b.*?>||i;
s|^(.*)</body\b.*|$1|i;

if u still wnat to remove htnl tags within body tags u could use this command

s|<.*?>||g;

then u can use print command
like
print $_;
or just print;

if any questions please post as comment

thanks,
tie
0
 
LVL 20

Expert Comment

by:jmcg
ID: 9692022
Nothing has happened on this question in over 8 months. It's time for cleanup!

My recommendation, which I will post in the Cleanup topic area, is to
split points between Tintin [40 pts] and wilcoxon [30 pts]

Please post any comments here within the next seven days. Moderators check comments here before acting on the recommendation. Experts: silence will likely be taken as assent.

PLEASE DO NOT ACCEPT THIS COMMENT AS AN ANSWER!

jmcg
EE Cleanup Volunteer
0

Featured Post

On Demand Webinar: Networking for the Cloud Era

Did you know SD-WANs can improve network connectivity? Check out this webinar to learn how an SD-WAN simplified, one-click tool can help you migrate and manage data in the cloud.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

On Microsoft Windows, if  when you click or type the name of a .pl file, you get an error "is not recognized as an internal or external command, operable program or batch file", then this means you do not have the .pl file extension associated with …
Checking the Alert Log in AWS RDS Oracle can be a pain through their user interface.  I made a script to download the Alert Log, look for errors, and email me the trace files.  In this article I'll describe what I did and share my script.
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Six Sigma Control Plans
Suggested Courses

801 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question