Go Premium for a chance to win a PS4. Enter to Win

x
?
Solved

Extract contents of body tag and body from html code

Posted on 2004-08-23
24
Medium Priority
?
787 Views
Last Modified: 2013-12-12
I have the html code of the webpages as a variable.

I need to extract the body tag from this html code - so I need something like "<body background="images/1.gif" bgcolor="#0000FF">" returning.

I have been playing around with this:

$html="<html><head><title>Untitled</title></head><BODY bgcolor=\"#0000FF\">this is the page content</BODY></html>";
preg_match("/(<body.*?>.*?<\/body>)/i",$html, $matches);
$body = $matches[0];
preg_match("/(<body.*?>)/i",$body, $matches);
$body_tag = $matches[0];

But I don't think its right. It needs to be case insensitive as some people use <BODY .......>

I then need to extract what is between the body tags, so in the above example it would return "this is the page content".

I have been playing with this - but as you can see it's not very good!

$content = str_replace($body_tag, "", $body);
$content = str_replace("</body>", "", $content);
$content = str_replace("</BODY>", "", $content);
$content = str_replace("</html>", "", $content);
$content = str_replace("</HTML>", "", $content);

Thanks for your help.



0
Comment
Question by:Nottingham
  • 8
  • 7
  • 4
  • +2
24 Comments
 
LVL 1

Expert Comment

by:oreomike
ID: 11868896
could you try to group the letters of body:
"/(<[Bb][Oo][Dd][Yy].*?>/*?<\/[Bb][Oo][Dd][Yy]>)/i" as you match string?

Or try the strloc functions.

PHP: http://us3.php.net/manual/en/function.preg-match.php
0
 
LVL 1

Expert Comment

by:oreomike
ID: 11868915
I meant stripos, as the following:

$bodystart = stripos($html,"<body");
$bodyend = strripos($html,"</body>");

then a substr($html,$bodystart,$bodyend);

I may be off by one there with the numbering.

Just a suggestion.
0
 
LVL 2

Expert Comment

by:platineo
ID: 11869466
Try is the simple way...

<?php
$html="<html><head><title>Untitled</title></head><BODY bgcolor=\"#0000FF\">this is the page content</BODY></html>";

$starttag="<body"; // assign the tags, you must leave it as "<body" because the tag might have other things in it!
$endtag="</body>"; // and the end tag!

$pos1=strpos($starttag,$html); //search the position
$pos2=strpos($endtag,$html); //search the other one!

if ($pos1===true&&$pos2===true){ //just in case ;)
$bodytag=substr($pos1,$pos2+strlen($endtag),$body);
//the body tag is determined by starting position $pos1 and the ending position $pos2+length of the closing tag coz we want to include that too!
}
?>

If you think there will be conflict between the content of the script such as the "<" and ">" will be taken as the piece of the current tag then you'd better use "&lt;" and "&gt;" on your content coz that will help make the search consistent! Hope that helped!
0
VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

 

Author Comment

by:Nottingham
ID: 11869606
Not sure if my original question was clear enough.

I need returned:

1. the opening body tag with it's attributes, eg. "<body background="images/1.gif" bgcolor="#0000FF">"

2.the content between the two body tags, eg. if $html="<html><head><title>Untitled</title></head><BODY bgcolor=\"#0000FF\">this is the page content</BODY></html>"; then I need "this is the page content".

so the code Im looking for is:

$html="<html><head><title>Untitled</title></head><BoDy bgcolor=\"#0000FF\">this is the page content</BodY></html>";

$body_tag=something_clever();

$body_content=something_smart();
0
 
LVL 9

Expert Comment

by:AlanJDM
ID: 11869653
You had it so close... regex needed a tweak...

$html="<html><head><title>Untitled</title></head><BODY bgcolor=\"#0000FF\">this is the page content</BODY></html>";
preg_match("/(<body.*>)(.*?)(<\/body>)/i",$html, $matches);

$body_tag = $matches[1];
$body = $matches[2];


Alan
0
 
LVL 9

Expert Comment

by:AlanJDM
ID: 11869666
Also, for what its worth... $matches[3] will contain the closing body tag if needed.


Alan
0
 
LVL 1

Expert Comment

by:oreomike
ID: 11869774
Or:

$html="<html><head><title>Untitled</title></head><BODY bgcolor=\"#0000FF\">this is the page content</BODY></html>";
preg_match("/(<[Bb][Oo][Dd][Yy].*>)(.*?)(<\/[Bb][Oo][Dd][Yy]>)/i",$html, $matches);

$body_tag = $matches[1];
$body = $matches[2];
0
 
LVL 9

Expert Comment

by:AlanJDM
ID: 11869790
Why would you do that? There is no need to check for the upper and lower case chars becase the "i" in the regex is for "case insensative".


Alan
0
 
LVL 1

Expert Comment

by:oreomike
ID: 11869794
You have to do that for people like me who don't know what the 'i' at the end of the string stands for.  ;-)
0
 

Author Comment

by:Nottingham
ID: 11869795
I was going to ask the same question.
0
 

Author Comment

by:Nottingham
ID: 11869843
I worked something out - my html code is pulled from a db. It must have line breaks in it or similar. The code does not work on it - like it wont work on this:

$html='<HTML><HEAD><TITLE>Homepage</TITLE><BODY background=bg.gif>
this is
content
</BODY></HTML>';

preg_match("/(<body.*>)(.*?)(<\/body>)/i",$html, $matches);

$body_tag = $matches[1];
$body = $matches[2];

echo $body_tag;
echo $body;

Is there a function I need to perform on the html_code first to remove all the line breaks?
0
 
LVL 9

Expert Comment

by:AlanJDM
ID: 11869894
The code I posted does work with line breaks, I just tested it. I even cut and pasteed you example and ran it and it worked fine so I dont think line breaks are your problem. However, if you think they are, you can get rid of them using nl2br()...

$html='<HTML><HEAD><TITLE>Homepage</TITLE><BODY background=bg.gif>
this is
content
</BODY></HTML>';

$html = nl2br($html);

preg_match("/(<body.*>)(.*?)(<\/body>)/i",$html, $matches);

$body_tag = $matches[1];
$body = $matches[2];


Alan

echo $body_tag;
echo $body;









0
 

Author Comment

by:Nottingham
ID: 11869909
Alan,

I'll test it again.

I'm sure it didn;t work - but I also thought it should have done.

Cheers

Dan
0
 

Author Comment

by:Nottingham
ID: 11869940
It worked when I made the variable assign line all on one line - but again did not when there were line breaks.

I dont actually want <br>'s in the html code where there is a newline - can I just remove them?
0
 
LVL 9

Expert Comment

by:AlanJDM
ID: 11869966
try this.....

$html='<HTML><HEAD><TITLE>Homepage</TITLE><BODY background=bg.gif>
this is
content
</BODY></HTML>';

$html = str_replace("\n","",$html);

preg_match("/(<body.*>)(.*?)(<\/body>)/i",$html, $matches);

$body_tag = $matches[1];
$body = $matches[2];

echo $body_tag;
echo $body;


Alan
0
 

Author Comment

by:Nottingham
ID: 11870198
ok

I thought it worked but then I tried this:

<?
$html='
<HTML>
<HEAD>
<TITLE>HOMEPAGE</TITLE>
</HEAD>
<BODY background=bg.gif><b>this is content</b></BODY>
</HTML>';

$html = str_replace("\n","",$html);

preg_match("/(<body.*>)(.*?)(<\/body>)/i",$html, $matches);

$body_tag = $matches[1];
$body = $matches[2];

echo $body_tag;
//echo $body;
?>

Returned in the browser is:

<BODY background=bg.gif><b>this is content</b>

So it's not right?
0
 
LVL 9

Accepted Solution

by:
AlanJDM earned 800 total points
ID: 11870432
try this...

<?php

$html='
<HTML>
<HEAD>
<TITLE>Homepage</TITLE>
<BODY background=bg.gif><b>this is content</b></BODY>
</HTML>';

$html = str_replace("\n","",$html);

preg_match("/(<body.*>)(.*)(<\/body>)/iU",$html, $matches);

$body_tag = $matches[1];
$body = $matches[2];

echo $body_tag;
echo $body;

?>
0
 

Author Comment

by:Nottingham
ID: 11870494
Alan,

Works like a treat!

Thanks for your help!!

Really appreciate the speed element aswell!
0
 
LVL 3

Expert Comment

by:Boris Aranovich
ID: 11870533
How's that?

$html =<<<HTMLEOT
<html>
      <head><title>asdasda</title></head>
<body bgcolor="white" link="blue">
<p>Hello world!</p>
<div align="center"><strong>I AM WORLD!!!</strong></div>
</body>
</html>
HTMLEOT;

preg_match('/(<body.*>)(.*)<\\/body>/Usi', $html, $matches);
$body_tag = $matches[1];
$body = $matches[2];

echo "<u><b>Body tag:</b></u><br />\n<pre>";
echo htmlspecialchars($body_tag);
echo "</pre><hr />\n<u><b>Content:</b></u><br />\n<pre>";
echo htmlspecialchars($body);
echo '</pre>';
0
 
LVL 3

Expert Comment

by:Boris Aranovich
ID: 11870556
You already accepted Alan's answer, which is good :)
But also try the piece of code I wrote, with the only real difference, that I added the 's' modifier in the Regex syntax, which allows you have the code on multiple lines, and you don't need to convert \n and/or \r to antything...
0
 

Author Comment

by:Nottingham
ID: 11870598
Nomaed,

Thanks very much for adding that.

I thought it must be possible!

Sorry you came in so late - i'll give you some points aswell if you want!?

Cheers

Dan

0
 
LVL 3

Expert Comment

by:Boris Aranovich
ID: 11870631
Nah, that's ok :) He deserved them, he was quicker on the trigger, and his answer is correct :)
I get my points from the satisfaction of success :p
Glad to help anyhow :)
0
 
LVL 9

Expert Comment

by:AlanJDM
ID: 11870669
Nomaed, You also helped me because I was unaware of the 's' modifier. Thanks.

Alan
0
 
LVL 3

Expert Comment

by:Boris Aranovich
ID: 11870685
np :)
0

Featured Post

Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

These days socially coordinated efforts have turned into a critical requirement for enterprises.
Originally, this post was published on Monitis Blog, you can check it here . In business circles, we sometimes hear that today is the “age of the customer.” And so it is. Thanks to the enormous advances over the past few years in consumer techno…
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
The viewer will learn how to count occurrences of each item in an array.
Suggested Courses

971 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question