Extract contents of body tag and body from html code

I have the html code of the webpages as a variable.

I need to extract the body tag from this html code - so I need something like "<body background="images/1.gif" bgcolor="#0000FF">" returning.

I have been playing around with this:

$html="<html><head><title>Untitled</title></head><BODY bgcolor=\"#0000FF\">this is the page content</BODY></html>";
preg_match("/(<body.*?>.*?<\/body>)/i",$html, $matches);
$body = $matches[0];
preg_match("/(<body.*?>)/i",$body, $matches);
$body_tag = $matches[0];

But I don't think its right. It needs to be case insensitive as some people use <BODY .......>

I then need to extract what is between the body tags, so in the above example it would return "this is the page content".

I have been playing with this - but as you can see it's not very good!

$content = str_replace($body_tag, "", $body);
$content = str_replace("</body>", "", $content);
$content = str_replace("</BODY>", "", $content);
$content = str_replace("</html>", "", $content);
$content = str_replace("</HTML>", "", $content);

Thanks for your help.



NottinghamAsked:
Who is Participating?

Improve company productivity with a Business Account.Sign Up

x
 
AlanJDMConnect With a Mentor Commented:
try this...

<?php

$html='
<HTML>
<HEAD>
<TITLE>Homepage</TITLE>
<BODY background=bg.gif><b>this is content</b></BODY>
</HTML>';

$html = str_replace("\n","",$html);

preg_match("/(<body.*>)(.*)(<\/body>)/iU",$html, $matches);

$body_tag = $matches[1];
$body = $matches[2];

echo $body_tag;
echo $body;

?>
0
 
oreomikeCommented:
could you try to group the letters of body:
"/(<[Bb][Oo][Dd][Yy].*?>/*?<\/[Bb][Oo][Dd][Yy]>)/i" as you match string?

Or try the strloc functions.

PHP: http://us3.php.net/manual/en/function.preg-match.php
0
 
oreomikeCommented:
I meant stripos, as the following:

$bodystart = stripos($html,"<body");
$bodyend = strripos($html,"</body>");

then a substr($html,$bodystart,$bodyend);

I may be off by one there with the numbering.

Just a suggestion.
0
Free Tool: SSL Checker

Scans your site and returns information about your SSL implementation and certificate. Helpful for debugging and validating your SSL configuration.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

 
platineoCommented:
Try is the simple way...

<?php
$html="<html><head><title>Untitled</title></head><BODY bgcolor=\"#0000FF\">this is the page content</BODY></html>";

$starttag="<body"; // assign the tags, you must leave it as "<body" because the tag might have other things in it!
$endtag="</body>"; // and the end tag!

$pos1=strpos($starttag,$html); //search the position
$pos2=strpos($endtag,$html); //search the other one!

if ($pos1===true&&$pos2===true){ //just in case ;)
$bodytag=substr($pos1,$pos2+strlen($endtag),$body);
//the body tag is determined by starting position $pos1 and the ending position $pos2+length of the closing tag coz we want to include that too!
}
?>

If you think there will be conflict between the content of the script such as the "<" and ">" will be taken as the piece of the current tag then you'd better use "&lt;" and "&gt;" on your content coz that will help make the search consistent! Hope that helped!
0
 
NottinghamAuthor Commented:
Not sure if my original question was clear enough.

I need returned:

1. the opening body tag with it's attributes, eg. "<body background="images/1.gif" bgcolor="#0000FF">"

2.the content between the two body tags, eg. if $html="<html><head><title>Untitled</title></head><BODY bgcolor=\"#0000FF\">this is the page content</BODY></html>"; then I need "this is the page content".

so the code Im looking for is:

$html="<html><head><title>Untitled</title></head><BoDy bgcolor=\"#0000FF\">this is the page content</BodY></html>";

$body_tag=something_clever();

$body_content=something_smart();
0
 
AlanJDMCommented:
You had it so close... regex needed a tweak...

$html="<html><head><title>Untitled</title></head><BODY bgcolor=\"#0000FF\">this is the page content</BODY></html>";
preg_match("/(<body.*>)(.*?)(<\/body>)/i",$html, $matches);

$body_tag = $matches[1];
$body = $matches[2];


Alan
0
 
AlanJDMCommented:
Also, for what its worth... $matches[3] will contain the closing body tag if needed.


Alan
0
 
oreomikeCommented:
Or:

$html="<html><head><title>Untitled</title></head><BODY bgcolor=\"#0000FF\">this is the page content</BODY></html>";
preg_match("/(<[Bb][Oo][Dd][Yy].*>)(.*?)(<\/[Bb][Oo][Dd][Yy]>)/i",$html, $matches);

$body_tag = $matches[1];
$body = $matches[2];
0
 
AlanJDMCommented:
Why would you do that? There is no need to check for the upper and lower case chars becase the "i" in the regex is for "case insensative".


Alan
0
 
oreomikeCommented:
You have to do that for people like me who don't know what the 'i' at the end of the string stands for.  ;-)
0
 
NottinghamAuthor Commented:
I was going to ask the same question.
0
 
NottinghamAuthor Commented:
I worked something out - my html code is pulled from a db. It must have line breaks in it or similar. The code does not work on it - like it wont work on this:

$html='<HTML><HEAD><TITLE>Homepage</TITLE><BODY background=bg.gif>
this is
content
</BODY></HTML>';

preg_match("/(<body.*>)(.*?)(<\/body>)/i",$html, $matches);

$body_tag = $matches[1];
$body = $matches[2];

echo $body_tag;
echo $body;

Is there a function I need to perform on the html_code first to remove all the line breaks?
0
 
AlanJDMCommented:
The code I posted does work with line breaks, I just tested it. I even cut and pasteed you example and ran it and it worked fine so I dont think line breaks are your problem. However, if you think they are, you can get rid of them using nl2br()...

$html='<HTML><HEAD><TITLE>Homepage</TITLE><BODY background=bg.gif>
this is
content
</BODY></HTML>';

$html = nl2br($html);

preg_match("/(<body.*>)(.*?)(<\/body>)/i",$html, $matches);

$body_tag = $matches[1];
$body = $matches[2];


Alan

echo $body_tag;
echo $body;









0
 
NottinghamAuthor Commented:
Alan,

I'll test it again.

I'm sure it didn;t work - but I also thought it should have done.

Cheers

Dan
0
 
NottinghamAuthor Commented:
It worked when I made the variable assign line all on one line - but again did not when there were line breaks.

I dont actually want <br>'s in the html code where there is a newline - can I just remove them?
0
 
AlanJDMCommented:
try this.....

$html='<HTML><HEAD><TITLE>Homepage</TITLE><BODY background=bg.gif>
this is
content
</BODY></HTML>';

$html = str_replace("\n","",$html);

preg_match("/(<body.*>)(.*?)(<\/body>)/i",$html, $matches);

$body_tag = $matches[1];
$body = $matches[2];

echo $body_tag;
echo $body;


Alan
0
 
NottinghamAuthor Commented:
ok

I thought it worked but then I tried this:

<?
$html='
<HTML>
<HEAD>
<TITLE>HOMEPAGE</TITLE>
</HEAD>
<BODY background=bg.gif><b>this is content</b></BODY>
</HTML>';

$html = str_replace("\n","",$html);

preg_match("/(<body.*>)(.*?)(<\/body>)/i",$html, $matches);

$body_tag = $matches[1];
$body = $matches[2];

echo $body_tag;
//echo $body;
?>

Returned in the browser is:

<BODY background=bg.gif><b>this is content</b>

So it's not right?
0
 
NottinghamAuthor Commented:
Alan,

Works like a treat!

Thanks for your help!!

Really appreciate the speed element aswell!
0
 
Boris AranovichSenior Software EngineerCommented:
How's that?

$html =<<<HTMLEOT
<html>
      <head><title>asdasda</title></head>
<body bgcolor="white" link="blue">
<p>Hello world!</p>
<div align="center"><strong>I AM WORLD!!!</strong></div>
</body>
</html>
HTMLEOT;

preg_match('/(<body.*>)(.*)<\\/body>/Usi', $html, $matches);
$body_tag = $matches[1];
$body = $matches[2];

echo "<u><b>Body tag:</b></u><br />\n<pre>";
echo htmlspecialchars($body_tag);
echo "</pre><hr />\n<u><b>Content:</b></u><br />\n<pre>";
echo htmlspecialchars($body);
echo '</pre>';
0
 
Boris AranovichSenior Software EngineerCommented:
You already accepted Alan's answer, which is good :)
But also try the piece of code I wrote, with the only real difference, that I added the 's' modifier in the Regex syntax, which allows you have the code on multiple lines, and you don't need to convert \n and/or \r to antything...
0
 
NottinghamAuthor Commented:
Nomaed,

Thanks very much for adding that.

I thought it must be possible!

Sorry you came in so late - i'll give you some points aswell if you want!?

Cheers

Dan

0
 
Boris AranovichSenior Software EngineerCommented:
Nah, that's ok :) He deserved them, he was quicker on the trigger, and his answer is correct :)
I get my points from the satisfaction of success :p
Glad to help anyhow :)
0
 
AlanJDMCommented:
Nomaed, You also helped me because I was unaware of the 's' modifier. Thanks.

Alan
0
 
Boris AranovichSenior Software EngineerCommented:
np :)
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.