Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 794
  • Last Modified:

Extract contents of body tag and body from html code

I have the html code of the webpages as a variable.

I need to extract the body tag from this html code - so I need something like "<body background="images/1.gif" bgcolor="#0000FF">" returning.

I have been playing around with this:

$html="<html><head><title>Untitled</title></head><BODY bgcolor=\"#0000FF\">this is the page content</BODY></html>";
preg_match("/(<body.*?>.*?<\/body>)/i",$html, $matches);
$body = $matches[0];
preg_match("/(<body.*?>)/i",$body, $matches);
$body_tag = $matches[0];

But I don't think its right. It needs to be case insensitive as some people use <BODY .......>

I then need to extract what is between the body tags, so in the above example it would return "this is the page content".

I have been playing with this - but as you can see it's not very good!

$content = str_replace($body_tag, "", $body);
$content = str_replace("</body>", "", $content);
$content = str_replace("</BODY>", "", $content);
$content = str_replace("</html>", "", $content);
$content = str_replace("</HTML>", "", $content);

Thanks for your help.



0
Nottingham
Asked:
Nottingham
  • 8
  • 7
  • 4
  • +2
1 Solution
 
oreomikeCommented:
could you try to group the letters of body:
"/(<[Bb][Oo][Dd][Yy].*?>/*?<\/[Bb][Oo][Dd][Yy]>)/i" as you match string?

Or try the strloc functions.

PHP: http://us3.php.net/manual/en/function.preg-match.php
0
 
oreomikeCommented:
I meant stripos, as the following:

$bodystart = stripos($html,"<body");
$bodyend = strripos($html,"</body>");

then a substr($html,$bodystart,$bodyend);

I may be off by one there with the numbering.

Just a suggestion.
0
 
platineoCommented:
Try is the simple way...

<?php
$html="<html><head><title>Untitled</title></head><BODY bgcolor=\"#0000FF\">this is the page content</BODY></html>";

$starttag="<body"; // assign the tags, you must leave it as "<body" because the tag might have other things in it!
$endtag="</body>"; // and the end tag!

$pos1=strpos($starttag,$html); //search the position
$pos2=strpos($endtag,$html); //search the other one!

if ($pos1===true&&$pos2===true){ //just in case ;)
$bodytag=substr($pos1,$pos2+strlen($endtag),$body);
//the body tag is determined by starting position $pos1 and the ending position $pos2+length of the closing tag coz we want to include that too!
}
?>

If you think there will be conflict between the content of the script such as the "<" and ">" will be taken as the piece of the current tag then you'd better use "&lt;" and "&gt;" on your content coz that will help make the search consistent! Hope that helped!
0
What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

 
NottinghamAuthor Commented:
Not sure if my original question was clear enough.

I need returned:

1. the opening body tag with it's attributes, eg. "<body background="images/1.gif" bgcolor="#0000FF">"

2.the content between the two body tags, eg. if $html="<html><head><title>Untitled</title></head><BODY bgcolor=\"#0000FF\">this is the page content</BODY></html>"; then I need "this is the page content".

so the code Im looking for is:

$html="<html><head><title>Untitled</title></head><BoDy bgcolor=\"#0000FF\">this is the page content</BodY></html>";

$body_tag=something_clever();

$body_content=something_smart();
0
 
AlanJDMCommented:
You had it so close... regex needed a tweak...

$html="<html><head><title>Untitled</title></head><BODY bgcolor=\"#0000FF\">this is the page content</BODY></html>";
preg_match("/(<body.*>)(.*?)(<\/body>)/i",$html, $matches);

$body_tag = $matches[1];
$body = $matches[2];


Alan
0
 
AlanJDMCommented:
Also, for what its worth... $matches[3] will contain the closing body tag if needed.


Alan
0
 
oreomikeCommented:
Or:

$html="<html><head><title>Untitled</title></head><BODY bgcolor=\"#0000FF\">this is the page content</BODY></html>";
preg_match("/(<[Bb][Oo][Dd][Yy].*>)(.*?)(<\/[Bb][Oo][Dd][Yy]>)/i",$html, $matches);

$body_tag = $matches[1];
$body = $matches[2];
0
 
AlanJDMCommented:
Why would you do that? There is no need to check for the upper and lower case chars becase the "i" in the regex is for "case insensative".


Alan
0
 
oreomikeCommented:
You have to do that for people like me who don't know what the 'i' at the end of the string stands for.  ;-)
0
 
NottinghamAuthor Commented:
I was going to ask the same question.
0
 
NottinghamAuthor Commented:
I worked something out - my html code is pulled from a db. It must have line breaks in it or similar. The code does not work on it - like it wont work on this:

$html='<HTML><HEAD><TITLE>Homepage</TITLE><BODY background=bg.gif>
this is
content
</BODY></HTML>';

preg_match("/(<body.*>)(.*?)(<\/body>)/i",$html, $matches);

$body_tag = $matches[1];
$body = $matches[2];

echo $body_tag;
echo $body;

Is there a function I need to perform on the html_code first to remove all the line breaks?
0
 
AlanJDMCommented:
The code I posted does work with line breaks, I just tested it. I even cut and pasteed you example and ran it and it worked fine so I dont think line breaks are your problem. However, if you think they are, you can get rid of them using nl2br()...

$html='<HTML><HEAD><TITLE>Homepage</TITLE><BODY background=bg.gif>
this is
content
</BODY></HTML>';

$html = nl2br($html);

preg_match("/(<body.*>)(.*?)(<\/body>)/i",$html, $matches);

$body_tag = $matches[1];
$body = $matches[2];


Alan

echo $body_tag;
echo $body;









0
 
NottinghamAuthor Commented:
Alan,

I'll test it again.

I'm sure it didn;t work - but I also thought it should have done.

Cheers

Dan
0
 
NottinghamAuthor Commented:
It worked when I made the variable assign line all on one line - but again did not when there were line breaks.

I dont actually want <br>'s in the html code where there is a newline - can I just remove them?
0
 
AlanJDMCommented:
try this.....

$html='<HTML><HEAD><TITLE>Homepage</TITLE><BODY background=bg.gif>
this is
content
</BODY></HTML>';

$html = str_replace("\n","",$html);

preg_match("/(<body.*>)(.*?)(<\/body>)/i",$html, $matches);

$body_tag = $matches[1];
$body = $matches[2];

echo $body_tag;
echo $body;


Alan
0
 
NottinghamAuthor Commented:
ok

I thought it worked but then I tried this:

<?
$html='
<HTML>
<HEAD>
<TITLE>HOMEPAGE</TITLE>
</HEAD>
<BODY background=bg.gif><b>this is content</b></BODY>
</HTML>';

$html = str_replace("\n","",$html);

preg_match("/(<body.*>)(.*?)(<\/body>)/i",$html, $matches);

$body_tag = $matches[1];
$body = $matches[2];

echo $body_tag;
//echo $body;
?>

Returned in the browser is:

<BODY background=bg.gif><b>this is content</b>

So it's not right?
0
 
AlanJDMCommented:
try this...

<?php

$html='
<HTML>
<HEAD>
<TITLE>Homepage</TITLE>
<BODY background=bg.gif><b>this is content</b></BODY>
</HTML>';

$html = str_replace("\n","",$html);

preg_match("/(<body.*>)(.*)(<\/body>)/iU",$html, $matches);

$body_tag = $matches[1];
$body = $matches[2];

echo $body_tag;
echo $body;

?>
0
 
NottinghamAuthor Commented:
Alan,

Works like a treat!

Thanks for your help!!

Really appreciate the speed element aswell!
0
 
Boris AranovichSenior Software EngineerCommented:
How's that?

$html =<<<HTMLEOT
<html>
      <head><title>asdasda</title></head>
<body bgcolor="white" link="blue">
<p>Hello world!</p>
<div align="center"><strong>I AM WORLD!!!</strong></div>
</body>
</html>
HTMLEOT;

preg_match('/(<body.*>)(.*)<\\/body>/Usi', $html, $matches);
$body_tag = $matches[1];
$body = $matches[2];

echo "<u><b>Body tag:</b></u><br />\n<pre>";
echo htmlspecialchars($body_tag);
echo "</pre><hr />\n<u><b>Content:</b></u><br />\n<pre>";
echo htmlspecialchars($body);
echo '</pre>';
0
 
Boris AranovichSenior Software EngineerCommented:
You already accepted Alan's answer, which is good :)
But also try the piece of code I wrote, with the only real difference, that I added the 's' modifier in the Regex syntax, which allows you have the code on multiple lines, and you don't need to convert \n and/or \r to antything...
0
 
NottinghamAuthor Commented:
Nomaed,

Thanks very much for adding that.

I thought it must be possible!

Sorry you came in so late - i'll give you some points aswell if you want!?

Cheers

Dan

0
 
Boris AranovichSenior Software EngineerCommented:
Nah, that's ok :) He deserved them, he was quicker on the trigger, and his answer is correct :)
I get my points from the satisfaction of success :p
Glad to help anyhow :)
0
 
AlanJDMCommented:
Nomaed, You also helped me because I was unaware of the 's' modifier. Thanks.

Alan
0
 
Boris AranovichSenior Software EngineerCommented:
np :)
0

Featured Post

What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

  • 8
  • 7
  • 4
  • +2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now