Solved

Extract contents of body tag and body from html code

Posted on 2004-08-23
24
757 Views
Last Modified: 2013-12-12
I have the html code of the webpages as a variable.

I need to extract the body tag from this html code - so I need something like "<body background="images/1.gif" bgcolor="#0000FF">" returning.

I have been playing around with this:

$html="<html><head><title>Untitled</title></head><BODY bgcolor=\"#0000FF\">this is the page content</BODY></html>";
preg_match("/(<body.*?>.*?<\/body>)/i",$html, $matches);
$body = $matches[0];
preg_match("/(<body.*?>)/i",$body, $matches);
$body_tag = $matches[0];

But I don't think its right. It needs to be case insensitive as some people use <BODY .......>

I then need to extract what is between the body tags, so in the above example it would return "this is the page content".

I have been playing with this - but as you can see it's not very good!

$content = str_replace($body_tag, "", $body);
$content = str_replace("</body>", "", $content);
$content = str_replace("</BODY>", "", $content);
$content = str_replace("</html>", "", $content);
$content = str_replace("</HTML>", "", $content);

Thanks for your help.



0
Comment
Question by:Nottingham
  • 8
  • 7
  • 4
  • +2
24 Comments
 
LVL 1

Expert Comment

by:oreomike
ID: 11868896
could you try to group the letters of body:
"/(<[Bb][Oo][Dd][Yy].*?>/*?<\/[Bb][Oo][Dd][Yy]>)/i" as you match string?

Or try the strloc functions.

PHP: http://us3.php.net/manual/en/function.preg-match.php
0
 
LVL 1

Expert Comment

by:oreomike
ID: 11868915
I meant stripos, as the following:

$bodystart = stripos($html,"<body");
$bodyend = strripos($html,"</body>");

then a substr($html,$bodystart,$bodyend);

I may be off by one there with the numbering.

Just a suggestion.
0
 
LVL 2

Expert Comment

by:platineo
ID: 11869466
Try is the simple way...

<?php
$html="<html><head><title>Untitled</title></head><BODY bgcolor=\"#0000FF\">this is the page content</BODY></html>";

$starttag="<body"; // assign the tags, you must leave it as "<body" because the tag might have other things in it!
$endtag="</body>"; // and the end tag!

$pos1=strpos($starttag,$html); //search the position
$pos2=strpos($endtag,$html); //search the other one!

if ($pos1===true&&$pos2===true){ //just in case ;)
$bodytag=substr($pos1,$pos2+strlen($endtag),$body);
//the body tag is determined by starting position $pos1 and the ending position $pos2+length of the closing tag coz we want to include that too!
}
?>

If you think there will be conflict between the content of the script such as the "<" and ">" will be taken as the piece of the current tag then you'd better use "&lt;" and "&gt;" on your content coz that will help make the search consistent! Hope that helped!
0
Gigs: Get Your Project Delivered by an Expert

Select from freelancers specializing in everything from database administration to programming, who have proven themselves as experts in their field. Hire the best, collaborate easily, pay securely and get projects done right.

 

Author Comment

by:Nottingham
ID: 11869606
Not sure if my original question was clear enough.

I need returned:

1. the opening body tag with it's attributes, eg. "<body background="images/1.gif" bgcolor="#0000FF">"

2.the content between the two body tags, eg. if $html="<html><head><title>Untitled</title></head><BODY bgcolor=\"#0000FF\">this is the page content</BODY></html>"; then I need "this is the page content".

so the code Im looking for is:

$html="<html><head><title>Untitled</title></head><BoDy bgcolor=\"#0000FF\">this is the page content</BodY></html>";

$body_tag=something_clever();

$body_content=something_smart();
0
 
LVL 9

Expert Comment

by:AlanJDM
ID: 11869653
You had it so close... regex needed a tweak...

$html="<html><head><title>Untitled</title></head><BODY bgcolor=\"#0000FF\">this is the page content</BODY></html>";
preg_match("/(<body.*>)(.*?)(<\/body>)/i",$html, $matches);

$body_tag = $matches[1];
$body = $matches[2];


Alan
0
 
LVL 9

Expert Comment

by:AlanJDM
ID: 11869666
Also, for what its worth... $matches[3] will contain the closing body tag if needed.


Alan
0
 
LVL 1

Expert Comment

by:oreomike
ID: 11869774
Or:

$html="<html><head><title>Untitled</title></head><BODY bgcolor=\"#0000FF\">this is the page content</BODY></html>";
preg_match("/(<[Bb][Oo][Dd][Yy].*>)(.*?)(<\/[Bb][Oo][Dd][Yy]>)/i",$html, $matches);

$body_tag = $matches[1];
$body = $matches[2];
0
 
LVL 9

Expert Comment

by:AlanJDM
ID: 11869790
Why would you do that? There is no need to check for the upper and lower case chars becase the "i" in the regex is for "case insensative".


Alan
0
 
LVL 1

Expert Comment

by:oreomike
ID: 11869794
You have to do that for people like me who don't know what the 'i' at the end of the string stands for.  ;-)
0
 

Author Comment

by:Nottingham
ID: 11869795
I was going to ask the same question.
0
 

Author Comment

by:Nottingham
ID: 11869843
I worked something out - my html code is pulled from a db. It must have line breaks in it or similar. The code does not work on it - like it wont work on this:

$html='<HTML><HEAD><TITLE>Homepage</TITLE><BODY background=bg.gif>
this is
content
</BODY></HTML>';

preg_match("/(<body.*>)(.*?)(<\/body>)/i",$html, $matches);

$body_tag = $matches[1];
$body = $matches[2];

echo $body_tag;
echo $body;

Is there a function I need to perform on the html_code first to remove all the line breaks?
0
 
LVL 9

Expert Comment

by:AlanJDM
ID: 11869894
The code I posted does work with line breaks, I just tested it. I even cut and pasteed you example and ran it and it worked fine so I dont think line breaks are your problem. However, if you think they are, you can get rid of them using nl2br()...

$html='<HTML><HEAD><TITLE>Homepage</TITLE><BODY background=bg.gif>
this is
content
</BODY></HTML>';

$html = nl2br($html);

preg_match("/(<body.*>)(.*?)(<\/body>)/i",$html, $matches);

$body_tag = $matches[1];
$body = $matches[2];


Alan

echo $body_tag;
echo $body;









0
 

Author Comment

by:Nottingham
ID: 11869909
Alan,

I'll test it again.

I'm sure it didn;t work - but I also thought it should have done.

Cheers

Dan
0
 

Author Comment

by:Nottingham
ID: 11869940
It worked when I made the variable assign line all on one line - but again did not when there were line breaks.

I dont actually want <br>'s in the html code where there is a newline - can I just remove them?
0
 
LVL 9

Expert Comment

by:AlanJDM
ID: 11869966
try this.....

$html='<HTML><HEAD><TITLE>Homepage</TITLE><BODY background=bg.gif>
this is
content
</BODY></HTML>';

$html = str_replace("\n","",$html);

preg_match("/(<body.*>)(.*?)(<\/body>)/i",$html, $matches);

$body_tag = $matches[1];
$body = $matches[2];

echo $body_tag;
echo $body;


Alan
0
 

Author Comment

by:Nottingham
ID: 11870198
ok

I thought it worked but then I tried this:

<?
$html='
<HTML>
<HEAD>
<TITLE>HOMEPAGE</TITLE>
</HEAD>
<BODY background=bg.gif><b>this is content</b></BODY>
</HTML>';

$html = str_replace("\n","",$html);

preg_match("/(<body.*>)(.*?)(<\/body>)/i",$html, $matches);

$body_tag = $matches[1];
$body = $matches[2];

echo $body_tag;
//echo $body;
?>

Returned in the browser is:

<BODY background=bg.gif><b>this is content</b>

So it's not right?
0
 
LVL 9

Accepted Solution

by:
AlanJDM earned 200 total points
ID: 11870432
try this...

<?php

$html='
<HTML>
<HEAD>
<TITLE>Homepage</TITLE>
<BODY background=bg.gif><b>this is content</b></BODY>
</HTML>';

$html = str_replace("\n","",$html);

preg_match("/(<body.*>)(.*)(<\/body>)/iU",$html, $matches);

$body_tag = $matches[1];
$body = $matches[2];

echo $body_tag;
echo $body;

?>
0
 

Author Comment

by:Nottingham
ID: 11870494
Alan,

Works like a treat!

Thanks for your help!!

Really appreciate the speed element aswell!
0
 
LVL 3

Expert Comment

by:Boris Aranovich
ID: 11870533
How's that?

$html =<<<HTMLEOT
<html>
      <head><title>asdasda</title></head>
<body bgcolor="white" link="blue">
<p>Hello world!</p>
<div align="center"><strong>I AM WORLD!!!</strong></div>
</body>
</html>
HTMLEOT;

preg_match('/(<body.*>)(.*)<\\/body>/Usi', $html, $matches);
$body_tag = $matches[1];
$body = $matches[2];

echo "<u><b>Body tag:</b></u><br />\n<pre>";
echo htmlspecialchars($body_tag);
echo "</pre><hr />\n<u><b>Content:</b></u><br />\n<pre>";
echo htmlspecialchars($body);
echo '</pre>';
0
 
LVL 3

Expert Comment

by:Boris Aranovich
ID: 11870556
You already accepted Alan's answer, which is good :)
But also try the piece of code I wrote, with the only real difference, that I added the 's' modifier in the Regex syntax, which allows you have the code on multiple lines, and you don't need to convert \n and/or \r to antything...
0
 

Author Comment

by:Nottingham
ID: 11870598
Nomaed,

Thanks very much for adding that.

I thought it must be possible!

Sorry you came in so late - i'll give you some points aswell if you want!?

Cheers

Dan

0
 
LVL 3

Expert Comment

by:Boris Aranovich
ID: 11870631
Nah, that's ok :) He deserved them, he was quicker on the trigger, and his answer is correct :)
I get my points from the satisfaction of success :p
Glad to help anyhow :)
0
 
LVL 9

Expert Comment

by:AlanJDM
ID: 11870669
Nomaed, You also helped me because I was unaware of the 's' modifier. Thanks.

Alan
0
 
LVL 3

Expert Comment

by:Boris Aranovich
ID: 11870685
np :)
0

Featured Post

Courses: Start Training Online With Pros, Today

Brush up on the basics or master the advanced techniques required to earn essential industry certifications, with Courses. Enroll in a course and start learning today. Training topics range from Android App Dev to the Xen Virtualization Platform.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
is this a cms? 8 58
Intermittent Error on Page Loading 4 55
Cookie not unsetting 7 18
how to use a switch statement with heredoc 11 17
Deprecated and Headed for the Dustbin By now, you have probably heard that some PHP features, while convenient, can also cause PHP security problems.  This article discusses one of those, called register_globals.  It is a thing you do not want.  …
This article discusses how to create an extensible mechanism for linked drop downs.
The viewer will learn how to dynamically set the form action using jQuery.
The viewer will learn how to create and use a small PHP class to apply a watermark to an image. This video shows the viewer the setup for the PHP watermark as well as important coding language. Continue to Part 2 to learn the core code used in creat…

813 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

8 Experts available now in Live!

Get 1:1 Help Now