Solved

Extract contents of body tag and body from html code

Posted on 2004-08-23
24
751 Views
Last Modified: 2013-12-12
I have the html code of the webpages as a variable.

I need to extract the body tag from this html code - so I need something like "<body background="images/1.gif" bgcolor="#0000FF">" returning.

I have been playing around with this:

$html="<html><head><title>Untitled</title></head><BODY bgcolor=\"#0000FF\">this is the page content</BODY></html>";
preg_match("/(<body.*?>.*?<\/body>)/i",$html, $matches);
$body = $matches[0];
preg_match("/(<body.*?>)/i",$body, $matches);
$body_tag = $matches[0];

But I don't think its right. It needs to be case insensitive as some people use <BODY .......>

I then need to extract what is between the body tags, so in the above example it would return "this is the page content".

I have been playing with this - but as you can see it's not very good!

$content = str_replace($body_tag, "", $body);
$content = str_replace("</body>", "", $content);
$content = str_replace("</BODY>", "", $content);
$content = str_replace("</html>", "", $content);
$content = str_replace("</HTML>", "", $content);

Thanks for your help.



0
Comment
Question by:Nottingham
  • 8
  • 7
  • 4
  • +2
24 Comments
 
LVL 1

Expert Comment

by:oreomike
Comment Utility
could you try to group the letters of body:
"/(<[Bb][Oo][Dd][Yy].*?>/*?<\/[Bb][Oo][Dd][Yy]>)/i" as you match string?

Or try the strloc functions.

PHP: http://us3.php.net/manual/en/function.preg-match.php
0
 
LVL 1

Expert Comment

by:oreomike
Comment Utility
I meant stripos, as the following:

$bodystart = stripos($html,"<body");
$bodyend = strripos($html,"</body>");

then a substr($html,$bodystart,$bodyend);

I may be off by one there with the numbering.

Just a suggestion.
0
 
LVL 2

Expert Comment

by:platineo
Comment Utility
Try is the simple way...

<?php
$html="<html><head><title>Untitled</title></head><BODY bgcolor=\"#0000FF\">this is the page content</BODY></html>";

$starttag="<body"; // assign the tags, you must leave it as "<body" because the tag might have other things in it!
$endtag="</body>"; // and the end tag!

$pos1=strpos($starttag,$html); //search the position
$pos2=strpos($endtag,$html); //search the other one!

if ($pos1===true&&$pos2===true){ //just in case ;)
$bodytag=substr($pos1,$pos2+strlen($endtag),$body);
//the body tag is determined by starting position $pos1 and the ending position $pos2+length of the closing tag coz we want to include that too!
}
?>

If you think there will be conflict between the content of the script such as the "<" and ">" will be taken as the piece of the current tag then you'd better use "&lt;" and "&gt;" on your content coz that will help make the search consistent! Hope that helped!
0
 

Author Comment

by:Nottingham
Comment Utility
Not sure if my original question was clear enough.

I need returned:

1. the opening body tag with it's attributes, eg. "<body background="images/1.gif" bgcolor="#0000FF">"

2.the content between the two body tags, eg. if $html="<html><head><title>Untitled</title></head><BODY bgcolor=\"#0000FF\">this is the page content</BODY></html>"; then I need "this is the page content".

so the code Im looking for is:

$html="<html><head><title>Untitled</title></head><BoDy bgcolor=\"#0000FF\">this is the page content</BodY></html>";

$body_tag=something_clever();

$body_content=something_smart();
0
 
LVL 9

Expert Comment

by:AlanJDM
Comment Utility
You had it so close... regex needed a tweak...

$html="<html><head><title>Untitled</title></head><BODY bgcolor=\"#0000FF\">this is the page content</BODY></html>";
preg_match("/(<body.*>)(.*?)(<\/body>)/i",$html, $matches);

$body_tag = $matches[1];
$body = $matches[2];


Alan
0
 
LVL 9

Expert Comment

by:AlanJDM
Comment Utility
Also, for what its worth... $matches[3] will contain the closing body tag if needed.


Alan
0
 
LVL 1

Expert Comment

by:oreomike
Comment Utility
Or:

$html="<html><head><title>Untitled</title></head><BODY bgcolor=\"#0000FF\">this is the page content</BODY></html>";
preg_match("/(<[Bb][Oo][Dd][Yy].*>)(.*?)(<\/[Bb][Oo][Dd][Yy]>)/i",$html, $matches);

$body_tag = $matches[1];
$body = $matches[2];
0
 
LVL 9

Expert Comment

by:AlanJDM
Comment Utility
Why would you do that? There is no need to check for the upper and lower case chars becase the "i" in the regex is for "case insensative".


Alan
0
 
LVL 1

Expert Comment

by:oreomike
Comment Utility
You have to do that for people like me who don't know what the 'i' at the end of the string stands for.  ;-)
0
 

Author Comment

by:Nottingham
Comment Utility
I was going to ask the same question.
0
 

Author Comment

by:Nottingham
Comment Utility
I worked something out - my html code is pulled from a db. It must have line breaks in it or similar. The code does not work on it - like it wont work on this:

$html='<HTML><HEAD><TITLE>Homepage</TITLE><BODY background=bg.gif>
this is
content
</BODY></HTML>';

preg_match("/(<body.*>)(.*?)(<\/body>)/i",$html, $matches);

$body_tag = $matches[1];
$body = $matches[2];

echo $body_tag;
echo $body;

Is there a function I need to perform on the html_code first to remove all the line breaks?
0
 
LVL 9

Expert Comment

by:AlanJDM
Comment Utility
The code I posted does work with line breaks, I just tested it. I even cut and pasteed you example and ran it and it worked fine so I dont think line breaks are your problem. However, if you think they are, you can get rid of them using nl2br()...

$html='<HTML><HEAD><TITLE>Homepage</TITLE><BODY background=bg.gif>
this is
content
</BODY></HTML>';

$html = nl2br($html);

preg_match("/(<body.*>)(.*?)(<\/body>)/i",$html, $matches);

$body_tag = $matches[1];
$body = $matches[2];


Alan

echo $body_tag;
echo $body;









0
Why You Should Analyze Threat Actor TTPs

After years of analyzing threat actor behavior, it’s become clear that at any given time there are specific tactics, techniques, and procedures (TTPs) that are particularly prevalent. By analyzing and understanding these TTPs, you can dramatically enhance your security program.

 

Author Comment

by:Nottingham
Comment Utility
Alan,

I'll test it again.

I'm sure it didn;t work - but I also thought it should have done.

Cheers

Dan
0
 

Author Comment

by:Nottingham
Comment Utility
It worked when I made the variable assign line all on one line - but again did not when there were line breaks.

I dont actually want <br>'s in the html code where there is a newline - can I just remove them?
0
 
LVL 9

Expert Comment

by:AlanJDM
Comment Utility
try this.....

$html='<HTML><HEAD><TITLE>Homepage</TITLE><BODY background=bg.gif>
this is
content
</BODY></HTML>';

$html = str_replace("\n","",$html);

preg_match("/(<body.*>)(.*?)(<\/body>)/i",$html, $matches);

$body_tag = $matches[1];
$body = $matches[2];

echo $body_tag;
echo $body;


Alan
0
 

Author Comment

by:Nottingham
Comment Utility
ok

I thought it worked but then I tried this:

<?
$html='
<HTML>
<HEAD>
<TITLE>HOMEPAGE</TITLE>
</HEAD>
<BODY background=bg.gif><b>this is content</b></BODY>
</HTML>';

$html = str_replace("\n","",$html);

preg_match("/(<body.*>)(.*?)(<\/body>)/i",$html, $matches);

$body_tag = $matches[1];
$body = $matches[2];

echo $body_tag;
//echo $body;
?>

Returned in the browser is:

<BODY background=bg.gif><b>this is content</b>

So it's not right?
0
 
LVL 9

Accepted Solution

by:
AlanJDM earned 200 total points
Comment Utility
try this...

<?php

$html='
<HTML>
<HEAD>
<TITLE>Homepage</TITLE>
<BODY background=bg.gif><b>this is content</b></BODY>
</HTML>';

$html = str_replace("\n","",$html);

preg_match("/(<body.*>)(.*)(<\/body>)/iU",$html, $matches);

$body_tag = $matches[1];
$body = $matches[2];

echo $body_tag;
echo $body;

?>
0
 

Author Comment

by:Nottingham
Comment Utility
Alan,

Works like a treat!

Thanks for your help!!

Really appreciate the speed element aswell!
0
 
LVL 3

Expert Comment

by:Boris Aranovich
Comment Utility
How's that?

$html =<<<HTMLEOT
<html>
      <head><title>asdasda</title></head>
<body bgcolor="white" link="blue">
<p>Hello world!</p>
<div align="center"><strong>I AM WORLD!!!</strong></div>
</body>
</html>
HTMLEOT;

preg_match('/(<body.*>)(.*)<\\/body>/Usi', $html, $matches);
$body_tag = $matches[1];
$body = $matches[2];

echo "<u><b>Body tag:</b></u><br />\n<pre>";
echo htmlspecialchars($body_tag);
echo "</pre><hr />\n<u><b>Content:</b></u><br />\n<pre>";
echo htmlspecialchars($body);
echo '</pre>';
0
 
LVL 3

Expert Comment

by:Boris Aranovich
Comment Utility
You already accepted Alan's answer, which is good :)
But also try the piece of code I wrote, with the only real difference, that I added the 's' modifier in the Regex syntax, which allows you have the code on multiple lines, and you don't need to convert \n and/or \r to antything...
0
 

Author Comment

by:Nottingham
Comment Utility
Nomaed,

Thanks very much for adding that.

I thought it must be possible!

Sorry you came in so late - i'll give you some points aswell if you want!?

Cheers

Dan

0
 
LVL 3

Expert Comment

by:Boris Aranovich
Comment Utility
Nah, that's ok :) He deserved them, he was quicker on the trigger, and his answer is correct :)
I get my points from the satisfaction of success :p
Glad to help anyhow :)
0
 
LVL 9

Expert Comment

by:AlanJDM
Comment Utility
Nomaed, You also helped me because I was unaware of the 's' modifier. Thanks.

Alan
0
 
LVL 3

Expert Comment

by:Boris Aranovich
Comment Utility
np :)
0

Featured Post

How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

Join & Write a Comment

These days socially coordinated efforts have turned into a critical requirement for enterprises.
Password hashing is better than message digests or encryption, and you should be using it instead of message digests or encryption.  Find out why and how in this article, which supplements the original article on PHP Client Registration, Login, Logo…
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
The viewer will learn how to create a basic form using some HTML5 and PHP for later processing. Set up your basic HTML file. Open your form tag and set the method and action attributes.: (CODE) Set up your first few inputs one for the name and …

772 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

11 Experts available now in Live!

Get 1:1 Help Now