• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 364
  • Last Modified:

stripping and cleaning Html and more -Delete text until you hit

Im trying to extract the content in an html which mean i can ignore the words in the <div area and  in <style type is there a way i can clean the area between those tags and others tags that are not relevant for the content of the page?
Im trying to do it in php.

0
Nura111
Asked:
Nura111
  • 6
  • 3
  • 2
  • +3
1 Solution
 
a1jCommented:
lynx --dump http://site/uri
0
 
Nura111Author Commented:
sorry, please explain what is it?
0
 
a1jCommented:
lynx is linux utility (text mode web browser). THis command will basically convert html into text. You can call it from php, grab output and be done with it.
0
Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
Mohamed AbowardaSoftware EngineerCommented:
0
 
Ali KayahanFull Stack DeveloperCommented:
Here is an easy to use html parser ; http://simplehtmldom.sourceforge.net/ after including class you can get plain text by ;

echo file_get_html('http://www.google.com/')->plaintext

0
 
Ray PaseurCommented:
Probably some combination of strpos() strip_tags() will work.
http://us.php.net/manual/en/function.strpos.php
http://php.net/manual/en/function.strip-tags.php

Sometimes regular expressions are useful.
http://php.net/manual/en/function.preg-match.php

If you want to post an example of the URL or the HTML you want to process, perhaps we can show you a more concrete solution. ~Ray
0
 
Nura111Author Commented:
Do I have to use file_get_htm on an URL? I want to use it on a string i tried to do it but its not working.
0
 
Nura111Author Commented:
str_replace will not help me in that case beacuse i dont want to replace only what i find, but whats between
0
 
gwkgCommented:
Do I have to use file_get_htm on an URL? I want to use it on a string i tried to do it but its not working.
Try $html = str_get_html('<html><body>Hello!</body></html>');
0
 
Nura111Author Commented:
what is the diffrence between using that ot strip_tags?
0
 
Ray PaseurCommented:
Let me try this one again... If you want to post an example of the URL or the HTML you want to process, perhaps we can show you a more concrete solution. ~Ray
0
 
Nura111Author Commented:
Its just a regular format, nothing special, (under the question is an exm)
I'm new to php and I didn't realize that strip_tags remove the content. from my understanding I can use it right?
im just trying to understand if ill be missing something if ill use strip_tags and not str_get_html is there a difference?
and also when I  try to use str_get_html I get an error message that its an undefined function . should I include something to use it?

for ex:



<div id="article_text">

 

<style type="text/css">
<!--

.style2 {color:#FF0000;}
.addthis_toolbox.addthis_pill_combo a {float:left;}.addthis_toolbox.addthis_pill_combo a.addthis_button_tweet,.addthis_toolbox.addthis_pill_combo a.addthis_counter {}.addthis_button_compact .at15t_compact {margin-right:4px;float:left;}
-->
</style>


<div class="addthis_toolbox addthis_pill_combo" style="margin-top:3px;margin-bottom:5px;float:right;padding:5px;">
<a rel="nofollow" class="addthis_button_facebook_like"></a><a rel="nofollow"  href="http://twitter.com/share" class="twitter-share-button">Tweet</a></div>

<div style="padding-bottom:5px;">

     <b>Posted By -  &nbsp;<a rel="nofollow"  href="mailto://www.11alive.com/rss/news11alive@yahoo.com?subject=viewer%20question%20about%20an%20article&body=Link:http://www.11alive.com/rss/rss_story.aspx?storyid=181673">11Alive&nbsp;Staff</a></b><br>
   
<br>Last Updated On: &nbsp;3/9/2011 4:21:45 PM</div>
<div style="padding-bottom:10px;">
<p><p>ATLANTA -- New testimony is expected Wednesday in the trial of the man accused of murdering a popular Grant Park bartender. </p>
<p>Jonathan Redding is <a rel="nofollow"  href="http://www.11alive.com/rss/rss_story.aspx?storyid=130020"><strong>accused of fatally shooting</strong> </a>John Henderson at the old <a rel="nofollow"  href="http://atlanta.metromix.com/bars-and-clubs/pub/closed-standard-food-and-cabbagetown/338905/content"><strong>Standard bar</strong></a> on Memorial Drive in January 2009. He was 17 years old at the time and a suspected member of 30 Deep, a gang notorious for murders and smash-and-grab robberies.</p>
<p>On the <a rel="nofollow"  href="http://www.11alive.com/rss/rss_story.aspx?storyid=125388"><strong>night of the shooting</strong></a>, police said four to five armed men broke through the bar's glass door as Henderson and a female bartender were getting ready to close up. Police said even after the men were given money, Henderson was shot in the head and legs; he later died at Grady Hospital. The female bartender was able to hide when the robbers weren't looking. </p>
<p>The crime sparked a major community effort to find the suspects.</p>
<p>The trial was pushed back last month after a <a rel="nofollow"  href="http://www.11alive.com/rss/rss_story.aspx?storyid=176753"><strong>key witness was shot</strong> </a>before jury selection. The witness lost a leg in the attack, but may still testify. </p>

</div>
</div>
<div style="clear:both;height:10px;"></div>
<div class="sidebar-photo">
                  <div id="articleimages_imageRotator_wrapper">
      <div id="articleimages_imageRotator_Div" style="overflow:hidden;height:310px;width:320px;"><div id="articleimages_imageRotator_FrameContainer" style="width:320px;"><div id="articleimages_imageRotator_frame0" style="overflow:hidden;">
             <img src="http://www.11alive.com/genthumb/genthumb.ashx?e=3&h=240&w=320&i=/assetpool/images/090508012059_ReddingResizr.jpg"/>
                   <p>Jonathan Redding <i>(Courtesy Atlanta Police)</i></p>
      </div></div></div>
</div>

            </div>
0
 
Ray PaseurCommented:
One of the good ways to learn how PHP works is to set up some tests and see what data comes out from using different functions.  Here is an example of strip_tags() against the data above.

http://www.laprbass.com/RAY_temp_nura111.php

Outputs something like this (you can tighten this up with trim() and certain regular expressions to remove excess whitespace):
 







Tweet



     Posted By -  &nbsp;11Alive&nbsp;Staff
   
Last Updated On: &nbsp;3/9/2011 4:21:45 PM

ATLANTA -- New testimony is expected Wednesday in the trial of the man accused of murdering a popular Grant Park bartender.
Jonathan Redding is accused of fatally shooting John Henderson at the old Standard bar on Memorial Drive in January 2009. He was 17 years old at the time and a suspected member of 30 Deep, a gang notorious for murders and smash-and-grab robberies.
On the night of the shooting, police said four to five armed men broke through the bar's glass door as Henderson and a female bartender were getting ready to close up. Police said even after the men were given money, Henderson was shot in the head and legs; he later died at Grady Hospital. The female bartender was able to hide when the robbers weren't looking.
The crime sparked a major community effort to find the suspects.
The trial was pushed back last month after a key witness was shot before jury selection. The witness lost a leg in the attack, but may still testify.





                 
     
             
                   Jonathan Redding (Courtesy Atlanta Police)
     


           
<?php // RAY_temp_nura111.php
error_reporting(E_ALL);
echo "<pre>";

$htm = <<<HTM
<div id="article_text">

 

<style type="text/css">
<!--

.style2 {color:#FF0000;}
.addthis_toolbox.addthis_pill_combo a {float:left;}.addthis_toolbox.addthis_pill_combo a.addthis_button_tweet,.addthis_toolbox.addthis_pill_combo a.addthis_counter {}.addthis_button_compact .at15t_compact {margin-right:4px;float:left;}
-->
</style>


<div class="addthis_toolbox addthis_pill_combo" style="margin-top:3px;margin-bottom:5px;float:right;padding:5px;">
<a rel="nofollow" class="addthis_button_facebook_like"></a><a rel="nofollow"  href="http://twitter.com/share" class="twitter-share-button">Tweet</a></div>

<div style="padding-bottom:5px;">

     <b>Posted By -  &nbsp;<a rel="nofollow"  href="mailto://www.11alive.com/rss/news11alive@yahoo.com?subject=viewer%20question%20about%20an%20article&body=Link:http://www.11alive.com/rss/rss_story.aspx?storyid=181673">11Alive&nbsp;Staff</a></b><br>
   
<br>Last Updated On: &nbsp;3/9/2011 4:21:45 PM</div>
<div style="padding-bottom:10px;">
<p><p>ATLANTA -- New testimony is expected Wednesday in the trial of the man accused of murdering a popular Grant Park bartender. </p>
<p>Jonathan Redding is <a rel="nofollow"  href="http://www.11alive.com/rss/rss_story.aspx?storyid=130020"><strong>accused of fatally shooting</strong> </a>John Henderson at the old <a rel="nofollow"  href="http://atlanta.metromix.com/bars-and-clubs/pub/closed-standard-food-and-cabbagetown/338905/content"><strong>Standard bar</strong></a> on Memorial Drive in January 2009. He was 17 years old at the time and a suspected member of 30 Deep, a gang notorious for murders and smash-and-grab robberies.</p>
<p>On the <a rel="nofollow"  href="http://www.11alive.com/rss/rss_story.aspx?storyid=125388"><strong>night of the shooting</strong></a>, police said four to five armed men broke through the bar's glass door as Henderson and a female bartender were getting ready to close up. Police said even after the men were given money, Henderson was shot in the head and legs; he later died at Grady Hospital. The female bartender was able to hide when the robbers weren't looking. </p>
<p>The crime sparked a major community effort to find the suspects.</p>
<p>The trial was pushed back last month after a <a rel="nofollow"  href="http://www.11alive.com/rss/rss_story.aspx?storyid=176753"><strong>key witness was shot</strong> </a>before jury selection. The witness lost a leg in the attack, but may still testify. </p>

</div>
</div>
<div style="clear:both;height:10px;"></div>
<div class="sidebar-photo">
                  <div id="articleimages_imageRotator_wrapper">
      <div id="articleimages_imageRotator_Div" style="overflow:hidden;height:310px;width:320px;"><div id="articleimages_imageRotator_FrameContainer" style="width:320px;"><div id="articleimages_imageRotator_frame0" style="overflow:hidden;">
             <img src="http://www.11alive.com/genthumb/genthumb.ashx?e=3&h=240&w=320&i=/assetpool/images/090508012059_ReddingResizr.jpg"/>
                   <p>Jonathan Redding <i>(Courtesy Atlanta Police)</i></p>
      </div></div></div>
</div>

            </div>
HTM;

// MAN PAGE: http://us3.php.net/manual/en/function.strip-tags.php
$new = strip_tags($htm);

// SHOW THE WORK PRODUCT
echo htmlentities($new);

Open in new window

0
 
Nura111Author Commented:
Thank you, yes I know that is the output I tried it on the text (only I got it without the <br>)
I read the warning in php.net that sometimes its can cause problems if the html document you are working on  doent tags properly and there can be other issues.
0

Featured Post

Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

  • 6
  • 3
  • 2
  • +3
Tackle projects and never again get stuck behind a technical roadblock.
Join Now