Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

x
?
Solved

Fetching a webpage....

Posted on 2004-12-01
17
Medium Priority
?
397 Views
Last Modified: 2006-11-17
Hi guys,
Im trying to fetch a webpage, then chop it up and set some variables based on it.

Im trying to fetch:
http://www.dmoz.org/Regional/Europe/United_Kingdom/Health/National_Health_Service/


And then get set some variables ($url, $title, $description) based on the links in the main part of the page, and then loop through and do stuff with them.

Can anyone help?

Alicia
0
Comment
Question by:fox_statton
17 Comments
 
LVL 18

Expert Comment

by:arantius
ID: 12720903
0
 
LVL 9

Expert Comment

by:keteracel
ID: 12721517
I wish this was worth more than 125 points for the work that went into it... but here goes:

<?php
    function explodeHref($searchString) {
      $fp = fsockopen("www.dmoz.org", 80, $errno, $errstr, 30);
      $file = "";
      
      if (!$fp) {
         echo "$errstr ($errno)<br />\n";
      } else {
         $out = "GET /$searchString HTTP/1.1\r\n";
         $out .= "Host: www.dmoz.org\r\n";
         $out .= "User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20030225\r\n";
         $out .= "Connection: Close\r\n\r\n";
        
         fwrite($fp, $out);
        
         while (!feof($fp)) {
               $file .= fgets($fp, 128);
         }
            fclose($fp);
      }
      
      $sections = explode("<li>", $file);
      $results = array();
      $i = 0;

      foreach ($sections as $section) {
            if (substr($section, 0, 8) == "<a href=") {
                  $section = preg_replace("/<b>|<i>|<br>|<\/b>|<\/i>/", " ", substr($section, 9));

                  //echo $section . "<BR>";

                  $entry = array();
                  $entry["url"]         = substr($section, 0, strpos($section, "\""));

                  $section = substr($section, strpos($section, "\"") + 2);
                  //echo $section . "<BR>";

                  $entry["title"]       = substr($section, 0, strpos($section, "<"));

                  $section = substr(strstr($section, "</a>"), 4);

                  if (strpos($section, "<") > 0) {
                        $entry["description"] = substr($section, 0, strpos($section, "<"));
                  }
                  else {
                        $entry["description"] = $section;
                  }
                  $results[$i] = $entry;
                  $i++;
            }
      }
      return $results;
    }

    $arr = explodeHref("Regional/Europe/United_Kingdom/Health/National_Health_Service/");

//BELOW IS AN EXAMPLE OF HOW TO USE THE COLLECTED DATA

?>
<table style="border: none; background: black;">
      <tr style="color: white; font-weight: bold;">
            <td>URL</td>
            <td>Title</td>
            <td>Description</td>
      </tr>
<?
      $i = 0;

      foreach($arr as $record) {
            if ($i % 2 == 0) {
                  echo "<tr style=\"background: white; color: black;\">";
            }
            else {
                  echo "<tr style=\"background: #ddeeff; color: black;\">";
            }
            echo "<td>" . $record['url'] . "</td><td>" . $record['title'] . "</td><td>" . $record['description'] . "</td></tr>";
            $i++;
      }            

?>
</table>


you can see an example of this script working at http://www.keteracel.com/test/dmoz.php

regards,

www.keteracel.com
0
 

Author Comment

by:fox_statton
ID: 12723872
Hi Keteracel,
Thanks for this, its great.

Just one thing, I dont want it to bring the categories in, just the actual links on the page, is there anyway to do this?

Thanks,

Alicia
0
Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 9

Expert Comment

by:keteracel
ID: 12724154
we can do a check when outputting the tables to see if the description is long enough... say it should be at least 10 characters:

 $arr = explodeHref("Regional/Europe/United_Kingdom/Health/National_Health_Service/");

//BELOW IS AN EXAMPLE OF HOW TO USE THE COLLECTED DATA

?>
<table style="border: none; background: black;">
     <tr style="color: white; font-weight: bold;">
          <td>URL</td>
          <td>Title</td>
          <td>Description</td>
     </tr>
<?
     $i = 0;

     foreach($arr as $record) {
          if (strlen($record['description']) < 10) continue;

          if ($i % 2 == 0) {
               echo "<tr style=\"background: white; color: black;\">";
          }
          else {
               echo "<tr style=\"background: #ddeeff; color: black;\">";
          }
          echo "<td>" . $record['url'] . "</td><td>" . $record['title'] . "</td><td>" . $record['description'] . "</td></tr>";
          $i++;
     }          

?>
</table>
0
 
LVL 9

Expert Comment

by:keteracel
ID: 12724227
ok, I also had to remove &nbsp; s from the markup as these take uo 5 characters and increase the minimum to 15 to make sure the sections are caught (+ a few other small changes):

<?php
    function explodeHref($site, $searchString) {
      $fp = fsockopen($site, 80, $errno, $errstr, 30);
      $file = "";

      if (!$fp) {
         echo "$errstr ($errno)<br />\n";
      } else {
         $out = "GET /$searchString HTTP/1.1\r\n";
         $out .= "Host: $site\r\n";
         $out .= "User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20030225\r\n";
         $out .= "Connection: Close\r\n\r\n";

         fwrite($fp, $out);

         while (!feof($fp)) {
               $file .= fgets($fp, 128);
         }
            fclose($fp);
      }

      $sections = explode("<li>", $file);
      $results = array();
      $i = 0;

      foreach ($sections as $section) {
            if (substr($section, 0, 8) == "<a href=") {
                  $section = preg_replace("/&nbsp;|<b>|<i>|<br>|<\/b>|<\/i>/", " ", substr($section, 9));

                  //echo $section . "<BR>";

                  $entry = array();
                  $entry["url"]         = trim(substr($section, 0, strpos($section, "\"")));

                  $section = substr($section, strpos($section, "\"") + 2);
                  //echo $section . "<BR>";

                  $entry["title"]       = trim(substr($section, 0, strpos($section, "<")));

                  $section = substr(strstr($section, "</a>"), 4);

                  if (strpos($section, "<") > 0) {
                        $entry["description"] = trim(substr($section, 0, strpos($section, "<")));
                  }
                  else {
                        $entry["description"] = trim($section);
                  }
                  $results[$i] = $entry;
                  $i++;
            }
      }
      return $results;
    }

    $arr = explodeHref("www.dmoz.org", "Regional/Europe/United_Kingdom/Health/National_Health_Service/");

?>
<table style="border: none; background: black;">
      <tr style="color: white; font-weight: bold;">
            <td>URL</td>
            <td>Title</td>
            <td>Description</td>
      </tr>
<?
      $i = 0;

      foreach($arr as $record) {
            if (strlen($record['description']) < 15) continue;
            if ($i % 2 == 0) {
                  echo "<tr style=\"background: white; color: black;\">";
            }
            else {
                  echo "<tr style=\"background: #ddeeff; color: black;\">";
            }
            echo "<td>" . $record['url'] . "</td><td>" . $record['title'] . "</td><td>" . $record['description'] . "</td></tr>";
            $i++;
      }

?>
</table>
0
 

Author Comment

by:fox_statton
ID: 12724236
Hi, I made those changes, but the categories are still being displayed, any ideas why?
0
 

Author Comment

by:fox_statton
ID: 12724370
Hi,
thanks, the second version you posted seems to be working!

Just so I can understand how the script works, if I wanted to change it to take results from say business.com

what areas would I change, the explode <li> woudl still apply wouldnt it?
0
 
LVL 9

Expert Comment

by:keteracel
ID: 12724574
no, unfortunately it wouldn't.. this solution is VERY specific to the page you asked for! The following solution is much more general:

<?php
      function loadFromWeb($site, $page) {
            $fp = fsockopen($site, 80, $errno, $errstr, 30);
            $file = "";

            if (!$fp) {
                     echo "$errstr ($errno)<br />\n";
            } else {
                     $out = "GET /$page HTTP/1.1\r\n";
                     $out .= "Host: $site\r\n";
                     $out .= "User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20030225\r\n";
                     $out .= "Connection: Close\r\n\r\n";

                     fwrite($fp, $out);

                     while (!feof($fp)) {
                           $file .= fgets($fp, 128);
                     }
                        fclose($fp);
            }
            return $file;
      }

    function getResults($site, $filename, $regexp) {
            $file = loadFromWeb($site, $filename);
            $results = array();
            
            preg_match_all($regexp, $file, $results);
            
            return $results;
    }

      function displayResults($results, $urlloc, $titleloc, $descloc, $mindesc) {
?>
<table style="border: none; background: black;">
      <tr style="color: white; font-weight: bold;">
            <td>URL</td>
            <td>Title</td>
            <td>Description</td>
      </tr>
<?
            $i = 0;
            $j = 0;
            
            foreach ($results[1] as $dontcare) {
                  
                  if (strlen($results[$descloc][$j]) < $mindesc) {
                        $j++;
                        continue;
                  }
                  
                  if ($i % 2 == 0) {
                        echo "<tr style=\"background: white; color: black;\">";
                  }
                  else {
                        echo "<tr style=\"background: #ddeeff; color: black;\">";
                  }
                  echo "<td>" . $results[$urlloc][$j] . "</td><td>" . $results[$titleloc][$j] . "</td><td>" . $results[$descloc][$j] . "</td></tr>";
                  $i++;
                  $j++;
            }
?>
</table>
<?
      }

$arr = getResults("www.dmoz.org", "Regional/Europe/United_Kingdom/Health/National_Health_Service/", "|<li><a href=\"(.*?)\">(.*?)</a>(.*?)[\\r\\n]|");
displayResults($arr, 1, 2, 3, 15);
?>
0
 
LVL 9

Expert Comment

by:keteracel
ID: 12724631
in order to change it for use on another page you would have to change the regular expression and possibly the url, title and description locations....

getResults(String website, String webpage, String regularexpression) gets the details into an array. The regular expression will be specific to the particular webpage you wish to extract details from. To figure out the what the regexp should be, look at the source of the page you want to get the results from and look for commonalities between results. The sections in brackets in my regexp represent the parts of the matched sections of the original you wish to separate. In this case the first bracketed section is the url, the second is the title and the third is the description. If these orders are different, simply change the numbers passed to the display function.

I'll show an example of getting info from another page shortly...
0
 
LVL 9

Expert Comment

by:keteracel
ID: 12725702
ok, I've had to make a change to the code as some sites (like business.com) use a chunked data encoding for returning HTTP/1.1 requests. To fix this I've changed the line which reads:

$out = "GET /$page HTTP/1.1\r\n";

to:

$out = "GET /$page HTTP/1.0\r\n";

this stops chunk information being placed in $file.

then, to display results from business.com use the following function calls:

$arr = getResults("www.business.com", "search/rslt_default.asp?vt=all&query=nhs&search=Search&type=web", "|<li.*?href=\"(.*?)\">(.*?)</a>.*?</font><br>(.*?)<br>|");
displayResults($arr, 1, 2, 3, 15);
0
 
LVL 9

Expert Comment

by:keteracel
ID: 12725842
for some sites, like google, it may be neccessary to add more processing after running the regexp. This is due to the fact that google has several different type of results. For instance, when a foreign page is a part of the results, google has a TRANSLATE link after the title. Unfortunately, as there is a <font size=-1> tag infront of this TRANSLATE link and the description, you cannot get the description from a single regexp whilst still being able to extract normal lines.... so....

use the following function calls:

$arr = getResults("www.google.com", "search?hl=en&q=train+spotting&btnG=Google+Search", "/<!--m--><a href=(.*?)>(.*?)<\\/a>.*?<font size=-1>(.*?)<br><font/s");
displayResults($arr, 1, 2, 3, 0);

but then in the displayResults() function you'll need to change the bit that goes:

...<td>" . $results[$descloc][$j] . "</td>...

with:

...<td>" . preg_replace("/- \\[.*?<font size=-1>/", "", $results[$descloc][$j]) . "</td>...

Unfortunately, as each search engine is different, you'll have to make quite specific functions for each page. The general regexp functions I have given you will work if the search engine responds with results that ALL take the same form. If they take different forms the regexp method is a lot more difficult and a workaround is easier (as I did in the code above!)

regards,

www.keteracel.com
0
 
LVL 2

Expert Comment

by:alexmay
ID: 12725995
Can't believe you keep asking keteracel so many follow up questions and only put the points up to 250.
This much work is easily worth 500!
0
 
LVL 9

Expert Comment

by:keteracel
ID: 12726038
I got 500 for doing a simple one of these for google some time ago... But I'm in a good mood :-D
0
 

Author Comment

by:fox_statton
ID: 12735213
Hi,
thanks for this, I really do appreciate your help.

At the moment Im just trying to understand how the regular expression part works.

The two main places Im looking at are Yahoo and Business.com

Using your business.com example I tried to get
http://www.business.com/directory/accounting/budgeting_and_forecasting/

but what it retrieved were two links that didnt appear to be on the page.

Also, I was trying it with:
http://dir.yahoo.com/Business_and_Economy/Business_to_Business/Education/Professional_Development/

But it didnt seem to work. If I understand correctly, shouldnt it be we are splitting the <li> tags and getting their contents on all pages?
0
 

Author Comment

by:fox_statton
ID: 12735489
Hi,
something weird I noticed is that the original script wont parse pages like:

http://dmoz.org/Business/Marketing_and_Advertising/Public_Relations/Copywriters/

Is this because there are too many directories?
0
 
LVL 9

Expert Comment

by:keteracel
ID: 12753670
Please note that I did say that these solutions are and MUST be very specific. Business.com seems to have gone well out of their way to make the ripping of their search results difficult. The following will get the info from the http://www.business.com/directory/accounting/budgeting_and_forecasting/ page:

$arr = getResults("www.business.com", "directory/accounting/budgeting_and_forecasting/", "/<li class=snavh>\\s+(<b>|)<a href=\"(.*?)\".*?>(.*?)<\\/a>(<\\/b>|)\\s*<br>(.*?)(<font|<a)/s");
displayResults($arr, 2, 3, 5, 15);

As you can see, the regular expression has had to be modified quite significantly!

Yahoo's is quite easy...

$arr = getResults("dir.yahoo.com", "Business_and_Economy/Business_to_Business/Education/Professional_Development/", "/<li><a href=\"(.*?)\">(.*?)<\\/a>\\s*-(.*?)<\\/li>/s");
displayResults($arr, 1, 2, 3, 0);

It may be possible to create a much more general solution that will work on any page that gives results in <li> tags. But since I have already spent more time on this question than on any other in the past, if you want me to create a general solution you'll have to create a new question as a follow-up to this question.

www.keteracel.com
0
 
LVL 9

Accepted Solution

by:
keteracel earned 1600 total points
ID: 12784101
ok, I've written a much more generic script that should get results for ANY page that lists results in <li> tags. It is also extensible so you can make it get results from pages which do not list results in <li> tags, like google!

If you want it, start a new question...
0

Featured Post

Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Developers of all skill levels should learn to use current best practices when developing websites. However many developers, new and old, fall into the trap of using deprecated features because this is what so many tutorials and books tell them to u…
It’s a season to be thankful, and we’re thankful for users like you who engage on site, solve technology problems, and network with others in the industry. What tech are we most thankful for? Keep reading.
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
The viewer will learn how to look for a specific file type in a local or remote server directory using PHP.
Suggested Courses

571 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question