Fetching a webpage....

Hi guys,
Im trying to fetch a webpage, then chop it up and set some variables based on it.

Im trying to fetch:
http://www.dmoz.org/Regional/Europe/United_Kingdom/Health/National_Health_Service/


And then get set some variables ($url, $title, $description) based on the links in the main part of the page, and then loop through and do stuff with them.

Can anyone help?

Alicia
fox_stattonAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

arantiusCommented:
0
keteracelCommented:
I wish this was worth more than 125 points for the work that went into it... but here goes:

<?php
    function explodeHref($searchString) {
      $fp = fsockopen("www.dmoz.org", 80, $errno, $errstr, 30);
      $file = "";
      
      if (!$fp) {
         echo "$errstr ($errno)<br />\n";
      } else {
         $out = "GET /$searchString HTTP/1.1\r\n";
         $out .= "Host: www.dmoz.org\r\n";
         $out .= "User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20030225\r\n";
         $out .= "Connection: Close\r\n\r\n";
        
         fwrite($fp, $out);
        
         while (!feof($fp)) {
               $file .= fgets($fp, 128);
         }
            fclose($fp);
      }
      
      $sections = explode("<li>", $file);
      $results = array();
      $i = 0;

      foreach ($sections as $section) {
            if (substr($section, 0, 8) == "<a href=") {
                  $section = preg_replace("/<b>|<i>|<br>|<\/b>|<\/i>/", " ", substr($section, 9));

                  //echo $section . "<BR>";

                  $entry = array();
                  $entry["url"]         = substr($section, 0, strpos($section, "\""));

                  $section = substr($section, strpos($section, "\"") + 2);
                  //echo $section . "<BR>";

                  $entry["title"]       = substr($section, 0, strpos($section, "<"));

                  $section = substr(strstr($section, "</a>"), 4);

                  if (strpos($section, "<") > 0) {
                        $entry["description"] = substr($section, 0, strpos($section, "<"));
                  }
                  else {
                        $entry["description"] = $section;
                  }
                  $results[$i] = $entry;
                  $i++;
            }
      }
      return $results;
    }

    $arr = explodeHref("Regional/Europe/United_Kingdom/Health/National_Health_Service/");

//BELOW IS AN EXAMPLE OF HOW TO USE THE COLLECTED DATA

?>
<table style="border: none; background: black;">
      <tr style="color: white; font-weight: bold;">
            <td>URL</td>
            <td>Title</td>
            <td>Description</td>
      </tr>
<?
      $i = 0;

      foreach($arr as $record) {
            if ($i % 2 == 0) {
                  echo "<tr style=\"background: white; color: black;\">";
            }
            else {
                  echo "<tr style=\"background: #ddeeff; color: black;\">";
            }
            echo "<td>" . $record['url'] . "</td><td>" . $record['title'] . "</td><td>" . $record['description'] . "</td></tr>";
            $i++;
      }            

?>
</table>


you can see an example of this script working at http://www.keteracel.com/test/dmoz.php

regards,

www.keteracel.com
0
fox_stattonAuthor Commented:
Hi Keteracel,
Thanks for this, its great.

Just one thing, I dont want it to bring the categories in, just the actual links on the page, is there anyway to do this?

Thanks,

Alicia
0
Cloud Class® Course: Python 3 Fundamentals

This course will teach participants about installing and configuring Python, syntax, importing, statements, types, strings, booleans, files, lists, tuples, comprehensions, functions, and classes.

keteracelCommented:
we can do a check when outputting the tables to see if the description is long enough... say it should be at least 10 characters:

 $arr = explodeHref("Regional/Europe/United_Kingdom/Health/National_Health_Service/");

//BELOW IS AN EXAMPLE OF HOW TO USE THE COLLECTED DATA

?>
<table style="border: none; background: black;">
     <tr style="color: white; font-weight: bold;">
          <td>URL</td>
          <td>Title</td>
          <td>Description</td>
     </tr>
<?
     $i = 0;

     foreach($arr as $record) {
          if (strlen($record['description']) < 10) continue;

          if ($i % 2 == 0) {
               echo "<tr style=\"background: white; color: black;\">";
          }
          else {
               echo "<tr style=\"background: #ddeeff; color: black;\">";
          }
          echo "<td>" . $record['url'] . "</td><td>" . $record['title'] . "</td><td>" . $record['description'] . "</td></tr>";
          $i++;
     }          

?>
</table>
0
keteracelCommented:
ok, I also had to remove &nbsp; s from the markup as these take uo 5 characters and increase the minimum to 15 to make sure the sections are caught (+ a few other small changes):

<?php
    function explodeHref($site, $searchString) {
      $fp = fsockopen($site, 80, $errno, $errstr, 30);
      $file = "";

      if (!$fp) {
         echo "$errstr ($errno)<br />\n";
      } else {
         $out = "GET /$searchString HTTP/1.1\r\n";
         $out .= "Host: $site\r\n";
         $out .= "User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20030225\r\n";
         $out .= "Connection: Close\r\n\r\n";

         fwrite($fp, $out);

         while (!feof($fp)) {
               $file .= fgets($fp, 128);
         }
            fclose($fp);
      }

      $sections = explode("<li>", $file);
      $results = array();
      $i = 0;

      foreach ($sections as $section) {
            if (substr($section, 0, 8) == "<a href=") {
                  $section = preg_replace("/&nbsp;|<b>|<i>|<br>|<\/b>|<\/i>/", " ", substr($section, 9));

                  //echo $section . "<BR>";

                  $entry = array();
                  $entry["url"]         = trim(substr($section, 0, strpos($section, "\"")));

                  $section = substr($section, strpos($section, "\"") + 2);
                  //echo $section . "<BR>";

                  $entry["title"]       = trim(substr($section, 0, strpos($section, "<")));

                  $section = substr(strstr($section, "</a>"), 4);

                  if (strpos($section, "<") > 0) {
                        $entry["description"] = trim(substr($section, 0, strpos($section, "<")));
                  }
                  else {
                        $entry["description"] = trim($section);
                  }
                  $results[$i] = $entry;
                  $i++;
            }
      }
      return $results;
    }

    $arr = explodeHref("www.dmoz.org", "Regional/Europe/United_Kingdom/Health/National_Health_Service/");

?>
<table style="border: none; background: black;">
      <tr style="color: white; font-weight: bold;">
            <td>URL</td>
            <td>Title</td>
            <td>Description</td>
      </tr>
<?
      $i = 0;

      foreach($arr as $record) {
            if (strlen($record['description']) < 15) continue;
            if ($i % 2 == 0) {
                  echo "<tr style=\"background: white; color: black;\">";
            }
            else {
                  echo "<tr style=\"background: #ddeeff; color: black;\">";
            }
            echo "<td>" . $record['url'] . "</td><td>" . $record['title'] . "</td><td>" . $record['description'] . "</td></tr>";
            $i++;
      }

?>
</table>
0
fox_stattonAuthor Commented:
Hi, I made those changes, but the categories are still being displayed, any ideas why?
0
fox_stattonAuthor Commented:
Hi,
thanks, the second version you posted seems to be working!

Just so I can understand how the script works, if I wanted to change it to take results from say business.com

what areas would I change, the explode <li> woudl still apply wouldnt it?
0
keteracelCommented:
no, unfortunately it wouldn't.. this solution is VERY specific to the page you asked for! The following solution is much more general:

<?php
      function loadFromWeb($site, $page) {
            $fp = fsockopen($site, 80, $errno, $errstr, 30);
            $file = "";

            if (!$fp) {
                     echo "$errstr ($errno)<br />\n";
            } else {
                     $out = "GET /$page HTTP/1.1\r\n";
                     $out .= "Host: $site\r\n";
                     $out .= "User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20030225\r\n";
                     $out .= "Connection: Close\r\n\r\n";

                     fwrite($fp, $out);

                     while (!feof($fp)) {
                           $file .= fgets($fp, 128);
                     }
                        fclose($fp);
            }
            return $file;
      }

    function getResults($site, $filename, $regexp) {
            $file = loadFromWeb($site, $filename);
            $results = array();
            
            preg_match_all($regexp, $file, $results);
            
            return $results;
    }

      function displayResults($results, $urlloc, $titleloc, $descloc, $mindesc) {
?>
<table style="border: none; background: black;">
      <tr style="color: white; font-weight: bold;">
            <td>URL</td>
            <td>Title</td>
            <td>Description</td>
      </tr>
<?
            $i = 0;
            $j = 0;
            
            foreach ($results[1] as $dontcare) {
                  
                  if (strlen($results[$descloc][$j]) < $mindesc) {
                        $j++;
                        continue;
                  }
                  
                  if ($i % 2 == 0) {
                        echo "<tr style=\"background: white; color: black;\">";
                  }
                  else {
                        echo "<tr style=\"background: #ddeeff; color: black;\">";
                  }
                  echo "<td>" . $results[$urlloc][$j] . "</td><td>" . $results[$titleloc][$j] . "</td><td>" . $results[$descloc][$j] . "</td></tr>";
                  $i++;
                  $j++;
            }
?>
</table>
<?
      }

$arr = getResults("www.dmoz.org", "Regional/Europe/United_Kingdom/Health/National_Health_Service/", "|<li><a href=\"(.*?)\">(.*?)</a>(.*?)[\\r\\n]|");
displayResults($arr, 1, 2, 3, 15);
?>
0
keteracelCommented:
in order to change it for use on another page you would have to change the regular expression and possibly the url, title and description locations....

getResults(String website, String webpage, String regularexpression) gets the details into an array. The regular expression will be specific to the particular webpage you wish to extract details from. To figure out the what the regexp should be, look at the source of the page you want to get the results from and look for commonalities between results. The sections in brackets in my regexp represent the parts of the matched sections of the original you wish to separate. In this case the first bracketed section is the url, the second is the title and the third is the description. If these orders are different, simply change the numbers passed to the display function.

I'll show an example of getting info from another page shortly...
0
keteracelCommented:
ok, I've had to make a change to the code as some sites (like business.com) use a chunked data encoding for returning HTTP/1.1 requests. To fix this I've changed the line which reads:

$out = "GET /$page HTTP/1.1\r\n";

to:

$out = "GET /$page HTTP/1.0\r\n";

this stops chunk information being placed in $file.

then, to display results from business.com use the following function calls:

$arr = getResults("www.business.com", "search/rslt_default.asp?vt=all&query=nhs&search=Search&type=web", "|<li.*?href=\"(.*?)\">(.*?)</a>.*?</font><br>(.*?)<br>|");
displayResults($arr, 1, 2, 3, 15);
0
keteracelCommented:
for some sites, like google, it may be neccessary to add more processing after running the regexp. This is due to the fact that google has several different type of results. For instance, when a foreign page is a part of the results, google has a TRANSLATE link after the title. Unfortunately, as there is a <font size=-1> tag infront of this TRANSLATE link and the description, you cannot get the description from a single regexp whilst still being able to extract normal lines.... so....

use the following function calls:

$arr = getResults("www.google.com", "search?hl=en&q=train+spotting&btnG=Google+Search", "/<!--m--><a href=(.*?)>(.*?)<\\/a>.*?<font size=-1>(.*?)<br><font/s");
displayResults($arr, 1, 2, 3, 0);

but then in the displayResults() function you'll need to change the bit that goes:

...<td>" . $results[$descloc][$j] . "</td>...

with:

...<td>" . preg_replace("/- \\[.*?<font size=-1>/", "", $results[$descloc][$j]) . "</td>...

Unfortunately, as each search engine is different, you'll have to make quite specific functions for each page. The general regexp functions I have given you will work if the search engine responds with results that ALL take the same form. If they take different forms the regexp method is a lot more difficult and a workaround is easier (as I did in the code above!)

regards,

www.keteracel.com
0
alexmayCommented:
Can't believe you keep asking keteracel so many follow up questions and only put the points up to 250.
This much work is easily worth 500!
0
keteracelCommented:
I got 500 for doing a simple one of these for google some time ago... But I'm in a good mood :-D
0
fox_stattonAuthor Commented:
Hi,
thanks for this, I really do appreciate your help.

At the moment Im just trying to understand how the regular expression part works.

The two main places Im looking at are Yahoo and Business.com

Using your business.com example I tried to get
http://www.business.com/directory/accounting/budgeting_and_forecasting/

but what it retrieved were two links that didnt appear to be on the page.

Also, I was trying it with:
http://dir.yahoo.com/Business_and_Economy/Business_to_Business/Education/Professional_Development/

But it didnt seem to work. If I understand correctly, shouldnt it be we are splitting the <li> tags and getting their contents on all pages?
0
fox_stattonAuthor Commented:
Hi,
something weird I noticed is that the original script wont parse pages like:

http://dmoz.org/Business/Marketing_and_Advertising/Public_Relations/Copywriters/

Is this because there are too many directories?
0
keteracelCommented:
Please note that I did say that these solutions are and MUST be very specific. Business.com seems to have gone well out of their way to make the ripping of their search results difficult. The following will get the info from the http://www.business.com/directory/accounting/budgeting_and_forecasting/ page:

$arr = getResults("www.business.com", "directory/accounting/budgeting_and_forecasting/", "/<li class=snavh>\\s+(<b>|)<a href=\"(.*?)\".*?>(.*?)<\\/a>(<\\/b>|)\\s*<br>(.*?)(<font|<a)/s");
displayResults($arr, 2, 3, 5, 15);

As you can see, the regular expression has had to be modified quite significantly!

Yahoo's is quite easy...

$arr = getResults("dir.yahoo.com", "Business_and_Economy/Business_to_Business/Education/Professional_Development/", "/<li><a href=\"(.*?)\">(.*?)<\\/a>\\s*-(.*?)<\\/li>/s");
displayResults($arr, 1, 2, 3, 0);

It may be possible to create a much more general solution that will work on any page that gives results in <li> tags. But since I have already spent more time on this question than on any other in the past, if you want me to create a general solution you'll have to create a new question as a follow-up to this question.

www.keteracel.com
0
keteracelCommented:
ok, I've written a much more generic script that should get results for ANY page that lists results in <li> tags. It is also extensible so you can make it get results from pages which do not list results in <li> tags, like google!

If you want it, start a new question...
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
PHP

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.