Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 230
  • Last Modified:

Help parsing html in to array

Hi

can someone help me with gettign search results from google in to array

I have no idea how regex works but I think it will do the trick

sample url would be
http://www.google.com/search?hl=en&q=car
just need unpaid listings in array
title, link, description, and url

thanks
0
cgibuys
Asked:
cgibuys
1 Solution
 
RoonaanCommented:
You can start with:

  $html = file_get_contents('http://www.google.com/search?hl=en&q=car');
 
  if(preg_match_all('#<a class=l(.*)</a>#iU', $html, $matches)) {
    $links = $matches[0];
  }

This wil not get you the descriptions though.

-r-
0
 
dr_dedoCommented:
you can use google APIs you know ?
0
 
ClickCentricCommented:
I'm assuming, as the search engines do, that you plan on checking for a robots.txt file and honoring it.
0
Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
RoonaanCommented:
Why would you need to honour robots.txt when you get your results from google itself. You might assume that pages with robots.txt file will not be present in googles results..

-r-
0
 
keteracelCommented:
I've answered this question before... hang on I'll dig it up...
0
 
keteracelCommented:
   function getRemoteFile($site, $page) {
          $fp = fsockopen($site, 80, $errno, $errstr, 30);
      $file = "";
      
      if (!$fp) {
         echo "$errstr ($errno)<br />\n";
         exit;
      }
      else {
         $out = "GET $page HTTP/1.0\r\n";
         $out .= "Host: $site\r\n";
         $out .= "User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20030225\r\n";
         $out .= "Connection: Close\r\n\r\n";
        
         fwrite($fp, $out);
        
         while (!feof($fp)) {
               $file .= fgets($fp, 128);
         }
            fclose($fp);
      }
      return $file;
    }
   
    function extractSearchResults($startSTR, $afterStart, $endSTR, $searchSTRStart, $searchSTREnd, $searchSTR, $site, $minimumSize = 0) {
          $file = getRemoteFile($site, $searchSTRStart.$searchSTR.$searchSTREnd);
        return extractSearchResultsWF($startSTR, $afterStart, $endSTR, $file, $minimumSize);
    }
   
    function extractSearchResultsWF($startSTR, $afterStart, $endSTR, $file, $minimumSize = 0) {
   
      $splitByStart = explode($startSTR, $file);
      $results = array();
      $first = true;
      $j = 0;
      
      foreach($splitByStart as $part) {
          if ($first) {
              $first = false;
              continue;
          }
          
          if ($afterStart != "") {
              $cont = false;

            for ($i = 0; $i < strlen($afterStart); $i++) {
                if ($part[0] == $afterStart[$i]) {
                      $cont = true;
                  continue;
                }
            }

            if ($cont == false) continue;
          }
          
          if ($endSTR != "") {
              if (strpos($part, $endSTR) != FALSE) {
                $part = substr($part, 0, strpos($part, $endSTR));
            }
          }
      
          if (!preg_match("/href=(\"|)(.*?)(\"|)>(.*?)<\\/a>(.+)/s", $part, $res)) {
               continue;
          }
          
          $descContents = preg_split("/<(\\/|)[tfhadpul\\-!].*?>|<img.*?>|<inp.*?>/", $res[5]);
          $biggestSTR = "";

          foreach($descContents as $desc) {
                $desc = trim($desc);
             
                if (strlen($desc) > strlen($biggestSTR)) {
                $biggestSTR = $desc;
            }
          }

          if (strlen($biggestSTR) < $minimumSize) continue;
      
          $sp = split("\"", trim($res[2]));
          $results[$j]["url"]   = $sp[0];
          $results[$j]["title"] = trim($res[4]);
          $results[$j]["desc"]  = $biggestSTR;
          
          $j++;
      }
      return $results;
    }

    function displayExtractedResults($results) {
          $i = 1;

      ?>
      <table style="background: #336699;"><tr style="background: #000000; color: #ffffff; font-weight: bold;"><td>#</td><td>URL</td><td>Title</td><td>Description</td></tr>
      <?
      foreach($results as $result) {
          ?>
          <tr style="background: <? if($i % 2 == 0) echo "#ddeeff;"; else echo "#ffffff;"; ?>">
          <?
          echo "<td>$i</td><td>" . $result["url"] . "</td><td>" . $result["title"] . "</td><td>" . $result["desc"] . "</td></tr>";
          $i++;
      }
      ?>
      </table>
      <?
    }

    function extractGoogleSearchResults($searchSTR) {
          return extractSearchResults("<!--m-->", "", "<!--n-->", "/search?hl=en&lr=&q=", "&btnG=Search", $searchSTR, "www.google.com");
    }
0
 
keteracelCommented:
so, try the following to test the code:

displayExtractedResults(extractGoogleSearchResults("keteracel+web+design"));
0
 
keteracelCommented:
or, of course, you can just

print_r(extractGoogleSearchResults("keteracel+web+design"));
0

Featured Post

Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

Tackle projects and never again get stuck behind a technical roadblock.
Join Now