?
Solved

Open a webpage using CURL and find all hyperlinks listed in the page

Posted on 2012-09-16
2
Medium Priority
?
932 Views
Last Modified: 2012-09-23
Hi,

I am looking to open up a 10 websites from a database using PHP CURL and find all hyperlinks mentioned in the page.

Then I want to store them in the mysql database. However, before storing to database, I also want to check if the hyperlink is not part of my do-not-use.txt (text file).

The content of do-not-use.txt text file is:

http://www.amazon.com/*
http://www.google.com/*
http://www.yahoo.com/*

The Astrix represents anything after the sitename (wild-character). Meaning no hyperlinks from Amazon/Google/Yahoo should be entered in the database.

How would I proceed with this?
0
Comment
Question by:nainil
2 Comments
 
LVL 111

Accepted Solution

by:
Ray Paseur earned 1600 total points
ID: 38404883
If this activity has any economic value, you might consider hiring a professional developer.  It's not a "hard" task, but there are many moving parts that must be tested and debugged independently, so it will take a while to get it right.  "Time is money."

I can show you how to use CURL to read the rendered HTML of a web page, however if the web page is not all HTML your script will be at risk of missing some of the information. Consider what happens when JavaScript is used to load elements into the DOM.  Those elements may not be apparent in the data string you can get with CURL.

After you have read the page, you will want to use a regular expression to match all of the strings starting with http and ending with a single or double quote mark.  That will probably find the URLs that are linked from the page.   However if you are reading a web page that comes from any of the do-not-use sites you will need to consider relative links.  You may also have to consider the <base href> tag.

You will want to read the do-not-use file into an array, and match the URLs against the contents of the array.  Some data normalization and extraction will be needed, for example, to make all of the test cases upper case, eliminate blanks, etc.  You will need to account for the differences in subdomains (gooogle.com vs www.google.com) and things like that.

Here's a CURL script that may help you get started.
<?php // RAY_curl_get_example.php
error_reporting(E_ALL);


// DEMONSTRATE THE BASICS OF CURL
// SOMETHING LIKE RAY_curl_get_example.php?url=http://twitter.com


// YOU COULD HAVE SOMETHING LIKE THIS
$url = isset($_GET["url"]) ? $_GET["url"] : 'http://twitter.com';

// BUT SINCE IT IS ON MY SERVER, I HAVE HARD-CODED THIS
$url = 'http://twitter.com';

// TRY THE REMOTE WEB SERVICE
$htm = my_curl($url);

// SHOW THE WORK PRODUCT OR BARK OUT ERROR MESSAGES
echo "<pre>";
echo PHP_EOL . '<strong>' . $url . '</strong>';
echo PHP_EOL . htmlentities($htm);
echo PHP_EOL;


// A FUNCTION TO RUN A CURL-GET CLIENT CALL TO A FOREIGN SERVER
function my_curl
( $url
, $timeout=3
, $error_report=TRUE
)
{
    $curl = curl_init();

    // HEADERS AND OPTIONS APPEAR TO BE A FIREFOX BROWSER REFERRED BY GOOGLE
    $header[] = "Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
    $header[] = "Cache-Control: max-age=0";
    $header[] = "Connection: keep-alive";
    $header[] = "Keep-Alive: 300";
    $header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
    $header[] = "Accept-Language: en-us,en;q=0.5";
    $header[] = "Pragma: "; // BROWSERS USUALLY LEAVE THIS BLANK

    // SET THE CURL OPTIONS - SEE http://php.net/manual/en/function.curl-setopt.php
    curl_setopt( $curl, CURLOPT_URL,            $url  );
    curl_setopt( $curl, CURLOPT_USERAGENT,      'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'  ); // ANCIENT HISTORY
    curl_setopt( $curl, CURLOPT_USERAGENT,      'Mozilla/5.0 (Windows NT 6.1; rv:13.0) Gecko/20100101 Firefox/13.0.1'  );
    curl_setopt( $curl, CURLOPT_HTTPHEADER,     $header  );
    curl_setopt( $curl, CURLOPT_REFERER,        'http://www.google.com'  );
    curl_setopt( $curl, CURLOPT_ENCODING,       'gzip,deflate'  );
    curl_setopt( $curl, CURLOPT_AUTOREFERER,    TRUE  );
    curl_setopt( $curl, CURLOPT_RETURNTRANSFER, TRUE  );
    curl_setopt( $curl, CURLOPT_FOLLOWLOCATION, TRUE  );
    curl_setopt( $curl, CURLOPT_TIMEOUT,        $timeout  );

    // RUN THE CURL REQUEST AND GET THE RESULTS
    $htm = curl_exec($curl);

    // ON FAILURE HANDLE ERROR MESSAGE
    if ($htm === FALSE)
    {
        if ($error_report)
        {
            $err = curl_errno($curl);
            $inf = curl_getinfo($curl);
            echo "CURL FAIL: $url TIMEOUT=$timeout, CURL_ERRNO=$err";
            var_dump($inf);
        }
        curl_close($curl);
        return FALSE;
    }

    // ON SUCCESS RETURN XML / HTML STRING
    curl_close($curl);
    return $htm;
}

Open in new window

Best of luck with your project, ~Ray
0
 

Author Closing Comment

by:nainil
ID: 38426481
Thanks Ray. I also used: simplehtmldom_1_5 to get what I needed.
0

Featured Post

Upgrade your Question Security!

Add Premium security features to your question to ensure its privacy or anonymity. Learn more about your ability to control Question Security today.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

In this blog post, we’ll look at how using thread_statistics can cause high memory usage.
In this blog, we’ll look at how improvements to Percona XtraDB Cluster improved IST performance.
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
In this video, Percona Solution Engineer Dimitri Vanoverbeke discusses why you want to use at least three nodes in a database cluster. To discuss how Percona Consulting can help with your design and architecture needs for your database and infras…
Suggested Courses
Course of the Month14 days, 8 hours left to enroll

840 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question