Php script for massive sitemap ping

I'd like to have a script that would submit http://www.google.com/webmasters/sitemaps/ping?sitemap=http://www.mysite.com/agsitemap/ for a very large number of websites.
This submission should be automatic at the release of the website online. And I would need to reveive the ping results on a separate database with a 24h timeline.

Is this possible?
LVL 1
Richard Coffree-commerce Product ManagerAsked:
Who is Participating?
 
Ray PaseurConnect With a Mentor Commented:
It looks to me like you might want to set something up this way...  it would be almost 100% automatic.

Create a data base table that holds a list of all of the 20,000+ web sites, and an indication of whether each site has been submitted to the Google Sitemaps application.

Select from the table where the submission has not yet occurred.  Perhaps with a limit of 500, just to avoid script timeouts.

Iterate over the selected rows, substituting each of your sitemap URLs into the argument in the Google URL.

With the prepared URL, use CURL's "get" method to connect to Google.  You can set the timeout to 2 seconds or less, and you can probably ignore any CURL errors.  

Wait one second between calls to avoid looking like you are trying a DOS attack against the Google Sitemap application.

Update the selected row to show that your sitemap URL has been submitted to Google.

When you get to the end of the iterator and all 500 of your sitemaps have been submitted, use CURL to restart your script, and it will get the next 500 records, repeating the process until all 20,000+ rows have been updated with the "submit-done" status.  When your query to select the unsubmitted sitemaps returns no rows, you're all finished.

Here is an example of how you can use CURL to present a GET request to a web site.

best regards, ~Ray
<?php // RAY_curl_example.php
error_reporting(E_ALL);

function my_curl($url, $timeout=2, $error_report=FALSE)
{
    $curl = curl_init();

    // HEADERS FROM FIREFOX - APPEARS TO BE A BROWSER REFERRED BY GOOGLE
    $header[] = "Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
    $header[] = "Cache-Control: max-age=0";
    $header[] = "Connection: keep-alive";
    $header[] = "Keep-Alive: 300";
    $header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
    $header[] = "Accept-Language: en-us,en;q=0.5";
    $header[] = "Pragma: "; // browsers keep this blank.

    // SET THE CURL OPTIONS - SEE http://php.net/manual/en/function.curl-setopt.php
    curl_setopt($curl, CURLOPT_URL,            $url);
    curl_setopt($curl, CURLOPT_USERAGENT,      'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6');
    curl_setopt($curl, CURLOPT_HTTPHEADER,     $header);
    curl_setopt($curl, CURLOPT_REFERER,        'http://www.google.com');
    curl_setopt($curl, CURLOPT_ENCODING,       'gzip,deflate');
    curl_setopt($curl, CURLOPT_AUTOREFERER,    TRUE);
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
    curl_setopt($curl, CURLOPT_TIMEOUT,        $timeout);

    // RUN THE CURL REQUEST AND GET THE RESULTS
    $htm = curl_exec($curl);
    $err = curl_errno($curl);
    $inf = curl_getinfo($curl);
    curl_close($curl);

    // ON FAILURE
    if (!$htm)
    {
        // PROCESS ERRORS HERE
        if ($error_report)
        {
            echo "CURL FAIL: $url TIMEOUT=$timeout, CURL_ERRNO=$err";
            var_dump($inf);
        }
        return FALSE;
    }

    // ON SUCCESS
    return $htm;
}




// USAGE EXAMPLE
$url = "http://twitter.com/Ray_Paseur";
$htm = my_curl($url);
if (!$htm) die("NO $url");


// SHOW WHAT WE GOT
echo "<pre>";
echo htmlentities($htm);

Open in new window

0
 
karoldvlCommented:
Doing it solely in PHP is probably not the best idea. Due to the amount of processing needed you won't be able to contain it in one execution of the script.

Either split the processing into chunks (f.e. one ping per script execution) and then call it in batch from the backend (shell/cron job) or move most of the processing to the background (some daemon - plenty of choices C++/Perl/Python/...). This is too abstract at the moment - too much depends on the specifics.

What's your estimated ping rate you're going to generate (pings per seconds)?
0
 
Ray PaseurCommented:
What is "a very large number?"
0
 
Richard Coffree-commerce Product ManagerAuthor Commented:
Thank you for you back-ups:

@ karoldvl: < I was orienting the procedure to replacing the www.mysite.com part in the higher mentioned URL and somehow submitting it. Today if I have to press Enter in order to launch this URL. Considering the large number of websites we realease per monthe I'd like it to be done automaticaly. This way, with one condition I could do "ping" all our DN sitemaps. And of course idealy I'd like to count the number/purcentage of http 200 answers. This script has to function only the day the website is released/published and the submissions should be limited to 3, more is not necessary>

@ Ray_Paseur: <Today we're talking about over 20.000 websites and the growth is exponential>
0
 
Ray PaseurCommented:
Thanks for the points - it's an interesting question!  Best, ~Ray
0
All Courses

From novice to tech pro — start learning today.