Fastest way of checking multiple websites if they work

infodigger
infodigger used Ask the Experts™
on
The main purpose of my project is to check about 100,000 websites if they exist (not domain check only website check). Currently the way I am doing it is to run file_get_content for every url and add the number of characters returned to the database. So if the characters are over 0 then the website exists.

However this way takes way too long (more than 2 days) and the results are not very good (I have to run this 3-4 times to get better results since many website do not respond fast).

Do you have any ideas to improve this? For example I think that getting just the first 4-5 characters of the response could work as well. Or another thing could be to launch many instances of the script at the same time.

Thanks for all the help!
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
As a very quick solution you could stop adding to the database and run the returned string from file_get_contents through the strlen function.

e.g.

$string = file_get_contents("http://www.google.com");
if(strlen($string) != 0)
{
echo "website exists";
}

Stopping the insert to the database will cut down on some of the time.
-- CTM

Commented:
You could use get_headers("http://www.google.com").
(http://de3.php.net/manual/en/function.get-headers.php)
You get an array with the http headers if the website is there. FALSE if the site is down.
Without transferring the site content. That sould be must faster then file_get_contents()
Most Valuable Expert 2011
Top Expert 2016

Commented:
get_headers() will be MUCH faster than file_get_contents() since it will not need to get all the contents!  In addition, file_get_contents() adds execution time to your script while it waits for the foreign site to prepare and render the HTML.  So it can cause timeouts if the foreign site hangs up.

There may be a good CURL solution, too.  You can set timeouts with CURL, unlike with file_get_contents().

I will experiment with this a little bit and post back here when I have some results.  Best regards, ~Ray
Become a CompTIA Certified Healthcare IT Tech

This course will help prep you to earn the CompTIA Healthcare IT Technician certification showing that you have the knowledge and skills needed to succeed in installing, managing, and troubleshooting IT systems in medical and clinical settings.

Commented:
There's probably not too much you can do about speeding up each website query. As d4011 suggested, usnig getheaders will reduce the amount you are downloading, but your biggest speed problem is connection time, and there's just no way to really speed that up (unless you're on a modem, in which case upgrading to broadband would drastically help).

Eliminating DB queries won't be much of a help either - 100,000 inserts is usually nothing major for a database unless you have a REALLY slow database or poorly-designed table structures.

However, you COULD divide up the list into several sub-lists (100 lists of 1,000 each). Put each list into its own file in one directory called "unprocessed" then create another directory called "working" and one called "processed"

Have your script pull ONE list from the directory, move the file into the "working" directory and begin to process those 1,000 sites. At the end of the script, move the file from "working" into "processed", and end the script - DO NOT pull more than one list at a time.

Now just run the script, wait about 10 seconds, and run another instance of the script. Do this about 10 times total and you should have 10 instances of your script, which should be processing 10,000 total sites, 10 at a time. Your network card is most likely capable of handling all the different simultaneous traffic, so it should almost cut your time down into 1/10th of the toatl time.
Does a website count as "existing" if it is just a parked domain? A lot of websites which go down get replaced by parked domains, especially if it was a popular website. You may want to consider ways to take that into account in your program. I'm not sure the best way to detect it, though, since most methods would involve getting the entire html contents of the site which is what is slowing you down in the first place! Perhaps some details exist in the nameserver or headers which might tip you off.

Commented:
Ray makes another good point about get_headers() - there may be some time involved in waiting for the rendering of a page, so get_headers() might help eliminate some of that time. If you're looking for the fastest method of processing all the sites, then my approach would probably give you the fastest times. It's simple division of labor.
Most Valuable Expert 2011
Top Expert 2016
Commented:
Here are the stats from my test.

I CHECKED 500 URLS WITH get_headers()
ELAPSED TIME: 22.0503

I CHECKED 500 URLS WITH file_get_contents()
ELAPSED TIME: 33.2863

After tinkering around a little, I think you'll find the following things.

1. the speed of the foreign servers will vary greatly, as will the speed of internet connections.
2. Extrapolating 500 URLS (efficient code, BTW, since I am testing my own web site) to 100,000 URLS indicates a requirement for 4,400 seconds of elapsed time.  Since an hour is 3,600 seconds you're looking at 1:15 minimum to do this with get_headers() and at least half again longer with f_g_c().

I would recommend taking the list of 100,000 URLS, and splitting it up into 100 lists of 1,000 URLS each, then starting 100 separate scripts to run the tests concurrently.  You could be done in just a few minutes!

Best regards, ~Ray
<?php // RAY_get_headers.php
error_reporting(E_ALL);
echo "<pre>\n";
 
// COMPARE THE TIME TO USE file_get_contents() AND get_headers() ON MANY URLS
 
// GENERATE AN ARRAY OF URLS - 500 NUMBERS FOR GET STRINGS
$nums = range(1,500);
foreach ($nums as $num)
{
    $urls[] = 'http://www.laprbass.com/angler.php?a=' . (string)$num;
}
 
 
 
// SET START TIME FOR GET HEADERS
// MAN PAGE: http://us3.php.net/manual/en/function.get-headers.php
$start_get_headers = microtime(TRUE);
 
// ITERATOR
$not_found = 0;
foreach ($urls as $url)
{
    $headers = get_headers($url);
    if (strpos($headers[0], '404') === FALSE) continue;
    $not_found++;
}
 
// SET END TIME
$finis_get_headers = microtime(TRUE);
 
// COMPUTE AND REPORT
$lapse_get_headers = number_format( ($finis_get_headers - $start_get_headers), 4);
$kount = count($urls);
echo "\nI CHECKED $kount URLS WITH get_headers()\nGOT $not_found 404\nELAPSED TIME: $lapse_get_headers\n";
 
 
 
// SET START TIME FOR GET CONTENTS
// MAN PAGE: http://us3.php.net/manual/en/function.file-get-contents.php
$start_get_contents = microtime(TRUE);
 
// ITERATOR
$not_found = 0;
foreach ($urls as $url)
{
    $response = file_get_contents($url);
    if (strpos($response, '404') === FALSE) continue;
    $not_found++;
}
 
// SET END TIME
$finis_get_contents = microtime(TRUE);
 
// COMPUTE AND REPORT
$lapse_get_contents = number_format( ($finis_get_contents - $start_get_contents), 4);
$kount = count($urls);
echo "\nI CHECKED $kount URLS WITH file_get_contents()\nGOT $not_found 404\nELAPSED TIME: $lapse_get_contents\n";

Open in new window

Most Valuable Expert 2011
Top Expert 2016

Commented:
@Frosty555: Good point, however a parked domain will still return headers and contents just like a regular web site.  Some heuristics would be needed to figure this out.

Best to all, ~Ray
Commented:
@Ray: Regarding to line 48 of your script: Likely just a c&p mistake, but file_get_contents() on urls sending a 404 header back will not return the page content. (PHP generates a warning containing "HTTP/1.0 404 Not Found").
So you can leave out the strpos (will also save a bit time on huge reponses) and just check against FALSE.
Most Valuable Expert 2011
Top Expert 2016

Commented:
Yeah, I know - I was just being lazy when I was c&p!

Author

Commented:
Thank you all for your suggestions! I have managed to run my project at about 1/20th of the time which is really great!

I use the get_headers and 10 parallel processes and it works very fast.
Most Valuable Expert 2011
Top Expert 2016

Commented:
Thanks for the points!  This is a great question,  ~Ray

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial