Link to home
Start Free TrialLog in
Avatar of infodigger
infodigger

asked on

Fastest way of checking multiple websites if they work

The main purpose of my project is to check about 100,000 websites if they exist (not domain check only website check). Currently the way I am doing it is to run file_get_content for every url and add the number of characters returned to the database. So if the characters are over 0 then the website exists.

However this way takes way too long (more than 2 days) and the results are not very good (I have to run this 3-4 times to get better results since many website do not respond fast).

Do you have any ideas to improve this? For example I think that getting just the first 4-5 characters of the response could work as well. Or another thing could be to launch many instances of the script at the same time.

Thanks for all the help!
Avatar of christophermccann
christophermccann
Flag of United Kingdom of Great Britain and Northern Ireland image

As a very quick solution you could stop adding to the database and run the returned string from file_get_contents through the strlen function.

e.g.

$string = file_get_contents("http://www.google.com");
if(strlen($string) != 0)
{
echo "website exists";
}

Stopping the insert to the database will cut down on some of the time.
-- CTM
You could use get_headers("http://www.google.com").
(http://de3.php.net/manual/en/function.get-headers.php)
You get an array with the http headers if the website is there. FALSE if the site is down.
Without transferring the site content. That sould be must faster then file_get_contents()
get_headers() will be MUCH faster than file_get_contents() since it will not need to get all the contents!  In addition, file_get_contents() adds execution time to your script while it waits for the foreign site to prepare and render the HTML.  So it can cause timeouts if the foreign site hangs up.

There may be a good CURL solution, too.  You can set timeouts with CURL, unlike with file_get_contents().

I will experiment with this a little bit and post back here when I have some results.  Best regards, ~Ray
SOLUTION
Avatar of gr8gonzo
gr8gonzo
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Ray makes another good point about get_headers() - there may be some time involved in waiting for the rendering of a page, so get_headers() might help eliminate some of that time. If you're looking for the fastest method of processing all the sites, then my approach would probably give you the fastest times. It's simple division of labor.
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
@Frosty555: Good point, however a parked domain will still return headers and contents just like a regular web site.  Some heuristics would be needed to figure this out.

Best to all, ~Ray
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Yeah, I know - I was just being lazy when I was c&p!
Avatar of infodigger
infodigger

ASKER

Thank you all for your suggestions! I have managed to run my project at about 1/20th of the time which is really great!

I use the get_headers and 10 parallel processes and it works very fast.
Thanks for the points!  This is a great question,  ~Ray