infodigger
asked on
Fastest way of checking multiple websites if they work
The main purpose of my project is to check about 100,000 websites if they exist (not domain check only website check). Currently the way I am doing it is to run file_get_content for every url and add the number of characters returned to the database. So if the characters are over 0 then the website exists.
However this way takes way too long (more than 2 days) and the results are not very good (I have to run this 3-4 times to get better results since many website do not respond fast).
Do you have any ideas to improve this? For example I think that getting just the first 4-5 characters of the response could work as well. Or another thing could be to launch many instances of the script at the same time.
Thanks for all the help!
However this way takes way too long (more than 2 days) and the results are not very good (I have to run this 3-4 times to get better results since many website do not respond fast).
Do you have any ideas to improve this? For example I think that getting just the first 4-5 characters of the response could work as well. Or another thing could be to launch many instances of the script at the same time.
Thanks for all the help!
You could use get_headers("http://www.google.com").
(http://de3.php.net/manual/en/function.get-headers.php)
You get an array with the http headers if the website is there. FALSE if the site is down.
Without transferring the site content. That sould be must faster then file_get_contents()
(http://de3.php.net/manual/en/function.get-headers.php)
You get an array with the http headers if the website is there. FALSE if the site is down.
Without transferring the site content. That sould be must faster then file_get_contents()
get_headers() will be MUCH faster than file_get_contents() since it will not need to get all the contents! In addition, file_get_contents() adds execution time to your script while it waits for the foreign site to prepare and render the HTML. So it can cause timeouts if the foreign site hangs up.
There may be a good CURL solution, too. You can set timeouts with CURL, unlike with file_get_contents().
I will experiment with this a little bit and post back here when I have some results. Best regards, ~Ray
There may be a good CURL solution, too. You can set timeouts with CURL, unlike with file_get_contents().
I will experiment with this a little bit and post back here when I have some results. Best regards, ~Ray
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Ray makes another good point about get_headers() - there may be some time involved in waiting for the rendering of a page, so get_headers() might help eliminate some of that time. If you're looking for the fastest method of processing all the sites, then my approach would probably give you the fastest times. It's simple division of labor.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
@Frosty555: Good point, however a parked domain will still return headers and contents just like a regular web site. Some heuristics would be needed to figure this out.
Best to all, ~Ray
Best to all, ~Ray
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Yeah, I know - I was just being lazy when I was c&p!
ASKER
Thank you all for your suggestions! I have managed to run my project at about 1/20th of the time which is really great!
I use the get_headers and 10 parallel processes and it works very fast.
I use the get_headers and 10 parallel processes and it works very fast.
Thanks for the points! This is a great question, ~Ray
e.g.
$string = file_get_contents("http://www.google.com");
if(strlen($string) != 0)
{
echo "website exists";
}
Stopping the insert to the database will cut down on some of the time.
-- CTM