asked on

Fastest way of checking multiple websites if they work

The main purpose of my project is to check about 100,000 websites if they exist (not domain check only website check). Currently the way I am doing it is to run file_get_content for every url and add the number of characters returned to the database. So if the characters are over 0 then the website exists.

However this way takes way too long (more than 2 days) and the results are not very good (I have to run this 3-4 times to get better results since many website do not respond fast).

Do you have any ideas to improve this? For example I think that getting just the first 4-5 characters of the response could work as well. Or another thing could be to launch many instances of the script at the same time.

Thanks for all the help!

christophermccann

As a very quick solution you could stop adding to the database and run the returned string from file_get_contents through the strlen function.

e.g.

$string = file_get_contents("http://www.google.com");
if(strlen($string) != 0)
{
echo "website exists";
}

Stopping the insert to the database will cut down on some of the time.
-- CTM

d4011

You could use get_headers("http://www.google.com").
(http://de3.php.net/manual/en/function.get-headers.php)
You get an array with the http headers if the website is there. FALSE if the site is down.
Without transferring the site content. That sould be must faster then file_get_contents()

Ray Paseur

get_headers() will be MUCH faster than file_get_contents() since it will not need to get all the contents! In addition, file_get_contents() adds execution time to your script while it waits for the foreign site to prepare and render the HTML. So it can cause timeouts if the foreign site hangs up.

There may be a good CURL solution, too. You can set timeouts with CURL, unlike with file_get_contents().

I will experiment with this a little bit and post back here when I have some results. Best regards, ~Ray

SOLUTION

gr8gonzo

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

SOLUTION

Frosty555

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

gr8gonzo

Ray makes another good point about get_headers() - there may be some time involved in waiting for the rendering of a page, so get_headers() might help eliminate some of that time. If you're looking for the fastest method of processing all the sites, then my approach would probably give you the fastest times. It's simple division of labor.

ASKER CERTIFIED SOLUTION

Ray Paseur

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Ray Paseur

@Frosty555: Good point, however a parked domain will still return headers and contents just like a regular web site. Some heuristics would be needed to figure this out.

Best to all, ~Ray

SOLUTION

d4011

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Ray Paseur

Yeah, I know - I was just being lazy when I was c&p!

infodigger

ASKER

Thank you all for your suggestions! I have managed to run my project at about 1/20th of the time which is really great!

I use the get_headers and 10 parallel processes and it works very fast.

Ray Paseur

Thanks for the points! This is a great question, ~Ray