Getting page title from 20,000 pages in an hour
Posted on 2009-06-29
If I had a lot of URLs [15k-20k] that I wanted to get the page title for... like google.com, experts-exchange.com, etc... How fast can they be processed?
What I assume I'd do, and correct me if I'm wrong, is do an fgets (or something to the same effect) until I reach the title tags in the external document so I don't have to pull everything. Then I'd extract it from there and move on to the next one in queue.
If this takes 3 seconds to do on the average site... that means the most I could process with one script running in an hour would be about 1,200. ( 20 per minute ) So if I needed to get more than that, 10k, 15k, 20k... How many of these processes could I run simultaneously until I destroy my server or notice a huge lag in regular content being called?
I understand this all depends on how awesome your hardware and your connection is etc etc... but for sake of example, let's say this is being done on some sort of average 1and1, Rackspace, or MT dedicated business server. Nothing too fancy, but it's also not a dinosaur.
Also, is there any other way besides this that I'm not thinking about?