Downloading 1,000,000 webpages with perl
Posted on 2001-08-06
That is the task at hand. We have a VERY large list of pages to download, and several servers at our disposal to help download them all. We have access to four dual-processor linux boxes.
Currently, we're using the LWP::UserAgent to fetch the pages, and Sys::AlarmCall (a wrapper module around SIGALARM) to monitor each fetch in case it timesout improperly.
However, this doesn't always seem to work, and sometimes, the SIGALARM fails and the page request continues for a very long time. Sometimes, it even seems to cause the process to be halted.
Does anyone have experience with such a project? What tools/strategies did you employ? How did you handle requests that timed out?