Solved

Downloading 1,000,000 webpages with perl

Posted on 2001-08-06
7
137 Views
Last Modified: 2010-03-05
That is the task at hand.  We have a VERY large list of pages to download, and several servers at our disposal to help download them all.  We have access to four dual-processor linux boxes.

Currently, we're using the LWP::UserAgent to fetch the pages, and Sys::AlarmCall (a wrapper module around SIGALARM) to monitor each fetch in case it timesout improperly.

However, this doesn't always seem to work, and sometimes, the SIGALARM fails and the page request continues for a very long time.  Sometimes, it even seems to cause the process to be halted.  

Does anyone have experience with such a project?  What tools/strategies did you employ?  How did you handle requests that timed out?
0
Comment
Question by:brgordon
7 Comments
 
LVL 51

Expert Comment

by:ahoffmann
Comment Utility
is this a duplicate question? please delete it.
0
 
LVL 16

Expert Comment

by:maneshr
Comment Utility
brgordon,

Did you get a solution you were looking for?

What solution, if any, did you use?

Your response in finalizing this question is appreciated.

Thanks,
0
 

Author Comment

by:brgordon
Comment Utility
maneshr,

I did receive a solution in Perl, however, my final solution was to switch to Java (better thread handling).
Perl's SIGALARM is not reliable enough, and only one is allowed per system.

any other questions, let me know.

cheers,
Brett
0
6 Surprising Benefits of Threat Intelligence

All sorts of threat intelligence is available on the web. Intelligence you can learn from, and use to anticipate and prepare for future attacks.

 

Author Comment

by:brgordon
Comment Utility
maneshr,

Sorry, somehow this question got posted twice. ahoffman has already answered it.

Brett
0
 

Author Comment

by:brgordon
Comment Utility
This question was already posted, but somehow, got posted twice.  I have already accepted an answer from ahoffman concering the question.  Please delete this copy.

THanks,
Brett
0
 
LVL 16

Expert Comment

by:maneshr
Comment Utility
brgordon,

"..Please delete this copy...."

You can delete the question yourself. If you do not know how to delete it, then please post your request, with the URL of this question to "Community Support" (http://www.experts-exchange.com/jsp/qList.jsp?ta=commspt)
0
 
LVL 1

Accepted Solution

by:
Moondancer earned 0 total points
Comment Utility
I refunded 300 points to you for this question and closed it today.  Sorry for the delay.
Moondancer - EE Moderator
0

Featured Post

How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

Join & Write a Comment

A year or so back I was asked to have a play with MongoDB; within half an hour I had downloaded (http://www.mongodb.org/downloads),  installed and started the daemon, and had a console window open. After an hour or two of playing at the command …
Checking the Alert Log in AWS RDS Oracle can be a pain through their user interface.  I made a script to download the Alert Log, look for errors, and email me the trace files.  In this article I'll describe what I did and share my script.
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
This video explains how to create simple products associated to Magento configurable product and offers fast way of their generation with Store Manager for Magento tool.

744 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

15 Experts available now in Live!

Get 1:1 Help Now