Solved

LWP::Parallel::UserAgent chunk size

Posted on 2006-07-05
3
779 Views
Last Modified: 2012-06-21
I want to use LWP::Parallel::UserAgent to issue HTTP requests, and have the responses processed by a callback function. I want to process the whole response data at the same time, so I've set the chunk size to be very large (100 MB). I was thinking that this would make each chunk to be the whole page. However, that didn't work. Each chunk becomes about 1460 bytes anyway.

What could I be doing wrong? Is this a common problem?

Thanks,
Jeff

-------------------------------

use strict;
use LWP::Parallel::UserAgent;


my $ua = LWP::Parallel::UserAgent->new();
$ua->max_hosts(5); # sets maximum number of locations accessed in parallel
$ua->max_req(5); # sets maximum number of parallel requests per host

while(<>) #read in urls
{
        chomp;
        my $request = HTTP::Request->new(GET => $_);
        print $request;
        $ua->register($request, \&uaCallback, 100000000);
}

$ua->wait ();

sub uaCallback
{
        my($data, $response, $protocol) = @_;
        print "BASE: ", $response->base(), "\n";
        print "LENGTH: ", length($data), "\n";
}

----------------------------------

Output:

echo "http://arctic.fws.gov/permglos.htm" | perl oneDeepCrawl.pl
HTTP::Request=HASH(0x8279c64)BASE: http://arctic.fws.gov/permglos.htm
LENGTH: 1159
BASE: http://arctic.fws.gov/permglos.htm
LENGTH: 1448
BASE: http://arctic.fws.gov/permglos.htm
LENGTH: 1448
BASE: http://arctic.fws.gov/permglos.htm
LENGTH: 1484
BASE: http://arctic.fws.gov/permglos.htm
LENGTH: 1448
BASE: http://arctic.fws.gov/permglos.htm
LENGTH: 1448
BASE: http://arctic.fws.gov/permglos.htm
LENGTH: 1484
BASE: http://arctic.fws.gov/permglos.htm
LENGTH: 2896
BASE: http://arctic.fws.gov/permglos.htm
LENGTH: 4380
BASE: http://arctic.fws.gov/permglos.htm
LENGTH: 1200
BASE: http://arctic.fws.gov/permglos.htm
LENGTH: 432



-------------------------------

Thanks again.
0
Comment
Question by:BerkeleyJeff
  • 2
3 Comments
 
LVL 20

Accepted Solution

by:
jmcg earned 500 total points
ID: 17055688
Why parallel if you're only interested in handling completed responses?

Alter your callback to accumulate the incoming data in the response object:

   if( length $data) { $response->add_content( $data); }

You will have to figure out the best way to determine when you've seen the last of the data so you can process it as a single chunk. Perhaps check whether

    $response->length > length( $response->content->as_string)



===============

See this article by Randy Schwartz -- while it's old, it should not need much updating to re-use his methods:

http://www.stonehenge.com/merlyn/WebTechniques/col27.html

0
 

Author Comment

by:BerkeleyJeff
ID: 17063227
JMCG,

Thanks for your response.

Perhaps I'm misunderstanding the purpose of ParallelUA. My goal was to download a large number different of pages simutaneously. Is this not what ParallelUA is for? Is ParallelUA for downloading a single page using multiple connections?
0
 
LVL 20

Expert Comment

by:jmcg
ID: 17063362
Yep, parallel will help on getting through a longer list of pages. You only get one thread working per URL.

If you can get what you need from the responses without downloading them all the way to the end, it also can speed things up (using the C_ENDCON response from your callback).
0

Featured Post

Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

In the distant past (last year) I hacked together a little toy that would allow a couple of Manager types to query, preview, and extract data from a number of MongoDB instances, to their tool of choice: Excel (http://dilbert.com/strips/comic/2007-08…
Checking the Alert Log in AWS RDS Oracle can be a pain through their user interface.  I made a script to download the Alert Log, look for errors, and email me the trace files.  In this article I'll describe what I did and share my script.
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
When you create an app prototype with Adobe XD, you can insert system screens -- sharing or Control Center, for example -- with just a few clicks. This video shows you how. You can take the full course on Experts Exchange at http://bit.ly/XDcourse.

932 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

13 Experts available now in Live!

Get 1:1 Help Now