?
Solved

LWP::Parallel::UserAgent chunk size

Posted on 2006-07-05
3
Medium Priority
?
790 Views
Last Modified: 2012-06-21
I want to use LWP::Parallel::UserAgent to issue HTTP requests, and have the responses processed by a callback function. I want to process the whole response data at the same time, so I've set the chunk size to be very large (100 MB). I was thinking that this would make each chunk to be the whole page. However, that didn't work. Each chunk becomes about 1460 bytes anyway.

What could I be doing wrong? Is this a common problem?

Thanks,
Jeff

-------------------------------

use strict;
use LWP::Parallel::UserAgent;


my $ua = LWP::Parallel::UserAgent->new();
$ua->max_hosts(5); # sets maximum number of locations accessed in parallel
$ua->max_req(5); # sets maximum number of parallel requests per host

while(<>) #read in urls
{
        chomp;
        my $request = HTTP::Request->new(GET => $_);
        print $request;
        $ua->register($request, \&uaCallback, 100000000);
}

$ua->wait ();

sub uaCallback
{
        my($data, $response, $protocol) = @_;
        print "BASE: ", $response->base(), "\n";
        print "LENGTH: ", length($data), "\n";
}

----------------------------------

Output:

echo "http://arctic.fws.gov/permglos.htm" | perl oneDeepCrawl.pl
HTTP::Request=HASH(0x8279c64)BASE: http://arctic.fws.gov/permglos.htm
LENGTH: 1159
BASE: http://arctic.fws.gov/permglos.htm
LENGTH: 1448
BASE: http://arctic.fws.gov/permglos.htm
LENGTH: 1448
BASE: http://arctic.fws.gov/permglos.htm
LENGTH: 1484
BASE: http://arctic.fws.gov/permglos.htm
LENGTH: 1448
BASE: http://arctic.fws.gov/permglos.htm
LENGTH: 1448
BASE: http://arctic.fws.gov/permglos.htm
LENGTH: 1484
BASE: http://arctic.fws.gov/permglos.htm
LENGTH: 2896
BASE: http://arctic.fws.gov/permglos.htm
LENGTH: 4380
BASE: http://arctic.fws.gov/permglos.htm
LENGTH: 1200
BASE: http://arctic.fws.gov/permglos.htm
LENGTH: 432



-------------------------------

Thanks again.
0
Comment
Question by:BerkeleyJeff
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 2
3 Comments
 
LVL 20

Accepted Solution

by:
jmcg earned 2000 total points
ID: 17055688
Why parallel if you're only interested in handling completed responses?

Alter your callback to accumulate the incoming data in the response object:

   if( length $data) { $response->add_content( $data); }

You will have to figure out the best way to determine when you've seen the last of the data so you can process it as a single chunk. Perhaps check whether

    $response->length > length( $response->content->as_string)



===============

See this article by Randy Schwartz -- while it's old, it should not need much updating to re-use his methods:

http://www.stonehenge.com/merlyn/WebTechniques/col27.html

0
 

Author Comment

by:BerkeleyJeff
ID: 17063227
JMCG,

Thanks for your response.

Perhaps I'm misunderstanding the purpose of ParallelUA. My goal was to download a large number different of pages simutaneously. Is this not what ParallelUA is for? Is ParallelUA for downloading a single page using multiple connections?
0
 
LVL 20

Expert Comment

by:jmcg
ID: 17063362
Yep, parallel will help on getting through a longer list of pages. You only get one thread working per URL.

If you can get what you need from the responses without downloading them all the way to the end, it also can speed things up (using the C_ENDCON response from your callback).
0

Featured Post

Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

On Microsoft Windows, if  when you click or type the name of a .pl file, you get an error "is not recognized as an internal or external command, operable program or batch file", then this means you do not have the .pl file extension associated with …
Email validation in proper way is  very important validation required in any web pages. This code is self explainable except that Regular Expression which I used for pattern matching. I originally published as a thread on my website : http://www…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Six Sigma Control Plans

752 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question