Solved

Lower level control over LWP (speed limiting, aborting)

Posted on 2009-07-16
8
1,140 Views
Last Modified: 2012-06-22
There are two things I've been wanting to know how to do for a while now.  I'm not even sure they're possible the way I want them done.

Constraints:

Must use the LWP (libwww perl) module.

Desired abilities:

1) I'd like to be able to throttle the speed of downloads being performed via LWP.  Basically, this would be equivalent to cURL or GetRight's "speed limit" functionalities.  I had thoughts on being able to -possibly- do this via putting a fairly granular select(undef,undef,undef,$timer) inside of a handler assigned to the request that would get called on each chunk.  Couple problems with that:

a) I don't know if even small sleeps will be acceptable.  I do know for a fact that if you have a raw socket and don't read often enough as data streams in, you suffer data dropout.  I'm not sure if HTTP is actually designed in such a way that it's synchronous to the point it won't overrun the socket's buffer.

b) I'm unsure of how to figure out what timeout to put on the select() if I put it in there.  It'd have to be based on how big the chunk size is, and how many chunks it would take to achieve the correct throughput over time.  it'd also have to account for actual time elapsed and decide whether or not to actually sleep a little bit.  It may even need to be dynamic (say, if the buffer size isn't coming back with the entire requested size).  The exact algorithm eludes me.

2) I'd like to know how to abort an LWP request without terminating the entire program or otherwise doing something that would mess about with program functionality in an unwanted fashion.  In the instance I'm considering, I'm doing downloads inside a Perl/Tk application.  i have an abort function, but it finishes the current file and then stops before requesting another.  I want to take it to the next level and have it do a proper abort in mid-download.  The docs for the handlers (specifically the data handler) seemed to mention something about croaking to abort, but my reading on croak() said that it's equivalent to die(), which is not what I want.  I simply want to abort the download, yet retain the instance of the running program and return control to the main Tk event loop in this case, or in other programs, simply continue onwards.

It's not like I haven't researched either of these points at all.  It's that I've not found anything clear, concise, and definitive.  Everything I've found on aborting is hypothetical and sketchy.  My own ideas on speed throttling are hypothetical, and I've not been able to really find a reference at all to doing it within the LWP framework.

If you can provide working code examples (skeletons are fine), so much the better.

Thanks in advance!
0
Comment
Question by:Fairlight2cx
  • 5
  • 3
8 Comments
 
LVL 39

Expert Comment

by:Adam314
ID: 24874553
For bandwidth limiting, I can think a few possible methods, but I've not used either:
1) Get files in a bunch of small chunks, using normal LWP to get each chunk.  Each chuck would be retrieved at full speed, but then you could sleep in between chunks to get an overall bandwidth usage.
2) I think squid (a proxy server) has some bandwidth limiting options.  Have your LWP connect through that proxy.
3) If you are on a linux system, use traffic control:
    http://lartc.org/howto/


For aborting: Are you running the download in a separate thread?  If not, and the download is in an event procedure, Tk won't even process your cancel button click until the download is complete.
0
 
LVL 7

Author Comment

by:Fairlight2cx
ID: 24875772
For bandwidth limiting, that's pretty much what I originally said in terms of the chunks and sleeping between them.  I said it was the exact algorithm that eludes me.  Assume, for the sake of ease, that we assume 1024-byte chunks.  Can you give me code that would dynamically keep code at a (tunable via a variable) download rate?

Neither squid, nor assuming linux are an option.  The constraint clearly said this must be done within LWP.  If you put a package out there, you have no control over the running environment, nor the environment to which it will connect if it's meant to go to multiple endpoints not under your control.  (Think about download managers.)

As for aborting, separate thread?  You're kidding, right?  Perl/Tk isn't thread safe, from everything I've read over the years.  Nor is it likely to become so, as the maintainer died and I don't think anyone's actually taking over the module for anything but bug fixes.  And no, I'm not running the download out of a child process and using IPC, nor do I plan on changing to that architecture.  Of course the download isn't going to naturally handle events; that's why I use a handler on the request(), during which it does update() on the entire main TopLevel window, recursively (read: program-wide), which -does- let the Tk event handler process UI events.  How else have I had a progress bar working for over half a year during downloads, along with an abort button and some bindings to at least trap the fact that the user wants to abort?  Handling the events is not a problem.  Knowing what to do once I've received this particular event, in order to abort the download as one would in other software, -is- the problem.

So far, I haven't heard anything I didn't already know that actually falls within the constraints i set forth.

Next...
0
 
LVL 39

Assisted Solution

by:Adam314
Adam314 earned 200 total points
ID: 24881655
For (1), I meant, but didn't say clearly, by using the Range header of the HTTP protocol.  With this, you can tell the server you want only a range of the given data.  

As to threading - Perl/Tk isn't thread safe.  You can use threads though if all of your Tk calls are from the same thread.  This would require a change to what you have, which it doesn't sound like you want.

Here is some sample code to control the bandwidth.
#!/usr/bin/perl

use strict;

use warnings;

use Time::HiRes qw(usleep gettimeofday tv_interval);

use LWP;
 

my $URL = 'http://SomeServer.com/Some/file.txt';

my $DesiredRate = 100;   #bytes/second

my $ChunkSize = 100;
 

my $ua = LWP::UserAgent->new;

my $res = $ua->head($URL);

my $Len = $res->header('Content-Length');
 

open(my $out, ">output.txt") or die "Output: $!\n";
 

my $ChunkStart = 0;

my $Downloaded = 0;

while($Downloaded < $Len) {

	my $t0 = [gettimeofday];

	

	$res = $ua->get($URL, 'Range' => "bytes=$ChunkStart-" . min($ChunkStart + $ChunkSize - 1, $Len));

	my $ElapsedSeconds = tv_interval($t0);

	my $Content = $res->content;

	print $out $Content;

	my $ContentLength = length($Content);

	my $DesiredSeconds = $ContentLength/$DesiredRate;

	my $SleepTimeSeconds = $DesiredSeconds - $ElapsedSeconds;

	usleep($SleepTimeSeconds * 1_000_000);

	

	

	$Downloaded += $ContentLength;

}

close($out);
 

sub min {

	return $_[0] if $_[0] < $_[1];

	return $_[1];

}

Open in new window

0
 
LVL 7

Author Comment

by:Fairlight2cx
ID: 24885007
Unacceptable.  I'm sorry, but there is no way that I could ever sanction a solution that effectively could very well be construed as a denial of service attack on any server.  

You're talking about taking way more resources on the server than necessary to handle this, since in all likelihood the daemon would launch more children, as you'd exceed any sane max requests per child, they'd be stuck in TIME_WAIT for a while, and a server like Apache would be spawning multiple children, probably per second at any usable threshhold, along with all the resources that incurs in terms of RAM use.  Then there's the extra bandwidth of that many sets of headers going out.  Not to mention holding -that- many sockets open on one system...  

I mean, this is a great solution--if your goal is to get client IP#s banned from servers, or worse.  I admin systems for a living, and believe me, I'd null route a source address that did this within a quarter of an hour, plus notify the remote's provider and get their account shut down, since null routing doesn't actually protect your pipe from all the SYN attempts.

That is arguably one of the most horrendous "solutions" I've ever seen.  Obviously you weren't thinking of the big picture, here.

Part of the code (the time/rate algorithm) is arguably useful, but not in this context.  It'd have to be adapted.  That gives me a bit of a baseline to a quarter of my request.  But the framework is all wrong.

Okay, since it apparently wasn't obvious, I'm going to have to explicitly state it:

ADDITIONAL CONSTRAINT:

The rate limiting MUST take place within -one- request under LWP.
0
Maximize Your Threat Intelligence Reporting

Reporting is one of the most important and least talked about aspects of a world-class threat intelligence program. Here’s how to do it right.

 
LVL 7

Author Comment

by:Fairlight2cx
ID: 24889029
Okay, I've done some testing of my own, and determined:

1) If you croak; inside a handler for request(), the request will abort, the program will continue.  You simply need to use Carp to get this functionality.

2) I have -most- of the sleep functionality done.  It -is- possible to sleep/usleep in the middle of the handler for request().  The part I'm currently having problems with is this:  Why does the request stop dead (yet the program continues) if I call tv_interval() inside the handler.  I've narrowed it to that one statement causing a cessation of request().

If I can get that issue worked out, I think I have viable solutions.

Thanks to Adam314 for the time code you posted.  Your overal multi-segment methodology was still an awful idea.  But the time algorithm is part of what I was looking for.

Any ideas on the tv_interval issue?

I'm attaching my proof of concept snippet.  This doesn't have the croak example (that's just plain trivial), but rather deals entirely with the throttling.  The rate argument is assumed to be in KB for the purposes of the proof of concept.
#!/usr/bin/perl
 

use strict;
 

use Carp;

use LWP;

use Time::HiRes;
 

my $url = shift;

die("No URL given.\n") unless defined(${url});
 

my $maxrate = shift;

$maxrate = 0 unless defined(${maxrate});

my $tempxbytes = 0;

my ($elapsed,$desired,$stime);
 

my $agent = new LWP::UserAgent;

my $request = HTTP::Request->new(GET => ${url});

my $response = undef;

my $totalbytes = 0;

my $starttime = 0;

for (my $x = 0;${x} < 10;$x++) {

     print("MAIN LOOP ${x}\n");

     $starttime = Time::HiRes::gettimeofday if ${x} == 5;

     $response = ${agent}->request(${request},\&dhandler,1024) if ${x} == 5;

}

print("Total Bytes Received: ${totalbytes}\n");

print((${totalbytes}/(time - ${starttime})) / 1024," KB/s\n");

exit;
 

sub dhandler {

     my $chunk = shift;

     my $len = length(${chunk});

     $totalbytes += ${len};

     $tempxbytes += ${len};

     print("Received ${len} bytes.  [${totalbytes} total]\n");

     if (${maxrate}) {

print("TEMPX: ",${tempxbytes} / 1024,"\nMAX: ${maxrate}\n");

          if ((${tempxbytes} / 1024) > ${maxrate}) {

               print("Rate exceeded - THROTTLE.\n");

               $elapsed = Time::HiRes::tv_interval(${starttime});

               print("elapsed: ${elapsed}\n");

               $desired = ${tempxbytes} / ${maxrate};

               $stime = ${desired} - ${elapsed};

               print("stime: ${stime}\n");

               Time::HiRes::usleep(${stime} * 1000000);

               $starttime = Time::HiRes::gettimeofday;

               $tempxbytes = 0;

          }

     }

     return;

}

Open in new window

0
 
LVL 7

Accepted Solution

by:
Fairlight2cx earned 0 total points
ID: 24889405
I've got a fully working proof of concept with both transfer abort -and- rate limiting, all within a single LWP request.  This, of course, can be extended to any LWP request as long as the requisite variables and handler are used.  It should fold just fine into any Tk framework, assuming one uses the SIGINT handler's functionality with a binding/button/whatever.

I'm giving partial credit to Adam314, since his time algorithm did help a lot, even if the framework was wayyyyy off.  In the end, I answered 3/4 of my own questions (including things like, "Will sleeps even be allowable without data dropout?" which were never addressed by Adam314, and which I tested rigorously).

For the curious, here's the proof of concept:

#!/usr/bin/perl
 

use strict;
 

use Carp;

use LWP;

use Time::HiRes;
 

$SIG{'INT'} = \&xfer_int;

my $midstream_abort = 0;
 

my $url = shift;

die("No URL given.\n") unless defined(${url});
 

my $maxrate = shift;

$maxrate = 0 unless defined(${maxrate});

my $tempxbytes = 0;

my ($elapsed,$desired,$stime);
 

my $agent = new LWP::UserAgent;

my $request = HTTP::Request->new(GET => ${url});

my $response = undef;

my $totalbytes = 0;

my $starttime = 0;

my $tstarttime = 0;

for (my $x = 0;${x} < 10;$x++) {

     print("MAIN LOOP ${x}\n");

     if (${x} == 5) {

          $starttime = [Time::HiRes::gettimeofday];

          $tstarttime = time;

          $response = ${agent}->request(${request},\&dhandler,1024);

          print("Total Bytes Received: ${totalbytes}\n");

          print((${totalbytes}/(time - ${tstarttime})) / 1024," KB/s\n");

     }

}

exit;
 

sub dhandler {

     my $chunk = shift;

     my $len = length(${chunk});

     if (${midstream_abort}) {

          $midstream_abort = 0;

          croak;

     }

     $totalbytes += ${len};

     $tempxbytes += ${len};

     if (${maxrate}) {

          if ((${tempxbytes} / 1024) > ${maxrate}) {

               $desired = ${tempxbytes} / ${maxrate};

               $elapsed = Time::HiRes::tv_interval(${starttime});

               $stime = ${desired} - ${elapsed};

               Time::HiRes::usleep(${stime} * 1000);

               $starttime = [Time::HiRes::gettimeofday];

               $tempxbytes = 0;

          }

     }

     return;

}
 

sub xfer_int {

     $midstream_abort = 1;

}

Open in new window

0
 
LVL 39

Expert Comment

by:Adam314
ID: 24892041
The code I gave has only 1 request active at a time.  There is the overhead of the additional headers though.  I tried it several times, and found no problems with using it.

I believe the code you gave will throttle the reading of the input buffer and writing to the file - not the actual traffic over the network.

Anyways, if you are happy with what you have, I'll consider this finished.
0
 
LVL 7

Author Comment

by:Fairlight2cx
ID: 24892703
Not possible to just ignore a socket buffer accumulating data for more than ~64KB without losing data.  Throttling this to 10K would surely lose data, yet it doesn't, even on localhost where it would certainly show up.

Out of paranoia, I tried this on WinXP and watched the TCP graph and levels with Iarsn Taskinfo in both throttled and non-throttled modes.  It confirmed that it actually stems the net traffic itself...albeit not in a smooth line, given the bursty way it processes the data--but it is throttling the network traffic, not just the buffer.

HTTP must be synchronous/handshaking, at least in chunked mode.  No other way you wouldn't lose data.

By the way...you might look at your solution on a unix box and monitor the child count of apache, and the results of netstat while it works.  You'll find the issues there, not at the client side.  Your method has the same inherent issues that cause admins to ban people for wardialing FTP aites without delays.
0

Featured Post

How to improve team productivity

Quip adds documents, spreadsheets, and tasklists to your Slack experience
- Elevate ideas to Quip docs
- Share Quip docs in Slack
- Get notified of changes to your docs
- Available on iOS/Android/Desktop/Web
- Online/Offline

Join & Write a Comment

Email validation in proper way is  very important validation required in any web pages. This code is self explainable except that Regular Expression which I used for pattern matching. I originally published as a thread on my website : http://www…
A year or so back I was asked to have a play with MongoDB; within half an hour I had downloaded (http://www.mongodb.org/downloads),  installed and started the daemon, and had a console window open. After an hour or two of playing at the command …
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
This video explains how to create simple products associated to Magento configurable product and offers fast way of their generation with Store Manager for Magento tool.

757 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

22 Experts available now in Live!

Get 1:1 Help Now