Link to home
Start Free TrialLog in
Avatar of Fairlight2cx
Fairlight2cxFlag for United States of America

asked on

Lower level control over LWP (speed limiting, aborting)

There are two things I've been wanting to know how to do for a while now.  I'm not even sure they're possible the way I want them done.

Constraints:

Must use the LWP (libwww perl) module.

Desired abilities:

1) I'd like to be able to throttle the speed of downloads being performed via LWP.  Basically, this would be equivalent to cURL or GetRight's "speed limit" functionalities.  I had thoughts on being able to -possibly- do this via putting a fairly granular select(undef,undef,undef,$timer) inside of a handler assigned to the request that would get called on each chunk.  Couple problems with that:

a) I don't know if even small sleeps will be acceptable.  I do know for a fact that if you have a raw socket and don't read often enough as data streams in, you suffer data dropout.  I'm not sure if HTTP is actually designed in such a way that it's synchronous to the point it won't overrun the socket's buffer.

b) I'm unsure of how to figure out what timeout to put on the select() if I put it in there.  It'd have to be based on how big the chunk size is, and how many chunks it would take to achieve the correct throughput over time.  it'd also have to account for actual time elapsed and decide whether or not to actually sleep a little bit.  It may even need to be dynamic (say, if the buffer size isn't coming back with the entire requested size).  The exact algorithm eludes me.

2) I'd like to know how to abort an LWP request without terminating the entire program or otherwise doing something that would mess about with program functionality in an unwanted fashion.  In the instance I'm considering, I'm doing downloads inside a Perl/Tk application.  i have an abort function, but it finishes the current file and then stops before requesting another.  I want to take it to the next level and have it do a proper abort in mid-download.  The docs for the handlers (specifically the data handler) seemed to mention something about croaking to abort, but my reading on croak() said that it's equivalent to die(), which is not what I want.  I simply want to abort the download, yet retain the instance of the running program and return control to the main Tk event loop in this case, or in other programs, simply continue onwards.

It's not like I haven't researched either of these points at all.  It's that I've not found anything clear, concise, and definitive.  Everything I've found on aborting is hypothetical and sketchy.  My own ideas on speed throttling are hypothetical, and I've not been able to really find a reference at all to doing it within the LWP framework.

If you can provide working code examples (skeletons are fine), so much the better.

Thanks in advance!
Avatar of Adam314
Adam314

For bandwidth limiting, I can think a few possible methods, but I've not used either:
1) Get files in a bunch of small chunks, using normal LWP to get each chunk.  Each chuck would be retrieved at full speed, but then you could sleep in between chunks to get an overall bandwidth usage.
2) I think squid (a proxy server) has some bandwidth limiting options.  Have your LWP connect through that proxy.
3) If you are on a linux system, use traffic control:
    http://lartc.org/howto/


For aborting: Are you running the download in a separate thread?  If not, and the download is in an event procedure, Tk won't even process your cancel button click until the download is complete.
Avatar of Fairlight2cx

ASKER

For bandwidth limiting, that's pretty much what I originally said in terms of the chunks and sleeping between them.  I said it was the exact algorithm that eludes me.  Assume, for the sake of ease, that we assume 1024-byte chunks.  Can you give me code that would dynamically keep code at a (tunable via a variable) download rate?

Neither squid, nor assuming linux are an option.  The constraint clearly said this must be done within LWP.  If you put a package out there, you have no control over the running environment, nor the environment to which it will connect if it's meant to go to multiple endpoints not under your control.  (Think about download managers.)

As for aborting, separate thread?  You're kidding, right?  Perl/Tk isn't thread safe, from everything I've read over the years.  Nor is it likely to become so, as the maintainer died and I don't think anyone's actually taking over the module for anything but bug fixes.  And no, I'm not running the download out of a child process and using IPC, nor do I plan on changing to that architecture.  Of course the download isn't going to naturally handle events; that's why I use a handler on the request(), during which it does update() on the entire main TopLevel window, recursively (read: program-wide), which -does- let the Tk event handler process UI events.  How else have I had a progress bar working for over half a year during downloads, along with an abort button and some bindings to at least trap the fact that the user wants to abort?  Handling the events is not a problem.  Knowing what to do once I've received this particular event, in order to abort the download as one would in other software, -is- the problem.

So far, I haven't heard anything I didn't already know that actually falls within the constraints i set forth.

Next...
SOLUTION
Avatar of Adam314
Adam314

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Unacceptable.  I'm sorry, but there is no way that I could ever sanction a solution that effectively could very well be construed as a denial of service attack on any server.  

You're talking about taking way more resources on the server than necessary to handle this, since in all likelihood the daemon would launch more children, as you'd exceed any sane max requests per child, they'd be stuck in TIME_WAIT for a while, and a server like Apache would be spawning multiple children, probably per second at any usable threshhold, along with all the resources that incurs in terms of RAM use.  Then there's the extra bandwidth of that many sets of headers going out.  Not to mention holding -that- many sockets open on one system...  

I mean, this is a great solution--if your goal is to get client IP#s banned from servers, or worse.  I admin systems for a living, and believe me, I'd null route a source address that did this within a quarter of an hour, plus notify the remote's provider and get their account shut down, since null routing doesn't actually protect your pipe from all the SYN attempts.

That is arguably one of the most horrendous "solutions" I've ever seen.  Obviously you weren't thinking of the big picture, here.

Part of the code (the time/rate algorithm) is arguably useful, but not in this context.  It'd have to be adapted.  That gives me a bit of a baseline to a quarter of my request.  But the framework is all wrong.

Okay, since it apparently wasn't obvious, I'm going to have to explicitly state it:

ADDITIONAL CONSTRAINT:

The rate limiting MUST take place within -one- request under LWP.
Okay, I've done some testing of my own, and determined:

1) If you croak; inside a handler for request(), the request will abort, the program will continue.  You simply need to use Carp to get this functionality.

2) I have -most- of the sleep functionality done.  It -is- possible to sleep/usleep in the middle of the handler for request().  The part I'm currently having problems with is this:  Why does the request stop dead (yet the program continues) if I call tv_interval() inside the handler.  I've narrowed it to that one statement causing a cessation of request().

If I can get that issue worked out, I think I have viable solutions.

Thanks to Adam314 for the time code you posted.  Your overal multi-segment methodology was still an awful idea.  But the time algorithm is part of what I was looking for.

Any ideas on the tv_interval issue?

I'm attaching my proof of concept snippet.  This doesn't have the croak example (that's just plain trivial), but rather deals entirely with the throttling.  The rate argument is assumed to be in KB for the purposes of the proof of concept.
#!/usr/bin/perl
 
use strict;
 
use Carp;
use LWP;
use Time::HiRes;
 
my $url = shift;
die("No URL given.\n") unless defined(${url});
 
my $maxrate = shift;
$maxrate = 0 unless defined(${maxrate});
my $tempxbytes = 0;
my ($elapsed,$desired,$stime);
 
my $agent = new LWP::UserAgent;
my $request = HTTP::Request->new(GET => ${url});
my $response = undef;
my $totalbytes = 0;
my $starttime = 0;
for (my $x = 0;${x} < 10;$x++) {
     print("MAIN LOOP ${x}\n");
     $starttime = Time::HiRes::gettimeofday if ${x} == 5;
     $response = ${agent}->request(${request},\&dhandler,1024) if ${x} == 5;
}
print("Total Bytes Received: ${totalbytes}\n");
print((${totalbytes}/(time - ${starttime})) / 1024," KB/s\n");
exit;
 
sub dhandler {
     my $chunk = shift;
     my $len = length(${chunk});
     $totalbytes += ${len};
     $tempxbytes += ${len};
     print("Received ${len} bytes.  [${totalbytes} total]\n");
     if (${maxrate}) {
print("TEMPX: ",${tempxbytes} / 1024,"\nMAX: ${maxrate}\n");
          if ((${tempxbytes} / 1024) > ${maxrate}) {
               print("Rate exceeded - THROTTLE.\n");
               $elapsed = Time::HiRes::tv_interval(${starttime});
               print("elapsed: ${elapsed}\n");
               $desired = ${tempxbytes} / ${maxrate};
               $stime = ${desired} - ${elapsed};
               print("stime: ${stime}\n");
               Time::HiRes::usleep(${stime} * 1000000);
               $starttime = Time::HiRes::gettimeofday;
               $tempxbytes = 0;
          }
     }
     return;
}

Open in new window

ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
The code I gave has only 1 request active at a time.  There is the overhead of the additional headers though.  I tried it several times, and found no problems with using it.

I believe the code you gave will throttle the reading of the input buffer and writing to the file - not the actual traffic over the network.

Anyways, if you are happy with what you have, I'll consider this finished.
Not possible to just ignore a socket buffer accumulating data for more than ~64KB without losing data.  Throttling this to 10K would surely lose data, yet it doesn't, even on localhost where it would certainly show up.

Out of paranoia, I tried this on WinXP and watched the TCP graph and levels with Iarsn Taskinfo in both throttled and non-throttled modes.  It confirmed that it actually stems the net traffic itself...albeit not in a smooth line, given the bursty way it processes the data--but it is throttling the network traffic, not just the buffer.

HTTP must be synchronous/handshaking, at least in chunked mode.  No other way you wouldn't lose data.

By the way...you might look at your solution on a unix box and monitor the child count of apache, and the results of netstat while it works.  You'll find the issues there, not at the client side.  Your method has the same inherent issues that cause admins to ban people for wardialing FTP aites without delays.