Solved

Need a script to automate a file download from a site which requires a login

Posted on 2008-06-11
27
949 Views
Last Modified: 2008-07-03
I am trying to automate my process of downloading a weekly .csv file from a site.  The url is "https://www.sitename.com/directory/filename.csv".  By following the link there is a windows prompt for a username and password.  Any ideas?
0
Comment
Question by:speede1
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 14
  • 13
27 Comments
 
LVL 39

Expert Comment

by:Adam314
ID: 21763175
If it is basic authentication, you could use the credentials method from LWP::UserAgent:
http://search.cpan.org/~gaas/libwww-perl-5.812/lib/LWP/UserAgent.pm
0
 

Author Comment

by:speede1
ID: 21764185
I am a newbie Adam,  I am trying to download a file from a site via  a hyperlink.  When I click on the hyperlink,  the site prompts me to enter my username and password via a windows popup.
0
 
LVL 39

Expert Comment

by:Adam314
ID: 21764313
Try this code.  You will have to install WWW::Mechanize, if it isn't already installed.
On windows: at a prompt, type: ppm install WWW-Mechanize
On anything else: at a prompt as root: perl -MCPAN -e 'install WWW::Mechanize'

#!/usr/bin/perl
use WWW::Mechanize;
 
my $mech = WWW::Mechanize->new();
 
#NOTE: put your username and password here
$mech->credentials( $username, $password );
 
$mech->get('https://www.sitename.com/directory/filename.csv');
die "Unsuccessful: status=" . $mech->status . "\n" unless $mech->success;
 
open(my $out, ">filename.txt") or die "Output file: $!\n";
print $out $mech->content;
close($out);
    

Open in new window

0
Secure Your WordPress Site: 5 Essential Approaches

WordPress is the web's most popular CMS, but its dominance also makes it a target for attackers. Our eBook will show you how to:

Prevent costly exploits of core and plugin vulnerabilities
Repel automated attacks
Lock down your dashboard, secure your code, and protect your users

 

Author Comment

by:speede1
ID: 21764569
Getting unsuccessful - 401
0
 
LVL 39

Expert Comment

by:Adam314
ID: 21766108
Status 401 means not authorized.  Did you enter the correct username and password?

Try this, it will display the content no matter what the status.

#!/usr/bin/perl
use WWW::Mechanize;
 
my $mech = WWW::Mechanize->new();
 
#NOTE: put your username and password here
$mech->credentials( $username, $password );
 
$mech->get('https://www.sitename.com/directory/filename.csv');
print "status=" . $mech->status . "\n";
print "content=" . $mech->content . "\n";

Open in new window

0
 

Author Comment

by:speede1
ID: 21770621
Tried that and the following is what has been received:

status=401
content=<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>401 Authorization Required</title>
</head><body>
<h1>Authorization Required</h1>
<p>This server could not verify that you
are authorized to access the document
requested.  Either you supplied the wrong
credentials (e.g., bad password), or your
browser doesn't understand how to supply
the credentials required.</p>
<hr>
<address>Apache/2.2.6 (Debian) mod_ssl/2.2.6 OpenSSL/0.9.8g mod_apreq2-20051231/
2.6.0 mod_perl/2.0.3 Perl/v5.8.8 Server at perlhttp Port 80</address>
</body></html>

I have tested the user name and password by logging in through the browser.
0
 
LVL 39

Expert Comment

by:Adam314
ID: 21771629
I'm guessing the method of authentication used by the server is different than what the credentials method provides.

Could use use firefox with LiveHTTPHeaders when you go to the website to see what headers are provided?
0
 

Author Comment

by:speede1
ID: 21771899
https://www.world-check.com/portal/Downloads/world-check.csv

GET /portal/Downloads/world-check.csv HTTP/1.1
Host: www.world-check.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.14) Gecko/20080404 Firefox/2.0.0.14
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive

HTTP/1.x 401 Authorization Required
Server: nginx/0.5.35
Date: Thu, 12 Jun 2008 17:45:29 GMT
Content-Type: text/html; charset=iso-8859-1
Connection: close
WWW-Authenticate: Basic realm="WorldCheck auth"
Content-Length: 556
0
 
LVL 39

Expert Comment

by:Adam314
ID: 21804109
Are you able to get the file with firefox?  Those headers look like you got an error message.  Can you post the headers for when it is successful?
0
 

Author Comment

by:speede1
ID: 21804182
https://www.world-check.com/portal/Downloads/world-check.csv

GET /portal/Downloads/world-check.csv HTTP/1.1
Host: www.world-check.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.14) Gecko/20080404 Firefox/2.0.0.14
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
Cookie: test_cookie=1

HTTP/1.x 401 Authorization Required
Server: nginx/0.5.35
Date: Tue, 17 Jun 2008 15:30:06 GMT
Content-Type: text/html; charset=iso-8859-1
Connection: close
WWW-Authenticate: Basic realm="WorldCheck auth"
Content-Length: 556
----------------------------------------------------------
https://www.world-check.com/portal/Downloads/world-check.csv

GET /portal/Downloads/world-check.csv HTTP/1.1
Host: www.world-check.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.14) Gecko/20080404 Firefox/2.0.0.14
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
Cookie: test_cookie=1
Authorization: Basic xxxxxxxxxxxxxxxx

HTTP/1.x 200 OK
Server: nginx/0.5.35
Date: Tue, 17 Jun 2008 15:30:19 GMT
Content-Type: text/csv
Content-Length: 507135096
Last-Modified: Tue, 17 Jun 2008 15:29:42 GMT
Connection: close
Accept-Ranges: bytes
0
 
LVL 39

Accepted Solution

by:
Adam314 earned 350 total points
ID: 21804400

#!/usr/bin/perl
use WWW::Mechanize;
 
my $mech = WWW::Mechanize->new();
 
$mech->add_header("Authorization" => "Basic xxxxxxxxxxxxxxxx");
$mech->get('https://www.sitename.com/directory/filename.csv');
 
print "status=" . $mech->status . "\n";
print "content=" . $mech->content . "\n";

Open in new window

0
 

Author Comment

by:speede1
ID: 21805228
status=401
content=<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>401 Authorization Required</title>
</head><body>
<h1>Authorization Required</h1>
<p>This server could not verify that you
are authorized to access the document
requested.  Either you supplied the wrong
credentials (e.g., bad password), or your
browser doesn't understand how to supply
the credentials required.</p>
<hr>
<address>Apache/2.2.6 (Debian) mod_ssl/2.2.6 OpenSSL/0.9.8g mod_apreq2-20051231/
2.6.0 mod_perl/2.0.3 Perl/v5.8.8 Server at perlhttp Port 80</address>
</body></html>
0
 
LVL 39

Expert Comment

by:Adam314
ID: 21805666
Did you replace the xxxxx in the script with the password?  Was the "Basic xxxxxxx" in the headers actually what was posted, or was there a password there?
0
 

Author Comment

by:speede1
ID: 21805796
It was string of letters which was probably an encrypted version of the username and password
0
 
LVL 39

Expert Comment

by:Adam314
ID: 21806073
Did you try using that same string in the perl script?
0
 

Author Comment

by:speede1
ID: 21806517
Yes, which is when I got the 401 message
0
 
LVL 39

Expert Comment

by:Adam314
ID: 21806602
I'm not sure then..... maybe you could use wget or curl
http://www.gnu.org/software/wget/
http://curl.haxx.se/

I'm not sure if either will work with authentication though.
0
 

Author Comment

by:speede1
ID: 21816079
is there any code for wget
0
 
LVL 39

Expert Comment

by:Adam314
ID: 21816193
The manual for wget is here:
    http://www.gnu.org/software/wget/manual/

There are many examples, with one having a username/password towards the bottom of this page:
    http://www.gnu.org/software/wget/manual/html_node/Advanced-Usage.html#Advanced-Usage


To call it from perl, you would use:
    my $returncode=system("wget ....");  #then check $returncode for success/failure
Or:
    my $output=`wget .....`;     #then check $output and $? for success/failure
0
 

Author Comment

by:speede1
ID: 21816326
I used the following and I think it is working:


#!/usr/bin/perl
use WWW::Mechanize;
 
my $mech = WWW::Mechanize->new();
 
$mech->get('https://user:password@www.sitename.com/directory/filename.csv');
 
print "status=" . $mech->status . "\n";
print "content=" . $mech->content . "\n";



i added the user:password string to the address.  How can I direct this now to a specific directory on my machine also with a log file
0
 

Author Comment

by:speede1
ID: 21816405
need to pipe this to a file because I ran it from a command prompt and the data appeared in the dos window
0
 
LVL 39

Assisted Solution

by:Adam314
Adam314 earned 350 total points
ID: 21816461
You could have the script write it to a file.  There are several methods:
1)
    #save the csv file to $filename
    $mech->save_content($filename);

2)
    open(my $out, ">$filename") or die "Could not create output: $!\n";
    print $out $mech->content;
    close($out);

Or you could do the redirect from the command prompt:
    perl yourscript.pl > some/file.csv
    yourscript.pl > some/file.csv
0
 

Author Comment

by:speede1
ID: 21817514
The final code is:

#!/usr/bin/perl
use WWW::Mechanize;
 
my $mech = WWW::Mechanize->new();
 
$mech->get('https://xxxx:password@www.sitename/file.csv');

#save the csv file to $filename
$mech->save_content($test.csv);

print "status=" . $mech->status . "\n";
print "content=" . $mech->content . "\n";
0
 

Author Comment

by:speede1
ID: 21832294
Adam,

The get command actually opens the file in the dos window, is there a command which just downloads the file without opening it.  I tried using the script with a 400MB file and it doesn't work.  Would like to modify the script to just do a straight download
0
 
LVL 39

Expert Comment

by:Adam314
ID: 21832400
It's not the get command that opens the dos window, it is perl itself.  If you don't want the window, you can use wperl instead of perl.  Several ways to do this:

In the command for the shortcut you are clicking, change it to:
    c:\perl\bin\wperl.exe "c:\path\to\your\script.pl"
You will not need double-quotes (like shown) if the path does not contain spaces.

Or you could associate the extension .wpl with wperl.exe, and rename the script from something.pl to something.wpl.
0
 

Author Comment

by:speede1
ID: 21833150
Ok, Adam I have changed the script to use the wperl command, now is there any way to implement a status window or log file to see whats going on with the download, because I am trying to download a 400MB File.  It takes an hour if i do it manually, but when I ran the script yesterday the dos windows was open for about 7 hours and when it finally closed there wasn't any file.
0
 
LVL 39

Expert Comment

by:Adam314
ID: 21833344
The saved file will be in the current directory of the script.  If you want it in a particular directory, add this to the script:
    chdir('/path/to/where/you/want/to/save');

You might be able to use the LWP::Simple module to save the file
    use LWP::Simple;
    getstore('https://xxxx:password@www.sitename.com/file.csv', '/path/to/file.csv') ;
Then check the size of /path/to/file.csv as the script is running.  I'm not sure if this function will write data as it is downloaded, or write it all when it's finished.
0

Featured Post

Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

A Change in PHP Behavior with Session Write Short Circuit (http://php.net/manual/en/book.session.php#116217) (Winter 2014)** With the release of PHP 5.6 the session handler changed in a way that many think should be considered a bug.  See the note …
Introduction A frequently used term in Object-Oriented design is "SOLID" which is a mnemonic acronym that covers five principles of OO design.  These principles do not stand alone; there is interplay among them.  And they are not laws, merely princ…
This video teaches users how to migrate an existing Wordpress website to a new domain.
Wufoo.com provides powerful tools for surveying targeted groups, and utilizing data from completed surveys to find trends, discover areas of demand or customer expectation, and make business decisions on products or services.

617 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question