Learn how to a build a cloud-first strategyRegister Now

x
?
Solved

Logon to Web Site and Download File Programmatically using Python, urllib2 module

Posted on 2014-08-05
6
Medium Priority
?
2,763 Views
Last Modified: 2014-08-08
Wish to logon to web site to programmatically download data.
Looking for assistance with Python and the urllib2 module.

At this page:
  https://www.nifc.blm.gov/cgi/WfmiHome.cgi
there is a Logon button which uses a GET (not a Post) referring to:
    https://www.nifc.blm.gov/cgi/WfmiHome.cgi/Page/Logon
which Redirects to: https://www.nifc.blm.gov/cgi/WfmiHome.cgi/Page/DoiMonitor

I'm hoping to understand how to navigate through web pages to eventually logon with my account.  
I've tried many variations of the below code without success.
Need help understand urllib2 to navigate to page with "I Agree" button and then to the actual page to enter username and password.
mport urllib
import urllib2
import cookielib

#cookie storage
cj = cookielib.CookieJar()
opener = urllib2.build_opener(
    urllib2.HTTPCookieProcessor(cj),
    urllib2.HTTPRedirectHandler
    )

#### First page
url = 'https://www.nifc.blm.gov/cgi/WfmiHome.cgi'

request = urllib2.Request(url)

response = urllib2.urlopen(request)

html = response.read()

# Print to screen
print html 

Open in new window


Appreciate help ...  Thanks !
0
Comment
Question by:DoveTails
6 Comments
 
LVL 16

Expert Comment

by:Walter Ritzel
ID: 40243609
You would need 2 things:
1) create a code similar to yours to submit the information for the logon page;
2) make sure that a session is being maintained;
3) call the download url, using the code you have showed.

It is not a question of navigating through pages, but a matter of generate the session that you need with the logon page and then call the direct download url.
0
 
LVL 5

Author Comment

by:DoveTails
ID: 40244505
Thanks for the response Walter.
My thinking behind navigating through the first few pages with buttons specifying "Logon" and "I Agree" is to acquire the necessary cookies.  If I attempt to navigate directly to the Authentication Page, I am directed back to the main Home page with the "Logon" button (basically back to page 1).

I'm assuming by pressing "I Agree" in a standard browser a cookie is set which lasts for that session.

Hopefully the code I have worked on for a Post with the logon information will work, but I cannot programmatically get to the logon page and my guess for that is because I do not yet have the "I Agree" cookie.

Any thoughts ?
0
 
LVL 16

Accepted Solution

by:
Walter Ritzel earned 668 total points
ID: 40244565
In this case, instead of using urllib2, you may be interested on using mechanize.

http://www.pythonforbeginners.com/mechanize/browsing-in-python-with-mechanize/

It uses urllib under the covers, but maybe the piece of functionality you are looking for is already implemented there.
0
Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

 
LVL 17

Assisted Solution

by:gelonida
gelonida earned 668 total points
ID: 40245858
Another library, that I can strongly recommend for any access to web servers, that are NOT using javascript is
requests:

https://pypi.python.org/pypi/requests


For any web page containing loads of javascript it might be necesseray to use a real web browser and to automate it.

you could use Selenium, which allows you to automate a browser
https://pypi.python.org/pypi/selenium

Please reply if you're interested in either requests or selenium.


I personally gave up on using urllib in my code. I think its only advantage is, that it is a standard python module, but coding with it just looks clumsy to me.
0
 
LVL 41

Assisted Solution

by:noci
noci earned 664 total points
ID: 40247422
maybe, probably, you are accomplishing the same thing as what curl (curllib) already does.
Check http://curl.haxx.se for more info on curl.

It has support for http, https, telnet, ftp, ftps,  etc. etc. and using the command line interface you can script  access to webservers including logins etc. excluding JS execution. (no ajax...). It can handle cookies that way too. the library version is more flexible. still no ajax though.... unless you have JS interpreter built into your software.
0
 
LVL 5

Author Closing Comment

by:DoveTails
ID: 40249109
Thank you.  More options than I expected.
Appreciate your input !
0

Featured Post

[Webinar] Cloud and Mobile-First Strategy

Maybe you’ve fully adopted the cloud since the beginning. Or maybe you started with on-prem resources but are pursuing a “cloud and mobile first” strategy. Getting to that end state has its challenges. Discover how to build out a 100% cloud and mobile IT strategy in this webinar.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

The Windows functions GetTickCount and timeGetTime retrieve the number of milliseconds since the system was started. However, the value is stored in a DWORD, which means that it wraps around to zero every 49.7 days. This article shows how to solve t…
The SignAloud Glove is capable of translating American Sign Language signs into text and audio.
In this fourth video of the Xpdf series, we discuss and demonstrate the PDFinfo utility, which retrieves the contents of a PDF's Info Dictionary, as well as some other information, including the page count. We show how to isolate the page count in a…
With the power of JIRA, there's an unlimited number of ways you can customize it, use it and benefit from it. With that in mind, there's bound to be things that I wasn't able to cover in this course. With this summary we'll look at some places to go…
Suggested Courses
Course of the Month20 days, 23 hours left to enroll

810 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question