Logon to Web Site and Download File Programmatically using Python, urllib2 module

Wish to logon to web site to programmatically download data.
Looking for assistance with Python and the urllib2 module.

At this page:
  https://www.nifc.blm.gov/cgi/WfmiHome.cgi
there is a Logon button which uses a GET (not a Post) referring to:
    https://www.nifc.blm.gov/cgi/WfmiHome.cgi/Page/Logon
which Redirects to: https://www.nifc.blm.gov/cgi/WfmiHome.cgi/Page/DoiMonitor

I'm hoping to understand how to navigate through web pages to eventually logon with my account.  
I've tried many variations of the below code without success.
Need help understand urllib2 to navigate to page with "I Agree" button and then to the actual page to enter username and password.
mport urllib
import urllib2
import cookielib

#cookie storage
cj = cookielib.CookieJar()
opener = urllib2.build_opener(
    urllib2.HTTPCookieProcessor(cj),
    urllib2.HTTPRedirectHandler
    )

#### First page
url = 'https://www.nifc.blm.gov/cgi/WfmiHome.cgi'

request = urllib2.Request(url)

response = urllib2.urlopen(request)

html = response.read()

# Print to screen
print html 

Open in new window


Appreciate help ...  Thanks !
LVL 5
DoveTailsAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Walter RitzelSenior Software EngineerCommented:
You would need 2 things:
1) create a code similar to yours to submit the information for the logon page;
2) make sure that a session is being maintained;
3) call the download url, using the code you have showed.

It is not a question of navigating through pages, but a matter of generate the session that you need with the logon page and then call the direct download url.
0
DoveTailsAuthor Commented:
Thanks for the response Walter.
My thinking behind navigating through the first few pages with buttons specifying "Logon" and "I Agree" is to acquire the necessary cookies.  If I attempt to navigate directly to the Authentication Page, I am directed back to the main Home page with the "Logon" button (basically back to page 1).

I'm assuming by pressing "I Agree" in a standard browser a cookie is set which lasts for that session.

Hopefully the code I have worked on for a Post with the logon information will work, but I cannot programmatically get to the logon page and my guess for that is because I do not yet have the "I Agree" cookie.

Any thoughts ?
0
Walter RitzelSenior Software EngineerCommented:
In this case, instead of using urllib2, you may be interested on using mechanize.

http://www.pythonforbeginners.com/mechanize/browsing-in-python-with-mechanize/

It uses urllib under the covers, but maybe the piece of functionality you are looking for is already implemented there.
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Ultimate Tool Kit for Technology Solution Provider

Broken down into practical pointers and step-by-step instructions, the IT Service Excellence Tool Kit delivers expert advice for technology solution providers. Get your free copy now.

gelonidaCommented:
Another library, that I can strongly recommend for any access to web servers, that are NOT using javascript is
requests:

https://pypi.python.org/pypi/requests


For any web page containing loads of javascript it might be necesseray to use a real web browser and to automate it.

you could use Selenium, which allows you to automate a browser
https://pypi.python.org/pypi/selenium

Please reply if you're interested in either requests or selenium.


I personally gave up on using urllib in my code. I think its only advantage is, that it is a standard python module, but coding with it just looks clumsy to me.
0
nociSoftware EngineerCommented:
maybe, probably, you are accomplishing the same thing as what curl (curllib) already does.
Check http://curl.haxx.se for more info on curl.

It has support for http, https, telnet, ftp, ftps,  etc. etc. and using the command line interface you can script  access to webservers including logins etc. excluding JS execution. (no ajax...). It can handle cookies that way too. the library version is more flexible. still no ajax though.... unless you have JS interpreter built into your software.
0
DoveTailsAuthor Commented:
Thank you.  More options than I expected.
Appreciate your input !
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Python

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.