Solved

programmatically downloading a web file

Posted on 2013-11-13
7
427 Views
Last Modified: 2013-11-13
Hi,

I wasn't sure which zones were the most appropriate, so if you think there's a better zone, please let me know.

I have a file I would like to download on a regular basis.  The file name stays the same, but I don't know the exact url as I press a 'submit' button which then brings up the javascript popup.  I just need to change the 'report type' and 'date'.  Here's the site:

http://www.theocc.com/webapps/series-download

Is there a way I can automate grabbing such a file?  I'd prefer to do it in Python, but I'm not sure how I go about getting the actual URL.

I've not done any type of web crawling like this before, so if there are libraries and/or other resources please let me know that too. I'm willing to use Java or VB if it's more convenient (though I kind of doubt it is:).

Ideas?

Thanks!
0
Comment
Question by:ugeb
  • 3
  • 2
  • 2
7 Comments
 
LVL 14

Accepted Solution

by:
jb1dev earned 500 total points
ID: 39646809
You can do this with python's scrapy http://scrapy.org/

I tried just using urllib/requests modules, but getting that exactly right (handling exactly what the server is expecting in terms of http headers, cookies, etc. can be a pain, especially if you do not have access to server side logs.)

Fortunately, the scrapy framework handles all of this for you.

Here's what I did. To figure out the form parameters I needed (the post data to send) I installed the firefox extension HttpFox:

https://addons.mozilla.org/En-us/firefox/addon/httpfox/

I then ran HttpFox while I downloaded the file, which gives me access to the POST data.

E.g.

seriesAddDeleteSearchDTO.exchanges      02
seriesAddDeleteSearchDTO.exchanges      25
seriesAddDeleteSearchDTO.exchanges      19
seriesAddDeleteSearchDTO.exchanges      04
seriesAddDeleteSearchDTO.exchanges      01
seriesAddDeleteSearchDTO.exchanges      12
seriesAddDeleteSearchDTO.exchanges      26
seriesAddDeleteSearchDTO.exchanges      22
seriesAddDeleteSearchDTO.exchanges      11
seriesAddDeleteSearchDTO.exchanges      18
seriesAddDeleteSearchDTO.exchanges      08
seriesAddDeleteSearchDTO.exchanges      03
seriesAddDeleteSearchDTO.exchanges      20
seriesAddDeleteSearchDTO.exchanges      13
seriesAddDeleteSearchDTO.exchanges      07
seriesAddDeleteSearchDTO.exchanges      27
seriesAddDeleteSearchDTO.exchanges      39
seriesAddDeleteSearchDTO.dowloadType      B
seriesAddDeleteSearchDTO.dates      11/12/2013


So there are many seriesAddDeleteSearchDTO.exchanges values (presumably representing the checkbox selections) and there is also the   seriesAddDeleteSearchDTO.dowloadType and seriesAddDeleteSearchDTO.dates parameters.

We will use that to construct form post data.

Here is the scrapy code I put together. Sorry if it's not perfect, I am new to Python and scrapy.

#!/usr/bin/python

from scrapy.spider import BaseSpider
from scrapy.http import FormRequest

class LoginSpider(BaseSpider):
    name = 'myspider'
    start_urls = ['http://www.theocc.com/webapps/series-download']

    def parse(self, response):
        # form is called "commandForm"
        return [FormRequest.from_response(
                    response,
                    formname='commandForm',
                    formdata={
                        'seriesAddDeleteSearchDTO.exchanges': [ '02','25','19','04','01','12','26','22','11','18','08','03','20','13','07','27','39' ],
                        'seriesAddDeleteSearchDTO.dowloadType': 'B',
                        'seriesAddDeleteSearchDTO.dates': '11/12/2013' },
                    callback=self.after_download)]

    def after_download(self, response):
        # check login succeed before going on
        if "authentication failed" in response.body:
            self.log("Login failed", level=log.ERROR)
            return
        self.log('*** Response is' + response.body);

Open in new window


Notice the line
self.log('*** Response is' + response.body);
is where it prints the response.

I then run this with
$ scrapy runspider ./scrapytest.py

(^^^ here scrapytest.py is the file containing the python code I posted in the block above.)

If you do not have scrapy installed, there are many ways to install it.
http://doc.scrapy.org/en/latest/intro/install.html
If you are on a linux distro with it packaged in a regular system software repository I would install it from there. Otherwise "pip" probably works well.
0
 
LVL 11

Author Closing Comment

by:ugeb
ID: 39646824
This is awesome!  While I haven't tried it out yet, I'm going to and you deserve all the points just for this.  If I have more questions I'll ask another question.

Thanks!
0
 
LVL 11

Expert Comment

by:Gregory Miller
ID: 39646833
I have looked through the source pretty closely. They are using JQuery to do the processing and as best I can tell they are building the file dynamically and then sending it to the browser on the fly as well. This means there is probably a random filename temporarily created when the file content is built, the file content is sent to the browser but the header contains the generic name download.txt and then I am certain they just purge the tempfile they created, if it was ever created at all. It could have been produced in RAM and dished up hot, no file required.

I took the source and created a local copy of the page and referenced all the .js src files from their site and every time I hit the submit button, it was looking locally for the download.txt file which tells me there is some server side script that is running that basically cannot be gotten to.

Having said all that, the conclusion is there is probably no way to make this work the way you want.
0
Forrester Webinar: xMatters Delivers 261% ROI

Guest speaker Dean Davison, Forrester Principal Consultant, explains how a Fortune 500 communication company using xMatters found these results: Achieved a 261% ROI, Experienced $753,280 in net present value benefits over 3 years and Reduced MTTR by 91% for tier 1 incidents.

 
LVL 11

Expert Comment

by:Gregory Miller
ID: 39646838
spent too long and someone figured this out... kudos
0
 
LVL 11

Author Comment

by:ugeb
ID: 39646840
@Technodweeb,

Sorry you spent too long.  Thanks for trying, though!  I'll have more questions ...
0
 
LVL 11

Expert Comment

by:Gregory Miller
ID: 39646852
No problem, I have learned a new trick on this question and the HttpFox utility is very cool. I have already used it on a few sites to play. Very cool indeed...
0
 
LVL 14

Expert Comment

by:jb1dev
ID: 39646870
@Technodweeb,

Yeah I spent way too long on this as well (before I tried scrapy anyway.)

I've done a lot of web scraping using different technologies, so I know how fickle server side request handling can be. I'm also new to Python. So I tried with their standard modules like urllib/httplib/requests but nothing was working. Finally I figured there had to be a library/framework for this and I came across scrapy. There were some others too but scrapy was the first I tried. And it worked pretty much right out of the box.

Keep in mind, that might be a sloppy extension of BaseSpider, it was adapted from a LoginSpider example I came across. Hence the class name. and authentication check.
0

Featured Post

Create the perfect environment for any meeting

You might have a modern environment with all sorts of high-tech equipment, but what makes it worthwhile is how you seamlessly bring together the presentation with audio, video and lighting. The ATEN Control System provides integrated control and system automation.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
Validating Date Part2 2 27
ModalPopup  question 22 38
Getting Variable not defined error in Python 1 44
Can anyone help me find this preloader? 51 43
International Data Corporation (IDC) prognosticates that before the current the year gets over disbursing on IT framework products to be sent in cloud environs will be $37.1B.
Today, the web development industry is booming, and many people consider it to be their vocation. The question you may be asking yourself is – how do I become a web developer?
Learn the basics of if, else, and elif statements in Python 2.7. Use "if" statements to test a specified condition.: The structure of an if statement is as follows: (CODE) Use "else" statements to allow the execution of an alternative, if the …
The viewer will learn the basics of jQuery, including how to invoke it on a web page. Reference your jQuery libraries: (CODE) Include your new external js/jQuery file: (CODE) Write your first lines of code to setup your site for jQuery.: (CODE)

733 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question