Solved

programmatically downloading a web file

Posted on 2013-11-13
7
422 Views
Last Modified: 2013-11-13
Hi,

I wasn't sure which zones were the most appropriate, so if you think there's a better zone, please let me know.

I have a file I would like to download on a regular basis.  The file name stays the same, but I don't know the exact url as I press a 'submit' button which then brings up the javascript popup.  I just need to change the 'report type' and 'date'.  Here's the site:

http://www.theocc.com/webapps/series-download

Is there a way I can automate grabbing such a file?  I'd prefer to do it in Python, but I'm not sure how I go about getting the actual URL.

I've not done any type of web crawling like this before, so if there are libraries and/or other resources please let me know that too. I'm willing to use Java or VB if it's more convenient (though I kind of doubt it is:).

Ideas?

Thanks!
0
Comment
Question by:ugeb
  • 3
  • 2
  • 2
7 Comments
 
LVL 14

Accepted Solution

by:
jb1dev earned 500 total points
ID: 39646809
You can do this with python's scrapy http://scrapy.org/

I tried just using urllib/requests modules, but getting that exactly right (handling exactly what the server is expecting in terms of http headers, cookies, etc. can be a pain, especially if you do not have access to server side logs.)

Fortunately, the scrapy framework handles all of this for you.

Here's what I did. To figure out the form parameters I needed (the post data to send) I installed the firefox extension HttpFox:

https://addons.mozilla.org/En-us/firefox/addon/httpfox/

I then ran HttpFox while I downloaded the file, which gives me access to the POST data.

E.g.

seriesAddDeleteSearchDTO.exchanges      02
seriesAddDeleteSearchDTO.exchanges      25
seriesAddDeleteSearchDTO.exchanges      19
seriesAddDeleteSearchDTO.exchanges      04
seriesAddDeleteSearchDTO.exchanges      01
seriesAddDeleteSearchDTO.exchanges      12
seriesAddDeleteSearchDTO.exchanges      26
seriesAddDeleteSearchDTO.exchanges      22
seriesAddDeleteSearchDTO.exchanges      11
seriesAddDeleteSearchDTO.exchanges      18
seriesAddDeleteSearchDTO.exchanges      08
seriesAddDeleteSearchDTO.exchanges      03
seriesAddDeleteSearchDTO.exchanges      20
seriesAddDeleteSearchDTO.exchanges      13
seriesAddDeleteSearchDTO.exchanges      07
seriesAddDeleteSearchDTO.exchanges      27
seriesAddDeleteSearchDTO.exchanges      39
seriesAddDeleteSearchDTO.dowloadType      B
seriesAddDeleteSearchDTO.dates      11/12/2013


So there are many seriesAddDeleteSearchDTO.exchanges values (presumably representing the checkbox selections) and there is also the   seriesAddDeleteSearchDTO.dowloadType and seriesAddDeleteSearchDTO.dates parameters.

We will use that to construct form post data.

Here is the scrapy code I put together. Sorry if it's not perfect, I am new to Python and scrapy.

#!/usr/bin/python

from scrapy.spider import BaseSpider
from scrapy.http import FormRequest

class LoginSpider(BaseSpider):
    name = 'myspider'
    start_urls = ['http://www.theocc.com/webapps/series-download']

    def parse(self, response):
        # form is called "commandForm"
        return [FormRequest.from_response(
                    response,
                    formname='commandForm',
                    formdata={
                        'seriesAddDeleteSearchDTO.exchanges': [ '02','25','19','04','01','12','26','22','11','18','08','03','20','13','07','27','39' ],
                        'seriesAddDeleteSearchDTO.dowloadType': 'B',
                        'seriesAddDeleteSearchDTO.dates': '11/12/2013' },
                    callback=self.after_download)]

    def after_download(self, response):
        # check login succeed before going on
        if "authentication failed" in response.body:
            self.log("Login failed", level=log.ERROR)
            return
        self.log('*** Response is' + response.body);

Open in new window


Notice the line
self.log('*** Response is' + response.body);
is where it prints the response.

I then run this with
$ scrapy runspider ./scrapytest.py

(^^^ here scrapytest.py is the file containing the python code I posted in the block above.)

If you do not have scrapy installed, there are many ways to install it.
http://doc.scrapy.org/en/latest/intro/install.html
If you are on a linux distro with it packaged in a regular system software repository I would install it from there. Otherwise "pip" probably works well.
0
 
LVL 11

Author Closing Comment

by:ugeb
ID: 39646824
This is awesome!  While I haven't tried it out yet, I'm going to and you deserve all the points just for this.  If I have more questions I'll ask another question.

Thanks!
0
 
LVL 11

Expert Comment

by:Technodweeb
ID: 39646833
I have looked through the source pretty closely. They are using JQuery to do the processing and as best I can tell they are building the file dynamically and then sending it to the browser on the fly as well. This means there is probably a random filename temporarily created when the file content is built, the file content is sent to the browser but the header contains the generic name download.txt and then I am certain they just purge the tempfile they created, if it was ever created at all. It could have been produced in RAM and dished up hot, no file required.

I took the source and created a local copy of the page and referenced all the .js src files from their site and every time I hit the submit button, it was looking locally for the download.txt file which tells me there is some server side script that is running that basically cannot be gotten to.

Having said all that, the conclusion is there is probably no way to make this work the way you want.
0
Free Trending Threat Insights Every Day

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

 
LVL 11

Expert Comment

by:Technodweeb
ID: 39646838
spent too long and someone figured this out... kudos
0
 
LVL 11

Author Comment

by:ugeb
ID: 39646840
@Technodweeb,

Sorry you spent too long.  Thanks for trying, though!  I'll have more questions ...
0
 
LVL 11

Expert Comment

by:Technodweeb
ID: 39646852
No problem, I have learned a new trick on this question and the HttpFox utility is very cool. I have already used it on a few sites to play. Very cool indeed...
0
 
LVL 14

Expert Comment

by:jb1dev
ID: 39646870
@Technodweeb,

Yeah I spent way too long on this as well (before I tried scrapy anyway.)

I've done a lot of web scraping using different technologies, so I know how fickle server side request handling can be. I'm also new to Python. So I tried with their standard modules like urllib/httplib/requests but nothing was working. Finally I figured there had to be a library/framework for this and I came across scrapy. There were some others too but scrapy was the first I tried. And it worked pretty much right out of the box.

Keep in mind, that might be a sloppy extension of BaseSpider, it was adapted from a LoginSpider example I came across. Hence the class name. and authentication check.
0

Featured Post

Maximize Your Threat Intelligence Reporting

Reporting is one of the most important and least talked about aspects of a world-class threat intelligence program. Here’s how to do it right.

Join & Write a Comment

The purpose of this article is to demonstrate how we can upgrade Python from version 2.7.6 to Python 2.7.10 on the Linux Mint operating system. I am using an Oracle Virtual Box where I have installed Linux Mint operating system version 17.2. Once yo…
When we want to run, execute or repeat a statement multiple times, a loop is necessary. This article covers the two types of loops in Python: the while loop and the for loop.
Learn the basics of if, else, and elif statements in Python 2.7. Use "if" statements to test a specified condition.: The structure of an if statement is as follows: (CODE) Use "else" statements to allow the execution of an alternative, if the …
Learn the basics of while and for loops in Python.  while loops are used for testing while, or until, a condition is met: The structure of a while loop is as follows:     while <condition>:         do something         repeate: The break statement m…

705 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

21 Experts available now in Live!

Get 1:1 Help Now