Want to win a PS4? Go Premium and enter to win our High-Tech Treats giveaway. Enter to Win

x
?
Solved

programmatically downloading a web file

Posted on 2013-11-13
7
Medium Priority
?
431 Views
Last Modified: 2013-11-13
Hi,

I wasn't sure which zones were the most appropriate, so if you think there's a better zone, please let me know.

I have a file I would like to download on a regular basis.  The file name stays the same, but I don't know the exact url as I press a 'submit' button which then brings up the javascript popup.  I just need to change the 'report type' and 'date'.  Here's the site:

http://www.theocc.com/webapps/series-download

Is there a way I can automate grabbing such a file?  I'd prefer to do it in Python, but I'm not sure how I go about getting the actual URL.

I've not done any type of web crawling like this before, so if there are libraries and/or other resources please let me know that too. I'm willing to use Java or VB if it's more convenient (though I kind of doubt it is:).

Ideas?

Thanks!
0
Comment
Question by:ugeb
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
  • 2
  • 2
7 Comments
 
LVL 14

Accepted Solution

by:
jb1dev earned 2000 total points
ID: 39646809
You can do this with python's scrapy http://scrapy.org/

I tried just using urllib/requests modules, but getting that exactly right (handling exactly what the server is expecting in terms of http headers, cookies, etc. can be a pain, especially if you do not have access to server side logs.)

Fortunately, the scrapy framework handles all of this for you.

Here's what I did. To figure out the form parameters I needed (the post data to send) I installed the firefox extension HttpFox:

https://addons.mozilla.org/En-us/firefox/addon/httpfox/

I then ran HttpFox while I downloaded the file, which gives me access to the POST data.

E.g.

seriesAddDeleteSearchDTO.exchanges      02
seriesAddDeleteSearchDTO.exchanges      25
seriesAddDeleteSearchDTO.exchanges      19
seriesAddDeleteSearchDTO.exchanges      04
seriesAddDeleteSearchDTO.exchanges      01
seriesAddDeleteSearchDTO.exchanges      12
seriesAddDeleteSearchDTO.exchanges      26
seriesAddDeleteSearchDTO.exchanges      22
seriesAddDeleteSearchDTO.exchanges      11
seriesAddDeleteSearchDTO.exchanges      18
seriesAddDeleteSearchDTO.exchanges      08
seriesAddDeleteSearchDTO.exchanges      03
seriesAddDeleteSearchDTO.exchanges      20
seriesAddDeleteSearchDTO.exchanges      13
seriesAddDeleteSearchDTO.exchanges      07
seriesAddDeleteSearchDTO.exchanges      27
seriesAddDeleteSearchDTO.exchanges      39
seriesAddDeleteSearchDTO.dowloadType      B
seriesAddDeleteSearchDTO.dates      11/12/2013


So there are many seriesAddDeleteSearchDTO.exchanges values (presumably representing the checkbox selections) and there is also the   seriesAddDeleteSearchDTO.dowloadType and seriesAddDeleteSearchDTO.dates parameters.

We will use that to construct form post data.

Here is the scrapy code I put together. Sorry if it's not perfect, I am new to Python and scrapy.

#!/usr/bin/python

from scrapy.spider import BaseSpider
from scrapy.http import FormRequest

class LoginSpider(BaseSpider):
    name = 'myspider'
    start_urls = ['http://www.theocc.com/webapps/series-download']

    def parse(self, response):
        # form is called "commandForm"
        return [FormRequest.from_response(
                    response,
                    formname='commandForm',
                    formdata={
                        'seriesAddDeleteSearchDTO.exchanges': [ '02','25','19','04','01','12','26','22','11','18','08','03','20','13','07','27','39' ],
                        'seriesAddDeleteSearchDTO.dowloadType': 'B',
                        'seriesAddDeleteSearchDTO.dates': '11/12/2013' },
                    callback=self.after_download)]

    def after_download(self, response):
        # check login succeed before going on
        if "authentication failed" in response.body:
            self.log("Login failed", level=log.ERROR)
            return
        self.log('*** Response is' + response.body);

Open in new window


Notice the line
self.log('*** Response is' + response.body);
is where it prints the response.

I then run this with
$ scrapy runspider ./scrapytest.py

(^^^ here scrapytest.py is the file containing the python code I posted in the block above.)

If you do not have scrapy installed, there are many ways to install it.
http://doc.scrapy.org/en/latest/intro/install.html
If you are on a linux distro with it packaged in a regular system software repository I would install it from there. Otherwise "pip" probably works well.
0
 
LVL 11

Author Closing Comment

by:ugeb
ID: 39646824
This is awesome!  While I haven't tried it out yet, I'm going to and you deserve all the points just for this.  If I have more questions I'll ask another question.

Thanks!
0
 
LVL 12

Expert Comment

by:Gregory Miller
ID: 39646833
I have looked through the source pretty closely. They are using JQuery to do the processing and as best I can tell they are building the file dynamically and then sending it to the browser on the fly as well. This means there is probably a random filename temporarily created when the file content is built, the file content is sent to the browser but the header contains the generic name download.txt and then I am certain they just purge the tempfile they created, if it was ever created at all. It could have been produced in RAM and dished up hot, no file required.

I took the source and created a local copy of the page and referenced all the .js src files from their site and every time I hit the submit button, it was looking locally for the download.txt file which tells me there is some server side script that is running that basically cannot be gotten to.

Having said all that, the conclusion is there is probably no way to make this work the way you want.
0
Looking for a new Web Host?

Lunarpages' assortment of hosting products and solutions ensure a perfect fit for anyone looking to get their vision or products to market. Our award winning customer support and 30-day money back guarantee show the pride we take in being the industry's premier MSP.

 
LVL 12

Expert Comment

by:Gregory Miller
ID: 39646838
spent too long and someone figured this out... kudos
0
 
LVL 11

Author Comment

by:ugeb
ID: 39646840
@Technodweeb,

Sorry you spent too long.  Thanks for trying, though!  I'll have more questions ...
0
 
LVL 12

Expert Comment

by:Gregory Miller
ID: 39646852
No problem, I have learned a new trick on this question and the HttpFox utility is very cool. I have already used it on a few sites to play. Very cool indeed...
0
 
LVL 14

Expert Comment

by:jb1dev
ID: 39646870
@Technodweeb,

Yeah I spent way too long on this as well (before I tried scrapy anyway.)

I've done a lot of web scraping using different technologies, so I know how fickle server side request handling can be. I'm also new to Python. So I tried with their standard modules like urllib/httplib/requests but nothing was working. Finally I figured there had to be a library/framework for this and I came across scrapy. There were some others too but scrapy was the first I tried. And it worked pretty much right out of the box.

Keep in mind, that might be a sloppy extension of BaseSpider, it was adapted from a LoginSpider example I came across. Hence the class name. and authentication check.
0

Featured Post

Plesk WordPress Toolkit

Plesk's WordPress Toolkit allows server administrators, resellers and customers to manage their WordPress instances, enabling a variety of development workflows for WordPress admins of all skill levels, from beginners to pros.

See why 2/3 of Plesk servers use it.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

A while back, I ran into a situation where I was trying to use the calculated columns feature in SharePoint 2013 to do some simple math using values in two lists. Between certain data types not being accessible, and also with trying to make a one to…
Originally, this post was published on Monitis Blog, you can check it here . In business circles, we sometimes hear that today is the “age of the customer.” And so it is. Thanks to the enormous advances over the past few years in consumer techno…
Learn the basics of modules and packages in Python. Every Python file is a module, ending in the suffix: .py: Modules are a collection of functions and variables.: Packages are a collection of modules.: Module functions and variables are accessed us…
The viewer will learn the basics of jQuery, including how to invoke it on a web page. Reference your jQuery libraries: (CODE) Include your new external js/jQuery file: (CODE) Write your first lines of code to setup your site for jQuery.: (CODE)

618 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question