Solved

download html page from web, using python - like a browser, bypass reblaze, explanation in body

Posted on 2015-02-24
5
506 Views
Last Modified: 2015-02-25
I'm trying to download html of a given web page - using python 2.7, on ubuntu.
I succeeded doing so for most of the web pages I saw, using several methods, such as using urllib3.

BUT, I failed to download the html of:
http://www.mako.co.il/food-recipes/recipes_column-fish-seafood/Recipe-9e6645ebcd35b41006.htm

If I'm opening the page in my browser, then I can download the page using my code for few minutes.
after a few minutes, I can no more download the html page, and starting to get:

HEADERS:
HTTPHeaderDict({'content-length': '616', 'expires': 'Tue, 24 Feb 2015 21:12:17 GMT', 'pagespeed': 'off', 'server': 'Reblaze Secure Web Gateway', 'connection': 'keep-alive', 'x-ua-compatible': 'IE=EmulateIE8', 'cache-control': 'private, no-cache, no-store, no-transform', 'date': 'Tue, 24 Feb 2015 21:12:17 GMT', 'x-cdn': 'Akamai', 'p3p': 'CP="IDC DSP COR ADM DEVi TAIi PSA PSD IVAi IVDi CONi HIS OUR IND CNT"', 'content-type': 'text/html; charset=utf-8'})

HTML:
<html><head><meta charset="utf-8"></head><body><script src="//d1a702rd0dylue.cloudfront.net/js/iealml-03/3600.js"></script><script>window.rbzns = {}; rbzns.hosts="www.mako.co.il mako.co.il"; rbzns.ctrbg="dVa9rce47U+iuusPxpSoG2zKw2PX1p1wpNsKpeo92FVY8m3Rww27b3eDes1IrdG2XG0sBBFooJqpNad4cFnt/fwvNznkniELGLpI0nurISYw1/qvHNtj+vAKZVCEcPcWbuWz2cEkppGJoNkMl3LNK2hv5QHSCYPLt78wQnMRLmk=";rbzns.rbzreqid="rbz-mako-reblazer0531343232313035323932bdaed4e40029eed1"; winsocks(true);</script></body></html>


Here is my code:

user_agent = {'user-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.99 Safari/537.36',
                              'connection': 'keep-alive',
                              'accept-encoding': 'gzip, deflate, sdch',
                              'accept-language': 'en-US,en;q=0.8,he;q=0.6,he-IL;q=0.4',
                              'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'}
                http = urllib3.PoolManager(10, headers=user_agent)
                r = http.request('GET', url)

How can I always download the page using only python code?

Thanks!
0
Comment
Question by:omer d
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 2
  • 2
5 Comments
 
LVL 83

Expert Comment

by:Dave Baldwin
ID: 40629465
You can't.  The HTML you posted contain two javascripts which you can't run with Python and it redirects from the 'cloud' provider to the actual web content.

'Akamai' is a CDN Content Delivery Network that delivers web content for their clients.
0
 

Author Comment

by:omer d
ID: 40629485
and can I do it by code using other language?
cause I can access the page using regular browser..
0
 
LVL 83

Expert Comment

by:Dave Baldwin
ID: 40629499
Javascript is built into all browsers.  I don't know of any other programming language that will also run that javascript.
0
 
LVL 25

Accepted Solution

by:
clockwatcher earned 500 total points
ID: 40630309
There's a spidermonkey interface (https://pypi.python.org/pypi/python-spidermonkey) available for python which will let you fire up and pass objects to a javascript interpreter.   So you actually can do what you're after.  It's slightly complicated by the fact that the javascript that they're using is obfuscated and also that you don't have a DOM but their javascript isn't so bad that it's not fairly easily hacked to do what you're after--  retrieve the cookie it needs to access the page.

The following is working for me to pull the ingredients:
from bs4 import BeautifulSoup
import requests
import re
import spidermonkey

class RecipeGetter(object):
    def __init__(self, url):
        self.url = url
        s = requests.Session()
        r = s.get(self.url)
        self.html = r.text
        cookies = dict()
        if self.html.find('3600.js') > 0:
            cookies['rbzreqid'], cookies['rbzid'] = self.getRbzid(self.html)
            r = s.get(self.url, cookies=cookies)
            self.html = r.text
        self.soup = BeautifulSoup(self.html)

    def getRbzid(self, page):
        rbzreqid = re.search('(rbz-mako-reblazer.*?)"', page).group(1)
        soup = BeautifulSoup(page)
        script = soup.find_all('script')[1].text
        script = re.sub('(window\.)?rbzns','window.rbzns', script)
        script = re.sub('winsocks', 'window.winsocks', script)
        rt = spidermonkey.Runtime()
        cx = rt.new_context()
        window = {"document": {
                        "documentElement": {
                            "scrollLeft": ""
                         },
                   },
                   "screen": {
                         "width": 1920,
                         "height": 1080,
                         "availHeight": 1000,
                         "availWidth": 1000
                   },
                   "navigator": {
                          "userAgent": "" 
                   }
                 }
        cx.add_global("window", window)
        jscript = open('3601.js', 'r').read()
        jscript = jscript + script
        cx.execute(jscript)
        cookie = window['retval']
        match = re.search('rbzid=(.*?);', cookie)
        if match:
            return rbzreqid, match.group(1)
        return '',''

    def getIngredients(self):
        for ingredient in self.soup.find_all('li',itemprop='ingredient'):
            yield ingredient.span.text


def main():
    url = 'http://www.mako.co.il/food-recipes/recipes_column-fish-seafood/Recipe-9e6645ebcd35b41006.htm'
    r = RecipeGetter(url)
    for ingredient in r.getIngredients():
        print ingredient

if __name__ == '__main__':
    main()

Open in new window


Also attached is a hacked up version of the script that they're using to build the cookie your request will need.  The above python script expects it to be called 3601.js and live in the same directory that the python script is in.

Anyway.  It's working for me.  Installing python-spidermonkey is a little more complex than your standard python pip install but if you're on linux it's not that tough.  Windows would be a bit tougher.  You'd probably need to go with cygwin and cygwin's python.
3601.js
0
 

Author Closing Comment

by:omer d
ID: 40630977
WOW, what a great answer!!!!!! thank you :)
0

Featured Post

Free Tool: Port Scanner

Check which ports are open to the outside world. Helps make sure that your firewall rules are working as intended.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

I annotated my article on ransomware somewhat extensively, but I keep adding new references and wanted to put a link to the reference library.  Despite all the reference tools I have on hand, it was not easy to find a way to do this easily. I finall…
In threads here at EE, each comment has a unique Identifier (ID). It is easy to get the full path for an ID via the right-click context menu. However, we often want to post a short link within a thread rather than the full link. This article shows a…
Learn the basics of if, else, and elif statements in Python 2.7. Use "if" statements to test a specified condition.: The structure of an if statement is as follows: (CODE) Use "else" statements to allow the execution of an alternative, if the …
How to create a custom search shortcut to site-search Experts Exchange using Google in the Firefox browser. This eliminates the need to type out site:experts-exchange.com whenever you want to search the site. Launch your Bookmark Menu: Press 'Ctrl +…

696 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question