Link to home
Start Free TrialLog in
Avatar of omer d
omer d

asked on

download html page from web, using python - like a browser, bypass reblaze, explanation in body

I'm trying to download html of a given web page - using python 2.7, on ubuntu.
I succeeded doing so for most of the web pages I saw, using several methods, such as using urllib3.

BUT, I failed to download the html of:
http://www.mako.co.il/food-recipes/recipes_column-fish-seafood/Recipe-9e6645ebcd35b41006.htm

If I'm opening the page in my browser, then I can download the page using my code for few minutes.
after a few minutes, I can no more download the html page, and starting to get:

HEADERS:
HTTPHeaderDict({'content-length': '616', 'expires': 'Tue, 24 Feb 2015 21:12:17 GMT', 'pagespeed': 'off', 'server': 'Reblaze Secure Web Gateway', 'connection': 'keep-alive', 'x-ua-compatible': 'IE=EmulateIE8', 'cache-control': 'private, no-cache, no-store, no-transform', 'date': 'Tue, 24 Feb 2015 21:12:17 GMT', 'x-cdn': 'Akamai', 'p3p': 'CP="IDC DSP COR ADM DEVi TAIi PSA PSD IVAi IVDi CONi HIS OUR IND CNT"', 'content-type': 'text/html; charset=utf-8'})

HTML:
<html><head><meta charset="utf-8"></head><body><script src="//d1a702rd0dylue.cloudfront.net/js/iealml-03/3600.js"></script><script>window.rbzns = {}; rbzns.hosts="www.mako.co.il mako.co.il"; rbzns.ctrbg="dVa9rce47U+iuusPxpSoG2zKw2PX1p1wpNsKpeo92FVY8m3Rww27b3eDes1IrdG2XG0sBBFooJqpNad4cFnt/fwvNznkniELGLpI0nurISYw1/qvHNtj+vAKZVCEcPcWbuWz2cEkppGJoNkMl3LNK2hv5QHSCYPLt78wQnMRLmk=";rbzns.rbzreqid="rbz-mako-reblazer0531343232313035323932bdaed4e40029eed1"; winsocks(true);</script></body></html>


Here is my code:

user_agent = {'user-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.99 Safari/537.36',
                              'connection': 'keep-alive',
                              'accept-encoding': 'gzip, deflate, sdch',
                              'accept-language': 'en-US,en;q=0.8,he;q=0.6,he-IL;q=0.4',
                              'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'}
                http = urllib3.PoolManager(10, headers=user_agent)
                r = http.request('GET', url)

How can I always download the page using only python code?

Thanks!
Avatar of Dave Baldwin
Dave Baldwin
Flag of United States of America image

You can't.  The HTML you posted contain two javascripts which you can't run with Python and it redirects from the 'cloud' provider to the actual web content.

'Akamai' is a CDN Content Delivery Network that delivers web content for their clients.
Avatar of omer d
omer d

ASKER

and can I do it by code using other language?
cause I can access the page using regular browser..
Javascript is built into all browsers.  I don't know of any other programming language that will also run that javascript.
ASKER CERTIFIED SOLUTION
Avatar of clockwatcher
clockwatcher

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of omer d

ASKER

WOW, what a great answer!!!!!! thank you :)