omer d
asked on
download html page from web, using python - like a browser, bypass reblaze, explanation in body
I'm trying to download html of a given web page - using python 2.7, on ubuntu.
I succeeded doing so for most of the web pages I saw, using several methods, such as using urllib3.
BUT, I failed to download the html of:
http://www.mako.co.il/food-recipes/recipes_column-fish-seafood/Recipe-9e6645ebcd35b41006.htm
If I'm opening the page in my browser, then I can download the page using my code for few minutes.
after a few minutes, I can no more download the html page, and starting to get:
HEADERS:
HTTPHeaderDict({'content-l ength': '616', 'expires': 'Tue, 24 Feb 2015 21:12:17 GMT', 'pagespeed': 'off', 'server': 'Reblaze Secure Web Gateway', 'connection': 'keep-alive', 'x-ua-compatible': 'IE=EmulateIE8', 'cache-control': 'private, no-cache, no-store, no-transform', 'date': 'Tue, 24 Feb 2015 21:12:17 GMT', 'x-cdn': 'Akamai', 'p3p': 'CP="IDC DSP COR ADM DEVi TAIi PSA PSD IVAi IVDi CONi HIS OUR IND CNT"', 'content-type': 'text/html; charset=utf-8'})
HTML:
<html><head><meta charset="utf-8"></head><bo dy><script src="//d1a702rd0dylue.clou dfront.net /js/iealml -03/3600.j s"></scrip t><script> window.rbz ns = {}; rbzns.hosts="www.mako.co.il mako.co.il"; rbzns.ctrbg="dVa9rce47U+iu usPxpSoG2z Kw2PX1p1wp NsKpeo92FV Y8m3Rww27b 3eDes1IrdG 2XG0sBBFoo JqpNad4cFn t/fwvNznkn iELGLpI0nu rISYw1/qvH Ntj+vAKZVC EcPcWbuWz2 cEkppGJoNk Ml3LNK2hv5 QHSCYPLt78 wQnMRLmk=" ;rbzns.rbz reqid="rbz -mako-rebl azer053134 3232313035 323932bdae d4e40029ee d1"; winsocks(true);</script></ body></htm l>
Here is my code:
user_agent = {'user-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.99 Safari/537.36',
'connection': 'keep-alive',
'accept-encoding': 'gzip, deflate, sdch',
'accept-language': 'en-US,en;q=0.8,he;q=0.6,h e-IL;q=0.4 ',
'accept': 'text/html,application/xht ml+xml,app lication/x ml;q=0.9,i mage/webp, */*;q=0.8' }
http = urllib3.PoolManager(10, headers=user_agent)
r = http.request('GET', url)
How can I always download the page using only python code?
Thanks!
I succeeded doing so for most of the web pages I saw, using several methods, such as using urllib3.
BUT, I failed to download the html of:
http://www.mako.co.il/food-recipes/recipes_column-fish-seafood/Recipe-9e6645ebcd35b41006.htm
If I'm opening the page in my browser, then I can download the page using my code for few minutes.
after a few minutes, I can no more download the html page, and starting to get:
HEADERS:
HTTPHeaderDict({'content-l
HTML:
<html><head><meta charset="utf-8"></head><bo
Here is my code:
user_agent = {'user-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.99 Safari/537.36',
'connection': 'keep-alive',
'accept-encoding': 'gzip, deflate, sdch',
'accept-language': 'en-US,en;q=0.8,he;q=0.6,h
'accept': 'text/html,application/xht
http = urllib3.PoolManager(10, headers=user_agent)
r = http.request('GET', url)
How can I always download the page using only python code?
Thanks!
ASKER
and can I do it by code using other language?
cause I can access the page using regular browser..
cause I can access the page using regular browser..
Javascript is built into all browsers. I don't know of any other programming language that will also run that javascript.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
WOW, what a great answer!!!!!! thank you :)
'Akamai' is a CDN Content Delivery Network that delivers web content for their clients.