web scraping using python

Hi,

I'm using python Browser() to download html pages,
it's working for most of the sites,
it doesn't work for: http://www.hashulchan.co.il/?CategoryID=541&ArticleID=13120
I'm getting:

<html style="height:100%"><head><META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"><meta name="format-detection" content="telephone=no"><meta name="viewport" content="initial-scale=1.0"><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"></head><body style="margin:0px;height:100%"><iframe src="/_Incapsula_Resource?CWUDNSAI=9&xinfo=0-273981-0 0NNN RT(1427217058600 4) q(0 -1 -1 -1) r(0 -1) B12(4,315,0)&incident_id=253000020000650957-2911433792487456&edet=12&cinfo=04000000" frameborder=0 width="100%" height="100%" marginheight="0px" marginwidth="0px">Request unsuccessful. Incapsula incident ID: 253000020000650957-2911433792487456</iframe></body></html>

How can I download the page, is it some kind of protection?

Thanks.
omer dAsked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Walter RitzelSenior Software EngineerCommented:
Try to add the line below to your code. Let me warn you that if you do that, you'll be harming the web crawling etiquette. The url you are trying to access should have a robots.txt file that does not allow access to robots to download or crawl or index their content.

import mechanize
br = mechanize.Browser()
br.set_handle_robots(False)  #add this line to your code

Open in new window

0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
omer dAuthor Commented:
Hi,

Thanks, I've no intention to scrape the all site or to harm it...

I'm using:
browser = Browser()
cookiejar = cookielib.LWPCookieJar()
browser.set_cookiejar(cookiejar)
browser.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.99 Safari/537.36')]
browser.set_handle_gzip(True)
browser.set_handle_refresh(False)
browser.set_handle_redirect(True)
browser.set_handle_equiv(False)
browser.set_handle_robots(False)
html = browser.open(self._url).get_data()

Open in new window


and yet I'm getting sometime the posted result, and sometime:

<html>
<head>
<META NAME="robots" CONTENT="noindex,nofollow">
<script>
(function() {  function getSessionCookies() {   cookieArray = new Array();   var cName = /^\s?incap_ses_/;   var c = document.cookie.split(";");   for (var i = 0; i < c.length; i++) {    key = c[i].substr(0, c[i].indexOf("="));    value = c[i].substr(c[i].indexOf("=") + 1, c[i].length);    if (cName.test(key)) {     cookieArray[cookieArray.length] = value    }   }   return cookieArray  }  function setIncapCookie(vArray) {   try {    cookies = getSessionCookies();    digests = new Array(cookies.length);    for (var i = 0; i < cookies.length; i++) {     digests[i] = simpleDigest((vArray) + cookies[i])    }    res = vArray + ",digest=" + (digests.join())   } catch (e) {    res = vArray + ",digest=" + (encodeURIComponent(e.toString()))   }   createCookie("___utmvc", res, 20)  }  function simpleDigest(mystr) {   var res = 0;   for (var i = 0; i < mystr.length; i++) {    res += mystr.charCodeAt(i)   }   return res  }  function createCookie(name, value, seconds) {   if (seconds) {    var date = new Date();    date.setTime(date.getTime() + (seconds * 1000));    var expires = "; expires=" + date.toGMTString()   } else {    var expires = ""   }   document.cookie = name + "=" + value + expires + "; path=/"  }  function test(o) {   var res = "";   var vArray = new Array();   for (var j = 0; j < o.length; j++) {    var test = o[j][0]    switch (o[j][1]) {    case "exists_boolean":     try { 	 if(typeof(eval(test)) != "undefined"){ 		vArray[vArray.length] = encodeURIComponent(test + "=true") 	 } 	 else{ 		vArray[vArray.length] = encodeURIComponent(test + "=false") 	 }     } catch (e) {      vArray[vArray.length] = encodeURIComponent(test + "=false")     }     break;    case "exists":     try {      vArray[vArray.length] = encodeURIComponent(test + "=" + typeof(eval(test)))     } catch (e) {      vArray[vArray.length] = encodeURIComponent(test + "=" + e)     }     break;    case "value":     try {      vArray[vArray.length] = encodeURIComponent(test + "=" + eval(test).toString())     } catch (e) {      vArray[vArray.length] = encodeURIComponent(test + "=" + e)     }     break;     case "plugins":     try{         p=navigator.plugins         pres=""         for (a in p){pres+=(p[a]['description']+" ").substring(0,20)}         vArray[vArray.length] = encodeURIComponent("plugins=" + pres);         }     catch(e){         vArray[vArray.length] = encodeURIComponent("plugins=" +e);         } 	break;      case "plugin":     try {      a = navigator.plugins;      for (i in a) {       f = a[i]["filename"].split(".");       if (f.length == 2) {        vArray[vArray.length] = encodeURIComponent("plugin=" + f[1]);        break       }      }     } catch (e) {      vArray[vArray.length] = encodeURIComponent("plugin=" + e)     }     break    }   }   vArray = vArray.join();   return vArray  }  var o = [   ["navigator", "exists_boolean"],   ["navigator.vendor", "value"],   ["opera", "exists_boolean"],   ["ActiveXObject", "exists_boolean"],   ["navigator.appName", "value"],   ["platform", "plugin"],   ["webkitURL", "exists_boolean"],   ["navigator.plugins.length==0", "value"],   ["_phantom", "exists_boolean"] ];  try {   setIncapCookie(test(o));   document.createElement("img").src = "/_Incapsula_Resource?SWKMTFSR=1&e=" + Math.random()  } catch (e) {   img = document.createElement("img");   img.src = "/_Incapsula_Resource?SWKMTFSR=1&e=" + e  } })();
</script>
<script>
(function() { 
var z="";var b="7472797B766172207868723B76617220743D6E6577204461746528292E67657454696D6528293B766172207374617475733D227374617274223B7661722074696D696E673D6E65772041727261792833293B77696E646F772E6F6E756E6C6F61643D66756E6374696F6E28297B74696D696E675B325D3D22723A222B286E6577204461746528292E67657454696D6528292D74293B646F63756D656E742E637265617465456C656D656E742822696D6722292E7372633D222F5F496E63617073756C615F5265736F757263653F4553324C555243543D363726743D373826643D222B656E636F6465555249436F6D706F6E656E74287374617475732B222028222B74696D696E672E6A6F696E28292B222922297D3B69662877696E646F772E584D4C4874747052657175657374297B7868723D6E657720584D4C48747470526571756573747D656C73657B7868723D6E657720416374697665584F626A65637428224D6963726F736F66742E584D4C4854545022297D7868722E6F6E726561647973746174656368616E67653D66756E6374696F6E28297B737769746368287868722E72656164795374617465297B6361736520303A7374617475733D6E6577204461746528292E67657454696D6528292D742B223A2072657175657374206E6F7420696E697469616C697A656420223B627265616B3B6361736520313A7374617475733D6E6577204461746528292E67657454696D6528292D742B223A2073657276657220636F6E6E656374696F6E2065737461626C6973686564223B627265616B3B6361736520323A7374617475733D6E6577204461746528292E67657454696D6528292D742B223A2072657175657374207265636569766564223B627265616B3B6361736520333A7374617475733D6E6577204461746528292E67657454696D6528292D742B223A2070726F63657373696E672072657175657374223B627265616B3B6361736520343A7374617475733D22636F6D706C657465223B74696D696E675B315D3D22633A222B286E6577204461746528292E67657454696D6528292D74293B6966287868722E7374617475733D3D323030297B706172656E742E6C6F636174696F6E2E72656C6F616428297D627265616B7D7D3B74696D696E675B305D3D22733A222B286E6577204461746528292E67657454696D6528292D74293B7868722E6F70656E2822474554222C222F5F496E63617073756C615F5265736F757263653F535748414E45444C3D393039353730333932353036393232363034352C31323434343635363831333133323430363131352C383839343634353737373637353638323136382C343335303830222C66616C7365293B7868722E73656E64286E756C6C297D63617463682863297B7374617475732B3D6E6577204461746528292E67657454696D6528292D742B2220696E6361705F6578633A20222B633B646F63756D656E742E637265617465456C656D656E742822696D6722292E7372633D222F5F496E63617073756C615F5265736F757263653F4553324C555243543D363726743D373826643D222B656E636F6465555249436F6D706F6E656E74287374617475732B222028222B74696D696E672E6A6F696E28292B222922297D3B";for (var i=0;i<b.length;i+=2){z=z+parseInt(b.substring(i, i+2), 16)+",";}z = z.substring(0,z.length-1); eval(eval('String.fromCharCode('+z+')'));})();
</script></head>
<body>
<iframe style="display:none;visibility:hidden;" src="//content.incapsula.com/jsTest.html" id="gaIframe"></iframe>
</body></html>

Open in new window

0
Dave BaldwinFixer of ProblemsCommented:
You're seeing that because the actual content is in the 'iframe'.  An 'iframe' is a method to include the content of another page in the current page.  However, the browser or the crawler in this case, has to make a separate request for the page in the iframe.
0
Bootstrap 4: Exploring New Features

Learn how to use and navigate the new features included in Bootstrap 4, the most popular HTML, CSS, and JavaScript framework for developing responsive, mobile-first websites.

omer dAuthor Commented:
Hi Dave,

thank you for you answer, but I don't think it's the case..
the iframe src is not a real page url...

it looks like I need to handle the robot issue...
0
Walter RitzelSenior Software EngineerCommented:
Hi omer_d,
where you able to solve the issue?
0
omer dAuthor Commented:
Hi Walter,
No... :/
0
Suhas .Senior QA ManagerCommented:
No comment has been added to this question in more than 21 days, so it is now classified as abandoned.

I have recommended this question be closed as follows:

Split:
-- Walter Ritzel (https:#a40685422)
-- omer d (https:#a40685489)


If you feel this question should be closed differently, post an objection and the moderators will review all objections and close it as they feel fit. If no one objects, this question will be closed automatically the way described above.

suhasbharadwaj
Experts-Exchange Cleanup Volunteer
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Python

From novice to tech pro — start learning today.