web scraping using python

Hi,

I'm using python Browser() to download html pages,
it's working for most of the sites,
it doesn't work for: http://www.hashulchan.co.il/?CategoryID=541&ArticleID=13120
I'm getting:

<html style="height:100%"><head><META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"><meta name="format-detection" content="telephone=no"><meta name="viewport" content="initial-scale=1.0"><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"></head><body style="margin:0px;height:100%"><iframe src="/_Incapsula_Resource?CWUDNSAI=9&xinfo=0-273981-0 0NNN RT(1427217058600 4) q(0 -1 -1 -1) r(0 -1) B12(4,315,0)&incident_id=253000020000650957-2911433792487456&edet=12&cinfo=04000000" frameborder=0 width="100%" height="100%" marginheight="0px" marginwidth="0px">Request unsuccessful. Incapsula incident ID: 253000020000650957-2911433792487456</iframe></body></html>

How can I download the page, is it some kind of protection?

Thanks.
omer dAsked:
Who is Participating?
 
Walter RitzelSenior Software EngineerCommented:
Try to add the line below to your code. Let me warn you that if you do that, you'll be harming the web crawling etiquette. The url you are trying to access should have a robots.txt file that does not allow access to robots to download or crawl or index their content.

import mechanize
br = mechanize.Browser()
br.set_handle_robots(False)  #add this line to your code

Open in new window

0
 
omer dAuthor Commented:
Hi,

Thanks, I've no intention to scrape the all site or to harm it...

I'm using:
browser = Browser()
cookiejar = cookielib.LWPCookieJar()
browser.set_cookiejar(cookiejar)
browser.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.99 Safari/537.36')]
browser.set_handle_gzip(True)
browser.set_handle_refresh(False)
browser.set_handle_redirect(True)
browser.set_handle_equiv(False)
browser.set_handle_robots(False)
html = browser.open(self._url).get_data()

Open in new window


and yet I'm getting sometime the posted result, and sometime:

<html>
<head>
<META NAME="robots" CONTENT="noindex,nofollow">
<script>
(function() {  function getSessionCookies() {   cookieArray = new Array();   var cName = /^\s?incap_ses_/;   var c = document.cookie.split(";");   for (var i = 0; i < c.length; i++) {    key = c[i].substr(0, c[i].indexOf("="));    value = c[i].substr(c[i].indexOf("=") + 1, c[i].length);    if (cName.test(key)) {     cookieArray[cookieArray.length] = value    }   }   return cookieArray  }  function setIncapCookie(vArray) {   try {    cookies = getSessionCookies();    digests = new Array(cookies.length);    for (var i = 0; i < cookies.length; i++) {     digests[i] = simpleDigest((vArray) + cookies[i])    }    res = vArray + ",digest=" + (digests.join())   } catch (e) {    res = vArray + ",digest=" + (encodeURIComponent(e.toString()))   }   createCookie("___utmvc", res, 20)  }  function simpleDigest(mystr) {   var res = 0;   for (var i = 0; i < mystr.length; i++) {    res += mystr.charCodeAt(i)   }   return res  }  function createCookie(name, value, seconds) {   if (seconds) {    var date = new Date();    date.setTime(date.getTime() + (seconds * 1000));    var expires = "; expires=" + date.toGMTString()   } else {    var expires = ""   }   document.cookie = name + "=" + value + expires + "; path=/"  }  function test(o) {   var res = "";   var vArray = new Array();   for (var j = 0; j < o.length; j++) {    var test = o[j][0]    switch (o[j][1]) {    case "exists_boolean":     try { 	 if(typeof(eval(test)) != "undefined"){ 		vArray[vArray.length] = encodeURIComponent(test + "=true") 	 } 	 else{ 		vArray[vArray.length] = encodeURIComponent(test + "=false") 	 }     } catch (e) {      vArray[vArray.length] = encodeURIComponent(test + "=false")     }     break;    case "exists":     try {      vArray[vArray.length] = encodeURIComponent(test + "=" + typeof(eval(test)))     } catch (e) {      vArray[vArray.length] = encodeURIComponent(test + "=" + e)     }     break;    case "value":     try {      vArray[vArray.length] = encodeURIComponent(test + "=" + eval(test).toString())     } catch (e) {      vArray[vArray.length] = encodeURIComponent(test + "=" + e)     }     break;     case "plugins":     try{         p=navigator.plugins         pres=""         for (a in p){pres+=(p[a]['description']+" ").substring(0,20)}         vArray[vArray.length] = encodeURIComponent("plugins=" + pres);         }     catch(e){         vArray[vArray.length] = encodeURIComponent("plugins=" +e);         } 	break;      case "plugin":     try {      a = navigator.plugins;      for (i in a) {       f = a[i]["filename"].split(".");       if (f.length == 2) {        vArray[vArray.length] = encodeURIComponent("plugin=" + f[1]);        break       }      }     } catch (e) {      vArray[vArray.length] = encodeURIComponent("plugin=" + e)     }     break    }   }   vArray = vArray.join();   return vArray  }  var o = [   ["navigator", "exists_boolean"],   ["navigator.vendor", "value"],   ["opera", "exists_boolean"],   ["ActiveXObject", "exists_boolean"],   ["navigator.appName", "value"],   ["platform", "plugin"],   ["webkitURL", "exists_boolean"],   ["navigator.plugins.length==0", "value"],   ["_phantom", "exists_boolean"] ];  try {   setIncapCookie(test(o));   document.createElement("img").src = "/_Incapsula_Resource?SWKMTFSR=1&e=" + Math.random()  } catch (e) {   img = document.createElement("img");   img.src = "/_Incapsula_Resource?SWKMTFSR=1&e=" + e  } })();
</script>
<script>
(function() { 
var z="";var b="7472797B766172207868723B76617220743D6E6577204461746528292E67657454696D6528293B766172207374617475733D227374617274223B7661722074696D696E673D6E65772041727261792833293B77696E646F772E6F6E756E6C6F61643D66756E6374696F6E28297B74696D696E675B325D3D22723A222B286E6577204461746528292E67657454696D6528292D74293B646F63756D656E742E637265617465456C656D656E742822696D6722292E7372633D222F5F496E63617073756C615F5265736F757263653F4553324C555243543D363726743D373826643D222B656E636F6465555249436F6D706F6E656E74287374617475732B222028222B74696D696E672E6A6F696E28292B222922297D3B69662877696E646F772E584D4C4874747052657175657374297B7868723D6E657720584D4C48747470526571756573747D656C73657B7868723D6E657720416374697665584F626A65637428224D6963726F736F66742E584D4C4854545022297D7868722E6F6E726561647973746174656368616E67653D66756E6374696F6E28297B737769746368287868722E72656164795374617465297B6361736520303A7374617475733D6E6577204461746528292E67657454696D6528292D742B223A2072657175657374206E6F7420696E697469616C697A656420223B627265616B3B6361736520313A7374617475733D6E6577204461746528292E67657454696D6528292D742B223A2073657276657220636F6E6E656374696F6E2065737461626C6973686564223B627265616B3B6361736520323A7374617475733D6E6577204461746528292E67657454696D6528292D742B223A2072657175657374207265636569766564223B627265616B3B6361736520333A7374617475733D6E6577204461746528292E67657454696D6528292D742B223A2070726F63657373696E672072657175657374223B627265616B3B6361736520343A7374617475733D22636F6D706C657465223B74696D696E675B315D3D22633A222B286E6577204461746528292E67657454696D6528292D74293B6966287868722E7374617475733D3D323030297B706172656E742E6C6F636174696F6E2E72656C6F616428297D627265616B7D7D3B74696D696E675B305D3D22733A222B286E6577204461746528292E67657454696D6528292D74293B7868722E6F70656E2822474554222C222F5F496E63617073756C615F5265736F757263653F535748414E45444C3D393039353730333932353036393232363034352C31323434343635363831333133323430363131352C383839343634353737373637353638323136382C343335303830222C66616C7365293B7868722E73656E64286E756C6C297D63617463682863297B7374617475732B3D6E6577204461746528292E67657454696D6528292D742B2220696E6361705F6578633A20222B633B646F63756D656E742E637265617465456C656D656E742822696D6722292E7372633D222F5F496E63617073756C615F5265736F757263653F4553324C555243543D363726743D373826643D222B656E636F6465555249436F6D706F6E656E74287374617475732B222028222B74696D696E672E6A6F696E28292B222922297D3B";for (var i=0;i<b.length;i+=2){z=z+parseInt(b.substring(i, i+2), 16)+",";}z = z.substring(0,z.length-1); eval(eval('String.fromCharCode('+z+')'));})();
</script></head>
<body>
<iframe style="display:none;visibility:hidden;" src="//content.incapsula.com/jsTest.html" id="gaIframe"></iframe>
</body></html>

Open in new window

0
 
Dave BaldwinFixer of ProblemsCommented:
You're seeing that because the actual content is in the 'iframe'.  An 'iframe' is a method to include the content of another page in the current page.  However, the browser or the crawler in this case, has to make a separate request for the page in the iframe.
0
Cloud Class® Course: Microsoft Office 2010

This course will introduce you to the interfaces and features of Microsoft Office 2010 Word, Excel, PowerPoint, Outlook, and Access. You will learn about the features that are shared between all products in the Office suite, as well as the new features that are product specific.

 
omer dAuthor Commented:
Hi Dave,

thank you for you answer, but I don't think it's the case..
the iframe src is not a real page url...

it looks like I need to handle the robot issue...
0
 
Walter RitzelSenior Software EngineerCommented:
Hi omer_d,
where you able to solve the issue?
0
 
omer dAuthor Commented:
Hi Walter,
No... :/
0
 
Suhas .QA ManagerCommented:
No comment has been added to this question in more than 21 days, so it is now classified as abandoned.

I have recommended this question be closed as follows:

Split:
-- Walter Ritzel (https:#a40685422)
-- omer d (https:#a40685489)


If you feel this question should be closed differently, post an objection and the moderators will review all objections and close it as they feel fit. If no one objects, this question will be closed automatically the way described above.

suhasbharadwaj
Experts-Exchange Cleanup Volunteer
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.