Solved

cannot retrieve xpath using scrapy

Posted on 2013-07-01
7
755 Views
Last Modified: 2013-10-08
Hello I am trying to get the xpath for title and text for class listCell. I believe I am doing it right because i get no errors but when i display it in a csv file i do not get nothing in the output file. I also tested my scrapy in other websites such as amazon and it worked fine but not working for this website. Please help!!

	
   def parse(self, response):
		self.log("\n\n\n We got data! \n\n\n")
		hxs = HtmlXPathSelector(response)
		sites = hxs.select('//form[@id=\'listForm\']/table/tbody/tr')
		items = []
		for site in sites:
		    item = CarrierItem()
		    item['title'] = site.select('.//td[@class\'listCell\']/a/text()').extract()
		    item['link'] = site.select('.//td[@class\'listCell\']/a/@href').extract()
		    items.append(item)
		return items

Open in new window



here is my html. Could it be possible it is not working because it has javascript in the html?

	
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title> Carrier IQ DIS 2.4 :: All Devices</title>
<script type="text/javascript" src="/dis/js/main.js">
<script type="text/javascript" src="/dis/js/validate.js">
<link rel="stylesheet" type="text/css" href="/dis/css/portal.css">
<link rel="stylesheet" type="text/css" href="/dis/css/style.css">
<script type="text/javascript">

    ....

<form id="listForm" name="listForm" method="POST" action="">
&#9;<table>
&#9;<thead>
&#9;<tbody>
&#9;<tr>
&#9;<td class="crt">1</td>
&#9;<td class="listCell" align="center">
&#9;<a href="/dis/packages.jsp?view=list&show=perdevice&device_gid=3651746C4173775343535452414567746D75643855673D3D53564A6151624D41716D534C68395A6337634E2F62413D3D&hwdid=probe0&mdn=6505550000&subscrbid=6505550000&maxlength=100">6505550000</a>
&#9;</td>
&#9;<td class="listCell" align="center">
&#9;<a href="/dis/packages.jsp?view=list&show=perdevice&device_gid=3651746C4173775343535452414567746D75643855673D3D53564A6151624D41716D534C68395A6337634E2F62413D3D&hwdid=probe0&subscrbid=6505550000&mdn=6505550000&maxlength=100">probe0</a>
&#9;</td>
&#9;<td class="listCell" align="center">
&#9;<td class="listCell" align="center">
&#9;<td class="cell" align="center">2013-07-01 13:39:38.820</td>
&#9;<td class="cell" align="left">1 - SMS_PullRequest_CS</td>
&#9;<td class="listCell" align="right">
&#9;<td class="listCell" align="center">
&#9;<td class="listCell" align="center">
&#9;</tr>
&#9;</tbody>
&#9;</table>
&#9;</form>

Open in new window

output

     
  C:\Users\ye831c\Documents\Big Data\Scrapy\carrier>scrapy crawl dis -o iqDis.csv
&#9;-t csv
&#9;2013-07-01 10:50:18-0500 [scrapy] INFO: Scrapy 0.16.5 started (bot: carrier)
&#9;2013-07-01 10:50:18-0500 [scrapy] DEBUG: Enabled extensions: FeedExporter, LogSt
&#9;ats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
&#9;2013-07-01 10:50:19-0500 [scrapy] DEBUG: Enabled downloader middlewares: HttpAut
&#9;hMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, De
&#9;faultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMi
&#9;ddleware, ChunkedTransferMiddleware, DownloaderStats
&#9;2013-07-01 10:50:19-0500 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMi
&#9;ddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddle
&#9;ware
&#9;2013-07-01 10:50:19-0500 [scrapy] DEBUG: Enabled item pipelines:
&#9;2013-07-01 10:50:19-0500 [dis] INFO: Spider opened
&#9;2013-07-01 10:50:19-0500 [dis] INFO: Crawled 0 pages (at 0 pages/min), scraped 0
&#9; items (at 0 items/min)
&#9;2013-07-01 10:50:19-0500 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:602
&#9;3
&#9;2013-07-01 10:50:19-0500 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
&#9;2013-07-01 10:50:19-0500 [dis] DEBUG: Crawled (200) <GET https://qvpweb01.ciq.la
&#9;bs.att.com:8080/dis/login.jsp> (referer: None)
&#9;2013-07-01 10:50:19-0500 [dis] DEBUG: Redirecting (302) to <GET https://qvpweb01
&#9;.ciq.labs.att.com:8080/dis/> from <POST https://qvpweb01.ciq.labs.att.com:8080/d
&#9;is/login>
&#9;2013-07-01 10:50:20-0500 [dis] DEBUG: Crawled (200) <GET https://qvpweb01.ciq.la
&#9;bs.att.com:8080/dis/> (referer: https://qvpweb01.ciq.labs.att.com:8080/dis/login
&#9;.jsp)
&#9;2013-07-01 10:50:20-0500 [dis] DEBUG:


&#9;&#9;Successfully logged in. Let's start crawling!



&#9;2013-07-01 10:50:21-0500 [dis] DEBUG: Crawled (200) <GET https://qvpweb01.ciq.la
&#9;bs.att.com:8080/dis/> (referer: https://qvpweb01.ciq.labs.att.com:8080/dis/)
&#9;2013-07-01 10:50:21-0500 [dis] DEBUG:


&#9;&#9; We got data!



&#9;2013-07-01 10:50:21-0500 [dis] INFO: Closing spider (finished)
&#9;2013-07-01 10:50:21-0500 [dis] INFO: Dumping Scrapy stats:
&#9;&#9;{'downloader/request_bytes': 1382,
&#9;&#9; 'downloader/request_count': 4,
&#9;&#9; 'downloader/request_method_count/GET': 3,
&#9;&#9; 'downloader/request_method_count/POST': 1,
&#9;&#9; 'downloader/response_bytes': 147888,
&#9;&#9; 'downloader/response_count': 4,
&#9;&#9; 'downloader/response_status_count/200': 3,
&#9;&#9; 'downloader/response_status_count/302': 1,
&#9;&#9; 'finish_reason': 'finished',
&#9;&#9; 'finish_time': datetime.datetime(2013, 7, 1, 15, 50, 21, 221000),
&#9;&#9; 'log_count/DEBUG': 12,
&#9;&#9; 'log_count/INFO': 4,
&#9;&#9; 'request_depth_max': 2,
&#9;&#9; 'response_received_count': 3,
&#9;&#9; 'scheduler/dequeued': 4,
&#9;&#9; 'scheduler/dequeued/memory': 4,
&#9;&#9; 'scheduler/enqueued': 4,
&#9;&#9; 'scheduler/enqueued/memory': 4,
&#9;&#9; 'start_time': datetime.datetime(2013, 7, 1, 15, 50, 19, 42000)}
&#9;2013-07-01 10:50:21-0500 [dis] INFO: Spider closed (finished)

&#9;

Open in new window

0
Comment
Question by:yescobar2012
  • 2
7 Comments
 
LVL 25

Accepted Solution

by:
clockwatcher earned 500 total points
ID: 39316379
Your Xpath expressions look suspect to me:

  site.select('.//td[@class\'listCell\']/a/text()').extract()
  site.select('.//td[@class\'listCell\']/a/@href').extract()

Looks like they're missing an '=' and should be:

   site.select(".//td[@class='listCell']/a/text()").extract()
   site.select(".//td[@class='listCell']/a/@href").extract()


Here's a scrapy shell session that is working for me with your html file:
mark@wheeze:~/Desktop/ee/scrapy/testing> ../bin/scrapy shell http://localhost:8000/doc.html
2013-07-10 19:13:56-0700 [scrapy] INFO: Scrapy 0.16.5 started (bot: testing)
2013-07-10 19:13:56-0700 [scrapy] DEBUG: Enabled extensions: TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2013-07-10 19:13:56-0700 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013-07-10 19:13:56-0700 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-07-10 19:13:56-0700 [scrapy] DEBUG: Enabled item pipelines: 
2013-07-10 19:13:56-0700 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-07-10 19:13:56-0700 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-07-10 19:13:56-0700 [default] INFO: Spider opened
localhost - - [10/Jul/2013 19:13:56] "GET /doc.html HTTP/1.0" 200 -
2013-07-10 19:13:56-0700 [default] DEBUG: Crawled (200) <GET http://localhost:8000/doc.html> (referer: None)
[s] Available Scrapy objects:
[s]   hxs        <HtmlXPathSelector xpath=None data=u'<html><head><meta http-equiv="Content-Ty'>
[s]   item       {}
[s]   request    <GET http://localhost:8000/doc.html>
[s]   response   <200 http://localhost:8000/doc.html>
[s]   settings   <CrawlerSettings module=<module 'testing.settings' from '/home/markh/Desktop/ee/scrapy/testing/testing/settings.pyc'>>
[s]   spider     <BaseSpider 'default' at 0x225ddd0>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser

>>> sites = hxs.select('//form[@id=\'listForm\']/table/tbody/tr')
>>> for site in sites:
...     print site.select(".//td[@class='listCell']/a/text()").extract()
... 
[u'6505550000', u'probe0']

Open in new window

0
 
LVL 58

Expert Comment

by:Gary
ID: 39549867
I've requested that this question be deleted for the following reason:

Not enough information to confirm an answer.
0
 
LVL 25

Expert Comment

by:clockwatcher
ID: 39549868
The xpath expression in his post would have resulted in what he was experiencing -- an empty file since it wouldn't have matched anything.  

My fixed xpath worked with the html he provided.  I believe my post provided a solution to the problem.
0

Featured Post

Master Your Team's Linux and Cloud Stack

Come see why top tech companies like Mailchimp and Media Temple use Linux Academy to build their employee training programs.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Developer portfolios can be a bit of an enigma—how do you present yourself to employers without burying them in lines of code?  A modern portfolio is more than just work samples, it’s also a statement of how you work.
Because your company can’t afford for you to make SEO mistakes, you’ll want to ensure you’re taking the right steps each and every time you post a new piece of content. This list of optimization do’s and don’ts can help you become an SEO wizard.
In this tutorial viewers will learn how to embed Flash content in a webpage using HTML5. Ensure your DOCTYPE declaration is set to HTML5: "<!DOCTYPE html>": Use the <object> tag to embed Flash content.: To specify that the object is Flash content, d…
This tutorial will teach you the core code needed to finalize the addition of a watermark to your image. The viewer will use a small PHP class to learn and create a watermark.

808 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question