?
Solved

cannot retrieve xpath using scrapy

Posted on 2013-07-01
7
Medium Priority
?
823 Views
Last Modified: 2013-10-08
Hello I am trying to get the xpath for title and text for class listCell. I believe I am doing it right because i get no errors but when i display it in a csv file i do not get nothing in the output file. I also tested my scrapy in other websites such as amazon and it worked fine but not working for this website. Please help!!

	
   def parse(self, response):
		self.log("\n\n\n We got data! \n\n\n")
		hxs = HtmlXPathSelector(response)
		sites = hxs.select('//form[@id=\'listForm\']/table/tbody/tr')
		items = []
		for site in sites:
		    item = CarrierItem()
		    item['title'] = site.select('.//td[@class\'listCell\']/a/text()').extract()
		    item['link'] = site.select('.//td[@class\'listCell\']/a/@href').extract()
		    items.append(item)
		return items

Open in new window



here is my html. Could it be possible it is not working because it has javascript in the html?

	
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title> Carrier IQ DIS 2.4 :: All Devices</title>
<script type="text/javascript" src="/dis/js/main.js">
<script type="text/javascript" src="/dis/js/validate.js">
<link rel="stylesheet" type="text/css" href="/dis/css/portal.css">
<link rel="stylesheet" type="text/css" href="/dis/css/style.css">
<script type="text/javascript">

    ....

<form id="listForm" name="listForm" method="POST" action="">
&#9;<table>
&#9;<thead>
&#9;<tbody>
&#9;<tr>
&#9;<td class="crt">1</td>
&#9;<td class="listCell" align="center">
&#9;<a href="/dis/packages.jsp?view=list&show=perdevice&device_gid=3651746C4173775343535452414567746D75643855673D3D53564A6151624D41716D534C68395A6337634E2F62413D3D&hwdid=probe0&mdn=6505550000&subscrbid=6505550000&maxlength=100">6505550000</a>
&#9;</td>
&#9;<td class="listCell" align="center">
&#9;<a href="/dis/packages.jsp?view=list&show=perdevice&device_gid=3651746C4173775343535452414567746D75643855673D3D53564A6151624D41716D534C68395A6337634E2F62413D3D&hwdid=probe0&subscrbid=6505550000&mdn=6505550000&maxlength=100">probe0</a>
&#9;</td>
&#9;<td class="listCell" align="center">
&#9;<td class="listCell" align="center">
&#9;<td class="cell" align="center">2013-07-01 13:39:38.820</td>
&#9;<td class="cell" align="left">1 - SMS_PullRequest_CS</td>
&#9;<td class="listCell" align="right">
&#9;<td class="listCell" align="center">
&#9;<td class="listCell" align="center">
&#9;</tr>
&#9;</tbody>
&#9;</table>
&#9;</form>

Open in new window

output

     
  C:\Users\ye831c\Documents\Big Data\Scrapy\carrier>scrapy crawl dis -o iqDis.csv
&#9;-t csv
&#9;2013-07-01 10:50:18-0500 [scrapy] INFO: Scrapy 0.16.5 started (bot: carrier)
&#9;2013-07-01 10:50:18-0500 [scrapy] DEBUG: Enabled extensions: FeedExporter, LogSt
&#9;ats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
&#9;2013-07-01 10:50:19-0500 [scrapy] DEBUG: Enabled downloader middlewares: HttpAut
&#9;hMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, De
&#9;faultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMi
&#9;ddleware, ChunkedTransferMiddleware, DownloaderStats
&#9;2013-07-01 10:50:19-0500 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMi
&#9;ddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddle
&#9;ware
&#9;2013-07-01 10:50:19-0500 [scrapy] DEBUG: Enabled item pipelines:
&#9;2013-07-01 10:50:19-0500 [dis] INFO: Spider opened
&#9;2013-07-01 10:50:19-0500 [dis] INFO: Crawled 0 pages (at 0 pages/min), scraped 0
&#9; items (at 0 items/min)
&#9;2013-07-01 10:50:19-0500 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:602
&#9;3
&#9;2013-07-01 10:50:19-0500 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
&#9;2013-07-01 10:50:19-0500 [dis] DEBUG: Crawled (200) <GET https://qvpweb01.ciq.la
&#9;bs.att.com:8080/dis/login.jsp> (referer: None)
&#9;2013-07-01 10:50:19-0500 [dis] DEBUG: Redirecting (302) to <GET https://qvpweb01
&#9;.ciq.labs.att.com:8080/dis/> from <POST https://qvpweb01.ciq.labs.att.com:8080/d
&#9;is/login>
&#9;2013-07-01 10:50:20-0500 [dis] DEBUG: Crawled (200) <GET https://qvpweb01.ciq.la
&#9;bs.att.com:8080/dis/> (referer: https://qvpweb01.ciq.labs.att.com:8080/dis/login
&#9;.jsp)
&#9;2013-07-01 10:50:20-0500 [dis] DEBUG:


&#9;&#9;Successfully logged in. Let's start crawling!



&#9;2013-07-01 10:50:21-0500 [dis] DEBUG: Crawled (200) <GET https://qvpweb01.ciq.la
&#9;bs.att.com:8080/dis/> (referer: https://qvpweb01.ciq.labs.att.com:8080/dis/)
&#9;2013-07-01 10:50:21-0500 [dis] DEBUG:


&#9;&#9; We got data!



&#9;2013-07-01 10:50:21-0500 [dis] INFO: Closing spider (finished)
&#9;2013-07-01 10:50:21-0500 [dis] INFO: Dumping Scrapy stats:
&#9;&#9;{'downloader/request_bytes': 1382,
&#9;&#9; 'downloader/request_count': 4,
&#9;&#9; 'downloader/request_method_count/GET': 3,
&#9;&#9; 'downloader/request_method_count/POST': 1,
&#9;&#9; 'downloader/response_bytes': 147888,
&#9;&#9; 'downloader/response_count': 4,
&#9;&#9; 'downloader/response_status_count/200': 3,
&#9;&#9; 'downloader/response_status_count/302': 1,
&#9;&#9; 'finish_reason': 'finished',
&#9;&#9; 'finish_time': datetime.datetime(2013, 7, 1, 15, 50, 21, 221000),
&#9;&#9; 'log_count/DEBUG': 12,
&#9;&#9; 'log_count/INFO': 4,
&#9;&#9; 'request_depth_max': 2,
&#9;&#9; 'response_received_count': 3,
&#9;&#9; 'scheduler/dequeued': 4,
&#9;&#9; 'scheduler/dequeued/memory': 4,
&#9;&#9; 'scheduler/enqueued': 4,
&#9;&#9; 'scheduler/enqueued/memory': 4,
&#9;&#9; 'start_time': datetime.datetime(2013, 7, 1, 15, 50, 19, 42000)}
&#9;2013-07-01 10:50:21-0500 [dis] INFO: Spider closed (finished)

&#9;

Open in new window

0
Comment
Question by:yescobar2012
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 2
7 Comments
 
LVL 25

Accepted Solution

by:
clockwatcher earned 2000 total points
ID: 39316379
Your Xpath expressions look suspect to me:

  site.select('.//td[@class\'listCell\']/a/text()').extract()
  site.select('.//td[@class\'listCell\']/a/@href').extract()

Looks like they're missing an '=' and should be:

   site.select(".//td[@class='listCell']/a/text()").extract()
   site.select(".//td[@class='listCell']/a/@href").extract()


Here's a scrapy shell session that is working for me with your html file:
mark@wheeze:~/Desktop/ee/scrapy/testing> ../bin/scrapy shell http://localhost:8000/doc.html
2013-07-10 19:13:56-0700 [scrapy] INFO: Scrapy 0.16.5 started (bot: testing)
2013-07-10 19:13:56-0700 [scrapy] DEBUG: Enabled extensions: TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2013-07-10 19:13:56-0700 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013-07-10 19:13:56-0700 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-07-10 19:13:56-0700 [scrapy] DEBUG: Enabled item pipelines: 
2013-07-10 19:13:56-0700 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-07-10 19:13:56-0700 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-07-10 19:13:56-0700 [default] INFO: Spider opened
localhost - - [10/Jul/2013 19:13:56] "GET /doc.html HTTP/1.0" 200 -
2013-07-10 19:13:56-0700 [default] DEBUG: Crawled (200) <GET http://localhost:8000/doc.html> (referer: None)
[s] Available Scrapy objects:
[s]   hxs        <HtmlXPathSelector xpath=None data=u'<html><head><meta http-equiv="Content-Ty'>
[s]   item       {}
[s]   request    <GET http://localhost:8000/doc.html>
[s]   response   <200 http://localhost:8000/doc.html>
[s]   settings   <CrawlerSettings module=<module 'testing.settings' from '/home/markh/Desktop/ee/scrapy/testing/testing/settings.pyc'>>
[s]   spider     <BaseSpider 'default' at 0x225ddd0>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser

>>> sites = hxs.select('//form[@id=\'listForm\']/table/tbody/tr')
>>> for site in sites:
...     print site.select(".//td[@class='listCell']/a/text()").extract()
... 
[u'6505550000', u'probe0']

Open in new window

0
 
LVL 58

Expert Comment

by:Gary
ID: 39549867
I've requested that this question be deleted for the following reason:

Not enough information to confirm an answer.
0
 
LVL 25

Expert Comment

by:clockwatcher
ID: 39549868
The xpath expression in his post would have resulted in what he was experiencing -- an empty file since it wouldn't have matched anything.  

My fixed xpath worked with the html he provided.  I believe my post provided a solution to the problem.
0

Featured Post

Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

When it comes to write a Context Sensitive Help (an online help that is obtained from a specific point in state of software to provide help with that state) ,  first we need to make the file that contains all topics, which are given exclusive IDs. …
Originally, this post was published on Monitis Blog, you can check it here . It goes without saying that technology has transformed society and the very nature of how we live, work, and communicate in ways that would’ve been incomprehensible 5 ye…
The viewer will learn how to dynamically set the form action using jQuery.
The is a quite short video tutorial. In this video, I'm going to show you how to create self-host WordPress blog with free hosting service.
Suggested Courses

765 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question