yescobar2012
asked on
cannot retrieve xpath using scrapy
Hello I am trying to get the xpath for title and text for class listCell. I believe I am doing it right because i get no errors but when i display it in a csv file i do not get nothing in the output file. I also tested my scrapy in other websites such as amazon and it worked fine but not working for this website. Please help!!
	
here is my html. Could it be possible it is not working because it has javascript in the html?
	
	
def parse(self, response):
		self.log("\n\n\n We got data! \n\n\n")
		hxs = HtmlXPathSelector(response)
		sites = hxs.select('//form[@id=\'listForm\']/table/tbody/tr')
		items = []
		for site in sites:
		 item = CarrierItem()
		 item['title'] = site.select('.//td[@class\'listCell\']/a/text()').extract()
		 item['link'] = site.select('.//td[@class\'listCell\']/a/@href').extract()
		 items.append(item)
		return items
here is my html. Could it be possible it is not working because it has javascript in the html?
	
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title> Carrier IQ DIS 2.4 :: All Devices</title>
<script type="text/javascript" src="/dis/js/main.js">
<script type="text/javascript" src="/dis/js/validate.js">
<link rel="stylesheet" type="text/css" href="/dis/css/portal.css">
<link rel="stylesheet" type="text/css" href="/dis/css/style.css">
<script type="text/javascript">
....
<form id="listForm" name="listForm" method="POST" action="">
	<table>
	<thead>
	<tbody>
	<tr>
	<td class="crt">1</td>
	<td class="listCell" align="center">
	<a href="/dis/packages.jsp?view=list&show=perdevice&device_gid=3651746C4173775343535452414567746D75643855673D3D53564A6151624D41716D534C68395A6337634E2F62413D3D&hwdid=probe0&mdn=6505550000&subscrbid=6505550000&maxlength=100">6505550000</a>
	</td>
	<td class="listCell" align="center">
	<a href="/dis/packages.jsp?view=list&show=perdevice&device_gid=3651746C4173775343535452414567746D75643855673D3D53564A6151624D41716D534C68395A6337634E2F62413D3D&hwdid=probe0&subscrbid=6505550000&mdn=6505550000&maxlength=100">probe0</a>
	</td>
	<td class="listCell" align="center">
	<td class="listCell" align="center">
	<td class="cell" align="center">2013-07-01 13:39:38.820</td>
	<td class="cell" align="left">1 - SMS_PullRequest_CS</td>
	<td class="listCell" align="right">
	<td class="listCell" align="center">
	<td class="listCell" align="center">
	</tr>
	</tbody>
	</table>
	</form>
output C:\Users\ye831c\Documents\Big Data\Scrapy\carrier>scrapy crawl dis -o iqDis.csv
	-t csv
	2013-07-01 10:50:18-0500 [scrapy] INFO: Scrapy 0.16.5 started (bot: carrier)
	2013-07-01 10:50:18-0500 [scrapy] DEBUG: Enabled extensions: FeedExporter, LogSt
	ats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
	2013-07-01 10:50:19-0500 [scrapy] DEBUG: Enabled downloader middlewares: HttpAut
	hMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, De
	faultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMi
	ddleware, ChunkedTransferMiddleware, DownloaderStats
	2013-07-01 10:50:19-0500 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMi
	ddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddle
	ware
	2013-07-01 10:50:19-0500 [scrapy] DEBUG: Enabled item pipelines:
	2013-07-01 10:50:19-0500 [dis] INFO: Spider opened
	2013-07-01 10:50:19-0500 [dis] INFO: Crawled 0 pages (at 0 pages/min), scraped 0
	 items (at 0 items/min)
	2013-07-01 10:50:19-0500 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:602
	3
	2013-07-01 10:50:19-0500 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
	2013-07-01 10:50:19-0500 [dis] DEBUG: Crawled (200) <GET https://qvpweb01.ciq.la
	bs.att.com:8080/dis/login.jsp> (referer: None)
	2013-07-01 10:50:19-0500 [dis] DEBUG: Redirecting (302) to <GET https://qvpweb01
	.ciq.labs.att.com:8080/dis/> from <POST https://qvpweb01.ciq.labs.att.com:8080/d
	is/login>
	2013-07-01 10:50:20-0500 [dis] DEBUG: Crawled (200) <GET https://qvpweb01.ciq.la
	bs.att.com:8080/dis/> (referer: https://qvpweb01.ciq.labs.att.com:8080/dis/login
	.jsp)
	2013-07-01 10:50:20-0500 [dis] DEBUG:
		Successfully logged in. Let's start crawling!
	2013-07-01 10:50:21-0500 [dis] DEBUG: Crawled (200) <GET https://qvpweb01.ciq.la
	bs.att.com:8080/dis/> (referer: https://qvpweb01.ciq.labs.att.com:8080/dis/)
	2013-07-01 10:50:21-0500 [dis] DEBUG:
		 We got data!
	2013-07-01 10:50:21-0500 [dis] INFO: Closing spider (finished)
	2013-07-01 10:50:21-0500 [dis] INFO: Dumping Scrapy stats:
		{'downloader/request_bytes': 1382,
		 'downloader/request_count': 4,
		 'downloader/request_method_count/GET': 3,
		 'downloader/request_method_count/POST': 1,
		 'downloader/response_bytes': 147888,
		 'downloader/response_count': 4,
		 'downloader/response_status_count/200': 3,
		 'downloader/response_status_count/302': 1,
		 'finish_reason': 'finished',
		 'finish_time': datetime.datetime(2013, 7, 1, 15, 50, 21, 221000),
		 'log_count/DEBUG': 12,
		 'log_count/INFO': 4,
		 'request_depth_max': 2,
		 'response_received_count': 3,
		 'scheduler/dequeued': 4,
		 'scheduler/dequeued/memory': 4,
		 'scheduler/enqueued': 4,
		 'scheduler/enqueued/memory': 4,
		 'start_time': datetime.datetime(2013, 7, 1, 15, 50, 19, 42000)}
	2013-07-01 10:50:21-0500 [dis] INFO: Spider closed (finished)
	
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
The xpath expression in his post would have resulted in what he was experiencing -- an empty file since it wouldn't have matched anything.
My fixed xpath worked with the html he provided. I believe my post provided a solution to the problem.
My fixed xpath worked with the html he provided. I believe my post provided a solution to the problem.
Not enough information to confirm an answer.