Solved

cannot retrieve xpath using scrapy

Posted on 2013-07-01
7
764 Views
Last Modified: 2013-10-08
Hello I am trying to get the xpath for title and text for class listCell. I believe I am doing it right because i get no errors but when i display it in a csv file i do not get nothing in the output file. I also tested my scrapy in other websites such as amazon and it worked fine but not working for this website. Please help!!

	
   def parse(self, response):
		self.log("\n\n\n We got data! \n\n\n")
		hxs = HtmlXPathSelector(response)
		sites = hxs.select('//form[@id=\'listForm\']/table/tbody/tr')
		items = []
		for site in sites:
		    item = CarrierItem()
		    item['title'] = site.select('.//td[@class\'listCell\']/a/text()').extract()
		    item['link'] = site.select('.//td[@class\'listCell\']/a/@href').extract()
		    items.append(item)
		return items

Open in new window



here is my html. Could it be possible it is not working because it has javascript in the html?

	
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title> Carrier IQ DIS 2.4 :: All Devices</title>
<script type="text/javascript" src="/dis/js/main.js">
<script type="text/javascript" src="/dis/js/validate.js">
<link rel="stylesheet" type="text/css" href="/dis/css/portal.css">
<link rel="stylesheet" type="text/css" href="/dis/css/style.css">
<script type="text/javascript">

    ....

<form id="listForm" name="listForm" method="POST" action="">
&#9;<table>
&#9;<thead>
&#9;<tbody>
&#9;<tr>
&#9;<td class="crt">1</td>
&#9;<td class="listCell" align="center">
&#9;<a href="/dis/packages.jsp?view=list&show=perdevice&device_gid=3651746C4173775343535452414567746D75643855673D3D53564A6151624D41716D534C68395A6337634E2F62413D3D&hwdid=probe0&mdn=6505550000&subscrbid=6505550000&maxlength=100">6505550000</a>
&#9;</td>
&#9;<td class="listCell" align="center">
&#9;<a href="/dis/packages.jsp?view=list&show=perdevice&device_gid=3651746C4173775343535452414567746D75643855673D3D53564A6151624D41716D534C68395A6337634E2F62413D3D&hwdid=probe0&subscrbid=6505550000&mdn=6505550000&maxlength=100">probe0</a>
&#9;</td>
&#9;<td class="listCell" align="center">
&#9;<td class="listCell" align="center">
&#9;<td class="cell" align="center">2013-07-01 13:39:38.820</td>
&#9;<td class="cell" align="left">1 - SMS_PullRequest_CS</td>
&#9;<td class="listCell" align="right">
&#9;<td class="listCell" align="center">
&#9;<td class="listCell" align="center">
&#9;</tr>
&#9;</tbody>
&#9;</table>
&#9;</form>

Open in new window

output

     
  C:\Users\ye831c\Documents\Big Data\Scrapy\carrier>scrapy crawl dis -o iqDis.csv
&#9;-t csv
&#9;2013-07-01 10:50:18-0500 [scrapy] INFO: Scrapy 0.16.5 started (bot: carrier)
&#9;2013-07-01 10:50:18-0500 [scrapy] DEBUG: Enabled extensions: FeedExporter, LogSt
&#9;ats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
&#9;2013-07-01 10:50:19-0500 [scrapy] DEBUG: Enabled downloader middlewares: HttpAut
&#9;hMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, De
&#9;faultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMi
&#9;ddleware, ChunkedTransferMiddleware, DownloaderStats
&#9;2013-07-01 10:50:19-0500 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMi
&#9;ddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddle
&#9;ware
&#9;2013-07-01 10:50:19-0500 [scrapy] DEBUG: Enabled item pipelines:
&#9;2013-07-01 10:50:19-0500 [dis] INFO: Spider opened
&#9;2013-07-01 10:50:19-0500 [dis] INFO: Crawled 0 pages (at 0 pages/min), scraped 0
&#9; items (at 0 items/min)
&#9;2013-07-01 10:50:19-0500 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:602
&#9;3
&#9;2013-07-01 10:50:19-0500 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
&#9;2013-07-01 10:50:19-0500 [dis] DEBUG: Crawled (200) <GET https://qvpweb01.ciq.la
&#9;bs.att.com:8080/dis/login.jsp> (referer: None)
&#9;2013-07-01 10:50:19-0500 [dis] DEBUG: Redirecting (302) to <GET https://qvpweb01
&#9;.ciq.labs.att.com:8080/dis/> from <POST https://qvpweb01.ciq.labs.att.com:8080/d
&#9;is/login>
&#9;2013-07-01 10:50:20-0500 [dis] DEBUG: Crawled (200) <GET https://qvpweb01.ciq.la
&#9;bs.att.com:8080/dis/> (referer: https://qvpweb01.ciq.labs.att.com:8080/dis/login
&#9;.jsp)
&#9;2013-07-01 10:50:20-0500 [dis] DEBUG:


&#9;&#9;Successfully logged in. Let's start crawling!



&#9;2013-07-01 10:50:21-0500 [dis] DEBUG: Crawled (200) <GET https://qvpweb01.ciq.la
&#9;bs.att.com:8080/dis/> (referer: https://qvpweb01.ciq.labs.att.com:8080/dis/)
&#9;2013-07-01 10:50:21-0500 [dis] DEBUG:


&#9;&#9; We got data!



&#9;2013-07-01 10:50:21-0500 [dis] INFO: Closing spider (finished)
&#9;2013-07-01 10:50:21-0500 [dis] INFO: Dumping Scrapy stats:
&#9;&#9;{'downloader/request_bytes': 1382,
&#9;&#9; 'downloader/request_count': 4,
&#9;&#9; 'downloader/request_method_count/GET': 3,
&#9;&#9; 'downloader/request_method_count/POST': 1,
&#9;&#9; 'downloader/response_bytes': 147888,
&#9;&#9; 'downloader/response_count': 4,
&#9;&#9; 'downloader/response_status_count/200': 3,
&#9;&#9; 'downloader/response_status_count/302': 1,
&#9;&#9; 'finish_reason': 'finished',
&#9;&#9; 'finish_time': datetime.datetime(2013, 7, 1, 15, 50, 21, 221000),
&#9;&#9; 'log_count/DEBUG': 12,
&#9;&#9; 'log_count/INFO': 4,
&#9;&#9; 'request_depth_max': 2,
&#9;&#9; 'response_received_count': 3,
&#9;&#9; 'scheduler/dequeued': 4,
&#9;&#9; 'scheduler/dequeued/memory': 4,
&#9;&#9; 'scheduler/enqueued': 4,
&#9;&#9; 'scheduler/enqueued/memory': 4,
&#9;&#9; 'start_time': datetime.datetime(2013, 7, 1, 15, 50, 19, 42000)}
&#9;2013-07-01 10:50:21-0500 [dis] INFO: Spider closed (finished)

&#9;

Open in new window

0
Comment
Question by:yescobar2012
  • 2
7 Comments
 
LVL 25

Accepted Solution

by:
clockwatcher earned 500 total points
ID: 39316379
Your Xpath expressions look suspect to me:

  site.select('.//td[@class\'listCell\']/a/text()').extract()
  site.select('.//td[@class\'listCell\']/a/@href').extract()

Looks like they're missing an '=' and should be:

   site.select(".//td[@class='listCell']/a/text()").extract()
   site.select(".//td[@class='listCell']/a/@href").extract()


Here's a scrapy shell session that is working for me with your html file:
mark@wheeze:~/Desktop/ee/scrapy/testing> ../bin/scrapy shell http://localhost:8000/doc.html
2013-07-10 19:13:56-0700 [scrapy] INFO: Scrapy 0.16.5 started (bot: testing)
2013-07-10 19:13:56-0700 [scrapy] DEBUG: Enabled extensions: TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2013-07-10 19:13:56-0700 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013-07-10 19:13:56-0700 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-07-10 19:13:56-0700 [scrapy] DEBUG: Enabled item pipelines: 
2013-07-10 19:13:56-0700 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-07-10 19:13:56-0700 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-07-10 19:13:56-0700 [default] INFO: Spider opened
localhost - - [10/Jul/2013 19:13:56] "GET /doc.html HTTP/1.0" 200 -
2013-07-10 19:13:56-0700 [default] DEBUG: Crawled (200) <GET http://localhost:8000/doc.html> (referer: None)
[s] Available Scrapy objects:
[s]   hxs        <HtmlXPathSelector xpath=None data=u'<html><head><meta http-equiv="Content-Ty'>
[s]   item       {}
[s]   request    <GET http://localhost:8000/doc.html>
[s]   response   <200 http://localhost:8000/doc.html>
[s]   settings   <CrawlerSettings module=<module 'testing.settings' from '/home/markh/Desktop/ee/scrapy/testing/testing/settings.pyc'>>
[s]   spider     <BaseSpider 'default' at 0x225ddd0>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser

>>> sites = hxs.select('//form[@id=\'listForm\']/table/tbody/tr')
>>> for site in sites:
...     print site.select(".//td[@class='listCell']/a/text()").extract()
... 
[u'6505550000', u'probe0']

Open in new window

0
 
LVL 58

Expert Comment

by:Gary
ID: 39549867
I've requested that this question be deleted for the following reason:

Not enough information to confirm an answer.
0
 
LVL 25

Expert Comment

by:clockwatcher
ID: 39549868
The xpath expression in his post would have resulted in what he was experiencing -- an empty file since it wouldn't have matched anything.  

My fixed xpath worked with the html he provided.  I believe my post provided a solution to the problem.
0

Featured Post

Webinar: Aligning, Automating, Winning

Join Dan Russo, Senior Manager of Operations Intelligence, for an in-depth discussion on how Dealertrack, leading provider of integrated digital solutions for the automotive industry, transformed their DevOps processes to increase collaboration and move with greater velocity.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
How to use question mark (?) in filename with html 25 71
I am having a  Git   issue 6 42
alert(innerHTML); 8 32
Please explain purpose of GZIP 4 34
Boost your ability to deliver ambitious and competitive web apps by choosing the right JavaScript framework to best suit your project’s needs.
When crafting your “Why Us” page, there are a plethora of pitfalls to avoid. Follow these five tips, and you’ll be well on your way to creating an effective page.
The viewer will learn how to dynamically set the form action using jQuery.
The viewer will get a basic understanding of what section 508 compliance can entail, learn about skip navigation links, alt text, transcripts, and font size controls.

791 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question