Solved

cannot retrieve xpath using scrapy

Posted on 2013-07-01
7
790 Views
Last Modified: 2013-10-08
Hello I am trying to get the xpath for title and text for class listCell. I believe I am doing it right because i get no errors but when i display it in a csv file i do not get nothing in the output file. I also tested my scrapy in other websites such as amazon and it worked fine but not working for this website. Please help!!

	
   def parse(self, response):
		self.log("\n\n\n We got data! \n\n\n")
		hxs = HtmlXPathSelector(response)
		sites = hxs.select('//form[@id=\'listForm\']/table/tbody/tr')
		items = []
		for site in sites:
		    item = CarrierItem()
		    item['title'] = site.select('.//td[@class\'listCell\']/a/text()').extract()
		    item['link'] = site.select('.//td[@class\'listCell\']/a/@href').extract()
		    items.append(item)
		return items

Open in new window



here is my html. Could it be possible it is not working because it has javascript in the html?

	
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title> Carrier IQ DIS 2.4 :: All Devices</title>
<script type="text/javascript" src="/dis/js/main.js">
<script type="text/javascript" src="/dis/js/validate.js">
<link rel="stylesheet" type="text/css" href="/dis/css/portal.css">
<link rel="stylesheet" type="text/css" href="/dis/css/style.css">
<script type="text/javascript">

    ....

<form id="listForm" name="listForm" method="POST" action="">
&#9;<table>
&#9;<thead>
&#9;<tbody>
&#9;<tr>
&#9;<td class="crt">1</td>
&#9;<td class="listCell" align="center">
&#9;<a href="/dis/packages.jsp?view=list&show=perdevice&device_gid=3651746C4173775343535452414567746D75643855673D3D53564A6151624D41716D534C68395A6337634E2F62413D3D&hwdid=probe0&mdn=6505550000&subscrbid=6505550000&maxlength=100">6505550000</a>
&#9;</td>
&#9;<td class="listCell" align="center">
&#9;<a href="/dis/packages.jsp?view=list&show=perdevice&device_gid=3651746C4173775343535452414567746D75643855673D3D53564A6151624D41716D534C68395A6337634E2F62413D3D&hwdid=probe0&subscrbid=6505550000&mdn=6505550000&maxlength=100">probe0</a>
&#9;</td>
&#9;<td class="listCell" align="center">
&#9;<td class="listCell" align="center">
&#9;<td class="cell" align="center">2013-07-01 13:39:38.820</td>
&#9;<td class="cell" align="left">1 - SMS_PullRequest_CS</td>
&#9;<td class="listCell" align="right">
&#9;<td class="listCell" align="center">
&#9;<td class="listCell" align="center">
&#9;</tr>
&#9;</tbody>
&#9;</table>
&#9;</form>

Open in new window

output

     
  C:\Users\ye831c\Documents\Big Data\Scrapy\carrier>scrapy crawl dis -o iqDis.csv
&#9;-t csv
&#9;2013-07-01 10:50:18-0500 [scrapy] INFO: Scrapy 0.16.5 started (bot: carrier)
&#9;2013-07-01 10:50:18-0500 [scrapy] DEBUG: Enabled extensions: FeedExporter, LogSt
&#9;ats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
&#9;2013-07-01 10:50:19-0500 [scrapy] DEBUG: Enabled downloader middlewares: HttpAut
&#9;hMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, De
&#9;faultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMi
&#9;ddleware, ChunkedTransferMiddleware, DownloaderStats
&#9;2013-07-01 10:50:19-0500 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMi
&#9;ddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddle
&#9;ware
&#9;2013-07-01 10:50:19-0500 [scrapy] DEBUG: Enabled item pipelines:
&#9;2013-07-01 10:50:19-0500 [dis] INFO: Spider opened
&#9;2013-07-01 10:50:19-0500 [dis] INFO: Crawled 0 pages (at 0 pages/min), scraped 0
&#9; items (at 0 items/min)
&#9;2013-07-01 10:50:19-0500 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:602
&#9;3
&#9;2013-07-01 10:50:19-0500 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
&#9;2013-07-01 10:50:19-0500 [dis] DEBUG: Crawled (200) <GET https://qvpweb01.ciq.la
&#9;bs.att.com:8080/dis/login.jsp> (referer: None)
&#9;2013-07-01 10:50:19-0500 [dis] DEBUG: Redirecting (302) to <GET https://qvpweb01
&#9;.ciq.labs.att.com:8080/dis/> from <POST https://qvpweb01.ciq.labs.att.com:8080/d
&#9;is/login>
&#9;2013-07-01 10:50:20-0500 [dis] DEBUG: Crawled (200) <GET https://qvpweb01.ciq.la
&#9;bs.att.com:8080/dis/> (referer: https://qvpweb01.ciq.labs.att.com:8080/dis/login
&#9;.jsp)
&#9;2013-07-01 10:50:20-0500 [dis] DEBUG:


&#9;&#9;Successfully logged in. Let's start crawling!



&#9;2013-07-01 10:50:21-0500 [dis] DEBUG: Crawled (200) <GET https://qvpweb01.ciq.la
&#9;bs.att.com:8080/dis/> (referer: https://qvpweb01.ciq.labs.att.com:8080/dis/)
&#9;2013-07-01 10:50:21-0500 [dis] DEBUG:


&#9;&#9; We got data!



&#9;2013-07-01 10:50:21-0500 [dis] INFO: Closing spider (finished)
&#9;2013-07-01 10:50:21-0500 [dis] INFO: Dumping Scrapy stats:
&#9;&#9;{'downloader/request_bytes': 1382,
&#9;&#9; 'downloader/request_count': 4,
&#9;&#9; 'downloader/request_method_count/GET': 3,
&#9;&#9; 'downloader/request_method_count/POST': 1,
&#9;&#9; 'downloader/response_bytes': 147888,
&#9;&#9; 'downloader/response_count': 4,
&#9;&#9; 'downloader/response_status_count/200': 3,
&#9;&#9; 'downloader/response_status_count/302': 1,
&#9;&#9; 'finish_reason': 'finished',
&#9;&#9; 'finish_time': datetime.datetime(2013, 7, 1, 15, 50, 21, 221000),
&#9;&#9; 'log_count/DEBUG': 12,
&#9;&#9; 'log_count/INFO': 4,
&#9;&#9; 'request_depth_max': 2,
&#9;&#9; 'response_received_count': 3,
&#9;&#9; 'scheduler/dequeued': 4,
&#9;&#9; 'scheduler/dequeued/memory': 4,
&#9;&#9; 'scheduler/enqueued': 4,
&#9;&#9; 'scheduler/enqueued/memory': 4,
&#9;&#9; 'start_time': datetime.datetime(2013, 7, 1, 15, 50, 19, 42000)}
&#9;2013-07-01 10:50:21-0500 [dis] INFO: Spider closed (finished)

&#9;

Open in new window

0
Comment
Question by:yescobar2012
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 2
7 Comments
 
LVL 25

Accepted Solution

by:
clockwatcher earned 500 total points
ID: 39316379
Your Xpath expressions look suspect to me:

  site.select('.//td[@class\'listCell\']/a/text()').extract()
  site.select('.//td[@class\'listCell\']/a/@href').extract()

Looks like they're missing an '=' and should be:

   site.select(".//td[@class='listCell']/a/text()").extract()
   site.select(".//td[@class='listCell']/a/@href").extract()


Here's a scrapy shell session that is working for me with your html file:
mark@wheeze:~/Desktop/ee/scrapy/testing> ../bin/scrapy shell http://localhost:8000/doc.html
2013-07-10 19:13:56-0700 [scrapy] INFO: Scrapy 0.16.5 started (bot: testing)
2013-07-10 19:13:56-0700 [scrapy] DEBUG: Enabled extensions: TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2013-07-10 19:13:56-0700 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013-07-10 19:13:56-0700 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-07-10 19:13:56-0700 [scrapy] DEBUG: Enabled item pipelines: 
2013-07-10 19:13:56-0700 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-07-10 19:13:56-0700 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-07-10 19:13:56-0700 [default] INFO: Spider opened
localhost - - [10/Jul/2013 19:13:56] "GET /doc.html HTTP/1.0" 200 -
2013-07-10 19:13:56-0700 [default] DEBUG: Crawled (200) <GET http://localhost:8000/doc.html> (referer: None)
[s] Available Scrapy objects:
[s]   hxs        <HtmlXPathSelector xpath=None data=u'<html><head><meta http-equiv="Content-Ty'>
[s]   item       {}
[s]   request    <GET http://localhost:8000/doc.html>
[s]   response   <200 http://localhost:8000/doc.html>
[s]   settings   <CrawlerSettings module=<module 'testing.settings' from '/home/markh/Desktop/ee/scrapy/testing/testing/settings.pyc'>>
[s]   spider     <BaseSpider 'default' at 0x225ddd0>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser

>>> sites = hxs.select('//form[@id=\'listForm\']/table/tbody/tr')
>>> for site in sites:
...     print site.select(".//td[@class='listCell']/a/text()").extract()
... 
[u'6505550000', u'probe0']

Open in new window

0
 
LVL 58

Expert Comment

by:Gary
ID: 39549867
I've requested that this question be deleted for the following reason:

Not enough information to confirm an answer.
0
 
LVL 25

Expert Comment

by:clockwatcher
ID: 39549868
The xpath expression in his post would have resulted in what he was experiencing -- an empty file since it wouldn't have matched anything.  

My fixed xpath worked with the html he provided.  I believe my post provided a solution to the problem.
0

Featured Post

Revamp Your Training Process

Drastically shorten your training time with WalkMe's advanced online training solution that Guides your trainees to action.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

The article shows the basic steps of integrating an HTML theme template into an ASP.NET MVC project
FAQ pages provide a simple way for you to supply and for customers to find answers to the most common questions about your company. Here are six reasons why your company website should have a FAQ page
The viewer will learn the benefit of using external CSS files and the relationship between class and ID selectors. Create your external css file by saving it as style.css then set up your style tags: (CODE) Reference the nav tag and set your prop…
Learn how to create flexible layouts using relative units in CSS.  New relative units added in CSS3 include vw(viewports width), vh(viewports height), vmin(minimum of viewports height and width), and vmax (maximum of viewports height and width).

689 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question