Solved

cannot retrieve xpath using scrapy

Posted on 2013-07-01
7
734 Views
Last Modified: 2013-10-08
Hello I am trying to get the xpath for title and text for class listCell. I believe I am doing it right because i get no errors but when i display it in a csv file i do not get nothing in the output file. I also tested my scrapy in other websites such as amazon and it worked fine but not working for this website. Please help!!

	
   def parse(self, response):
		self.log("\n\n\n We got data! \n\n\n")
		hxs = HtmlXPathSelector(response)
		sites = hxs.select('//form[@id=\'listForm\']/table/tbody/tr')
		items = []
		for site in sites:
		    item = CarrierItem()
		    item['title'] = site.select('.//td[@class\'listCell\']/a/text()').extract()
		    item['link'] = site.select('.//td[@class\'listCell\']/a/@href').extract()
		    items.append(item)
		return items

Open in new window



here is my html. Could it be possible it is not working because it has javascript in the html?

	
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title> Carrier IQ DIS 2.4 :: All Devices</title>
<script type="text/javascript" src="/dis/js/main.js">
<script type="text/javascript" src="/dis/js/validate.js">
<link rel="stylesheet" type="text/css" href="/dis/css/portal.css">
<link rel="stylesheet" type="text/css" href="/dis/css/style.css">
<script type="text/javascript">

    ....

<form id="listForm" name="listForm" method="POST" action="">
&#9;<table>
&#9;<thead>
&#9;<tbody>
&#9;<tr>
&#9;<td class="crt">1</td>
&#9;<td class="listCell" align="center">
&#9;<a href="/dis/packages.jsp?view=list&show=perdevice&device_gid=3651746C4173775343535452414567746D75643855673D3D53564A6151624D41716D534C68395A6337634E2F62413D3D&hwdid=probe0&mdn=6505550000&subscrbid=6505550000&maxlength=100">6505550000</a>
&#9;</td>
&#9;<td class="listCell" align="center">
&#9;<a href="/dis/packages.jsp?view=list&show=perdevice&device_gid=3651746C4173775343535452414567746D75643855673D3D53564A6151624D41716D534C68395A6337634E2F62413D3D&hwdid=probe0&subscrbid=6505550000&mdn=6505550000&maxlength=100">probe0</a>
&#9;</td>
&#9;<td class="listCell" align="center">
&#9;<td class="listCell" align="center">
&#9;<td class="cell" align="center">2013-07-01 13:39:38.820</td>
&#9;<td class="cell" align="left">1 - SMS_PullRequest_CS</td>
&#9;<td class="listCell" align="right">
&#9;<td class="listCell" align="center">
&#9;<td class="listCell" align="center">
&#9;</tr>
&#9;</tbody>
&#9;</table>
&#9;</form>

Open in new window

output

     
  C:\Users\ye831c\Documents\Big Data\Scrapy\carrier>scrapy crawl dis -o iqDis.csv
&#9;-t csv
&#9;2013-07-01 10:50:18-0500 [scrapy] INFO: Scrapy 0.16.5 started (bot: carrier)
&#9;2013-07-01 10:50:18-0500 [scrapy] DEBUG: Enabled extensions: FeedExporter, LogSt
&#9;ats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
&#9;2013-07-01 10:50:19-0500 [scrapy] DEBUG: Enabled downloader middlewares: HttpAut
&#9;hMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, De
&#9;faultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMi
&#9;ddleware, ChunkedTransferMiddleware, DownloaderStats
&#9;2013-07-01 10:50:19-0500 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMi
&#9;ddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddle
&#9;ware
&#9;2013-07-01 10:50:19-0500 [scrapy] DEBUG: Enabled item pipelines:
&#9;2013-07-01 10:50:19-0500 [dis] INFO: Spider opened
&#9;2013-07-01 10:50:19-0500 [dis] INFO: Crawled 0 pages (at 0 pages/min), scraped 0
&#9; items (at 0 items/min)
&#9;2013-07-01 10:50:19-0500 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:602
&#9;3
&#9;2013-07-01 10:50:19-0500 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
&#9;2013-07-01 10:50:19-0500 [dis] DEBUG: Crawled (200) <GET https://qvpweb01.ciq.la
&#9;bs.att.com:8080/dis/login.jsp> (referer: None)
&#9;2013-07-01 10:50:19-0500 [dis] DEBUG: Redirecting (302) to <GET https://qvpweb01
&#9;.ciq.labs.att.com:8080/dis/> from <POST https://qvpweb01.ciq.labs.att.com:8080/d
&#9;is/login>
&#9;2013-07-01 10:50:20-0500 [dis] DEBUG: Crawled (200) <GET https://qvpweb01.ciq.la
&#9;bs.att.com:8080/dis/> (referer: https://qvpweb01.ciq.labs.att.com:8080/dis/login
&#9;.jsp)
&#9;2013-07-01 10:50:20-0500 [dis] DEBUG:


&#9;&#9;Successfully logged in. Let's start crawling!



&#9;2013-07-01 10:50:21-0500 [dis] DEBUG: Crawled (200) <GET https://qvpweb01.ciq.la
&#9;bs.att.com:8080/dis/> (referer: https://qvpweb01.ciq.labs.att.com:8080/dis/)
&#9;2013-07-01 10:50:21-0500 [dis] DEBUG:


&#9;&#9; We got data!



&#9;2013-07-01 10:50:21-0500 [dis] INFO: Closing spider (finished)
&#9;2013-07-01 10:50:21-0500 [dis] INFO: Dumping Scrapy stats:
&#9;&#9;{'downloader/request_bytes': 1382,
&#9;&#9; 'downloader/request_count': 4,
&#9;&#9; 'downloader/request_method_count/GET': 3,
&#9;&#9; 'downloader/request_method_count/POST': 1,
&#9;&#9; 'downloader/response_bytes': 147888,
&#9;&#9; 'downloader/response_count': 4,
&#9;&#9; 'downloader/response_status_count/200': 3,
&#9;&#9; 'downloader/response_status_count/302': 1,
&#9;&#9; 'finish_reason': 'finished',
&#9;&#9; 'finish_time': datetime.datetime(2013, 7, 1, 15, 50, 21, 221000),
&#9;&#9; 'log_count/DEBUG': 12,
&#9;&#9; 'log_count/INFO': 4,
&#9;&#9; 'request_depth_max': 2,
&#9;&#9; 'response_received_count': 3,
&#9;&#9; 'scheduler/dequeued': 4,
&#9;&#9; 'scheduler/dequeued/memory': 4,
&#9;&#9; 'scheduler/enqueued': 4,
&#9;&#9; 'scheduler/enqueued/memory': 4,
&#9;&#9; 'start_time': datetime.datetime(2013, 7, 1, 15, 50, 19, 42000)}
&#9;2013-07-01 10:50:21-0500 [dis] INFO: Spider closed (finished)

&#9;

Open in new window

0
Comment
Question by:yescobar2012
  • 2
7 Comments
 
LVL 25

Accepted Solution

by:
clockwatcher earned 500 total points
ID: 39316379
Your Xpath expressions look suspect to me:

  site.select('.//td[@class\'listCell\']/a/text()').extract()
  site.select('.//td[@class\'listCell\']/a/@href').extract()

Looks like they're missing an '=' and should be:

   site.select(".//td[@class='listCell']/a/text()").extract()
   site.select(".//td[@class='listCell']/a/@href").extract()


Here's a scrapy shell session that is working for me with your html file:
mark@wheeze:~/Desktop/ee/scrapy/testing> ../bin/scrapy shell http://localhost:8000/doc.html
2013-07-10 19:13:56-0700 [scrapy] INFO: Scrapy 0.16.5 started (bot: testing)
2013-07-10 19:13:56-0700 [scrapy] DEBUG: Enabled extensions: TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2013-07-10 19:13:56-0700 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013-07-10 19:13:56-0700 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-07-10 19:13:56-0700 [scrapy] DEBUG: Enabled item pipelines: 
2013-07-10 19:13:56-0700 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-07-10 19:13:56-0700 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-07-10 19:13:56-0700 [default] INFO: Spider opened
localhost - - [10/Jul/2013 19:13:56] "GET /doc.html HTTP/1.0" 200 -
2013-07-10 19:13:56-0700 [default] DEBUG: Crawled (200) <GET http://localhost:8000/doc.html> (referer: None)
[s] Available Scrapy objects:
[s]   hxs        <HtmlXPathSelector xpath=None data=u'<html><head><meta http-equiv="Content-Ty'>
[s]   item       {}
[s]   request    <GET http://localhost:8000/doc.html>
[s]   response   <200 http://localhost:8000/doc.html>
[s]   settings   <CrawlerSettings module=<module 'testing.settings' from '/home/markh/Desktop/ee/scrapy/testing/testing/settings.pyc'>>
[s]   spider     <BaseSpider 'default' at 0x225ddd0>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser

>>> sites = hxs.select('//form[@id=\'listForm\']/table/tbody/tr')
>>> for site in sites:
...     print site.select(".//td[@class='listCell']/a/text()").extract()
... 
[u'6505550000', u'probe0']

Open in new window

0
 
LVL 58

Expert Comment

by:Gary
ID: 39549867
I've requested that this question be deleted for the following reason:

Not enough information to confirm an answer.
0
 
LVL 25

Expert Comment

by:clockwatcher
ID: 39549868
The xpath expression in his post would have resulted in what he was experiencing -- an empty file since it wouldn't have matched anything.  

My fixed xpath worked with the html he provided.  I believe my post provided a solution to the problem.
0

Featured Post

How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

Join & Write a Comment

Suggested Solutions

Title # Comments Views Activity
jquery, html5 UI 1 49
BASH script to modify crontab? 3 22
Editing login page in zencart. 2 14
addressing a specific html page 9 27
Envision that you are chipping away at another e-business site with a team of pundit developers and designers. Everything seems, by all accounts, to be going easily.
"In order to have an organized way for empathy mapping, we rely on a psychological model and trying to model it in a simple way, so we will split the board to three section for each persona and a scenario and try to see what those personas would Do,…
The viewer will get a basic understanding of what section 508 compliance can entail, learn about skip navigation links, alt text, transcripts, and font size controls.
The viewer will learn the benefit of using external CSS files and the relationship between class and ID selectors. Create your external css file by saving it as style.css then set up your style tags: (CODE) Reference the nav tag and set your prop…

757 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

21 Experts available now in Live!

Get 1:1 Help Now