Solved

scrapy: spider not generating item_ signals

Posted on 2014-02-27
1
1,054 Views
Last Modified: 2014-02-27
Python 2.7.6.2 on Windows 7 using binary WinPython-32bit-2.7.6.2, Scrapy 0.22.0, Eclipse 4.2.1 and a Twisted-13.2.0.win32-py2.7 reactor

I'm learning scrapy. I have it doing everything EXCEPT properly calling the pipelines.process_item(). It IS CALLING pipelines.open_spider() and pipelines.close_spider() OK.

I THINK this is because the spider is not generating any "item" signals (not item_passed, item_dropped or item_scraped).

I added some code to try capture these signals, and I'm getting nothing when I try to capture any of the 3 above item signals.

The code DOES capture other signals (like engine_started, or spider_closed, etc).

It ALSO errors if I try to set an item['doesnotexist'] variable, so it appears to be using the items file and my user defined items class "AuctionDOTcomItems".

Really at a loss. I would greatly appreciate any help either...

A) Getting the pipelines.process_item() to work normally OR...

B) Being able to manually catch the signal that an item has been set so I can pass control to my own version of pipelines.process_item().

Thanks!!

----------
reactor:
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log, signals
from scrapy.utils.project import get_project_settings

class SpiderRun:
    def __init__(self, spider):
        settings = get_project_settings()
        mySettings = {'ITEM_PIPELINES': {'estatescraper.pipelines.EstatescraperXLSwriter':300}} 
        settings.overrides.update(mySettings)

        crawler = Crawler(settings)
        crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
        crawler.configure()
        crawler.crawl(spider)
        crawler.start()
#         log.start()
        reactor.run() # the script will block here until the spider_closed signal was sent
        self.cleanup()

    def cleanup(self):
        print "SpiderRun done" #333
        pass

if __name__ == "__main__":
    from estatescraper import AuctionDOTcom
    spider = AuctionDOTcom()
    r = SpiderRun(spider)

Open in new window


----------------
spider:
from scrapy.xlib.pydispatch import dispatcher
from scrapy.http import Request
from scrapy.selector import Selector
from scrapy import signals
from scrapy.spider import Spider

from auctiondotcomurls import AuctionDOTcomURLs
from auctiondotcomitems import AuctionDOTcomItems
from auctiondotcomgetitems import AuctionDOTcomGetItems

import urlparse
import time 

import sys

class AuctionDOTcom(Spider):
    def __init__(self,
                 limit = 50, 
                 miles = 250,
                 zip = None, 
                 asset_types = "",
                 auction_types = "", 
                 property_types = ""):
        self.name = "auction.com"
        self.allowed_domains = ["auction.com"]
        self.start_urls = AuctionDOTcomURLs(limit, miles, zip, asset_types, 
                                            auction_types, property_types)

        dispatcher.connect(self.testsignal, signals.item_scraped) 

#     def _item_passed(self, item):
#         print "item = ", item #333  

    def testsignal(self):
        print "in csvwrite" #333

    def parse(self, response):
        sel = Selector(response)
        listings =  sel.xpath('//div[@class="contentDetail searchResult"]')
        for listing in listings:
            item = AuctionDOTcomItems()

            item['propertyID'] = ''.join(set(listing.xpath('./@property-id').extract()))
            print "item['propertyID'] = ", item['propertyID'] #333
#             item = AuctionDOTcomGetItems(listing)

#         ################
#         # DEMONSTRATTION ONLY
#             print "######################################"            
#             for i in item:
#                 print i + ": " + str(item[i])

        next = set(sel.xpath('//a[contains(text(),"Next")]//@href').extract())

        for i in next:
            yield Request("http://%s/%s" % (urlparse.urlparse(response.url).hostname, i), callback=self.parse)


if __name__ == "__main__":
    from estatescraper import SpiderRun
    from estatescraper import AuctionDOTcom
    spider = AuctionDOTcom()
    r = SpiderRun(spider)

Open in new window


--------------------
pipelines:

import csv
from csv import DictWriter

# class TutorialPipeline(object):
#     def process_item(self, item, spider):
#         return item

class EstatescraperXLSwriter(object):
    def __init__(self):
        print "Ive started the __init__ in the pipeline" #333

        self.brandCategoryCsv = csv.writer(open('test.csv', 'wb'),
        delimiter=',', 
        quoting=csv.QUOTE_MINIMAL)
        self.brandCategoryCsv.writerow(['Property ID', 'Asset Type'])

    def open_spider(self, spider):
        print "Hit open_spider in EstatescraperXLSwriter" #333

    def process_item(self, item, spider):
        print "attempting to run process_item" #333
        self.brandCategoryCsv.writerow([item['propertyID'],
                                        item['assetType']])
        return item

    def close_spider(self, spider):
        print "Hit close_spider in EstatescraperXLSwriter" #333
        pass


if __name__ == "__main__":

    o = EstatescraperXLSwriter()

Open in new window


--------------------
items:

from scrapy.item import Item, Field

class AuctionDOTcomItems(Item):
    """"""
    propertyID      = Field()  # <uniqueID>ABCD1234</uniqueID>

Open in new window


------------------
output:
Ive started the __init__ in the pipeline
Hit open_spider in EstatescraperXLSwriter
item['propertyID'] =  1590613
item['propertyID'] =  1466738
(...)
item['propertyID'] =  1639764
Hit close_spider in EstatescraperXLSwriter
SpiderRun done

Open in new window


---------------
logged output:
Ive started the __init__ in the pipeline
Hit open_spider in EstatescraperXLSwriter
2014-02-27 17:44:12+0100 [auction.com] INFO: Closing spider (finished)
2014-02-27 17:44:12+0100 [auction.com] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 240,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 40640,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2014, 2, 27, 16, 44, 12, 238000),
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2014, 2, 27, 16, 44, 9, 203000)}
2014-02-27 17:44:12+0100 [auction.com] INFO: Spider closed (finished)
signals scrapy spider

Open in new window

0
Comment
Question by:Mike R.
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
1 Comment
 
LVL 3

Accepted Solution

by:
Mike R. earned 0 total points
ID: 39893045
STOOPIDLY SIMPLE!

I needed a yield statement.

        for listing in listings:
            item = AuctionDOTcomItems()

            item['propertyID'] = ''.join(set(listing.xpath('./@property-id').extract()))
            print "item['propertyID'] = ", item['propertyID'] #333
            yield item
0

Featured Post

Webinar: Aligning, Automating, Winning

Join Dan Russo, Senior Manager of Operations Intelligence, for an in-depth discussion on how Dealertrack, leading provider of integrated digital solutions for the automotive industry, transformed their DevOps processes to increase collaboration and move with greater velocity.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Developer portfolios can be a bit of an enigma—how do you present yourself to employers without burying them in lines of code?  A modern portfolio is more than just work samples, it’s also a statement of how you work.
CTAs encourage people to do something specific to show interest in your company, product or service. Keep reading to learn why CTAs should always be thought of as extremely important, albeit small, sections of websites.
This tutorial will teach you the core code needed to finalize the addition of a watermark to your image. The viewer will use a small PHP class to learn and create a watermark.
The is a quite short video tutorial. In this video, I'm going to show you how to create self-host WordPress blog with free hosting service.

752 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question