scrapy: spider not generating item_ signals

Python 2.7.6.2 on Windows 7 using binary WinPython-32bit-2.7.6.2, Scrapy 0.22.0, Eclipse 4.2.1 and a Twisted-13.2.0.win32-py2.7 reactor

I'm learning scrapy. I have it doing everything EXCEPT properly calling the pipelines.process_item(). It IS CALLING pipelines.open_spider() and pipelines.close_spider() OK.

I THINK this is because the spider is not generating any "item" signals (not item_passed, item_dropped or item_scraped).

I added some code to try capture these signals, and I'm getting nothing when I try to capture any of the 3 above item signals.

The code DOES capture other signals (like engine_started, or spider_closed, etc).

It ALSO errors if I try to set an item['doesnotexist'] variable, so it appears to be using the items file and my user defined items class "AuctionDOTcomItems".

Really at a loss. I would greatly appreciate any help either...

A) Getting the pipelines.process_item() to work normally OR...

B) Being able to manually catch the signal that an item has been set so I can pass control to my own version of pipelines.process_item().

Thanks!!

----------
reactor:
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log, signals
from scrapy.utils.project import get_project_settings

class SpiderRun:
    def __init__(self, spider):
        settings = get_project_settings()
        mySettings = {'ITEM_PIPELINES': {'estatescraper.pipelines.EstatescraperXLSwriter':300}} 
        settings.overrides.update(mySettings)

        crawler = Crawler(settings)
        crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
        crawler.configure()
        crawler.crawl(spider)
        crawler.start()
#         log.start()
        reactor.run() # the script will block here until the spider_closed signal was sent
        self.cleanup()

    def cleanup(self):
        print "SpiderRun done" #333
        pass

if __name__ == "__main__":
    from estatescraper import AuctionDOTcom
    spider = AuctionDOTcom()
    r = SpiderRun(spider)

Open in new window


----------------
spider:
from scrapy.xlib.pydispatch import dispatcher
from scrapy.http import Request
from scrapy.selector import Selector
from scrapy import signals
from scrapy.spider import Spider

from auctiondotcomurls import AuctionDOTcomURLs
from auctiondotcomitems import AuctionDOTcomItems
from auctiondotcomgetitems import AuctionDOTcomGetItems

import urlparse
import time 

import sys

class AuctionDOTcom(Spider):
    def __init__(self,
                 limit = 50, 
                 miles = 250,
                 zip = None, 
                 asset_types = "",
                 auction_types = "", 
                 property_types = ""):
        self.name = "auction.com"
        self.allowed_domains = ["auction.com"]
        self.start_urls = AuctionDOTcomURLs(limit, miles, zip, asset_types, 
                                            auction_types, property_types)

        dispatcher.connect(self.testsignal, signals.item_scraped) 

#     def _item_passed(self, item):
#         print "item = ", item #333  

    def testsignal(self):
        print "in csvwrite" #333

    def parse(self, response):
        sel = Selector(response)
        listings =  sel.xpath('//div[@class="contentDetail searchResult"]')
        for listing in listings:
            item = AuctionDOTcomItems()

            item['propertyID'] = ''.join(set(listing.xpath('./@property-id').extract()))
            print "item['propertyID'] = ", item['propertyID'] #333
#             item = AuctionDOTcomGetItems(listing)

#         ################
#         # DEMONSTRATTION ONLY
#             print "######################################"            
#             for i in item:
#                 print i + ": " + str(item[i])

        next = set(sel.xpath('//a[contains(text(),"Next")]//@href').extract())

        for i in next:
            yield Request("http://%s/%s" % (urlparse.urlparse(response.url).hostname, i), callback=self.parse)


if __name__ == "__main__":
    from estatescraper import SpiderRun
    from estatescraper import AuctionDOTcom
    spider = AuctionDOTcom()
    r = SpiderRun(spider)

Open in new window


--------------------
pipelines:

import csv
from csv import DictWriter

# class TutorialPipeline(object):
#     def process_item(self, item, spider):
#         return item

class EstatescraperXLSwriter(object):
    def __init__(self):
        print "Ive started the __init__ in the pipeline" #333

        self.brandCategoryCsv = csv.writer(open('test.csv', 'wb'),
        delimiter=',', 
        quoting=csv.QUOTE_MINIMAL)
        self.brandCategoryCsv.writerow(['Property ID', 'Asset Type'])

    def open_spider(self, spider):
        print "Hit open_spider in EstatescraperXLSwriter" #333

    def process_item(self, item, spider):
        print "attempting to run process_item" #333
        self.brandCategoryCsv.writerow([item['propertyID'],
                                        item['assetType']])
        return item

    def close_spider(self, spider):
        print "Hit close_spider in EstatescraperXLSwriter" #333
        pass


if __name__ == "__main__":

    o = EstatescraperXLSwriter()

Open in new window


--------------------
items:

from scrapy.item import Item, Field

class AuctionDOTcomItems(Item):
    """"""
    propertyID      = Field()  # <uniqueID>ABCD1234</uniqueID>

Open in new window


------------------
output:
Ive started the __init__ in the pipeline
Hit open_spider in EstatescraperXLSwriter
item['propertyID'] =  1590613
item['propertyID'] =  1466738
(...)
item['propertyID'] =  1639764
Hit close_spider in EstatescraperXLSwriter
SpiderRun done

Open in new window


---------------
logged output:
Ive started the __init__ in the pipeline
Hit open_spider in EstatescraperXLSwriter
2014-02-27 17:44:12+0100 [auction.com] INFO: Closing spider (finished)
2014-02-27 17:44:12+0100 [auction.com] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 240,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 40640,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2014, 2, 27, 16, 44, 12, 238000),
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2014, 2, 27, 16, 44, 9, 203000)}
2014-02-27 17:44:12+0100 [auction.com] INFO: Spider closed (finished)
signals scrapy spider

Open in new window

LVL 3
Mike R.Asked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Mike R.Author Commented:
STOOPIDLY SIMPLE!

I needed a yield statement.

        for listing in listings:
            item = AuctionDOTcomItems()

            item['propertyID'] = ''.join(set(listing.xpath('./@property-id').extract()))
            print "item['propertyID'] = ", item['propertyID'] #333
            yield item
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Python

From novice to tech pro — start learning today.