Solved

scrapy: spider not generating item_ signals

Posted on 2014-02-27
1
1,030 Views
Last Modified: 2014-02-27
Python 2.7.6.2 on Windows 7 using binary WinPython-32bit-2.7.6.2, Scrapy 0.22.0, Eclipse 4.2.1 and a Twisted-13.2.0.win32-py2.7 reactor

I'm learning scrapy. I have it doing everything EXCEPT properly calling the pipelines.process_item(). It IS CALLING pipelines.open_spider() and pipelines.close_spider() OK.

I THINK this is because the spider is not generating any "item" signals (not item_passed, item_dropped or item_scraped).

I added some code to try capture these signals, and I'm getting nothing when I try to capture any of the 3 above item signals.

The code DOES capture other signals (like engine_started, or spider_closed, etc).

It ALSO errors if I try to set an item['doesnotexist'] variable, so it appears to be using the items file and my user defined items class "AuctionDOTcomItems".

Really at a loss. I would greatly appreciate any help either...

A) Getting the pipelines.process_item() to work normally OR...

B) Being able to manually catch the signal that an item has been set so I can pass control to my own version of pipelines.process_item().

Thanks!!

----------
reactor:
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log, signals
from scrapy.utils.project import get_project_settings

class SpiderRun:
    def __init__(self, spider):
        settings = get_project_settings()
        mySettings = {'ITEM_PIPELINES': {'estatescraper.pipelines.EstatescraperXLSwriter':300}} 
        settings.overrides.update(mySettings)

        crawler = Crawler(settings)
        crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
        crawler.configure()
        crawler.crawl(spider)
        crawler.start()
#         log.start()
        reactor.run() # the script will block here until the spider_closed signal was sent
        self.cleanup()

    def cleanup(self):
        print "SpiderRun done" #333
        pass

if __name__ == "__main__":
    from estatescraper import AuctionDOTcom
    spider = AuctionDOTcom()
    r = SpiderRun(spider)

Open in new window


----------------
spider:
from scrapy.xlib.pydispatch import dispatcher
from scrapy.http import Request
from scrapy.selector import Selector
from scrapy import signals
from scrapy.spider import Spider

from auctiondotcomurls import AuctionDOTcomURLs
from auctiondotcomitems import AuctionDOTcomItems
from auctiondotcomgetitems import AuctionDOTcomGetItems

import urlparse
import time 

import sys

class AuctionDOTcom(Spider):
    def __init__(self,
                 limit = 50, 
                 miles = 250,
                 zip = None, 
                 asset_types = "",
                 auction_types = "", 
                 property_types = ""):
        self.name = "auction.com"
        self.allowed_domains = ["auction.com"]
        self.start_urls = AuctionDOTcomURLs(limit, miles, zip, asset_types, 
                                            auction_types, property_types)

        dispatcher.connect(self.testsignal, signals.item_scraped) 

#     def _item_passed(self, item):
#         print "item = ", item #333  

    def testsignal(self):
        print "in csvwrite" #333

    def parse(self, response):
        sel = Selector(response)
        listings =  sel.xpath('//div[@class="contentDetail searchResult"]')
        for listing in listings:
            item = AuctionDOTcomItems()

            item['propertyID'] = ''.join(set(listing.xpath('./@property-id').extract()))
            print "item['propertyID'] = ", item['propertyID'] #333
#             item = AuctionDOTcomGetItems(listing)

#         ################
#         # DEMONSTRATTION ONLY
#             print "######################################"            
#             for i in item:
#                 print i + ": " + str(item[i])

        next = set(sel.xpath('//a[contains(text(),"Next")]//@href').extract())

        for i in next:
            yield Request("http://%s/%s" % (urlparse.urlparse(response.url).hostname, i), callback=self.parse)


if __name__ == "__main__":
    from estatescraper import SpiderRun
    from estatescraper import AuctionDOTcom
    spider = AuctionDOTcom()
    r = SpiderRun(spider)

Open in new window


--------------------
pipelines:

import csv
from csv import DictWriter

# class TutorialPipeline(object):
#     def process_item(self, item, spider):
#         return item

class EstatescraperXLSwriter(object):
    def __init__(self):
        print "Ive started the __init__ in the pipeline" #333

        self.brandCategoryCsv = csv.writer(open('test.csv', 'wb'),
        delimiter=',', 
        quoting=csv.QUOTE_MINIMAL)
        self.brandCategoryCsv.writerow(['Property ID', 'Asset Type'])

    def open_spider(self, spider):
        print "Hit open_spider in EstatescraperXLSwriter" #333

    def process_item(self, item, spider):
        print "attempting to run process_item" #333
        self.brandCategoryCsv.writerow([item['propertyID'],
                                        item['assetType']])
        return item

    def close_spider(self, spider):
        print "Hit close_spider in EstatescraperXLSwriter" #333
        pass


if __name__ == "__main__":

    o = EstatescraperXLSwriter()

Open in new window


--------------------
items:

from scrapy.item import Item, Field

class AuctionDOTcomItems(Item):
    """"""
    propertyID      = Field()  # <uniqueID>ABCD1234</uniqueID>

Open in new window


------------------
output:
Ive started the __init__ in the pipeline
Hit open_spider in EstatescraperXLSwriter
item['propertyID'] =  1590613
item['propertyID'] =  1466738
(...)
item['propertyID'] =  1639764
Hit close_spider in EstatescraperXLSwriter
SpiderRun done

Open in new window


---------------
logged output:
Ive started the __init__ in the pipeline
Hit open_spider in EstatescraperXLSwriter
2014-02-27 17:44:12+0100 [auction.com] INFO: Closing spider (finished)
2014-02-27 17:44:12+0100 [auction.com] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 240,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 40640,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2014, 2, 27, 16, 44, 12, 238000),
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2014, 2, 27, 16, 44, 9, 203000)}
2014-02-27 17:44:12+0100 [auction.com] INFO: Spider closed (finished)
signals scrapy spider

Open in new window

0
Comment
Question by:Mike R.
1 Comment
 
LVL 3

Accepted Solution

by:
Mike R. earned 0 total points
ID: 39893045
STOOPIDLY SIMPLE!

I needed a yield statement.

        for listing in listings:
            item = AuctionDOTcomItems()

            item['propertyID'] = ''.join(set(listing.xpath('./@property-id').extract()))
            print "item['propertyID'] = ", item['propertyID'] #333
            yield item
0

Featured Post

Master Your Team's Linux and Cloud Stack

Come see why top tech companies like Mailchimp and Media Temple use Linux Academy to build their employee training programs.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

When we want to run, execute or repeat a statement multiple times, a loop is necessary. This article covers the two types of loops in Python: the while loop and the for loop.
"In order to have an organized way for empathy mapping, we rely on a psychological model and trying to model it in a simple way, so we will split the board to three section for each persona and a scenario and try to see what those personas would Do,…
The viewer will learn how to look for a specific file type in a local or remote server directory using PHP.
The viewer will learn how to create and use a small PHP class to apply a watermark to an image. This video shows the viewer the setup for the PHP watermark as well as important coding language. Continue to Part 2 to learn the core code used in creat…

823 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question