Solved

scrapy: spider not generating item_ signals

Posted on 2014-02-27
1
1,078 Views
Last Modified: 2014-02-27
Python 2.7.6.2 on Windows 7 using binary WinPython-32bit-2.7.6.2, Scrapy 0.22.0, Eclipse 4.2.1 and a Twisted-13.2.0.win32-py2.7 reactor

I'm learning scrapy. I have it doing everything EXCEPT properly calling the pipelines.process_item(). It IS CALLING pipelines.open_spider() and pipelines.close_spider() OK.

I THINK this is because the spider is not generating any "item" signals (not item_passed, item_dropped or item_scraped).

I added some code to try capture these signals, and I'm getting nothing when I try to capture any of the 3 above item signals.

The code DOES capture other signals (like engine_started, or spider_closed, etc).

It ALSO errors if I try to set an item['doesnotexist'] variable, so it appears to be using the items file and my user defined items class "AuctionDOTcomItems".

Really at a loss. I would greatly appreciate any help either...

A) Getting the pipelines.process_item() to work normally OR...

B) Being able to manually catch the signal that an item has been set so I can pass control to my own version of pipelines.process_item().

Thanks!!

----------
reactor:
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log, signals
from scrapy.utils.project import get_project_settings

class SpiderRun:
    def __init__(self, spider):
        settings = get_project_settings()
        mySettings = {'ITEM_PIPELINES': {'estatescraper.pipelines.EstatescraperXLSwriter':300}} 
        settings.overrides.update(mySettings)

        crawler = Crawler(settings)
        crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
        crawler.configure()
        crawler.crawl(spider)
        crawler.start()
#         log.start()
        reactor.run() # the script will block here until the spider_closed signal was sent
        self.cleanup()

    def cleanup(self):
        print "SpiderRun done" #333
        pass

if __name__ == "__main__":
    from estatescraper import AuctionDOTcom
    spider = AuctionDOTcom()
    r = SpiderRun(spider)

Open in new window


----------------
spider:
from scrapy.xlib.pydispatch import dispatcher
from scrapy.http import Request
from scrapy.selector import Selector
from scrapy import signals
from scrapy.spider import Spider

from auctiondotcomurls import AuctionDOTcomURLs
from auctiondotcomitems import AuctionDOTcomItems
from auctiondotcomgetitems import AuctionDOTcomGetItems

import urlparse
import time 

import sys

class AuctionDOTcom(Spider):
    def __init__(self,
                 limit = 50, 
                 miles = 250,
                 zip = None, 
                 asset_types = "",
                 auction_types = "", 
                 property_types = ""):
        self.name = "auction.com"
        self.allowed_domains = ["auction.com"]
        self.start_urls = AuctionDOTcomURLs(limit, miles, zip, asset_types, 
                                            auction_types, property_types)

        dispatcher.connect(self.testsignal, signals.item_scraped) 

#     def _item_passed(self, item):
#         print "item = ", item #333  

    def testsignal(self):
        print "in csvwrite" #333

    def parse(self, response):
        sel = Selector(response)
        listings =  sel.xpath('//div[@class="contentDetail searchResult"]')
        for listing in listings:
            item = AuctionDOTcomItems()

            item['propertyID'] = ''.join(set(listing.xpath('./@property-id').extract()))
            print "item['propertyID'] = ", item['propertyID'] #333
#             item = AuctionDOTcomGetItems(listing)

#         ################
#         # DEMONSTRATTION ONLY
#             print "######################################"            
#             for i in item:
#                 print i + ": " + str(item[i])

        next = set(sel.xpath('//a[contains(text(),"Next")]//@href').extract())

        for i in next:
            yield Request("http://%s/%s" % (urlparse.urlparse(response.url).hostname, i), callback=self.parse)


if __name__ == "__main__":
    from estatescraper import SpiderRun
    from estatescraper import AuctionDOTcom
    spider = AuctionDOTcom()
    r = SpiderRun(spider)

Open in new window


--------------------
pipelines:

import csv
from csv import DictWriter

# class TutorialPipeline(object):
#     def process_item(self, item, spider):
#         return item

class EstatescraperXLSwriter(object):
    def __init__(self):
        print "Ive started the __init__ in the pipeline" #333

        self.brandCategoryCsv = csv.writer(open('test.csv', 'wb'),
        delimiter=',', 
        quoting=csv.QUOTE_MINIMAL)
        self.brandCategoryCsv.writerow(['Property ID', 'Asset Type'])

    def open_spider(self, spider):
        print "Hit open_spider in EstatescraperXLSwriter" #333

    def process_item(self, item, spider):
        print "attempting to run process_item" #333
        self.brandCategoryCsv.writerow([item['propertyID'],
                                        item['assetType']])
        return item

    def close_spider(self, spider):
        print "Hit close_spider in EstatescraperXLSwriter" #333
        pass


if __name__ == "__main__":

    o = EstatescraperXLSwriter()

Open in new window


--------------------
items:

from scrapy.item import Item, Field

class AuctionDOTcomItems(Item):
    """"""
    propertyID      = Field()  # <uniqueID>ABCD1234</uniqueID>

Open in new window


------------------
output:
Ive started the __init__ in the pipeline
Hit open_spider in EstatescraperXLSwriter
item['propertyID'] =  1590613
item['propertyID'] =  1466738
(...)
item['propertyID'] =  1639764
Hit close_spider in EstatescraperXLSwriter
SpiderRun done

Open in new window


---------------
logged output:
Ive started the __init__ in the pipeline
Hit open_spider in EstatescraperXLSwriter
2014-02-27 17:44:12+0100 [auction.com] INFO: Closing spider (finished)
2014-02-27 17:44:12+0100 [auction.com] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 240,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 40640,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2014, 2, 27, 16, 44, 12, 238000),
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2014, 2, 27, 16, 44, 9, 203000)}
2014-02-27 17:44:12+0100 [auction.com] INFO: Spider closed (finished)
signals scrapy spider

Open in new window

0
Comment
Question by:Mike R.
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
1 Comment
 
LVL 3

Accepted Solution

by:
Mike R. earned 0 total points
ID: 39893045
STOOPIDLY SIMPLE!

I needed a yield statement.

        for listing in listings:
            item = AuctionDOTcomItems()

            item['propertyID'] = ''.join(set(listing.xpath('./@property-id').extract()))
            print "item['propertyID'] = ", item['propertyID'] #333
            yield item
0

Featured Post

Get 15 Days FREE Full-Featured Trial

Benefit from a mission critical IT monitoring with Monitis Premium or get it FREE for your entry level monitoring needs.
-Over 200,000 users
-More than 300,000 websites monitored
-Used in 197 countries
-Recommended by 98% of users

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

FAQ pages provide a simple way for you to supply and for customers to find answers to the most common questions about your company. Here are six reasons why your company website should have a FAQ page
When it comes to security, close monitoring is a must. According to WhiteHat Security annual report, a substantial number of all web applications are vulnerable always. Monitis offers a new product - fully-featured Website security monitoring and pr…
The viewer will learn how to count occurrences of each item in an array.
The viewer will receive an overview of the basics of CSS showing inline styles. In the head tags set up your style tags: (CODE) Reference the nav tag and set your properties.: (CODE) Set the reference for the UL element and styles for it to ensu…

632 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question