Solved

scrapy: spider not generating item_ signals

Posted on 2014-02-27
1
1,046 Views
Last Modified: 2014-02-27
Python 2.7.6.2 on Windows 7 using binary WinPython-32bit-2.7.6.2, Scrapy 0.22.0, Eclipse 4.2.1 and a Twisted-13.2.0.win32-py2.7 reactor

I'm learning scrapy. I have it doing everything EXCEPT properly calling the pipelines.process_item(). It IS CALLING pipelines.open_spider() and pipelines.close_spider() OK.

I THINK this is because the spider is not generating any "item" signals (not item_passed, item_dropped or item_scraped).

I added some code to try capture these signals, and I'm getting nothing when I try to capture any of the 3 above item signals.

The code DOES capture other signals (like engine_started, or spider_closed, etc).

It ALSO errors if I try to set an item['doesnotexist'] variable, so it appears to be using the items file and my user defined items class "AuctionDOTcomItems".

Really at a loss. I would greatly appreciate any help either...

A) Getting the pipelines.process_item() to work normally OR...

B) Being able to manually catch the signal that an item has been set so I can pass control to my own version of pipelines.process_item().

Thanks!!

----------
reactor:
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log, signals
from scrapy.utils.project import get_project_settings

class SpiderRun:
    def __init__(self, spider):
        settings = get_project_settings()
        mySettings = {'ITEM_PIPELINES': {'estatescraper.pipelines.EstatescraperXLSwriter':300}} 
        settings.overrides.update(mySettings)

        crawler = Crawler(settings)
        crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
        crawler.configure()
        crawler.crawl(spider)
        crawler.start()
#         log.start()
        reactor.run() # the script will block here until the spider_closed signal was sent
        self.cleanup()

    def cleanup(self):
        print "SpiderRun done" #333
        pass

if __name__ == "__main__":
    from estatescraper import AuctionDOTcom
    spider = AuctionDOTcom()
    r = SpiderRun(spider)

Open in new window


----------------
spider:
from scrapy.xlib.pydispatch import dispatcher
from scrapy.http import Request
from scrapy.selector import Selector
from scrapy import signals
from scrapy.spider import Spider

from auctiondotcomurls import AuctionDOTcomURLs
from auctiondotcomitems import AuctionDOTcomItems
from auctiondotcomgetitems import AuctionDOTcomGetItems

import urlparse
import time 

import sys

class AuctionDOTcom(Spider):
    def __init__(self,
                 limit = 50, 
                 miles = 250,
                 zip = None, 
                 asset_types = "",
                 auction_types = "", 
                 property_types = ""):
        self.name = "auction.com"
        self.allowed_domains = ["auction.com"]
        self.start_urls = AuctionDOTcomURLs(limit, miles, zip, asset_types, 
                                            auction_types, property_types)

        dispatcher.connect(self.testsignal, signals.item_scraped) 

#     def _item_passed(self, item):
#         print "item = ", item #333  

    def testsignal(self):
        print "in csvwrite" #333

    def parse(self, response):
        sel = Selector(response)
        listings =  sel.xpath('//div[@class="contentDetail searchResult"]')
        for listing in listings:
            item = AuctionDOTcomItems()

            item['propertyID'] = ''.join(set(listing.xpath('./@property-id').extract()))
            print "item['propertyID'] = ", item['propertyID'] #333
#             item = AuctionDOTcomGetItems(listing)

#         ################
#         # DEMONSTRATTION ONLY
#             print "######################################"            
#             for i in item:
#                 print i + ": " + str(item[i])

        next = set(sel.xpath('//a[contains(text(),"Next")]//@href').extract())

        for i in next:
            yield Request("http://%s/%s" % (urlparse.urlparse(response.url).hostname, i), callback=self.parse)


if __name__ == "__main__":
    from estatescraper import SpiderRun
    from estatescraper import AuctionDOTcom
    spider = AuctionDOTcom()
    r = SpiderRun(spider)

Open in new window


--------------------
pipelines:

import csv
from csv import DictWriter

# class TutorialPipeline(object):
#     def process_item(self, item, spider):
#         return item

class EstatescraperXLSwriter(object):
    def __init__(self):
        print "Ive started the __init__ in the pipeline" #333

        self.brandCategoryCsv = csv.writer(open('test.csv', 'wb'),
        delimiter=',', 
        quoting=csv.QUOTE_MINIMAL)
        self.brandCategoryCsv.writerow(['Property ID', 'Asset Type'])

    def open_spider(self, spider):
        print "Hit open_spider in EstatescraperXLSwriter" #333

    def process_item(self, item, spider):
        print "attempting to run process_item" #333
        self.brandCategoryCsv.writerow([item['propertyID'],
                                        item['assetType']])
        return item

    def close_spider(self, spider):
        print "Hit close_spider in EstatescraperXLSwriter" #333
        pass


if __name__ == "__main__":

    o = EstatescraperXLSwriter()

Open in new window


--------------------
items:

from scrapy.item import Item, Field

class AuctionDOTcomItems(Item):
    """"""
    propertyID      = Field()  # <uniqueID>ABCD1234</uniqueID>

Open in new window


------------------
output:
Ive started the __init__ in the pipeline
Hit open_spider in EstatescraperXLSwriter
item['propertyID'] =  1590613
item['propertyID'] =  1466738
(...)
item['propertyID'] =  1639764
Hit close_spider in EstatescraperXLSwriter
SpiderRun done

Open in new window


---------------
logged output:
Ive started the __init__ in the pipeline
Hit open_spider in EstatescraperXLSwriter
2014-02-27 17:44:12+0100 [auction.com] INFO: Closing spider (finished)
2014-02-27 17:44:12+0100 [auction.com] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 240,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 40640,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2014, 2, 27, 16, 44, 12, 238000),
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2014, 2, 27, 16, 44, 9, 203000)}
2014-02-27 17:44:12+0100 [auction.com] INFO: Spider closed (finished)
signals scrapy spider

Open in new window

0
Comment
Question by:Mike R.
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
1 Comment
 
LVL 3

Accepted Solution

by:
Mike R. earned 0 total points
ID: 39893045
STOOPIDLY SIMPLE!

I needed a yield statement.

        for listing in listings:
            item = AuctionDOTcomItems()

            item['propertyID'] = ''.join(set(listing.xpath('./@property-id').extract()))
            print "item['propertyID'] = ", item['propertyID'] #333
            yield item
0

Featured Post

Free Tool: Port Scanner

Check which ports are open to the outside world. Helps make sure that your firewall rules are working as intended.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
How do I pull the base url for use in html links 7 37
Connection to multiple databases 13 35
Syntax for query to update table 2 40
IDE for Python 5 65
Color can increase conversions, create feelings of warmth or even incite people to get behind a cause. If you want your website to really impact site visitors, then it is vital to consider the impact color has on them.
There’s a good reason for why it’s called a homepage – it closely resembles that of a physical house and the only real difference is that it’s online. Your website’s homepage is where people come to visit you. It’s the family room of your website wh…
HTML5 has deprecated a few of the older ways of showing media as well as offering up a new way to create games and animations. Audio, video, and canvas are just a few of the adjustments made between XHTML and HTML5. As we learned in our last micr…
The viewer will learn the basics of jQuery including how to code hide show and toggles. Reference your jQuery libraries: (CODE) Include your new external js/jQuery file: (CODE) Write your first lines of code to setup your site for jQuery…

761 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question