asked on

Issue with getting Scrapy to properly organize the scraped data in XML form

I've made progress with this scrapy script but cannot get the full lyrics to be within the title and genre grouping of the xml file. I have two classes, one for the main page to grab the title, composer, shortened lyrics, and genre and the other class function (all_lyrics) follows the title link to get the full lyrics. I've included a XML file and screen shot to show you how the FULL LYRICS is not getting grouped with the correct section but is grouped completely separate. The full lyrics should be under lyrics. The function named PARSE runs first and calls out to the all_lyrics function to get the complete lyrics and chords. The issue is getting all_lyrics to return or yield the results back to the PARSE function so that I can get all the data under the correct xml schema grouping.

-------start of code--------------

import scrapy
import re
from ..items import HopamItem, HopamItem_lyrics

class hopamspider(scrapy.Spider):
    name = 'hopam_or'
    page_number = 10
    start_urls = ['https://hopamviet.vn/chord/']
    custom_settings = {'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter'}

    def all_lyrics(self, response):
        global items
        items = HopamItem()
        items['full_lyrics'] = response.xpath("//div[@id='lyric']/text()").extract() #no yield here!!!
        yield items


    def parse(self, response):
        items = HopamItem()
        x = 0
        xy = 0
        all_hopam = response.xpath("//div[@class='col-md-12']")
        all_hopamlyrics = response.xpath("//em[contains(text(),'[')]") #complete lyrics and chords

        while x < 1: #modify this number as the max according to the number of song titles on that web page
            songlink = all_hopam.xpath("//h5/a/@href").extract()[x]

            items['titles'] = all_hopam.xpath("//h5/a/text()").extract()[x]
            items['genre'] = all_hopam.xpath("//span[@class='float-right text-muted small']/text()").extract()[x]
            items['writer'] = all_hopam.xpath("//h5/small/a[1]/text()").extract()[x]
            items['lyrics'] = all_hopamlyrics.extract()[x]
            items['full_lyrics'] = yield response.follow(url=songlink, callback=self.all_lyrics)  ########I think the problem begins here; how do I get the all_lyrics function to pass 
            ############################################################3333back all the lyrics into items['full_lyrics']?

            x += 1
            xy += 1
            yield items

        next_page = 'https://hopamviet.vn/chord/latest/' + str(hopamspider.page_number) + '/'
        x=0

        if hopamspider.page_number < 20:
            hopamspider.page_number += 10
            yield response.follow(next_page, callback = self.parse)

Open in new window

test33.xml
Capture.JPG

This question needs an answer!

Become an EE member today

7 DAY FREE TRIAL

Members can start a 7-Day Free trial then enjoy unlimited access to the platform.

View membership options

Learn why we charge membership fees

We get it - no one likes a content blocker. Take one extra minute and find out why we block content.