使用scrapy跟蹤分頁鏈接不起作用-有解無憂

我正在嘗試抓取此網站https://www.pararius.com/english以獲取租賃資訊。我想刮掉這個網站上的所有頁面。

我已經查看了 stackoverflow 上關于scrapy 分頁問題的類似問題，但似乎沒有一個反映我的問題。

除了我想關注“next_page”鏈接的部分之外，我的代碼中的所有內容都有效。我使用完全相同的概念為另一本書網站撰寫了另一個蜘蛛，它運行良好。我無法加入指向起始 URL 的 next_page 鏈接，并讓scrapy 自動抓取下一頁。

這是我的代碼：

import scrapy

from time import sleep

class ParariusScraper(scrapy.Spider):
    name = 'pararius'
    start_urls = ['https://www.pararius.com/apartments/amsterdam/']
    def parse(self, response):
        base_url = 'https://www.pararius.com/apartments/amsterdam'
        for section in response.css('section.listing-search-item'):
            yield {
                'Title': section.css('h2.listing-search-item__title > a::text').get().strip(),
                'Location': section.css('div.listing-search-item__sub-title::text').get().strip(),
                'Price': section.css('div.listing-search-item__price::text').get().strip(),
                'Size': section.css('li.illustrated-features__item::text').get().strip(),
                'Link':f"{base_url}{section.css('h2.listing-search-item__title a').attrib['href']}"
            }
            sleep(1)
            next_page = response.css('li.pagination__item a').attrib['href'].split('/')[-1]
            print(next_page)
            if next_page:
                yield response.follow(next_page, self.parse)

當我運行這段代碼時，瘋狂的想法是我的代碼只抓取了第 2 頁的結果，甚至沒有抓取我的代碼中看到的 start_url 的第一頁。

我想知道如何解決這個問題并讓我的代碼按預期開始作業。謝謝，我希望得到你的支持。

uj5u.com熱心網友回復：

以下代碼的分頁沒有拋出任何例外

import scrapy

class ParariusScraper(scrapy.Spider):
    name = 'pararius'
    start_urls = ['https://www.pararius.com/apartments/amsterdam/']
    def parse(self, response):
        for section in response.css('section.listing-search-item'):
            yield {
                'Title': section.css('h2.listing-search-item__title > a::text').get().strip(),
                'Location': section.css('div.listing-search-item__sub-title::text').get().strip(),
                'Price': section.css('div.listing-search-item__price::text').get().strip(),
                'Size': section.css('li.illustrated-features__item::text').get().strip(),
                'Link':f"{self.start_urls[0]}{section.css('h2.listing-search-item__title a').attrib['href']}"
            }
        next_page = response.css('a:contains(Next)::attr(href)').get()
        if next_page:
            yield response.follow(next_page, callback=self.parse)

uj5u.com熱心網友回復：

我設法使用下面的示例使其作業。下一頁的 css 選擇器存在問題，使用response.urljoin()相對鏈接比自己進行所有決議要容易得多。您還需要將下一頁的請求移到 for 回圈之外，否則您將為回圈的每次迭代發送相同的請求。

import scrapy

class ParariusScraper(scrapy.Spider):
    name = 'pararius'
    start_urls = ['https://www.pararius.com/apartments/amsterdam/']
    def parse(self, response):
        for section in response.css('section.listing-search-item'):
            yield {
                'Title': section.css('h2.listing-search-item__title > a::text').get().strip(),
                'Location': section.css('div.listing-search-item__sub-title::text').get().strip(),
                'Price': section.css('div.listing-search-item__price::text').get().strip(),
                'Size': section.css('li.illustrated-features__item::text').get().strip(),
                'Link':f"{self.start_urls[0]}{section.css('h2.listing-search-item__title a').attrib['href']}"
            }
        next_page = response.css('.pagination__link.pagination__link--next')
        if next_page:
            yield response.follow(next_page.attrib['href'])

轉載請註明出處，本文鏈接：https://www.uj5u.com/gongcheng/519409.html

標籤：Pythonpython-3.x网页抓取刮擦

上一篇：如何使用webdriver和selenium獲取此div中的所有div？

下一篇：ChiselTest中的dut在哪里定義？（關于Scala語法）