未能從幾個類別中檢索產品串列頁面-有解無憂

該賞金過期4天。此問題的答案有資格獲得 50聲望賞金。 SMTH想引起更多人對這個問題的關注。

從這個網頁我試圖獲得不同產品所在的那種鏈接。有 6 個類別有More info按鈕，當我遞回遍歷時，我通常會到達目標頁面。這是我希望獲得的一個這樣的產品串列頁面。

請注意，其中一些頁面既有產品串列又有more info按鈕，這就是我未能準確捕獲產品串列頁面的原因。

當前的蜘蛛如下所示（無法抓取大量產品串列頁面）：

import scrapy

class NorgrenSpider(scrapy.Spider):
    name = 'norgren'
    start_urls = ['https://www.norgren.com/de/en/list']

    def start_requests(self):
        for start_url in self.start_urls:
            yield scrapy.Request(start_url, callback=self.parse)

    def parse(self, response):
        link_list = []
        for item in response.css(".match-height a.more-info::attr(href)").getall():
            if not "/detail/" in item:
                inner_page_link = response.urljoin(item)
                link_list.append(inner_page_link)
                yield {"target_url":inner_page_link}

        for new_link in link_list:
            yield scrapy.Request(new_link, callback=self.parse)

預期輸出（隨機獲取）：

https://www.norgren.com/de/en/list/directional-control-valves/in-line-and-manifold-valves
https://www.norgren.com/de/en/list/pressure-switches/electro-mechanical-pressure-switches
https://www.norgren.com/de/en/list/pressure-switches/electronic-pressure-switches
https://www.norgren.com/de/en/list/directional-control-valves/sub-base-valves
https://www.norgren.com/de/en/list/directional-control-valves/non-return-valves
https://www.norgren.com/de/en/list/directional-control-valves/valve-islands
https://www.norgren.com/de/en/list/air-preparation/combination-units-frl

如何從六個類別中獲取所有產品串列頁面？

uj5u.com熱心網友回復：

import scrapy


class NorgrenSpider(scrapy.Spider):
    name = 'norgren'
    start_urls = ['https://www.norgren.com/de/en/list']

    def start_requests(self):
        for start_url in self.start_urls:
            yield scrapy.Request(start_url)

    def parse(self, response):
        # check if there are items in the page
        if response.xpath('//div[contains(@class, "item-list")]//div[@]/div[@]/a/@href'):
            yield scrapy.Request(url=response.url, callback=self.get_links, dont_filter=True)

        # follow "more info" buttons
        for url in response.xpath('//a[text()="More info"]/@href').getall():
            yield response.follow(url)

    def get_links(self, response):
        yield {"target_url": response.url}

        next_page = response.xpath('//a[@]/@href').get()
        if next_page:
            yield response.follow(url=next_page, callback=self.get_links)

uj5u.com熱心網友回復：

也許只過濾至少有一個詳細資訊鏈接的頁??面？以下是如何確定頁面是否符合您正在搜索的條件的示例：

import scrapy


class NorgrenSpider(scrapy.Spider):
    name = 'norgren'
    start_urls = ['https://www.norgren.com/de/en/list']

    def start_requests(self):
        for start_url in self.start_urls:
            yield scrapy.Request(start_url, callback=self.parse)

    def parse(self, response):
        link_list = []

        more_info_items = response.css(
            ".match-height a.more-info::attr(href)").getall()

        detail_items = [item for item in more_info_items if '/detail/' in item]
        if len(detail_items) > 0:
            print(f'This is a link you are searching for: {response.url}')

        for item in more_info_items:
            if not "/detail/" in item:
                inner_page_link = response.urljoin(item)
                link_list.append(inner_page_link)
                yield {"target_url": inner_page_link}

        for new_link in link_list:
            yield scrapy.Request(new_link, callback=self.parse)

我只列印了控制臺的鏈接，但你可以弄清楚如何將它記錄到你需要的地方。

轉載請註明出處，本文鏈接：https://www.uj5u.com/caozuo/407911.html

標籤：

上一篇：如何使用selenium(python)獲取無序串列的串列項？

下一篇：無法使用請求從網頁表格中抓取不同事件的開頭