如何使用Scrapy進行分頁并訪問每個頁面上的所有鏈接-有解無憂

我有以下蜘蛛，我嘗試結合分頁和規則來訪問每個頁面上的鏈接。

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class Paging(CrawlSpider):
    name = "paging"
    start_urls = ['https://ausschreibungen-deutschland.de/1/']

    # Visit all 10 links (it recognizes all 10 sublinks starting with a number and a _)
    rules = (
        Rule(LinkExtractor(allow=r"/[0-9] _"), callback='parse', follow=True),
    )

    def parse(self, response):
        
        # just get all the text 
        all_text = response.xpath("//text()").getall()

        yield {
            "text": " ".join(all_text),
            "url": response.url
        }
        
        # visit next page 
        # next_page_url = response.xpath('//a[@]').extract_first()

        # if next_page_url is not None:
            # yield scrapy.Request(response.urljoin(next_page_url))

我想實作以下行為：

從第 1 頁開始https://ausschreibungen-deutschland.de/1/，訪問所有 10 個鏈接并獲取文本。（已經實施）

轉到第 2 頁https://ausschreibungen-deutschland.de/2/，訪問所有 10 個鏈接并獲取文本。

轉到第 3 頁https://ausschreibungen-deutschland.de/3/，訪問所有 10 個鏈接并獲取文本。

轉到第 4 頁...

我將如何結合這兩個概念？

uj5u.com熱心網友回復：

我已經在 start_urls 中完成了分頁，您可以根據需要增加或減少頁碼。

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class Paging(CrawlSpider):
    name = "paging"
    start_urls = ['https://ausschreibungen-deutschland.de/' str(x) '/' for x in range(1,11)]

    # Visit all 10 links (it recognizes all 10 sublinks starting with a number and a _)
    rules = (
        Rule(LinkExtractor(allow=r"/[0-9] _"), callback='parse', follow=False),
    )

    def parse(self, response):
        
        # just get all the text 
        #all_text = response.xpath("//text()").getall()

        yield {
            #"text": " ".join(all_text),
            'title':response.xpath('//*[@]/h2//text()').get(),
            "url": response.url
        }

轉載請註明出處，本文鏈接：https://www.uj5u.com/qianduan/439598.html

標籤：Python 网页抓取刮擦

上一篇：使用Python從reCAPTCHA保護網站抓取資料

下一篇：試圖從標題中拆分文本