Scrapy：按順序抓取url并重復輸出-有解無憂

目前這個爬蟲有點作業，給我一個回應，但我有一些問題。第一個是抓取頁面的順序。我希望從第 1 頁開始到我設定的范圍，此時似乎隨機執行并且還重復頁面。第二個是輸出，是否全部重復或具有空值或不按順序。不知道問題出在規則還是爬蟲。

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

            
class QuotesSpider(CrawlSpider):
    name = "catspider"
    start_urls = []
    for i in range(1,10):
        if i % 2 == 1:
            start_urls.append('https://www.worldcat.org/title/rose-in-bloom/oclc/'   str(i)  '&referer=brief_results')
            

    rules = (
        Rule(LinkExtractor(allow='title')),
        Rule(LinkExtractor(allow='oclc'), callback='parse_item')
    )


    def parse_item(self, response):
        yield {
            'title': response.css('h1.title::text').get(),
            'author': response.css('td[id="bib-author-cell"] a::text').getall(),
            'publisher': response.css('td[id="bib-publisher-cell"]::text').get(),
            'format': response.css('span[id="editionFormatType"] span::text').get(),
            'isbn': response.css('tr[id="details-standardno"] td::text').get(),
            'oclc':  response.css('tr[id="details-oclcno"] td::text').get()
        }

額外資訊：來自對scrapy有更多經驗的人，什么更好，為什么，Xpath或css標簽？

感謝您提供任何資訊。

uj5u.com熱心網友回復：

您可以使用分頁for loop range型別比其他分頁快 2 倍的方法在 start_urls 中進行分頁。如果每個專案都包含鏈接，這是在規則中使用 xpath 的最佳方法之一。

Extra info: from someone that have more experience with scrapy what is better and why, Xpath or css tag?

根據您的評論Extra info：xpath 和 css 元素定位器都更好，但 xpath 更豐富一點，因為 xpath 可以輕松地在 html 樹中上下移動，您也可以以混合方式同時應用 xpath 和 css。這是一個作業示例。

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor 
from scrapy.crawler import CrawlerProcess   
        
class QuotesSpider(CrawlSpider):
    name = "catspider"
    start_urls = ['https://www.worldcat.org/search?q=oclc&fq=&dblist=638&start=' str(i) '1&qt=page_number_link' for i in range(1,11)]

    rules = (Rule(LinkExtractor(restrict_xpaths='//*[@]/a'), callback='parse_item', follow=True),)

    def parse_item(self, response):
        yield {
            'title' : response.css('h1.title::text').get(),
            'author' : response.css('td[id="bib-author-cell"] a::text').getall(),
            'publisher' : response.css('td[id="bib-publisher-cell"]::text').get(),
            'format' : response.css('span[id="editionFormatType"] span::text').get(),
            'isbn' : response.css('tr[id="details-standardno"] td::text').get(),
            'oclc' :  response.css('tr[id="details-oclcno"] td::text').get()
            }

process = CrawlerProcess()
process.crawl(QuotesSpider)
process.start()

轉載請註明出處，本文鏈接：https://www.uj5u.com/houduan/456912.html

標籤：Python 网页抓取刮擦

上一篇：BeautifulSoup中的.findall()未回傳所有“tr”標簽

下一篇：位置引數跟隨關鍵字引數，不知道如何解決這個問題