我有類似于以下代碼的內容。我知道在這個例子中可以直接導航到你自己的標簽頁,但在我的應用程式中,我需要轉到第 1 頁才能獲得轉到第 2 頁的鏈接,并且我需要第 2 頁的鏈接為了到達第 3 頁等(即 url 不遵循特定模式)。
import scrapy
class SampleSpider(scrapy.Spider):
name = "sample"
start_urls = [
"https://quotes.toscrape.com/",
]
def parse(self, response):
links = response.css(
'a[][href*=inspirational]::attr(href)'
).extract()
for link in links:
yield response.follow(link, self.parse_inspirational)
def parse_inspirational(self, response):
links = response.css('a[][href*=life]::attr(href)').extract()
for link in links:
yield response.follow(link, self.parse_life)
def parse_life(self, response):
links = response.css('a[][href*=yourself]::attr(href)').extract()
for link in links:
yield response.follow(link, self.parse_yourself)
def parse_yourself(self, response):
for resp in response.css('span[itemprop="text"]::text').extract():
print(resp)
由于跟蹤鏈接并尋找新的 css 模式的相同模式重復了 3 次,我想撰寫一個函式來迭代 css 字串串列并遞回地產生回應。這是我想到的,但它不起作用。我期待列印與原始/長版本代碼相同的輸出的東西:
def parse_recurse(self, response, css_str=None):
links = response.css(css_str.pop(0)).extract()
for link in links:
yield response.follow(link, callback=self.parse_recurse, cb_kwargs={"css_str":css_str})
def parse(self, response):
css = ['a[][href*=inspirational]::attr(href)',
'a[][href*=life]::attr(href)',
'a[][href*=yourself]::attr(href)']
response = self.parse_recurse(response, css_str=css)
for resp in response.css('span[itemprop="text"]::text').extract():
print(resp)
uj5u.com熱心網友回復:
你不能這樣做response = self.parse_recurse(...),因為parse_recurse只產生request,而不是response。
通常,函式 yieldrequest和 Scrapy 會捕獲它并將其發送request到engine稍后將發送request到服務器,然后response從服務器獲取,并callback使用 this執行response。
請參閱檔案中的詳細資訊:架構概述
您必須使用liststart_requests運行,它應該檢查是否不為空。如果不為空,則產生帶有回呼且較小的請求(因此它運行遞回)。如果為空,則它應該產生帶有回呼的請求,該回呼將獲取文本。parse_requestcsscsscssparse_requestscsscssparse
import scrapy
class SampleSpider(scrapy.Spider):
name = "sample"
start_urls = ["https://quotes.toscrape.com/"]
road = [
'a[][href*=inspirational]::attr(href)',
'a[][href*=life]::attr(href)',
'a[][href*=yourself]::attr(href)',
]
def start_requests(self):
"""Run starting URL with full road."""
for url in self.start_urls:
yield scrapy.Request(url, callback=self.parse_recurse, cb_kwargs={"road": self.road})
def parse_recurse(self, response, road):
"""If road is not empty then send to parse_recurse with smaller road.
If road is empty then send to parse."""
first = road[0]
rest = road[1:]
links = response.css(first).extract()
if rest:
# repeat recursion
for link in links:
yield response.follow(link, callback=self.parse_recurse, cb_kwargs={"road": rest})
else:
# exit recursion
for link in links:
yield response.follow(link, callback=self.parse)
def parse(self, response):
for resp in response.css('span[itemprop="text"]::text').extract():
print(resp)
# --- run without project and save in `output.csv` ---
from scrapy.crawler import CrawlerProcess
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
# save in file CSV, JSON or XML
'FEEDS': {'output.csv': {'format': 'csv'}}, # new in 2.1
})
c.crawl(SampleSpider)
c.start()
轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/442013.html
下一篇:如何滾動頁面和抓取網站?
