我有以下代碼,并希望逐步進入該網站的下一頁:
import scrapy
class ZoosSpider(scrapy.Spider):
name = 'zoos'
allowed_domains = ['https://www.tripadvisor.co.uk']
start_urls = ['https://www.tripadvisor.co.uk/Attractions-g186216-Activities-c48-a_allAttractions.true-United_Kingdom.html']
def parse(self, response):
tmpSEC = response.xpath("//section[@data-automation='AppPresentation_SingleFlexCardSection']")
for elem in tmpSEC:
yield {
"link": response.urljoin(elem.xpath(".//a/@href").get())
}
nextPage = response.xpath("//a[@aria-label='Next page']/@href").get()
if nextPage != None:
nextPage = response.urljoin(nextPage)
yield scrapy.Request(nextPage, callback=self.parse)
但是當我運行這段代碼時,只有第一頁被刮掉了,我收到了這個錯誤訊息:
2021-11-17 12:52:20 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.tripadvisor.co.uk': <GET https://www.tripadvisor.co.uk/ClientLink?value=NVB5Xy9BdHRyYWN0aW9ucy1nMTg2MjE2LUFjdGl2aXRpZXMtYzQ4LWFfYWxsQXR0cmFjdGlvbnMudHJ1ZS1vYTMwLVVuaXRlZF9LaW5nZG9tLmh0bWxfQ3Yx>
只有當我洗掉這一行時,我才會得到所有結果
allowed_domains = ['https://www.tripadvisor.co.uk']
為什么 - 指向以下站點的鏈接具有允許的域?
uj5u.com熱心網友回復:
默認情況下,蜘蛛allowed_domains不是強制性的。為了最小化錯誤,排除它總是更好的做法。另一點是您可以洗掉allowed_domains或您必須排除
https://您可以www.tripadvisor.co.uk根據scrapy doc包含為 allowed_domains 的含義。這就是為什么您會收到此https://
部分錯誤的原因。
正確的方法如下:
allowed_domains = ['www.tripadvisor.co.uk']
轉載請註明出處,本文鏈接:https://www.uj5u.com/shujuku/361659.html
