我嘗試使用scrapy 抓取以下站點并嘗試使用scrapy shell -
這是基礎蜘蛛:
import scrapy
class ZoosSpider(scrapy.Spider):
name = 'zoos'
allowed_domains = ['https://www.tripadvisor.co.uk/Attractions-g186216-Activities-c48-a_allAttractions.true-United_Kingdom.html']
start_urls = ['http://https://www.tripadvisor.co.uk/Attractions-g186216-Activities-c48-a_allAttractions.true-United_Kingdom.html/']
def parse(self, response):
tmpSEC = response.xpath("//section[@data-automation='AppPresentation_SingleFlexCardSection']")
for elem in tmpSEC:
pass
我得到了這個 xpath 的所有相關部分:(當我嘗試 len(tmpSEC) 時,我得到 30,這對我來說似乎沒問題)
tmpSEC = response.xpath("//section[@data-automation='AppPresentation_SingleFlexCardSection']")
現在我想提取第一個href-tag并用這個xpath嘗試它:(但我只得到“/”作為結果)
>>> tmpSEC[0].xpath("//a/@href").get()
'/'
還有
>>> tmpSEC[0].xpath("(//a)[1]/@href").get()
'/'
但只有使用 css 選擇器才能正常作業
>>> tmpSEC[0].css("a::attr(href)").get()
'/Attraction_Review-g186332-d216481-Reviews-Blackpool_Zoo-Blackpool_Lancashire_England.html'
為什么這只適用于 css-selector 而不適用于 xpath-selector?
uj5u.com熱心網友回復:
這是使用 xpath 的作業解決方案。您需要注入 dot(.) ,如下所示:
import scrapy
class ZoosSpider(scrapy.Spider):
name = 'zoos'
start_urls = ['https://www.tripadvisor.co.uk/Attractions-g186216-Activities-c48-a_allAttractions.true-United_Kingdom.html/']
def parse(self, response):
tmpSEC = response.xpath(
"//section[@data-automation='AppPresentation_SingleFlexCardSection']")
#for elem in tmpSEC:
yield {
'link':tmpSEC[0].xpath(".//a/@href").get()
}
輸出:
{'link': '/Attraction_Review-g186332-d216481-Reviews-Blackpool_Zoo-Blackpool_Lancashire_England.html'}
轉載請註明出處,本文鏈接:https://www.uj5u.com/shujuku/361662.html
上一篇:Scrapyresponse.xpath如何獲取數字輸出
下一篇:抓取附加鏈接并將其附加到串列中
