我通過鏈接提取器使用scrapy 進行爬行,我在scrapy 鏈接提取器中使用了正確的XPath 運算式,但我不知道為什么它會無限運行并列印某種源代碼而不是餐廳的名稱和地址。我知道我的限制 XPath 運算式中存在一些錯誤,但無法弄清楚它是什么
代碼 :
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class TripadSpider(CrawlSpider):
name = 'tripad'
allowed_domains = ['www.tripadvisor.in']
start_urls = ['https://www.tripadvisor.in/Restaurants-g304551-New_Delhi_National_Capital_Territory_of_Delhi.html']
rules = (
Rule(LinkExtractor(restrict_xpaths='//div[@]//a'), callback='parse_item', follow=True),
)
def parse_item(self, response):
yield {
'title': response.xpath('//h1[@]/text()').get(),
'Address': response.xpath('(//a[@])[2]').get()
}
uj5u.com熱心網友回復:
它正在爬行,請嘗試更改您的 user_agent。但是你忘記添加/text()地址了。
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class TripadSpider(CrawlSpider):
name = 'tripad'
allowed_domains = ['tripadvisor.in']
start_urls = ['https://www.tripadvisor.in/Restaurants-g304551-New_Delhi_National_Capital_Territory_of_Delhi.html']
rules = (
Rule(LinkExtractor(restrict_xpaths='//div[@]//a'), callback='parse_item'),
Rule(LinkExtractor(restrict_xpaths='//a[contains(@class, "next")]')), # pagination
)
def parse_item(self, response):
yield {
'title': response.xpath('//h1[@]/text()').get(),
'Address': response.xpath('(//a[@])[2]/text()').get()
}
輸出:
{'title': 'Mosaic', 'Address': 'Sector 10 Lobby Level Crowne Plaza Twin District Centre, Rohini, New Delhi 110085 India'}
{'title': 'Spring', 'Address': 'Plot 4, Dwarka City Centre Radisson Blu, Sector 13, New Delhi 110075 India'}
{'title': 'Dilli 32', 'Address': 'Maharaja Surajmal Road The Leela Ambience Convention Hotel, Near Yamuna Sports Complex, Vivek Vihar, New Delhi 110002 India'}
{'title': 'Viva - All Day Dining', 'Address': 'Hospitality District Asset Area 12 Gurgoan sector 28, New Delhi 110037 India'}
...
...
...
轉載請註明出處,本文鏈接:https://www.uj5u.com/shujuku/381166.html
