該賞金過期4天。此問題的答案有資格獲得 50聲望賞金。 SMTH想引起更多人對這個問題的關注。
從這個網頁我試圖獲得不同產品所在的那種鏈接。有 6 個類別有More info按鈕,當我遞回遍歷時,我通常會到達目標頁面。這是我希望獲得的一個這樣的產品串列頁面。
請注意,其中一些頁面既有產品串列又有more info按鈕,這就是我未能準確捕獲產品串列頁面的原因。
當前的蜘蛛如下所示(無法抓取大量產品串列頁面):
import scrapy
class NorgrenSpider(scrapy.Spider):
name = 'norgren'
start_urls = ['https://www.norgren.com/de/en/list']
def start_requests(self):
for start_url in self.start_urls:
yield scrapy.Request(start_url, callback=self.parse)
def parse(self, response):
link_list = []
for item in response.css(".match-height a.more-info::attr(href)").getall():
if not "/detail/" in item:
inner_page_link = response.urljoin(item)
link_list.append(inner_page_link)
yield {"target_url":inner_page_link}
for new_link in link_list:
yield scrapy.Request(new_link, callback=self.parse)
預期輸出(隨機獲取):
https://www.norgren.com/de/en/list/directional-control-valves/in-line-and-manifold-valves
https://www.norgren.com/de/en/list/pressure-switches/electro-mechanical-pressure-switches
https://www.norgren.com/de/en/list/pressure-switches/electronic-pressure-switches
https://www.norgren.com/de/en/list/directional-control-valves/sub-base-valves
https://www.norgren.com/de/en/list/directional-control-valves/non-return-valves
https://www.norgren.com/de/en/list/directional-control-valves/valve-islands
https://www.norgren.com/de/en/list/air-preparation/combination-units-frl
如何從六個類別中獲取所有產品串列頁面?
uj5u.com熱心網友回復:
import scrapy
class NorgrenSpider(scrapy.Spider):
name = 'norgren'
start_urls = ['https://www.norgren.com/de/en/list']
def start_requests(self):
for start_url in self.start_urls:
yield scrapy.Request(start_url)
def parse(self, response):
# check if there are items in the page
if response.xpath('//div[contains(@class, "item-list")]//div[@]/div[@]/a/@href'):
yield scrapy.Request(url=response.url, callback=self.get_links, dont_filter=True)
# follow "more info" buttons
for url in response.xpath('//a[text()="More info"]/@href').getall():
yield response.follow(url)
def get_links(self, response):
yield {"target_url": response.url}
next_page = response.xpath('//a[@]/@href').get()
if next_page:
yield response.follow(url=next_page, callback=self.get_links)
uj5u.com熱心網友回復:
也許只過濾至少有一個詳細資訊鏈接的頁??面?以下是如何確定頁面是否符合您正在搜索的條件的示例:
import scrapy
class NorgrenSpider(scrapy.Spider):
name = 'norgren'
start_urls = ['https://www.norgren.com/de/en/list']
def start_requests(self):
for start_url in self.start_urls:
yield scrapy.Request(start_url, callback=self.parse)
def parse(self, response):
link_list = []
more_info_items = response.css(
".match-height a.more-info::attr(href)").getall()
detail_items = [item for item in more_info_items if '/detail/' in item]
if len(detail_items) > 0:
print(f'This is a link you are searching for: {response.url}')
for item in more_info_items:
if not "/detail/" in item:
inner_page_link = response.urljoin(item)
link_list.append(inner_page_link)
yield {"target_url": inner_page_link}
for new_link in link_list:
yield scrapy.Request(new_link, callback=self.parse)
我只列印了控制臺的鏈接,但你可以弄清楚如何將它記錄到你需要的地方。
轉載請註明出處,本文鏈接:https://www.uj5u.com/caozuo/407911.html
標籤:
