我是scrapy的新手,我正在嘗試廢棄https:opensports。我需要來自所有產品的一些資料,所以我的想法是獲得所有品牌(如果我獲得所有品牌,我將獲得所有產品)。每個 url 的品牌,有很多頁(每頁 24 篇文章),所以我需要定義每個品牌的總頁數,然后獲取從 1 到總頁數的鏈接。我正面臨一個(或更多!)hrefs 問題...這是腳本:
import scrapy
from scrapy import Request
from scrapy.crawler import CrawlerProcess
from datetime import datetime
import datetime
#start_url: https://www.opensports.com.ar/marcas.html
class SolodeportesSpider(scrapy.Spider):
name = 'solodeportes'
start_urls = ['https://www.opensports.com.ar/marcas.html']
custom_settings = {'FEED_URI':'opensports_' f'{datetime.datetime.today().strftime("%d-%m-%Y-%H%M%S")}.csv', 'FEED_FORMAT': 'csv', }
#get links of dif. brands
def parse(self, response):
marcas= response.css('#maincontent > div.category-view > div > div.brands-page > table > tbody td a::attr(href)').getall()
for marca in marcas:
yield Request(marca, self.parse_paginator)
#get total number of pages of the brand And request all pages from 1 to total number of products
def parse_paginator(self,response):
total_products = int(int(response.css('#toolbar-amount > span:nth-child(3)::text').get() / 24) 1)
for count in range(1, total_products):
yield Request(url=f'https://www.opensports.com.ar/{response.url}?p={count}',
callback=self.parse_listings)
#Links list to click to get the articles detail
def parse_listings(self, response):
all_listings = response.css('a.product-item-link::attr(class)').getall()
for url in all_listings:
yield Request(url, self.detail_page)
#url--Article-- Needed data
def detail_page(self, response):
yield {
'Nombre_Articulo' :response.css('h1.page-title span::text').get(),
'Precio_Articulo' : response.css('span.price::text').get(),
'Sku_Articulo' : response.css('td[data-th="SKU"]::text').get() ,
'Tipo_Producto': response.css('td[data-th="Disciplina"]::text').get() ,
'Item_url': response.url
}
process = CrawlerProcess()
process.crawl(SolodeportesSpider)
process.start()
我收到此錯誤訊息:
c:/Users/User/Desktop/Personal/DABRA/Scraper_opensports/opensports/opens_sp_copia_solod.py 2022-01-16 03:45:05 [scrapy.utils.log] 資訊:Scrapy 2.5.1 已啟動(機器人:scrapybot)2022 -01-16 03:45:05 [scrapy.utils.log] 資訊:版本:lxml 4.7.1.0,libxml2 2.9.12,cssselect 1.1.0,parsel 1.6.0,w3lib 1.22.0,Twisted 21.7.0, Python 3.10.1(tags/v3.10.1:2cd268a,2021 年 12 月 6 日,19:10:37)[MSC v.1929 64 位
(AMD64)],pyOpenSSL 21.0.0(OpenSSL 1.1.1m 2021 年 12 月 14 日),密碼學36.0.1,平臺 Windows-10-10.0.19042-SP0 2022-01-16 03:45:05 [scrapy.utils.log] 除錯:使用反應器:
twisted.internet.selectreactor.SelectReactor 2022-01-16 03: 45:05 [scrapy.crawler] 資訊:覆寫設定:{} 2022-01-16 03:45:05 [scrapy.extensions.telnet] 資訊:Telnet 密碼:b362a63ff2281937
2022-01-16 03:45:05 [py.warnings] 警告:
C:\Users\User\Desktop\Personal\DABRA\Scraper_opensports\venv\lib\site-packages\scrapy\extensions\feedexport.py:247: ScrapyDeprecationWarning:FEED_URI和FEED_FORMAT設定已被棄用,取而代之的是該FEEDS設定。
有關更多詳細資訊,請參閱FEEDS設定檔案 exporter = cls(crawler)2022-01-16 03:45:05 [scrapy.middleware] 資訊:啟用擴展:['scrapy.extensions.corestats.CoreStats','scrapy.extensions.telnet.TelnetConsole','scrapy.extensions.feedexport.FeedExporter' , 'scrapy.extensions.logstats.LogStats'] 2022-01-16 03:45:05 [scrapy.middleware] 資訊:啟用下載器中間件:['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout. DownloadTimeoutMiddleware'、'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware'、'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware'、'scrapy.downloadermiddlewares.retry.RetryMiddleware'、'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware'、'scrapy.downloadermiddlewares.httpcompression。 HttpCompressionMiddleware','scrapy.downloadermiddlewares.redirect.RedirectMiddleware'、'scrapy.downloadermiddlewares.cookies.CookiesMiddleware'、'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware'、'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2022-01-16 03:45:05 [scrapy.middleware] 資訊:啟用蜘蛛中間件:['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware','scrapy.spidermiddlewares.offsite.OffsiteMiddleware','scrapy.spidermiddlewares.referer.RefererMiddleware','scrapy.spidermiddlewares.urllength.UrlLengthMiddleware ','scrapy.spidermiddlewares.depth.DepthMiddleware'] 2022-01-16 03:45:05 [scrapy.middleware] 資訊:啟用的專案管道:[] 2022-01-16 03:45:05 [scrapy.core.引擎]資訊:蜘蛛打開 2022-01-16 03:45:05 [scrapy.extensions.logstats] 資訊:抓取 0 頁(以 0 頁/分鐘),抓取 0 項(以 0 項/分鐘) 2022-01-16 03:45:05 [scrapy.extensions.telnet] 資訊:Telnet 控制臺正在偵聽 127.0.0.1:6023 2023 -01-16 03:45:07 [scrapy.core.engine] 除錯:已爬網(200)<GEThttps://www.opensports.com.ar/marcas.html> (referer: None) 2022-01-16 03:45:07 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www .opensports.com.ar/marcas.html> (referer: None) Traceback (最近一次通話最后): File "C:\Users\User\Desktop\Personal\DABRA\Scraper_opensports\venv\lib\site-packages\scrapy \utils\defer.py”,第 120 行,在 iter_errback 中產生 next(it) 檔案“C:\Users\User\Desktop\Personal\DABRA\Scraper_opensports\venv\lib\site-packages\scrapy\utils\python.py ”,第 353 行,在下一個 回傳 next(self.data) 檔案“C:\Users\User\Desktop\Personal\DABRA\Scraper_opensports\venv\lib\site-packages\scrapy\utils\python.py”,第 353 行,接下來 回傳下一個(self.data)檔案“C:\Users\User\Desktop\Personal\DABRA\Scraper_opensports\venv\lib\site-packages\scrapy\core\spidermw.py”,第 56 行,_evaluate_iterable
for r in iterable :檔案“C:\Users\User\Desktop\Personal\DABRA\Scraper_opensports\venv\lib\site-packages\scrapy\spidermiddlewares\offsite.py”,第 29 行,在 process_spider_output for x 結果:檔案“C:\ Users\User\Desktop\Personal\DABRA\Scraper_opensports\venv\lib\site-packages\scrapy\core\spidermw.py”,第 56 行,_evaluate_iterable
for r in iterable: File "C:\Users\User\Desktop\ Personal\DABRA\Scraper_opensports\venv\lib\site-packages\scrapy\spidermiddlewares\referer.py”,第 342 行,在
return (_set_referer(r) for r in result or ()) 檔案“C:\Users\User\Desktop\Personal\DABRA\Scraper_opensports\venv\lib\site-packages\scrapy\core\spidermw.py”,第 56 行,在_evaluate_iterable
for r in iterable:檔案“C:\Users\User\Desktop\Personal\DABRA\Scraper_opensports\venv\lib\site-packages\scrapy\spidermiddlewares\urllength.py”,第40行,作為
回報(r r 結果或 () if _filter(r)) 檔案“C:\Users\User\Desktop\Personal\DABRA\Scraper_opensports\venv\lib\site-packages\scrapy\core\spidermw.py”,第 56 行,在_evaluate_iterable
for r in iterable:檔案“C:\Users\User\Desktop\Personal\DABRA\Scraper_opensports\venv\lib\site-packages\scrapy\spidermiddlewares\depth.py”,第 58 行,在
return (r for r in result or () if _filter(r)) File "C:\Users\User\Desktop\Personal\DABRA\Scraper_opensports\venv\lib\site-packages\scrapy\core\spidermw.py",第 56 行,在可迭代的 r 中評估
迭代:檔案“c:\Users\User\Desktop\Personal\DABRA\Scraper_opensports\opensports\opens_sp_copia_solod.py”,第 16 行,決議中的產量請求(marca,self.parse_paginator)檔案“ C:\Users\User\Desktop\Personal\DABRA\Scraper_opensports\venv\lib\site-packages\scrapy\http\request_init .py ",第 25 行,在init self. set_url(url) 檔案“C:\Users\User\Desktop\Personal\DABRA\Scraper_opensports\venv\lib\site-packages\scrapy\http\ request_init.py",第 73 行,在 _set_url 中引發 ValueError(f'Missing scheme in request url: {self._url}') ValueError: Missing scheme in request url: /marca/adidas.html 2022-01-16 03:45: 07 [scrapy.core.engine] INFO: Closing spider (finished) 2022-01-16 03:45:07 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 232, 'downloader/request_count' : 1, 'downloader/request_method_count/GET': 1, 'downloader/response_bytes': 22711, 'downloader/response_count': 1, 'downloader/response_status_count/200': 1, 'elapsed_time_seconds': 1.748282, 'finish_reason': '完成', 'finish_time': datetime.datetime(2022, 1, 16, 6, 45, 7, 151772), 'httpcompression/response_bytes': 116063, 'httpcompression/response_count': 1, 'log_count/DEBUG': 1, 'log_count/ERROR': 1, 'log_count/INFO': 10, 'log_count/WARNING': 1, 'response_received_count': 1, 'scheduler/dequeued': 1, 'scheduler/dequeued/記憶體':1,'調度程式/入隊':1,'調度程式/入隊/記憶體':1,'spider_exceptions/ValueError':1,'start_time':datetime.datetime(2022,1,16,6,45,5 , 403490)}
起初我的 f' url 有問題...我不知道如何連接 url,因為在:
marcas= response.css('#maincontent > div.category-view > div > div.brands-page > table > tbody td a::attr(href)').getall()
我得到這種型別的網址(我不知道它是否可以或我需要 https:// 部分):
'/marca/adidas.html'
我知道這是錯誤的,我無法找到解決方法...有人可以幫幫我嗎?
提前致謝!
uj5u.com熱心網友回復:
對于您可以使用response.follow或請求的親戚,只需添加基本網址。
您還有其他一些錯誤:
- 分頁并不總是有效。
- 在函式
parse_listings中,您有類屬性而不是 href。 - 出于某種原因,我的某些網址獲得了 500 狀態。
我已經修復了錯誤 #1 和 #2,您需要弄清楚如何修復錯誤 #3。
import scrapy
from scrapy import Request
from scrapy.crawler import CrawlerProcess
from datetime import datetime
import datetime
#start_url: https://www.opensports.com.ar/marcas.html
class SolodeportesSpider(scrapy.Spider):
name = 'solodeportes'
start_urls = ['https://www.opensports.com.ar/marcas.html']
custom_settings = {
'FEED_URI': 'opensports_' f'{datetime.datetime.today().strftime("%d-%m-%Y-%H%M%S")}.csv', 'FEED_FORMAT': 'csv',
}
#get links of dif. brands
def parse(self, response):
marcas= response.css('#maincontent > div.category-view > div > div.brands-page > table > tbody td a::attr(href)').getall()
for marca in marcas:
yield response.follow(url=marca, callback=self.parse_paginator)
#get total number of pages of the brand And request all pages from 1 to total number of products
def parse_paginator(self, response):
yield scrapy.Request(url=response.url, callback=self.parse_listings, dont_filter=True)
next_page = response.xpath('//a[contains(@class, "next")]/@href').get()
if next_page:
yield scrapy.Request(url=next_page, callback=self.parse_paginator)
#Links list to click to get the articles detail
def parse_listings(self, response):
all_listings = response.css('a.product-item-link::attr(href)').getall()
for url in all_listings:
yield Request(url, self.detail_page)
#url--Article-- Needed data
def detail_page(self, response):
yield {
'Nombre_Articulo': response.css('h1.page-title span::text').get(),
'Precio_Articulo': response.css('span.price::text').get(),
'Sku_Articulo': response.css('td[data-th="SKU"]::text').get(),
'Tipo_Producto': response.css('td[data-th="Disciplina"]::text').get(),
'Item_url': response.url
}
轉載請註明出處,本文鏈接:https://www.uj5u.com/caozuo/415158.html
標籤:
上一篇:Web使用python美湯刮桌子
下一篇:python中的資料整理/清理
