我正在嘗試scrapy根據brandurl 中的數字傳遞請求,然后id's從提供下一頁資訊的網頁中提取,然后遍歷下一頁以獲取產品 ID。
我可以傳遞請求并決議產品的資料并將其發送到請求中,但是我不確定定義函式以獲取下一頁的游標。
這是我的代碼:
class DepopItem(scrapy.Item):
brands = Field(output_processor=TakeFirst())
ID = Field(output_processor=TakeFirst())
brand = Field(output_processor=TakeFirst())
class DepopSpider(scrapy.Spider):
name = 'depop'
start_urls = ['https://webapi.depop.com/api/v2/search/filters/aggregates/?brands=1596&itemsPerPage=24&country=gb¤cy=GBP&sort=relevance']
brands = [1596]
custom_settings = {
'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'
}
def start_requests(self, cursor=''):
for brand in self.brands:
for item in self.create_product_request(brand):
yield item
yield scrapy.FormRequest(
url='https://webapi.depop.com/api/v2/search/products/',
method='GET',
formdata={
'brands': str(brand),
'cursor': cursor,
'itemsPerPage': '24',
'country': 'gb',
'currency': 'GBP',
'sort': 'relevance'
},
cb_kwargs={'brand': brand}
)
def parse(self, response, brand):
# load stuff
for item in response.json().get('products'):
loader = ItemLoader(DepopItem())
loader.add_value('brand', brand)
loader.add_value('ID', item.get('id'))
yield loader.load_item()
cursor = response.json()['meta'].get('cursor')
if cursor:
for item in self.create_product_request(brand, cursor):
yield item
def create_product_request(self, response):
test = response.json()['meta'].get('cursor')
yield test
我收到以下錯誤:
AttributeError: 'int' 物件沒有屬性 'json'
預期輸出:
{"brand": 1596, "ID": 273027529}
{"brand": 1596, "ID": 274115361}
{"brand": 1596, "ID": 270641301}
{"brand": 1596, "ID": 274505678}
{"brand": 1596, "ID": 262857014}
{"brand": 1596, "ID": 270088589}
{"brand": 1596, "ID": 208498028}
{"brand": 1596, "ID": 270426792}
{"brand": 1596, "ID": 274483351}
{"brand": 1596, "ID": 274109923}
{"brand": 1596, "ID": 273424157}
..
..
..
uj5u.com熱心網友回復:
start_requests 在發出請求之前運行。
您可以遞回處理分頁。
import scrapy
from scrapy.loader import ItemLoader
from scrapy import Field
from scrapy.loader.processors import TakeFirst
class DepopItem(scrapy.Item):
brands = Field(output_processor=TakeFirst())
ID = Field(output_processor=TakeFirst())
brand = Field(output_processor=TakeFirst())
class DepopSpider(scrapy.Spider):
name = 'depop'
start_urls = ['https://webapi.depop.com/api/v2/search/products/']
brands = [1596]
custom_settings = {
'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'
}
def parse(self, response):
json_data = response.json()
# pagination
cursor = json_data['meta']['cursor']
if json_data['meta']['hasMore']:
yield scrapy.FormRequest(
url='https://webapi.depop.com/api/v2/search/products/',
method='GET',
formdata={'cursor': cursor}
)
for brand in self.brands:
yield scrapy.FormRequest(
url='https://webapi.depop.com/api/v2/search/products/',
method='GET',
formdata={
'brands': str(brand),
'cursor': cursor,
'itemsPerPage': '24',
'country': 'gb',
'currency': 'GBP',
'sort': 'relevance'
},
cb_kwargs={'brand': brand},
callback=self.parse_brand
)
def parse_brand(self, response, brand):
# load stuff
for item in response.json().get('products'):
loader = ItemLoader(DepopItem())
loader.add_value('brand', brand)
loader.add_value('ID', item.get('id'))
yield loader.load_item()
輸出:
{'ID': 245137362, 'brand': 1596}
{'ID': 244263081, 'brand': 1596}
{'ID': 242128472, 'brand': 1596}
{'ID': 239929000, 'brand': 1596}
...
...
...
順便說一句,使用輪換代理或其他東西,因為我因為“請求太多”而被阻止了 10 分鐘。
轉載請註明出處,本文鏈接:https://www.uj5u.com/yidong/389814.html
上一篇:YouTube訂閱串列抓取
下一篇:網頁抓取專案串列
