我正在嘗試通過網路抓取以下類別中的產品鏈接
https://www.acihellas.gr/gaming-pontikia#/
它有 4 頁產品……但出于某種原因,我只得到了第一個……以下內容
headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'}
filterprods = '/#/pageSize=21&viewMode=grid&orderBy=10&pageNumber='
for itm in range(1,page_number):
print("Page",itm)
urlget = str(url2get filterprods str(itm))
time.sleep(2)
ses=requests.Session()
r = ses.get(urlget, headers=headers)
if r.status_code == 200:
Myhtml = r.text
soup = BeautifulSoup(Myhtml, 'lxml')
productlist = soup.find_all('div',attrs = {'class','item-box'})
for p_item in productlist:
a = p_item.find('a')
if a:
producttitle = a['title']
productlink = a['href']
url_item = 'https://acihellas.gr' productlink
print(url_item)
urllist.append(url_item)
time.sleep(2)
else:
pass
ses.close()
else:
print(r.status_code)
return urllist
鏈接連接正確,但 ses.get(url) 不起作用,所以我想我是否可以再次關閉會話。
檢查到下一頁的鏈接時,該頁面沒有。所以我用 filterprods 變數構造
我們怎樣才能解決這個問題?
謝謝你
uj5u.com熱心網友回復:
您沒有在您的網站代碼中提供 URL。您可以使用網站 API 來收集產品。這是一個起始代碼,我把決議留給你:)
import requests
url = 'https://www.acihellas.gr/getFilteredProducts'
for pagenum in range(1, 5):
payload = {
"categoryId": "828",
"manufacturerId": "0",
"vendorId": "0",
"priceRangeFilterModel7Spikes": "null",
"specificationFiltersModel7Spikes": {
"CategoryId": "828",
"ManufacturerId": "0",
"VendorId": "0",
"SpecificationFilterGroups": [{
"Id": 998,
"FilterItems": [{
"Id": "25188",
"FilterItemState": "Unchecked"
}, {
"Id": "18572",
"FilterItemState": "Unchecked"
}, {
"Id": "7361",
"FilterItemState": "Unchecked"
}, {
"Id": "7362",
"FilterItemState": "Unchecked"
}, {
"Id": "7368",
"FilterItemState": "Unchecked"
}, {
"Id": "18060",
"FilterItemState": "Unchecked"
}, {
"Id": "19024",
"FilterItemState": "Unchecked"
}, {
"Id": "24876",
"FilterItemState": "Unchecked"
}, {
"Id": "28037",
"FilterItemState": "Unchecked"
}, {
"Id": "23321",
"FilterItemState": "Unchecked"
}]
}, {
"Id": 990,
"FilterItems": [{
"Id": "7336",
"FilterItemState": "Unchecked"
}]
}, {
"Id": 995,
"FilterItems": [{
"Id": "7350",
"FilterItemState": "Unchecked"
}, {
"Id": "7348",
"FilterItemState": "Unchecked"
}, {
"Id": "7349",
"FilterItemState": "Unchecked"
}]
}]
},
"pageNumber": str(pagenum),
"orderby": "10",
"viewmode": "grid",
"pagesize": "21",
"queryString": "#/pageSize=21&viewMode=grid&orderBy=10&pageNumber=" str(pagenum),
"shouldNotStartFromFirstPage": "true",
"keyword": "",
"searchCategoryId": "0",
"searchManufacturerId": "0",
"searchVendorId": "0",
"priceFrom": "",
"priceTo": "",
"includeSubcategories": "False",
"searchInProductDescriptions": "False",
"advancedSearch": "False",
"isOnSearchPage": "False",
"inStockFilterModel": {
"CategoryId": "828",
"ManufacturerId": "0",
"VendorId": "0",
"Id": "1",
"FilterItemState": "Unchecked"
}
}
res = requests.post(url, json=payload)
print(res.text)
注意:您的鏈接是相對的,因此您需要在它們前面加上網站 URL: https://www.acihellas.gr/
編輯:
回答一個問題,如果有辦法只更改 payload 中的類別,看起來是的,我從payload基本上是產品過濾器的變數中洗掉了整個部分,它仍然有效:
payload = {
"categoryId": "828",
"manufacturerId": "0",
"vendorId": "0",
"priceRangeFilterModel7Spikes": "null",
"pageNumber": str(pagenum),
"orderby": "10",
"viewmode": "grid",
"pagesize": "21",
"queryString": "#/pageSize=21&viewMode=grid&orderBy=10&pageNumber=" str(pagenum),
"shouldNotStartFromFirstPage": "true",
"keyword": "",
"searchCategoryId": "0",
"searchManufacturerId": "0",
"searchVendorId": "0",
"priceFrom": "",
"priceTo": "",
"includeSubcategories": "False",
"searchInProductDescriptions": "False",
"advancedSearch": "False",
"isOnSearchPage": "False",
"inStockFilterModel": {
"CategoryId": "828",
"ManufacturerId": "0",
"VendorId": "0",
"Id": "1",
"FilterItemState": "Unchecked"
}
}
uj5u.com熱心網友回復:
我無法準確地對此進行測驗,因為該域在我的國家/地區被阻止,但也許您可以嘗試以下操作:
import requests
import bs4 as bs
url_base = 'http://www.acihellas.gr/gaming-pontikia#/pageSize=21&viewMode=grid&orderBy=10&pageNumber={page}'
total_pages = 4
products = {}
for page in range(1, total_pages 1):
url = url_base.format(page=page)
print(f'Scraping page {page}...')
res = requests.get(url)
soup = bs.BeautifulSoup(res.text, 'lxml')
item = soup.find_all('div', {'class': 'details'})
# Add items to dictionary
for i in item:
name = i.find("h2").text
print(name)
url = i.find("a")['href']
products[i] = {'name': name, 'url': url}
products
該products詞典應該有名稱和專案的URL。
轉載請註明出處,本文鏈接:https://www.uj5u.com/gongcheng/354332.html
上一篇:網頁抓取時如何切換框?
下一篇:如何使用R從網站下載檔案
