嗨,我正在嘗試使用網站中的搜索功能從單詞串列中抓取該網站的產品名稱和單位。
我嘗試使用滾動方法,但是每次向下滾動都會暫停,我該如何處理?當我刮掉很多頁面時,處理滾動的最佳方法是什么?我嘗試使用無頭 chrome,但這是不可能的,所以這里有一個 chromedrivermanager 可以打開視窗并滾動。該網站是這個https://www.sayurbox.com/
import pandas as pd
from datetime import datetime
from bs4 import BeautifulSoup
import time
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
def selenium(soup):
driver = webdriver.Chrome(ChromeDriverManager().install() )
driver.maximize_window()
driver.get(url)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(3)
soup = BeautifulSoup(driver.page_source, 'html.parser')
driver.close()
#scraping components
list=['ayam', 'sabun','sayur', 'common']
for item in list:
URL= "https://www.sayurbox.com"
itemsEncoded = str(item).replace(" ", " ")
url = f"{URL}/products/s/{itemsEncoded}"
print(f"{url} start scraping")
soup = selenium(url)
#handling for items not found
try:
found = soup.find_all("span", {"class" : "NotFoundMessage__container__title"})
if found[0].text == "Produk tidak ditemukan.":
print('url not found')
#if found continue scraping
except:
#scrape details
#get product title
productTitle = soup.find_all('span', {"class":"ProductItem__container__name"})
product=[]
for p in productTitle:
p = p.text
product.append(p)
#get unit
units= soup.find_all('span', {"class":"Product__container__priceWrapper__packDesc"})
unit =[]
for u in units:
u = u.text
unit.append(u)
#write into dataframe
data = {'product':product,
'unit':unit,
'date':datetime.date(datetime.now())
}
上面的代碼只能滾動 1 次,但第 1 次滾動下方仍有專案。
uj5u.com熱心網友回復:
你需要使用硒嗎?您通過 POST 獲取資料,只需更改頁面引數即可獲取更多資訊。基本上這就是你滾動時發生的事情。然后只需更改'value'引數以瀏覽您的串列。
import requests
import pandas as pd
url = 'https://api.sayurbox.io/graphql'
headers = {
'authorization': 'eyJhbGciOiJSUzI1NiIsImtpZCI6ImY4NDY2MjEyMTQxMjQ4NzUxOWJiZjhlYWQ4ZGZiYjM3ODYwMjk5ZDciLCJ0eXAiOiJKV1QifQ.eyJhbm9ueW1vdXMiOnRydWUsImF1ZCI6InNheXVyYm94LWF1ZGllbmNlIiwiYXV0aF90aW1lIjoxNjUwNTUxMDYxLCJleHAiOjE2NTMxNDMwNjEsImlhdCI6MTY1MDU1MTA2MSwiaXNzIjoiaHR0cHM6Ly93d3cuc2F5dXJib3guY29tIiwibWV0YWRhdGEiOnsiZGV2aWNlX2luZm8iOm51bGx9LCJuYW1lIjpudWxsLCJwaWN0dXJlIjpudWxsLCJwcm92aWRlcl9pZCI6ImFub255bW91cyIsInNpZCI6IjFjNDE1ODFiLWQzMjItNDFhZi1hOWE5LWE4YTQ4OTZkODMxZiIsInN1YiI6InFSWXF2OFV2bEFucVR3NlE1NGhfbHdTNFBvTk8iLCJ1c2VyX2lkIjoicVJZcXY4VXZsQW5xVHc2UTU0aF9sd1M0UG9OTyJ9.MSmOz0mAe3UjhH9KSRp-fCk65tkTUPlxiJrRHweDEY2vqBSnUP43TO8ug3P38x8igxC4qguCOlwCTCPfUEWFhr3X8ePY7u7I7D22tV1LOF7Tm6T8PuLzHbmlBTgPK9C_GJpXwLAKnD2A535r-9DttYGt4QytIeWua8NKyW_riURfWGnhZBBMjEPeVPJBqGn1jMtZoh_iUeRb-kWccJ8IhBDQr0T1Op6IDMJuw3x6uf1Ks_SVqEVA0ZGIM1GVwuyZ87JYT4kqITNgi6yNy69jVH6gDFqBkTwJ7ZNWj8NCQsaRfh03bZROZzY9MeCtL6if_8D9newYZagyZu5mKTJNzg',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.88 Safari/537.36'}
rows = []
for page in range(1,10):
print(page)
payload = {
'operationName': "getCatalogVariant",
'query': "query getCatalogVariant($deliveryDate: String!, $deliveryArea: String!, $deliveryCode: String, $limit: Int!, $page: Int!, $type: CatalogType, $value: String) {\n catalogVariantList(deliveryDate: $deliveryDate, deliveryArea: $deliveryArea, deliveryCode: $deliveryCode, limit: $limit, page: $page, type: $type, value: $value) {\n limit\n page\n size\n hasNextPage\n category {\n displayName\n }\n list {\n key\n availability\n categories\n farmers {\n image\n name\n }\n image {\n md\n sm\n lg\n }\n isDiscount\n discount\n labelDesc\n labelName\n maxQty\n name\n displayName\n nextAvailableDates\n packDesc\n packNote\n price\n priceFormatted\n actualPrice\n actualPriceFormatted\n shortDesc\n stockAvailable\n type\n emptyMessageHtml\n promoMessageHtml\n }\n }\n}\n",
'variables': {
'deliveryArea': "Jabodetabek",
'deliveryCode': "JK01",
'deliveryDate': "Friday, 22 April 2022",
'limit': 12,
'page': page,
'type': "SEARCH",
'value': "ayam"}}
jsonData = requests.post(url, headers = headers, json=payload).json()
items = jsonData['data']['catalogVariantList']['list']
rows = items
df = pd.DataFrame(rows)
輸出:
print(df)
key ... promoMessageHtml
0 Sreeya Sayap Ayam Frozen 500 gram ... None
1 SunOne Kulit Ayam 1 kg ... None
2 Bundling Ayam & Pisau 1 pack ... Promo!! maksimal 5
3 SunOne Hati Ayam 1 kg ... Hanya tersedia 1
4 Wellfed Daging Ayam Giling 250 gram ... None
.. ... ... ...
103 Frozchick Ayam Bumbu Kecap 400 gram ... Hanya tersedia 5
104 Sasa Larasa Bumbu Ungkep Ayam Kalasan 33 gram ... Promo!! maksimal 5
105 Bundling Indomie Kuah Ayam Bawang 69 gram 5 pcs ... Promo!! maksimal 7
106 Bundling MPASI Dada Ayam 1 pack ... Promo!! maksimal 10
107 Berkah Chicken Paha Bawah Probiotik Organik 55... ... Promo!! maksimal 10
[108 rows x 24 columns]
轉載請註明出處,本文鏈接:https://www.uj5u.com/qita/463175.html
