嗨,伙計們,我正在嘗試使用 Seleinum webdriver 抓取有關 zalando 鞋的一些資訊,并在不同的變數中保存價格、標題、日期和時間。這是我的代碼:
from selenium import webdriver
from selenium.webdriver.common.by import By
import csv
DRIVER_PATH = 'C:\chromedriver.exe'
driver = webdriver.Chrome(executable_path=DRIVER_PATH)
driver.get('https://www.zalando.es/release-calendar/zapatillas-mujer/')
#Get the data of product 1 (If I change the /div/div[1]/div and I choose another number, it will get ther data of other shoe)
product_1 = driver.find_element(By.XPATH, '//*[@id="release-calendar"]/div/div[1]/div')
element_text = product_1.text
print(element_text)
當我列印下一個代碼的 element_text 時,我會得到很多關于產品的資訊。我想在不同的變數中保護它,所以我嘗試了一件事(繼續閱讀)
109,95 € Nike Sportswear WMNS DUNK LOW CZ 10 de noviembre de 2022, 8:15 Recordármelo
所以問題是,在這個小代碼作業之后,我嘗試拆分添加此代碼的資料,然后在不同的變數中保護不同型別的資料,但我遇到了一個問題:
from selenium import webdriver
from selenium.webdriver.common.by import By
import csv
DRIVER_PATH = 'C:\chromedriver.exe'
driver = webdriver.Chrome(executable_path=DRIVER_PATH)
driver.get('https://www.zalando.es/release-calendar/zapatillas-mujer/')
#Select product 1
product_1 = driver.find_element(By.XPATH, '//*[@id="release-calendar"]/div/div[1]/div')
element_text = product_1.text
#Split the data
element_text_split = element_text.split()
#Price 1 --> Result=109.95
price_1 =element_text_split[0]
print(price_1)
#Result=109,95
#Title 1 --> Result=€
title_1 =element_text_split[1]
print(title_1)
這 2 次列印的結果是:“109.95”和“€”
我在想 element_text_split[1] 是 Nike Sportswear,但不是,它是 € 符號,因為我通過它們之間的空格分割資料。
如果我想獲得鞋子的名稱,這是一個大問題,因為名稱之間沒有相同的空格,例如:Nike Dunk Low Cz 或 Air Jordan One Mid 1
我該如何解決這個問題?謝謝
uj5u.com熱心網友回復:
我想你可能正在尋找這樣的東西?
# Needed libs
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# We create the driver
DRIVER_PATH = 'C:\chromedriver.exe'
driver = webdriver.Chrome(executable_path=DRIVER_PATH)
# We maximize the window
driver.maximize_window()
# We navigate to the url
url='https://www.zalando.es/release-calendar/zapatillas-mujer/'
driver.get(url)
# We save a list of elements that are products (search for that xpath in the page and you will see what kind of element it is)
products = WebDriverWait(driver, 20).until(EC.presence_of_all_elements_located((By.XPATH, "//div[@id='release-calendar']//div[contains(@data-cid,'cid')]")))
# We make a loop for that list and for each of then we take the price, the brand, the model and the date.
for i, product in enumerate(products):
price = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.XPATH, f"//div[@data-cid='cid{i 1}']/div[2]"))).text
brand = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.XPATH, f"//div[@data-cid='cid{i 1}']/div[3]"))).text
model = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.XPATH, f"//div[@data-cid='cid{i 1}']/div[4]"))).text
date = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.XPATH, f"//div[@data-cid='cid{i 1}']/div[5]"))).text
url = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.XPATH, f"//div[@data-cid='cid{i 1}']//a"))).get_attribute("href")
image = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.XPATH, f"//div[@data-cid='cid{i 1}']//img"))).get_attribute("src")
print(f"""{price}
{brand}
{model}
{date}
{url}
{image}
""")
uj5u.com熱心網友回復:
一個想法是查看許多不同產品的變數 element_text,并決定拆分文本的不同方式 - split 方法可以采用較小的字串來拆分較長的字串。
如果這不起作用,您還可以遍歷 element_text_split 變數(它只是一個字串串列),并通過查找某些較小的字串或使用regex來分解該字串串列。
例如,要查找價格,您可以查找數字、句點,然后再查找數字。我假設產品名稱在之前或之后。格!
uj5u.com熱心網友回復:
您可以使用 selenium 和 bs4 以強大的方式獲取所需的資料
from selenium import webdriver
import time
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.service import Service
import pandas as pd
webdriver_service = Service("./chromedriver") #Your chromedriver path
driver = webdriver.Chrome(service=webdriver_service)
d = []
driver.get('https://www.zalando.es/release-calendar/zapatillas-mujer/')
driver.maximize_window()
time.sleep(5)
soup = BeautifulSoup(driver.page_source,"html.parser")
price= [x.get_text(strip=True) for x in soup.select('.Wqd6Qu div')]
#print(price)
title= [x.get_text(strip=True) for x in soup.select('.Wqd6Qu div div div')]
#print(title)
date = [x.get_text(strip=True).split(',')[0] for x in soup.select('.Wqd6Qu div div div div')]
#print(date)
hour = [x.get_text(strip=True).split(',')[1] for x in soup.select('.Wqd6Qu div div div div')]
#print(hour)
cols = ['title', 'price', 'date', 'hour']
df = pd.DataFrame(data=list(zip(title,price,date,hour)), columns=cols)
print(df)
輸出:
title price date hour
0 WMNS DUNK LOW CZ 109,95 € 10 de noviembre de 2022 14:15
1 HYPERTURF ADVENTURE 139,95 € 11 de noviembre de 2022 14:00
2 W AIR MAX 95 ESS 189,95 € 11 de noviembre de 2022 14:00
3 CITY CLASSIC 119,95 € 11 de noviembre de 2022 14:00
4 CITY CLASSIC 119,95 € 11 de noviembre de 2022 14:00
5 WMNS AIR 1 MID 129,95 € 11 de noviembre de 2022 14:15
6 DUNK LOW NEXT NATURE 109,95 € 11 de noviembre de 2022 14:15
7 CROSS WOMEN 295,00 € 14 de noviembre de 2022 14:00
8 CROSS WOMEN 295,00 € 14 de noviembre de 2022 14:00
9 CROSS WOMEN 295,00 € 14 de noviembre de 2022 14:00
10 W DUNK HIGH 119,95 € 14 de noviembre de 2022 14:15
11 MT410 99,95 € 16 de noviembre de 2022 14:00
12 MT410 99,95 € 16 de noviembre de 2022 14:00
13 MT410 99,95 € 16 de noviembre de 2022 14:00
14 MT410 99,95 € 16 de noviembre de 2022 14:00
15 MT410 94,95 € 16 de noviembre de 2022 14:00
16 WL574 109,95 € 18 de noviembre de 2022 14:00
17 WS327 119,95 € 18 de noviembre de 2022 14:00
轉載請註明出處,本文鏈接:https://www.uj5u.com/ruanti/531044.html
下一篇:預期條件失敗:等待元素可點擊:By.xpath://*[@id="documentation"]/div/div[2]/div/button
