我試圖從網站上獲取所有帶有“someClass”類的特殊 div 標簽
網站需要向下滾動以加載新的 div 元素,所以我使用了 Keys.PAGE_DOWN,它可以作業并滾動,但資料又不完整
所以我用:
elem = driver.find_element(By.TAG_NAME, "body")
no_of_pagedowns = 23
while no_of_pagedowns:
elem.send_keys(Keys.PAGE_DOWN)
time.sleep(0.7)
no_of_pagedowns-=1
它會滾動到整個 html 頁面加載,但是當我想在檔案中寫入資料時,它只寫 20 個 div 標簽而不是 100 個 ...
完整代碼:
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
url = 'https://divar.ir/s/tehran/buy-apartment/parand?price=200000000-450000000&non-negotiable=true&has-photo=true&q=خانه پرند'
driver.get(url)
elem = driver.find_element(By.TAG_NAME, "body")
no_of_pagedowns = 23
while no_of_pagedowns:
elem.send_keys(Keys.PAGE_DOWN)
time.sleep(0.3)
no_of_pagedowns-=1
datas = driver.find_elements(By.CLASS_NAME, 'kt-post-card__body')
f = open('data.txt', 'w')
counter = 1
for data in range(len(datas)):
f.write(f'{counter}--> {datas[data].text}')
counter = 1
f.write('\n')
f.close()
driver.quit()
uj5u.com熱心網友回復:
要僅選擇 20 個<div>標簽而不是數百個標簽,您可以使用串列切片,并且可以使用以下任一定位器策略:
使用CSS_SELECTOR
elements = driver.find_elements(By.CSS_SELECTOR, "div.kt-post-card__body")[:20]使用XPATH:
elements = driver.find_elements(By.XPATH, "//div[@class='kt-post-card__body']")[:20]
理想情況下,您必須誘導WebDriverWait并且visibility_of_all_elements_located()您可以使用以下任一定位器策略:
使用CSS_SELECTOR
elements = WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.kt-post-card__body")))[:20]使用XPATH:
elements = WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='kt-post-card__body']")))[:20]
更新
要選擇所有<div>s:
要選擇所有可以使用串列切片<div>的s ,您可以使用以下任一定位器策略:
使用CSS_SELECTOR
elements = driver.find_elements(By.CSS_SELECTOR, "div.kt-post-card__body")使用XPATH:
elements = driver.find_elements(By.XPATH, "//div[@class='kt-post-card__body']")
uj5u.com熱心網友回復:
我檢查了該站點,發現他們通過使用 api 和游標以 json 格式獲取資料。此處的游標由時間運算式和名為 last-post-date 的變數制成。當輸入到站點時,此值在 json 中作為 lastPostDate 給出。要從網站快速獲取資料,可以使用此請求:
https ://divar.ir/s/tehran/buy-apartment/parand?price=200000000-450000000&non-negotiable=true&has-photo=true&q=خ% D8
