使用Python和Selenium抓取網站時無法通過鏈接回圈-有解無憂

我想爬取一個網站，但我在回圈低谷頁面時遇到問題。我想創建一個收集所有鏈接的系統，然后單擊每個鏈接并收集資料（在這種情況下為日期）。我寫了一個代碼，但我不斷收到這個錯誤：

StaleElementReferenceException: Message: stale element reference: element is not attached to the page document
  (Session info: chrome=98.0.4758.109)

我試圖增加睡眠間隔，但結果是一樣的。該錯誤發生在第二次迭代之后（在第一個鏈接之后）。

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import requests
import time

# url for crawling
url = "https://bstger.weblaw.ch/?size=n_60_n"
    
# path to selenium
path = 'path to selenium'
driver = webdriver.Chrome(path)
driver.get(url)
time.sleep(4)    
    
# click on search button
buttonClickSearch = driver.find_element_by_xpath('//*[@id="root"]/div/div/div[2]/div[1]/div/div[3]/form/div/input').click()
time.sleep(3)    
    
# get all links
all_links = driver.find_elements_by_tag_name('li.sui-result div.sui-result__header a')
print(all_links)
print()

# loop trough links and crawl them
for link in all_links:
    
    # click on link
    print(link)
    time.sleep(4)
    click = link.click() # I GET THE ERROR HERE ON SECOND ITERATION
    time.sleep(4)
        
    # get date
    date = driver.find_element_by_tag_name('div.filter-data button.wlclight13').text
    day = date.split('.')[0]
    month = date.split('.')[1]
    year = date.split('.')[2]
    date = year   "-"   month   "-"   day
    print(date)
    print()
    
    # click on back button
    back_button = driver.find_element_by_xpath('//*[@id="root"]/div/section[1]/div[1]/div[1]/a').click()
    time.sleep(4)
    #scroll
    driver.execute_script("window.scrollTo(0, 200)")

uj5u.com熱心網友回復：

而不是元素獲取href值并用于driver.get()導航。

//獲取href值

all_links =[link.get_attribute('href') for link in driver.find_elements_by_css_selector('li.sui-result >.sui-result__header> a')]
print(all_links) 

for link in all_links:
    
    driver.get(link)
    driver.refresh()
        
    # get date
    date = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div.filter-data button.wlclight13"))).text
    day = date.split('.')[0]
    month = date.split('.')[1]
    year = date.split('.')[2]
    date = year   "-"   month   "-"   day
    print(date)

如果您想繼續使用您的代碼，您需要重新分配您的元素，如下所示。

all_links = driver.find_elements_by_tag_name('li.sui-result div.sui-result__header a')
print(all_links)
print()

# loop trough links and crawl them
for link in range(len(all_links)):
   #Re-assined it again
    all_links = driver.find_elements_by_tag_name('li.sui-result div.sui-result__header a')
    # click on link
    print(all_links[link])
    time.sleep(4)
    all_links[link].click() 
    time.sleep(4)
        
    # get date
    date = driver.find_element_by_tag_name('div.filter-data button.wlclight13').text
    day = date.split('.')[0]
    month = date.split('.')[1]
    year = date.split('.')[2]
    date = year   "-"   month   "-"   day
    print(date)
    print()
    
    # click on back button
    back_button = driver.find_element_by_xpath('//*[@id="root"]/div/section[1]/div[1]/div[1]/a').click()
    time.sleep(4)
    #scroll
    driver.execute_script("window.scrollTo(0, 200)")

更新： 導航 url 不重繪頁面。添加driver.refresh()以顯示日期。

all_links =[link.get_attribute('href') for link in driver.find_elements_by_css_selector('li.sui-result >.sui-result__header> a')]
print(all_links) 

for link in all_links:
    
    driver.get(link)
    driver.refresh()
        
    # get date
    date = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div.filter-data button.wlclight13"))).text
    day = date.split('.')[0]
    month = date.split('.')[1]
    year = date.split('.')[2]
    date = year   "-"   month   "-"   day
    print(date)

您需要匯入以下庫。

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

輸出：使用 Python 和 Selenium 抓取網站時無法通過鏈接回圈

uj5u.com熱心網友回復：

如前所述，單擊后退按鈕是不穩定的。但可以使用“下一步”按鈕導航到其他鏈接。

最好應用一些顯式等待。

driver.get("https://bstger.weblaw.ch/?size=n_60_n")

wait = WebDriverWait(driver,30)
actions = ActionChains(driver)

buttonClickSearch = wait.until(EC.element_to_be_clickable((By.XPATH,"//input[@aria-label='search button']")))
actions.move_to_element(buttonClickSearch).click()

time.sleep(5)
all_links = driver.find_elements(By.XPATH,"//div[@class='sui-result__header']/a")
all_links[0].click() # Click on the First link.

for i in range(20):
    ...
    next = wait.until(EC.element_to_be_clickable((By.XPATH,"//button[contains(@class,'next')]")))
    next.click() # Click on next link for 20 iterations.

轉載請註明出處，本文鏈接：https://www.uj5u.com/net/436012.html

標籤：Python 硒网页抓取

上一篇：使用Selenium和Python在每個請求上管理多個用戶代理

下一篇：從fo回圈添加到串列