我正在嘗試使用 Python 和 Selenium從該網站上抓取硫化氫資料。到目前為止,我一直在苦苦掙扎的是我不知道如何獲取每個工具提示的資料(站點 ID、站點名稱、日期、值、單位等)。如您所見,我們有從 A 到 G 的七個監控點,每個點對應自己的資料。我做了很多研究,但仍然卡住了。我已經編譯了以下代碼來抓取特定日期的資料,但遇到了錯誤。請在下面查看我的代碼。
from selenium import webdriver
from webdriver_manager.microsoft import EdgeChromiumDriverManager
from selenium.webdriver import ActionChains
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Edge(EdgeChromiumDriverManager(log_level=20).install())
driver.maximize_window()
driver.get("https://marathonlosangelesrefineryfencelinemonitoring.com/index.html")
# Navigate to monitors
button = driver.find_element_by_xpath("//div[@class='nav-link-text']")
button.click()
# Navigate to dropdown button
dropdown = driver.find_element_by_xpath("//i[@class='arrow-down parameter-arrow']")
dropdown.click()
# Select Hydrogen Sulfide and click
h2s = driver.find_element_by_xpath("//ul[@class='dropdown-menu' and @role='menu' and @aria-labelledby='ParameterDropdown']//li[12]")
h2s.click()
res = []
test = driver.find_elements_by_xpath("//div[@class='leaflet-pane leaflet-marker-pane']//div[contains(@class, 'leaflet-marker-icon')]")
for ele in test:
hover = ActionChains(driver).move_to_element(ele)
hover.perform()
try:
site_id = driver.find_element_by_css_selector(".LAR-tooltip-site-id > p")
site_name = driver.find_element_by_css_selector(".LAR-tooltip-site-name")
date = driver.find_element_by_css_selector(".LAR-tooltip-localtime")
value = driver.find_element_by_css_selector(".LAR-tooltip-data-value")
unit = driver.find_element_by_css_selector(".LAR-tooltip-data-unit")
para_mdl = driver.find_element_by_css_selector(".tooltip-parameter-mdl")
res.append((site_id.text, site_name.text, date.text, value.text, unit.text, para_mdl.text))
except:
pass
如果有人能幫我解決這個問題,我真的很感激。另外,我想通過利用上面的代碼在時間視窗(比如說從 2021 年 8 月 1 日到 2022 年 1 月 1 日)上抓取資料,因此非常感謝任何反饋。
uj5u.com熱心網友回復:
看起來您需要的所有代碼都是一些 WebdriverWaits。如果我沒記錯的話,基于 React 的網站在自動化方面有點困難,因為有很多 aysncs 并且由于虛擬 DOM。我已根據需要使用 WebdriverWaits 重構了您的代碼(并且還消除了多行,盡管如果您想要更好的可讀性,您可以保留它們)。這是代碼:
from selenium import webdriver
from webdriver_manager.microsoft import EdgeChromiumDriverManager
from selenium.webdriver import ActionChains
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
driver = webdriver.Edge(EdgeChromiumDriverManager(log_level=20).install())
driver.maximize_window()
driver.get("https://marathonlosangelesrefineryfencelinemonitoring.com/index.html")
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//div[@class='nav-link-text']"))).click()
# Navigate to monitors
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//i[@class='arrow-down parameter-arrow']"))).click()
# Navigate to dropdown button
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//ul[@class='dropdown-menu' and @role='menu' and @aria-labelledby='ParameterDropdown']//li[12]"))).click()
# Select Hydrogen Sulfide and click
WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='leaflet-pane leaflet-marker-pane']//div[contains(@class, 'leaflet-marker-icon')]")))
driver.find_element_by_css_selector(".arrow-down.date-arrow").click()
req_month = 'Aug'
req_year = '2021'
req_timeline = req_month " " req_year
print(f"Timeline Selected is: {req_timeline}")
for i in range(11):
month = driver.find_element(By.XPATH, "//th[@class='month']").text
if month == req_timeline:
break
else:
driver.find_element(By.XPATH, "//th[@class='prev available']").click()
driver.find_element(By.XPATH, "//*[@class='table-condensed']//td[text()='1']").click()
driver.find_element(By.XPATH, "//*[text()='Apply']").click()
time.sleep(8)
res = []
test = driver.find_elements_by_xpath("//div[@class='leaflet-pane leaflet-marker-pane']//div[contains(@class, 'leaflet-marker-icon')]")
for ele in test:
hover = ActionChains(driver).move_to_element(ele)
hover.perform()
time.sleep(1)
try:
site_id = driver.find_element_by_css_selector(".LAR-tooltip-site-id > p")
site_name = driver.find_element_by_css_selector(".LAR-tooltip-site-name")
date = driver.find_element_by_css_selector(".LAR-tooltip-localtime")
value = driver.find_element_by_css_selector(".LAR-tooltip-data-value")
unit = driver.find_element_by_css_selector(".LAR-tooltip-data-unit")
para_mdl = driver.find_element_by_css_selector(".tooltip-parameter-mdl")
res.append((site_id.text, site_name.text, date.text, value.text, unit.text, para_mdl.text))
except:
pass
print(res)
結果如下:
Timeline Selected is: Aug 2021
[('F', 'Point Monitor', '7:55 AM', '1.80', 'ppb', 'MDL: 0.40 ppb'), ('B', 'Point Monitor', '7:55 AM', '1.20', 'ppb', 'MDL: 0.40 ppb'), ('E', 'Point Monitor', '7:55 AM', '1.10', 'ppb', 'MDL: 0.40 ppb'), ('A', 'Point Monitor', '7:55 AM', '0.40', 'ppb', 'MDL: 0.40 ppb')]
Process finished with exit code 0
您會看到,即使引入了 WebdriverWaits,有些地方也需要硬停time.sleep,否則測驗會變得不穩定。
uj5u.com熱心網友回復:
@ThaiNguyen,添加另一個答案以保留較早的答案。我嘗試了一些粗略的方法來完成作業,經過多次嘗試后我成功了,但我會說一點點鹽,因為我在 8 月只迭代了 3 個日期。重構的代碼粘貼在下面,但是在你看到代碼之前,讓我解釋一下我所面臨的問題,你可以標記一下。為了讓 DOM 為每個動作穩定下來,我必須添加很多睡眠(如您所知,time.sleep 在異步方面非常不可靠),但我認為即使在等待之后我也看到代碼失敗陳舊的元素,增加時間幫助我(暫時)照顧它們。另一件事——在我看來,這是一個大問題:即使這段代碼成功地獲取了結果,我不能向你保證它會在 8 月的所有日期(更不用說所有需要的月份)都這樣做,因為代碼在渲染的 DOM 中表現得非常不穩定,我不想在這一點上責怪代碼時間(我對硒的了解有限),但如果我沒記錯的話,DOM 有嚴重的異步。所以,我想說的是,使用這段代碼,你不能指望一下子就搞定一切。相反,您可能不得不將時間花在重構代碼和改進代碼上,或者通過在每個月的幾個日期一次運行多次來分塊獲取資料,考慮到它所欠的脆弱性,這非常令人沮喪。但如果我沒記錯的話,DOM 有嚴重的異步。所以,我想說的是,使用這段代碼,你不能指望一下子就搞定一切。相反,您可能不得不將時間花在重構代碼和改進代碼上,或者通過在每個月的幾個日期一次運行多次來分塊獲取資料,考慮到它所欠的脆弱性,這非常令人沮喪。但如果我沒記錯的話,DOM 有嚴重的異步。所以,我想說的是,使用這段代碼,你不能指望一下子就搞定一切。相反,您可能不得不將時間花在重構代碼和改進代碼上,或者通過在每個月的幾個日期一次運行多次來分塊獲取資料,考慮到它所欠的脆弱性,這非常令人沮喪。
from selenium import webdriver
from webdriver_manager.microsoft import EdgeChromiumDriverManager
from selenium.webdriver import ActionChains
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
driver = webdriver.Edge(EdgeChromiumDriverManager(log_level=20).install())
driver.maximize_window()
def h2s_selection():
driver.get("https://marathonlosangelesrefineryfencelinemonitoring.com/index.html")
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//div[@class='nav-link-text']"))).click()
# Navigate to monitors
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//i[@class='arrow-down parameter-arrow']"))).click()
# Navigate to dropdown button
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//ul[@class='dropdown-menu' and @role='menu' and @aria-labelledby='ParameterDropdown']//li[12]"))).click()
# Select Hydrogen Sulfide and click
WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='leaflet-pane leaflet-marker-pane']//div[contains(@class, 'leaflet-marker-icon')]")))
def aug_date():
driver.find_element_by_css_selector(".arrow-down.date-arrow").click()
req_month = 'Aug'
req_year = '2021'
req_timeline = req_month " " req_year
print(f"Timeline Selected is: {req_timeline}")
for i in range(11):
month = driver.find_element(By.XPATH, "//th[@class='month']").text
if month == req_timeline:
break
else:
driver.find_element(By.XPATH, "//th[@class='prev available']").click()
dt = ['1', '2', '3']
for i in dt:
time.sleep(5)
each_date = driver.find_element(By.XPATH, "//*[@class='table-condensed']//td[text()=" i ']')
print(f"Date is {each_date.text}")
each_date.click()
driver.find_element(By.XPATH, "//*[text()='Apply']").click()
time.sleep(10)
tooltips()
time.sleep(5)
driver.find_element_by_css_selector(".arrow-down.date-arrow").click()
def tooltips():
# time.sleep(8)
res = []
test = driver.find_elements_by_xpath("//div[@class='leaflet-pane leaflet-marker-pane']//div[contains(@class, 'leaflet-marker-icon')]")
for ele in test:
hover = ActionChains(driver).move_to_element(ele)
hover.perform()
time.sleep(1)
try:
site_id = driver.find_element_by_css_selector(".LAR-tooltip-site-id > p")
site_name = driver.find_element_by_css_selector(".LAR-tooltip-site-name")
date = driver.find_element_by_css_selector(".LAR-tooltip-localtime")
value = driver.find_element_by_css_selector(".LAR-tooltip-data-value")
unit = driver.find_element_by_css_selector(".LAR-tooltip-data-unit")
para_mdl = driver.find_element_by_css_selector(".tooltip-parameter-mdl")
res.append((site_id.text, site_name.text, date.text, value.text, unit.text, para_mdl.text))
except:
pass
print(res)
if __name__ == "__main__":
h2s_selection()
aug_date()
輸出:
Timeline Selected is: Aug 2021
Date is 1
[('F', 'Point Monitor', '10:55 AM', '0.90', 'ppb', 'MDL: 0.40 ppb'), ('B', 'Point Monitor', '10:55 AM', '1.20', 'ppb', 'MDL: 0.40 ppb'), ('E', 'Point Monitor', '10:55 AM', '1.30', 'ppb', 'MDL: 0.40 ppb'), ('A', 'Point Monitor', '10:55 AM', '0.60', 'ppb', 'MDL: 0.40 ppb')]
Date is 2
[('B', 'Point Monitor', '10:25 PM', '1.70', 'ppb', 'MDL: 0.40 ppb'), ('E', 'Point Monitor', '10:25 PM', '1.90', 'ppb', 'MDL: 0.40 ppb')]
Date is 3
[('F', 'Point Monitor', '9:55 AM', '1.20', 'ppb', 'MDL: 0.40 ppb'), ('B', 'Point Monitor', '9:55 AM', '1.20', 'ppb', 'MDL: 0.40 ppb'), ('E', 'Point Monitor', '9:55 AM', '1.90', 'ppb', 'MDL: 0.40 ppb'), ('A', 'Point Monitor', '9:55 AM', '0.50', 'ppb', 'MDL: 0.40 ppb')]
Process finished with exit code 0
轉載請註明出處,本文鏈接:https://www.uj5u.com/gongcheng/406656.html
標籤:
