我正在做一個專案,試圖從存檔網站上抓取文章。例如,下面是存檔 url 和原始 url。我有存檔網址。我想使用 Selenium 來提取原始 url。
存檔網址: https://archive.is/xXAoL
原網址: https://beforeitsnews.com/eu/2021/08/breaking-germany-halts-all-covid-19-vaccines-says-they-are-unsafe-and-no-longer-recommended-2676130.html?fbclid=IwAR3JPcxNHlZ5eQHLyO2teh6_xcrerisBrCNeleOZz7qmxI7_pDJDBlEAIjU
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
url = "https://archive.is/xXAoL"
driver = webdriver.Chrome('./chromedriver')
driver.get(url)
關于如何獲取原始網址的任何建議?
方法一
可能有效的一件事是規范鏈接是
https://archive.is/2021.09.07-145059/https://beforeitsnews.com/eu/2021/08/breaking-germany-halts-all-covid-19-vaccines-says-they-are-unsafe-and-no-longer-recommended-2676130.html?fbclid=IwAR3JPcxNHlZ5eQHLyO2teh6_xcrerisBrCNeleOZz7qmxI7_pDJDBlEAIjU
我可以把東西去掉,直到第二個 https。但是,該方法不起作用,因此尋找另一種不依賴元的方法。
uj5u.com熱心網友回復:
要提取原始 url,您需要為visibility_of_element_located()引入WebDriverWait,您可以使用以下任一Locator Strategies:
使用CSS_SELECTOR:
driver.get('https://archive.is/xXAoL') print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "input[name='q'][value]"))).get_attribute("value"))使用XPATH:
driver.get('https://archive.is/xXAoL') print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//input[@name='q'][@value]"))).get_attribute("value"))控制臺輸出:
https://beforeitsnews.com/eu/2021/08/breaking-germany-halts-all-covid-19-vaccines-says-they-are-unsafe-and-no-longer-recommended-2676130.html?fbclid=IwAR3JPcxNHlZ5eQHLyO2teh6_xcrerisBrCNeleOZz7qmxI7_pDJDBlEAIjU注意:您必須添加以下匯入:
from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC
轉載請註明出處,本文鏈接:https://www.uj5u.com/qiye/420597.html
標籤:
