我想從像這樣的鏈接中提取整個新聞文章。https://www.reuters.com/world/europe/navalny-allies-accuse-telegram-censorship-russian-election-2021-09-18/ 下面的代碼是為了獲得這些鏈接,現在我想為每個鏈接獲得文章。我無法提取XPath來做到這一點。該段被分為多個<p>標簽,我不確定如何處理它
。!pip install selenium
!apt-get update
!apt install chromium-chromedriver
from selenium import webdriver
import time
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('-headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome('chromedriver', chrome_options=chrome_options)
driver.maximum_window()
driver.implicitly_wait(10)
driver.get("https://www.reuters.com/companies/AAPL.O")
鏈接=[]
i=0.
try:
while True:
news = driver.find_elements_by_xpath("//div[@class='item']")
driver.execute_script("arguments[0].rollIntoView(true);", news[i])
if news[i].find_element_by_tag_name("time").get_attribute("innerText") == "a year ago":
break。
links.append(news[i].find_element_by_tag_name("a").get_attribute("ref")
i = 1 。
time.sleep(.5)
except:
通過。
驅動程式.退出()
#links: pass driver.quit()
uj5u.com熱心網友回復:
試試這個xpath。
//div[contains(@class,'Article__container')]/div/div[2]/p
#This would give all the paragraphs.
paragraphs = driver.find_elements_by_xpath("//div[contains(@class,'Article__container')]/div/div[2]/p")
for段in段。
print(para.get_attribute("innerText"/span>)
轉載請註明出處,本文鏈接:https://www.uj5u.com/caozuo/327077.html
標籤:
