我已經看到過一些帶有相同問題的帖子,但他們的腳本通常會等到其中一個元素(按鈕)是可點擊的。這是我要抓取的表:
https://ropercenter.cornell.edu/presidential-approval/highslows
前幾次嘗試我的代碼回傳了除了兩個輪詢組織列之外的所有行。在不進行任何更改的情況下,它現在只抓取表格標題和 tbody 標記(沒有表格行)。
url = "https://ropercenter.cornell.edu/presidential-approval/highslows"
driver = webdriver.Firefox()
driver.get(url)
driver.implicitly_wait(12)
soup = BeautifulSoup(driver.page_source, 'lxml')
table = soup.find_all('table')
approvalData = pd.read_html(str(table[0]))
approvalData = pd.DataFrame(approvalData[0], columns = ['President', 'Highest %', 'Polling Organization & Dates H' 'Lowest %', 'Polling Organization & Dates L'])
我應該使用顯式等待嗎?如果是這樣,由于動態表不是互動式的,我應該等待哪個條件?
另外,為什么我的代碼在多次運行后輸出會發生變化?
uj5u.com熱心網友回復:
也許更多的作弊,但更簡單的解決方案,確實解決了你的問題,但在其他方面,將看看前端做了什么(使用開發人員工具),并發現它呼叫了回傳 JSON 值的 api,所以沒有 selenium真的需要。requests并且pandas足夠了。
import requests
import pandas as pd
url = "https://ropercenter.cornell.edu/presidential-approval/api/presidents/highlow"
data = requests.get(url).json()
df = pd.io.json.json_normalize(data)
>>> df
>>> df
president.id president.active president.surname president.givenname president.shortname ... low.approve low.disapprove low.noOpinion low.sampleSize low.presidentName
0 e9c0d19b-dfe9-49cf-9939-d06a0f256e57 True Biden Joe None ... 33 53 13 1313.0 Joe Biden
1 bc9855d5-8e97-4448-b62e-1fb2865c79e6 True Trump Donald None ... 29 68 3 5360.0 Donald Trump
2 1c49881f-0f0c-4a53-9b2c-0dd6540f88e4 True Obama Barack None ... 37 57 5 1017.0 Barack Obama
3 ceda6415-5975-404d-8049-978758a7d1f8 True Bush George W. W. Bush ... 19 77 4 1100.0 George W. Bush
4 4f7344de-a7bd-4bc6-9147-87963ae51095 True Clinton Bill None ... 36 50 14 800.0 Bill Clinton
5 116721f1-f947-4c14-b0b5-d521ed5a4c8b True Bush George H.W. H.W. Bush ... 29 60 11 1001.0 George H.W. Bush
6 43720f8f-0b9f-43b0-8c0d-63da059e7a57 True Reagan Ronald None ... 35 56 9 1555.0 Ronald Reagan
7 7aa76fd3-e1bc-4e9a-b13c-463a64e0c864 True Carter Jimmy None ... 28 59 13 1542.0 Jimmy Carter
8 6255dd77-531d-46c6-bb26-627e2a4b3654 True Ford Gerald None ... 37 39 24 1519.0 Gerald Ford
9 f1a23b06-4200-41e6-b137-dd46260ac4d8 True Nixon Richard None ... 23 55 22 1589.0 Richard Nixon
10 772aabfd-289b-4f10-aaae-81a82dd3dbc6 True Johnson Lyndon B. None ... 35 52 13 1526.0 Lyndon B. Johnson
11 d849b5a8-f711-4ac9-9728-c3915e17bb6a True Kennedy John F. None ... 56 30 14 1550.0 John F. Kennedy
12 e22fd64a-cf20-4bc4-8db6-b4e71dc4483d True Eisenhower Dwight D. None ... 48 36 16 NaN Dwight D. Eisenhower
13 ab0bfa04-61da-49d1-8069-6992f6124f17 True Truman Harry S. None ... 22 65 13 NaN Harry S. Truman
14 11edf04f-9d8d-4678-976d-b9339b46705d True Roosevelt Franklin D. None ... 48 43 8 NaN Franklin D. Roosevelt
[15 rows x 41 columns]
>>> df.columns
Index(['president.id', 'president.active', 'president.surname',
'president.givenname', 'president.shortname', 'president.fullname',
'president.number', 'president.terms', 'president.ratings',
'president.termCount', 'president.ratingCount', 'high.id',
'high.active', 'high.organization.id', 'high.organization.active',
'high.organization.name', 'high.organization.ratingCount',
'high.pollingStart', 'high.pollingEnd', 'high.updated',
'high.president', 'high.approve', 'high.disapprove', 'high.noOpinion',
'high.sampleSize', 'high.presidentName', 'low.id', 'low.active',
'low.organization.id', 'low.organization.active',
'low.organization.name', 'low.organization.ratingCount',
'low.pollingStart', 'low.pollingEnd', 'low.updated', 'low.president',
'low.approve', 'low.disapprove', 'low.noOpinion', 'low.sampleSize',
'low.presidentName'],
dtype='object')
uj5u.com熱心網友回復:
僅使用Selenium、GeckoDriver和火狐要提取網站中的表格內容,您需要為visibility_of_element_located()誘導WebDriverWait并使用Pandas中的DataFrame,您可以使用以下定位器策略:
代碼塊:
from selenium import webdriver from selenium.webdriver.firefox.options import Options from selenium.webdriver.firefox.service import Service from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC import pandas as pd options = Options() options.add_argument('--disable-blink-features=AutomationControlled') s = Service('C:\\BrowserDrivers\\geckodriver.exe') driver = webdriver.Firefox(service=s, options=options) driver.get('https://ropercenter.cornell.edu/presidential-approval/highslows') tabledata = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//table[@class='table table-striped']"))).get_attribute("outerHTML") tabledf = pd.read_html(tabledata) print(tabledf) driver.quit()控制臺輸出:
[ President Highest % ... Lowest % Polling Organization & Dates.1 0 Joe Biden 63% ... 33% Quinnipiac UniversityJan 7th, 2022 - Jan 10th,... 1 Donald Trump 49% ... 29% PewJan 8th, 2021 - Jan 12th, 2021 2 Barack Obama 76% ... 37% Gallup OrganizationSep 8th, 2011 - Sep 11th, 2011 3 George W. Bush 92% ... 19% American Research GroupFeb 16th, 2008 - Feb 19... 4 Bill Clinton 73% ... 36% Yankelovich Partners / TIME / CNNMay 26th, 199... 5 George H.W. Bush 89% ... 29% Gallup OrganizationJul 31st, 1992 - Aug 2nd, 1992 6 Ronald Reagan 68% ... 35% Gallup OrganizationJan 28th, 1983 - Jan 31st, ... 7 Jimmy Carter 75% ... 28% Gallup OrganizationJun 29th, 1979 - Jul 2nd, 1979 8 Gerald Ford 71% ... 37% Gallup OrganizationJan 10th, 1975 - Jan 13th, ... 9 Richard Nixon 70% ... 23% Gallup OrganizationJan 4th, 1974 - Jan 7th, 1974 10 Lyndon B. Johnson 80% ... 35% Gallup OrganizationAug 7th, 1968 - Aug 12th, 1968 11 John F. Kennedy 83% ... 56% Gallup OrganizationSep 12th, 1963 - Sep 17th, ... 12 Dwight D. Eisenhower 78% ... 48% Gallup OrganizationMar 27th, 1958 - Apr 1st, 1958 13 Harry S. Truman 87% ... 22% Gallup OrganizationFeb 9th, 1952 - Feb 14th, 1952 14 Franklin D. Roosevelt 84% ... 48% Gallup OrganizationAug 18th, 1939 - Aug 24th, ... [15 rows x 5 columns]]
轉載請註明出處,本文鏈接:https://www.uj5u.com/qianduan/460068.html
上一篇:Selenium無法提取文本
