如何使用Selenium和Python從表中刮取所有藝術家的姓名？-有解無憂

我正在嘗試抓取前 1000 名藝術家的網站并將它們附加到串列中，以便通過搜索藝術家的姓名來進行抒情分析。我正在使用的網站可以選擇一次顯示所有 1000 位藝術家，因此我使用 selenium 來選擇該選項。從那里，我找到藝術家的名字并將它們放在 WebElements 串列中。我遍歷串列以獲取文本元素并將其附加到我的串列中。程式在獲得一定數量的藝術家后不斷拋出 StaleElementReferenceException，如下所示。

如何使用 Selenium 和 Python 從表中刮取所有藝術家的姓名？

我嘗試了許多建議的選項，例如使用等待直到陳述句或 try and catch 陳述句，但最終導致程式崩潰。我看到的大多數解決方案都是在單擊或與 Web 元素互動時發生的，但是在選擇選項后我沒有更改頁面上的任何內容，所以我不確定我哪里出錯了。我對硒相當陌生，所以我不確定這是否是獲得藝術家姓名的最佳方式。任何幫助，將不勝感激。

我的代碼：

from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager

driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get('https://chartmasters.org/most-streamed-artists-ever-on-spotify/')

try:
    # get the select tag
    select = Select(driver.find_element(By.TAG_NAME,'#table_1_length > label > div > select'))
    # select by value (select All option to get all 1000 artists)
    select.select_by_value('-1')

    all_artists = []
    all_artists_references = driver.find_elements(By.CLASS_NAME, 'bolded.column-artist-name')

    for element in all_artists_references:
        print(element.text)
        all_artists.append(element.text)

    print(all_artists)

finally:
    driver.quit()

uj5u.com熱心網友回復：

要提取和列印所有 1000 個藝術家姓名，您需要使用List Comprehension誘導WebDriverWait for visibility_of_all_elements_located()您可以使用以下任一Locator Strategies：

使用CSS_SELECTOR：

print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table#table_1 tbody tr[role='row'] td:nth-of-type(2)")))])

使用XPATH：

print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//table[@id='table_1']//tbody//tr[@role='row']//following::td[2]")))])

注意：您必須添加以下匯入：

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

uj5u.com熱心網友回復：

獲取確切表的表單查詢相當冗長，但直接從源獲取資料效率更高。

import requests
import pandas as pd

url = 'https://chartmasters.org/wp-admin/admin-ajax.php'
params = {
    'action': 'get_wdtable',
    'table_id': '1'}
data = {
'draw': '1',
'columns[0][data]': '0',
'columns[0][name]': 'rank',
'columns[0][searchable]': 'true',
'columns[0][orderable]': 'false',
'columns[0][search][value]': '',
'columns[0][search][regex]': 'false',
'columns[1][data]': '1',
'columns[1][name]': 'Artist Name',
'columns[1][searchable]': 'true',
'columns[1][orderable]': 'false',
'columns[1][search][value]': '',
'columns[1][search][regex]': 'false',
'columns[2][data]': '2',
'columns[2][name]': 'Lead Streams',
'columns[2][searchable]': 'true',
'columns[2][orderable]': 'true',
'columns[2][search][value]': '',
'columns[2][search][regex]': 'false',
'columns[3][data]': '3',
'columns[3][name]': 'Featured Streams',
'columns[3][searchable]': 'true',
'columns[3][orderable]': 'true',
'columns[3][search][value]': '',
'columns[3][search][regex]': 'false',
'columns[4][data]': '4',
'columns[4][name]': 'Tracks',
'columns[4][searchable]': 'true',
'columns[4][orderable]': 'true',
'columns[4][search][value]': '',
'columns[4][search][regex]': 'false',
'columns[5][data]': '5',
'columns[5][name]': '1b ',
'columns[5][searchable]': 'true',
'columns[5][orderable]': 'true',
'columns[5][search][value]': '',
'columns[5][search][regex]': 'false',
'columns[6][data]': '6',
'columns[6][name]': '100m ',
'columns[6][searchable]': 'true',
'columns[6][orderable]': 'true',
'columns[6][search][value]': '',
'columns[6][search][regex]': 'false',
'columns[7][data]': '7',
'columns[7][name]': '10m ',
'columns[7][searchable]': 'true',
'columns[7][orderable]': 'true',
'columns[7][search][value]': '',
'columns[7][search][regex]': 'false',
'columns[8][data]': '8',
'columns[8][name]': '1m ',
'columns[8][searchable]': 'true',
'columns[8][orderable]': 'true',
'columns[8][search][value]': '',
'columns[8][search][regex]': 'false',
'columns[9][data]': '9',
'columns[9][name]': 'Last Update',
'columns[9][searchable]': 'true',
'columns[9][orderable]': 'true',
'columns[9][search][value]': '',
'columns[9][search][regex]': 'false',
'order[0][column]': '2',
'order[0][dir]': 'desc',
'start': '0',
'length': '9999',
'search[value]': '',
'search[regex]': 'false',
'wdtNonce': '64ac23afe1'}


cols = []
for k, v in data.items():
    if 'name' in k:
        cols.append(v)

jsonData = requests.post(url, params=params, data=data).json()
df = pd.DataFrame(jsonData['data'], columns=cols)

輸出：

print(df)
     rank    Artist Name    Lead Streams  ... 10m   1m  Last Update
0       1          Drake  45,625,377,884  ...  241  244    29.03.22
1       2     Ed Sheeran  34,724,649,138  ...  165  199    29.03.22
2       3      Bad Bunny  33,419,082,838  ...  134  140    29.03.22
3       4     The Weeknd  30,455,269,996  ...  143  161    29.03.22
4       5  Ariana Grande  30,021,891,319  ...  126  175    29.03.22
..    ...            ...             ...  ...  ...  ...         ...
995   996          HONNE   1,229,848,408  ...   29   85    18.12.21
996   997  Darius Rucker   1,229,826,891  ...   14   77    28.03.22
997   998       King Von   1,224,925,368  ...   34   68    14.03.22
998   999        JP Saxe   1,224,510,818  ...   13   30    24.03.22
999  1000        Showtek   1,223,338,892  ...   19   69    26.02.21

[1000 rows x 10 columns]

轉載請註明出處，本文鏈接：https://www.uj5u.com/shujuku/453391.html

標籤：Python 硒网页抓取列表理解网络驱动程序等待

上一篇：EC.presence_of_element_located使用變數作為值

下一篇：使用request/selenium/cloudscraper進行Web抓取回傳空值