我們正在抓取此頁面上的主表格 - https://www.metacritic.com/browse/albums/release-date/available/date?view=detailed - 我們有以下內容來抓取表格:
import requests
from bs4 import BeautifulSoup
# grab page and soup it
headers = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36' }
metacritic_url = 'https://www.metacritic.com/browse/albums/release-date/available/date?view=detailed'
metacritic_page = requests.get(metacritic_url, headers=headers)
metacritic_soup = BeautifulSoup(metacritic_page.text, "html.parser")
# extract scores from page
all_trs = metacritic_soup.find_all('tr')
中的每個其他tr元素all_trs都是空的tr,類為spacer。
all_trs[0] # not empty
all_trs[1] # empty tr
的型別all_trs是bs4.element.ResultSet。我們如何過濾去除tr從要素all_trs是做有一個類的spacer,而所有其他元素?
uj5u.com熱心網友回復:
篩選時選擇
只需選擇<tr>沒有class命名的spacer:
metacritic_soup.select('tr:not(.spacer)')
過濾結果集
如果classnamedspacer是每隔一個<tr>就做list slicing- 2是間隔,每秒:
metacritic_soup.select('tr')[::2]
轉載請註明出處,本文鏈接:https://www.uj5u.com/qukuanlian/343738.html
