如何使用BeautifulSoup從一個表格中提取href鏈接？ -有解無憂

我正試圖從基本URL內的若干表格中的任何一個創建一個所有足球隊/鏈接的串列。https://fbref.com/en/comps/10/stats/Championship-Stats

然后，我將使用來自href的鏈接來抓取每個團隊的資料。該href被嵌入到th標簽中，如下所示
。
th scope="row" class="left" data-stat="squad"> <a href="/en/squads/293cb36b/Barnsley-Stats"/span>> Barnsley</a>/span></th a href="/en/squads/293cb36b/Barnsley-Stats"/span>>Barnsley</a

下面的代碼給了我一個 "a "標簽的串列
page = "https://fbref.com/en/comps/10/Championship-Stats" pageTree = requests.get(page) pageSoup = BeautifulSoup(pageTree.content, 'html.arser') Teams = pageSoup.find_all("th"/span>, {"class"/span>: "left"})
輸出（對于每個'左'的類）：

th class="left" data-stat="squad" scope="row"> a href="/en/squads/293cb36b/Barnsley-Stats" >Barnsley,

我已經嘗試了之前Stack問題中的指導（在beautifulsoup中提取th之后的鏈接）。然而，基于該執行緒的以下代碼產生了錯誤
AttributeError: 'NoneType' 物件沒有屬性'find_parent'

def import_TeamList（）。 BASE_URL = "https://fbref.com/en/comps/10/Championship-Stats"/span> r = requests.get(BASE_URL) soup = BeautifulSoup(r.text, 'lxml') team_list = [] team_tr = soup.find('a'/span>, {'data-stat'/span>: 'squad'}).find_parent('tr') for tr in reels_tr.find_next_siblings('tr') 。 if tr.find('a').text !='squad': break。 midi_list.append(BASE_URL tr.find('a') ['href']) return TeamList

uj5u.com熱心網友回復：

這里有一個使用CSS選擇器的版本，我發現它比大多數其他方法更簡單。
import requests from bs4 import BeautifulSoup url = 'https://fbref.com/en/comps/10/stats/Championship-Stats'/span> data = requests.get(url).text soup = BeautifulSoup(data) links = BeautifulSoup(data).select('th a') urls = [link['href'] for link in links ] print(urls)

uj5u.com熱心網友回復：

這是你要找的嗎？

這是你要找的嗎？
import requests from bs4 import BeautifulSoup as BS from lxml import etree with requests.Session() as session: r = session.get('https://fbref.com/en/comps/10/stats/Championship-Stats') r.raise_for_status() dom = etree.HTML(str(BS(r.text, 'lxml') ) for a in dom.xpath('//th[@class="left"]/a'/span>)。 print(a.attrib['href'/span>])

轉載請註明出處，本文鏈接：https://www.uj5u.com/net/331281.html
標籤：

上一篇：如何將兩個pathlib.PosixPath路徑連接成一個？
下一篇：用Scrapy進行不必要的HTML輸出