如果表頭不是所有帶有python和beautifulsoup的文本，我怎么能得到它們？-有解無憂

嗨，大家好，

我基本上是編碼新手，所以對我來說很簡單。

我正在嘗試檢索此表的表頭： 如果表頭不是所有帶有python和beautifulsoup的文本，我怎么能得到它們？

首先我嘗試使用 pandas，但我無法獲得我的資料，所以我了解了美麗的湯并嘗試了我的運氣。

問題是一些標題是文本，我可以很容易地使用它來獲取資訊：

from bs4 import BeautifulSoup as bs
import requests

url = 'https://www.transfermarkt.co.uk/manchester-united-fc/leistungsdaten/verein/985/reldata/&2022/plus/1'

headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'}

response = requests.get(url, headers=headers)

response.content

soup = bs(response.content, 'html.parser')

soup.prettify().splitlines()

tabela_equipa = soup.find('table', {'class': 'items'} )

headers_tabela = [th.text.encode("utf-8") for th in tabela_equipa.select("tr th")]

print(headers_tabela)

輸出：[b'#', b'player', b'Age', b'Nat.', b'In小隊', b'\xc2\xa0', b'\xc2\xa0', b'\xc2 \xa0', b'\xc2\xa0', b'\xc2\xa0', b'\xc2\xa0', b'\xc2\xa0', b'\xc2\xa0', b'PPG', b '\xc2\xa0']

問題是這些標題中的大多數都是圖示，我需要的資訊實際上在跨度標題中，這就是我的問題所在，因為我無法找到任何地方如何獲取所有這些資訊以構建我的表格標題，這樣我就可以刮掉表格的其余部分。

如果表頭不是所有帶有python和beautifulsoup的文本，我怎么能得到它們？

有人知道這樣做的方法嗎？在這里發布之前嘗試了 4 天沒有成功。

然后我嘗試使用以下代碼獲取所有跨度：

thead = soup.thead
Theaders = thead.find_all('span')
print(Theaders)

輸出：

[<span class="icons_sprite icon-einsaetze-table-header sort-link-icon" title="Appearances"> </span>, <span class="icons_sprite icon-tor-table-header sort-link-icon" title="Goals"> </span>, <span class="icons_sprite icon-vorlage-table-header sort-link-icon" title="Assists"> </span>, <span class="icons_sprite icon-gelbekarte-table-header sort-link-icon" title="Yellow cards"> </span>, <span class="icons_sprite icon-gelbrotekarte-table-header sort-link-icon" title="Second yellow cards"> </span>, <span class="icons_sprite icon-rotekarte-table-header sort-link-icon" title="Red cards"> </span>, <span class="icons_sprite icon-einwechslungen-table-header sort-link-icon" title="Substitutions on"> </span>, <span class="icons_sprite icon-auswechslungen-table-header sort-link-icon" title="Substitutions off"> </span>, <span class="icons_sprite icon-minuten-table-header sort-link-icon" title="Minutes played"> </span>]

接近我想，因為我可以看到我需要的所有資訊都在那里。但后來我碰壁了，我可以得到一個跨度標題，但不是全部都在一個串列中：

thead = soup.thead Theaders = thead.find('span')['title'] print(Theaders)

輸出：外觀

thead = soup.thead
Theaders = thead.find_all('span')['title']
print(Theaders)

輸出：

---> 23 Theaders = thead.find_all('span')['title']
     24 print(Theaders)

TypeError: list indices must be integers or slices, not str

即便如此，我也會遇到它與原始桌子上的順序不同的問題。

也許我只是愚蠢，但任何幫助都會非常感激

uj5u.com熱心網友回復：

from bs4 import BeautifulSoup
import requests

url = 'https://www.transfermarkt.co.uk/manchester-united-fc/leistungsdaten/verein/985/reldata/&2022/plus/1'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'}

r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.content, 'lxml')

html_headers = soup.find_all('a', {'class': 'sort-link'})

headers_list = []
for i in html_headers:
    if i.find('span') == None:
        headers_list.append(i.get_text())
    else:
        headers_list.append(i.find('span')['title'])

print(headers_list)

uj5u.com熱心網友回復：

你可以得到缺少的東西

>>> [span.get_attribute_list("title")[0] for span in thead.find_all('span')]
['Appearances',
 'Goals',
 'Assists',
 'Yellow cards',
 'Second yellow cards',
 'Red cards',
 'Substitutions on',
 'Substitutions off',
 'Minutes played']

然后手動將這些結果與您之前在headers_tabela.

但是，為了保持一致性，我建議一一獲取它們：

def anchor_span(th):
    try:
        return th.a.span.get_attribute_list("title")[0]
    except Exception as exc:
        #print(exc)
        return False

def th_text(th):
    try:
        return th.text
    except Exception as exc:
        print(exc)
        return False

def anchor_text(th):
    try:
        ch = th.a.children
        assert len([*ch])==1
        return th.a.text
    except Exception as exc:
        #print(exc)
        return False

def get_col_names(soup):
    colnames = []
    for colnum, th in enumerate(soup.thead.tr.children):
        if isinstance (th, bs4.element.NavigableString):
            continue
    
        title = anchor_span(th)
        if not title:
            title = anchor_text(th)
        if not title:
            title = th_text(th)
        if not title:
            raise NotImplementedError

        colnames.append(title)

table_MN = pd.read_html(response.content, flavor='html5lib')
data = table_MN[1].iloc[:,:15]
data.columns = get_col_names(soup)

轉載請註明出處，本文鏈接：https://www.uj5u.com/gongcheng/527425.html

標籤：Python网页抓取美丽的汤蟒蛇请求

上一篇：FantasyLabs的Webscrape表格

下一篇：如何獲得一個與網路抓取類名稱完全相同的值