如何使用漂亮的湯找到具有不同選項卡的HTML表格-有解無憂

我正在嘗試在 FT 網站上為一些基金（即https://markets.ft.com/data/funds/tearsheet/holdings?s=LU1076093779:EUR）抓取一些資產和控股資訊。我可以毫無問題地選擇第一個表，但第二個表由兩個具有不同選項卡的表組成：“部門”和“區域”。當我嘗試使用選擇第二個表時table2 = soup.find_all('table')[1]，我會在“部門”選項卡或“區域”選項卡下找到該表。有沒有辦法選擇兩個表？

我的代碼如下：

import requests
import pandas as pd
from bs4 import BeautifulSoup

List = ['LU0526609390:EUR', 'IE00BHBX0Z19:EUR', 'LU1076093779:EUR', 'LU1116896363:EUR']

df = pd.DataFrame(List, columns=['List'])
urls = 'https://markets.ft.com/data/funds/tearsheet/holdings?s='  df['List']

dfs =[]
for url in urls:
    ISIN = url.split('=')[-1].replace(':', '_')
    ISIN = ISIN[:-4]
    r = requests.get(url).content
    soup = BeautifulSoup(r, 'html.parser')
    try:
        table1 = soup.find_all('table')[0]
        table2 = soup.find_all('table')[1]
    except Exception:
        continue   
    df1 = pd.read_html(str(table1), index_col=0)[0]
    df2 = pd.read_html(str(table2), index_col=0)[0]
    del df2['Category average']
    del df1['% Short']
    del df1['% Long']
    df1 = df1.rename(columns={'% Net assets': ISIN})
    df2 = df2.rename(columns={'% Net assets': ISIN})
    df = df1.append(df2)
    print(df)

我所需的基金輸出 - LU1076093779：

                      LU1076093779
Non-UK stock                 94.76%
Cash                          2.21%
UK stock                      3.03%
UK bond                       0.00%
Non-UK bond                   0.00%
Other                         0.00%
Financial Services           16.96%
Industrials                  14.22%
Consumer Cyclical            13.80%
Technology                   13.65%
Healthcare                   11.08%
Consumer Defensive            8.09%
Communication Services        7.20%
Basic Materials               6.04%
Utilities                     3.62%
Other                         3.10%
Americas                      1.37% 
United States                 0.79% 
Latin America                 0.58% 
Greater Asia                  0.00% 
Greater Europe                96.02%    
Eurozone                      91.79%    
United Kingdom                3.03% 
Europe - ex Euro              1.20%

但目前我只得到以下資訊：

                       LU1076093779
Non-UK stock                 94.76%
Cash                          2.21%
UK stock                      3.03%
UK bond                       0.00%
Non-UK bond                   0.00%
Other                         0.00%
Financial Services           16.96%
Industrials                  14.22%
Consumer Cyclical            13.80%
Technology                   13.65%
Healthcare                   11.08%
Consumer Defensive            8.09%
Communication Services        7.20%
Basic Materials               6.04%
Utilities                     3.62%
Other                         3.10%

uj5u.com熱心網友回復：

只需在進行一些資料操作后將表放在一起即可。請注意，盡管其中一些鏈接沒有所有 3 個表。我還改變了你遍歷串列的方式。無需創建資料框。

還要小心你的變數。我想你的意思是dfs = df1.append(df2)。但我稍微調整了代碼。

import requests
import pandas as pd
from bs4 import BeautifulSoup

# Define all urls required for data scrapping from the FT Website - if new fund is added simply add the appropriate Fund ID to the List
id_list = ['LU1076093779:EUR','LU0526609390:EUR', 'IE00BHBX0Z19:EUR', 'LU1076093779:EUR', 'LU1116896363:EUR']
urls = ['https://markets.ft.com/data/funds/tearsheet/holdings?s='  x for x in id_list]


for url in urls:
    print(url)
    ISIN = url.split('=')[-1].replace(':', '_')
    ISIN = ISIN[:-4]
    df_list = []
    r = requests.get(url).content
    soup = BeautifulSoup(r, 'html.parser')
    
    all_colspan = soup.find_all(attrs={'colspan':True})
    for colspan in all_colspan:
        colspan.attrs['colspan'] = colspan.attrs['colspan'].replace('%', '')
        
    dfs = pd.read_html(str(soup))
    for df in dfs:
        cols = df.columns
        drop_cols = ['Category average', '% Short', '% Long']
        if 'Type' in cols or 'Sector' in cols:
            df = df.rename(columns={'% Net assets': ISIN,
                                    'Sector':'Type'})
            df = df.drop([x for x in drop_cols if x in df.columns], axis=1)
            df_list.append(df)
            
    result = pd.concat(df_list)
    print(result)

轉載請註明出處，本文鏈接：https://www.uj5u.com/caozuo/365333.html

標籤：Python html 熊猫网页抓取美汤

上一篇：網頁抓取，將多個值附加到串列中的一行

下一篇：從表中抓取Selenium資料