我正在嘗試在 FT 網站上為一些基金(即https://markets.ft.com/data/funds/tearsheet/holdings?s=LU1076093779:EUR)抓取一些資產和控股資訊。我可以毫無問題地選擇第一個表,但第二個表由兩個具有不同選項卡的表組成:“部門”和“區域”。當我嘗試使用 選擇第二個表時table2 = soup.find_all('table')[1],我會在“部門”選項卡或“區域”選項卡下找到該表。有沒有辦法選擇兩個表?
我的代碼如下:
import requests
import pandas as pd
from bs4 import BeautifulSoup
List = ['LU0526609390:EUR', 'IE00BHBX0Z19:EUR', 'LU1076093779:EUR', 'LU1116896363:EUR']
df = pd.DataFrame(List, columns=['List'])
urls = 'https://markets.ft.com/data/funds/tearsheet/holdings?s=' df['List']
dfs =[]
for url in urls:
ISIN = url.split('=')[-1].replace(':', '_')
ISIN = ISIN[:-4]
r = requests.get(url).content
soup = BeautifulSoup(r, 'html.parser')
try:
table1 = soup.find_all('table')[0]
table2 = soup.find_all('table')[1]
except Exception:
continue
df1 = pd.read_html(str(table1), index_col=0)[0]
df2 = pd.read_html(str(table2), index_col=0)[0]
del df2['Category average']
del df1['% Short']
del df1['% Long']
df1 = df1.rename(columns={'% Net assets': ISIN})
df2 = df2.rename(columns={'% Net assets': ISIN})
df = df1.append(df2)
print(df)
我所需的基金輸出 - LU1076093779:
LU1076093779
Non-UK stock 94.76%
Cash 2.21%
UK stock 3.03%
UK bond 0.00%
Non-UK bond 0.00%
Other 0.00%
Financial Services 16.96%
Industrials 14.22%
Consumer Cyclical 13.80%
Technology 13.65%
Healthcare 11.08%
Consumer Defensive 8.09%
Communication Services 7.20%
Basic Materials 6.04%
Utilities 3.62%
Other 3.10%
Americas 1.37%
United States 0.79%
Latin America 0.58%
Greater Asia 0.00%
Greater Europe 96.02%
Eurozone 91.79%
United Kingdom 3.03%
Europe - ex Euro 1.20%
但目前我只得到以下資訊:
LU1076093779
Non-UK stock 94.76%
Cash 2.21%
UK stock 3.03%
UK bond 0.00%
Non-UK bond 0.00%
Other 0.00%
Financial Services 16.96%
Industrials 14.22%
Consumer Cyclical 13.80%
Technology 13.65%
Healthcare 11.08%
Consumer Defensive 8.09%
Communication Services 7.20%
Basic Materials 6.04%
Utilities 3.62%
Other 3.10%
uj5u.com熱心網友回復:
只需在進行一些資料操作后將表放在一起即可。請注意,盡管其中一些鏈接沒有所有 3 個表。我還改變了你遍歷串列的方式。無需創建資料框。
還要小心你的變數。我想你的意思是dfs = df1.append(df2)。但我稍微調整了代碼。
import requests
import pandas as pd
from bs4 import BeautifulSoup
# Define all urls required for data scrapping from the FT Website - if new fund is added simply add the appropriate Fund ID to the List
id_list = ['LU1076093779:EUR','LU0526609390:EUR', 'IE00BHBX0Z19:EUR', 'LU1076093779:EUR', 'LU1116896363:EUR']
urls = ['https://markets.ft.com/data/funds/tearsheet/holdings?s=' x for x in id_list]
for url in urls:
print(url)
ISIN = url.split('=')[-1].replace(':', '_')
ISIN = ISIN[:-4]
df_list = []
r = requests.get(url).content
soup = BeautifulSoup(r, 'html.parser')
all_colspan = soup.find_all(attrs={'colspan':True})
for colspan in all_colspan:
colspan.attrs['colspan'] = colspan.attrs['colspan'].replace('%', '')
dfs = pd.read_html(str(soup))
for df in dfs:
cols = df.columns
drop_cols = ['Category average', '% Short', '% Long']
if 'Type' in cols or 'Sector' in cols:
df = df.rename(columns={'% Net assets': ISIN,
'Sector':'Type'})
df = df.drop([x for x in drop_cols if x in df.columns], axis=1)
df_list.append(df)
result = pd.concat(df_list)
print(result)
轉載請註明出處,本文鏈接:https://www.uj5u.com/caozuo/365333.html
下一篇:從表中抓取Selenium資料
