我正在嘗試從以下網址中抓取“個人資料和投資”表:https : //markets.ft.com/data/funds/tearsheet/summary? s = LU0526609390: EUR,使用以下代碼:
import requests
import pandas as pd
# Define all urls required for data scraping from the FT Website - if new fund is added simply add the appropriate Fund ID to the List
List = ['LU0526609390:EUR', 'IE00BHBX0Z19:EUR', 'LU1076093779:EUR', 'LU1116896363:EUR']
df = pd.DataFrame(List, columns=['List'])
urls = 'https://markets.ft.com/data/funds/tearsheet/summary?s=' df['List']
for url in urls:
r = requests.get(url).content
df = pd.read_html(r)[0]
print (df)
但是,當我使用該pd.read_html函式時,我收到以下錯誤代碼:ValueError: invalid literal for int() with base 10: '100%',因為該表在 % 中有條目。有沒有辦法讓熊貓接受 % 值?我是 python 和 Pandas 的新手,所以任何幫助將不勝感激!
我需要的輸出是獲取具有以下格式的表:
Fund_ID Fund_type Income_treatment Morningstar category ......
LU0526609390:EUR ... ... ....
IE00BHBX0Z19:EUR ... ... ....
LU1076093779:EUR ... ... ....
LU1116896363:EUR ... ... ....
uj5u.com熱心網友回復:
問題是該站點使用該'colspan'屬性并使用%而不是使用 int。正如 AsishM 在評論中提到的,這些應該是 int 的形式,雖然一些瀏覽器會適應這種情況,但pandas特別希望它是合適的語法
<td colspan="number">
解決這個問題的方法是:
使用 BeautifulSoup 修復這些屬性
由于它不在您實際想要決議的表中,因此使用 BeautifulSoup 獲取第一個表,然后無需擔心。
查看該表是否具有特定屬性,并且可以將其
.read_html()作為引數添加到 中,以便它僅抓取該特定表。
我在這里選擇了選項 2:
import requests
import pandas as pd
from bs4 import BeautifulSoup
# Define all urls required for data scrapping from the FT Website - if new fund is added simply add the appropriate Fund ID to the List
List = ['LU0526609390:EUR', 'IE00BHBX0Z19:EUR', 'LU1076093779:EUR', 'LU1116896363:EUR']
df = pd.DataFrame(List, columns=['List'])
urls = 'https://markets.ft.com/data/funds/tearsheet/summary?s=' df['List']
results = pd.DataFrame()
for url in urls:
print(url)
r = requests.get(url).content
soup = BeautifulSoup(r, 'html.parser')
table = soup.find('table')
df = pd.read_html(str(table), index_col=0)[0].T
results = results.append(df, sort=False)
results = results.reset_index(drop=True)
print (results)
輸出:
print(results.to_string())
0 Fund type Income treatment Morningstar category IMA sector Launch date Price currency Domicile ISIN Manager & start date Investment style (bonds) Investment style (stocks)
0 SICAV Income Global Bond - EUR Hedged -- 06 Aug 2010 GBP Luxembourg LU0526609390 Jonathan Gregory01 Nov 2012Vivek Acharya09 Dec 2015Simon Foster01 Nov 2012 NaN NaN
1 Open Ended Investment Company Income EUR Diversified Bond -- 21 Feb 2014 EUR Ireland IE00BHBX0Z19 Lorenzo Pagani12 May 2017Konstantin Veit01 Jul 2019 Credit Quality: HighInterest-Rate Sensitivity: Mod NaN
2 SICAV Income Eurozone Large-Cap Equity -- 11 Jul 2014 GBP Luxembourg LU1076093779 NaN NaN Market Cap: LargeInvestment Style: Blend
3 SICAV Income EUR Flexible Bond -- 01 Dec 2014 EUR Luxembourg LU1116896363 NaN NaN NaN
轉載請註明出處,本文鏈接:https://www.uj5u.com/gongcheng/336072.html
上一篇:用pd.NA替換特定的列值
下一篇:如何修復熊貓資料框中的列順序?
