我想從位于此處的 owid covid19 json 資料創建一個資料框。json 在資料列中有一組每日記錄,這與國家/地區索引一起是我試圖將其制成資料框的內容。
{"AFG":{"continent":"Asia","location":"Afghanistan","population":39835428.0,"population_density":54.422,"median_age":18.6,"aged_65_older":2.581,"aged_70_older":1.337,"gdp_per_capita":1803.987,"cardiovasc_death_rate":597.029,"diabetes_prevalence":9.59,"handwashing_facilities":37.746,"hospital_beds_per_thousand":0.5,"life_expectancy":64.83,"human_development_index":0.511,"data":[{"date":"2020-02-24","total_cases":5.0,"new_cases":5.0,"total_cases_per_million":0.126,"new_cases_per_million":0.126,"stringency_index":8.33},{"date":"2020-02-25","total_cases":5.0,"new_cases":0.0,"total_cases_per_million":0.126,"new_cases_per_million":0.0,"stringency_index":8.33},
到目前為止,我一直在將檔案直接加載到資料框中
df = pd.read_json('owid-covid-data.json', orient='index')
然后將陣列歸一化
data = pd.concat([pd.DataFrame(json_normalize(key)) for key in df['data']])
除了洗掉索引并因此沒有提供識別符號以連接回靜態值之外,它可以正常作業。
我還想象有一種比我使用的更有效的標準化方法。
非常感謝任何幫助!
uj5u.com熱心網友回復:
這不是最有效的方法,但它有效:
new_df = pd.DataFrame()
for index, row in df.iterrows():
tmp = pd.json_normalize(row['data'])
tmp['country_code'] = index
new_df = pd.concat([new_df, tmp])
編輯:
我找到了一種更有效的方法,即一次對所有 JSON 進行規范化:
country_codes = []
datas = []
for index, data in zip(df.index, df['data']):
datas.extend(data)
country_codes.extend(len(data)*[index])
new_df = pd.DataFrame(datas)
new_df['country_code'] = country_codes
改進從9.38 s ± 856 ms per loop到1.37 s ± 12 ms per loop
uj5u.com熱心網友回復:
df = pd.read_json("https://covid.ourworldindata.org/data/owid-covid-data.json", orient='index')
# explode records/ lists into new rows, convert to dict,
# use it to create a new DataFrame, and transpose it
data = pd.DataFrame(df['data'].explode().to_dict()).T
df = df.drop(columns='data').join(data)
表現
忽略資料讀取
>>> %%timeit
... data = pd.DataFrame(df['data'].explode().to_dict()).T
... df.drop(columns='data').join(data)
84.4 ms ± 3.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
編輯 - 免責宣告
上述解決方案實際上是錯誤的。to_dict轉換程序中丟失了很多記錄,因為有很多重復的國家代碼鍵(系列索引),字典的鍵必須是唯一的。為了解決這個問題,首先,我們需要重置索引以確保它是唯一的。只有在創建新的 DataFrame 之后,我們才添加原始索引。
data = df['data'].explode()
data_df = pd.DataFrame(data.reset_index(drop=True).to_dict()).T
data_df.index = data.index
df = df.drop(columns='data').join(data_df)
這比之前的解決方案慢得多,因為實際上有 127314 條記錄,而之前的解決方案只產生 233 條記錄(唯一的國家代碼)。即使我們忽略了join部分,就像布魯諾的解決方案一樣,它也比布魯諾的解決方案慢得多
>>> %%timeit
... data = df['data'].explode()
... new_df = pd.DataFrame(data.reset_index(drop=True).to_dict()).T
... new_df.index = data.index
17.6 s ± 972 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# Bruno's solution
>>> %%timeit
... country_codes = []
... datas = []
... for index, data in zip(df.index, df['data']):
... datas.extend(data)
... country_codes.extend(len(data)*[index])
...
... new_df = pd.DataFrame(datas)
... new_df['country_code'] = country_codes
1.86 s ± 32.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
編輯 2 - 有更好的方法......
我找到了一個更好更簡單的解決方案。我肯定是過于復雜了。與Bruno的解決方案基本相同
data = df['data'].explode()
data_df = pd.DataFrame(data.tolist(), index=data.index)
df = df.drop(columns='data').join(data_df)
它和布魯諾的解決方案一樣快
>>> %%timeit
... data = df['data'].explode()
... pd.DataFrame(data.tolist(), index=data.index)
1.87 s ± 16.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
uj5u.com熱心網友回復:
你可以使用pd.json_normalize()which is must 更快(我在整個 JSON 檔案上試過這個):
%%timeit
pd.json_normalize(
data["AFG"],
record_path=["data"],
meta=[
"continent",
"location",
"population",
"population_density",
"median_age",
"aged_65_older",
"aged_70_older",
"gdp_per_capita",
"cardiovasc_death_rate",
"diabetes_prevalence",
"handwashing_facilities",
"hospital_beds_per_thousand",
"life_expectancy",
"human_development_index",
],
)
17.9 ms ± 1.68 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
輸出:
date total_cases new_cases total_cases_per_million new_cases_per_million stringency_index new_cases_smoothed ... gdp_per_capita cardiovasc_death_rate diabetes_prevalence handwashing_facilities hospital_beds_per_thousand life_expectancy human_development_index
0 2020-02-24 5.0 5.0 0.126 0.126 8.33 NaN ... 1803.987 597.029 9.59 37.746 0.5 64.83 0.511
1 2020-02-25 5.0 0.0 0.126 0.000 8.33 NaN ... 1803.987 597.029 9.59 37.746 0.5 64.83 0.511
2 2020-02-26 5.0 0.0 0.126 0.000 8.33 NaN ... 1803.987 597.029 9.59 37.746 0.5 64.83 0.511
3 2020-02-27 5.0 0.0 0.126 0.000 8.33 NaN ... 1803.987 597.029 9.59 37.746 0.5 64.83 0.511
4 2020-02-28 5.0 0.0 0.126 0.000 8.33 NaN ... 1803.987 597.029 9.59 37.746 0.5 64.83 0.511
.. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
610 2021-10-26 156071.0 31.0 3917.894 0.778 NaN 38.571 ... 1803.987 597.029 9.59 37.746 0.5 64.83 0.511
611 2021-10-27 156124.0 53.0 3919.225 1.330 NaN 37.857 ... 1803.987 597.029 9.59 37.746 0.5 64.83 0.511
612 2021-10-28 156166.0 42.0 3920.279 1.054 NaN 39.286 ... 1803.987 597.029 9.59 37.746 0.5 64.83 0.511
613 2021-10-29 156196.0 30.0 3921.032 0.753 NaN 37.857 ... 1803.987 597.029 9.59 37.746 0.5 64.83 0.511
614 2021-10-30 156210.0 14.0 3921.384 0.351 NaN 38.571 ... 1803.987 597.029 9.59 37.746 0.5 64.83 0.511
[615 rows x 38 columns]
但最快的解決方案就是下載csv格式的資料,因為他們在此處為您提供鏈接。
轉載請註明出處,本文鏈接:https://www.uj5u.com/shujuku/343224.html
上一篇:從維基百科下載某些影像會導致意外的UnidentifiedImageError
下一篇:將函式應用于csv檔案中的每一行
