我有一個 JSON 示例,我想將其壓縮為一個 Pandas DataFrame。我已經習慣應用我自己撰寫的一些方法,但我想知道是否有更好/更短的解決方案來解決這個問題。
JSON 示例:
{
"documentName": "test1.json",
"time": "2020-10-10T08:00:00Z",
"data": [
{
"name":"john",
"scores": [
{
"event":"one",
"score":10
},
{
"event":"two",
"score":10
},
{
"event":"three",
"score":10
}
]
},
{
"name":"mary",
"scores": [
{
"event":"one",
"score":10
},
{
"event":"two",
"score":5
}
]
},
{
"name":"hope",
"scores": [
]
}
]
}
所需的輸出資料幀:
| 指數 | 檔案名 | 時間 | 姓名 | 一 | 二 | 三 |
|---|---|---|---|---|---|---|
| 0 | 測驗1.json | 2020-10-10T08:00:00Z | 約翰 | 10 | 10 | 10 |
| 1 | 測驗1.json | 2020-10-10T08:00:00Z | 瑪麗 | 10 | 5 | 空值 |
| 2 | 測驗1.json | 2020-10-10T08:00:00Z | 希望 | 空值 | 空值 | 空值 |
所以事件名稱將被添加為列并相應地填充。有 4 個事件,但如果有可能動態檢查數量和命名事件(因此不是固定的),那將是一個巨大的優勢。
至于現在我使用了以下方法:
def object_to_columns(df_row,column):
if isinstance(df_row[column], dict):
for key, value in df_row[column].items():
column_name = "{}-{}".format(column.lower(), key.lower())
df_row[column_name] = value
return df_row
def list_of_objects_to_columns(df_row,column):
if isinstance(df_row[column], list):
for item in df_row[column]:
column_name = f"{item['event']}"
df_row[column_name] = item['score']
return df_row
with open("test1.json") as file:
df = pd.read_json(file)
df = df.apply(object_to_columns, column="data", axis=1)
df = df.apply(list_of_objects_to_columns, column="data-scores", axis-1)
### CODE TO REMOVE UNUSED COLUMNS AND RENAMING ###
哪些想法更好、更清潔、更快?
uj5u.com熱心網友回復:
更直接的方法是使用,json_normalize但您丟失了有關“希望”的資訊:
import pandas as pd
import json
with open("data.json") as file:
data = json.load(file)
out = pd.json_normalize(data, ['data', 'scores'],
meta=['documentName', 'time', ['data', 'name']]) \
.pivot(index=['documentName', 'time', 'data.name'],
columns='event', values='score').reset_index()
輸出:
>>> out
event documentName time data.name one three two
0 test1.json 2020-10-10T08:00:00Z john 10.0 10.0 10.0
1 test1.json 2020-10-10T08:00:00Z mary 10.0 NaN 5.0
更新 另一個選項以保留“希望”行:
with open("data.json") as file:
data = json.load(file)
out = pd.json_normalize(data, 'data', meta=['documentName', 'time']) \
.explode('scores', ignore_index=True)
out[['event', 'score']] = out.pop('scores').dropna() \
.agg(lambda x: pd.Series(x.values()))
out = out.pivot(index=['documentName', 'time', 'name'],
columns='event', values='score') \
.reset_index().drop(columns=np.NaN)
輸出:
>>> out
event documentName time name one three two
0 test1.json 2020-10-10T08:00:00Z hope NaN NaN NaN
1 test1.json 2020-10-10T08:00:00Z john 10.0 10.0 10.0
2 test1.json 2020-10-10T08:00:00Z mary 10.0 NaN 5.0
轉載請註明出處,本文鏈接:https://www.uj5u.com/ruanti/353675.html
下一篇:使用標題和文本創建資料框
