我正在從 MongoDB 讀取資料并放入 s3。使用 Athena 讀取資料。
這是我的集合,其中包含 Items 列,它是一個陣列。將其保存到 s3 時如何將其分解為單獨的列。
{"_id":{"$oid":"11111111"},
"receiptId":"rtrtrtrttrtrtrtr",
"paymentSystem":"CARD",
"lastFourDigit":"1111",
"cardType":"ghsl",
"paidOn":{"$numberLong":"1623078706000"},
"currency":"USD",
"totalAmountInCents":{"$numberInt":"0000"},
"items":[{"title":"Jun 21 - Jun 21,2022",
"description":"Starter",
"currency":"USD",
"amountInCents":{"$numberInt":"0000"},
"itemType":"SUBSCRIPTION_PLAN",
"id":{"$numberInt":"1"},
"frequency":"YEAR",
"periodStart":{"$numberLong":"1624288306000"},
"periodEnd":{"$numberLong":"1655824306000"}}],
"subscriptionPlanTitle":"Starter",
"subscriptionPlanFrequency":"YEAR",
"uuid":"1111111111",
"createTimestamp":{"$numberLong":"1624292188650"},
"updateTimestamp":{"$numberLong":"1624292188650"}}
我試過的Python代碼,
mylist = []
myresult = collection.find(query)
mylist = []
for x in myresult:
mylist.append(x)
df = json_normalize(mylist)
df1 = df.applymap(str)
我可以將其保存到鑲木地板中。但是所有專案都在一個列中。有沒有辦法動態爆炸?
輸出模式可能是
_id object
id object
createTimestamp object
updateTimestamp object
deleteTimestamp object
receiptId object
paymentSystem object
lastFourDigit object
cardType object
paidOn object
currency object
totalAmountInCents object
items.title object
items.description object
items.currency object
items.amountInCents object
items.itemType object
items.id object
items.frequency object
items.periodstart object
items.periodend object
subscriptionPlanTitle object
subscriptionPlanFrequency object
uuid object
consumerEmail object
taxAmountInCents object
gifted object
uj5u.com熱心網友回復:
你可以使用json_normalize:
out = pd.json_normalize(data, ['items'], list(data.keys() - {'items'}), record_prefix = 'items.')
另一種選擇是使用data;創建一個 DataFrame 然后explode用“items”列單獨構建一個DataFrame;然后join:
df = pd.json_normalize(data)
out1 = df.join(df['items'].explode().pipe(lambda x: pd.DataFrame(x.tolist())).add_prefix('items.')).drop(columns='items')
輸出:
items.title items.description items.currency items.itemType \
0 Jun 21 - Jun 21,2022 Starter USD SUBSCRIPTION_PLAN
items.frequency items.amountInCents.$numberInt items.id.$numberInt \
0 YEAR 0000 1
items.periodStart.$numberLong items.periodEnd.$numberLong cardType ... \
0 1624288306000 1655824306000 ghsl ...
uuid lastFourDigit _id currency \
0 1111111111 1111 {'$oid': '11111111'} USD
totalAmountInCents createTimestamp \
0 {'$numberInt': '0000'} {'$numberLong': '1624292188650'}
paidOn updateTimestamp \
0 {'$numberLong': '1623078706000'} {'$numberLong': '1624292188650'}
subscriptionPlanTitle paymentSystem
0 Starter CARD
[1 rows x 22 columns]
請注意,元資料中的某些鍵(例如“taxAmountInCents”)在示例中不存在。
轉載請註明出處,本文鏈接:https://www.uj5u.com/caozuo/442830.html
標籤:python-3.x 熊猫 数据框 json标准化 熊猫爆炸
