我有一個 excel 檔案,它有一個嵌套的類似 JSON 格式的字串列。我想決議/擴展它。當我使用時,資料框看起來像這樣df.head(2)
json_str
0 {"id":"lni001","pub_date":"20220301","doc_id":"7098727","unique_id":"64WP-UI-POLI","content":[{"c_id":"002","p_id":"P02","type":"org","source":"internet"},{"c_id":"003","p_id":"P03","type":"org","source":"internet"},{"c_id":"005","p_id":"K01","type":"people","source":"news"}]}
1 {"id":"lni002","pub_date":"20220301","doc_id":"7097889","unique_id":"64WP-UI-CFGT","content":[{"c_id":"012","p_id":"K21","type":"location","source":"internet"},{"c_id":"034","p_id":"P17","type":"people","source":"news"},{"c_id":"098","p_id":"K54","type":"people","source":"news"}]}
每行的結構如下所示:
{
"id":"lni001",
"pub_date":"20220301",
"doc_id":"7098727",
"unique_id":"64WP-UI-POLI",
"content":[
{
"c_id":"002",
"p_id":"P02",
"type":"org",
"source":"internet"
},
{
"c_id":"003",
"p_id":"P03",
"type":"org",
"source":"internet"
},
{
"c_id":"005",
"p_id":"K01",
"type":"people",
"source":"news"
}
]
}
列的型別/類是str通過使用type(df['json_str'].iloc[0])
所有行都具有相同的結構/格式,但其中一些可能在content. 在上面的示例中,它有 3 個不同的嵌套字串,但有些可能有 1、2、4、5 或更多。預期的結果將如下所示
id pub_date doc_id unique_id c_id p_id type source
lni001 20220301 7098727 64WP-UI-POLI 002 P02 org internet
lni001 20220301 7098727 64WP-UI-POLI 003 P03 org internet
lni001 20220301 7098727 64WP-UI-POLI 005 K01 people internet
lni002 20220301 7097889 64WP-UI-CFGT 012 K21 location internet
lni002 20220301 7097889 64WP-UI-CFGT 034 P17 people news
lni002 20220301 7097889 64WP-UI-CFGT 098 K54 people news
我試圖將列轉換為字典并提取資訊,但效果不佳。我想知道有沒有更好的方法來做到這一點。
uj5u.com熱心網友回復:
json.loads我們可以在每一行上使用 apply并使用json_normalize:
import json
data = df['json_str'].apply(json.loads).tolist()
out = (pd.json_normalize(data, ['content'], list(data[0].keys()-{'content'}))
[['id', 'pub_date', 'doc_id', 'unique_id', 'c_id', 'p_id', 'type', 'source']])
輸出:
id pub_date doc_id unique_id c_id p_id type source
0 lni001 20220301 7098727 64WP-UI-POLI 002 P02 org internet
1 lni001 20220301 7098727 64WP-UI-POLI 003 P03 org internet
2 lni001 20220301 7098727 64WP-UI-POLI 005 K01 people news
3 lni002 20220301 7097889 64WP-UI-CFGT 012 K21 location internet
4 lni002 20220301 7097889 64WP-UI-CFGT 034 P17 people news
5 lni002 20220301 7097889 64WP-UI-CFGT 098 K54 people news
這里,data[0].keys()對應于每個字典中除“內容”之外的所有鍵。
uj5u.com熱心網友回復:
根據@enke 的回答,您可以先將字串轉換為真正的 JSON,然后使用pd.json_normalize:
import ast
new_df = pd.json_normalize(df['YOUR COLUMN'].apply(ast.literal_eval), ['content'], list(data.keys()-{'content'}))
如果您關心列的順序,可以重新排列它們:
new_df = new_df[['id', 'pub_date', 'doc_id', 'unique_id', 'c_id', 'p_id', 'type', 'source']]
轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/443554.html
