使用pandas.json_normalize來“展開”字典串列的字典-有解無憂

我是 Python 的新手（以及一般的編碼），所以我會盡力解釋我正在努力解決的挑戰。

我正在處理一個從資料庫匯出為 CSV 的大型資料集。但是，此 CSV 匯出中有一個列包含字典的嵌套串列（盡我所知）。我在網上廣泛地尋找解決方案，包括在 Stackoverflow 上，但還沒有得到完整的解決方案。我想我從概念上理解我要完成的任務，但不清楚要使用的最佳方法或資料準備程序。

這是一個資料示例（縮減為我感興趣的兩列）：

    {
       "app_ID": {
          "0": 1abe23574,
          "1": 4gbn21096
       },
       "locations": {
          "0": "[ {"loc_id" : "abc1",  "lat" : "12.3456",  "long" : "101.9876"  
                  },
                  {"loc_id" : "abc2",  "lat" : "45.7890",  "long" : "102.6543"} 
                ]",
          "1": "[ ]",
         ]"
       }
    }

基本上每個 app_ID 可以有多個位置系結到一個 ID，或者它可以是空的，如上所示。我嘗試使用我在網上找到的一些指南，使用 Panda 的 json_normalize() 函式來“展開”或將字典串列放入 Panda 資料框中自己的行中。

我想以這樣的方式結束：

loc_id    lat      long       app_ID
abc1      12.3456  101.9876   1abe23574
abc1      45.7890  102.6543   1abe23574

等等...

我正在學習如何使用 json_normalize 的不同功能，比如“record_path”和“meta”，但還沒有能夠讓它作業。

我嘗試使用以下方法將 json 檔案加載到 Jupyter Notebook 中：

with open('location_json.json', 'r') as f:
          data = json.loads(f.read())
df = pd.json_normalize(data, record_path = ['locations'])

但它只創建一個 1 行和多列長的資料框，我希望從與 app_ID 和 loc_ID 欄位相關聯的最內部字典生成多行。

嘗試解決方案：

我能夠接近我想要使用的資料幀格式：

with open('location_json.json', 'r') as f:
          data = json.loads(f.read())
df = pd.json_normalize(data['locations']['0'])

但這將需要通過串列進行某種迭代以創建資料幀，然后我將失去與 app_ID 欄位的連接。（盡我所能理解 json_normalize 函式的作業原理）。

我在嘗試使用 json_normalize 時是否走在正確的軌道上，還是應該重新開始并嘗試不同的路線？任何建議或指導將不勝感激。

uj5u.com熱心網友回復：

我不能說建議您使用convtools庫是一件好事，因為您是初學者，因為這個庫幾乎就像是 Python 之上的另一個 Python。它有助于動態定義資料轉換（在后臺生成 Python 代碼）。

但無論如何，如果我正確理解輸入資料，這里是代碼：

import json
from convtools import conversion as c

data = {
    "app_ID": {"0": "1abe23574", "1": "4gbn21096"},
    "locations": {
        "0": """[ {"loc_id" : "abc1",  "lat" : "12.3456",  "long" : "101.9876" },
              {"loc_id" : "abc2",  "lat" : "45.7890",  "long" : "102.6543"} ]""",
        "1": "[ ]",
    },
}

# define it once and use multiple times
converter = (
    c.join(
        # converts "app_ID" data to iterable of dicts
        (
            c.item("app_ID")
            .call_method("items")
            .iter({"id": c.item(0), "app_id": c.item(1)})
        ),
        # converts "locations" data to iterable of dicts,
        # where each id like "0" is zipped to each location.
        # the result is iterable of dicts like {"id": "0", "loc": {"loc_id": ... }}
        (
            c.item("locations")
            .call_method("items")
            .iter(
                c.zip(id=c.repeat(c.item(0)), loc=c.item(1).pipe(json.loads))
            )
            .flatten()
        ),
        # join on "id"
        c.LEFT.item("id") == c.RIGHT.item("id"),
        how="full",
    )
    # process results, where 0 index is LEFT item, 1 index is the RIGHT one
    .iter(
        {
            "loc_id": c.item(1, "loc", "loc_id", default=None),
            "lat": c.item(1, "loc", "lat", default=None),
            "long": c.item(1, "loc", "long", default=None),
            "app_id": c.item(0, "app_id"),
        }
    )
    .as_type(list)
    .gen_converter()
)
result = converter(data)

assert result == [
    {'loc_id': 'abc1', 'lat': '12.3456', 'long': '101.9876', 'app_id': '1abe23574'},
    {'loc_id': 'abc2', 'lat': '45.7890', 'long': '102.6543', 'app_id': '1abe23574'},
    {'loc_id': None, 'lat': None, 'long': None, 'app_id': '4gbn21096'}
]

轉載請註明出處，本文鏈接：https://www.uj5u.com/qita/330474.html

標籤：Python 熊猫数据框字典 json 规范化

上一篇：您將如何遍歷索引鍵以整數形式給出但已預定義的庫（如果有道理）

下一篇：Terraform匯入地圖資源