通過每個檔案中存在的公共欄位合并檔案夾中的所有.csv檔案-有解無憂

所以，我有一個包含 .csv 檔案的目錄。例如：

一個.csv

id,name
1,john
2,mary
3,alex

b.csv

id,birth
1,01.01.2001
2,05.06.1990

檔案

id,death
2,01.02.2020
1,-

結果應該是一個字典，其中鍵是 id (int)，值是檔案中所有不同值的字典（字典的字典）。像這樣的東西：

{
        1: {"id": 1, "name": "john", "birth": "01.01.2001", "death": -},
        2: {"id": 2, "name": "mary", "birth": "05.06.1990",
            "death": "01.02.2020"},
        3: {"id": 3, "name": "alex", "birth": None, "death": None},
}

到目前為止，我已經嘗試將所有文??件合并到一個資料框中：

from pathlib import Path
import os
import pandas as pd

files = Path(r'path').rglob('*.csv')

# read in all the csv files
all_csvs = [pd.read_csv(file) for file in files]

# lump into one table
all_csvs = pd.concat(all_csvs, axis=1)

但結果我得到了一個資料框，其中“id”在三列中重復。

任何幫助將不勝感激！

uj5u.com熱心網友回復：

你想要merge而不是concat。由于您需要合并多個DataFrames，您可以執行以下操作：

import os
from functools import reduce

all_csvs = [pd.read_csv(file) for file in os.listdir() if file.endswith(".csv")]
df = reduce(lambda left, right: pd.merge(left, right, how="outer", on="id"), all_csvs)

>>> df

   id  name       birth       death
0   1  john  01.01.2001         NaN
1   2  mary  05.06.1990  01.02.2020
2   3  alex         NaN         NaN

#for dictionary output replacing nan with None
my_dict = df.where(df.notnull(), None).set_index("id", drop=False).to_dict(orient="index")
>>> my_dict

{1: {'id': 1, 'name': 'john', 'birth': '01.01.2001', 'death': None},
 2: {'id': 2, 'name': 'mary', 'birth': '05.06.1990', 'death': '01.02.2020'},
 3: {'id': 3, 'name': 'alex', 'birth': None, 'death': None}}

uj5u.com熱心網友回復：

如果你愿意，你甚至可以在沒有熊貓的情況下做到這一點。首先，創建一個defaultdict將保存所有 csv 資料的檔案。讓這個 dict 的默認元素是一個代表“默認”人的字典，即所有鍵的值為None。

import collections

def default_person():
    return {'id': None, 'name': None, 'birth': None, 'death': None}
all_csvs = collections.defaultdict(default_person)

此字典中的鍵將是id欄位，值將是包含您想要的所有資訊的字典。

接下來，使用csv.DictReader. DictReader將 csv 檔案的每一行作為字典讀取，鍵來自檔案的標題。然后對于每個檔案中的每一行，id在defaultdict我們剛剛創建的正確位置更新字典的值：

import csv

files = Path(r'path').rglob('*.csv')

for file in files:
    with open(file, "r") as f_in:
        reader = csv.DictReader(f_in)
        for row_dict in reader:
            p_id = row_dict['id'] = int(row_dict['id']) # Convert `id` to integer
            all_csvs[p_id].update(row_dict)

現在，all_csvs看起來像這樣：

defaultdict(<function __main__.default_person()>,
            {
             1: {'id': 1, 'name': 'john', 'birth': '01.01.2001', 'death': '-'},
             2: {'id': 2, 'name': 'mary', 'birth': '05.06.1990', 'death': '01.02.2020'},
             3: {'id': 3, 'name': 'alex', 'birth': None, 'death': None}
            })

轉載請註明出處，本文鏈接：https://www.uj5u.com/gongcheng/326233.html

標籤：Python 文件

上一篇：如何通過python代碼將csv中的ID號從A列到B列？

下一篇：比較兩個CSV檔案并輸出現有的等式