如何在合并1000多個檔案時將csv檔案的名稱添加為列中的值？-有解無憂

我正在嘗試使用以下代碼合并 1000 多個 csv 檔案：

path = r'path_to_files/' 
all_files = glob.glob(path   "/*.csv")

import shutil

with open('updated_thirteen_jan.csv','wb') as wfd:
    for f in all_files:
        with open(f,'rb') as fd:
            shutil.copyfileobj(fd, wfd)

我正在使用上面的代碼來避免 ram 崩潰問題，它作業正常。但是，我想做以下代碼為我做的事情：

path = r'path_to_files/'
all_files = glob.glob(path   "/*.csv")
fields = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6', 'col7', 'col8']
li = []

first_one = True
for filename in all_files:

    if not first_one: # if it is not the first csv file then skip the header row (row 0) of that file
        skip_row = [0]
    else:
        skip_row = []

for filename in all_files:
    df = pd.read_csv(filename, index_col=None, skiprows = skip_row, engine='python', usecols=fields)
    df = df[(df['lang'] == 'en')]
    filename = os.path.basename(filename)
    df['file_name'] = filename


    li.append(df)

frame = pd.concat(li, axis=0, ignore_index=True)

從這段代碼中，我希望能夠執行列選擇fileds，row_skip并添加file_name為值。

請問有什么指導嗎？

uj5u.com熱心網友回復：

如果記憶體是約束，那么一種pandas基于解決方案是迭代行塊：

import os

import pandas as pd

print(pd.__version__)
# works with this version: '1.3.4'

# gen sample files
all_files = [f"{_}.csv" for _ in range(3)]
for filename in all_files:
    df = pd.DataFrame(range(3))
    df.to_csv(filename, index=False)

# combine into one
mode = "w"
header = True
for filename in all_files:
    with pd.read_csv(
        filename,
        engine="python",
        iterator=True,
        chunksize=10_000,
    ) as reader:
        for df in reader:
            filename = os.path.basename(filename)
            df["file_name"] = filename
            df.to_csv("some_file.csv", index=False, mode=mode, header=header)
            mode = "a"
            header = False

uj5u.com熱心網友回復：

另一種解決方案是使用dask：

# pip install dask
import dask.dataframe as dd

# dd.read_csv is mostly compatible with pd.read_csv options
# so can specify reading specific columns, etc.
ddf = dd.read_csv("some_path/*.csv")
ddf.to_csv('merged_file.csv', index=False, single_file=True)

uj5u.com熱心網友回復：

好的舊csv模塊一次可以處理一行，因此記憶體不會成為問題。以下代碼將連接僅保留第一個標題的 csv 檔案，并添加一個填充了檔案名的檔案名列。

path = r'path_to_files/' 
all_files = glob.glob(path   "/*.csv")

import csv

with open('updated_thirteen_jan.csv','w', newline='') as wfd:
    wr = csv.writer(wfd)
    first = True
    for f in all_files:
        with open(f) as fd:
            rd = csv.reader(fd)
            # skip header line, except for the first file
            row = next(rd)
            if first:
                row.append('filename')
                wr.writerow(row)
                first = False
            for row in rd:
                row.append(f)
                wr.writerow(row)

uj5u.com熱心網友回復：

一次將一個檔案讀入 pandas 資料幀，向其中添加新列并將其寫入新檔案。

import os
import glob
import pathlib

path = 'path_to_files/'
out_file = 'updated_thirteen_jan.csv'
all_files = glob.glob(path   '*.csv')
all_files = sorted([pathlib.Path(i) for i in all_files])

keep_cols = ['list', 'of', 'columns', 'to', 'keep']
skip_row = 2  # number of rows to skip

for fn in all_files:
    temp = pd.read_csv(fn, usecols=keep_cols, skiprows=skip_row)
    temp['filename'] = fn.stem
    temp.to_csv(out_file, mode='a', index=False, header=not os.path.isfile(out_file))

如果將整個 csv 讀入記憶體不可行，則使用 chunksize。根據您的機器容量修改此值。

for fn in all_files:
    reader = pd.read_csv(fn, usecols=keep_cols, skiprows=skip_row, chunksize=5000)
    for idx, df in enumerate(reader):
        df['filename'] = fn.stem
        df.to_csv(out_file, mode='a', index=False, header=not os.path.isfile(out_file))

轉載請註明出處，本文鏈接：https://www.uj5u.com/ruanti/409851.html

標籤：

上一篇：根據行中的值填充列

下一篇：計算特定日期有多少不同的用戶