通過讀取檔案從行中洗掉某些字符并將其保存到檔案中-有解無憂

我有個問題。我有一個損壞的 csv 檔案。最后一列是自由文本，;不幸的是，我的分隔符是一些用戶;在自由文本中使用的，例如This is a longer text and;ups that should not be. 我現在想逐行讀取檔案，第二個之后;所有內容都應該替換為,. 我列印出這個 csv 檔案的哪一行損壞了。如何讀取檔案并同時替換它？或者我應該保存行輸出并在之后替換它？

不幸的是，我不知道如何解決這種問題。

import pandas as pd

with open("sample.csv", encoding="UTF-8") as file:
    for i, line in enumerate(file):
      x = line.split(";")
      if(len(x) > 3):
        print(i, ": ", line)
        cleaned_x = (', '.join(x[2:]))
        # Add cleaned_x to x
        new_line = x[0]   ";"   x[1]    ";"   cleaned_x
        print(new_line)

df = pd.read_csv("file.csv", encoding="utf-8", sep=";")

我有的

customerId;name;text
1;Josey;I want to go at 05pm
2;Mike;Check this out --> ?l
2;Frank;This is a longer text and;ups that should not be
2;Max;okay;
3;Josey;here is everythink good

我想要的是

customerId;name;text
1;Josey;I want to go at 05pm
2;Mike;Check this out --> ?l
2;Frank;This is a longer text and,ups that should not be
2;Max;okay,
3;Josey;here is everythink good

uj5u.com熱心網友回復：

您可以將行保存在陣列中并創建一個新檔案。

import csv

new_sample = []
with open("sample.csv", encoding="UTF-8") as file:
for i, line in enumerate(file):
    x = line.split(";")
    if(len(x) > 3):
        print(i, ": ", line)
        cleaned_x = (', '.join(x[2:]))
        # Add cleaned_x to x
        new_line = x[0]   ";"   x[1]    ";"   cleaned_x
        print(new_line)
        new_sample.append(new_line)
    else:
        new_sample.append(line)

with open("new_sample.csv", encoding="UTF-8") as new_file:
    writer = csv.writer(new_file)
    for row in new_sample:
        writer.writerow(row)

uj5u.com熱心網友回復：

定義一個自定義函式來讀取 csv 檔案，然后從rows和創建一個新的資料框cols：

def read_csv(path):
    with open(path) as file:
        for line in file:
            *v, t = line.strip().split(';', 2)
            yield [*v, t.replace(';', ',')]

cols, *rows = read_csv('sample.csv')
df = pd.DataFrame(rows, columns=cols)

print(df)
  customerId   name                                              text
0          1  Josey                              I want to go at 05pm
1          2   Mike                            Check this out --> ??l
2          2  Frank  This is a longer text and,ups that should not be
3          2    Max                                             okay,
4          3  Josey                           here is everythink good

uj5u.com熱心網友回復：

僅供參考，如果您使用 Python 的 csv 庫撰寫初始檔案，它將處理 embeddded ；正確地

import csv

with open("test.csv", "w") as f:
    writer = csv.writer(f, delimiter=";")
    writer.writerow(["hello", "world", "hello;world"])

# test.csv contains hello;world;"hello;world"
# which will be read as three fields using csv.reader

以下是解決您的問題的方法。我會寫出一個新檔案。可以以讀/寫模式打開檔案，但它更復雜，因為您需要讀取一行，移動檔案中的位置，寫入新資料，同時確保不覆寫下一行的位元組。 ..使用新檔案然后重命名它要容易得多。

import csv

with open("input.csv") as in_file, open("output.csv", "w") as out_file:

    reader = csv.reader(in_file, delimiter=";")
    writer = csv.writer(out_file, delimiter=";")

    for line in reader:  # line is a list containing the fields
        if len(line) > 3:
            line = line[:2]   [", ".join(line[2:])]
        writer.writerow(line)

如果您不需要保存固定檔案，則不需要打開“output.csv”或創建撰寫器。在更正后列印line以顯示欄位串列["hello", "world", "hello;world"]

如果您希望列印最終會出現在檔案中的字串，則需要將包含分號的欄位括在引號中。

line = [f"\"{item}\"" if ";" in item else item for item in line]
print(";".join(line))
# hello;world;"hello;world"

uj5u.com熱心網友回復：

當遇到帶有on_bad_lines引數的壞行時， Pandas（版本 >= 1.3.0）允許呼叫函式來處理壞行：

可呼叫的，帶有簽名的函式 (bad_line: list[str]) -> list[str] | 沒有一個會處理一條壞線。bad_line 是由 sep 拆分的字串串列。如果函式回傳 None，壞行將被忽略。如果函式回傳一個包含比預期更多元素的新字串串列，則會在洗掉額外元素時發出 ParserWarning。僅在 engine="python" 時支持

所以你可以簡單地閱讀檔案：

df = pd.read_csv('sample.csv', sep=';', engine='python', on_bad_lines=lambda x: x[:2]   [';'.join(x[2:])])

然后將其保存為您喜歡的任何格式。或者實作問題中定義的輸出：

df['text'] = df['text'].str.replace(';', ',')
df.to_csv('output.csv', sep=';')

轉載請註明出處，本文鏈接：https://www.uj5u.com/net/483057.html

標籤：Python 熊猫文件阅读线

上一篇：真的不可能同時掛起兩個std/posix執行緒嗎？

下一篇：請幫助我獲得我想要的輸出