我有個問題。我有一個損壞的 csv 檔案。最后一列是自由文本,;不幸的是,我的分隔符是一些用戶;在自由文本中使用的,例如This is a longer text and;ups that should not be. 我現在想逐行讀取檔案,第二個之后;所有內容都應該替換為,. 我列印出這個 csv 檔案的哪一行損壞了。如何讀取檔案并同時替換它?或者我應該保存行 輸出并在之后替換它?
不幸的是,我不知道如何解決這種問題。
import pandas as pd
with open("sample.csv", encoding="UTF-8") as file:
for i, line in enumerate(file):
x = line.split(";")
if(len(x) > 3):
print(i, ": ", line)
cleaned_x = (', '.join(x[2:]))
# Add cleaned_x to x
new_line = x[0] ";" x[1] ";" cleaned_x
print(new_line)
df = pd.read_csv("file.csv", encoding="utf-8", sep=";")
我有的
customerId;name;text
1;Josey;I want to go at 05pm
2;Mike;Check this out --> ?l
2;Frank;This is a longer text and;ups that should not be
2;Max;okay;
3;Josey;here is everythink good
我想要的是
customerId;name;text
1;Josey;I want to go at 05pm
2;Mike;Check this out --> ?l
2;Frank;This is a longer text and,ups that should not be
2;Max;okay,
3;Josey;here is everythink good
uj5u.com熱心網友回復:
您可以將行保存在陣列中并創建一個新檔案。
import csv
new_sample = []
with open("sample.csv", encoding="UTF-8") as file:
for i, line in enumerate(file):
x = line.split(";")
if(len(x) > 3):
print(i, ": ", line)
cleaned_x = (', '.join(x[2:]))
# Add cleaned_x to x
new_line = x[0] ";" x[1] ";" cleaned_x
print(new_line)
new_sample.append(new_line)
else:
new_sample.append(line)
with open("new_sample.csv", encoding="UTF-8") as new_file:
writer = csv.writer(new_file)
for row in new_sample:
writer.writerow(row)
uj5u.com熱心網友回復:
定義一個自定義函式來讀取 csv 檔案,然后從rows和創建一個新的資料框cols:
def read_csv(path):
with open(path) as file:
for line in file:
*v, t = line.strip().split(';', 2)
yield [*v, t.replace(';', ',')]
cols, *rows = read_csv('sample.csv')
df = pd.DataFrame(rows, columns=cols)
print(df)
customerId name text
0 1 Josey I want to go at 05pm
1 2 Mike Check this out --> ??l
2 2 Frank This is a longer text and,ups that should not be
3 2 Max okay,
4 3 Josey here is everythink good
uj5u.com熱心網友回復:
僅供參考,如果您使用 Python 的 csv 庫撰寫初始檔案,它將處理 embeddded ;正確地
import csv
with open("test.csv", "w") as f:
writer = csv.writer(f, delimiter=";")
writer.writerow(["hello", "world", "hello;world"])
# test.csv contains hello;world;"hello;world"
# which will be read as three fields using csv.reader
以下是解決您的問題的方法。我會寫出一個新檔案。可以以讀/寫模式打開檔案,但它更復雜,因為您需要讀取一行,移動檔案中的位置,寫入新資料,同時確保不覆寫下一行的位元組。 ..使用新檔案然后重命名它要容易得多。
import csv
with open("input.csv") as in_file, open("output.csv", "w") as out_file:
reader = csv.reader(in_file, delimiter=";")
writer = csv.writer(out_file, delimiter=";")
for line in reader: # line is a list containing the fields
if len(line) > 3:
line = line[:2] [", ".join(line[2:])]
writer.writerow(line)
如果您不需要保存固定檔案,則不需要打開“output.csv”或創建撰寫器。在更正后列印line以顯示欄位串列["hello", "world", "hello;world"]
如果您希望列印最終會出現在檔案中的字串,則需要將包含分號的欄位括在引號中。
line = [f"\"{item}\"" if ";" in item else item for item in line]
print(";".join(line))
# hello;world;"hello;world"
uj5u.com熱心網友回復:
當遇到帶有on_bad_lines引數的壞行時, Pandas(版本 >= 1.3.0)允許呼叫函式來處理壞行:
可呼叫的,帶有簽名的函式 (bad_line: list[str]) -> list[str] | 沒有一個會處理一條壞線。bad_line 是由 sep 拆分的字串串列。如果函式回傳 None,壞行將被忽略。如果函式回傳一個包含比預期更多元素的新字串串列,則會在洗掉額外元素時發出 ParserWarning。僅在 engine="python" 時支持
所以你可以簡單地閱讀檔案:
df = pd.read_csv('sample.csv', sep=';', engine='python', on_bad_lines=lambda x: x[:2] [';'.join(x[2:])])
然后將其保存為您喜歡的任何格式。或者實作問題中定義的輸出:
df['text'] = df['text'].str.replace(';', ',')
df.to_csv('output.csv', sep=';')
轉載請註明出處,本文鏈接:https://www.uj5u.com/net/483057.html
下一篇:請幫助我獲得我想要的輸出
