我正在撰寫一個腳本來讀取一個 TXT 檔案,其中每一行都是一個日志條目,我需要將此日志分隔在不同的檔案中(對于所有 Hor、Sia、Lmu)。使用我的測驗檔案(80kb)時,我正在讀取每一行并劃分新檔案沒有問題,但是當我嘗試應用到實際檔案(177MB - 大約 500k 行)時,它需要的時間太長。花了一個多小時,它仍然在讀取 80K 行。
這些行是這樣的:
CRM|Hor|SiebelSeed
CRM|Sia|SiebelSeed
CRM|Lmu|LMU|
無論如何我可以讓它運行得更快嗎?
我的代碼
with open(path, "r", encoding="UTF-16") as file:
for i, line in enumerate(file):
if i > 2: # lines 1-2 are headers
component = re.match(r"Crm\|([A-Za-z0-9_] )|]", line).group(1)
if component not in comp_list:
comp_list.append(component)
with open(f'HHR_Splitter/output/{component}.txt','w ', encoding="UTF-16") as new_file:
new_file.write('{}'.format(line))
if component in comp_list:
with open(f'HHR_Splitter/output/{component}.txt','a ', encoding="UTF-16") as existing_file:
existing_file.write('{}'.format(line))
else:
break
uj5u.com熱心網友回復:
我發現的第一件事是您正在打開每一行的輸出檔案。您可以打開它們一次,然后它們會處理所有行。這同樣適用于正則運算式:您可以在 for 回圈之前計算一次re.compile()
這是一個例子:
def process_log(input_file, output_files):
prog = re.compile(r"Crm\|([A-Za-z0-9_] )|]")
for i, line in enumerate(file):
if i > 2:
component = prog.match(line).group(1)
output_files[component].write('{}'.format(line))
def open_outputs_files():
output_files = {}
components = ["Crm", "Hor", "Sia", "Lmu", "SiebelSeed"]
for component in components:
with open(f'HHR_Splitter/output/{component}.txt','w ', encoding="UTF-16") as new_file:
output_files[component] = new_file
return output_files
with open(path, "r", encoding="UTF-16") as input_file:
output_files = open_outputs_files()
process_log(input_file, output_files)
轉載請註明出處,本文鏈接:https://www.uj5u.com/qiye/422655.html
標籤:
