在python中更快地讀取檔案-有解無憂

我正在撰寫一個腳本來讀取一個 TXT 檔案，其中每一行都是一個日志條目，我需要將此日志分隔在不同的檔案中（對于所有 Hor、Sia、Lmu）。使用我的測驗檔案（80kb）時，我正在讀取每一行并劃分新檔案沒有問題，但是當我嘗試應用到實際檔案（177MB - 大約 500k 行）時，它需要的時間太長。花了一個多小時，它仍然在讀取 80K 行。

這些行是這樣的：

CRM|Hor|SiebelSeed

CRM|Sia|SiebelSeed

CRM|Lmu|LMU|

無論如何我可以讓它運行得更快嗎？

我的代碼

with open(path, "r", encoding="UTF-16") as file:
    for i, line in enumerate(file): 
    
            if i > 2: # lines 1-2 are headers
                component = re.match(r"Crm\|([A-Za-z0-9_] )|]", line).group(1)
                
                if component not in comp_list:
                    comp_list.append(component)
                    
                    with open(f'HHR_Splitter/output/{component}.txt','w ', encoding="UTF-16") as new_file:
                        new_file.write('{}'.format(line))
                        
                        
                if component in comp_list:
                    
                    with open(f'HHR_Splitter/output/{component}.txt','a ', encoding="UTF-16") as existing_file: 
                        existing_file.write('{}'.format(line))

                else:
                    break

uj5u.com熱心網友回復：

我發現的第一件事是您正在打開每一行的輸出檔案。您可以打開它們一次，然后它們會處理所有行。這同樣適用于正則運算式：您可以在 for 回圈之前計算一次re.compile()

這是一個例子：

def process_log(input_file, output_files):
    prog = re.compile(r"Crm\|([A-Za-z0-9_] )|]")
    for i, line in enumerate(file):
        if i > 2:
           component = prog.match(line).group(1)
           output_files[component].write('{}'.format(line))

def open_outputs_files():
     output_files = {}
     components = ["Crm", "Hor", "Sia", "Lmu", "SiebelSeed"]
     for component in components:
         with open(f'HHR_Splitter/output/{component}.txt','w ', encoding="UTF-16") as new_file:
             output_files[component] = new_file
     return output_files

with open(path, "r", encoding="UTF-16") as input_file:
    output_files = open_outputs_files()
    process_log(input_file, output_files)

轉載請註明出處，本文鏈接：https://www.uj5u.com/qiye/422655.html

標籤：

上一篇：在我使用來自原始資料集的資訊創建新資料集時，如何簡化此代碼(r)？

下一篇：getter上的@NotNull注釋對性能有顯著影響嗎？