我有一個 120GB 的大檔案,逐行由字串組成。我想逐行回圈檔案,將所有德語字符 ? 替換為字符 s。我有一個作業代碼,但它很慢,將來我應該替換更多的德語字符。所以我一直在嘗試將檔案分成 6 塊(對于我的 6 核 CPU)并結合多核處理來加快代碼速度,但沒有運氣。
由于行沒有排序,我不在乎新檔案中的行將在哪里結束。有人能幫幫我嗎?
我的作業慢代碼:
import re
with open('C:\Projects\orders.txt', 'r') as f, open('C:\Projects\orders_new.txt', 'w') as nf:
for l in f:
l = re.sub("?", "s", l)
nf.write(l)
uj5u.com熱心網友回復:
為了使多處理解決方案比同等的單處理解決方案性能更高,作業函式必須足夠占用 CPU 資源,以便并行運行該函式可以節省足夠的時間來補償多處理產生的額外開銷。
為了使作業函式充分占用 CPU 資源,我會將要轉換為塊的行進行批處理,以便作業函式的每次呼叫都涉及更多 CPU。您可以使用該CHUNK_SIZE值(閱讀其定義之前的注釋)。如果你有足夠的記憶體,越大越好。
from multiprocessing import Pool
def get_chunks():
# If you have N processors,
# then we need memory to hold 2 * (N - 1) chunks (one processor
# is reserved for the main process).
# The size of a chunk is CHUNK_SIZE * average-line-length.
# If the average line length were 100, then a chunk would require
# approximately 1_000_000 bytes of memory.
# So if you had, for example, a 16MB machine with 8 processors,
# you would have more
# than enough memory for this CHUNK_SIZE.
CHUNK_SIZE = 1_000
with open('C:\Projects\orders.txt', 'r', encoding='utf-8') as f:
chunk = []
while True:
line = f.readline()
if line == '': # end of file
break
chunk.append(line)
if len(chunk) == CHUNK_SIZE:
yield chunk
chunk = []
if chunk:
yield chunk
def worker(chunk):
# This function must be sufficiently CPU-intensive
# to justify multiprocessing.
for idx in range(len(chunk)):
chunk[idx] = chunk[idx].replace("?", "s")
return chunk
def main():
with Pool(multiprocessing.cpu_count() - 1) as pool, \
open('C:\Projects\orders_new.txt', 'w', encoding='utf-8') as nf:
for chunk in pool.imap_unordered(worker, get_chunks()):
nf.write(''.join(chunk))
"""
Or to be more memory efficient, but slower:
for line in chunk:
nf.write(chunk)
"""
if __name__ == '__main__':
main()
轉載請註明出處,本文鏈接:https://www.uj5u.com/gongcheng/530144.html
