我有一個大.txt檔案,我想一次讀取一行(而不是將其全部讀入記憶體,以避免記憶體不足問題),然后提取檔案中存在的所有唯一字符。我有下面的代碼,它適用于小檔案,但是當我在大檔案(這是我通常需要執行練習的那種檔案)上運行它時,它運行得非常慢,例如 10GB 檔案大約需要 1 小時。有人可以建議我如何提高性能,例如通過重新安排正在執行的操作,避免重復運行或使用更快的功能。
謝謝
def flatten(t):
'''Flatten list of lits'''
return [item for sublist in t for item in sublist]
input_file = r'C:\large_text_file.txt'
output_file = r'C:\char_set.txt'
#Parameters
case_sensitive = False
remove_crlf = True
#Extract all unique characters from file
charset = []
with open(input_file, 'r') as infile:
for line in infile:
if remove_crlf:
charset.append(list(line.rstrip())) #remove CRLF
else:
charset.append(list(line))
charset = flatten(charset) #flatten the list of lists
if not(case_sensitive):
charset = (map(lambda x: x.upper(), charset)) #convert to upper case
charset = list(dict.fromkeys(charset)) #remove duplicates
charset.sort(key=None, reverse=False) #sort character set in ascending order
infile.close() #close the input file
#Output the charater set
with open(output_file, 'w') as f:
for char in charset:
f.write(char)
uj5u.com熱心網友回復:
您可以非常簡化以使其線性:
charset = set() # use a real set!
with open(input_file, 'r') as infile:
for line in infile:
if remove_crlf:
line = line.strip()
if not case_sensitive:
line = line.upper()
charset.update(line)
with open(output_file, 'w') as f:
for char in sorted(charset):
f.write(char)
是什么使它成為二次的,是所有這些線:
charset = flatten(charset) #flatten the list of lists
charset = map(lambda x: x.upper(), charset)
charset = list(dict.fromkeys(charset))
您可以繼續在不斷增長的串列上執行操作,而不僅僅是當前行。
一些檔案:
set.update
轉載請註明出處,本文鏈接:https://www.uj5u.com/yidong/341418.html
上一篇:查找any()函式檢測到的單詞
下一篇:ModuleNotFoundError:沒有名為'rest_framework'的模塊python3django
