我有具有該模式的 10GB 檔案:
Header,
header2,
header3,4
content
aaa, HO222222222222, AD, CE
bbb, HO222222222222, AS, AE
ccc, HO222222222222, AD, CE
ddd, HO222222222222, BD, CE
eee, HO222222222222, AD, CE
fff, HO222222222222, BD, CE
ggg, HO222222222222, AD, AE
hhh, HO222222222222, AD, CE
aaa, HO333333333333, AG, CE
bbb, HO333333333333, AT, AE
ccc, HO333333333333, AD, CT
ddd, HO333333333333, BD, CE
eee, HO333333333333, AD, CE
fff, HO333333333333, BD, CE
ggg, HO333333333333, AU, AE
hhh, HO333333333333, AD, CE
....
假設在第二列中我有一個 ID。在整個檔案中,我有 4000 人,每個人都有 50k 條記錄。
我不能使用我準備好的腳本來分析那個大檔案(10GB - pandas 中的腳本,而且我的記憶體太低。我知道我應該重構它,我正在努力),所以我需要將該檔案劃分為4. 但是我不能在檔案之間拆分ID。我的意思是我不能將一個人的一部分放在單獨的檔案中。
所以我寫腳本。它根據ID將檔案劃分為4。
有代碼:
file1 = open('file.txt', 'r')
count = 0
list_of_ids= set()
while True:
if len(list_of_ids) < 1050:
a = "out1.csv"
elif (len(list_of_ids)) >= 1049 and (len(list_of_ids)) < 2100:
a = "out2.csv"
elif (len(list_of_ids)) >= 2099 and (len(list_of_ids)) < 3200:
a = "out3.csv"
else:
a = "out4.csv"
line = file1.readline()
if not line:
break
try:
list_of_ids.add(line.split(',')[1])
out = open(a, "a")
out.write(line)
except IndexError as e:
print(e)
count = 1
out.close()
但它太慢了,我需要加快速度。有很多如果,每次我打開檔案,但我不知道如何獲得更好的性能。也許有人有一些提示?
uj5u.com熱心網友回復:
我想你想要更像這樣的東西:
# this number is arbitrary, of course
ids_per_file = 1000
# use with, so the file always closes when you're done, or something happens
with open('20220317_EuroG_MD_v3_XT_POL_FinalReport.txt', 'r') as f:
# an easier way to loop over all the lines:
n = 0
ids = set()
try:
for line in f:
try:
ids.add(line.split(',')[1])
except IndexError:
# you don't want to break, you just want to ignore the line and continue
continue
# when the number ids reaches the limit (or at the start), start a new file
if not n or len(ids) > ids_per_file:
# close the previous one, unless it's the first
if n > 0:
out_f.close()
# on to the next
n = 1
out_f = open(f'out{n}.csv', 'w')
# reset ids
ids = {line.split(',')[1]}
# write the line, if you get here, it's a record
out_f.write(line)
finally:
# close the last file
out_f.close()
編輯:實際上有一個錯誤,會將第一個新識別符號寫入以前的檔案,認為這樣更好。
轉載請註明出處,本文鏈接:https://www.uj5u.com/net/447853.html
