我正在嘗試使用 mmap 從記憶體不足的 csv 檔案中讀取一些行。
這就是我的 csv 檔案的樣子[為了便于閱讀,我將行分開]:
'','InputText',101,102,103,104,105,106,107,108,109,110\n
0,'abcde efgh ijkl mnop',1,0,0,0,0,1,1,0,0,0\n
1,'qwerty uiop asdf',1,0,0,1,0,0,0,0,0,0\n
2,'zxcv',0,1,1,0,0,0,0,1,0,0\n
3,'qazxswedc vfrtgbnhy nhyummjikkig jhguopjfservcs fdtuugdsae dsawruoh',1,0,0,0,0,1,1,1,0,0\n
4,'plmnkoijb vhuygcxf tr r mhjease',1,0,0,0,0,0,0,0,0,1\n
這是我到目前為止所做的:
# imports
import mmap
import os
# open the file buffer
fbuff = open("big_file.csv", mode="r", encoding="utf8")
# now read that file buffer to mmap
f1_mmap = mmap.mmap(fbuff.fileno(), length=os.path.getsize("big_file.csv"),
access=mmap.ACCESS_READ, offset=0)
將檔案讀到 后mmap.mmap(),這是我嘗試讀取一行的方式,如此處的python-3.7 檔案中所述:
# according to python docs: https://docs.python.org/3.7/library/mmap.html#mmap.mmap.seek
# this mmap.mmap.seek need to be set to the byte position in the file
# and when I set it to 0th position(beginning of file) like below, readline() would print entire line till '\n'
f1_mmap.seek(0)
f1_mmap.readline()
如果我想讀取檔案中的第 102,457 行,我需要找到該行的起始位元組位置并將其設定在mmap.mmap.seek(pos=<this-position>). 如何找到文本檔案中任何給定行的位置?
uj5u.com熱心網友回復:
下面是如何構建一個索引,該索引由檔案中每一行開頭的偏移量串列組成,然后如何使用它來讀取任意行以及記憶體映射 CSV 檔案的行:
import csv
from io import StringIO
import mmap
import random
my_csv_dialect = dict(delimiter=',', quotechar="'")
filepath = 'big_file.csv'
# Build list of offsets where each line of file starts.
fbuff = open(filepath, mode='r', encoding='utf8')
f1_mmap = mmap.mmap(fbuff.fileno(), 0, access=mmap.ACCESS_READ)
print('Index:')
offsets = [0] # First line is always at offset 0.
for line_no, line in enumerate(iter(f1_mmap.readline, b'')):
offsets.append(f1_mmap.tell()) # Append where *next* line would start.
print(f'{line_no} ({offsets[line_no]:3d}) {line!r}')
print()
# Access arbitrary lines in the memory-mapped file.
print('Line access:')
for line_no in (3, 1, 5):
f1_mmap.seek(offsets[line_no])
line = f1_mmap.readline()
print(f'{line_no}: {line!r}')
print()
# Access arbitrary rows of memory-mapped csv file.
print('CSV row access:')
for line_no in (3, 1, 5):
f1_mmap.seek(offsets[line_no])
line = f1_mmap.readline()
b = StringIO(line.decode())
r = csv.reader(b, **my_csv_dialect)
values = next(r)
print(f'{line_no}: {values}')
f1_mmap.close()
fbuff.close()
列印結果:
Index:
0 ( 0) b"'','InputText',101,102,103,104,105,106,107,108,109,110\r\n"
1 ( 56) b"0,'abcde efgh ijkl mnop',1,0,0,0,0,1,1,0,0,0\r\n"
2 (102) b"1,'qwerty uiop asdf',1,0,0,1,0,0,0,0,0,0\r\n"
3 (144) b"2,'zxcv',0,1,1,0,0,0,0,1,0,0\r\n"
4 (174) b"3,'qazxswedc vfrtgbnhy nhyummjikkig jhguopjfservcs fdtuugdsae dsawruoh',1,0,0,0,0,1,1,1,0,0\r\n"
5 (267) b"4,'plmnkoijb vhuygcxf tr r mhjease',1,0,0,0,0,0,0,0,0,1\r\n"
Line access:
3: b"2,'zxcv',0,1,1,0,0,0,0,1,0,0\r\n"
1: b"0,'abcde efgh ijkl mnop',1,0,0,0,0,1,1,0,0,0\r\n"
5: b"4,'plmnkoijb vhuygcxf tr r mhjease',1,0,0,0,0,0,0,0,0,1\r\n"
CSV row access:
3: ['2', 'zxcv', '0', '1', '1', '0', '0', '0', '0', '1', '0', '0']
1: ['0', 'abcde efgh ijkl mnop', '1', '0', '0', '0', '0', '1', '1', '0', '0', '0']
5: ['4', 'plmnkoijb vhuygcxf tr r mhjease', '1', '0', '0', '0', '0', '0', '0', '0', '0', '1']
轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/342615.html
下一篇:讀取超大檔案R的列名和列值
