python中可能的scipy稀疏陣列記憶體泄漏-有解無憂

編輯 3：TL;DR 我的問題是由于我的矩陣不夠稀疏，并且還錯誤地計算了稀疏陣列的大小。

希望有人能向我解釋為什么會這樣。我正在使用具有 51 GB 記憶體的 colab，我需要從 H5 檔案 float32 加載資料。我能夠將測驗 H5 檔案加載為 numpy 陣列和 RAM ~ 45 GB。我分批加載（總共 21 個）并將其堆疊起來。然后我嘗試將資料加載到 numpy 轉換為稀疏和 hstack 資料和記憶體爆炸，我在第 12 批左右后得到 OOM。

此代碼對其進行模擬，您可以更改資料大小以在您的計算機上對其進行測驗。我得到完全無法解釋的記憶體增加，即使當我查看記憶體中變數的大小時，它們看起來很小。怎么了？我究竟做錯了什么？

import os, psutil
import gc
gc.enable()
from scipy import sparse
import numpy as np
all_x = None
x = (1*(np.random.rand(97406, 2048)>0.39721115241072164)).astype('float32')
x2 = sparse.csr_matrix(x)
print('GB on Memory SPARSE ', x2.data.nbytes/ 10**9)
print('GB on Memory NUMPY ', x.nbytes/ 10**9)
print('sparse to dense mat ratio', x2.data.nbytes/ x.nbytes)
print('_____________________')
for k in range(8):
  if all_x is None:
    all_x = x2
  else:
    all_x = sparse.hstack([all_x, x2])
  print('GB on Memory ALL SPARSE ', all_x.data.nbytes/ 10**9)
  print('GB USED BEFORE GC', psutil.Process(os.getpid()).memory_info().rss/ 10**9) 
  gc.collect()
  print('GB USED AFTER GC', psutil.Process(os.getpid()).memory_info().rss/ 10**9) 
  print('_____________________')

GB on Memory SPARSE  0.481035332
GB on Memory NUMPY  0.797949952
sparse to dense mat ratio 0.6028389760464576
_____________________
GB on Memory ALL SPARSE  0.481035332
GB USED BEFORE GC 4.62065664
GB USED AFTER GC 4.6206976
_____________________
GB on Memory ALL SPARSE  0.962070664
GB USED BEFORE GC 8.473133056
GB USED AFTER GC 8.473133056
_____________________
GB on Memory ALL SPARSE  1.443105996
GB USED BEFORE GC 12.325183488
GB USED AFTER GC 12.325183488
_____________________
GB on Memory ALL SPARSE  1.924141328
GB USED BEFORE GC 17.140740096
GB USED AFTER GC 17.140740096
_____________________
GB on Memory ALL SPARSE  2.40517666
GB USED BEFORE GC 20.512710656
GB USED AFTER GC 20.512710656
_____________________
GB on Memory ALL SPARSE  2.886211992
GB USED BEFORE GC 22.920142848
GB USED AFTER GC 22.920142848
_____________________
GB on Memory ALL SPARSE  3.367247324
GB USED BEFORE GC 29.660889088
GB USED AFTER GC 29.660889088
_____________________
GB on Memory ALL SPARSE  3.848282656
GB USED BEFORE GC 33.99727104
GB USED AFTER GC 33.99727104
_____________________

編輯：我在 numpy hstack 中堆疊了一個串列，它作業正常

import os, psutil
import gc
gc.enable()
from scipy import sparse
import numpy as np
all_x = None
x = (1*(np.random.rand(97406, 2048)>0.39721115241072164)).astype('float32')
x2 = sparse.csr_matrix(x)
print('GB on Memory SPARSE ', x2.data.nbytes/ 10**9)
print('GB on Memory NUMPY ', x.nbytes/ 10**9)
print('sparse to dense mat ratio', x2.data.nbytes/ x.nbytes)
print('_____________________')

all_x = np.hstack([x]*21)

print('GB on Memory ALL SPARSE ', all_x.data.nbytes/ 10**9)
print('GB USED BEFORE GC', psutil.Process(os.getpid()).memory_info().rss/ 10**9) 
gc.collect()
print('GB USED AFTER GC', psutil.Process(os.getpid()).memory_info().rss/ 10**9) 
print('_____________________')

輸出

GB on Memory SPARSE  0.480956104
GB on Memory NUMPY  0.797949952
sparse to dense mat ratio 0.6027396866113227
_____________________
GB on Memory ALL SPARSE  16.756948992
GB USED BEFORE GC 38.169387008
GB USED AFTER GC 38.169411584
_____________________

but when I do the same with sparse matrix I get an OOM. according to the bytes the sparse matrix should be smaller.

import os, psutil
import gc
gc.enable()
from scipy import sparse
import numpy as np
all_x = None
x = (1*(np.random.rand(97406, 2048)>0.39721115241072164)).astype('float32')
x2 = sparse.csr_matrix(x)
print('GB on Memory SPARSE ', x2.data.nbytes/ 10**9)
print('GB on Memory NUMPY ', x.nbytes/ 10**9)
print('sparse to dense mat ratio', x2.data.nbytes/ x.nbytes)
print('_____________________')

all_x = sparse.hstack([x2]*21)

print('GB on Memory ALL SPARSE ', all_x.data.nbytes/ 10**9)
print('GB USED BEFORE GC', psutil.Process(os.getpid()).memory_info().rss/ 10**9) 
gc.collect()
print('GB USED AFTER GC', psutil.Process(os.getpid()).memory_info().rss/ 10**9) 
print('_____________________')

but when i do above it returns OOM error

EDIT 2 it seems I was calculating the true size of the sparse matrix incorrectly. it can be calculated using

def bytes_in_sparse(a):
  return  a.data.nbytes   a.indptr.nbytes   a.indices.nbytes

the true comparison between the dense and sparse arrays are

GB on Memory SPARSE  0.962395268
GB on Memory NUMPY  0.797949952
sparse to dense mat ratio 1.2060847495357703

Once I use sparse.hstack the two variables become different types of sparse matrices.

all_x, x2

outputs

(<97406x4096 sparse matrix of type '<class 'numpy.float32'>'
    with 240476696 stored elements in COOrdinate format>,
 <97406x2048 sparse matrix of type '<class 'numpy.float32'>'
    with 120238348 stored elements in Compressed Sparse Row format>)

uj5u.com熱心網友回復：

尺寸更小，所以我不會掛我的電腦

In [50]: x = (1 * (np.random.rand(974, 204) > 0.39721115241072164)).astype("float32")
In [51]: x.nbytes
Out[51]: 794784

csr 和近似記憶體使用：

In [52]: M = sparse.csr_matrix(x)
In [53]: M.data.nbytes   M.indices.nbytes   M.indptr.nbytes
Out[53]: 960308

hstack實際使用coo格式：

In [54]: Mo = M.tocoo()
In [55]: Mo.data.nbytes   Mo.row.nbytes   Mo.col.nbytes
Out[55]: 1434612

合并 10 個副本 - nbytes 增加 10：

In [56]: xx = np.hstack([x]*10)
In [57]: xx.shape
Out[57]: (974, 2040)

與稀疏相同：

In [58]: MM = sparse.hstack([M] * 10)
In [59]: MM.shape
Out[59]: (974, 2040)
In [60]: xx.nbytes
Out[60]: 7947840
In [61]: MM
Out[61]: 
<974x2040 sparse matrix of type '<class 'numpy.float32'>'
    with 1195510 stored elements in Compressed Sparse Row format>
In [62]: M
Out[62]: 
<974x204 sparse matrix of type '<class 'numpy.float32'>'
    with 119551 stored elements in Compressed Sparse Row format>
In [63]: MM.data.nbytes   MM.indices.nbytes   MM.indptr.nbytes
Out[63]: 9567980

稀疏的密度

In [65]: M.nnz / np.prod(M.shape)
Out[65]: 0.6016779401699078

不節省記憶體。如果您想節省記憶體和計算時間（尤其是矩陣乘法），0.1 或更小是一個很好的作業密度。

In [66]: ([email protected]).shape
Out[66]: (974, 974)
In [67]: timeit([email protected]).shape
10.1 ms ± 31.3 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [68]: ([email protected]).shape
Out[68]: (974, 974)
In [69]: timeit([email protected]).shape
220 ms ± 91.8 μs per loop (mean ± std. dev. of 7 runs, 1 loop each)

轉載請註明出處，本文鏈接：https://www.uj5u.com/gongcheng/439824.html

標籤：python numpy scipy sparse-matrix

上一篇：從nd點串列中獲取Numpyndarray值

下一篇：堆疊np.array上的滑動視窗（Python）