將NxD時間序列資料集轉換為(N-T 1)xTxD的最佳方法？-有解無憂

不幸的是，我想不出更好的標題；我承認我無法更好地解釋這一事實可能阻礙了我尋找已經確定的答案的能力。

所以，我有一個 N1 行 D 列的時間序列資料集。回圈神經網路需要 N2xTxD 格式的資料，因此如果序列長度 T 為 2，則新 N2xTxD 資料ds2[0]集的第一個元素將是原始資料集的前 2 行ds[0:2, :]。第二個元素ds2[1]將是ds[1:3, :]依此類推，直到ds2[N2] = ds[N-2:N, :]

我現在的做法是使用這些函式：

import numpy as np

#Shift Array arr's elements by num positions
def NpShift(arr, num, fill_value = np.nan):
    result = np.empty_like(arr)
    result[:num] = fill_value
    result[num:] = arr[:-num]
    return result


def TemporalTransformation(ds, T):
    tmp = ds
    ds = ds.reshape(-1, 1, ds.shape[1]) #By definition ds is NxD, so Nx1xD is -1x1xshape[1]
    
    for t in range(T):
        ds = np.concatenate((NpShift(tmp, t 1)[:, np.newaxis, :], ds), axis = 1) #Adding the shifted matrices one by one
    ds = ds[T-1:, 1:, :] #The 1st T-1 elements contain the shifted values so they have to be discarded; same goes for the 1st element on axis=1
    
    return ds

您可以使用以下方法測驗它以查看結果是否正確：

t = 2
xall = np.array([[1,1,1], [2,2,2], [3,3,3], [4,4,4], [5,5,5]], dtype = float)
print(f"ds shape:\n{xall.shape}")
print(f"ds:\n{xall}\n")
ds2 = TemporalTransformation(xall, t)
print("ds2 shape:\n", ds2.shape)
print(f"ds2:\n{ds2}")

輸出：

ds shape:
(5, 3)
ds:
[[1. 1. 1.]
 [2. 2. 2.]
 [3. 3. 3.]
 [4. 4. 4.]
 [5. 5. 5.]]

ds2 shape:
 (4, 2, 3)
ds2:
[[[1. 1. 1.]
  [2. 2. 2.]]

 [[2. 2. 2.]
  [3. 3. 3.]]

 [[3. 3. 3.]
  [4. 4. 4.]]

 [[4. 4. 4.]
  [5. 5. 5.]]]

Now, that works perfectly and accomplished what I want, however, for a large number of T (e.g. 700) on big datasets (hundreds of thousands of rows), it takes a terrifying amount of time to complete the conversion (30 minutes or so).

I can observe how this (currently) single-threaded piece of code allocates slowly and steadily more and more RAM as it creates the final (N-T-1)xTxD tensor (3 dimensional array).

Is there a way to do it quicker and possibly without allocating such huge amount of memory? I mean, in its core, the values of ds2 are the same as ds1, so I would think a way to do it with pointers should exist (I just can't think of how).

Any possible solution should preferably work on both windows and linux And one last noteworthy thing is that, eventually, this N2xTxD numpy array will be called in batches (so one iteration will call the first b rows, then the next b rows) and this batch will become a PyTorch tensor.

Now, I am familiar with torch.utils.data.Dataset, and I have tried extending it by inhereting from it to make my own iterator:

import numpy as np
from torch.utils.data import Dataset
class TemporalTransformation_Dataset(Dataset):
    def __init__(self, data, T):
        self.data = data
        self.T = T

    def __getitem__(self, index):
            Xi = self.data[index : index   self.T]
            return Xi

    def __len__(self):
        return self.data.shape[0] - self.T   1

t = 2
ds = torch.from_numpy(np.array([[1,1,1], [2,2,2], [3,3,3], [4,4,4], [5,5,5]]))
print(f"ds shape:\n{ds.shape}")
print(f"ds:\n{ds}\n")
ds2 = TemporalTransformation_Dataset(ds, t)
ds2_loader = torch.utils.data.DataLoader(dataset = ds2, batch_size = len(ds2), shuffle = False)
print("W/o Y:\n", next(iter(ds2_loader)))

However, it gets significantly slower in training compared to my numpy implementation. We're talking double the time or so, hence it's no fun. That being said, a pytorch solution that is comparably fast to my numpy's one is also something that I could use - I just don't see how to make it faster.. seems like this is a pytorch issue.

uj5u.com熱心網友回復：

“[...] 在其核心中，ds2 的值與 ds1 相同，所以我認為應該存在一種使用指標的方法” 您的直覺是正確的。這是您可以使用 NumPyas_strided函式的一種方法。它創建陣列的新視圖，而不復制底層資料：

from numpy.lib.stride_tricks import as_strided

def transformed_view(ds, T):
    ds = np.asarray(ds)
    if ds.ndim != 2:
        raise ValueError('ds must be a 2-d array.')
    shp = ds.shape
    if T < 1 or T > shp[0]:
        raise ValueError('Must have 1 <= T <= ds.shape[0]')

    strides = ds.strides
    return as_strided(ds, shape=(shp[0] - T   1, T, shp[1]),
                      strides=(strides[0], strides[0], strides[1]))

例如，

In [49]: xall = np.array([[1,1,1], [2,2,2], [3,3,3], [4,4,4], [5,5,5]], dtype=float)

In [50]: xall
Out[50]: 
array([[1., 1., 1.],
       [2., 2., 2.],
       [3., 3., 3.],
       [4., 4., 4.],
       [5., 5., 5.]])

In [51]: transformed_view(xall, 2)
Out[51]: 
array([[[1., 1., 1.],
        [2., 2., 2.]],

       [[2., 2., 2.],
        [3., 3., 3.]],

       [[3., 3., 3.],
        [4., 4., 4.]],

       [[4., 4., 4.],
        [5., 5., 5.]]])

轉載請註明出處，本文鏈接：https://www.uj5u.com/gongcheng/351778.html

標籤：python arrays numpy machine-learning

上一篇：根據滿足某些條件的回應設定Postmanenv變數

下一篇：如何對齊/合并2個串列？（Python）