x陣列中對應唯一值的ynumpy陣列中的最大值-有解無憂

我正在處理有兩組數字的資料，一組用于 x 值，另一組用于相應的 y 坐標。我需要通過在回傳相應 y 值的同時消除重復項來抽取資料以減少 x 值的數量。我需要在非常大的陣列上快速執行此操作，因此代碼必須高效。

numpy 函式 'unique' 將消除 x 陣列中的重復項。但是，對于每個剩余的 x 陣列縱坐標，我需要回傳與該 x 值對應的所有那些的最大y 陣列縱坐標。因此，如果這是兩個這樣的示例陣列：

x = [   0,  16,  24,  28,  30,  31,  32,  32,  33,  33,   33,  33]
y = [1050, 110, 104, 107, 820, 101, 102, 649, 103, 101, 1020, 100]

我最終需要的是：

x = [   0,  16,  24,  28,  30,  31,  32,   33]
y = [1050, 110, 104, 107, 820, 101, 649, 1020]

感謝所有幫助。

uj5u.com熱心網友回復：

排序后，取出每個唯一值的第一個索引，用于的引數np.maximum.reduceat：

>>> x = np.asarray(x)
>>> y = np.asarray(y)
>>> perm = x.argsort()
>>> sort = x[perm]
>>> mask = np.concatenate([[True], sort[1:] != sort[:-1]])
>>> sort[mask]
array([ 0, 16, 24, 28, 30, 31, 32, 33])
>>> np.maximum.reduceat(y[perm], mask.nonzero()[0])
array([1050,  110,  104,  107,  820,  101,  649, 1020])

10 ** 6 大小的大型陣列的簡單基準：

In [251]: def mechanic(x, y):
     ...:     x = np.asarray(x)
     ...:     y = np.asarray(y)
     ...:     perm = x.argsort()
     ...:     sort = x[perm]
     ...:     mask = np.concatenate([[True], sort[1:] != sort[:-1]])
     ...:     return sort[mask], np.maximum.reduceat(y[perm], mask.nonzero()[0])
     ...:

In [252]: def claudio(x, y):
     ...:     xout = []
     ...:     yout = []
     ...:     for g, v in groupby(sorted(zip(x, y)), lambda x: x[0]):
     ...:         xout  = [g]
     ...:         yout  = [max(v)[1]]
     ...:     return xout, yout
     ...:

In [253]: def joran_beasley(x, y):
     ...:     df = pd.DataFrame({'x': x, 'y': y})
     ...:     return (*df.groupby('x').agg({'x': 'first', 'y': 'max'}).values.T,)
     ...:

In [254]: import pandas as pd

In [255]: x, y = np.random.randint(0, 100, (2, 10 ** 6))

In [256]: %timeit mechanic(x, y)
65.6 ms ± 2.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [257]: %timeit claudio(x, y)
2.6 s ± 56.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [258]: %timeit joran_beasley(x, y)
36.3 ms ± 755 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

從串列開始：

In [275]: x, y = np.random.randint(0, 100, (2, 10 ** 6)).tolist()

In [276]: %timeit joran_beasley(x, y)
404 ms ± 6.67 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [277]: %timeit mechanic(x, y)
193 ms ± 2.57 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [278]: %timeit claudio(x, y)
1.02 s ± 20.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@Claudio 的解決方案的一點優化：

In [283]: def claudio(x, y):
     ...:     xout = []
     ...:     yout = []
     ...:     firstgetter = itemgetter(0)
     ...:     secondgetter = itemgetter(1)
     ...:     for g, v in groupby(sorted(zip(x, y), key=firstgetter), firstgetter):
     ...:         xout.append(g)
     ...:         yout.append(max(map(secondgetter, v)))
     ...:     return xout, yout
     ...:

In [284]: %timeit claudio(x, y)
495 ms ± 13.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

使用串列作為輸入時使用的解決方案defaultdict獲勝：

In [291]: def defdict_solution(x, y):
     ...:     defdict = defaultdict(list)
     ...:     for k, v in zip(x, y):
     ...:         defdict[k].append(v)
     ...:     lst = [(k, max(v)) for k, v in defdict.items()]
     ...:     lst.sort(key=itemgetter(0))
     ...:     return [k for k, v in lst], [v for k, v in lst]
     ...:

In [292]: %timeit defdict_solution(x, y)
73.9 ms ± 723 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

uj5u.com熱心網友回復：

把它變成一個資料框和 groupby 和聚合:)

import pandas

x = [   0,  16,  24,  28,  30,  31,  32,  32,  33,  33,   33,  33]
y = [1050, 110, 104, 107, 820, 101, 102, 649, 103, 101, 1020, 100]

df = pandas.DataFrame({'x':x,'y':y})
print(df.groupby('x').agg({'x':'first','y':'max'}))

uj5u.com熱心網友回復：

將串列作為串列作業的方法作為問題中的輸入和輸出資料給出：

from itertools import groupby
xout = []
yout = []
for g, v in groupby(sorted(zip(xin, yin)), lambda x: x[0]):
    xout  = [g]
    yout  = [max(v)[1]]

轉載請註明出處，本文鏈接：https://www.uj5u.com/qianduan/511863.html

標籤：Python数组麻木的独特的

上一篇：將Matlab切片和索引轉換為numpy陣串列示法的腳本

下一篇：如何替換空白值和按日期重新排列的資料？