我正在處理有兩組數字的資料,一組用于 x 值,另一組用于相應的 y 坐標。我需要通過在回傳相應 y 值的同時消除重復項來抽取資料以減少 x 值的數量。我需要在非常大的陣列上快速執行此操作,因此代碼必須高效。
numpy 函式 'unique' 將消除 x 陣列中的重復項。但是,對于每個剩余的 x 陣列縱坐標,我需要回傳與該 x 值對應的所有那些的最大y 陣列縱坐標。因此,如果這是兩個這樣的示例陣列:
x = [ 0, 16, 24, 28, 30, 31, 32, 32, 33, 33, 33, 33]
y = [1050, 110, 104, 107, 820, 101, 102, 649, 103, 101, 1020, 100]
我最終需要的是:
x = [ 0, 16, 24, 28, 30, 31, 32, 33]
y = [1050, 110, 104, 107, 820, 101, 649, 1020]
感謝所有幫助。
uj5u.com熱心網友回復:
排序后,取出每個唯一值的第一個索引,用于 的引數np.maximum.reduceat:
>>> x = np.asarray(x)
>>> y = np.asarray(y)
>>> perm = x.argsort()
>>> sort = x[perm]
>>> mask = np.concatenate([[True], sort[1:] != sort[:-1]])
>>> sort[mask]
array([ 0, 16, 24, 28, 30, 31, 32, 33])
>>> np.maximum.reduceat(y[perm], mask.nonzero()[0])
array([1050, 110, 104, 107, 820, 101, 649, 1020])
10 ** 6 大小的大型陣列的簡單基準:
In [251]: def mechanic(x, y):
...: x = np.asarray(x)
...: y = np.asarray(y)
...: perm = x.argsort()
...: sort = x[perm]
...: mask = np.concatenate([[True], sort[1:] != sort[:-1]])
...: return sort[mask], np.maximum.reduceat(y[perm], mask.nonzero()[0])
...:
In [252]: def claudio(x, y):
...: xout = []
...: yout = []
...: for g, v in groupby(sorted(zip(x, y)), lambda x: x[0]):
...: xout = [g]
...: yout = [max(v)[1]]
...: return xout, yout
...:
In [253]: def joran_beasley(x, y):
...: df = pd.DataFrame({'x': x, 'y': y})
...: return (*df.groupby('x').agg({'x': 'first', 'y': 'max'}).values.T,)
...:
In [254]: import pandas as pd
In [255]: x, y = np.random.randint(0, 100, (2, 10 ** 6))
In [256]: %timeit mechanic(x, y)
65.6 ms ± 2.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [257]: %timeit claudio(x, y)
2.6 s ± 56.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [258]: %timeit joran_beasley(x, y)
36.3 ms ± 755 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
從串列開始:
In [275]: x, y = np.random.randint(0, 100, (2, 10 ** 6)).tolist()
In [276]: %timeit joran_beasley(x, y)
404 ms ± 6.67 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [277]: %timeit mechanic(x, y)
193 ms ± 2.57 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [278]: %timeit claudio(x, y)
1.02 s ± 20.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
@Claudio 的解決方案的一點優化:
In [283]: def claudio(x, y):
...: xout = []
...: yout = []
...: firstgetter = itemgetter(0)
...: secondgetter = itemgetter(1)
...: for g, v in groupby(sorted(zip(x, y), key=firstgetter), firstgetter):
...: xout.append(g)
...: yout.append(max(map(secondgetter, v)))
...: return xout, yout
...:
In [284]: %timeit claudio(x, y)
495 ms ± 13.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
使用串列作為輸入時使用的解決方案defaultdict獲勝:
In [291]: def defdict_solution(x, y):
...: defdict = defaultdict(list)
...: for k, v in zip(x, y):
...: defdict[k].append(v)
...: lst = [(k, max(v)) for k, v in defdict.items()]
...: lst.sort(key=itemgetter(0))
...: return [k for k, v in lst], [v for k, v in lst]
...:
In [292]: %timeit defdict_solution(x, y)
73.9 ms ± 723 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
uj5u.com熱心網友回復:
把它變成一個資料框和 groupby 和聚合:)
import pandas
x = [ 0, 16, 24, 28, 30, 31, 32, 32, 33, 33, 33, 33]
y = [1050, 110, 104, 107, 820, 101, 102, 649, 103, 101, 1020, 100]
df = pandas.DataFrame({'x':x,'y':y})
print(df.groupby('x').agg({'x':'first','y':'max'}))
uj5u.com熱心網友回復:
將串列作為串列作業的方法作為問題中的輸入和輸出資料給出:
from itertools import groupby
xout = []
yout = []
for g, v in groupby(sorted(zip(xin, yin)), lambda x: x[0]):
xout = [g]
yout = [max(v)[1]]
轉載請註明出處,本文鏈接:https://www.uj5u.com/qianduan/511863.html
