值大于當前行數的條件計數-有解無憂

也許有點初學者的問題，但我的思想真的卡住了。

我在名為 x 的列中有一個具有某些值的資料框，分為兩組。

   x     group
1  1.7   a
2  0     b
3  2.3   b
4  2.7   b
5  8.6   a
6  5.4   b
7  4.2   a
8  5.7   b

我的目的是為每一行計算另一組中有多少行的值大于當前行。所以為了更清楚起見，對于第一行（a 組），我正在尋找 b 組有多少行大于 1.7（答案是 4）。最終結果應如下所示：

   x     group   result
1  1.7   a       4
2  0     b       3
3  2.3   b       2
4  2.7   b       2
5  8.6   a       0
6  5.4   b       1
7  4.2   a       2
8  5.7   b       1

我在資料框中有幾行，所以理想情況下我也想要一個相對快速的解決方案。

uj5u.com熱心網友回復：

使用np.searchsorted：

df['result'] = 0

a = df.loc[df['group'] == 'a', 'x']
b = df.loc[df['group'] == 'b', 'x']

df.loc[a.index, 'result'] = len(b) - np.searchsorted(np.sort(b), a)
df.loc[b.index, 'result'] = len(a) - np.searchsorted(np.sort(a), b)

輸出：

>>> df
     x group  result
1  1.7     a       4
2  0.0     b       3
3  2.3     b       2
4  2.7     b       2
5  8.6     a       0
6  5.4     b       1
7  4.2     a       2
8  5.7     b       1

130K 記錄的性能

>>> %%timeit
    a = df.loc[df['group'] == 'a', 'x']
    b = df.loc[df['group'] == 'b', 'x']
    len(b) - np.searchsorted(np.sort(b), a)
    len(a) - np.searchsorted(np.sort(a), b)

31.8 ms ± 319 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

設定：

N = 130000
df = pd.DataFrame({'x': np.random.randint(1, 1000, N),
                   'group': np.random.choice(['a', 'b'], N, p=(0.7, 0.3))})

uj5u.com熱心網友回復：

這是一種方法。在交換組名以將 a 與 b 中的排名值合并之后，基于ranking 降序排列x每個組的值，以及merge_asofdf 與自身，反之亦然。

# needed for the merge_asof
df = df.sort_values('x')

res = (
    pd.merge_asof(
        df.reset_index(), # to keep original index order
        df.assign(
            # to compare a with b in the merge
            group = df['group'].map({'a':'b', 'b':'a'}), 
            # rank descending to get the number of number above current number
            result = df.groupby('group')['x'].rank(ascending=False)),
        by='group', # same group first, knowing you exchange groups in second df
        on='x', direction='forward') # look forward on x to get the rank
      # complete the result column
      .fillna({'result':0})
      .astype({'result':int})
      # for cosmetic
      .set_index('index')
      .rename_axis(None)
      .sort_index()
)
print(res)
#      x group  result
# 1  1.7     a       4
# 2  0.0     b       3
# 3  2.3     b       2
# 4  2.7     b       2
# 5  8.6     a       0
# 6  5.4     b       1
# 7  4.2     a       2
# 8  5.7     b       1

uj5u.com熱心網友回復：

您可以對值進行排序并使用掩碼對cumsum其他組進行排序：

df2 = df.sort_values(by='x', ascending=False)
m = df2['group'].eq('a')
df['result'] = m.cumsum().mask(m).fillna((~m).cumsum().where(m)).astype(int)

輸出：

     x group  result
1  1.7     a       4
2  0.0     b       3
3  2.3     b       2
4  2.7     b       2
5  8.6     a       0
6  5.4     b       1
7  4.2     a       2
8  5.7     b       1

uj5u.com熱心網友回復：

這應該非常有效，只是一種x，然后只計算 cumsums

df2 = df.sort_values('x', ascending=False).reset_index()
df2['acount'] = (df['group'] == 'a').cumsum()
df2['bcount'] = (df['group'] == 'b').cumsum()
df2 = df2.fillna(0)
df2

此時 df2 看起來像這樣：

    index   x   group   acount  bcount
0   5       8.6 a       0.0     0.0
1   8       5.7 b       1.0     0.0
2   6       5.4 b       1.0     1.0
3   7       4.2 a       1.0     2.0
4   4       2.7 b       1.0     3.0
5   3       2.3 b       2.0     3.0
6   1       1.7 a       2.0     4.0
7   2       0.0 b       3.0     4.0

現在恢復索引并根據組選擇acount或bcount：

df2 = df2.set_index('index').sort_index()
df2['result'] = np.where(df['group']=='a', df2['bcount'],df2['acount']).astype(int)
df2[['x','result']]

最后結果


    x   group   result
index           
1   1.7 a       4
2   0.0 b       3
3   2.3 b       2
4   2.7 b       1
5   8.6 a       0
6   5.4 b       1
7   4.2 a       2
8   5.7 b       1

性能（在與@Corralien 相同的 130000 行測驗中，不是相同的硬體 obv）

65.4 ms ± 957 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

uj5u.com熱心網友回復：

與 Corralien 的解決方案沒有太大不同，但您可以使用廣播來檢查組 'a' 中的所有元素與組 'b' 中的所有元素并計算有多少滿足條件。然后將結果加入回來。

import pandas as pd
import numpy as np

a = df.loc[df['group'] == 'a', 'x']
b = df.loc[df['group'] == 'b', 'x']

result = pd.concat([
            pd.Series(np.sum(a.to_numpy() < b.to_numpy()[:, None], axis=0), index=a.index),
            pd.Series(np.sum(b.to_numpy() < a.to_numpy()[:, None], axis=0), index=b.index)])

df['result'] = result

     x group  result
1  1.7     a       4
2  0.0     b       3
3  2.3     b       2
4  2.7     b       2
5  8.6     a       0
6  5.4     b       1
7  4.2     a       2
8  5.7     b       1

uj5u.com熱心網友回復：

一個快速的解決方案是使用 pandas 的DataFrame.apply方法。

df['result'] = df.apply(lambda row: df[(df['group'] != row['group']) & (df['x'] > row['x'])].x.count(), axis=1)

轉載請註明出處，本文鏈接：https://www.uj5u.com/ruanti/409088.html

標籤：

上一篇：提取具有資料框級別的行

下一篇：PythonPandas如何擺脫只有1行的分組？