Python-基于排名資訊對陣列進行分組（聚類/排序）-有解無憂

我有一個如下所示的資料框：

      A         B          C          D
0    5         4           3         2
1    4         5           3         2
2    3         5           2         1
3    4         2           5         1
4    4         5           2         1
5    4         3           5         1
...

我將資料幀轉換為這樣的二維陣列：

[[5 4 3 2]
 [4 5 3 2]
 [3 5 2 1]
 [4 2 5 1]
 [4 5 2 1]
 [4 3 5 1]
 ...]

每行的分數1-5實際上是人們給 item 的分數A, B, C, D。我想確定排名相同的人，例如人們認為A > B > C > D。我想根據這樣的排名資訊重新組合這些陣列：

2DArray1: [[5 4 3 2]]
2DArray2: [[4 5 3 2]
           [3 5 2 1]
           [4 5 2 1]]
2DArray3: [[4 2 5 1]
           [4 3 5 1]]

例如2DArray2手段誰想到人B > A > C > D，2DArray3都是人認為C > A > B > D。我在 numpy 中嘗試了不同的排序功能，但找不到合適的。我應該怎么做？

uj5u.com熱心網友回復：

Numpy 沒有groupby函式，因為 groupby 會回傳不同大小的串列串列；而 numpy 主要只處理“矩形”陣列。

一種解決方法是對行進行排序，使相似的行相鄰，然后生成每個組開頭的索引陣列。

由于我懶得這樣做，這里有一個沒有 numpy 的解決方案：

直接按置換索引

對于每一行，我們計算的對應排列'ABCD'。然后，我們將該行添加到行串列的字典中，其中字典鍵是相應的排列。

from collections import defaultdict

a = [[5, 4, 3, 2], [4, 5, 3, 2], [3, 5, 2, 1], [4, 2, 5, 1], [4, 5, 2, 1], [4, 3, 5, 1]]

groups = defaultdict(list)
for row in a:
    groups[tuple(sorted(range(len(row)), key=lambda i: row[i], reverse=True))].append(row)

print(groups)

輸出：

defaultdict(<class 'list'>, {
    (0, 1, 2, 3): [[5, 4, 3, 2]],
    (1, 0, 2, 3): [[4, 5, 3, 2], [3, 5, 2, 1], [4, 5, 2, 1]],
    (2, 0, 1, 3): [[4, 2, 5, 1], [4, 3, 5, 1]]
})

請注意，使用此解決方案，如果某些用戶對兩個不同的專案給出相同的分數，結果可能與您預期的不同，因為sorted不保持公平；相反，它按出現的順序斷開連接（在這種情況下，這意味著兩個專案之間的連接按字母順序斷開）。

按排列的索引索引

的排列'ABCD'可以按字典順序排列：'ABCD'首先，然后'ABDC'是第二，然后'ACBD'是第三......

事實證明，有一種演算法可以計算給定排列在該序列中出現的索引！該演算法是在 python 模塊中實作的more_itertools：

more_itertools.permutation_index

所以，我們可以用tuple(sorted(range(len(row)), key=lambda i: row[i], reverse=True))一個簡單的數字鍵替換我們的元組鍵permutation_index(row, sorted(row, reverse=True))。

from collections import defaultdict
from more_itertools import permutation_index

a = [[5, 4, 3, 2], [4, 5, 3, 2], [3, 5, 2, 1], [4, 2, 5, 1], [4, 5, 2, 1], [4, 3, 5, 1]]

groups = defaultdict(list)
for row in a:
    groups[permutation_index(row, sorted(row, reverse=True))].append(row)

print(groups)

輸出：

defaultdict(<class 'list'>, {
    0: [[5, 4, 3, 2]],
    6: [[4, 5, 3, 2], [3, 5, 2, 1], [4, 5, 2, 1]],
    8: [[4, 2, 5, 1], [4, 3, 5, 1]]
})

混合 permutation_index 和 pandas

由于 permutation_index 的輸出是一個簡單的數字，我們可以輕松地將其作為新列包含在 numpy 陣列或 Pandas 資料框中：

import pandas as pd
from more_itertools import permutation_index

df = pd.DataFrame({'A': [5,4,3,4,4,4], 'B': [4,5,5,2,5,3], 'C': [3,2,2,5,2,5], 'D': [2,2,1,1,1,1]})

df['perm_idx'] = df.apply(lambda row: permutation_index(row, sorted(row, reverse=True)), axis=1)

print(df)

   A  B  C  D  perm_idx
0  5  4  3  2         0
1  4  5  2  2         6
2  3  5  2  1         6
3  4  2  5  1         8
4  4  5  2  1         6
5  4  3  5  1         8

for idx, sub_df in df.groupby('perm_idx'):
    print(idx)
    print(sub_df)

0
   A  B  C  D  perm_idx
0  5  4  3  2         0
6
   A  B  C  D  perm_idx
1  4  5  2  2         6
2  3  5  2  1         6
4  4  5  2  1         6
8
   A  B  C  D  perm_idx
3  4  2  5  1         8
5  4  3  5  1         8

uj5u.com熱心網友回復：

你可以

(i) 轉置df并將其轉換為字典，

(ii) 按值對字典進行排序并獲取鍵，

(iii) 加入每個“人”的排序鍵并將這個字典分配給df['ranks']，

(iv) 匯總排名積分并將其分配給df['pref']，

(v)groupby(['ranks'])并從pref

df = pd.DataFrame({'A': {0: 5, 1: 4, 2: 3, 3: 4, 4: 4, 5: 4},
                   'B': {0: 4, 1: 5, 2: 5, 3: 2, 4: 5, 5: 3},
                   'C': {0: 3, 1: 3, 2: 2, 3: 5, 4: 2, 5: 5},
                   'D': {0: 2, 1: 2, 2: 1, 3: 1, 4: 1, 5: 1}})

df['ranks'] = pd.Series({k : ''.join(list(zip(*sorted(v.items(), key=lambda d:d[1], 
                                                      reverse=True)))[0]) 
                         for k,v in df.T.to_dict().items()})
df['pref'] = df.loc[:,'A':'D'].values.tolist()
out = df[['ranks','pref']].groupby('ranks').agg(list).to_dict()['pref']

輸出：

{'ABCD': [[5, 4, 3, 2]],
 'BACD': [[4, 5, 3, 2], [3, 5, 2, 1], [4, 5, 2, 1]],
 'CABD': [[4, 2, 5, 1], [4, 3, 5, 1]]}

轉載請註明出處，本文鏈接：https://www.uj5u.com/houduan/376780.html

標籤：Python 数组麻木的排序分组

上一篇：使用地址到t陣列欄位創建t*陣列

下一篇：查找陣列中最大數字的索引