計算兩個唯一值在同一組中無序出現的次數-有解無憂

考慮一個有 2 列的 Pandas DataFrame：image_id和name

每行代表位于影像 (image_id) 中的一個人（姓名）
每個影像可以有 1 個或多個人
每個名字在一張圖片中最多出現一次
友誼順序無關緊要，例如 Bob & Mary = Mary & Bob

如何計算兩個人在整個資料集中出現在同一影像中的次數？

data = [[1, 'Mary'], [1, 'Bob'], [1, 'Susan'],
        [2, 'Bob'], [2, 'Joe'],
        [3, 'Isaac'],
        [4, 'Mary'], [4, 'Susan'],
        [5, 'Mary'], [5, 'Susan'], [5, 'Bob'], [5, 'Joe']]

df = pd.DataFrame(data, columns=['image_id', 'name'])

# Now what?

預期的資料框（行或名稱的順序無關緊要）：

 name1    name2   count
  Mary    Susan       3
   Bob    Susan       2
  Mary      Bob       2
   Bob      Joe       2
  Mary      Joe       1
 Susan      Joe       1

替代解決方案：

也可以使用對稱網格，其中行和列都是名稱，單元格值是這兩個人出現在同一影像中的次數。什么都容易。

uj5u.com熱心網友回復：

我們可以使用crosstab計算頻率表然后計算這個頻率表上的內積來計算兩個人在同一幅影像中出現的次數

s = pd.crosstab(df['image_id'], df['name'])
c = s.T @ s
c = c.mask(np.triu(c, 1) == 0).stack()\
     .rename_axis(['name1', 'name2']).reset_index(name='count')

  name1  name2  count
0   Bob    Joe    2.0
1   Bob   Mary    2.0
2   Bob  Susan    2.0
3   Joe   Mary    1.0
4   Joe  Susan    1.0
5  Mary  Susan    3.0

由 OP 編輯??：

下面是對上述代碼的詳細解釋：

# Compute a frequency table of names that appear in each image.
s = pd.crosstab(df['image_id'], df['name'])

name      Bob  Isaac  Joe  Mary  Susan
image_id                              
1           1      0    0     1      1
2           1      0    1     0      0
3           0      1    0     0      0
4           0      0    0     1      1
5           1      0    1     1      1

# Inner product counts the occurrences of each pair.
# The diagonal counts the number of times a name appeared in any image.
c = s.T @ s

name   Bob  Isaac  Joe  Mary  Susan
name                               
Bob      3      0    2     2      2
Isaac    0      1    0     0      0
Joe      2      0    2     1      1
Mary     2      0    1     3      3
Susan    2      0    1     3      3

# Keep the non-zero elements in the upper triangle, since matrix is symmetric.
c = c.mask(np.triu(c, 1) == 0)

name   Bob  Isaac  Joe  Mary  Susan
name                               
Bob    NaN    NaN  2.0   2.0    2.0
Isaac  NaN    NaN  NaN   NaN    NaN
Joe    NaN    NaN  NaN   1.0    1.0
Mary   NaN    NaN  NaN   NaN    3.0
Susan  NaN    NaN  NaN   NaN    NaN

# Group all counts in a single column.
# Each row represents a unique pair of names.
c = c.stack()

name  name 
Bob   Joe      2.0
      Mary     2.0
      Susan    2.0
Joe   Mary     1.0
      Susan    1.0
Mary  Susan    3.0

# Expand the MultiIndex into separate columns.
c = c.rename_axis(['name1', 'name2']).reset_index(name='count')

  name1  name2  count
0   Bob    Joe    2.0
1   Bob   Mary    2.0
2   Bob  Susan    2.0
3   Joe   Mary    1.0
4   Joe  Susan    1.0
5  Mary  Susan    3.0

有關更多詳細資訊，請參閱crosstab、@（矩陣乘法）、T（轉置）、triu、掩碼和堆疊。

uj5u.com熱心網友回復：

我知道答案已經被用戶做出并接受了。但是，我仍然想分享我的代碼。這是我通過“HARD WAY”實作預期輸出的代碼。

import itertools
import pandas as pd

data = [[1, 'Mary'], [1, 'Bob'], [1, 'Susan'],
        [2, 'Bob'], [2, 'Joe'],
        [3, 'Isaac'],
        [4, 'Mary'], [4, 'Susan'],
        [5, 'Mary'], [5, 'Susan'], [5, 'Bob'], [5, 'Joe']]

df = pd.DataFrame(data, columns=['image_id', 'name'])

# Group the df by 'image_id' and get the value of name in the form of list
groups = df.groupby(['image_id'])['name'].apply(list).reset_index()

output = {}

# Loop through the groups dataframe
for index, row in groups.iterrows():
    # Sort the list of names in ascending order
    row['name'].sort()
    # Get the all possible combination of list in pair of twos
    temp = list(itertools.combinations(row['name'], 2))
    # Loop through it and maintain the output dictionary with its occurrence
    # Default set occurrence value to 1 when initialize
    # Increment it when we found more occurrence of it
    for i, val in enumerate(temp):
        if val not in output:
            output[val] = 1
        else:
            output[val]  = 1

temp_output = []

# Reformat the output dictionary so we can initialize it into pandas dictionary
for key, val in output.items():
    temp = [key[0], key[1], val]
    temp_output.append(temp)

df = pd.DataFrame(temp_output, columns=['name1', 'name2', 'count'])

print(df.sort_values(by=['count'], ascending=False))

這是我得到的輸出：

  name1  name2  count
2  Mary  Susan      3
0   Bob   Mary      2
1   Bob  Susan      2
3   Bob    Joe      2
4   Joe   Mary      1
5   Joe  Susan      1

這是“不是 PYTHONIC ”的方式，但這就是我解決大部分問題的方式，雖然不是很好，但它可以完成我的作業。

注意：評論中已經提到了代碼的作業原理，但如果你們中的任何人有任何疑問/問題/建議，請告訴我。

轉載請註明出處，本文鏈接：https://www.uj5u.com/qiye/362546.html

標籤：Python 熊猫数据框 pandas-groupby

上一篇：使用Pandas將非數字列值替換為浮動

下一篇：AttributeError:'Index'物件沒有屬性'replace'