考慮一個有 2 列的 Pandas DataFrame:image_id和name
- 每行代表位于影像 (image_id) 中的一個人(姓名)
- 每個影像可以有 1 個或多個人
- 每個名字在一張圖片中最多出現一次
- 友誼順序無關緊要,例如 Bob & Mary = Mary & Bob
如何計算兩個人在整個資料集中出現在同一影像中的次數?
data = [[1, 'Mary'], [1, 'Bob'], [1, 'Susan'],
[2, 'Bob'], [2, 'Joe'],
[3, 'Isaac'],
[4, 'Mary'], [4, 'Susan'],
[5, 'Mary'], [5, 'Susan'], [5, 'Bob'], [5, 'Joe']]
df = pd.DataFrame(data, columns=['image_id', 'name'])
# Now what?
預期的資料框(行或名稱的順序無關緊要):
name1 name2 count
Mary Susan 3
Bob Susan 2
Mary Bob 2
Bob Joe 2
Mary Joe 1
Susan Joe 1
替代解決方案:
也可以使用對稱網格,其中行和列都是名稱,單元格值是這兩個人出現在同一影像中的次數。什么都容易。
uj5u.com熱心網友回復:
我們可以使用crosstab計算頻率表然后計算這個頻率表上的內積來計算兩個人在同一幅影像中出現的次數
s = pd.crosstab(df['image_id'], df['name'])
c = s.T @ s
c = c.mask(np.triu(c, 1) == 0).stack()\
.rename_axis(['name1', 'name2']).reset_index(name='count')
name1 name2 count
0 Bob Joe 2.0
1 Bob Mary 2.0
2 Bob Susan 2.0
3 Joe Mary 1.0
4 Joe Susan 1.0
5 Mary Susan 3.0
由 OP 編輯??:
下面是對上述代碼的詳細解釋:
# Compute a frequency table of names that appear in each image.
s = pd.crosstab(df['image_id'], df['name'])
name Bob Isaac Joe Mary Susan
image_id
1 1 0 0 1 1
2 1 0 1 0 0
3 0 1 0 0 0
4 0 0 0 1 1
5 1 0 1 1 1
# Inner product counts the occurrences of each pair.
# The diagonal counts the number of times a name appeared in any image.
c = s.T @ s
name Bob Isaac Joe Mary Susan
name
Bob 3 0 2 2 2
Isaac 0 1 0 0 0
Joe 2 0 2 1 1
Mary 2 0 1 3 3
Susan 2 0 1 3 3
# Keep the non-zero elements in the upper triangle, since matrix is symmetric.
c = c.mask(np.triu(c, 1) == 0)
name Bob Isaac Joe Mary Susan
name
Bob NaN NaN 2.0 2.0 2.0
Isaac NaN NaN NaN NaN NaN
Joe NaN NaN NaN 1.0 1.0
Mary NaN NaN NaN NaN 3.0
Susan NaN NaN NaN NaN NaN
# Group all counts in a single column.
# Each row represents a unique pair of names.
c = c.stack()
name name
Bob Joe 2.0
Mary 2.0
Susan 2.0
Joe Mary 1.0
Susan 1.0
Mary Susan 3.0
# Expand the MultiIndex into separate columns.
c = c.rename_axis(['name1', 'name2']).reset_index(name='count')
name1 name2 count
0 Bob Joe 2.0
1 Bob Mary 2.0
2 Bob Susan 2.0
3 Joe Mary 1.0
4 Joe Susan 1.0
5 Mary Susan 3.0
有關更多詳細資訊,請參閱crosstab、@(矩陣乘法)、T(轉置)、triu、掩碼和堆疊。
uj5u.com熱心網友回復:
我知道答案已經被用戶做出并接受了。但是,我仍然想分享我的代碼。這是我通過“HARD WAY”實作預期輸出的代碼。
import itertools
import pandas as pd
data = [[1, 'Mary'], [1, 'Bob'], [1, 'Susan'],
[2, 'Bob'], [2, 'Joe'],
[3, 'Isaac'],
[4, 'Mary'], [4, 'Susan'],
[5, 'Mary'], [5, 'Susan'], [5, 'Bob'], [5, 'Joe']]
df = pd.DataFrame(data, columns=['image_id', 'name'])
# Group the df by 'image_id' and get the value of name in the form of list
groups = df.groupby(['image_id'])['name'].apply(list).reset_index()
output = {}
# Loop through the groups dataframe
for index, row in groups.iterrows():
# Sort the list of names in ascending order
row['name'].sort()
# Get the all possible combination of list in pair of twos
temp = list(itertools.combinations(row['name'], 2))
# Loop through it and maintain the output dictionary with its occurrence
# Default set occurrence value to 1 when initialize
# Increment it when we found more occurrence of it
for i, val in enumerate(temp):
if val not in output:
output[val] = 1
else:
output[val] = 1
temp_output = []
# Reformat the output dictionary so we can initialize it into pandas dictionary
for key, val in output.items():
temp = [key[0], key[1], val]
temp_output.append(temp)
df = pd.DataFrame(temp_output, columns=['name1', 'name2', 'count'])
print(df.sort_values(by=['count'], ascending=False))
這是我得到的輸出:
name1 name2 count
2 Mary Susan 3
0 Bob Mary 2
1 Bob Susan 2
3 Bob Joe 2
4 Joe Mary 1
5 Joe Susan 1
這是“不是 PYTHONIC ”的方式,但這就是我解決大部分問題的方式,雖然不是很好,但它可以完成我的作業。
注意:評論中已經提到了代碼的作業原理,但如果你們中的任何人有任何疑問/問題/建議,請告訴我。
轉載請註明出處,本文鏈接:https://www.uj5u.com/qiye/362546.html
