給定下表,其中一列存盤唯一識別符號(user_id列)和四個二進制列(col1到 col_4_):
將熊貓匯入為 pd
df = pd.DataFrame.from_dict({
'id': ['a', 'b', 'c', 'd', 'e']
,'col1': [1,1,0,1,0]
,'col2': [0,1,1,1,1]
,'col3': [0,0,1,0,0]
,'col4': [0,0,1,1,1]
})
| 用戶身份 | col1 | col2 | col3 | col4 |
|---|---|---|---|---|
| 一個 | 1 | 0 | 0 | 0 |
| b | 1 | 1 | 0 | 0 |
| C | 0 | 1 | 1 | 1 |
| d | 1 | 1 | 0 | 1 |
| e | 0 | 1 | 0 | 1 |
如何創建一個輸出表來顯示有多少 user_ids 具有二進制列的共現對?
| 共現對 | user_id 的計數 |
|---|---|
| col1-col2 | 2 |
| col1-col3 | 0 |
| col1-col4 | 1 |
| col2-col3 | 1 |
| col2-col4 | 3 |
| col3-col4 | 1 |
uj5u.com熱心網友回復:
你可以嘗試這樣的事情:
from itertools import combinations
import pandas as pd
df = pd.DataFrame.from_dict({
'id': ['a', 'b', 'c', 'd', 'e']
,'col1': [1,1,0,1,0]
,'col2': [0,1,1,1,1]
,'col3': [0,0,1,0,0]
,'col4': [0,0,1,1,1]
})
dfi = df.set_index('id')
pd.Series({f'{z[0]}-{z[1]}':df[list(z)].all(1).sum() for z in list(combinations(dfi.columns, 2))})
輸出:
col1-col2 2
col1-col3 0
col1-col4 1
col2-col3 1
col2-col4 3
col3-col4 1
dtype: int64
uj5u.com熱心網友回復:
df_output = pd.DataFrame({"co-occurrence pair": [f"{df.columns[i]}-{df.columns[j]}" for i in range(1, len(df.columns)) for j in range(i 1, len(df.columns))],
"count of user_id": [sum(df.iloc[:, i] & df.iloc[:, j]) for i in range(1, len(df.columns)) for j in range(i 1, len(df.columns))]})
co-occurrence pair count of user_id
0 col1-col2 2
1 col1-col3 0
2 col1-col4 1
3 col2-col3 1
4 col2-col4 3
5 col3-col4 1
更易讀的版本
cols = []
counts = []
for i in range(1, len(df.columns)):
for j in range(i 1, len(df.columns)):
cols.append(f"{df.columns[i]}-{df.columns[j]}")
counts.append(sum(df.iloc[:, i] & df.iloc[:, j]))
df_output = pd.DataFrame({"co-occurrence pair": cols, "count of user_id": counts})
uj5u.com熱心網友回復:
您可以為此使用 itertools API。
過濾所需的列:
import itertools
columns = list(itertools.filterfalse(lambda c: c=="id", df.columns))
>> ['col1', 'col2', 'col3', 'col4']
創建列對的組合:
combinations = list(itertools.combinations(columns, 2))
>> [('col1', 'col2'),
>> ('col1', 'col3'),
>> ('col1', 'col4'),
>> ('col2', 'col3'),
>> ('col2', 'col4'),
>> ('col3', 'col4')]
過濾二進制列:
result_data = [(f"{x}-{y}", df[(df[x]==1) & (df[y]==1)]["id"].count()) for x,y in combinations]
result_df = pd.DataFrame(result_data, columns=["co-occurrence pair", "count of user_id"])
>> [('col1-col2', 2),
>> ('col1-col3', 0),
>> ('col1-col4', 1),
>> ('col2-col3', 1),
>> ('col2-col4', 3),
>> ('col3-col4', 1)]
轉載請註明出處,本文鏈接:https://www.uj5u.com/ruanti/517790.html
標籤:Python熊猫算法
