計算多列的對值？-有解無憂

給定下表，其中一列存盤唯一識別符號（user_id列）和四個二進制列（col1到 col_4_）：

    將熊貓匯入為 pd
    df = pd.DataFrame.from_dict({
        'id': ['a', 'b', 'c', 'd', 'e']
        ,'col1': [1,1,0,1,0]
        ,'col2': [0,1,1,1,1]
        ,'col3': [0,0,1,0,0]
        ,'col4': [0,0,1,1,1]
    })

用戶身份	col1	col2	col3	col4
一個	1	0	0	0
b	1	1	0	0
C	0	1	1	1
d	1	1	0	1
e	0	1	0	1

如何創建一個輸出表來顯示有多少 user_ids 具有二進制列的共現對？

共現對	user_id 的計數
col1-col2	2
col1-col3	0
col1-col4	1
col2-col3	1
col2-col4	3
col3-col4	1

uj5u.com熱心網友回復：

你可以嘗試這樣的事情：

from itertools import combinations
import pandas as pd

df = pd.DataFrame.from_dict({
        'id': ['a', 'b', 'c', 'd', 'e']
        ,'col1': [1,1,0,1,0]
        ,'col2': [0,1,1,1,1]
        ,'col3': [0,0,1,0,0]
        ,'col4': [0,0,1,1,1]
    })

dfi = df.set_index('id')

pd.Series({f'{z[0]}-{z[1]}':df[list(z)].all(1).sum() for z in list(combinations(dfi.columns, 2))})

輸出：

col1-col2    2
col1-col3    0
col1-col4    1
col2-col3    1
col2-col4    3
col3-col4    1
dtype: int64

uj5u.com熱心網友回復：

df_output = pd.DataFrame({"co-occurrence pair": [f"{df.columns[i]}-{df.columns[j]}" for i in range(1, len(df.columns)) for j in range(i 1, len(df.columns))],
              "count of user_id": [sum(df.iloc[:, i] & df.iloc[:, j]) for i in range(1, len(df.columns)) for j in range(i 1, len(df.columns))]})

co-occurrence pair  count of user_id
0          col1-col2                 2
1          col1-col3                 0
2          col1-col4                 1
3          col2-col3                 1
4          col2-col4                 3
5          col3-col4                 1

更易讀的版本

cols = []
counts = []
for i in range(1, len(df.columns)):
    for j in range(i   1, len(df.columns)):
        cols.append(f"{df.columns[i]}-{df.columns[j]}")
        counts.append(sum(df.iloc[:, i] & df.iloc[:, j]))
df_output = pd.DataFrame({"co-occurrence pair": cols, "count of user_id": counts})

uj5u.com熱心網友回復：

您可以為此使用 itertools API。

過濾所需的列：

import itertools

columns = list(itertools.filterfalse(lambda c: c=="id", df.columns))
>> ['col1', 'col2', 'col3', 'col4']

創建列對的組合：

combinations = list(itertools.combinations(columns, 2))
>> [('col1', 'col2'),
>>  ('col1', 'col3'),
>>  ('col1', 'col4'),
>>  ('col2', 'col3'),
>>  ('col2', 'col4'),
>>  ('col3', 'col4')]

過濾二進制列：

result_data = [(f"{x}-{y}", df[(df[x]==1) & (df[y]==1)]["id"].count()) for x,y in combinations]

result_df = pd.DataFrame(result_data, columns=["co-occurrence pair", "count of user_id"])

>> [('col1-col2', 2),
>>  ('col1-col3', 0),
>>  ('col1-col4', 1),
>>  ('col2-col3', 1),
>>  ('col2-col4', 3),
>>  ('col3-col4', 1)]

轉載請註明出處，本文鏈接：https://www.uj5u.com/ruanti/517790.html

標籤：Python熊猫算法

上一篇：常用性能調優策略及在風控場景下應用

下一篇：為什么在檔案中搜索短語時我的計數回圈不起作用？