如何將3個相同維度的資料框相交并輸出一個在至少2個資料框中常見的資料框-有解無憂

我有三個相同尺寸的資料框。我想在 3 個資料幀中的至少 2 個中找到常見的事件（1s 和 -1s）。我想要一個相同維度的輸出資料框，其中元素位于至少兩個資料框中。

我寫了下面的例子供大家理解問題。在示例中，我想找出至少兩個資料幀中哪些位置有 1 或 -1。

import pandas as pd

A= {'a': [0, '.', 0, -1],'b': [0, '.', 1, 0], 'c':[1,'.', 0, 1]   }
A = pd.DataFrame(data=A)

    a  b  c
0   0  0  1
1   .  .  .
2   0  1  0
3  -1  0  1

B= {'a': [0, '.', 0, -1],'b': [1, '.', 1, 0], 'c':[1,'.', 0, -1]   }
B = pd.DataFrame(data=B)

B
    a  b   c
0   0  1   1
1   .  .   .
2   0  1   0
3  -1  0  -1

C = {'a': [0, '.', 0, 0],'b': [1, '.', 1, 0], 'c':[0,'.', 0, 0]   }
C = pd.DataFrame(data=C)

C
   a  b  c
0  0  1  0
1  .  .  .
2  0  1  0
3  0  0  0

所需的輸出將是：

    a  b  c
0   0  1  1
1   .  .  .
2   0  1  0
3  -1  0  0

我嘗試了幾件事，但都沒有奏效。

我將不勝感激任何幫助。

非常感謝！

uj5u.com熱心網友回復：

您可以使用底層 numpy 陣列來計算每個值的值并用于numpy.select映射任意數量的選擇。For 在這里獨立處理，但也可以添加為要檢查的值。

dfs = [A,B,C]
vals = [1, -1]

masks = [sum(x.eq(val).astype(int) for x in dfs).ge(2)
         for val in vals]

pd.DataFrame(np.select(masks, vals),
             columns=A.columns, index=A.index).mask(A.eq('.'), '.')

輸出：

    a  b  c
0   0  1  1
1   .  .  .
2   0  1  0
3  -1  0  0

點處理為 1/-1：

dfs = [A,B,C]
vals = ['.', 1, -1]
masks = [sum(x.eq(val).astype(int) for x in dfs).ge(2)
         for val in vals]
pd.DataFrame(np.select(masks, vals),
             columns=A.columns, index=A.index)

uj5u.com熱心網友回復：

我可以想象有更好的解決方案，但你可以使用：

from collections import Counter

final_df = pd.DataFrame(columns=["a", "b", "c"])

for i in range(0, len(A)):
    a = 0
    temp = []
    for j in A.columns:
        count = Counter([A[j][i], B[j][i], C[j][i]]).most_common()
        if count[0][1] > 1:
            temp.append(count[0][0])
        else:
            temp.append(0)
        a =1
    final_df = pd.concat([final_df, pd.DataFrame([temp], columns=["a", "b", "c"])])

輸出：

    a   b   c
0   0   1   1
0   .   .   .
0   0   1   0
0   -1  0   0

uj5u.com熱心網友回復：

首先解決問題

連接 dfs
groupby 較低的索引
聚合值計數保留第一個元素

df = (pd.concat([A, B, C], axis=0, keys=['A', 'B', 'C'])
        .groupby(level=1)
        .agg(lambda x: (x.value_counts().iloc[0] >= 2) * x.value_counts().index[0])
)

輸出

    a  b   c
0   0  1   1
1   .  .   .
2   0  1   0
3  -1  0   0

清理代碼

該行.agg(lambda x: (x.value_counts().iloc[0] >= 2) * x.value_counts().index[0])：

不花哨

在 pandas 1.1.5 下發出警告：

DeprecationWarning: In future, it will be an error for 'np.bool_' scalars to be interpreted as an index
This is separate from the ipykernel package so we can avoid doing imports until

使用替換 lambda 的函式，我們可以在這兩點上進行增強：

def series_rule(s):
    vc = s.value_counts()
    return  vc.index[0] if (vc.iloc[0] >= 2) else 0  
    
df = (pd.concat([A, B, C], axis=0, keys=['A', 'B', 'C'])
        .groupby(level=1)
        .agg(series_rule)
)

概括

我們現在可以定義一個可重用的函式：

def Rachael_rule(pandas_serie):
    vc = pandas_serie.value_counts()
    return  vc.index[0] if (vc.iloc[0] >= 2) else 0  

def df_list_apply(rule, df_list):
    """
    Applies a rule to a list of dataframe of same shapes
    
    Parameters;
    - rule : a function taking a pandas.Series argument and returnig a value
    - df_list a list of pandas.DataFRame

    Return:
      the dataframe obtained by apllyng the rule on the dimenion of the list
    """    
    return (pd.concat(df_list, axis=0, keys=list(range(len(df_list))))
              .groupby(level=1)
              .agg(rule)
           )

然后：

>>> df_list_apply(Rachael_rule, [A, B, C])
[Out]
    a  b   c
0   0  1   1
1   .  .   .
2   0  1   0
3  -1  0   0

轉載請註明出處，本文鏈接：https://www.uj5u.com/ruanti/420952.html

標籤：

上一篇：Python編程和分組

下一篇：在Spark中讀取多行JSON檔案排成一行