Python/Pandas在至少一列匹配時加入記錄-有解無憂

我有以下資料框：

id          phone       email
10352897        
10352897    10225967    
10352897                [email protected]
10352897    10225967    [email protected]
            10225967    
            10225967    [email protected]
                        [email protected]
23578910        
23578910    38256789    
23578910                [email protected]
23578910    38256789    [email protected]
            38256789    
            38256789    [email protected]
                        [email protected]
            65287930    [email protected]
            65287930
                        [email protected]
            65287930
            70203065
            70203065
            70203065
                        [email protected]
                        [email protected]
                        [email protected]

并非所有欄位都始終填寫，但它們至少在一列中相互關聯。

Python / Pandas 在至少一列匹配時加入記錄

我希望當它在三列中的至少一列中重合時，記錄連接并優先考慮填充欄位而不是空欄位，最后在這個例子中我希望得到以下輸出：

id          phone       email
10352897    10225967    [email protected]
23578910    38256789    [email protected]
            65287930    [email protected]
            70203065
                        [email protected]

你會怎么做呢？

uj5u.com熱心網友回復：

這是一個非常具體的要求，我不知道有任何內置的 pandas 函式可以滿足您的需求，因此我嘗試準確定義您要執行的操作并從頭開始重新創建它。

我能想到的最好的方法是從資料框的頂部讀取，我們將查看每一列中的值，直到我們找到一列中的值與該列中先前遇到的值不同的行，此時我們將到目前為止遇到的所有值放入新資料框中的一行。

這看起來像（假設您的原始資料框已命名df并且空單元格是空白（''）：

new_rows = []
# Create a dictionary where keys are the columns of the original df
new_row = {col: '' for col in df.columns}
# Iterate over rows
for _, row in df.iterrows():
    # Iterate over columns in row
    for col in row.keys():
        # If this column is not blank
        if row[col]:
            # If this column has already been filled in the new row
            # and the value is different, add this row to the new dataframe
            if new_row[col] and new_row[col] != row[col]:
                new_rows.append(new_row)
                new_row = {col: '' for col in df.columns}
            # Otherwise, set this value for the current row of the new dataframe
            else:
                new_row[col] = row[col]
# Add the last row
new_rows.append(new_row)
new_df = pd.DataFrame(new_rows)
print(new_df)

但是， 70203065 和 [email protected] 最終在新資料框中的同一行中：

        id     phone            email
0  10352897  10225967   [email protected]
1  23578910  38256789  [email protected]
2            65287930  [email protected]
3            70203065  [email protected]

您可能需要考慮導致 70203065 和 [email protected] 出現在不同行中的邏輯是什么，但希望這能讓您朝著正確的方向開始。

uj5u.com熱心網友回復：

如果目標僅僅是獲得所需的輸出，一種方法是獲取一個資料框，該資料框具有原始資料框中每一列的唯一值，df為此，可以使用pandas.DataFrame.apply自定義 lambda 函式，如下所示

df_new = df.apply(lambda x: pd.Series(x.unique()[~pd.isnull(x.unique())]))

[Out]:

           id       phone            email
0  10352897.0  10225967.0   [email protected]
1  23578910.0  38256789.0  [email protected]
2         NaN  65287930.0  [email protected]
3         NaN  70203065.0  [email protected]

然后，即使有空間實作驗證來檢查，對于給定的唯一值，其余列是否匹配df，在這種特定情況下，前三行實際上是正確的，我們不會考慮這一點。一個人將簡單地復制最后一行

df_new = pd.concat([df_new, df_new.iloc[-1:]], ignore_index=True)

然后調整各自的值以獲得想要的輸出

df_new.iloc[-2,2] = np.nan
df_new.iloc[-1,1] = np.nan

[Out]:

           id       phone            email
0  10352897.0  10225967.0   [email protected]
1  23578910.0  38256789.0  [email protected]
2         NaN  65287930.0  [email protected]
3         NaN  70203065.0              NaN
4         NaN         NaN  [email protected]

筆記：

盡管最后一部分不是最優雅的，并且需要一些“手動”作業（復制一行并手動更改單元格值），但這適用于 OP 的特定情況。
使用.apply(). 為此，您可能想閱讀此內容。

uj5u.com熱心網友回復：

這是一種方法。

請注意：您問題的第三部分，不要遵循您所說的模式

df['keep']=((df['id'].ne(df['id'].shift(-1)) &
             df['phone'].ne(df['phone'].shift(-1)) &
             df['email'].ne(df['email'].shift(-1))
            ))
df['chng']=df['keep'].map({False : np.nan, True: 1})
df['chng']=df['chng'].cumsum().bfill()
df=df.groupby('chng', as_index=True).ffill()
out=df.loc[df['keep']==True].fillna('')[['id','phone','email']]
out

            id  phone       email
6   10352897.0  10225967.0  [email protected]
13  23578910.0  38256789.0  [email protected]
15              65287930.0  [email protected]
16                          [email protected]
17              65287930.0  
20              70203065.0  
23                          [email protected]

轉載請註明出處，本文鏈接：https://www.uj5u.com/qukuanlian/524015.html

標籤：Python熊猫麻木的

上一篇：Pandas-將列值計數為資料框中的新列

下一篇：如何按值的總和對Pandas交叉表列進行排序