Pandas更好的方法來比較兩個資料幀并找到只存在于一個資料幀中的條目-有解無憂

我有一個資料框/電子表格，其中包含員工資訊（姓名、作業地點）的列以及總作業時間的列。我的主要目標是找到存在于一個檔案中但不存在于另一個檔案中的員工。

資料框ORIGINAL：

 Name      Site    ....other columns
 Anne      A
 Bob       B
 Charlie   A

資料框NEW：

 Name      Site    ....other columns
 Anne      A
 Bob       B
 Doug      B

DataFrameNEW非常相似，ORIGINAL但有一些差異，這些是我想要展示的細節

查理/A 只在ORIGINAL
Doug/B 只在NEW

我找到了這個解決方案，它作業正常，但我需要執行兩次才能在一個 DataFrame 而不是另一個 DataFrame 中查找記錄，然后再執行一次，反之亦然。

這是我的代碼：

COLS = ['Name','Site'] # Columns to group by to find a 'unique' record

# Records in New, not in Original
df_right = ORIGINAL.merge(NEW.drop_duplicates(), on=COLS, how='right', indicator=True)
df_right = df_right[df_right._merge != 'both'] # Filter out records that exist in both.

# Records in Original, not in New
df_left = ORIGINAL.merge(NEW.drop_duplicates(), on=COLS, how='left', indicator=True)
df_left = df_left[df_left._merge != 'both']

df = pd.concat([df_left,df_right])
# df now contains Name/Site records that exist in one DataFrame but not the other

有沒有更好的方法來執行此檢查而不是執行兩次并連接？

uj5u.com熱心網友回復：

您實際上可以將資料幀轉換為Indexes，然后簡單地用于isin檢查整行是否在另一個資料幀中：

cols = ['Name', 'Site']
originalI = pd.Index(ORIGINAL[cols])
newI = pd.Index(NEW[cols])

out = pd.concat([
    ORIGINAL[~originalI.isin(newI)].assign(from_df='ORIGINAL'),
    NEW[~newI.isin(originalI)].assign(from_df='NEW'),    
])

輸出：

>>> out
      Name Site   from_df
2  Charlie    A  ORIGINAL
2     Doug    B       NEW

uj5u.com熱心網友回復：

看起來像使用“外部”作為how解決方案

z = pd.merge(ORIGINAL, NEW, on=cols, how = 'outer', indicator=True)
z = z[z._merge != 'both'] # Filter out records from both

輸出看起來像這樣（僅顯示我關心的列之后）

  Name       Site   _merge
  Charlie    A     left_only
  Doug       B     right_only

uj5u.com熱心網友回復：

如果您只想要“名稱”-“站點”對；那么我認為這可以作業：

out = (pd.concat((ORIGINAL[['Name','Site']].assign(From='ORIGINAL'), 
                  NEW[['Name','Site']].assign(From='NEW')))
       .drop_duplicates(subset=['Name', 'Site'], keep=False))

輸出：

      Name Site      From
2  Charlie    A  ORIGINAL
2     Doug    B       NEW

轉載請註明出處，本文鏈接：https://www.uj5u.com/yidong/445493.html

標籤：Python 熊猫数据框比较

上一篇：熊貓資料框按多列分組并計算不同的值

下一篇：如何計算小于給定數字的數字，包括重復數字