標記兩個資料幀之間不匹配的記錄-有解無憂

我有一個基準資料框：

my_id    parent_id    attribute_1    attribute_2     attribute_3       attribute_4
  ABC          DEF             A-          378.8          Accept             False
  ABS          DES             A-          388.8          Accept             False
  ABB          DEG             A           908.8          Decline             True
  ABB          DEG             B-          378.8          Accept             False
  APP          DRE             C-          370.8          Accept              True

和一個資料框：

my_id    parent_id    Attribute_1     attribute2           attr_3        attribute_5
  ABC          DEF             A-          478.8          Decline              StRing
  ABS          DES             A-          388.8          Accept               String
  ABB          DEG             A           908.8          Accept               StrIng
  ABB          DEG             C-          378.8          Accept               String
  APP          DRE             C-          370.8          Accept               STring

如您所見，attribute_1、attribute_2 或attribute_3 中不時出現一些錯誤（列的名稱不同，但它們應該包含相同的內容）。

當我檢查每一行的這三個屬性是否與基準測驗中的完全相同時，如何標記錯誤記錄？我期望類似這樣的輸出：

faulty_rows = 

    my_id    parent_id    Attribute_1     attribute2           attr_3       faulty_attr 
      ABC          DEF             A-          478.8          Decline       [attribute2, attr_3]                  
      ABB          DEG             A           908.8          Accept        [attr_3]      
      ABB          DEG             C-          378.8          Accept        [Attribute_1]

我所做的是重命名列并始終逐列連接，這讓我知道出了什么問題，但我想同時檢查整行并標記錯誤所在。那可能嗎？無論哪種方式，PySpark 或 Pandas 解決方案都很好，我對邏輯很感興趣。

uj5u.com熱心網友回復：

它在 pyspark 中滿口。請參閱下面的代碼和邏輯

df1 =df1.withColumn('index', row_number().over(Window.partitionBy().orderBy('my_id','parent_id')))#Create index. This can be avaoide if you have a nuique key
df =(df.withColumn('index', row_number().over(Window.partitionBy().orderBy('my_id','parent_id')))#Create index. This can be avaoide if you have a nuique key
#Rename columns to make them similar with df1
     .withColumnRenamed('attribute_1','Attribute_1').withColumnRenamed('attribute_2','attribute2')
.withColumnRenamed('attribute_3','attr_3').withColumnRenamed('attribute_3','attr_3'))


s =(df1.drop('attribute_5').unionByName(df.drop('attribute_4')).orderBy('my_id','parent_id','index')#Union the two dfs and sort bt index
    .withColumn('change',array('attribute_1','attribute2', 'attr_3'))#Create an array of columns being invetsigated for change
    .withColumn('cols',split(lit('attribute_1,attribute2, attr_3'),'\,'))#Introduce list of column names being invetsigated
    .withColumn('change1',last('change').over(Window.partitionBy('index').orderBy('my_id','parent_id')))#For every index,put changes side by side
    .where(col('change')!=col('change1'))#Filter where changes are not the same
    .withColumn('change2', expr("transform(change,(c,i)->change[i]!=change1[i])"))#create boolean of chnages
    .withColumn('faulty_attr',expr('filter(cols,(x,j)->(change2[j]))'))#Leverage arrray functions to filter columns that didnt change as expected
    .drop('index','change','cols','change1' ,'change2')#drop unwated columns
   )
s.show(truncate=False)


 ----- --------- ----------- ---------- ------- --------------------- 
|my_id|parent_id|Attribute_1|attribute2|attr_3 |faulty_attr          |
 ----- --------- ----------- ---------- ------- --------------------- 
|ABB  |DEG      |A          |908.8     |Accept |[ attr_3]            |
|ABB  |DEG      |C-         |378.8     |Accept |[attribute_1]        |
|ABC  |DEF      |A-         |478.8     |Decline|[attribute2,  attr_3]|
 ----- --------- ----------- ---------- ------- ---------------------

uj5u.com熱心網友回復：

一種方法可能正在使用pandas.DataFrame.compare，它要求：

兩個資料幀上相同的列名和相同的形狀。
資料幀之間的行順序是一致的——這可以通過預處理資料幀（例如，使用排序）來實作。

編碼：

df1.iloc[:,0:5].compare(df2.iloc[:,0:5])

輸出：

  attribute_1       attribute_2        attribute_3         
         self other        self  other        self    other
0         NaN   NaN       378.8  478.8      Accept  Decline
2         NaN   NaN         NaN    NaN     Decline   Accept
3          B-    C-         NaN    NaN         NaN      NaN

uj5u.com熱心網友回復：

假設A指的是您的第一個字典，以及B您的第二個：

選項 1：使用 Dataframe 比較

df1 = pd.DataFrame.from_dict(A)
df2 = pd.DataFrame.from_dict(B)

diffs = df1.where(df1.values==df2.values).isnull()
df1['faulty_attr'] = diffs.dot(diffs.columns)

###resulting output:
###  my_id parent_id attribute_1  attribute_2 attribute_3  attribute_4             faulty_attr
###0   ABC       DEF          A-        378.8      Accept        False  attribute_2attribute_3
###1   ABS       DES          A-        388.8      Accept        False
###2   ABB       DEG           A        908.8     Decline         True             attribute_3
###3   ABB       DEG          B-        378.8      Accept        False             attribute_1
###4   APP       DRE          C-        370.8      Accept         True

選項 2：使用DeepDiff

from deepdiff import DeepDiff

print(DeepDiff(A, B, ignore_order=False).pretty())

###resulting output:
###Value of root['attribute_1'][3] changed from "B-" to "C-".
###Value of root['attribute_2'][0] changed from 378.8 to 478.8.
###Value of root['attribute_3'][0] changed from "Accept" to "Decline".
###Value of root['attribute_3'][2] changed from "Decline" to "Accept"

選項 3：使用DataFrame合并

df1 = pd.DataFrame.from_dict(A)
df2 = pd.DataFrame.from_dict(B)

comparison_df = df1.merge(df2, indicator=True, how='outer')
print(comparison_df[comparison_df['_merge'] == 'right_only'])

###resulting output:
###  my_id parent_id attribute_1  attribute_2 attribute_3  attribute_4      _merge
###5   ABC       DEF          A-        478.8     Decline        False  right_only
###6   ABB       DEG           A        908.8      Accept         True  right_only
###7   ABB       DEG          C-        378.8      Accept        False  right_only

轉載請註明出處，本文鏈接：https://www.uj5u.com/caozuo/515060.html

標籤：Python熊猫数据框加入pyspark

上一篇：只用一個查詢回傳多對多關系的所有內容

下一篇：SQL從連接表中獲取不同的行