我有一個基準資料框:
my_id parent_id attribute_1 attribute_2 attribute_3 attribute_4
ABC DEF A- 378.8 Accept False
ABS DES A- 388.8 Accept False
ABB DEG A 908.8 Decline True
ABB DEG B- 378.8 Accept False
APP DRE C- 370.8 Accept True
和一個資料框:
my_id parent_id Attribute_1 attribute2 attr_3 attribute_5
ABC DEF A- 478.8 Decline StRing
ABS DES A- 388.8 Accept String
ABB DEG A 908.8 Accept StrIng
ABB DEG C- 378.8 Accept String
APP DRE C- 370.8 Accept STring
如您所見,attribute_1、attribute_2 或attribute_3 中不時出現一些錯誤(列的名稱不同,但它們應該包含相同的內容)。
當我檢查每一行的這三個屬性是否與基準測驗中的完全相同時,如何標記錯誤記錄?我期望類似這樣的輸出:
faulty_rows =
my_id parent_id Attribute_1 attribute2 attr_3 faulty_attr
ABC DEF A- 478.8 Decline [attribute2, attr_3]
ABB DEG A 908.8 Accept [attr_3]
ABB DEG C- 378.8 Accept [Attribute_1]
我所做的是重命名列并始終逐列連接,這讓我知道出了什么問題,但我想同時檢查整行并標記錯誤所在。那可能嗎?無論哪種方式,PySpark 或 Pandas 解決方案都很好,我對邏輯很感興趣。
uj5u.com熱心網友回復:
它在 pyspark 中滿口。請參閱下面的代碼和邏輯
df1 =df1.withColumn('index', row_number().over(Window.partitionBy().orderBy('my_id','parent_id')))#Create index. This can be avaoide if you have a nuique key
df =(df.withColumn('index', row_number().over(Window.partitionBy().orderBy('my_id','parent_id')))#Create index. This can be avaoide if you have a nuique key
#Rename columns to make them similar with df1
.withColumnRenamed('attribute_1','Attribute_1').withColumnRenamed('attribute_2','attribute2')
.withColumnRenamed('attribute_3','attr_3').withColumnRenamed('attribute_3','attr_3'))
s =(df1.drop('attribute_5').unionByName(df.drop('attribute_4')).orderBy('my_id','parent_id','index')#Union the two dfs and sort bt index
.withColumn('change',array('attribute_1','attribute2', 'attr_3'))#Create an array of columns being invetsigated for change
.withColumn('cols',split(lit('attribute_1,attribute2, attr_3'),'\,'))#Introduce list of column names being invetsigated
.withColumn('change1',last('change').over(Window.partitionBy('index').orderBy('my_id','parent_id')))#For every index,put changes side by side
.where(col('change')!=col('change1'))#Filter where changes are not the same
.withColumn('change2', expr("transform(change,(c,i)->change[i]!=change1[i])"))#create boolean of chnages
.withColumn('faulty_attr',expr('filter(cols,(x,j)->(change2[j]))'))#Leverage arrray functions to filter columns that didnt change as expected
.drop('index','change','cols','change1' ,'change2')#drop unwated columns
)
s.show(truncate=False)
----- --------- ----------- ---------- ------- ---------------------
|my_id|parent_id|Attribute_1|attribute2|attr_3 |faulty_attr |
----- --------- ----------- ---------- ------- ---------------------
|ABB |DEG |A |908.8 |Accept |[ attr_3] |
|ABB |DEG |C- |378.8 |Accept |[attribute_1] |
|ABC |DEF |A- |478.8 |Decline|[attribute2, attr_3]|
----- --------- ----------- ---------- ------- ---------------------
uj5u.com熱心網友回復:
一種方法可能正在使用pandas.DataFrame.compare,它要求:
兩個資料幀上相同的列名和相同的形狀。
資料幀之間的行順序是一致的——這可以通過預處理資料幀(例如,使用排序)來實作。
編碼:
df1.iloc[:,0:5].compare(df2.iloc[:,0:5])
輸出:
attribute_1 attribute_2 attribute_3
self other self other self other
0 NaN NaN 378.8 478.8 Accept Decline
2 NaN NaN NaN NaN Decline Accept
3 B- C- NaN NaN NaN NaN
uj5u.com熱心網友回復:
假設A指的是您的第一個字典,以及B您的第二個:
選項 1:使用 Dataframe 比較
df1 = pd.DataFrame.from_dict(A)
df2 = pd.DataFrame.from_dict(B)
diffs = df1.where(df1.values==df2.values).isnull()
df1['faulty_attr'] = diffs.dot(diffs.columns)
###resulting output:
### my_id parent_id attribute_1 attribute_2 attribute_3 attribute_4 faulty_attr
###0 ABC DEF A- 378.8 Accept False attribute_2attribute_3
###1 ABS DES A- 388.8 Accept False
###2 ABB DEG A 908.8 Decline True attribute_3
###3 ABB DEG B- 378.8 Accept False attribute_1
###4 APP DRE C- 370.8 Accept True
選項 2:使用DeepDiff
from deepdiff import DeepDiff
print(DeepDiff(A, B, ignore_order=False).pretty())
###resulting output:
###Value of root['attribute_1'][3] changed from "B-" to "C-".
###Value of root['attribute_2'][0] changed from 378.8 to 478.8.
###Value of root['attribute_3'][0] changed from "Accept" to "Decline".
###Value of root['attribute_3'][2] changed from "Decline" to "Accept"
選項 3:使用DataFrame合并
df1 = pd.DataFrame.from_dict(A)
df2 = pd.DataFrame.from_dict(B)
comparison_df = df1.merge(df2, indicator=True, how='outer')
print(comparison_df[comparison_df['_merge'] == 'right_only'])
###resulting output:
### my_id parent_id attribute_1 attribute_2 attribute_3 attribute_4 _merge
###5 ABC DEF A- 478.8 Decline False right_only
###6 ABB DEG A 908.8 Accept True right_only
###7 ABB DEG C- 378.8 Accept False right_only
轉載請註明出處,本文鏈接:https://www.uj5u.com/caozuo/515060.html
下一篇:SQL從連接表中獲取不同的行
