我有這個有四列的資料框
df1 = spark.createDataFrame([
('c', 'd', 3.0, 4),
('c', 'd', 7.3, 8),
('c', 'd', 7.3, 2),
('c', 'd', 7.3, 8),
('e', 'f', 6.0, 3),
('e', 'f', 6.0, 8),
('e', 'f', 6.0, 3),
('c', 'j', 4.2, 3),
('c', 'j', 4.3, 9),
], ['a', 'b', 'c', 'd'])
df1.show()
--- --- --- ---
| a| b| c| d|
--- --- --- ---
| c| d|3.0| 4|
| c| d|7.3| 8|
| c| d|7.3| 2|
| c| d|7.3| 8|
| e| f|6.0| 3|
| e| f|6.0| 8|
| e| f|6.0| 3|
| c| j|4.2| 3|
| c| j|4.3| 9|
--- --- --- ---
我還得到了與資料幀 df1 具有相同架構的其他資料幀 df2
df2 = spark.createDataFrame([
('c', 'd', 3.0, 4),
('c', 'd', 3.3, 5),
('c', 'd', 7.3, 2),
('c', 'd', 7.3, 7),
('e', 'f', 6.0, 3),
('c', 'j', 4.2, 1),
('c', 'j', 4.3, 9),
], ['a', 'b', 'c', 'd'])
df2.show()
--- --- --- ---
| a| b| c| d|
--- --- --- ---
| c| d|3.0| 4|
| c| d|3.3| 5|
| c| d|7.3| 2|
| c| d|7.3| 7|
| e| f|6.0| 3|
| c| j|4.2| 1|
| c| j|4.3| 9|
--- --- --- ---
我想比較這對(a,b,d),以便我可以獲得 df2 中存在的不同值,但 df1 中沒有這樣的值
df3
--- --- --- ---
| a| b| c| d|
--- --- --- ---
| c| d|3.3| 5|
| c| d|7.3| 7|
| c| j|4.2| 1|
--- --- --- ---
uj5u.com熱心網友回復:
我想你想要的是:
df2.subtract(df1.intersect(df2)).show()
我想要 df2 中不存在于 df1 和 df2 中的內容。
--- --- --- ---
| a| b| c| d|
--- --- --- ---
| c| j|4.2| 1|
| c| d|3.3| 5|
| c| d|7.3| 7|
--- --- --- ---
我也同意@pltc 的說法,即您可能在輸出表中犯了錯誤。
轉載請註明出處,本文鏈接:https://www.uj5u.com/qiye/461329.html
