假設我們有兩個表 -A和B,它們的列鍵是country, name, age。我想從Aand創建一個聯合表B,稱為result,這樣result表中將是所有唯一的國家和名稱對,并優先配對 from A。
例如,假設這是表A:
val column = Seq("country","name", "age")
val A = Seq(("New York","Smith","10"),
("Washington","Rose","5"),
("Mexico","David","1"))
val aDF = A.toDF(column:_*)
aDF.show(false)
---------- ----- ---
|country |name |age|
---------- ----- ---
|New York |Smith|10 |
|Washington|Rose |5 |
|Mexico |David|1 |
---------- ----- ---
這是表B:
val B = Seq(("New York","Smith","5"),
("Florida","Smith","5"),
("Washington","Jef","5"),
("Russia","Boris","12"))
val bDF = B.toDF(column:_*)
bDF.show(false)
---------- ----- ---
|country |name |age|
---------- ----- ---
|New York |Smith|5 |. // Should not be included in the result table
|Florida |Smith|5 |
|Washington|Jef |5 |
|Russia |Boris|12 |
---------- ----- ---
結果表將是 -
---------- ----- ---
|country |name |age|
---------- ----- ---
|New York |Smith|10 | // <New York, Smith> contained in A and B - we take from A
|Washington|Rose |5 |. // Contained in A
|Mexico |David|1 |. // Contained in A
|Florida |Smith|5 |. // Contained in B
|Washington|Jef |5 |. // Contained in B
|Russia |Boris|12 |. // Contained in B
---------- ----- ---
我怎么能用火花做到這一點?
uj5u.com熱心網友回復:
您可以添加一個新列source具有文本值1和2分別在dataframesaDF和bDF工會操作之前。然后,使用此列進行自定義排序,以使用row_numberWindow 函式消除重復行:
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy("country", "name").orderBy("source")
val result = aDF.withColumn("source", lit(1))
.union(bDF.withColumn("source", lit(2)))
.withColumn("rn", row_number().over(w))
.filter("rn = 1")
.drop("rn", "source")
result.show
// ---------- ----- ---
//| country| name|age|
// ---------- ----- ---
//| Florida|Smith| 5|
//|Washington| Rose| 5|
//| Mexico|David| 1|
//|Washington| Jef| 5|
//| Russia|Boris| 12|
//| New York|Smith| 10|
// ---------- ----- ---
uj5u.com熱心網友回復:
完全聯接可能比union window:
aDF.join(bDF, Seq("country", "name"), "full")
.withColumn("age", coalesce(aDF("age"), bDF("age")))
.drop(aDF("age")).drop(bDF("age"))
.show
轉載請註明出處,本文鏈接:https://www.uj5u.com/qianduan/405129.html
標籤:
