我正在加入兩個資料集,其中一些列共享相同的名稱。我希望輸出是兩個案例類的元組,每個類代表各自的資料集。
joined = dataset1.as("ds1")
.join(dataset2.as("ds2"),dataset1("key") === dataset2("key"),"inner")
// select doesn't work because of the columns which have similar names
.select("ds1.*,ds2.*)
// skipping select and going straight here doesn't work because of the same problem
.as[Tuple2(caseclass1,caseclass2)]
需要什么代碼讓 spark 知道將 ds1.* 映射到 caseclass1 和 ds2.* 到 caseclass2?
uj5u.com熱心網友回復:
您可以在此處利用該struct功能,如下所示:
// create a wrapper case class
case class Outer(caseclass1: Caseclass1, caseclass2: Caseclass2)
// join and select the columns as struct
val joined = dataset1.as("ds1")
.join(dataset2.as("ds2"), dataset1("key") === dataset2("key"), "inner")
.select(struct("ds1.*").as("caseclass1"), struct("ds2.*").as("caseclass2"))
.as[Outer]
轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/524762.html
標籤:数据框斯卡拉阿帕奇火花
