我正在嘗試根據另一個資料表中的資訊創建一個新列。
df1
Loc Time Wage
1 192 1
3 192 2
1 193 3
5 193 3
7 193 5
2 194 7
df2
Loc City
1 NYC
2 Miami
3 LA
4 Chicago
5 Houston
6 SF
7 DC
所需的輸出:
Loc Time Wage City
1 192 1 NYC
3 192 2 LA
1 193 3 NYC
5 193 3 Houston
7 193 5 DC
2 194 7 Miami
實際資料幀在行號方面差異很大,但大致如此。我認為這可能是可以實作的,.map但我還沒有在網上找到太多相關檔案。join似乎不太適合這種情況。
uj5u.com熱心網友回復:
join正是您所需要的。嘗試在spark-shell
val sparkSession = SparkSession.builder().appName("my_app").getOrCreate()
import spark.implicits._
val col1 = Seq("loc", "time", "wage")
val data1 = Seq((1, 192, 1), (3, 193, 2), (1, 193, 3), (5, 193, 3), (7, 193, 5), (2, 194, 7))
val col2 = Seq("loc", "city")
val data2 = Seq((1, "NYC"), (2, "Miami"), (3, "LA"), (4, "Chicago"), (5, "Houston"), (6, "SF"), (7, "DC"))
val df1 = spark.sparkContext.parallelize(data1).toDF(col1: _*)
val df2 = spark.sparkContext.parallelize(data2).toDF(col2: _*)
val outputDf = df1.join(df2, Seq("loc")) // join on the column "loc"
outputDf.show()
這將輸出
--- ---- ---- -------
|loc|time|wage| city|
--- ---- ---- -------
| 1| 192| 1| NYC|
| 1| 193| 3| NYC|
| 2| 194| 7| Miami|
| 3| 193| 2| LA|
| 5| 193| 3|Houston|
| 7| 193| 5| DC|
--- ---- ---- -------
轉載請註明出處,本文鏈接:https://www.uj5u.com/shujuku/331456.html
標籤:斯卡拉
