我有一個這樣的 2 個元素的元組：

Tuple2(“String1, String2”, ArrayList(“String3”, “String4”))
=> 第一個元素是一個以逗號分隔的字串值的字串
=> 第二個元素是一個包含字串串列的陣列串列

我想要一個這樣的資料框：

Col1        Col2        Col3
1           String1     String3
2           String1     String4
3           String2     String3
4           String2     String4

uj5u.com熱心網友回復：

TL;博士

import org.apache.spark.sql.functions.{col, explode, monotonically_increasing_id, split}

df
    // `split` "String1, String2" into separate values, then create a row per value using `explode`
    .withColumn("Col2", explode(split(col("_1"), ", ")))
    // create a row per value in the list: "String3", "String4"
    .withColumn("Col3", explode(col("_2")))
    // now that we have our 4 rows, add a new column with an incrementing number
    .withColumn("Col1", monotonically_increasing_id()   1)
    // only keep the columns we care about
    .select("Col1", "Col2", "Col3")
    .show(false)

完整答案

從你的例子開始：

val tuple2 = Tuple2("String1, String2", List("String3", "String4"))

并將其轉換為 DataFrame：

val df = List(tuple2).toDF("_1", "_2")

df.show(false)

這使：

 ---------------- ------------------ 
|_1              |_2                |
 ---------------- ------------------ 
|String1, String2|[String3, String4]|
 ---------------- ------------------

現在我們準備好進行轉換了：

import org.apache.spark.sql.functions.{col, explode, monotonically_increasing_id, split}

df
    // `split` "String1, String2" into separate values, then create a row per value using `explode`
    .withColumn("Col2", explode(split(col("_1"), ", ")))
    // create a row per value in the list: "String3", "String4"
    .withColumn("Col3", explode(col("_2")))
    // now that we have our 4 rows, add a new column with an incrementing number
    .withColumn("Col1", monotonically_increasing_id()   1)
    // only keep the columns we care about
    .select("Col1", "Col2", "Col3")
    .show(false)

這使：

 ---- ------- ------- 
|Col1|Col2   |Col3   |
 ---- ------- ------- 
|1   |String1|String3|
|2   |String1|String4|
|3   |String2|String3|
|4   |String2|String4|
 ---- ------- -------

額外閱讀以獲取更多詳細資訊

值得注意的是，操作的順序是關鍵：

首先我們分解 "String1"成"String2"自己的行：

df
    .withColumn("Col2", explode(split(col("_1"), ", ")))
    .select("Col2")
    .show(false)

給出：

 ------- 
|Col2   |
 ------- 
|String1|
|String2|
 -------

我們從原來的 1 行變成了 2 行。

然后我們爆炸"String3", "String4"：

df
    .withColumn("Col2", explode(split(col("_1"), ", ")))
    .withColumn("Col3", explode(col("_2")))
    .select("Col2", "Col3")
    .show(false)

給出：

 ------- ------- 
|Col2   |Col3   |
 ------- ------- 
|String1|String3|
|String1|String4|
|String2|String3|
|String2|String4|
 ------- -------

最后我們添加遞增計數。如果我們早先這樣做，我們會將相同的數值復制到多行。

例如：

df
    // here we add `Col1` to a Dataset of only one row! So we only have the value `1`
    .withColumn("Col1", monotonically_increasing_id()   1)
    // here we explode row 1, copying the value of `Col1`
    .withColumn("Col2", explode(split(col("_1"), ", ")))
    .withColumn("Col3", explode(col("_2")))
    .select("Col1", "Col2", "Col3")
    .show(false)

給出：

 ---- ------- ------- 
|Col1|Col2   |Col3   |
 ---- ------- ------- 
|1   |String1|String3|
|1   |String1|String4|
|1   |String2|String3|
|1   |String2|String4|
 ---- ------- -------

轉載請註明出處，本文鏈接：https://www.uj5u.com/caozuo/459757.html

標籤：数据框斯卡拉阿帕奇火花

上一篇：創建具有通過SparkSQLDataframe回圈的值的新列

下一篇：Spark/Scala代碼不再在Spark3.x中作業

使用ScalaSpark將元組擴展為資料框

TL;博士

完整答案

額外閱讀以獲取更多詳細資訊