我有一個這樣的 2 個元素的元組:
Tuple2(“String1, String2”, ArrayList(“String3”, “String4”))
=> 第一個元素是一個以逗號分隔的字串值的字串
=> 第二個元素是一個包含字串串列的陣列串列
我想要一個這樣的資料框:
Col1 Col2 Col3
1 String1 String3
2 String1 String4
3 String2 String3
4 String2 String4
uj5u.com熱心網友回復:
TL;博士
import org.apache.spark.sql.functions.{col, explode, monotonically_increasing_id, split}
df
// `split` "String1, String2" into separate values, then create a row per value using `explode`
.withColumn("Col2", explode(split(col("_1"), ", ")))
// create a row per value in the list: "String3", "String4"
.withColumn("Col3", explode(col("_2")))
// now that we have our 4 rows, add a new column with an incrementing number
.withColumn("Col1", monotonically_increasing_id() 1)
// only keep the columns we care about
.select("Col1", "Col2", "Col3")
.show(false)
完整答案
從你的例子開始:
val tuple2 = Tuple2("String1, String2", List("String3", "String4"))
并將其轉換為 DataFrame:
val df = List(tuple2).toDF("_1", "_2")
df.show(false)
這使:
---------------- ------------------
|_1 |_2 |
---------------- ------------------
|String1, String2|[String3, String4]|
---------------- ------------------
現在我們準備好進行轉換了:
import org.apache.spark.sql.functions.{col, explode, monotonically_increasing_id, split}
df
// `split` "String1, String2" into separate values, then create a row per value using `explode`
.withColumn("Col2", explode(split(col("_1"), ", ")))
// create a row per value in the list: "String3", "String4"
.withColumn("Col3", explode(col("_2")))
// now that we have our 4 rows, add a new column with an incrementing number
.withColumn("Col1", monotonically_increasing_id() 1)
// only keep the columns we care about
.select("Col1", "Col2", "Col3")
.show(false)
這使:
---- ------- -------
|Col1|Col2 |Col3 |
---- ------- -------
|1 |String1|String3|
|2 |String1|String4|
|3 |String2|String3|
|4 |String2|String4|
---- ------- -------
額外閱讀以獲取更多詳細資訊
值得注意的是,操作的順序是關鍵:
- 首先我們分解
"String1"成"String2"自己的行:
df
.withColumn("Col2", explode(split(col("_1"), ", ")))
.select("Col2")
.show(false)
給出:
-------
|Col2 |
-------
|String1|
|String2|
-------
我們從原來的 1 行變成了 2 行。
- 然后我們爆炸
"String3", "String4":
df
.withColumn("Col2", explode(split(col("_1"), ", ")))
.withColumn("Col3", explode(col("_2")))
.select("Col2", "Col3")
.show(false)
給出:
------- -------
|Col2 |Col3 |
------- -------
|String1|String3|
|String1|String4|
|String2|String3|
|String2|String4|
------- -------
- 最后我們添加遞增計數。如果我們早先這樣做,我們會將相同的數值復制到多行。
例如:
df
// here we add `Col1` to a Dataset of only one row! So we only have the value `1`
.withColumn("Col1", monotonically_increasing_id() 1)
// here we explode row 1, copying the value of `Col1`
.withColumn("Col2", explode(split(col("_1"), ", ")))
.withColumn("Col3", explode(col("_2")))
.select("Col1", "Col2", "Col3")
.show(false)
給出:
---- ------- -------
|Col1|Col2 |Col3 |
---- ------- -------
|1 |String1|String3|
|1 |String1|String4|
|1 |String2|String3|
|1 |String2|String4|
---- ------- -------
轉載請註明出處,本文鏈接:https://www.uj5u.com/caozuo/459757.html
