sparkscala：從一列中提取xml-有解無憂

假設df具有以下結構：

root
 |-- id: decimal(38,0) (nullable = true)
 |-- text: string (nullable = true)

這里text包含大致 XML 型別記錄的字串。然后，我可以應用以下步驟將必要的條目提取到平面表中：

首先，追加根節點，因為原來沒有。（問題 1：這一步是必要的，還是可以省略？）

val df2 = df.withColumn("text", concat(lit("<root>"),$"text",lit("</root>")))

接下來，決議 XML：

val payloadSchema = schema_of_xml(df.select("text").as[String])
val df3 = spark.read.option("rootTag","root").option("rowTag","row").schema(payloadSchema)xml(df2.select("text").as[String])

這會產生df3：

root
 |-- row: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- key: string (nullable = true)
 |    |    |-- value: string (nullable = true)

我終于爆炸了：

val df4 = df3.withColumn("exploded_cols", explode($"row"))

進入

root
 |-- row: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- key: string (nullable = true)
 |    |    |-- value: string (nullable = true)
 |-- exploded_cols: struct (nullable = true)
 |    |-- key: string (nullable = true)
 |    |-- value: string (nullable = true)

我的目標是下表：

val df5 = df4.select("exploded_cols.*")

和

root
 |-- key: string (nullable = true)
 |-- value: string (nullable = true)

主要問題： 我希望最終表格還包含id: decimal(38,0) (nullable = true)條目以及展開的key, value列，例如，

root
 |-- id: decimal(38,0) (nullable = true)
 |-- key: string (nullable = true)
 |-- value: string (nullable = true)

但是，我不確定如何在spark.read.option不df2.select("text").as[String]單獨選擇方法的情況下進行呼叫（請參閱參考資料df3）。是否可以簡化此腳本？

這應該很簡單，所以我不確定是否需要可重現的示例。此外，我從一個盲人r 背景，所以我錯過了所有的 Scala 基礎知識，但我正在努力學習。

uj5u.com熱心網友回復：

使用from_xml的功能SPAK-XML庫。

val df = // Read source data
val schema = // Define schema of XML text

df.withColumn("xmlData", from_xml("xmlColName", schema))

轉載請註明出處，本文鏈接：https://www.uj5u.com/gongcheng/362479.html

標籤：斯卡拉阿帕奇火花

上一篇：從Python氣流dag代碼呼叫SparkScala函式

下一篇：sbt/Scala是否有類似于Bundler/pip>=或~=的“兼容版本”？