使用Spark從json字串串列中提取陣列-有解無憂

我的資料框中有一列包含 JSON 串列，但型別是字串。我需要explode在這個列上運行，所以首先我需要將它轉換成一個串列。我找不到對這個用例的太多參考。

樣本資料：

columnName: "[{"name":"a","info":{"age":"1","grade":"b"},"other":7},{"random":"x"}, {...}]"

以上是資料的樣子，欄位不是固定的（索引 0 可能包含帶有某些欄位的 JSON，而索引 1 將包含帶有其他欄位的欄位）。在串列中可以有更多嵌套的 JSON 或一些額外的欄位。我目前正在使用這個 -

"""explode(split(regexp_replace(regexp_replace(colName, '(\\\},)','}},'), '(\\\[|\\\])',''), "},")) as colName"""我只是用“}}”替換“}”，然后洗掉“[]”，然后在“}”上呼叫拆分，但這種方法不起作用，因為有嵌套的 JSON。

如何從字串中提取陣列？

uj5u.com熱心網友回復：

你可以這樣試試：

// Initial DataFrame

df.show(false)

 ---------------------------------------------------------------------- 
|columnName                                                            |
 ---------------------------------------------------------------------- 
|[{"name":"a","info":{"age":"1","grade":"b"},"other":7},{"random":"x"}]|
 ---------------------------------------------------------------------- 

df.printSchema()

root
 |-- columnName: string (nullable = true)
 
// toArray is a user defined function that parses an array of json objects which is present as a string
     
import org.json.JSONArray

val toArray = udf { (data: String) => {
    val jsonArray = new JSONArray(data)
    var arr: Array[String] = Array()
    val objects = (0 until jsonArray.length).map(x => jsonArray.getJSONObject(x))
    objects.foreach { elem =>
        arr : = elem.toString
    }
    arr
}
}

// Using the udf and exploding the resultant array

val df1 = df.withColumn("columnName",explode(toArray(col("columnName"))))

df1.show(false)

 ----------------------------------------------------- 
|columnName                                           |
 ----------------------------------------------------- 
|{"other":7,"name":"a","info":{"grade":"b","age":"1"}}|
|{"random":"x"}                                       |
 ----------------------------------------------------- 

df1.printSchema()

root
 |-- columnName: string (nullable = true)
 
// Parsing the json string by obtaining the schema dynamically

val schema = spark.read.json(df1.select("columnName").rdd.map(x => x(0).toString)).schema
val df2 = df1.withColumn("columnName",from_json(col("columnName"),schema))

df2.show(false)

 --------------- 
|columnName     |
 --------------- 
|[[1, b], a, 7,]|
|[,,, x]        |
 --------------- 

df2.printSchema()

root
 |-- columnName: struct (nullable = true)
 |    |-- info: struct (nullable = true)
 |    |    |-- age: string (nullable = true)
 |    |    |-- grade: string (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- other: long (nullable = true)
 |    |-- random: string (nullable = true)
 
// Extracting all the fields from the json

df2.select(col("columnName.*")).show(false)

 ------ ---- ----- ------ 
|info  |name|other|random|
 ------ ---- ----- ------ 
|[1, b]|a   |7    |null  |
|null  |null|null |x     |
 ------ ---- ----- ------

編輯：

如果您可以使用get_json_object功能，您可以嘗試這種方式

// Get the list of columns dynamically

val columns = spark.read.json(df1.select("columnName").rdd.map(x => x(0).toString)).columns

// define an empty array of Column type and get_json_object function to extract the columns

var extract_columns: Array[Column] = Array()
    columns.foreach { column =>
    extract_columns : = get_json_object(col("columnName"), "$."   column).as(column)
}

df1.select(extract_columns: _*).show(false)

 ----------------------- ---- ----- ------ 
|info                   |name|other|random|
 ----------------------- ---- ----- ------ 
|{"grade":"b","age":"1"}|a   |7    |null  |
|null                   |null|null |x     |
 ----------------------- ---- ----- ------

請注意，info列不是結構型別。您可能必須按照類似的方式來提取嵌套 json 的列

轉載請註明出處，本文鏈接：https://www.uj5u.com/net/432467.html

標籤：json 斯卡拉阿帕奇火花 apache-spark-sql

上一篇：將json檔案讀入pandas資料幀非常慢

下一篇：我正在嘗試斷言兩個回應字串是否相同或不使用jmeter中的回應斷言，但它顯示錯誤