來自JSON的SparkDataFrame將列與行交換-有解無憂

我有給定的 JSON，取自 HDFS，有數千條記錄，如下所示：

    {
      "01": {
        "created": "2020-12-28 02-15-01", 
        "entity_id": "s.m_free", 
        "old_state_id": null, 
        "state": "1498.7"
      }, 
      "02": {
        "created": "2020-12-28 02-15-31", 
        "entity_id": "s.m_free", 
        "old_state_id": 58100, 
        "state": "1498.9"
      }, 
...}

不幸的是，DataFrame 以數千列的形式出現，只有 4 行，如下所示：

              |                 01 |                   02|..................| 
created       |2020-12-28 02-15-01 |  2020-12-28 02-15-31|..................|
entity_id     |           s.m_free |             s.m_free|..................|
old_state_id  |               null |                58100|..................|
state         |             1498.7 |               1498.9|..................|

我需要它有 4 列和數千條記錄：

       |             created| entity_id| old_state_id|  state|
01     | 2020-12-28 02-15-01|  s.m.free|         null| 1498.7|
02     | 2020-12-28 02-15-31|  s.m.free|        58100| 1498.9|

我找到了 PySpark 的一個選項，它可以使用 Pandas 更改資料框的方向，但由于我必須使用 Scala 完成任務，所以我找不到類似的選項。

還有一種方法可以讓我在第一列（記錄 01、02 等）上輸入名稱，因為它似乎是 json 檔案中值的鍵。

如果你能幫助我，我會很高興。

uj5u.com熱心網友回復：

這部分模擬原始資料幀的生成。
與此示例類似，請確保在實際場景中您也在使用option("primitivesAsString",true).
這是為了解決由于 Spark 默認型別為 null 的不匹配型別問題，即字串。
例如，沒有option("primitivesAsString",true), for "old_state_id": 58100,old_state_id將被推斷為 long，而 for"old_state_id": null將被推斷為字串。

import spark.implicits._

val json_str = """
{
    "01": {
      "created": "2020-12-28 02-15-01", 
      "entity_id": "s.m_free", 
      "old_state_id": null, 
      "state": "1498.7"
    }, 
    "02": {
      "created": "2020-12-28 02-15-31", 
      "entity_id": "s.m_free", 
      "old_state_id": 58100, 
      "state": "1498.9"
    }
}"""

val df = spark.read.option("primitivesAsString",true).json(Seq(json_str).toDS)

df.printSchema()

root
 |-- 01: struct (nullable = true)
 |    |-- created: string (nullable = true)
 |    |-- entity_id: string (nullable = true)
 |    |-- old_state_id: string (nullable = true)
 |    |-- state: string (nullable = true)
 |-- 02: struct (nullable = true)
 |    |-- created: string (nullable = true)
 |    |-- entity_id: string (nullable = true)
 |    |-- old_state_id: string (nullable = true)
 |    |-- state: string (nullable = true)

df.show(false)

 --------------------------------------------- ---------------------------------------------- 
|01                                           |02                                            |
 --------------------------------------------- ---------------------------------------------- 
|{2020-12-28 02-15-01, s.m_free, null, 1498.7}|{2020-12-28 02-15-31, s.m_free, 58100, 1498.9}|
 --------------------------------------------- ----------------------------------------------

這是基于堆疊的資料轉換部分

df.createOrReplaceTempView("t")
val cols_num = df.columns.size // 2
val cols_names_and_vals = (for (c <- df.columns) yield s"'$c',`$c`").mkString(",") // "'01',`01`,'02',`02`"
val sql_query = s"select id,val.* from (select stack($cols_num,$cols_names_and_vals) as (id,val) from t)" // select id,val.* from (select stack(2,'01',`01`,'02',`02`) as (id,val) from t)
val df_unpivot = spark.sql(sql_query)

df_unpivot.printSchema()

root
 |-- id: string (nullable = true)
 |-- created: string (nullable = true)
 |-- entity_id: string (nullable = true)
 |-- old_state_id: string (nullable = true)
 |-- state: string (nullable = true)

df_unpivot.show(truncate = false)

 --- ------------------- --------- ------------ ------ 
|id |created            |entity_id|old_state_id|state |
 --- ------------------- --------- ------------ ------ 
|01 |2020-12-28 02-15-01|s.m_free |null        |1498.7|
|02 |2020-12-28 02-15-31|s.m_free |58100       |1498.9|
 --- ------------------- --------- ------------ ------

uj5u.com熱心網友回復：

我什至沒想到會有這么好的解釋。非常感謝，它真的幫助了我，讓我了解它是如何作業的。

轉載請註明出處，本文鏈接：https://www.uj5u.com/qukuanlian/444163.html

標籤：json 数据框斯卡拉阿帕奇火花 apache-spark-sql

上一篇：如何使用PHP用json資料填充<select>下拉選單

下一篇：需要想法如何決議以下JSON格式