在Spark中讀取多行JSON檔案排成一行-有解無憂

我有一個如下所示的 json 檔案 -

{
  "249": "\"Other\"",
  "63": "\"Billing\"",
  "67": "\"Handset\"",
  "72": "\"Your plan\"",
  "71": "\"Customer services\"",
  "69": "\"Network coverage\"",
  "68": "\"International roaming\"",
  "770": "\"Purchases\"",
  "70": "\"Expectations not being met\"",
  "65": "\"Fraud\""
}

我正在使用多行 spark.read 方法讀取此檔案-

val df = sqlContext.read.option("multiline","true").json("file:///category_names.json")

讀取的資料幀是：

 --------------------------- --------- ------- --------- ----------------------- ------------------ ---------------------------- ------------------- ----------- ----------- 
|249                        |63       |65     |67       |68                     |69                |70                          |71                 |72         |770        |
 --------------------------- --------- ------- --------- ----------------------- ------------------ ---------------------------- ------------------- ----------- ----------- 
|"Other (none of the above)"|"Billing"|"Fraud"|"Handset"|"International roaming"|"Network coverage"|"Expectations not being met"|"Customer services"|"Your plan"|"Purchases"|
 --------------------------- --------- ------- --------- ----------------------- ------------------ ---------------------------- ------------------- ----------- -----------

我想將此資料框與另一個資料框連接，其中列名是那里的主鍵。我想要以下格式的輸出

CategroryID CategoryName
249           "Other"
63            "Billing"

有沒有這樣做的火花方式？我可以旋轉資料框，但我正在尋找一種更好的方法來做到這一點。

uj5u.com熱心網友回復：

使用stack函式取消透視資料框。您可以從列名串列動態生成堆疊運算式：

val stackExpr = s"stack(${df.columns.size},"   df.columns
  .flatMap(c => Seq(c, s"`$c`"))
  .mkString(",")   ") as (CategroryID, CategoryName)"

//stackExpr: String = stack(10, 249,`249`,63,`63`,65,`65`,67,`67`,68,`68`,69,`69`,70,`70`,71,`71`,72,`72`,770,`770`) as (CategroryID, CategoryName)

val df1 = df.selectExpr(stackExpr)

df1.show()

// ----------- -------------------- 
//|CategroryID|        CategoryName|
// ----------- -------------------- 
//|        249|             "Other"|
//|         63|           "Billing"|
//|         65|             "Fraud"|
//|         67|           "Handset"|
//|         68|"International ro...|
//|         69|  "Network coverage"|
//|         70|"Expectations not...|
//|         71| "Customer services"|
//|         72|         "Your plan"|
//|        770|         "Purchases"|
// ----------- --------------------

另一種方法是從每一行創建地圖列，然后將其展開：

import org.apache.spark.sql.functions.map

val mapExpr = map(df.columns.flatMap(c => Seq(lit(c), col(c))):_*)
val df1 = df.select(explode(mapExpr).as(Seq("CategroryID", "CategoryName")))

uj5u.com熱心網友回復：

出于學術目的和我一次又一次地重繪我的技能，一個替代方案 - 但你需要重命名 cols 等，因為我只是使用我自己的并假設 DF 在那里并且沒有解決 JSON：

import org.apache.spark.sql.functions._

val df   = sqlContext.createDataFrame(Seq(("xxx", "yyy", "zzz"))).toDF("v1", "v2", "v3")
val cols = df.columns
val df2  = df.withColumn("arrayColNames", array(cols.map(lit):_*))
             .withColumn("arrayColVals",  array(cols.map(df(_)):_*))

val df3 = df2.withColumn("arrayNamesVals", arrays_zip(col("arrayColNames"), col("arrayColVals")));
val df4 = df3.withColumn("aNV", explode($"arrayNamesVals"))
val df5 = df4.select($"aNV.*")    
df5.show(false)

回傳：

 ------------- ------------ 
|arrayColNames|arrayColVals|
 ------------- ------------ 
|v1           |xxx         |
|v2           |yyy         |
|v3           |zzz         |
 ------------- ------------

轉載請註明出處，本文鏈接：https://www.uj5u.com/ruanti/420953.html

標籤：

上一篇：如何將3個相同維度的資料框相交并輸出一個在至少2個資料框中常見的資料框

下一篇：當函式由列名確定時，有沒有辦法在多列之間回圈R函式？