我有一個如下所示的 json 檔案 -
{
"249": "\"Other\"",
"63": "\"Billing\"",
"67": "\"Handset\"",
"72": "\"Your plan\"",
"71": "\"Customer services\"",
"69": "\"Network coverage\"",
"68": "\"International roaming\"",
"770": "\"Purchases\"",
"70": "\"Expectations not being met\"",
"65": "\"Fraud\""
}
我正在使用多行 spark.read 方法讀取此檔案-
val df = sqlContext.read.option("multiline","true").json("file:///category_names.json")
讀取的資料幀是:
--------------------------- --------- ------- --------- ----------------------- ------------------ ---------------------------- ------------------- ----------- -----------
|249 |63 |65 |67 |68 |69 |70 |71 |72 |770 |
--------------------------- --------- ------- --------- ----------------------- ------------------ ---------------------------- ------------------- ----------- -----------
|"Other (none of the above)"|"Billing"|"Fraud"|"Handset"|"International roaming"|"Network coverage"|"Expectations not being met"|"Customer services"|"Your plan"|"Purchases"|
--------------------------- --------- ------- --------- ----------------------- ------------------ ---------------------------- ------------------- ----------- -----------
我想將此資料框與另一個資料框連接,其中列名是那里的主鍵。我想要以下格式的輸出
CategroryID CategoryName
249 "Other"
63 "Billing"
有沒有這樣做的火花方式?我可以旋轉資料框,但我正在尋找一種更好的方法來做到這一點。
uj5u.com熱心網友回復:
使用stack函式取消透視資料框。您可以從列名串列動態生成堆疊運算式:
val stackExpr = s"stack(${df.columns.size}," df.columns
.flatMap(c => Seq(c, s"`$c`"))
.mkString(",") ") as (CategroryID, CategoryName)"
//stackExpr: String = stack(10, 249,`249`,63,`63`,65,`65`,67,`67`,68,`68`,69,`69`,70,`70`,71,`71`,72,`72`,770,`770`) as (CategroryID, CategoryName)
val df1 = df.selectExpr(stackExpr)
df1.show()
// ----------- --------------------
//|CategroryID| CategoryName|
// ----------- --------------------
//| 249| "Other"|
//| 63| "Billing"|
//| 65| "Fraud"|
//| 67| "Handset"|
//| 68|"International ro...|
//| 69| "Network coverage"|
//| 70|"Expectations not...|
//| 71| "Customer services"|
//| 72| "Your plan"|
//| 770| "Purchases"|
// ----------- --------------------
另一種方法是從每一行創建地圖列,然后將其展開:
import org.apache.spark.sql.functions.map
val mapExpr = map(df.columns.flatMap(c => Seq(lit(c), col(c))):_*)
val df1 = df.select(explode(mapExpr).as(Seq("CategroryID", "CategoryName")))
uj5u.com熱心網友回復:
出于學術目的和我一次又一次地重繪 我的技能,一個替代方案 - 但你需要重命名 cols 等,因為我只是使用我自己的并假設 DF 在那里并且沒有解決 JSON:
import org.apache.spark.sql.functions._
val df = sqlContext.createDataFrame(Seq(("xxx", "yyy", "zzz"))).toDF("v1", "v2", "v3")
val cols = df.columns
val df2 = df.withColumn("arrayColNames", array(cols.map(lit):_*))
.withColumn("arrayColVals", array(cols.map(df(_)):_*))
val df3 = df2.withColumn("arrayNamesVals", arrays_zip(col("arrayColNames"), col("arrayColVals")));
val df4 = df3.withColumn("aNV", explode($"arrayNamesVals"))
val df5 = df4.select($"aNV.*")
df5.show(false)
回傳:
------------- ------------
|arrayColNames|arrayColVals|
------------- ------------
|v1 |xxx |
|v2 |yyy |
|v3 |zzz |
------------- ------------
轉載請註明出處,本文鏈接:https://www.uj5u.com/ruanti/420953.html
標籤:
