我對 Scala 和 Pyspark 很陌生,我必須將這段用 Scala 撰寫的代碼轉換為 Pyspark。有人可以幫助我理解 Scala 中的語法以便能夠對其進行轉換嗎?
val df= spark.read.parquet(s"$basePath/dod_m/")
.select(df2.map(x => col(x._1).as(x._2)).toList :_*)
uj5u.com熱心網友回復:
最有可能的df2是這里的一個簡單的 scala 集合。
如果它是一個資料框,df2.map(x => col(x._1).as(x._2))將產生error: value _1 is not a member of org.apache.spark.sql.Row. 實際上,map資料框上的函式允許您處理Row物件,而不是元組。
(String, String)例如,如果它是一個資料集,df2.map(x => col(x._1).as(x._2))將產生:error: Unable to find encoder for type org.apache.spark.sql.Column.. 如果你定義這樣一個編碼器,你會得到error: value toList is not a member of org.apache.spark.sql.Dataset[org.apache.spark.sql.Column]相當清楚的。
RDD 也不具備該toList方法。
所以讓我們考慮df2成為(String, String). df2.map(x => col(x._1).as(x._2)).toList是關于重命名列。舊名稱是元組的第一個元素,新名稱是第二個元素。
斯卡拉的一個例子:
val df2 = Seq(("a", "b"), ("c", "d"))
val df = Seq((1, 2), (4, 5)).toDF("a", "c")
// running this in a shell, we see that it is about renaming columns
df2.map(x => col(x._1).as(x._2)).toList
//res2: List[org.apache.spark.sql.Column] = List(a AS b, c AS d)
我們試試看:
df.show
--- ---
| a| c|
--- ---
| 1| 2|
| 4| 5|
--- ---
df.select(df2.map(x => col(x._1).as(x._2)).toList :_*).show
--- ---
| b| d|
--- ---
| 1| 2|
| 4| 5|
--- ---
在蟒蛇中:
df2 = [("a", "b"), ("c", "d")]
df = spark.createDataFrame([(1, 2), (4, 5)], ['a', 'c'])
import pyspark.sql.functions as f
df.select([f.col(x[0]).alias(x[1]) for x in df2]).show()
--- ---
| b| d|
--- ---
| 1| 2|
| 4| 5|
--- ---
轉載請註明出處,本文鏈接:https://www.uj5u.com/gongcheng/519546.html
