關于從連接的資料框中管理重復列有幾個很好的答案,例如(如何在連接后避免重復列?),但是如果我只是向 DataFrame 顯示了我必須處理的重復列,該怎么辦。我無法控制導致這一點的程序。
我擁有的:
val data = Seq((1,2),(3,4)).toDF("a","a")
data.show
--- ---
| a| a|
--- ---
| 1| 2|
| 3| 4|
--- ---
我想要的是:
--- ---
| a|a_2|
--- ---
| 1| 2|
| 3| 4|
--- ---
withColumnRenamed("a","a_2") 不起作用,原因很明顯。
uj5u.com熱心網友回復:
我發現這樣做的最簡單方法是:
val data = Seq((1,2),(3,4)).toDF("a","a")
val deduped = data.toDF("a","a_2")
deduped.show
--- ---
| a|a_2|
--- ---
| 1| 2|
| 3| 4|
--- ---
對于更通用的解決方案:
val data = Seq(
(1,2,3,4,5,6,7,8),
(9,0,1,2,3,4,5,6)
).toDF("a","b","c","a","d","b","e","b")
data.show
--- --- --- --- --- --- --- ---
| a| b| c| a| d| b| e| b|
--- --- --- --- --- --- --- ---
| 1| 2| 3| 4| 5| 6| 7| 8|
| 9| 0| 1| 2| 3| 4| 5| 6|
--- --- --- --- --- --- --- ---
import scala.annotation.tailrec
def dedupeColumnNames(df: DataFrame): DataFrame = {
@tailrec
def dedupe(fixed_columns: List[String], columns: List[String]): List[String] = {
if (columns.isEmpty) fixed_columns
else {
val count = columns.groupBy(identity).mapValues(_.size)(columns.head)
if (count == 1) dedupe(columns.head :: fixed_columns, columns.tail)
else dedupe(s"${columns.head}_${count}":: fixed_columns, columns.tail)
}
}
val new_columns = dedupe(List.empty[String], df.columns.reverse.toList).toArray
df.toDF(new_columns:_*)
}
data
.transform(dedupeColumnNames)
.show
--- --- --- --- --- --- --- ---
| a| b| c|a_2| d|b_2| e|b_3|
--- --- --- --- --- --- --- ---
| 1| 2| 3| 4| 5| 6| 7| 8|
| 9| 0| 1| 2| 3| 4| 5| 6|
--- --- --- --- --- --- --- ---
轉載請註明出處,本文鏈接:https://www.uj5u.com/qiye/347522.html
