請教大家個問題:
現在有個DF是這樣的 :
case class Person(id:Long,name:String,age:Integer,job:String,rn:Long)
val df = spark.createDataFrame(List(Person(1,"Jason",34,null,1),
Person(1,"Jason1",null,"Dev",2),
Person(1,null,28,"DBA",3),
Person(2,"Tom",20,null,1),
Person(2,"Tom1",null,"Cooker",2)));
現在的輸出是這樣的:
+---+------+----+------+---+
| id| name| age| job| rn|
+---+------+----+------+---+
| 1| Jason| 34| null| 1|
| 1|Jason1|null| Dev| 2|
| 1| null| 28| DBA| 3|
| 2| Tom| 20| null| 1|
| 2| Tom1|null|Cooker| 2|
+---+------+----+------+---+
需求是 按照ID分組去重, 按照 RN作為 排序, 合并資料,下一個欄位是NULL的話,則保留上個欄位的值
期待輸出結果是:
+---+------+----+------+---+
| id| name| age| job| rn|
+---+------+----+------+---+
| 1|Jason1|28| DBA| 3|
| 2| Tom1|20|Cooker| 3|
+---+------+----+------+---+
幫忙看看 怎么這個DF怎么轉換 ? 謝謝 !
uj5u.com熱心網友回復:
用 Dataset你也可以用Windws Function. 下面是用 Spark Dataset GroupByKey + reduceGroups
scala> spark.version
res18: String = 2.4.3
scala> val ds = Seq(Person(1,"Jason",34,null,1),Person(1,"Jason1",null,"Dev",2),Person(1,null,28,"DBA",3),Person(2,"Tom",20,null,1),Person(2,"Tom1",null,"Cooker",2)).toDS
ds: org.apache.spark.sql.Dataset[Person] = [id: bigint, name: string ... 3 more fields]
scala> ds.show(false)
+---+------+----+------+---+
|id |name |age |job |rn |
+---+------+----+------+---+
|1 |Jason |34 |null |1 |
|1 |Jason1|null|Dev |2 |
|1 |null |28 |DBA |3 |
|2 |Tom |20 |null |1 |
|2 |Tom1 |null|Cooker|2 |
+---+------+----+------+---+
scala> ds.groupByKey(p=>p.id).reduceGroups((p1,p2) => if (p1.rn <= p2.rn) Person( id = p2.id, name = if (p2.name == null) p1.name else p2.name, age = if (p2.age == null) p1.age else p2.age, job = if (p2.job == null) p1.job else p2.job, p2.rn) else Person( id = p1.id, name = if (p1.name == null) p2.name else p1.name, age = if (p1.age == null) p2.age else p1.age, job = if (p1.job == null) p2.job else p1.job, p1.rn)).map(_._2).show(false)
+---+------+---+------+---+
|id |name |age|job |rn |
+---+------+---+------+---+
|1 |Jason1|28 |DBA |3 |
|2 |Tom1 |20 |Cooker|2 |
+---+------+---+------+---+
uj5u.com熱心網友回復:
厲害了! 大神!轉載請註明出處,本文鏈接:https://www.uj5u.com/qita/15667.html
標籤:Spark
上一篇:cloudsim仿真粒子群演算法
