我正在尋找如何在 Pyspark 中一起使用 Group by Aggregate Functions 的解決方案?我的資料框看起來像這樣:
df = sc.parallelize([
('23-09-2020', 'CRICKET'),
('25-11-2020', 'CRICKET'),
('13-09-2021', 'FOOTBALL'),
('20-11-2021', 'BASKETBALL'),
('12-12-2021', 'FOOTBALL')]).toDF(['DATE', 'SPORTS_INTERESTED'])
我想在 SPORTS_INTERESTED Column 上應用 group by 并選擇 MIN of date From DATE Column 。以下是我正在使用的查詢
from pyspark.sql.functions import min
df=df.groupby('SPORTS_INTERESTED').agg(count('SPORTS_INTERESTED').alias('FIRST_COUNT'),(F.min('DATE').alias('MIN_OF_DATE_COLUMN'))).filter((col('FIRST_COUNT')> 1))
但是當我應用上面的查詢時,我不知道為什么它在輸出值 DESIRED OUTPUT 中給出 MAX 日期而不是 MIN 日期
## ----------------- -------------------
## |SPORTS_INTERESTED| MIN_OF_DATE_COLUMN|
## ------ ---------- -------------------
## | CRICKET |23-09-2020 |
## ------ ---------- -------------------
## | FOOTBALL |13-09-2021 |
----------------- -------------------
我得到的輸出:
## ----------------- ----------------------
## |SPORTS_INTERESTED| MIN_OF_DATE_COLUMN|
## ------ ---------- -------------------
## | CRICKET |25-11-2020 |
## ------ ---------- -------------------
## | FOOTBALL |12-12-2021 |
----------------- -------------------
兩列都是字串資料型別
uj5u.com熱心網友回復:
首先,將字串轉換為日期格式,然后應用 min:
import pyspark.sql.functions as F
df = spark.createDataFrame(data=[
('23-09-2020', 'CRICKET'),
('25-11-2020', 'CRICKET'),
('13-09-2021', 'FOOTBALL'),
('20-11-2021', 'BASKETBALL'),
('12-12-2021', 'FOOTBALL')
], schema=['DATE', 'SPORTS_INTERESTED'])
df = df.withColumn("DATE", F.to_date("DATE", format="dd-MM-yyyy"))
df = df.groupBy("SPORTS_INTERESTED").agg(F.min("DATE").alias("MIN_OF_DATE"))
[Out]:
----------------- -----------
|SPORTS_INTERESTED|MIN_OF_DATE|
----------------- -----------
|BASKETBALL |2021-11-20 |
|FOOTBALL |2021-09-13 |
|CRICKET |2020-09-23 |
----------------- -----------
轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/537490.html
下一篇:Pyspark掉落的專欄沒有消失
