這是我的 pyspark 資料框
-------------------------------------------------- ----------
|date |date_count|
-------------------------------------------------- ----------
|[20210629, 20210629] |495 |
|[20210619, 20210619, 20210619] |1781 |
|[20210611] |3675263 |
|[20210611, 20210611, 20210611, 20210611, 20210611]|3 |
-------------------------------------------------- ----------
給你提示,它來自這樣的旋轉
from pyspark.sql.functions import max as pyspark_max, min as pyspark_min, sum as pyspark_sum, avg, count
timeseries_monthly = spark.read.options(header='True',inferschema='True',delimiter=',').parquet("url...")
date = timeseries_monthly.select( timeseries_monthly["gps.date"])
date.groupBy('date').agg(count('date').alias('date_count')).show(4,truncate=False)
這是我的預期輸出
---------- ----------
|date |date_count|
---------- ----------
|20210629 |495 |
|20210619 |1781 |
|20210611 |3675263 |
|20210611 |3 |
---------- ----------
uj5u.com熱心網友回復:
使用array_distinct(), array_join()pyspark 中的函式。
Example:
df.withColumn("date", array_join(array_distinct(col("date")),'')).show()
轉載請註明出處,本文鏈接:https://www.uj5u.com/qita/374911.html
上一篇:如果我通過pipinstallpyspark安裝了pyspark,在哪里修改spark-defaults.conf
