我想計算相同組內的持續時間subs_no,和year。
請注意,這與此略有不同,因為該列new_ts是 UNIX 時間戳,而不是字串。
這是我的代碼:
from pyspark.sql import functions as F, Window as W
df_subs_loc_movmnt_ts = df_subs_loc_movmnt.withColumn("new_ts",from_unixtime(unix_timestamp(col("ts"),"HH:mm:ss"),"HH:mm:ss"))
w = W.partitionBy('subs_no', 'year').orderBy('new_ts')
df_subs_loc_movmnt_duration = df_subs_loc_movmnt_ts.withColumn('duration', F.regexp_extract('new_ts' - F.min('new_ts').over(w),"HH:mm:ss", 0))
但df_subs_loc_movmnt_duration始終為空
-------- --------------- -------- --------------- ------------- ---- ----- --- -------- --------
| date_id| ts| subs_no| cgi| msisdn|year|month|day| new_ts|duration|
-------- --------------- -------- --------------- ------------- ---- ----- --- -------- --------
|20200801|17:00:10.000000|10100559|510-11-610104-9|3729882521647|2022| 6| 1|17:00:10| null|
|20200801|17:09:39.000000|10100559|510-11-610104-9|3729882521647|2022| 6| 1|17:09:39| null|
|20200801|06:44:55.000000|10100559|510-11-610034-6|3729882521647|2022| 6| 1|06:44:55| null|
|20200801|17:45:35.000000|10100559|510-11-610079-5|3729882521647|2022| 6| 1|17:45:35| null|
|20200801|17:48:05.000000|10100559|510-11-610660-4|3729882521647|2022| 6| 1|17:48:05| null|
|20200801|18:07:25.000000|10100559|510-11-610127-6|3729882521647|2022| 6| 1|18:07:25| null|
-------- --------------- -------- --------------- ------------- ---- ----- --- -------- --------
uj5u.com熱心網友回復:
不要from_unixtime在第一行使用,在下面使用它,因為此函式會將您的時間戳轉換為字串,但稍后您會進行需要時間戳/長格式(不是字串)的計算。此外,使用“HH:mm:ss”格式regex_extract時也不需要。from_unixtime
from pyspark.sql import functions as F, Window as W
df_subs_loc_movmnt = spark.createDataFrame(
[('17:00:10.000000', '10100559', 2022, '17:00:10'),
('17:09:39.000000', '10100559', 2022, '17:09:39'),
('06:44:55.000000', '10100559', 2022, '06:44:55'),
('17:45:35.000000', '10100559', 2022, '17:45:35'),
('17:48:05.000000', '10100559', 2022, '17:48:05'),
('18:07:25.000000', '10100559', 2022, '18:07:25')],
['ts', 'subs_no', 'year', 'new_ts'])
腳本:
df_subs_loc_movmnt_ts = df_subs_loc_movmnt.withColumn("new_ts", F.unix_timestamp(F.col("ts"), "HH:mm:ss"))
w = W.partitionBy('subs_no', 'year').orderBy('new_ts')
df_subs_loc_movmnt_duration = df_subs_loc_movmnt_ts.withColumn('duration', F.from_unixtime(F.col('new_ts') - F.min('new_ts').over(w), "HH:mm:ss"))
df_subs_loc_movmnt_duration.show()
# --------------- -------- ---- ------ --------
# | ts| subs_no|year|new_ts|duration|
# --------------- -------- ---- ------ --------
# |06:44:55.000000|10100559|2022| 24295|00:00:00|
# |17:00:10.000000|10100559|2022| 61210|10:15:15|
# |17:09:39.000000|10100559|2022| 61779|10:24:44|
# |17:45:35.000000|10100559|2022| 63935|11:00:40|
# |17:48:05.000000|10100559|2022| 64085|11:03:10|
# |18:07:25.000000|10100559|2022| 65245|11:22:30|
# --------------- -------- ---- ------ --------
轉載請註明出處,本文鏈接:https://www.uj5u.com/yidong/514669.html
上一篇:如何將日期舍入到最近的月份?
