我有資料框:
from datetime import datetime
data = [
(1, datetime(2018, 7, 25, 17, 15, 6, 390000)),
(2, datetime(2018, 7, 25, 11, 12, 49, 317000))
]
df = spark.createDataFrame(data, ['ID', 'max_ts'])
# --- -----------------------
# |ID |max_ts |
# --- -----------------------
# |1 |2018-07-25 17:15:06.39 |
# |2 |2018-07-25 11:12:49.317|
# --- -----------------------
我想創建一個專欄milliseconds:
--- ----------------------- ------
|ID |max_ts |ms |
--- ----------------------- ------
|1 |2018-07-25 17:15:06.39 |390000|
|2 |2018-07-25 11:12:49.317|317000|
--- ----------------------- ------
在熊貓中,我可以做到這一點
df_interfax['ms_created_at'] = df_interfax['max_ts'].dt.microsecond
但是我怎樣才能在 PySpark 中做到這一點?
uj5u.com熱心網友回復:
一種選擇:
from pyspark.sql import functions as F
df = df.withColumn('ms', F.expr("date_part('s', max_ts) % 1 * pow(10, 6)"))
df.show(truncate=0)
# --- ----------------------- --------
# |ID |max_ts |ms |
# --- ----------------------- --------
# |1 |2018-07-25 17:15:06.39 |390000.0|
# |2 |2018-07-25 11:12:49.317|317000.0|
# --- ----------------------- --------
另外的選擇:
df = df.withColumn('ms', F.expr("unix_micros(max_ts) - unix_micros(date_trunc('second', max_ts))"))
df.show(truncate=0)
# --- ----------------------- ------
# |ID |max_ts |ms |
# --- ----------------------- ------
# |1 |2018-07-25 17:15:06.39 |390000|
# |2 |2018-07-25 11:12:49.317|317000|
# --- ----------------------- ------
轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/512111.html
標籤:阿帕奇火花约会时间pysparkapache-spark-sql时间戳
