我有一個場景,我需要在日期條件下過濾日期列,同樣明智地我需要為整個月做這件事。問題是在為每個日期回圈時需要時間。我想一次性完成整個月。以下是代碼。
target_date = [1,2,3...30]
for i in target_date:
df = spark.sql(f'select * from table where x_date <={i} and y_date >={i}')
df = df.withColumn('load_date',f.lit(i))
df.write.partition('load_date').mode('append').parquet(output_path)
任何使這更快的方法
uj5u.com熱心網友回復:
也許您可以將寫入移動到回圈之外。就像是
target_date = [1,2,3...30]
df_final = []
for i in target_date:
df = spark.sql(f'select * from table where x_date <={i} and y_date >={i}')
df = df.withColumn('load_date',f.lit(i))
df_final = df_final.union(df)
df_final.write.partition('load_date').parquet(output_path)
uj5u.com熱心網友回復:
我相信你可以用這樣的交叉連接來解決它:
load_dates = spark.createDataFrame([[i,] for i in range(1,31)], ['load_date'])
load_dates.show()
---------
|load_date|
---------
| 1|
| 2|
| 3|
| ...|
| 30|
---------
df = spark.sql(f'select * from table')
df.join(
load_dates,
on=(F.col('x_date') <= F.col('load_date')) & (F.col('y_date') >= F.col('load_date')),
how='inner',
)
df.write.partitionBy('load_date').parquet(output_path)
uj5u.com熱心網友回復:
你應該能夠做到
- 在每一行中創建一個 load_dates 陣列
- 分解陣列,以便每個原始行都有唯一的 load_date
- 過濾以獲取您想要的 load_dates
例如
target_dates = [1,2,3...30]
df = spark.sql(f'select * from table')
# create an array of all load_dates in each row
df = df.withColumn("load_date", F.array([F.lit(i) for i in target_dates]))
# Explode the load_dates so that you get a new row for each load_date
df = df.withColumn("load_date", F.explode("load_date"))
# Filter only the load_dates you want to keep
df = df.filter("x_date <= load_date and y_date >=load_date")
df.write.partition('load_date').mode('append').parquet(output_path)
轉載請註明出處,本文鏈接:https://www.uj5u.com/gongcheng/375347.html
標籤:sql 阿帕奇火花 火花 apache-spark-sql
