我有兩個這樣的資料框:
df1 = spark.createDataFrame([(1, 11, 1999, 1999, None), (2, 22, 2000, 2000, 44), (3, 33, 2001, 2001,None)], ['id', 't', 'year','new_date','rev_t'])
df2 = spark.createDataFrame([(2, 44, 2022, 2022,None), (2, 55, 2001, 2001, 88)], ['id', 't', 'year','new_date','rev_t'])
df1.show()
df2.show()
--- --- ---- -------- -----
| id| t|year|new_date|rev_t|
--- --- ---- -------- -----
| 1| 11|1999| 1999| null|
| 2| 22|2000| 2000| 44|
| 3| 33|2001| 2001| null|
--- --- ---- -------- -----
--- --- ---- -------- -----
| id| t|year|new_date|rev_t|
--- --- ---- -------- -----
| 2| 44|2022| 2022| null|
| 2| 55|2001| 2001| 88|
--- --- ---- -------- -----
我想以某種方式加入它們,如果df2.t == df1.rev_t 然后更新new_date到df2.year結果資料框中。所以它應該是這樣的:
--- --- ---- -------- -----
| id| t|year|new_date|rev_t|
--- --- ---- -------- -----
| 1| 11|1999| 1999| null|
| 2| 22|2000| 2022| 44|
| 2| 44|2022| 2022| null|
| 2| 55|2001| 2001| 88|
| 3| 33|2001| 2001| null|
--- --- ---- -------- -----
uj5u.com熱心網友回復:
要從df2in更新列df1,請對要更新的列使用左連接 coalesce函式,在本例中為new_date。
從您的預期輸出來看,您似乎還想從 中添加行df2,因此將連接結果與df2:
from pyspark.sql import functions as F
result = (df1.join(df2.selectExpr("t as rev_t", "new_date as df2_new_date"), ["rev_t"], "left")
.withColumn("new_date", F.coalesce("df2_new_date", "new_date"))
.select(*df1.columns).union(df2)
)
result.show()
# --- --- ---- -------- -----
#| id| t|year|new_date|rev_t|
# --- --- ---- -------- -----
#| 1| 11|1999| 1999| null|
#| 3| 33|2001| 2001| null|
#| 2| 22|2000| 2022| 44|
#| 2| 44|2022| 2022| null|
#| 2| 55|2001| 2001| 88|
# --- --- ---- -------- -----
轉載請註明出處,本文鏈接:https://www.uj5u.com/net/429913.html
標籤:Python 数据框 阿帕奇火花 pyspark apache-spark-sql
上一篇:SwingJButton不適用于actionPerformed方法
下一篇:計數不同的布林值
