基于Pyspark資料框中的連接填充列值-有解無憂

我有一個使用代碼的資料框

df = sc.parallelize([
    (123, 2345,25,""), (123, 2345,29,"NY"), (123,5422,67,"NY"),(123,9422,67,"NY"),(123,3581,98,"NY"),(231, 4322,77,""),(231,4322,99,"Paris"),(231,8342,45,"Paris")
]).toDF(["userid", "transactiontime","zip","location"])

 ------ --------------- --- -------- 
|userid|transactiontime|zip|location|
 ------ --------------- --- -------- 
|   123|           2345| 25|        |
|   123|           2345| 29|      NY|
|   123|           5422| 67|      NY|
|   123|           9422| 67|      NY|
|   123|           3581| 98|      NY|
|   231|           4322| 77|        |
|   231|           4322| 99|   Paris|
|   231|           8342| 45|   Paris|
 ------ --------------- --- --------

我希望輸出是這樣的

 ------ --------------- --- -------- 
|userid|transactiontime|zip|location|
 ------ --------------- --- -------- 
|   123|           2345| 25|      NY|
|   123|           2345| 29|      NY|
|   123|           5422| 67|      NY|
|   123|           9422| 67|      NY|
|   123|           3581| 98|      NY|
|   231|           4322| 77|   Paris|
|   231|           4322| 99|   Paris|
|   231|           8342| 45|   Paris|
 ------ --------------- --- --------

我想加入 userid 和 transactiontime 并用非空值填充 city 列。

我試過這樣的視窗功能

w1 = Window.partitionBy('userid', 'transactiontime').orderBy(col('zip'))

df_new = df.withColumn("newlocation", F.last('location').over(w1))
print(df_new.show())

但這不起作用，我也嘗試過自我加入，但也無法正常作業。有什么幫助嗎？？

uj5u.com熱心網友回復：

和視窗函式接受一個可選引數first，在這種情況下可能會有所幫助。但是，在您的示例中，您實際上沒有空值而是空字串，這是不同的。lastignorenulls

w = Window.partitionBy('userid', 'transactiontime')

df_new = df \
    .withColumn("fixedLoc", F.when(F.col("location") == "", None).otherwise(F.col("location"))) \
    .withColumn("newLoc", F.first('fixedLoc', ignorenulls=True).over(w))

在上述解決方案中，使用臨時列將空字串替換為空值，然后在新列上使用firstwith 。ignorenulls

作為替代解決方案，您可以使用max將忽略空值并優先考慮非空字串的函式：

df_new = df \
    .withColumn("newLoc", F.max('location').over(w))

轉載請註明出處，本文鏈接：https://www.uj5u.com/yidong/531083.html

標籤：Python阿帕奇火花pysparkapache-spark-sql

上一篇：NavigationLink在按鈕內不起作用

下一篇：使用特定的列值作為檢查器來更改pyspark/pandas中的其他列值