在sparkDataFrame中創建兩列，一列用于累積值，另一列用于最大連續值-有解無憂

我有一個火花資料框如下：

from pyspark.sql import SparkSession
from pyspark.sql import Window

import pyspark.sql.functions as F

data = [{"Category": 'A', "ID": 1, "Value": 121.44, "Truth": True, "time": 1},
        {"Category": 'B', "ID": 2, "Value": 300.01, "Truth": False, "time": 2},
        {"Category": 'C', "ID": 3, "Value": 10.99, "Truth": None, "time": 3},
        {"Category": 'C', "ID": 4, "Value": 33.87, "Truth": True, "time": 4},
        {"Category": 'D', "ID": 4, "Value": 33.87, "Truth": True, "time": 5},
        {"Category": 'E', "ID": 4, "Value": 33.87, "Truth": True, "time": 6},
        {"Category": 'E', "ID": 4, "Value": 33.87, "Truth": True, "time": 7},
        {"Category": 'E', "ID": 4, "Value": 33.87, "Truth": True, "time": 8}
        ]

spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(data)

w = Window.partitionBy(F.col("Category")).rowsBetween(Window.unboundedPreceding, Window.currentRow)
df.filter(df["ID"] == 4).withColumn("val_sum", F.sum(F.col("Value")).over(w)).withColumn("max_time", F.max(F.col("time")).over(w)).show()

我收到了以下輸出：

但是，我的預期輸出如下：

 -------- --- ----- ----- ---- ------------------ -------- 
|Category| ID|Truth|Value|time|           val_sum|max_time|
 -------- --- ----- ----- ---- ------------------ -------- 
|       C|  4| true|33.87|   4|             33.87|       4|
|       D|  4| true|33.87|   5|             33.87|       5|
|       E|  4| true|33.87|   6|             33.87|       6|
|       E|  4| true|33.87|   7|             67.74|       7|
|       E|  4| true|33.87|   8|101.60999999999999|       8|
 -------- --- ----- ----- ---- ------------------ --------

任何人都可以幫我解決這個問題嗎？

 -------- --- ----- ----- ---- ------------------ -------- 
|Category| ID|Truth|Value|time|           val_sum|max_time|
 -------- --- ----- ----- ---- ------------------ -------- 
|       C|  4| true|33.87|   4|             33.87|       4|
|       D|  4| true|33.87|   5|             33.87|       5|
|       E|  4| true|33.87|   6|             33.87|       8|
|       E|  4| true|33.87|   7|             67.74|       8|
|       E|  4| true|33.87|   8|101.60999999999999|       8|
 -------- --- ----- ----- ---- ------------------ --------

如果不清楚，請告訴我，以便我提供更多資訊。

uj5u.com熱心網友回復：

為 max 函式定義另一個視窗規范，unboundedPreceding而unboundedFollowing不是currentRow

Example:

from pyspark.sql import SparkSession
from pyspark.sql import Window

import pyspark.sql.functions as F

data = [{"Category": 'A', "ID": 1, "Value": 121.44, "Truth": True, "time": 1},
        {"Category": 'B', "ID": 2, "Value": 300.01, "Truth": False, "time": 2},
        {"Category": 'C', "ID": 3, "Value": 10.99, "Truth": None, "time": 3},
        {"Category": 'C', "ID": 4, "Value": 33.87, "Truth": True, "time": 4},
        {"Category": 'D', "ID": 4, "Value": 33.87, "Truth": True, "time": 5},
        {"Category": 'E', "ID": 4, "Value": 33.87, "Truth": True, "time": 6},
        {"Category": 'E', "ID": 4, "Value": 33.87, "Truth": True, "time": 7},
        {"Category": 'E', "ID": 4, "Value": 33.87, "Truth": True, "time": 8}
        ]
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(data)

w = Window.partitionBy(F.col("Category")).rowsBetween(Window.unboundedPreceding, Window.currentRow)
w1 = Window.partitionBy(F.col("Category")).rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
df.filter(df["ID"] == 4).withColumn("val_sum", F.sum(F.col("Value")).over(w)).withColumn("max_time", F.max(F.col("time")).over(w1)).show()

 -------- --- ----- ----- ---- ------------------ -------- 
|Category| ID|Truth|Value|time|           val_sum|max_time|
 -------- --- ----- ----- ---- ------------------ -------- 
|       E|  4| true|33.87|   6|             33.87|       8|
|       E|  4| true|33.87|   7|             67.74|       8|
|       E|  4| true|33.87|   8|101.60999999999999|       8|
|       D|  4| true|33.87|   5|             33.87|       5|
|       C|  4| true|33.87|   4|             33.87|       4|
 -------- --- ----- ----- ---- ------------------ --------

轉載請註明出處，本文鏈接：https://www.uj5u.com/shujuku/370956.html

標籤：Python 数据框阿帕奇火花火花

上一篇：在r中使用for回圈時出現“無法回收”錯誤

下一篇：熊貓資料框中的for回圈