我有一個 DataFrame 和一個邊框串列:
test = spark.createDataFrame(
[
(1,),
(2,),
(234,),
(0,),
(6,),
(7,),
(35,),
(46,),
(8,),
],
"Population int",
)
border_list = [0, 1.5, 7, 41, 235]
我想為“Population”列添加兩個新列到 DataFrame(“LowerBorder”、“UpperBorder”)。
當我嘗試僅使用 python 串列和函式時,它起作用了:
lower = lambda x: max([i for i in border_list if x >= i])
upper = lambda x: min([i for i in border_list if x < i])
list_value = [1, 2, 234, 0, 6, 7, 35, 46, 8]
for i in list_value:
print(lower(i), upper(I))
# Output:
# low high
0 1.5
1.5 7
41 235
0 1.5
1.5 7
7 41
7 41
41 235
7 41
但是,當我嘗試將其轉換為使用列時,它沒有:
from pyspark.sql.types import FloatType
lower_border = F.udf(lambda x: max([i for i in border_list if x >= i]), FloatType())
upper_border = F.udf(lambda x: min([i for i in border_list if x < i]), FloatType())
test.withColumn("LowBorder", lower_border("Population")) \
.withColumn("UpBorder", upper_border("Population"))
display(test) # no changes in test Dataframe
如果我嘗試通過選擇添加列,它也無法按預期作業:
display(test.select(lower_border("Population").alias('low'), upper_border("Population").alias('high')))
# Output:
low high
-----------
null 1.5
1.5 null
null null
null 1.5
1.5 null
null null
null null
null null
null null
測驗 DataFrame 的預期輸出是:
Population | LowBorder | UpBorder
---------------------------------
1 0 1.5
2 1.5 7
234 41 235
0 0 1.5
6 1.5 7
7 7 41
35 7 41
46 41 235
8 7 41
uj5u.com熱心網友回復:
您可以從 , 然后創建一個陣列border_list,然后filter選擇最小值或最大值。
from pyspark.sql import functions as F
test = spark.createDataFrame([(1,), (2,), (234,), (0,), (6,), (7,), (35,), (46,), (8,)], "Population int")
border_list = [0, 1.5, 7, 41, 235]
arr = F.array_sort(F.array([F.lit(x) for x in border_list]))
test = test.select(
'Population',
F.element_at(F.filter(arr, lambda x: x <= F.col('Population')), -1).alias('LowBorder'),
F.element_at(F.filter(arr, lambda x: x > F.col('Population')), 1).alias('UpBorder'),
)
test.show(truncate=0)
# ---------- --------- --------
# |Population|LowBorder|UpBorder|
# ---------- --------- --------
# |1 |0.0 |1.5 |
# |2 |1.5 |7.0 |
# |234 |41.0 |235.0 |
# |0 |0.0 |1.5 |
# |6 |1.5 |7.0 |
# |7 |7.0 |41.0 |
# |35 |7.0 |41.0 |
# |46 |41.0 |235.0 |
# |8 |7.0 |41.0 |
# ---------- --------- --------
轉載請註明出處,本文鏈接:https://www.uj5u.com/gongcheng/492250.html
標籤:Python 列表 阿帕奇火花 pyspark apache-spark-sql
