我有一個如下所示的資料框 pyspark
import pyspark.sql.functions as f
df = spark.createDataFrame(
[(123, 2897402, 43.25, 2),
(124, 2897402, 49.11, 0),
(125, 2897402, 43.25, 2),
(126, 2897402, 48.75, 0)]
, ['model_id','lab_test_id','summary_measure_value','reading_precision'])
預期輸出:
-------- ----------- --------------------- ----------------- -------------
|model_id|lab_test_id|summary_measure_value|reading_precision|reading_value|
-------- ----------- --------------------- ----------------- -------------
| 123| 2897402| 43.25| 2| 43.25|
| 124| 2897402| 49.11| 1| 49.1|
| 125| 2897402| 43.25| 2| 43.25|
| 126| 2897402| 48.75| 0| 49.0|
-------- ----------- --------------------- ----------------- -------------
我試過如下
df1 = df.withColumn("reading_value", f.round(f.col("summary_measure_value"), f.col("reading_precision")))
我收到Column is not iterable錯誤。
我怎樣才能達到我想要的
uj5u.com熱心網友回復:
您可以嘗試使用udf使用 python 內置圓形函式的 a 來實作這一點,例如:
@f.udf
def udf_round(value,precision):
try:
precision = int(precision)
value = float(value)
# use python built-in round function to round values
return round(value,precision)
except:
# decide what to return when you encounter bad data
# in this example I've returned the original value
return value
df=df.withColumn("reading_value",udf_round( f.col("summary_measure_value"),f.col("reading_precision") ))
df.show(truncate=False)
輸出:
-------- ----------- --------------------- ----------------- -------------
|model_id|lab_test_id|summary_measure_value|reading_precision|reading_value|
-------- ----------- --------------------- ----------------- -------------
|123 |2897402 |43.25 |2 |43.25 |
|124 |2897402 |49.25 |0 |49.0 |
|125 |2897402 |43.25 |2 |43.25 |
|126 |2897402 |48.75 |0 |49.0 |
-------- ----------- --------------------- ----------------- -------------
uj5u.com熱心網友回復:
不幸的是,該round函式具有以下簽名:
def round(e: Column, scale: Int): Column
因此,您只能使用驅動程式中確定的固定精度對列進行舍入。
要解決這個問題,您可以使用 UDF,但在 pyspark 中,它們非常昂貴。與原生 Spark 代碼相比,Python 非常慢。
因此,您可以使用round和構建自定義舍入函式,pow如下所示:
# This is not a UDF, just a construction based on spark functions
def round(column, precision):
return f.round(column * pow(10, precision)) / f.pow(10, precision)
df.withColumn("reading_value", round(f.col("summary_measure_value"), f.col("reading_precision")))
轉載請註明出處,本文鏈接:https://www.uj5u.com/qiye/347526.html
下一篇:Pyspark:最常用的詞
