PySpark中雙資料型別的UDF函式-有解無憂

我正在嘗試使用 PySpark 中的 UDF 函式創建一個列。

我嘗試的代碼如下所示：

# The function checks year and adds a multiplied value_column to the final column

def new_column(row, year):
    if year == "2020":
        return row * 0.856 
    elif year == "2019": 
        return row * 0.8566
    else:
        return row

final_udf = F.udf(lambda z: new_column(z), Double()) #How do I get - Double datatype here 
res = res.withColumn("final_value", final_udf(F.col('value_column'), F.col('year')))

我怎樣才能寫 Double() in final_udf？我看到我們可以使用字串StringType()。但是我該怎么做才能在“final_value”列中回傳雙精度型別？

uj5u.com熱心網友回復：

輸入：

from pyspark.sql import functions as F, types as T
res = spark.createDataFrame([(1.0, '2020',), (1.0, '2019',), (1.0, '2018',)], ['value_column', 'year'])

udf在處理大資料時效率非常低。

您應該首先嘗試在本機 Spark 中執行此操作：

res = res.withColumn(
    'final_value',
    F.when(F.col('year') == "2020", F.col('value_column') * 0.856)
     .when(F.col('year') == "2019", F.col('value_column') * 0.8566)
     .otherwise(F.col('value_column'))
)
res.show()
#  ------------ ---- ----------- 
# |value_column|year|final_value|
#  ------------ ---- ----------- 
# |         1.0|2020|      0.856|
# |         1.0|2019|     0.8566|
# |         1.0|2018|        1.0|
#  ------------ ---- -----------

如果在原生 Spark 中不可能，請轉到pandas_udf：

from pyspark.sql import functions as F, types as T
import pandas as pd

@F.pandas_udf(T.DoubleType())
def new_column(row: pd.Series, year: pd.Series) -> pd.Series:
    if year == "2020":
        return row * 0.856 
    elif year == "2019": 
        return row * 0.8566
    else:
        return row

res = res.withColumn("final_value", final_udf('value_column', 'year'))

res.show()
#  ------------ ---- ----------- 
# |value_column|year|final_value|
#  ------------ ---- ----------- 
# |         1.0|2020|      0.856|
# |         1.0|2019|     0.8566|
# |         1.0|2018|        1.0|
#  ------------ ---- -----------

只有作為最后的手段，你應該去udf：

@F.udf(T.DoubleType())
def new_column(row, year):
    if year == "2020":
        return row * 0.856 
    elif year == "2019": 
        return row * 0.8566
    else:
        return row

res = res.withColumn("final_value", new_column('value_column', 'year'))

res.show()
#  ------------ ---- ----------- 
# |value_column|year|final_value|
#  ------------ ---- ----------- 
# |         1.0|2020|      0.856|
# |         1.0|2019|     0.8566|
# |         1.0|2018|        1.0|
#  ------------ ---- -----------

uj5u.com熱心網友回復：

使用簡單的字串"double"或匯入 pypspark 的DoubleType

# like this
final_udf = F.udf(lambda z: new_column(z), "double")

# or this
import pyspark.sql.types as T
final_udf = F.udf(lambda z: new_column(z), T.DoubleType())

轉載請註明出處，本文鏈接：https://www.uj5u.com/houduan/519561.html

標籤：功能阿帕奇火花pyspark双倍的用户定义函数

上一篇：從R中函式內的資料框中呼叫變數名

下一篇：在Django中使用python將產品添加到購物車