我正在嘗試使用 PySpark 中的 UDF 函式創建一個列。
我嘗試的代碼如下所示:
# The function checks year and adds a multiplied value_column to the final column
def new_column(row, year):
if year == "2020":
return row * 0.856
elif year == "2019":
return row * 0.8566
else:
return row
final_udf = F.udf(lambda z: new_column(z), Double()) #How do I get - Double datatype here
res = res.withColumn("final_value", final_udf(F.col('value_column'), F.col('year')))
我怎樣才能寫 Double() in final_udf?我看到我們可以使用字串StringType()。但是我該怎么做才能在“final_value”列中回傳雙精度型別?
uj5u.com熱心網友回復:
輸入:
from pyspark.sql import functions as F, types as T
res = spark.createDataFrame([(1.0, '2020',), (1.0, '2019',), (1.0, '2018',)], ['value_column', 'year'])
udf在處理大資料時效率非常低。
您應該首先嘗試在本機 Spark 中執行此操作:
res = res.withColumn(
'final_value',
F.when(F.col('year') == "2020", F.col('value_column') * 0.856)
.when(F.col('year') == "2019", F.col('value_column') * 0.8566)
.otherwise(F.col('value_column'))
)
res.show()
# ------------ ---- -----------
# |value_column|year|final_value|
# ------------ ---- -----------
# | 1.0|2020| 0.856|
# | 1.0|2019| 0.8566|
# | 1.0|2018| 1.0|
# ------------ ---- -----------
如果在原生 Spark 中不可能,請轉到pandas_udf:
from pyspark.sql import functions as F, types as T
import pandas as pd
@F.pandas_udf(T.DoubleType())
def new_column(row: pd.Series, year: pd.Series) -> pd.Series:
if year == "2020":
return row * 0.856
elif year == "2019":
return row * 0.8566
else:
return row
res = res.withColumn("final_value", final_udf('value_column', 'year'))
res.show()
# ------------ ---- -----------
# |value_column|year|final_value|
# ------------ ---- -----------
# | 1.0|2020| 0.856|
# | 1.0|2019| 0.8566|
# | 1.0|2018| 1.0|
# ------------ ---- -----------
只有作為最后的手段,你應該去udf:
@F.udf(T.DoubleType())
def new_column(row, year):
if year == "2020":
return row * 0.856
elif year == "2019":
return row * 0.8566
else:
return row
res = res.withColumn("final_value", new_column('value_column', 'year'))
res.show()
# ------------ ---- -----------
# |value_column|year|final_value|
# ------------ ---- -----------
# | 1.0|2020| 0.856|
# | 1.0|2019| 0.8566|
# | 1.0|2018| 1.0|
# ------------ ---- -----------
uj5u.com熱心網友回復:
使用簡單的字串"double"或匯入 pypspark 的DoubleType
# like this
final_udf = F.udf(lambda z: new_column(z), "double")
# or this
import pyspark.sql.types as T
final_udf = F.udf(lambda z: new_column(z), T.DoubleType())
轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/519561.html
上一篇:從R中函式內的資料框中呼叫變數名
