我試圖在 pyspark 中向現有資料框中添加一個新列。我的資料框如下所示。我在這篇文章的幫助下進行了嘗試 Pyspark: Replaceing value in a column by search a dictionary by-searching-a-dictionary
水果
橘
橙
蘋果
香蕉
蘋果
我正在嘗試的代碼是這樣的
from pyspark.sql import functions as F
from itertools import chain
simple_dict = {'Orange': 'OR, 'Apple': 'AP', 'Banana': 'BN'}
mapping_expr = F.create_map([F.lit(x) for x in F.chain(*simple_dict.items())])
def addCols(data):
data = (data.withColumn('Fruit_code', mapping_expr[data['Fruit']]))
return data
預期輸出:
預期輸出:
Fruit Fruit_code
Orange OR
Orange OR
Apple AP
Banana BN
Apple AP
我收到以下錯誤:我知道是因為函式 F。但我不知道如何解決。有人能幫我嗎 ?
FILE "/MYPROJECT/DATASETS/DERIVED/OPPORTUNITY_WON.PY", LINE 8, IN <MODULE>
MAPPING_EXPR = CREATE_MAP([LIT(X) FOR X IN CHAIN(*SIMPLE_DICT.ITEMS())])
FILE "/MYPROJECT/DATASETS/DERIVED/OPPORTUNITY_WON.PY", LINE 8, IN <LISTCOMP>
MAPPING_EXPR = CREATE_MAP([LIT(X) FOR X IN CHAIN(*SIMPLE_DICT.ITEMS())])
uj5u.com熱心網友回復:
我已經修改了您的代碼片段以使其正常作業。
from pyspark.sql import functions as F
from itertools import chain
simple_dict = {'Orange': 'OR', 'Apple': 'AP', 'Banana': 'BN'}
mapping_expr = F.create_map([F.lit(x) for x in chain(*simple_dict.items())])
def addCols(data):
data = (data.withColumn('Fruit_code', mapping_expr[data['Fruit']]))
return data
data = spark.createDataFrame([("Orange", ), ("Apple", ), ("Banana", ), ], ("Fruit", ))
new_data = addCols(data)
new_data.show()
輸出
------ ----------
| Fruit|Fruit_code|
------ ----------
|Orange| OR|
| Apple| AP|
|Banana| BN|
------ ----------
轉載請註明出處,本文鏈接:https://www.uj5u.com/gongcheng/386404.html
標籤:字典 火花 apache-spark-sql
