我有一個火花資料框。其中一列是陣列型別,由不同長度的文本字串陣列組成。我正在尋找一種方法來添加一個新列,該列是這些字串的唯一左側 8 個字符的陣列。
df.printSchema()
root
(...)
|-- arr_agent: array (nullable = true)
| |-- element: string (containsNull = true)
“arr_agent”列中的示例資料:
["NRCANL2AXXX", "NRCANL2A"]
["UTRONL2U", "BKRBNL2AXXX", "BKRBNL2A"]
["NRCANL2A"]
["UTRONL2U", "REUWNL2A002", "BKRBNL2A", "REUWNL2A", "REUWNL2N"]
["UTRONL2U", "UTRONL2UXXX", "BKRBNL2A"]
["MQBFDEFFYYY", "MQBFDEFFZZZ", "MQBFDEFF" ]
我需要在新專欄中有什么:
["NRCANL2A"]
["UTRONL2U", "BKRBNL2A"]
["NRCANL2A"]
["UTRONL2U", "BKRBNL2A", "REUWNL2A", "REUWNL2N"]
["UTRONL2U", "BKRBNL2A"]
["MQBFDEFF" ]
我已經嘗試定義一個為我做的 udf。
from pyspark.sql import functions as F
from pyspark.sql import types as T
def make_list_of_unique_prefixes(text_array, prefix_length=8):
out_arr = set(t[0:prefix_length] for t in text_array)
return(out_arr)
make_list_of_unique_prefixes_udf = F.udf(lambda x,y=8: make_list_of_unique_prefixes(x,y), T.ArrayType(T.StringType()))
但隨后呼叫:
df.withColumn("arr_prefix8s", F.collect_set( make_list_of_unique_prefixes_udf(F.col("arr_agent") )))
引發錯誤
AnalysisException: grouping expressions sequence is empty,
任何提示將不勝感激。謝謝
uj5u.com熱心網友回復:
您可以使用 spark 2.4 中的高階函式使用 transform 和 substring 解決此問題,然后將陣列區分:
from pyspark.sql import functions as F
n = 8
out = df.withColumn("New",F.expr(f"array_distinct(transform(arr_agent,x->substring(x,0,{n})))"))
out.show(truncate=False)
----------------------------------------------------- ----------------------------------------
|arr_agent |New |
----------------------------------------------------- ----------------------------------------
|[NRCANL2AXXX, NRCANL2A] |[NRCANL2A] |
|[UTRONL2U, BKRBNL2AXXX, BKRBNL2A] |[UTRONL2U, BKRBNL2A] |
|[NRCANL2A] |[NRCANL2A] |
|[UTRONL2U, REUWNL2A002, BKRBNL2A, REUWNL2A, REUWNL2N]|[UTRONL2U, REUWNL2A, BKRBNL2A, REUWNL2N]|
|[UTRONL2U, UTRONL2UXXX, BKRBNL2A] |[UTRONL2U, BKRBNL2A] |
|[MQBFDEFFYYY, MQBFDEFFZZZ, MQBFDEFF] |[MQBFDEFF] |
----------------------------------------------------- ----------------------------------------
轉載請註明出處,本文鏈接:https://www.uj5u.com/yidong/448812.html
上一篇:在字串中查找由未知字符數分隔的兩個單詞,正則運算式,python
下一篇:如何用點、感嘆號或問號連接字串?
