我有一個 df 有兩列。一列是字串,另一列是整數陣列。
root
|-- col1: string (nullable = true)
|-- col2: array (nullable = true)
| |-- element: integer (containsNull = true)
資料框看起來像:
-------------------- ------------
| col1| col2|
-------------------- ------------
|Barkley likes peo...|[22, 22, 25]|
-------------------- ------------
該陣列實際上告訴我我需要在哪里拆分 col1 中的句子。
如果 col1 中的值是“巴克利喜歡人。巴克利喜歡零食。巴克利喜歡一切”。該陣列告訴我,從 0-22 個字符是第一句,從 22 到 44 (22 22) 是第二句,最后一句是從 44(22 22) 到 69 (44 25)。
我需要避免向驅動程式節點發送任何內容并保持并行性。那么我的問題是如何創建一個 udf 來利用陣列中的整數來拆分 col1 中的句子?輸出可以利用 withColumn 并回傳三個新列或每個句子的映射。我可以在沒有 for 回圈、串列理解、collect() 或 select() 的情況下執行此操作嗎?
uj5u.com熱心網友回復:
對于 Spark 版本 >= 2.4,我們可以利用高階函式來處理陣列,包括這個問題。假設df是資料框。
df = spark.createDataFrame([
("Barkley likes people. Barkley likes treats. Barkley likes everything.",[22, 22, 25]),
("A sentence. Another sentence.",[13, 18]),
("One sheep. Two sheep. Three sheep. Four sheep.",[11, 12, 13, 12])],
"col1:string, col2:array<int>")
df.show()
# -------------------- ----------------
# | col1| col2|
# -------------------- ----------------
# |Barkley likes peo...| [22, 22, 25]|
# |A sentence. Anoth...| [13, 18]|
# |One sheep. Two sh...|[11, 12, 13, 12]|
# -------------------- ----------------
要從 中切出句子col1,substring將使用函式,它需要起始位置和長度的引數。col2是字串中每個句子的長度。col2正如問題中所暗示的,每個句子的開始位置是從 0 到 n-1陣列的累積和。為此,請使用高階函式transform和aggregate. 之后,獲取每個句子并用于map_from_entries為每個句子及其索引創建映射。這是一個這樣做的例子。
import pyspark.sql.functions as F
df = (df
.withColumn("start", F.expr("transform(transform(col2, (v1,i) -> slice(col2, 1, i)), v2 -> aggregate(v2, 0, (a,b) -> a b))"))
.withColumn("sentences", F.expr("transform(col2, (v, i) -> struct(i 1 as index, substring(col1, start[i], col2[i]) as sentence))"))
.selectExpr("col1", "map_from_entries(sentences) as sentences")
)
df.show(truncate=False)
# --------------------------------------------------------------------- ------------------------------------------------------------------------------------------
# |col1 |sentences |
# --------------------------------------------------------------------- ------------------------------------------------------------------------------------------
# |Barkley likes people. Barkley likes treats. Barkley likes everything.|[1 -> Barkley likes people. , 2 -> Barkley likes treats., 3 -> Barkley likes everything]|
# |A sentence. Another sentence. |[1 -> A sentence. A, 2 -> Another sentence.] |
# |One sheep. Two sheep. Three sheep. Four sheep. |[1 -> One sheep. , 2 -> Two sheep. , 3 -> Three sheep. , 4 -> Four sheep.] |
# --------------------------------------------------------------------- ------------------------------------------------------------------------------------------
轉載請註明出處,本文鏈接:https://www.uj5u.com/qukuanlian/330955.html
