我有一個df包含 struct-array 列properties(其元素是具有鍵x和 的結構欄位的陣列列)的資料框,y我想通過x從 column 中提取值來創建一個新的陣列列properties。
示例輸入資料幀將是這樣的
import pyspark.sql.functions as F
from pyspark.sql.types import *
data = [
(1, [{'x':11, 'y':'str1a'}, ]),
(2, [{'x':21, 'y':'str2a'}, {'x':22, 'y':0.22, 'z':'str2b'}, ]),
]
my_schema = StructType([
StructField('id', LongType()),
StructField('properties', ArrayType(
StructType([
StructField('x', LongType()),
StructField('y', StringType()),
])
)
),
])
df = spark.createDataFrame(data, schema=my_schema)
df.show()
# --- --------------------
# | id| properties|
# --- --------------------
# | 1| [[11, str1a]]|
# | 2|[[21, str2a], [22...|
# --- --------------------
df.printSchema()
# root
# |-- id: long (nullable = true)
# |-- properties: array (nullable = true)
# | |-- element: struct (containsNull = true)
# | | |-- x: long (nullable = true)
# | | |-- y: string (nullable = true)
另一方面,所需的輸出df_new應如下所示
df_new.show()
# --- -------------------- --------
# | id| properties|x_values|
# --- -------------------- --------
# | 1| [[11, str1a]]| [11]|
# | 2|[[21, str2a], [22...|[21, 22]|
# --- -------------------- --------
df_new.printSchema()
# root
# |-- id: long (nullable = true)
# |-- properties: array (nullable = true)
# | |-- element: struct (containsNull = true)
# | | |-- x: long (nullable = true)
# | | |-- y: string (nullable = true)
# |-- x_values: array (nullable = true)
# | |-- element: long (containsNull = true)
有人知道此類任務的解決方案嗎?
理想情況下,我正在尋找一種解決方案,它可以在不依賴于F.explode. 事實上,在我的實際資料庫中,我還沒有確定與idcolumn的等價物,并且在呼叫之后F.explode我不確定如何將分解的值合并在一起。
uj5u.com熱心網友回復:
嘗試使用properties.x然后從屬性陣列中提取所有值。
例子:
df.withColumn("x_values",col("properties.x")).show(10,False)
#or by using higher order functions
df.withColumn("x_values",expr("transform(properties,p -> p.x)")).show(10,False)
# --- ------------------------- --------
#|id |properties |x_values|
# --- ------------------------- --------
#|1 |[[11, str1a]] |[11] |
#|2 |[[21, str2a], [22, 0.22]]|[21, 22]|
# --- ------------------------- --------
轉載請註明出處,本文鏈接:https://www.uj5u.com/gongcheng/375524.html
上一篇:如何獲得亂數學數字和運算子并在Javascript中得到它的答案
下一篇:資料塊覆寫整個表而不是添加新磁區
