我有以下DF:
---- ------ -------- ---- ---- ---- ---- ----------
| ID | Name | Vl1 | Vl2| Vl3| Vl4|Vl5 | Vl6 |
---- ------ -------- ---- ---- ---- ---- ----------
|1 |John | 1.5 |null|null|null| A|2022-01-01|
|1 |John | 1 |null|null|null| A|2022-01-01|
|1 |John | 3 |null|1 |null| A|2022-01-01|
|1 |John | 4 |null|1 |null| A|2022-01-01|
|2 |Ana | 2.5 |null|null|null| A|2022-01-01|
|2 |Ana | 0 |null|null|null| A|2022-01-01|
|2 |Ana | null|null|null|null| A|2022-01-01|
|2 |Ana | 2 |null|null|null| A|2022-01-01|
|2 |Ana | 2 |2 |null|null| A|2022-01-01|
|2 |Ana | 1 |null|null|null| A|2022-01-01|
|3 |Paul | 5 |null|null|null| A|2022-01-01|
|3 |Paul | null|2 |null|null| A|2022-01-01|
|3 |Paul | 2.5 |null|2 |null| A|2022-01-01|
|3 |Paul | null|null|3 |null| A|2022-01-01|
---- ------ -------- ---- ---- ---- ---- ----------
如何生成以下嵌套結構:
|-- Title: string (nullable = true)
|-- Company: string (nullable = true)
|-- Client: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- Vl1: double (nullable = true)
| | |-- Vl2: double (nullable = true)
| | |-- Prch: array (nullable = false)
| | | |-- element: struct (containsNull = false)
| | | | |-- Vl3: string (nullable = true)
| | | | |-- Detail: array (nullable = true)
| | | | | |-- element: struct (containsNull = false)
| | | | | | |-- Vl4: date (nullable = true)
| | | | |-- Bs: array (nullable = true)
| | | | | |-- element: struct (containsNull = false)
| | | | | | |-- Vl5: string (nullable = true)
| | | | | | |-- Vl6: date (nullable = true)
到目前為止,我使用 withcolumn 來生成 columnnsTiltle和Company. 然后我用groupbygroup by clients ,然后我用collect_list生成嵌套結構的第一級,但是如何生成結構的其他級別(Prch,,, Detail)Bs。
要知道,我的 DF 有 10 億行。我想知道生成這種結構的最佳方法。我的火花環境有 4 個工人,每個工人有 5 個核心。
MVCE:
data = [
("1","John",1.5,None,None,None,"A", "2022-01-01"),
("1","John",1.0,None,None,None,"A", "2022-01-01"),
("1","John",3.0,None,1.0,None,"A", "2022-01-01"),
("1","John",4.0,None,1.0,None,"A", "2022-01-01"),
("2","Ana",2.5,None,None,None,"A", "2022-01-01"),
("2","Ana",0.0,None,None,None,"A", "2022-01-01"),
("2","Ana",None,None,None,None,"A", "2022-01-01"),
("2","Ana",2.0,None,None,None,"A", "2022-01-01"),
("2","Ana",2.0,2.0,None,None,"A", "2022-01-01"),
("2","Ana",1.0,None,None,None,"A", "2022-01-01"),
("3","Paul",5.0,None,None,None,"A", "2022-01-01"),
("3","Paul",None,2.0,None,None,"A", "2022-01-01"),
("3","Paul",2.5,None,2.0,None,"A", "2022-01-01"),
("3","Paul",None,None,3.0,None,"A", "2022-01-01")
]
schema = StructType([
StructField("Id", StringType(),True),
StructField("Name", StringType(),True),
StructField("Vl1", DoubleType(),True),
StructField("Vl2", DoubleType(), True),
StructField("Vl3", DoubleType(), True),
StructField("Vl4", DateType(), True),
StructField("Vl5", StringType(), True),
StructField("Vl6", StringType(), True)
])
df = spark.createDataFrame(data=data,schema=schema)
df.printSchema()
df.show()
uj5u.com熱心網友回復:
您必須按順序對它們進行分組,從最里面的每個嵌套級別向外移動。下面第一行是創建Vl3、Details、Bs,然后第二行是創建Prch列,第三行是創建Client列。(您不必為每個步驟創建多個資料框,這只是為了解釋。)
df1 = df.groupBy("Id", "Name", "Vl1", "Vl2", "Vl3").agg(fn.collect_list(fn.struct(fn.col("Vl4"))).alias("Detail"), fn.collect_list(fn.struct(fn.col("Vl5"), fn.col("Vl6"))).alias("Bs"))
df2 = df1.groupBy("Id", "Name", "Vl1", "Vl2").agg(fn.collect_list(fn.struct(fn.col("Vl3"), fn.col("Detail"), fn.col("Bs"))).alias("Prch"))
df3 = df2.groupBy("Id", "Name").agg(fn.collect_list(fn.struct(fn.col("Vl1"), fn.col("Vl2"), fn.col("Prch"))).alias("Client"))
這為您提供了此架構:
root
|-- Id: string (nullable = true)
|-- Name: string (nullable = true)
|-- Client: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- Vl1: double (nullable = true)
| | |-- Vl2: double (nullable = true)
| | |-- Prch: array (nullable = false)
| | | |-- element: struct (containsNull = false)
| | | | |-- Vl3: double (nullable = true)
| | | | |-- Detail: array (nullable = false)
| | | | | |-- element: struct (containsNull = false)
| | | | | | |-- Vl4: date (nullable = true)
| | | | |-- Bs: array (nullable = false)
| | | | | |-- element: struct (containsNull = false)
| | | | | | |-- Vl5: string (nullable = true)
| | | | | | |-- Vl6: string (nullable = true)
轉載請註明出處,本文鏈接:https://www.uj5u.com/caozuo/491791.html
