我有一個如下所示的資料框
-----------------------------
| Item |
-----------------------------
|[[a,b,c], [d,e,f], [g,h,i]] |
-------------------- --------
如何將其轉換為下表?
a b c
d e f
g h i
我嘗試過使用explodeandwithColumn功能
a b c
a e c
a h c
d b c
d e c
d h c
... (many other combinations)
uj5u.com熱心網友回復:
您只需要分解第一級陣列,然后就可以選擇陣列元素作為列:
import pyspark.sql.functions as F
df = spark.createDataFrame(
[([["a","b","c"], ["d","e","f"], ["g","h","i"]],)],
["Item"]
)
df.withColumn("Item", F.explode("Item")).select(
*[F.col("Item")[i].alias(f"col_{i}") for i in range(3)]
).show()
# ----- ----- -----
#|col_0|col_1|col_2|
# ----- ----- -----
#| a| b| c|
#| d| e| f|
#| g| h| i|
# ----- ----- -----
uj5u.com熱心網友回復:
@blackbishop 改進你的答案......
import pyspark.sql.functions as F
df = spark.createDataFrame(
[([["a","b","c"], ["d","e","f"], ["g","h","i", "j"]],)],
["data"]
)
df.show(20, False)
df = df.withColumn("data1", F.explode("data"))
df.select('data1').show()
# Row(max(size(data1))=4) ---> 4
max_size = df.select(F.max(F.size('data1'))).collect()[0][0]
df.select(
*[F.col("data1")[i].alias(f"col_{i}") for i in range(max_size)]
).show()
------------------------------------
|data |
------------------------------------
|[[a, b, c], [d, e, f], [g, h, i, j]]|
------------------------------------
------------
| data1|
------------
| [a, b, c]|
| [d, e, f]|
|[g, h, i, j]|
------------
----- ----- ----- -----
|col_0|col_1|col_2|col_3|
----- ----- ----- -----
| a| b| c| null|
| d| e| f| null|
| g| h| i| j|
----- ----- ----- -----
轉載請註明出處,本文鏈接:https://www.uj5u.com/qiye/405149.html
標籤:
上一篇:將Pandas資料幀轉換為Spark資料幀時,是否可以將float轉換為long?
下一篇:當系列到系列(PandasUDFType.SCALAR)可用時,為什么系列迭代器到系列pandasUDF(PandasUDFType.SCALAR_ITER)的迭代器?
