我試圖在 pyspark 中加入兩個資料幀,但將一個表作為陣列列加入另一個表中。
例如,對于這些表:
from pyspark.sql import Row
df1 = spark.createDataFrame([
Row(a = 1, b = 'C', c = 26, d = 'abc'),
Row(a = 1, b = 'C', c = 27, d = 'def'),
Row(a = 1, b = 'D', c = 51, d = 'ghi'),
Row(a = 2, b = 'C', c = 40, d = 'abc'),
Row(a = 2, b = 'D', c = 45, d = 'abc'),
Row(a = 2, b = 'D', c = 38, d = 'def')
])
df2 = spark.createDataFrame([
Row(a = 1, b = 'C', e = 2, f = 'cba'),
Row(a = 1, b = 'D', e = 3, f = 'ihg'),
Row(a = 2, b = 'C', e = 7, f = 'cba'),
Row(a = 2, b = 'D', e = 9, f = 'cba')
])
我想加入到DF1 DF2對列a和b,但df1.c并df1.d應在單個陣列型別列。此外,應保留所有名稱。新資料幀的輸出應該能夠轉換為這個 json 結構(前兩行的示例):
{
"a": 1,
"b": "C",
"e": 2,
"f": "cba",
"df1": [
{
"c": 26,
"d": "abc"
},
{
"c": 27,
"d": "def"
}
]
}
任何關于如何實作這一點的想法將不勝感激!
謝謝,
卡羅萊納州
uj5u.com熱心網友回復:
根據您輸入的樣本資料:
df1 上的聚合
from pyspark.sql import functions as F
df1 = df1.groupBy("a", "b").agg(
F.collect_list(F.struct(F.col("c"), F.col("d"))).alias("df1")
)
df1.show()
--- --- --------------------
| a| b| df1|
--- --- --------------------
| 1| C|[[26, abc], [27, ...|
| 1| D| [[51, ghi]]|
| 2| D|[[45, abc], [38, ...|
| 2| C| [[40, abc]]|
--- --- --------------------
加入df2
df3 = df1.join(df2, on=["a", "b"])
df3.show()
--- --- -------------------- --- ---
| a| b| df1| e| f|
--- --- -------------------- --- ---
| 1| C|[[26, abc], [27, ...| 2|cba|
| 1| D| [[51, ghi]]| 3|ihg|
| 2| D|[[45, abc], [38, ...| 9|cba|
| 2| C| [[40, abc]]| 7|cba|
--- --- -------------------- --- ---
轉載請註明出處,本文鏈接:https://www.uj5u.com/qiye/353008.html
上一篇:如何從SparkSQL中的逗號分隔字串中洗掉重復項?
下一篇:將每月付款ID分配給每日現金流
