假設我們有以下兩個表
--------- --------
|AUTHOR_ID| NAME |
--------- --------
| 102 |Camus |
| 103 |Hugo |
--------- -------- ------------
|AUTHOR_ID| BOOK_ID BOOK_NAME |
--------- -------- -----------|
| 1 |Camus | Etranger
| 1 |Hugo | Mesirable |
我想加入這兩個表以獲得具有以下架構的DataFrame
root
|-- AUTHORID: integer
|-- NAME: string
|-- BOOK_LIST: array
| |-- BOOK_ID: integer
| |-- BOOK_NAME: string
我正在使用pyspark,在此先感謝
uj5u.com熱心網友回復:
簡單的 join group by 應該可以完成這項作業:
from pyspark.sql import functions as F
result = (df_authors.join(df_books, on=["AUTHOR_ID"], how="left")
.groupBy("AUTHOR_ID", "NAME")
.agg(F.collect_list(F.struct("BOOK_ID", "BOOK_NAME")))
)
在聚合中,我們collect_list用來創建結構陣列。
轉載請註明出處,本文鏈接:https://www.uj5u.com/net/425727.html
