在本地運行 pyspark 時,我得到了按 BOOK_ID 排序的串列的正確結果,但是在部署 AWS Glue 作業時,書籍似乎沒有排序
root
|-- AUTHORID: integer
|-- NAME: string
|-- BOOK_LIST: array
| |-- BOOK_ID: integer
| |-- BOOK_NAME: string
from pyspark.sql import functions as F
result = (df_authors.join(df_books, on=["AUTHOR_ID"], how="left")
.orderBy(F.col("BOOK_ID").desc())
.groupBy("AUTHOR_ID", "NAME")
.agg(F.collect_list(F.struct("BOOK_ID", "BOOK_NAME")))
)
注意:我正在使用pyspark 3.2.1和Glue 2.0
請有任何建議
uj5u.com熱心網友回復:
假設
盡管我設法在支持的 Glue 3.0 上運行該作業spark 3.1,但orderBy仍然給出錯誤的結果

解釋是:膠水作業可能有許多允許并行的工人,因此 orderBy 不能給出正確的結果,相反我們只有一個工人
建議的解決方案
- 使用最少數量的工人(這不是一個相關的解決方案)
- 在每個資料幀之前應用.orderBy
join - 或使用
.coalesce(1)
result = (df_authors.join(df_books, on=["AUTHOR_ID"], how="left")
.coalesce(1)
.orderBy(F.col("BOOK_ID").desc())
.groupBy("AUTHOR_ID", "NAME")
.agg(F.collect_list(F.struct("BOOK_ID", "BOOK_NAME")))
)
這可以得到正確的結果,但在這種情況下,我們會失去性能
uj5u.com熱心網友回復:
我試圖簡化問題,與我合作:
讓我們創建一個資料框示例:
>>> df = spark.createDataFrame([
{"book_id": 1, "author_id": 1, "name": "David", "book_name": "Kill Bill"},
{"book_id": 2, "author_id": 2, "name": "Roman", "book_name": "Dying is Hard"},
{"book_id": 3, "author_id": 3, "name": "Moshe", "book_name": "Apache Kafka The Easy Way"},
{"book_id": 4, "author_id": 1, "name": "David", "book_name": "Pyspark Is Awesome"},
{"book_id": 5, "author_id": 2, "name": "Roman", "book_name": "Playing a Piano"},
{"book_id": 6, "author_id": 3, "name": "Moshe", "book_name": "Awesome Scala"}
])
現在,這樣做:
(
df
.groupBy("author_id", "name")
.agg(F.collect_list(F.struct("book_id", "book_name")).alias("data"), F.sum("book_id").alias("sorted_key"))
.orderBy(F.col("sorted_key").desc()).drop("sorted_key")
.show(10, False)
)
我得到的正是你所要求的:
--------- ----- ----------------------------------------------------
|author_id|name |collect_list(struct(book_id, book_name)) |
--------- ----- ----------------------------------------------------
|3 |Moshe|[{3, Apache Kafka The Easy Way}, {6, Awesome Scala}]|
|2 |Roman|[{2, Dying is Hard}, {5, Playing a Piano}] |
|1 |David|[{1, Kill Bill}, {4, Pyspark Is Awesome}] |
--------- ----- ----------------------------------------------------
轉載請註明出處,本文鏈接:https://www.uj5u.com/shujuku/441160.html
上一篇:PysparkErrorwithreturn_compile(pattern,flags).findall(string)-如何排除故障?
下一篇:無法創建新頁面IONIC6
