假設我們有以下資料框模式
root
|-- AUTHOR_ID: integer (nullable = false)
|-- NAME: string (nullable = true)
|-- Books: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- BOOK_ID: integer (nullable = false)
| | |-- Chapters: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- NAME: string (nullable = true)
| | | | |-- NUMBER_PAGES: integer (nullable = true)
- 如何找到有書的作者
NUMBER_PAGES < 100
謝謝
uj5u.com熱心網友回復:
根據您的資料結構,NUMBER_PAGES給定BOOK_ID的 等于NUMBER_PAGES其每個章節的總和。
您可以使用aggregate函式來計算每本書的頁數,然后使用帶有exists函式的過濾器:
from pyspark.sql import functions as F
df1 = df.filter(
F.exists(
"Books",
lambda x: F.aggregate(x["Chapters"], F.lit(0), lambda a, b: a b) < F.lit(100)
)
)
對于 Spark <3.1,您需要使用expr聚合和存在函式:
df1 = df.filter(
F.expr("exists(Book, x -> aggregate(x.Chapters, 0, (a, b) -> a b) < 100)")
)
轉載請註明出處,本文鏈接:https://www.uj5u.com/gongcheng/427403.html
