我有以下資料框
root
|-- AUTHOR_ID: integer (nullable = false)
|-- Books: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- NAME: string (nullable = true)
| | |-- DATE: TimestampType (nullable = true)
如何找到每個作者最古老的出版書籍?我想檢索日期
{
"AUTHOR_ID": 1,
"FIRST_PUBLICATION": <Date>
"Books": "[ ... ]"
}
uj5u.com熱心網友回復:
很多做法,讓我們試試視窗函式
root
|-- AUTHOR_ID: integer (nullable = false)
|-- Books: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- NAME: string (nullable = true)
| | |-- DATE: date (nullable = true)
--------- --------------------------------
|AUTHOR_ID|Books |
--------- --------------------------------
|21 |[{Stories of Mary, 2019-12-01}] |
|34 |[{Sorrows of Mary, 2019-09-01}] |
|34 |[{Sparrows of Mary, 2019-06-16}]|
|21 |[{Songs of Mary, 2017-03-14}] |
--------- --------------------------------
關注你
win=Window.partitionBy('AUTHOR_ID').orderBy(F.asc('Books.Date'))
df1=(
df.withColumn("rank", row_number().over(win)==1).where(col('rank')==1).drop('rank')#Filter by oldest date
.withColumn('value', to_json(F.struct(col('AUTHOR_ID'),col('Books.Date').alias('FIRST_PUBLICATION'),'Books')))#Create json column
).select('value').show(truncate=False)
-------------------------------------------------------------------------------------------------------------
|value |
-------------------------------------------------------------------------------------------------------------
|{"AUTHOR_ID":21,"FIRST_PUBLICATION":["2017-03-14"],"Books":[{"NAME":"Songs of Mary","DATE":"2017-03-14"}]} |
|{"AUTHOR_ID":34,"FIRST_PUBLICATION":["2019-06-16"],"Books":[{"NAME":"Sparrows of Mary","DATE":"2019-06-16"}]}|
-------------------------------------------------------------------------------------------------------------
uj5u.com熱心網友回復:
因為Spark v3使用Spark Higher-order functions是最好的解決方案,
df = spark.createDataFrame([("1", [Row(NAME="xs", DATE=datetime.strptime('2022-04-06 00:00:00', '%Y-%m-%d %H:%M:%S')),
Row(NAME="s", DATE=datetime.strptime('2022-04-05 00:00:00', '%Y-%m-%d %H:%M:%S')),]), ],
'struct<AUTHOR_ID:string,Books:array<struct<NAME:string,DATE:timestamp>>>')
df.show(truncate=False)
--------- -----------------------------------------------------
|AUTHOR_ID|Books |
--------- -----------------------------------------------------
|1 |[{xs, 2022-04-06 00:00:00}, {s, 2022-04-05 00:00:00}]|
--------- -----------------------------------------------------
df.printSchema()
root
|-- AUTHOR_ID: string (nullable = true)
|-- Books: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- NAME: string (nullable = true)
| | |-- DATE: timestamp (nullable = true)
我們可以得到每個作者日期最少的書,如下所示
df = df.withColumn('FIRST_PUBLICATION',
f.aggregate(
'Books',
f.lit(datetime.strptime('2222-02-22 22:22:22', '%Y-%m-%d %H:%M:%S')),
lambda acc, b : f.least(acc, b['DATE'])
)
)
結果
# df.show()
--------- -------------------- -------------------
|AUTHOR_ID| Books| FIRST_PUBLICATION|
--------- -------------------- -------------------
| 1|[{xs, 2022-04-06 ...|2022-04-05 00:00:00|
--------- -------------------- -------------------
uj5u.com熱心網友回復:
從 Spark 2.4 開始,您可以使用該array_min函式來檢索陣列的最小元素。您將此函式應用于僅包含日期的陣列。要構建僅包含日期的陣列,您可以使用列getField上的方法Books。
這是完整的代碼:
from pyspark.sql import functions as F
df = df.withColumn('FIRST_PUBLICATION', F.array_min(F.col('Books').getField('DATE')))
轉載請註明出處,本文鏈接:https://www.uj5u.com/yidong/457308.html
