我有一個具有以下表結構的表:
place_id date_start date_end
2826088480774 2017-09-19 2017-09-20
1898375544837 2017-08-01 2017-08-03
1425929142277 2017-09-23 2017-10-03
1013612281863 2016-10-12 2016-10-14
1795296329731 2016-10-13 2016-10-13
695784701956 2017-09-11 2017-11-02
我想計算每個月每個地方有多少事件(每一行都是一個事件)。如果事件日期涉及幾個月,則應計算所有受影響的月份。
place_id 可以重復,所以我做了以下查詢:
Select place_id, EXTRACT(MONTH FROM date_start) as
month, EXTRACT(YEAR FROM date_start) as year,
COUNT(*) as events
From Table
Group by place_id, year, month
Order by month, year, events desc
所以我得到以下分組表:
place_id month year events
2826088480774 8 2017 345
1898375544837 8 2017 343
1425929142277 8 2017 344
1013612281863 8 2017 355
1795296329731 8 2017 348
695784701956 8 2017 363
問題是資料僅按 分組start_date,我不清楚如何按從date_start到 的所有受影響月份分發資料date_end。
uj5u.com熱心網友回復:
您可以使用sequence函式生成日期之間的日期date_start,date_end然后分解生成的陣列列,并按照您已經做過的分組和計數:
df.createOrReplaceTempView("EventsTable")
spark.sql("""
WITH events AS (
SELECT place_id,
explode(event_dates) as event_date
FROM (
SELECT place_id,
sequence(date_start, date_end, interval 1 day) as event_dates
FROM EventsTable
)
)
SELECT place_id,
month(event_date) as month,
year(event_date) as year,
count(*) as events
FROM events
GROUP BY 1, 2, 3
ORDER BY month, year, events desc
""").show()
// ------------- ----- ---- ------
//| place_id|month|year|events|
// ------------- ----- ---- ------
//|1898375544837| 8|2017| 3|
//|695784701956 | 9|2017| 20|
//|1425929142277| 9|2017| 8|
//|2826088480774| 9|2017| 2|
//|1013612281863| 10|2016| 3|
//|1795296329731| 10|2016| 1|
//|695784701956 | 10|2017| 31|
//|1425929142277| 10|2017| 3|
//|695784701956 | 11|2017| 2|
// ------------- ----- ---- ------
轉載請註明出處,本文鏈接:https://www.uj5u.com/yidong/400124.html
