Hive SQL經典面試題
最近發現一道大資料面試經常會問的SQL題目:統計連續登錄的三天及以上的用戶(或者類似的:連續3個月充值會員用戶、連續N天購買商品的用戶等),下面就來記錄一下解題思路,
要求輸出格式:
+---------+--------+-------------+-------------+--+
| uid | times | start_date | end_date |
+---------+--------+-------------+-------------+--+
首先建表:
create table test.user_login_info(
user_id string COMMENT '用戶ID'
,login_date date COMMENT '登錄日期'
)
row format delimited
fields terminated by ',';
原始資料:

匯入原始資料:
hdfs dfs -put /root/user_login_info.txt /user/hive/warehouse/test.db/user_login_info/
查看:
select * from user_login_info;
+--------------------------+-----------------------------+
| user_login_info.user_id | user_login_info.login_date |
+--------------------------+-----------------------------+
| user01 | 2018-02-28 |
| user01 | 2018-03-01 |
| user01 | 2018-03-02 |
| user01 | 2018-03-04 |
| user01 | 2018-03-05 |
| user01 | 2018-03-06 |
| user01 | 2018-03-07 |
| user02 | 2018-03-01 |
| user02 | 2018-03-02 |
| user02 | 2018-03-03 |
| user02 | 2018-03-06 |
+--------------------------+-----------------------------+
解題思路:
1.先把資料按照user_id分組,login_date升序
select
user_id
,login_date
,row_number() over(partition by user_id order by login_date asc) as rank
from user_login_info
;
結果:
+----------+-------------+-------+
| user_id | login_date | rank |
+----------+-------------+-------+
| user01 | 2018-02-28 | 1 |
| user01 | 2018-03-01 | 2 |
| user01 | 2018-03-02 | 3 |
| user01 | 2018-03-04 | 4 |
| user01 | 2018-03-05 | 5 |
| user01 | 2018-03-06 | 6 |
| user01 | 2018-03-07 | 7 |
| user02 | 2018-03-01 | 1 |
| user02 | 2018-03-02 | 2 |
| user02 | 2018-03-03 | 3 |
| user02 | 2018-03-06 | 4 |
+----------+-------------+-------+
2.用 login_date - rank 得到的差值日期如果是一樣的,則說明是連續登錄的
select
t1.user_id as user_id
,t1.login_date as login_date
,date_sub(t1.login_date , t1.rank) as date_diff
from
(
select
user_id
,login_date
,row_number() over(partition by user_id order by login_date asc) as rank
from user_login_info
)t1
;
結果:
+----------+-------------+-------------+
| user_id | login_date | date_diff |
+----------+-------------+-------------+
| user01 | 2018-02-28 | 2018-02-27 |
| user01 | 2018-03-01 | 2018-02-27 |
| user01 | 2018-03-02 | 2018-02-27 |
| user01 | 2018-03-04 | 2018-02-28 |
| user01 | 2018-03-05 | 2018-02-28 |
| user01 | 2018-03-06 | 2018-02-28 |
| user01 | 2018-03-07 | 2018-02-28 |
| user02 | 2018-03-01 | 2018-02-28 |
| user02 | 2018-03-02 | 2018-02-28 |
| user02 | 2018-03-03 | 2018-02-28 |
| user02 | 2018-03-06 | 2018-03-02 |
+----------+-------------+-------------+
3.根據 user_id 和 date_diff 分組,login_date 的最小時間即 start_date ,最大時間即 end_date,取分組后的count>=3的即為最終結果
select
t2.user_id as user_id
,count(*) as times
,min(t2.login_date) as start_date
,max(t2.login_date) as end_date
from
(
select
t1.user_id as user_id
,t1.login_date as login_date
,date_sub(t1.login_date , t1.rank) as date_diff
from
(
select
user_id
,login_date
,row_number() over(partition by user_id order by login_date asc) as rank
from user_login_info
)t1
)t2
group by t2.user_id,t2.date_diff
having count(*) >= 3
;
結果:
+----------+--------+-------------+-------------+
| user_id | times | start_date | end_date |
+----------+--------+-------------+-------------+
| user01 | 3 | 2018-02-28 | 2018-03-02 |
| user01 | 4 | 2018-03-04 | 2018-03-07 |
| user02 | 3 | 2018-03-01 | 2018-03-03 |
+----------+--------+-------------+-------------+
以上為一種解決方案,大佬們如果有更好的方案歡迎留言交流,
轉載請註明出處,本文鏈接:https://www.uj5u.com/qita/295118.html
標籤:其他
