每天分享一個sql,幫助大家找到sql的快樂
需求
有一張表,其中一個欄位是由時間、介面、ip和其他欄位組成的
求11月9號下午14點(14-15點),訪問/api/user/login介面的top10的ip地址
建表陳述句
create table log(
log_detail string
)
ROW format delimited FIELDS TERMINATED BY "\t"
;
資料
insert overwrite table log values
("2016-11-09 14:22:05 /api/user/login 110.23.5.33 a *"),
("2016-11-09 11:23:10 /api/user/detail 57.3.2.16 a *"),
("2016-11-09 14:59:40 /api/user/login 200.6.5.166 a *"),
("2016-11-09 14:22:05 /api/user/login 110.23.5.34 a *"),
("2016-11-09 14:22:05 /api/user/login 110.23.5.34 a *"),
("2016-11-09 14:22:05 /api/user/login 110.23.5.34 a *"),
("2016-11-09 11:23:10 /api/user/detail 57.3.2.16 a *"),
("2016-11-09 23:59:40 /api/user/login 200.6.5.166 a *"),
("2016-11-09 14:22:05 /api/user/login 110.23.5.34 a *"),
("2016-11-09 11:23:10 /api/user/detail 57.3.2.16 a *"),
("2016-11-09 23:59:40 /api/user/login 200.6.5.166 a *"),
("2016-11-09 14:22:05 /api/user/login 110.23.5.35 a *"),
("2016-11-09 14:23:10 /api/user/detail 57.3.2.16 a *"),
("2016-11-09 23:59:40 /api/user/login 200.6.5.166 a *"),
("2016-11-09 14:59:40 /api/user/login 200.6.5.166 a *"),
("2016-11-09 14:59:40 /api/user/login 200.6.5.166 a *");
實作
select
t2.interface,
t2.ip,
t2.ip_visit_count,
row_number() over( order by t2.ip_visit_count desc) as rk
from
(select
t1.interface,
t1.ip,
count(*) as ip_visit_count
from
(select
split(log_detail," ")[0] as rq,
split(log_detail," ")[1] as hour,
split(log_detail," ")[2] as interface,
split(log_detail," ")[3] as ip
from
log
)t1
where t1.interface = "/api/user/login"
and date_format(t1.rq,"yyyy-MM-dd")= "2016-11-09"
and substring(t1.hour,0,2) >= 14
and substring(t1.hour,0,2) <= 15
group by t1.interface,t1.ip
limit 10
)t2
;
結果
Total MapReduce CPU Time Spent: 7 seconds 110 msec
OK
t2.interface t2.ip t2.ip_visit_count rk
/api/user/login 110.23.5.34 4 1
/api/user/login 200.6.5.166 3 2
/api/user/login 110.23.5.33 1 3
/api/user/login 110.23.5.35 1 4
Time taken: 48.843 seconds, Fetched: 4 row(s)
分析
1、分析需求,需要時間、介面、ip三個欄位
2、觀察字串的格式,選擇合適的分割符切割
3、使用split函式切割字串,將一列轉換為三列
4、根據ip分組,統計每個ip的個數,這里group by時,需要group by t1.interface,t1.ip這樣寫
5、全域排序,需要開窗,開窗時不需要指定磁區鍵,即不需要partition by
6、如下擴展對觀察資料、切割、取數,需要注意,明確每一步操作的結果是什么
擴展
#split切割、陣列取數
select
split(log_detail," ")[0] as time,
split(log_detail," ")[1] as interface,
split(log_detail," ")[2] as ip
from
log
OK
#這里結果并不對,觀察資料,切割會將時間切成兩部分,所以如上實作部分sql是正確的
#結果
time interface ip
2016-11-09 14:22:05 /api/user/login
2016-11-09 11:23:10 /api/user/detail
2016-11-09 14:59:40 /api/user/login
2016-11-09 14:22:05 /api/user/login
2016-11-09 14:22:05 /api/user/login
2016-11-09 14:22:05 /api/user/login
2016-11-09 11:23:10 /api/user/detail
2016-11-09 23:59:40 /api/user/login
2016-11-09 14:22:05 /api/user/login
2016-11-09 11:23:10 /api/user/detail
2016-11-09 23:59:40 /api/user/login
2016-11-09 14:22:05 /api/user/login
2016-11-09 14:23:10 /api/user/detail
2016-11-09 23:59:40 /api/user/login
2016-11-09 14:59:40 /api/user/login
2016-11-09 14:59:40 /api/user/login
Time taken: 0.3 seconds, Fetched: 16 row(s)
where date_format(t1.time,"yyyy-MM-dd HH") >= concat(year(current_date()),"-","11-09 14")
and
date_format(t1.time,"yyyy-MM-dd HH") <= concat(year(current_date()),"-","11-09 15")
and
ip = "/api/user/login"
#關于排序
因為最后需要全域排序,topn序號,那么如何選擇排序時機
#測驗一(沒有達到預期的結果)
在group by時對ip_visit_count 降序排序,這時也是全域排序,但是在row_number()Over()開窗時,默認是升序
hive (default)> select
> t2.interface,
> t2.ip,
> t2.ip_visit_count,
> row_number() over() as rk
> from
> (select
> t1.interface,
> t1.ip,
> count(*) as ip_visit_count
> from
> (select
> split(log_detail," ")[0] as rq,
> split(log_detail," ")[1] as hour,
> split(log_detail," ")[2] as interface,
> split(log_detail," ")[3] as ip
> from
> log
> )t1
> where t1.interface = "/api/user/login"
> and date_format(t1.rq,"yyyy-MM-dd")= "2016-11-09"
> and substring(t1.hour,0,2) >= 14
> and substring(t1.hour,0,2) <= 15
> group by t1.interface,t1.ip
> order by ip_visit_count desc
> limit 10
> )t2
> ;
Total MapReduce CPU Time Spent: 7 seconds 100 msec
OK
t2.interface t2.ip t2.ip_visit_count rk
/api/user/login 110.23.5.33 1 1
/api/user/login 110.23.5.35 1 2
/api/user/login 200.6.5.166 3 3
/api/user/login 110.23.5.34 4 4
還有其他測驗情況,篇幅有限,就不展示了
總結一下:
1、第一次排序和最終排序欄位不沖突是可以的
2、如果第一次排序和最終排序實作目標一致,那么最后排序即可
知識點
1、split(字串,分割符):使用分割符切割字串,回傳一個陣列
2、row_number():row_number是基于over()開窗函式的一個不重復的序號,如上結果所示,即便結果相同,也會順延,序號自增
3、substring(字串,起始位置,步長):根據起始位置和步長切割字串
轉載請註明出處,本文鏈接:https://www.uj5u.com/qita/275754.html
標籤:其他
