Hive中的多維分析函式
眾所周知,Hive是一個OLAP資料庫,擁有著很強大的資料分析能力,
但是當我們一張表的維度資訊特別多,并且需要各種維度組合的統計分析結果時就很不方便,比如一張表有5個維度,所有的維度組合是2的5次方也就是32種,那也就意味著我們需要寫32個查詢的HQL解決我們的需求,同時不方便的點還存在于我們需要將32個HQL的查詢結果寫入到32張不同的表中,無形中給我們未來查詢我們所需要的資訊帶來了額外的麻煩,
本篇文章就是介紹Hive中非常好用的多維分析函式幫助我們解決以上兩個痛點問題,
1.環境說明
當前環境采用Hadoop3.1.3以及Hive3.1.2版本!
2.準備資料
[whalex@hadoop102data]$ vim test.txt
1001 John male sale beijing
1002 Tom female sale shanghai
1003 Bob male sale beijing
1004 Alex male product shanghai
3.建表并匯入資料
create table test(id string,name string,sex string,dept_name string,addr string) row format delimited fields terminated by '\t';
load data local inpath '/opt/whalex/data/test.txt' into table test;
4.Grouping Sets計算多種自定義維度組合問題
此時需要計算如下需求:
- 按照性別分組計算每種性別的人數
- 按照部門和性別分組計算每個部門每種性別的人數
- 按照地區和性別分組計算每個地區每種性別的人數
我們可以采用如下3條HQL陳述句計算并插入到不同的3個結果表中
select sex,count(*) ct from test group by sex;
select dept_name,sex,count(*) ct from test group by dept_name,sex;
select addr,sex,count(*) ct from test group by addr,sex;
其實在Hive中提供了Grouping Sets幫助我們解決上述問題,Grouping Sets說明如下圖所示:

所以以上3條HQL可以合并為如下一條HQL
select dept_name,addr,sex,count(*) ct from test group by dept_name,addr,sex grouping sets((sex),(dept_name,sex),(addr,sex)) order by dept_name,addr,sex;
+------------+-----------+---------+-----+
| dept_name | addr | sex | ct |
+------------+-----------+---------+-----+
| NULL | NULL | female | 1 |
| NULL | NULL | male | 3 |
| NULL | beijing | male | 2 |
| NULL | shanghai | female | 1 |
| NULL | shanghai | male | 1 |
| product | NULL | male | 1 |
| sale | NULL | female | 1 |
| sale | NULL | male | 2 |
+------------+-----------+---------+-----+
注:HQL中的order by只是為了讓輸出結果更好看一些,沒有其他意義,
5.With Cube解決計算所有維度組合問題
此時需要計算地區、部門以及性別三個維度所有組合的聚合資料
我們當然還可以使用Grouping Sets來解決問題,HQL如下
select dept_name,addr,sex,count(*) ct from test group by dept_name,addr,sex grouping sets((addr,dept_name,sex),(addr,dept_name),(addr,sex),(dept_name,sex),(addr),(dept_name),(sex),()) order by dept_name,addr,sex;
+------------+-----------+---------+-----+
| dept_name | addr | sex | ct |
+------------+-----------+---------+-----+
| NULL | NULL | NULL | 4 |
| NULL | NULL | female | 1 |
| NULL | NULL | male | 3 |
| NULL | beijing | NULL | 2 |
| NULL | beijing | male | 2 |
| NULL | shanghai | NULL | 2 |
| NULL | shanghai | female | 1 |
| NULL | shanghai | male | 1 |
| product | NULL | NULL | 1 |
| product | NULL | male | 1 |
| product | shanghai | NULL | 1 |
| product | shanghai | male | 1 |
| sale | NULL | NULL | 3 |
| sale | NULL | female | 1 |
| sale | NULL | male | 2 |
| sale | beijing | NULL | 2 |
| sale | beijing | male | 2 |
| sale | shanghai | NULL | 1 |
| sale | shanghai | female | 1 |
+------------+-----------+---------+-----+
此時我們也可以使用With Cube來解決問題
GROUP BY a, b, c WITH CUBE is equivalent to
GROUP BY a, b, c GROUPING SETS ( (a, b, c), (a, b), (b, c), (a, c), (a), (b), (c), ( )).
所以HQL可以改為:
select dept_name,addr,sex,count(*) ct from test group by dept_name,addr,sex with cube order by dept_name,addr,sex;
+------------+-----------+---------+-----+
| dept_name | addr | sex | ct |
+------------+-----------+---------+-----+
| NULL | NULL | NULL | 4 |
| NULL | NULL | female | 1 |
| NULL | NULL | male | 3 |
| NULL | beijing | NULL | 2 |
| NULL | beijing | male | 2 |
| NULL | shanghai | NULL | 2 |
| NULL | shanghai | female | 1 |
| NULL | shanghai | male | 1 |
| product | NULL | NULL | 1 |
| product | NULL | male | 1 |
| product | shanghai | NULL | 1 |
| product | shanghai | male | 1 |
| sale | NULL | NULL | 3 |
| sale | NULL | female | 1 |
| sale | NULL | male | 2 |
| sale | beijing | NULL | 2 |
| sale | beijing | male | 2 |
| sale | shanghai | NULL | 1 |
| sale | shanghai | female | 1 |
+------------+-----------+---------+-----+
6.With Rollup解決計算層級維度組合問題
資料:
1 2020 12 31 56
2 2020 12 31 89
3 2021 01 01 78
4 2021 01 01 67
5 2021 01 02 56
建表:
create table test1(order_id int,year string,month string,day string,order_amount int) row format delimited fields terminated by '\t';
現在需要按照時間統計銷售總額,注意按照時間聚合資料的時候,單獨的月份、單獨的天、月份和天以及年和天的維度聚合出來的指標是沒有意義的,也就是說按照時間維度聚合指標時只有“年月日”、“年月”、“年”以及不添加維度資訊計算總和時才是有意義的,
這種像年月日維度的可以稱為層級維度,同樣可以使用Grouping Sets來解決該需求,HQL如下:
select year,month,day,sum(order_amount) from test1 group by year,month,day grouping sets((year,month,day),(year,month),(year),()) order by year,month,day;
+-------+--------+-------+------+
| year | month | day | _c3 |
+-------+--------+-------+------+
| NULL | NULL | NULL | 346 |
| 2020 | NULL | NULL | 145 |
| 2020 | 12 | NULL | 145 |
| 2020 | 12 | 31 | 145 |
| 2021 | NULL | NULL | 201 |
| 2021 | 01 | NULL | 201 |
| 2021 | 01 | 01 | 145 |
| 2021 | 01 | 02 | 56 |
+-------+--------+-------+------+
此時我們還可以使用With Rollup來解決該需求
GROUP BY a, b, c, WITH ROLLUP is equivalent to GROUP BY a, b, c GROUPING SETS ( (a, b, c), (a, b), (a), ( )).
所以HQL可以改為:
select year,month,day,sum(order_amount) from test1 group by year,month,day with rollup order by year,month,day;
+-------+--------+-------+------+
| year | month | day | _c3 |
+-------+--------+-------+------+
| NULL | NULL | NULL | 346 |
| 2020 | NULL | NULL | 145 |
| 2020 | 12 | NULL | 145 |
| 2020 | 12 | 31 | 145 |
| 2021 | NULL | NULL | 201 |
| 2021 | 01 | NULL | 201 |
| 2021 | 01 | 01 | 145 |
| 2021 | 01 | 02 | 56 |
+-------+--------+-------+------+
7.總結
Hive中提供了grouping sets、with cube以及with rollup來解決多維分析的問題,當維度特別多時,我們根據具體情況可以使用對應的函式來解決問題,
我們可以將結果輸出到一張表中方便后續查詢,比如我們是a,b,c三個維度使用with cube計算的結果集,當我們只需要group by a的結果時,我們可以使用select a,ct from result_table where a is not null and b is null and c is null;這樣的HQL獲取需要的結果,避免了我們在開頭提到了多條HQL需要使用多張表來保存資料給最后查詢帶來麻煩的問題,
轉載請註明出處,本文鏈接:https://www.uj5u.com/qita/294666.html
標籤:其他
下一篇:鎖的原理和使用場景,樂觀鎖、悲觀鎖、公平鎖、非公平鎖,基于資料庫、Redis、Zookeeper實作分布式鎖的原理及代碼實作
