1.應用背景:
實際生產中,各種指標的報表統計,往往都會涉及到多維分析,比如,統計榷訓數,榷訓話次數,榷訓頭訪客數,日新,日用戶平均訪問時長,訪問深度……都需要從不同維度,各種角度去分析,如果上述維度分析需求,都逐個開發計算sql(逐個去group by聚合),作業繁冗!
那么,如何解決這個問題呢?
2.實作程序:
2.1 關鍵要點:
- 創建一個統一的目標維度分析聚合結果表,這個表應該包含所有的維度欄位
- 利用hive的高階聚合函式,在一個sql中,即可計算出所有可能的維度組合
2.2 Cube表模型:這種表,在業內通常被稱之為: cube (多維資料立方體)
| 省 | 市 | 區 | 手機型號 | 作業系統 | App版本 | 下載渠道 | 小時段 | 榷訓總數 |
| 江西 | \n | \n | \n | \n | \n | \n | \n | 1000 |
| 江蘇 | \n | \n | \n | \n | \n | \n | \n | 1500 |
| 河南 | \n | \n | \n | \n | \n | \n | \n | 1800 |
| …… |
|
|
|
|
|
|
|
|
| 江西 | 九江 | \n | \n | \n | \n | \n | \n | 800 |
| 江西 | 贛州 | \n | \n | \n | \n | \n | \n | 600 |
| 江西 | 南昌 | \n | \n | \n | \n | \n | \n | 450 |
| 江西 | …… | \n | \n | \n | \n | \n | \n | 550 |
| 江蘇 | 南通 | \n | \n | \n | \n | \n | \n | 660 |
| 江蘇 | 蘇州 | \n | \n | \n | \n | \n | \n | 540 |
| 江蘇 | 徐州 | \n | \n | \n | \n | \n | \n | 400 |
| 江蘇 | …… | \n | \n | \n | \n | \n | \n | 320 |
| \n | \n | \n | MI6 | \n | \n | \n | \n | 1500 |
| \n | \n | \n | MI8 | \n | \n | \n | \n | 2200 |
| \n | \n | \n | MATE10 | \n | \n | \n | \n | 1800 |
| \n | \n | \n | IPHONE6 | \n | \n | \n | \n | 1200 |
| \n | \n | \n | …… | \n | \n | \n | \n | …… |
|
|
|
|
|
|
|
|
|
|
2.3 如果我要從上面的表中,獲取到 各省份榷訓數,如何獲取?
SELECT
province,
dau_cnt
FROM cube
WHERE province is not null and coalesce(city,district,devicetype,osname,....) is null
2.4 什么是維度的基數?
上述表的行數很大
比如按(省、市、區、手機型號、app版本、下載渠道、小時段)維度組合計算榷訓數,結果行數有: 省維度的基數 * 市維度的基數 * 區維度的基數 * ……
基數: 就是某個維度欄位的去重值個數!
3. hive為解決以上問題所提供的高階函式:
3.1 With cube函式:將所有可能要參與維度的欄位都寫在group by 后面,with cube 函式就會自動的將所有各種維度都統計出來
INSERT INTO TABLE cube
SELECT
province,
city,
district,
device_type,
os_name,
app_version,
release_channel,
hour_segement,
count(distinct guid) as dau_cnt
FROM t_src
GROUP BY
province,
city,
district,
device_type,
os_name,
app_version,
release_channel,
house_segment
WITH CUBE
;
3.2 Grouping sets函式:這個可以由用戶自己決定需要哪些維度組合,將自己需要的維度組合都寫在Grouping sets函式后
INSERT INTO TABLE cube
SELECT
province,
city,
district,
device_type,
os_name,
app_version,
release_channel,
hour_segement,
count(distinct guid) as dau_cnt
FROM t_src
GROUP BY
province,
city,
district,
device_type,
os_name,
app_version,
release_channel,
house_segment
GROUPING SETS(
(),
(province),
(province,city),
(province,city,district),
(device_type)
)
;
3.3 With rollup函式:主要針對層級維度的組合處理,假如現在的組合是省+市+區三個維度進行組合,with rollup 就會自動的一級一級往上卷,變成省+市,最后是省
INSERT INTO TABLE cube
SELECT
province,
city,
district,
device_type,
os_name,
app_version,
release_channel,
hour_segement,
count(distinct guid) as dau_cnt
FROM t_src
GROUP BY province,city,district
WITH ROLLUP
;
轉載請註明出處,本文鏈接:https://www.uj5u.com/shujuku/238134.html
標籤:其他
