1.需求描述
統計硅谷影音視頻網站的常規指標,各種 TopN 指標:-- 統計視頻觀看數 Top10-- 統計視頻類別熱度 Top10-- 統計出視頻觀看數最高的 20 個視頻的所屬類別以及類別包含 Top20 視頻的個數-- 統計視頻觀看數 Top50 所關聯視頻的所屬類別排序-- 統計每個類別中的視頻熱度 Top10,以 Music 為例-- 統計每個類別視頻觀看數 Top10-- 統計上傳視頻最多的用戶 Top10 以及他們上傳的視頻觀看次數在前 20 的視頻
2.資料結構
|
欄位
|
備注
|
詳細描述
|
|
videoId
|
視頻唯一
id
(
String
)
|
11
位字串
|
|
uploader
|
視頻上傳者(
String
)
|
上傳視頻的用戶名
String
|
|
age
|
視頻年齡(
int
)
|
視頻在平臺上的整數天
|
|
category
|
視頻類別(
Array<String>
)
|
上傳視頻指定的視頻分類
|
|
length
|
視頻長度(
Int
)
|
整形數字標識的視頻長度
|
|
views
|
觀看次數(
Int
)
|
視頻被瀏覽的次數
|
|
rate
|
視頻評分(
Double
)
|
滿分
5
分
|
|
Ratings
|
流量(
Int
)
|
視頻的流量,整型數字
|
|
conments
|
評論數(
Int
)
|
一個視頻的整數評論數
|
|
relatedId
|
相關視頻
id
(
Array<String>
)
|
相關視頻的
id
,最多
20
個
|
|
欄位
|
備注
|
欄位型別
|
|
uploader
|
上傳者用戶名
|
string
|
|
videos
|
上傳視頻數
|
int
|
|
friends
|
朋友數量
|
int
|
3.準備作業
(資料和代碼在我的資源里,代碼是自己手打的,)
https://download.csdn.net/download/weixin_42759988/19418290?spm=1001.2014.3001.5501
通過觀察原始資料形式可以發現,視頻可以有多個所屬分類,每個所屬分類用&符號分割,且分割的兩邊有空格字符,同時相關視頻也是可以有多個元素,多個相關視頻又用“\t”進行分割,為了分析資料時方便對存在多個子元素的資料進行操作,我們首先進行資料重組清洗操作,即:將所有的類別用“&”分割,同時去掉兩邊空格,多個相關視頻id也使用“&”進行分割,
將資源代碼打成jar包上傳到hdfs或者自己的虛擬機里,相信此步驟大家都會,這里不再贅述,
使用運行jar包:
yarn jar /opt/module/jars/hive-gulivideo-1.0-SNAPSHOT.jar com.atguigu.mr.ETLDriver /user/root/guliVideo/video /guliout
說明:首先記得啟動yarn,運行此jar包,后面跟著hdfs的輸入輸出路徑,
4.準備表
1)需要準備的表
2)創建原始資料表:
create table gulivideo_ori(
videoId string,
uploader string,
age int,
category array<string>,
length int,
views int,
rate float,
ratings int,
comments int,
relatedId array<string>)
row format delimited fields terminated by "\t"
collection items terminated by "&"
stored as textfile;
create table gulivideo_user_ori(
uploader string,
videos int,
friends int)
row format delimited
fields terminated by "\t"
stored as textfile;
load data inpath '/guliout/part-r-00000' into table gulivideo_ori;
select * from gulivideo_ori limit 2;
load data inpath '/user/root/guliVideo/user/user.txt' into table gulivideo_user_ori;
select * from gulivideo_user_ori limit 2;
2)創建 orc 存盤格式帶 snappy 壓縮的表:
(1)gulivideo_orc
create table gulivideo_orc(
videoId string,
uploader string,
age int,
category array<string>,
length int,
views int,
rate float,
ratings int,
comments int,
relatedId array<string>)
stored as orc
tblproperties("orc.compress"="SNAPPY");
create table gulivideo_user_orc(
uploader string,
videos int,
friends int)
row format delimited
fields terminated by "\t"
stored as orc
tblproperties("orc.compress"="SNAPPY");
insert into table gulivideo_orc select * from gulivideo_ori;
insert into table gulivideo_user_orc select * from gulivideo_user_ori;
(4)查看資料:

1 統計視頻觀看數 Top10
select
videoId,views
from
gulivideo_orc
order by
views desc
limit 10;
結果:
dMH0bHeiRNg 42513417
0XxI-hvPRRA 20282464
1dmVU08zVpA 16087899
RB-wUgnyGv0 15712924
QjA5faZF1A8 15256922
-_CSo1gOd48 13199833
49IDp76kjPw 11970018
tYnn51C3X_w 11823701
pv5zWaTEVkI 11672017
D2kJZOfq7zk 11184051
2 統計視頻類別熱度 Top10
1.使用UDTF函式將類別列炸開
select
videoId,category_name
from
gulivideo_orc
lateral view explode(category) tmp_category as category_name; --t1
結果:
dMH0bHeiRNg 42513417
0XxI-hvPRRA 20282464
1dmVU08zVpA 16087899
RB-wUgnyGv0 15712924
QjA5faZF1A8 15256922
-_CSo1gOd48 13199833
49IDp76kjPw 11970018
tYnn51C3X_w 11823701
pv5zWaTEVkI 11672017
D2kJZOfq7zk 11184051
select category_name,count(*) category_count
from t1
group by category_name
order by category_count
desc limit 10;
3.最終SQL
select category_name,count(*) category_count
from (select videoId,category_name from gulivideo_orc
lateral view explode(category) tmp_category as category_name) t1
group by category_name
order by category_count
desc limit 10;
4.結果:
Music 179049
Entertainment 127674
Comedy 87818
Animation 73293
Film 73293
Sports 67329
Games 59817
Gadgets 59817
People 48890
Blogs 48890
3 統計出視頻觀看數最高的 20 個視頻的所屬類別以及類別包含
Top20 視頻的個數
(1)先找到觀看數最高的 20 個視頻所屬條目的所有資訊,降序排列
(2)把這 20 條資訊中的 category 分裂出來(列轉行)
(3)最后查詢視頻分類名稱和該分類下有多少個 Top20 的視頻
select
videoId,views,category
from
gulivideo_orc
order by
views desc
limit 20; --t1
結果:
dMH0bHeiRNg 42513417 ["Comedy"]
0XxI-hvPRRA 20282464 ["Comedy"]
1dmVU08zVpA 16087899 ["Entertainment"]
RB-wUgnyGv0 15712924 ["Entertainment"]
QjA5faZF1A8 15256922 ["Music"]
-_CSo1gOd48 13199833 ["People","Blogs"]
49IDp76kjPw 11970018 ["Comedy"]
tYnn51C3X_w 11823701 ["Music"]
pv5zWaTEVkI 11672017 ["Music"]
D2kJZOfq7zk 11184051 ["People","Blogs"]
vr3x_RRJdd4 10786529 ["Entertainment"]
lsO6D1rwrKc 10334975 ["Entertainment"]
5P6UU6m3cqk 10107491 ["Comedy"]
8bbTtPL1jRs 9579911 ["Music"]
_BuRwH59oAo 9566609 ["Comedy"]
aRNzWyD7C9o 8825788 ["UNA"]
UMf40daefsI 7533070 ["Music"]
ixsZy2425eY 7456875 ["Entertainment"]
MNxwAU_xAMk 7066676 ["Comedy"]
RUCZJVJ_M8o 6952767 ["Entertainment"]
select
videoId,
category_name
from
t1
lateral view explode(category) tmp_category as category_name;--t2
執行t2,結果:
dMH0bHeiRNg Comedy
0XxI-hvPRRA Comedy
1dmVU08zVpA Entertainment
RB-wUgnyGv0 Entertainment
QjA5faZF1A8 Music
-_CSo1gOd48 People
-_CSo1gOd48 Blogs
49IDp76kjPw Comedy
tYnn51C3X_w Music
pv5zWaTEVkI Music
D2kJZOfq7zk People
D2kJZOfq7zk Blogs
vr3x_RRJdd4 Entertainment
lsO6D1rwrKc Entertainment
5P6UU6m3cqk Comedy
8bbTtPL1jRs Music
_BuRwH59oAo Comedy
aRNzWyD7C9o UNA
UMf40daefsI Music
ixsZy2425eY Entertainment
MNxwAU_xAMk Comedy
RUCZJVJ_M8o Entertainment
select category_name,count(*) category_count
from t2
group by category_name
order by category_count desc;
4.最終SQL
select category_name,count(*) category_count
from
(
select
videoId,
category_name
from
(
select
videoId,views,category
from
gulivideo_orc
order by
views desc
limit 20
)t1
lateral view explode(category) tmp_category as category_name)t2
group by category_name
order by category_count desc;
結果:
Entertainment 6
Comedy 6
Music 5
People 2
Blogs 2
UNA 1
4 統計視頻觀看數 Top50 所關聯視頻的所屬類別排序
select
relatedId,views
from
gulivideo_orc
order by
views desc
limit 50;--t1
2.對t1表中的relatedId進行炸裂(列轉行)并去重
select related_id
from t1
lateral view explode(relatedId) tmp_related as related_id
group by related_id; --t2
3. 取出觀看數前50視頻關聯ID視頻的類別
select category
from t2
join gulivideo_orc orc
on t2.related_id=orc.videoId; --t3
4.對t3表中的category進行炸裂
select explode(category) category_name
from t3; --t4
5.分組(類別)求和(總數)
select category_name,count(*) category_count
from t4
group by category_name
order by category_count desc;
6.最終SQL:
select category_name,count(*) category_count
from (select explode(category) category_name
from (select category
from (select related_id
from (select
relatedId,views
from
gulivideo_orc
order by
views desc
limit 50)t1
lateral view explode(relatedId) tmp_related as related_id
group by related_id)t2
join gulivideo_orc orc
on t2.related_id=orc.videoId)t3)t4
group by category_name
order by category_count desc;
方法二:使用rank()函式:
SELECT
t6.category_name,
t6.video_sum,
rank() over(ORDER BY t6.video_sum DESC ) rk
FROM
(
SELECT
t5.category_name,
COUNT(t5.relatedid_id) video_sum
FROM
(
SELECT
t4.relatedid_id,
category_name
FROM
(
SELECT
t2.relatedid_id ,
t3.category
FROM
(
SELECT
relatedid_id
FROM
(
SELECT
videoId,
views,
relatedid
FROM
gulivideo_orc
ORDER BY
views
DESC
LIMIT 50
)t1
lateral VIEW explode(t1.relatedid) t1_tmp AS relatedid_id
)t2
JOIN
gulivideo_orc t3
ON
t2.relatedid_id = t3.videoId
) t4
lateral VIEW explode(t4.category) t4_tmp AS category_name
) t5
GROUP BY
t5.category_name
ORDER BY
video_sum
DESC
) t6;
結果:
Comedy 232
Entertainment 216
Music 195
Blogs 51
People 51
Film 47
Animation 47
News 22
Politics 22
Games 20
Gadgets 20
Sports 19
Howto 14
DIY 14
UNA 13
Places 12
Travel 12
Animals 11
Pets 11
Autos 4
Vehicles 4
5 統計每個類別中的視頻熱度 Top10,以 Music 為例
(1)要想統計 Music 類別中的視頻熱度 Top10,需要先找到 Music 類別,那么就需要將
category 展開,所以可以創建一張表用于存放 categoryId 展開的資料,
(2)向 category 展開的表中插入資料,
(3)統計對應類別(Music)中的視頻熱度,
SELECT
t1.videoId,
t1.views,
t1.category_name
FROM
(
SELECT
videoId,
views,
category_name
FROM gulivideo_orc
lateral VIEW explode(category) gulivideo_orc_tmp AS category_name
)t1
WHERE
t1.category_name = "Music"
ORDER BY
t1.views
DESC
LIMIT 10;
上面的方法雖然也可以解決問題,但是不建議使用,可以使用如下方法:
思路:
1)要想統計music類別中的視頻熱度Top10,需要先找到Music類別,那么就需要將category展開,
所以可以創建一張表用于存放categoryId展開的資料,
2)向category展開的表中插入資料
3)統計對應類別(Music)中的視頻熱度,
1.創建類別表:
create table guliVideo_category(
videoId String,
uploader string,
age Int,
categoryId string,
length int,
views int,
rate float,
ratings int,
comments int,
relatedId array<string>)
row format delimited fields terminated by "\t"
collection items terminated by "&"
stored as orc;
2.向類別表中插入資料:
insert into table guliVideo_category
select
videoId,
uploader,
age,
categoryId,
length,
views,
rate,
ratings,
comments,
relatedId
from gulivideo_orc
lateral view explode(category) tmp_category as categoryId;
3.統計Music類別的Top10(也可以統計其他)
select videoId,views
from guliVideo_category
where categoryId = "Music"
order by views desc
limit 10;
結果:
QjA5faZF1A8 15256922
tYnn51C3X_w 11823701
pv5zWaTEVkI 11672017
8bbTtPL1jRs 9579911
UMf40daefsI 7533070
-xEzGIuY7kw 6946033
d6C0bNDqf3Y 6935578
HSoVKUVOnfQ 6193057
3URfWTEPmtE 5581171
thtmaZnxk_0 5142238
6 統計每個類別視頻觀看數 Top10
select
categoryId,
videoId,
views,
rank() over(partition by categoryId order by views desc) rk
from guliVideo_category; --t1
2.過濾前10
select categoryId,videoId,views
from t1
where rk <=10;
3.最終SQL:
select categoryId,videoId,views
from (select
categoryId,
videoId,
views,
rank() over(partition by categoryId order by views desc) rk
from guliVideo_category)t1
where rk <=10;
7 統計上傳視頻最多的用戶 Top10 以及他們上傳的視頻觀看次數在前 20 的視頻
select uploader,videos
from gulivideo_user_orc
order by videos desc
limit 10; --t1
2.關聯 gulivideo_orc 表,求出這 10 個用戶上傳的所有的視頻,按照觀看數取前 20
最終SQL:
SELECT
t2.videoId,
t2.views
FROM
(
SELECT
uploader,
videos
FROM gulivideo_user_orc
ORDER BY
videos
DESC
LIMIT 10
) t1
JOIN gulivideo_orc t2
ON t1.uploader = t2.uploader
ORDER BY
t2.views DESC
LIMIT 20;
3.結果:
-IxHBW0YpZw 39059
BU-fT5XI_8I 29975
ADOcaBYbMl0 26270
yAqsULIDJFE 25511
vcm-t0TJXNg 25366
0KYGFawp14c 24659
j4DpuPvMLF4 22593
Msu4lZb2oeQ 18822
ZHZVj44rpjE 16304
foATQY3wovI 13576
-UnQ8rcBOQs 13450
crtNd46CDks 11639
D1leA0JKHhE 11553
NJu2oG1Wm98 11452
CapbXdyv4j4 10915
epr5erraEp4 10817
IyQoDgaLM7U 10597
tbZibBnusLQ 10402
_GnCHodc7mk 9422
hvEYlSlRitU 7123
參考資料:《尚硅谷大資料技術之Hive》
轉載請註明出處,本文鏈接:https://www.uj5u.com/qita/286240.html
標籤:其他
上一篇:sersync 實作實時資料同步
