一：首先自我介紹

二：資料傾斜

2.1.是什么？

2.2.為什么

2. 3.導致后果？

2. 4.分類？

2.5 資料傾斜分類補充

三：寫編程題目：陣列中最小k 個數

3.1 兩種方法

3.2此處涉及優先佇列實作堆知識點總結：

3.3.自定義比較類知識總結

四：寫SQL

4.1兩種方法解決

4.2實戰該題目

五：問簡歷，問專案

六：總結

一：首先自我介紹

談一談為什么會選擇這個崗位？

談一談你對大資料技術堆疊的認識？

難道真的是我熱愛資料，喜歡鉆研大資料技術嗎？哈哈哈哈哈哈哈哈哈😂

目前對Hadoop生態了解比較多，但是鉆研不夠深入，后續著重學習hive基礎，技術，hql深入那種，熟悉Hadoop，

然后牢記mapreduce，flume，Kafka，組件等原理概念，也要盡快著手學習spark技術堆疊的學習

二：資料傾斜

什么是資料傾斜？

為什么會產生資料傾斜？

你在實際應用中是否遇到過資料傾斜的問題？怎么解決資料傾斜？

2.1.是什么？

任務進度長時間維持在99%，查看監控頁面發現只有某幾個reduce子任務尚未完成，

2.2.為什么

為什么會出現資料傾斜這種情況呢？簡單來講，例如wordcount中某個key對應的資料量非常大的話，就會產生資料傾斜，

一般由什么操作導致？一般由于count(distinct *), group by(), join()操作引起，導致某個reduce處理的資料過多，

引起處理時間非常耗時，

1）group by()

注：group by 優于 distinct group
情形：group by 維度過小，某值的數量過多
后果：處理某值的 reduce 非常耗時
解決方式：采用 sum() group by 的方式來替換 count(distinct)完成計算，

2) count(*)

count(distinct)
情形：某特殊值過多
后果：處理此特殊值的 reduce 耗時；只有一個 reduce 任務
解決方式：count distinct 時，將值為空的情況單獨處理，比如可以直接過濾空值的行，
在最后結果中加 1，如果還有其他計算，需要進行 group by，可以先將值為空的記錄單獨處
理，再和其他計算結果進行 union，

3)不同資料型別關聯產生資料傾斜

情形：比如用戶表中 user_id 欄位為 int，log 表中 user_id 欄位既有 string 型別也有 int 類
型，當按照 user_id 進行兩個表的 Join 操作時，
后果：處理此特殊值的 reduce 耗時；只有一個 reduce 任務
默認的 Hash 操作會按 int 型的 id 來進行分配，這樣會導致所有 string 型別 id 的記錄都分配
到一個 Reducer 中，
解決方式：把數字型別轉換成字串型別
select * from users a 
left outer join logs b
on a.usr_id = cast(b.user_id as string)

2. 3.導致后果？

拖慢整個job執行時間，（其他以已經完成的結點都在等這個還在做的結點）

2. 4.分類？

（借鑒spark中資料傾斜舉例）（不屬于HQL中知識，后期再看這塊）

2.5 資料傾斜分類補充

1）聚合傾斜

（區域聚合+全域聚合）

2） join傾斜

三：寫編程題目：陣列中最小k 個數

3.1 兩種方法

方法一：暴力法：直接呼叫庫函式排序，輸出結果即可.(面試官需要的肯定不是這個答案)

class Solution {
    public int[] smallestK(int[] arr, int k) {
        // 暴力法：先呼叫庫函式排序，直接輸出結果即可
        int[] res = new int[k];
        Arrays.sort(arr);
        for(int i =0;i<k;i++){
            res[i] = arr[i];
        }
        return res;
    }
}

方法二：使用優先佇列實作堆疊，

用大頂堆進行臨時存盤k個元素，然后取堆頂與其余元素做比較

class Solution {
    public int[] smallestK(int[] arr, int k) {
        int[] res = new int[k];
        if(k ==0){return res;}
        PriorityQueue<Integer> pqueue = new PriorityQueue<Integer>((a,b)->b-a); //構建大頂堆
        for(int i=0;i<k;i++){
            pqueue.offer(arr[i]);
        }
        for(int i =k;i<arr.length;i++){
            if(pqueue.peek()>arr[i]){
                pqueue.poll();
                pqueue.offer(arr[i]);
            }
        }
        for(int i =0;i<k;i++){
            res[i] = pqueue.poll();
        }
        return res;
    }
}

3.2此處涉及優先佇列實作堆知識點總結：

1.Java中優先佇列PriorityQueue的用法

PriorityQueue<Integer> pqueue = new PriorityQueue<Integer>();

當不指定comparator時，默認為小頂堆，初始容量為11.

通過傳入自定義的comparator函式時可以實作大頂堆

  PriorityQueue<Integer> pqueue = new PriorityQueue<Integer>(new Comparator<Integer>(){
        public int compare(Integer a,Integer b){
           return b-a;
       }  
   };

或者簡介版

PriorityQueue<Integer> pqueue = new PriorityQueue<Integer>((a,b)->b-a);

3.3.自定義比較類知識總結

類似與（a,b）->b-a的用法(即自定義比較類)

例如之前的一道編程題目需要使用“用最小數量的箭引爆氣球”中我們首先需要對資料按照它的第二維度進行排序，

比如[左區間，右區間]，我們需要按照右區間大小進行排序，

我們使用Arrays.sort(points,(a,b)->a[1]>b[1]?1:-1)

四：寫SQL

4.1兩種方法解決

對于下表zijie_ads.求每個自然周，新用戶，完播率排名前5的用戶的網頁跳轉來源？

day (date)	id (int)	user_type (int)	play_rate (int)	resource (string)
2021-01-04	1	1	0.4	type_a
2021-09-22	2	0	0.4	type_b
......	......	......	......	......

思路：此問題中有兩個難點：1)完播率排名前5如何求？

2）如何把范圍規定到每個自然周，即每個自然周的表示方法？

第一個問題：完播率前5名如何求，可以參考我的

博客https://blog.csdn.net/yezonghui/article/details/115283626 中題目四--部門工資前三高的員工

此處方法一：mysql

select h1.resource 
from zijie_ads h1
where user_type = 1
      and
(select count(distinct h2.play_rate)
 from haokan h2
 where h2.play_rate > h1.play_rate
) <5

此處的方法二：dense_rank() over()

也就是說dense_rank()不一定要有partition by分組，但是一般要有按照什么欄位排序喔

select temp_table.resource
from
(
  select resource, dense_rank() over(order by play_rate desc) as orderrank
  from zijie_ads
  where user_type = 1
) temp_table
where orderrank<6;

[注意]：方法一中count(distinct)加了distinct與方法二中dense_rank對應；

如果不加distinct就和方法二中rank對應，

第二個問題接下來如何處理自然周呢？

這就涉及到我們hive中日期處理函式，例如weekofyear():可以求出當前日期對應的自然周

那么如何按照自然周分組呢？和問題一中一樣也有兩種方法，

即weekofyear(h1.day1) = weekofyear(h2.day1)

或者使用視窗函式dense_rank() over(partition by weekof(year) )進行分組

綜述上面兩個問題：我們最終答案如下：

方法一：

select h1.resource?
from haokan_ads_test02 h1?
where user_type = 1?
and?      (select count( h2.play_rate)?       
from haokan_ads_test02 h2?       
where h2.play_rate > h1.play_rate?       
and weekofyear(h1.day1) = weekofyear(h2.day1)?      ) <3?;

方法二：

select temp_table.resource
?from?(select resource, 
dense_rank() 
over (partition by weekofyear(day1) order by play_rate desc) as ordrrank? 
from haokan_ads_test02? where user_type = 1?) temp_table?where 
ordrrank <3;

4.2實戰該題目

1）建表

create table if not exists haokan_ads_test02?(?    
user_id   int,?    
user_type int,?    
day1      date,?    
play_rate double,?    
resource  string?)?row format delimited fields terminated by ' '?
lines terminated by '\n';

2）準備資料

1 1 2021-01-02 0.6 ads1
2 1 2021-01-08 0.9 ads2
3 0 2021-01-03 0.52 ads3
4 1 2021-01-07 0.62 ads4
5 1 2021-01-11 0.19 ads5
6 0 2021-01-02 0.18 ads6
7 1 2021-01-02 0.49 ads7
8 0 2021-01-03 0.39 ads8
9 0 2021-01-09 0.21 ads9
10 0 2021-01-03 0.39 ads10
11 0 2021-01-04 0.25 ads11
12 0 2021-01-03 0.35 ads12
13 0 2021-01-09 0.1 ads13

3）把本地Linux上面資料檔案上傳到hdfs上面

在Linux命令列中到達檔案指定目錄

輸入指令

hdfs dfs -put haokan_ads_test02.txt /user/hive/warehouse

4）把hdfs上面的資料匯入到建好的表格中

（可以從本地匯入，也可以從hdfs匯入）

load data local inpath '/home/atguigu/bin/haokan_ads_test02.txt' 
overwrite into table haokan_ads_test02;

5）select題目的要求

五：問簡歷，問專案

簡歷上面知識點可以寫含蓄一點，多用了解，掌握，一旦寫在簡歷上面的知識，一定要非常數量，提前多演練幾遍，

心中提前準備好面試官會提到的問題，

六：總結

hive實戰基礎有點差，一些視窗函式，日期函式和思維還沒有建立起來，

多動手，多思考，學習一個知識點，要么完全掌握學會，也不要是是而非含含糊糊的，

多記憶，多理解

轉載請註明出處，本文鏈接：https://www.uj5u.com/qita/272643.html

標籤：其他

上一篇：華中科技大學計算機組成原理 educoder存盤系統設計 Logisim平臺

下一篇：第41節 C程式結構/陳述句小結

2021年位元組跳動大資料研發崗面試復盤

一：首先自我介紹

二：資料傾斜

2.1.是什么？

2.2.為什么

2. 3.導致后果？

2. 4.分類？

2.5 資料傾斜分類補充

三：寫編程題目：陣列中最小k 個數

3.1 兩種方法

3.2此處涉及優先佇列實作堆知識點總結：

3.3.自定義比較類知識總結

四：寫SQL

4.1兩種方法解決

4.2實戰該題目

五：問簡歷，問專案

六：總結