關于hive磁區，你知道多少呢？-有解無憂

?
文末查看關鍵字，回復贈書

一、理論基礎

1.Hive磁區背景

在Hive Select查詢中一般會掃描整個表內容，會消耗很多時間做沒必要的作業，有時候只需要掃描表中關心的一部分資料，因此建表時引入了partition概念，

2.Hive磁區實質

因為Hive實際是存盤在HDFS上的抽象，Hive的一個磁區名對應hdfs的一個目錄名，并不是一個實際欄位，

3.Hive磁區的意義

輔助查詢，縮小查詢范圍，加快資料的檢索速度和對資料按照一定的規格和條件進行查詢，更方便資料管理，

4.常見的磁區技術

hive表中的資料一般按照時間、地域、類別等維度進行磁區，

二、單磁區操作

1.創建磁區表

create table if not exists t1(
    id      int
   ,name    string
   ,hobby   array
   ,add     map
)
partitioned by (pt_d string)
row format delimited
fields terminated by ','
collection items terminated by '-'
map keys terminated by ':'
;

注：這里磁區欄位不能和表中的欄位重復，
如果磁區欄位和表中欄位相同的話，會報錯，如下：

create table t10(
    id      int
   ,name    string
   ,hobby   array<string>
   ,add     maptring,string>
)
partitioned by (id int)
row format delimited
fields terminated by ','
collection items terminated by '-'
map keys terminated by ':'
;

報錯資訊：FAILED: SemanticException [Error 10035]: Column repeated in partitioning columns

2.裝載資料

需要加載的檔案內容如下：

1,xiaoming,book-TV-code,beijing:chaoyang-shagnhai:pudong
2,lilei,book-code,nanjing:jiangning-taiwan:taibei
3,lihua,music-book,heilongjiang:haerbin

執行load data

load data local inpath '/home/hadoop/Desktop/data' overwrite into table t1 partition ( pt_d = '201701');

3.查看資料及磁區

查看磁區資料,使用和欄位使用一致，

select * from t1 where pt_d = '201701';

結果

1   xiaoming    ["book","TV","code"]    {"beijing":"chaoyang","shagnhai":"pudong"}  201701
2   lilei   ["book","code"] {"nanjing":"jiangning","taiwan":"taibei"}   201701
3   lihua   ["music","book"]    {"heilongjiang":"haerbin"}  201701

查看磁區

show partitions t1;

4.插入另一個磁區

再創建一份資料并裝載，磁區=‘000000’

load data local inpath '/home/hadoop/Desktop/data' overwrite into table t1 partition ( pt_d = '000000');

查看資料：

select * from t1;

1   xiaoming    ["book","TV","code"]    {"beijing":"chaoyang","shagnhai":"pudong"}  000000
2   lilei   ["book","code"] {"nanjing":"jiangning","taiwan":"taibei"}   000000
3   lihua   ["music","book"]    {"heilongjiang":"haerbin"}  000000
1   xiaoming    ["book","TV","code"]    {"beijing":"chaoyang","shagnhai":"pudong"}  201701
2   lilei   ["book","code"] {"nanjing":"jiangning","taiwan":"taibei"}   201701
3   lihua   ["music","book"]    {"heilongjiang":"haerbin"}  201701

5.觀察HDFS上的檔案

去hdfs上看檔案

http://namenode:50070/explorer.html#/user/hive/warehouse/test.db/t1

可以看到，檔案是根據磁區分別存盤，增加一個磁區就是一個檔案，

查詢相應磁區的資料

select * from t1 where pt_d = ‘000000’

添加磁區，增加一個磁區檔案

alter table t1 add partition (pt_d = ‘333333’);

洗掉磁區(洗掉相應磁區檔案)

注意，對于外表進行drop partition并不會洗掉hdfs上的檔案，并且通過msck repair table table_name可以同步回hdfs上的磁區，

alter table test1 drop partition (pt_d = ‘20170101’);

三、多個磁區操作

1.創建磁區表???????

create table t10(
    id      int
   ,name    string
   ,hobby   array<string>
   ,add     maptring,string>
)
partitioned by (pt_d string,sex string)
row format delimited
fields terminated by ','
collection items terminated by '-'
map keys terminated by ':'
;

2.加載資料(磁區欄位必須都要加)

load data local inpath ‘/home/hadoop/Desktop/data’ overwrite into table t10 partition ( pt_d = ‘0’);

如果只是添加一個，會報錯：FAILED: SemanticException [Error 10006]: Line 1:88 Partition not found ”0”???????

load data local inpath '/home/hadoop/Desktop/data' overwrite into table t10 partition ( pt_d = '0',sex='male');
load data local inpath '/home/hadoop/Desktop/data' overwrite into table t10 partition ( pt_d = '0',sex='female');

觀察HDFS上的檔案，可發現多個磁區具有順序性，可以理解為windows的樹狀檔案夾結構，

四、表磁區的增刪修查
1.增加磁區
這里我們創建一個磁區外部表???????

create external table testljb (
    id int
) partitioned by (age int);

添加磁區

官網說明：???????

ALTER TABLE table_name ADD [IF NOT EXISTS] PARTITION partition_spec [LOCATION 'location'][, PARTITION partition_spec [LOCATION 'location'], ...];

partition_spec:
  : (partition_column = partition_col_value, partition_column = partition_col_value, ...)

實體說明

一次增加一個磁區

alter table testljb add partition (age=2);

一次增加多個同級（磁區名相同）磁區

alter table testljb add partition(age=3) partition(age=4);

注意：一定不能寫成如下方式：

alter table testljb add partition(age=5,age=6);

如果我們show partitions table_name 會發現僅僅添加了age=6的磁區，

這里猜測原因：因為這種寫法實際上：具有多個磁區欄位表的磁區添加，而我們寫兩次同一個欄位，而系統中并沒有兩個age磁區欄位，那么就會隨機添加其中一個磁區，

父子級磁區增加：

舉個例子，有個表具有兩個磁區欄位：age磁區和sex磁區，那么我們添加一個age磁區為1，sex磁區為male的資料，可以這樣添加：

alter table testljb add partition(age=1,sex='male');

2.洗掉磁區

洗掉磁區age=1

alter table testljb drop partition(age=1);

注：加入表testljb有兩個磁區欄位（上文已經提到多個磁區先后順序類似于windows的檔案夾的樹狀結構），partitioned by(age int ,sex string)，那么我們洗掉age磁區（第一個磁區）時，會把該磁區及其下面包含的所有sex磁區一起刪掉，

3.修復磁區

修復磁區就是重新同步hdfs上的磁區資訊，

msck repair table table_name;

4.查詢磁區

show partitions table_name;

上一篇：資料倉庫與資料集市建模

下期預告：hive的動態磁區與靜態磁區

按例，我的個人公眾號：魯邊社，歡迎關注

后臺回復關鍵字 [hive]，隨機贈送一本魯邊備注版珍藏大資料書籍，

轉載請註明出處，本文鏈接：https://www.uj5u.com/shujuku/500858.html

標籤：大數據

上一篇：企業級資料治理作業怎么開展？Datahub這樣做

下一篇：-B+樹索引和HASH索引有哪些不一樣【MySQL系列】