文章目錄

1.HQL操作之DDL命令
- （1）資料庫操作
- （2）建表語法
- （3）內部表及外部表
- （4）磁區表
- （5）分桶表
- （6）修改表及洗掉表
5.HQL操作之資料操作
- （1）load裝載資料
- （2）insert插入資料
6.HQL操作之DQL命令
- （1）簡單查詢
- （2）簡單子句
- （3）group by分組子句
- （4）表連接
- （5）order by排序子句
- （6）sort by排序
- （7）distribute by和cluster by排序

1.HQL操作之DDL命令

Hive資料庫的層次如下：

hive level

可以看到，一個資料庫可以包括多張表，一張表可以分為多個磁區，同時一個磁區還可以分為多個分桶，

DDL，即資料定義語言（data definition language），主要的操作包括CREATE、ALTER、DROP等，主要用來定義、修改資料庫物件的結構或資料型別，

要查看最完整的DDL官網命令，可以查看https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL，

（1）資料庫操作

Hive有一個默認的資料庫default，在操作HQL時，如果不明確要使用哪個庫，則使用默認資料庫，

Hive資料庫命名規則如下：

資料庫名、表名均不區分大小寫；

名字不能使用數字開頭；

不能使用關鍵字，盡量不使用特殊符號；

操作資料庫的完整語法如下：

-- 創建資料庫
CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name
 [COMMENT database_comment]
 [LOCATION hdfs_path]
 [MANAGEDLOCATION hdfs_path]
 [WITH DBPROPERTIES (property_name=property_value, ...)];

-- 查看資料庫
-- 查看所有資料庫
show database;
-- 查看資料庫資訊
desc database database_name;
desc database extended database_name;
describe database extended database_name;

-- 使用資料庫
use database_name;

-- 洗掉資料庫
-- 洗掉一個空資料庫
drop database databasename;
-- 如果資料庫不為空，使用cascade強制洗掉
drop database databasename cascade;

使用示意如下：

hive (default)> create database mydb;
OK
Time taken: 0.18 seconds
hive (default)> show databases;
OK
database_name
default
mydb
test1
Time taken: 0.049 seconds, Fetched: 3 row(s)
hive (default)> dfs -ls /user/hive/warehouse;
Found 3 items
drwxrwxrwx   - root supergroup          0 2021-09-21 14:39 /user/hive/warehouse/mydb.db
drwxrwxrwx   - root supergroup          0 2021-09-21 14:09 /user/hive/warehouse/s1
drwxrwxrwx   - root supergroup          0 2021-09-20 18:52 /user/hive/warehouse/test1.db
hive (default)> create database if not exists mydb;
OK
Time taken: 0.078 seconds
hive (default)> create database if not exists mydb2
              > comment 'this is my db2'
              > location '/user/hive/mydb2.db';
OK
Time taken: 0.14 seconds
hive (default)> show databases;
OK
database_name
default
mydb
mydb2
test1
Time taken: 0.024 seconds, Fetched: 4 row(s)
hive (default)> use mydb;
OK
Time taken: 0.041 seconds
hive (mydb)> desc database mydb2;
OK
db_name comment location        owner_name      owner_type      parameters
mydb2   this is my db2  hdfs://node01:9000/user/hive/mydb2.db   root    USER    
Time taken: 0.029 seconds, Fetched: 1 row(s)
hive (mydb)> desc database extended mydb2;
OK
db_name comment location        owner_name      owner_type      parameters
mydb2   this is my db2  hdfs://node01:9000/user/hive/mydb2.db   root    USER    
Time taken: 0.025 seconds, Fetched: 1 row(s)
hive (mydb)> drop database test1;
OK
Time taken: 0.338 seconds
hive (mydb)> show databases;
OK
database_name
default
mydb
mydb2
Time taken: 0.021 seconds, Fetched: 3 row(s)
hive (mydb)> create table t1(id int);
OK
Time taken: 0.164 seconds
hive (mydb)> drop database mydb2 cascade;
OK
Time taken: 0.083 seconds
hive (mydb)> show databases;
OK
database_name
default
mydb
Time taken: 0.049 seconds, Fetched: 2 row(s)
hive (mydb)> show tables;
OK
tab_name
t1
Time taken: 0.032 seconds, Fetched: 1 row(s)
hive (mydb)>

可以看到，創建資料庫可以通過選項設定備注和自定義存放路徑；

在洗掉資料庫時，如果資料庫不為空，則不能直接洗掉，而要使用cascade指定強制洗掉，

（2）建表語法

Hive中創建表的語法如下：

-- as方式
create [external] table [IF NOT EXISTS] table_name
[(colName colType [comment 'comment'], ...)]
[comment table_comment]
[partition by (colName colType [comment col_comment], ...)]
[clustered BY (colName, colName, ...)
[sorted by (col_name [ASC|DESC], ...)] into num_buckets
buckets]
[row format row_format]
[stored as file_format]
[LOCATION hdfs_path]
[TBLPROPERTIES (property_name=property_value, ...)]
[AS select_statement];

-- like方式
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS]
[db_name.]table_name
 LIKE existing_table_or_view_name
 [LOCATION hdfs_path];

其中，各部分的含義如下：

關鍵字	含義
CREATE TABLE	按給定名稱創建表，如果表已經存在則拋出例外，可使用if not exists規避
EXTERNAL	創建外部表，否則創建的是內部表（管理表）：洗掉內部表時，資料和表的定義同時被洗掉；洗掉外部表時，僅僅洗掉了表的定義，資料保留，在生產環境中，多使用外部表，
comment	表的注釋
partition by	對表中資料進行磁區，指定表的磁區欄位
clustered by	創建分桶表，指定分桶欄位
sorted by	對桶中的一個或多個列排序，較少使用
ROW FORMAT DELIMITED	存盤子句：建表時可指定SerDe；如果沒有指定ROW FORMAT或者ROW FORMAT DELIMITED，將會使用默認的 SerDe；建表時還需要為表指定列，在指定列的同時也會指定自定義的SerDe； Hive通過SerDe確定表的具體的列的資料
stored as SEQUENCEFILE	TEXTFILE
LOCATION	表在HDFS上的存放位置
TBLPROPERTIES	定義表的屬性
AS	后面可以接查詢陳述句，表示根據后面的查詢結果創建表
LIKE	`like 表名`，允許用戶復制現有的表結構，但是不復制資料

其中，存盤子句用于建表時指定SerDe，格式如下：

ROW FORMAT DELIMITED
-- 欄位分隔符
[FIELDS TERMINATED BY char]
-- 集合分隔符
[COLLECTION ITEMS TERMINATED BY char]
-- map的鍵值分隔符
[MAP KEYS TERMINATED BY char]
-- 行分隔符
[LINES TERMINATED BY char] | SERDE serde_name
[WITH SERDEPROPERTIES (property_name=property_value,
property_name=property_value, ...)]

SerDe是Serialize/Deserilize的簡稱，Hive使用Serde進行行物件的序列與反序列化；

如果沒有指定ROW FORMAT或者ROW FORMATDELIMITED，將會使用默認的SerDe；

建表時還需要為表指定列，在指定列的同時也會指定自定義的SerDe，Hive通過SerDe確定表的具體的列的資料，

（3）內部表及外部表

在創建表的時候，可指定表的型別，表有兩種型別，分別是內部表(管理表)、外部表：

默認情況下（不指定external關鍵字），創建內部表，如果要創建外部表，需要使用關鍵字external；

在洗掉內部表時，表的定義（元資料）和資料同時被洗掉；

在洗掉外部表時，僅洗掉表的定義，資料被保留；

在生產環境中，多使用外部表，

在測驗表之前，準備資料，vim /home/hadoop/data/t1.dat，輸入以下內容：

2;zhangsan;book,TV,code;beijing:chaoyang,shagnhai:pudong
3;lishi;book,code;nanjing:jiangning,taiwan:taibei
4;wangwu;music,book;heilongjiang:haerbin

先測驗內部表：

hive (default)> use mydb;
OK
Time taken: 0.034 seconds
hive (mydb)> show tables;
OK
tab_name
Time taken: 0.026 seconds
-- 創建內部表
hive (mydb)> create table t1(
           > id int,
           > name string,
           > hobby array<string>,
           > addr map<string, string>
           > )
           > row format delimited
           > fields terminated by ";"
           > collection items terminated by ","
           > map keys terminated by ":"
           > ;
OK
Time taken: 0.156 seconds
hive (mydb)> show tables;
OK
tab_name
t1
Time taken: 0.031 seconds, Fetched: 1 row(s)
-- 顯示表的定義，顯示的資訊較少
hive (mydb)> desc t1;
OK
col_name        data_type       comment
id                      int                                         
name                    string                                      
hobby                   array<string>                               
addr                    map<string,string>                          
Time taken: 0.064 seconds, Fetched: 4 row(s)
-- 顯示表的定義，顯示的資訊多，格式友好
hive (mydb)> desc formatted t1;
OK
col_name        data_type       comment
# col_name              data_type               comment             
                 
id                      int                                         
name                    string                                      
hobby                   array<string>                               
addr                    map<string,string>                          
                 
# Detailed Table Information             
Database:               mydb                     
Owner:                  root                     
CreateTime:             Wed Sep 22 17:02:03 CST 2021     
LastAccessTime:         UNKNOWN                  
Retention:              0                        
Location:               hdfs://node01:9000/user/hive/warehouse/mydb.db/t1        
Table Type:             MANAGED_TABLE            
Table Parameters:                
        COLUMN_STATS_ACCURATE   {\"BASIC_STATS\":\"true\"}
        numFiles                0                   
        numRows                 0                   
        rawDataSize             0                   
        totalSize               0                   
        transient_lastDdlTime   1632301323          
                 
# Storage Information            
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe       
InputFormat:            org.apache.hadoop.mapred.TextInputFormat         
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat       
Compressed:             No                       
Num Buckets:            -1                       
Bucket Columns:         []                       
Sort Columns:           []                       
Storage Desc Params:             
        colelction.delim        ,                   
        field.delim             ;                   
        mapkey.delim            :                   
        serialization.format    ;                   
Time taken: 0.089 seconds, Fetched: 36 row(s)
-- 加載資料
hive (mydb)> load data local inpath "/home/hadoop/data/t1.dat" into table t1;
Loading data to table mydb.t1
OK
Time taken: 0.461 seconds
-- 查詢資料
hive (mydb)> select * from t1;
OK
t1.id   t1.name t1.hobby        t1.addr
2       zhangsan        ["book","TV","code"]    {"beijing":"chaoyang","shagnhai":"pudong"}
3       lishi   ["book","code"] {"nanjing":"jiangning","taiwan":"taibei"}
4       wangwu  ["music","book"]        {"heilongjiang":"haerbin"}
Time taken: 0.169 seconds, Fetched: 3 row(s)
-- 查看資料檔案
hive (mydb)> dfs -ls /user/hive/warehouse/mydb.db/t1;
Found 1 items
-rwxrwxrwx   3 root supergroup        148 2021-09-22 17:02 /user/hive/warehouse/mydb.db/t1/t1.dat
hive (mydb)> dfs -cat /user/hive/warehouse/mydb.db/t1/t1.dat;
2;zhangsan;book,TV,code;beijing:chaoyang,shagnhai:pudong
3;lishi;book,code;nanjing:jiangning,taiwan:taibei
4;wangwu;music,book;heilongjiang:haerbin
-- 洗掉表，表和資料同時被洗掉
hive (mydb)> drop table t1;
OK
Time taken: 0.198 seconds
hive (mydb)> show tables;
OK
tab_name
Time taken: 0.025 seconds
-- 再次查詢資料檔案，已經被洗掉
hive (mydb)> dfs -ls /user/hive/warehouse/mydb.db/t1;
ls: `/user/hive/warehouse/mydb.db/t1': No such file or directory
Command -ls /user/hive/warehouse/mydb.db/t1 failed with exit code = 1
Query returned non-zero code: 1, cause: null
hive (mydb)> dfs -ls /user/hive/warehouse/mydb.db/;
hive (mydb)>

可以看到，在創建內部表時，型別是MANAGED_TABLE；

在洗掉內部表之后，不僅表的定義被洗掉，資料（表HDFS中對應的檔案）也被洗掉，

再測驗使用外部表：

hive (mydb)> create external table t2(
           > id int,
           > name string,
           > hobby array<string>,
           > addr map<string, string>
           > )
           > row format delimited
           > fields terminated by ";"
           > collection items terminated by ","
           > map keys terminated by ":"
           > ;
OK
Time taken: 0.19 seconds
hive (mydb)> show tables;
OK
tab_name
t2
Time taken: 0.044 seconds, Fetched: 1 row(s)
hive (mydb)> desc formatted t2;
OK
col_name        data_type       comment
# col_name              data_type               comment             
                 
id                      int                                         
name                    string                                      
hobby                   array<string>                               
addr                    map<string,string>                          
                 
# Detailed Table Information             
Database:               mydb                     
Owner:                  root                     
CreateTime:             Wed Sep 22 17:12:16 CST 2021     
LastAccessTime:         UNKNOWN                  
Retention:              0                        
Location:               hdfs://node01:9000/user/hive/warehouse/mydb.db/t2        
Table Type:             EXTERNAL_TABLE           
Table Parameters:                
        COLUMN_STATS_ACCURATE   {\"BASIC_STATS\":\"true\"}
        EXTERNAL                TRUE                
        numFiles                0                   
        numRows                 0                   
        rawDataSize             0                   
        totalSize               0                   
        transient_lastDdlTime   1632301936          
                 
# Storage Information            
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe       
InputFormat:            org.apache.hadoop.mapred.TextInputFormat         
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat       
Compressed:             No                       
Num Buckets:            -1                       
Bucket Columns:         []                       
Sort Columns:           []                       
Storage Desc Params:             
        colelction.delim        ,                   
        field.delim             ;                   
        mapkey.delim            :                   
        serialization.format    ;                   
Time taken: 0.075 seconds, Fetched: 37 row(s)
hive (mydb)> load data local inpath "/home/hadoop/data/t1.dat" into table t1;
FAILED: SemanticException [Error 10001]: Line 1:61 Table not found 't1'
hive (mydb)> load data local inpath "/home/hadoop/data/t1.dat" into table t2;
Loading data to table mydb.t2
OK
Time taken: 0.463 seconds
hive (mydb)> dfs -ls /user/hive/warehouse/mydb.db/t2;
Found 1 items
-rwxrwxrwx   3 root supergroup        148 2021-09-22 17:15 /user/hive/warehouse/mydb.db/t2/t1.dat
hive (mydb)> dfs -cat /user/hive/warehouse/mydb.db/t2/t1.dat;
2;zhangsan;book,TV,code;beijing:chaoyang,shagnhai:pudong
3;lishi;book,code;nanjing:jiangning,taiwan:taibei
4;wangwu;music,book;heilongjiang:haerbin
hive (mydb)> select * from t2;
OK
t2.id   t2.name t2.hobby        t2.addr
2       zhangsan        ["book","TV","code"]    {"beijing":"chaoyang","shagnhai":"pudong"}
3       lishi   ["book","code"] {"nanjing":"jiangning","taiwan":"taibei"}
4       wangwu  ["music","book"]        {"heilongjiang":"haerbin"}
Time taken: 0.192 seconds, Fetched: 3 row(s)
hive (mydb)> drop table t2;
OK
Time taken: 0.239 seconds
hive (mydb)> show tables;
OK
tab_name
Time taken: 0.022 seconds
hive (mydb)> dfs -ls /user/hive/warehouse/mydb.db/t2;
Found 1 items
-rwxrwxrwx   3 root supergroup        148 2021-09-22 17:15 /user/hive/warehouse/mydb.db/t2/t1.dat
hive (mydb)> dfs -cat /user/hive/warehouse/mydb.db/t2/t1.dat;
2;zhangsan;book,TV,code;beijing:chaoyang,shagnhai:pudong
3;lishi;book,code;nanjing:jiangning,taiwan:taibei
4;wangwu;music,book;heilongjiang:haerbin
hive (mydb)>

可以看到，創建了外部表后，查看表的詳細資訊時，表型別是EXTERNAL_TABLE；

同時在洗掉表之后，只是洗掉了表定義，并沒有洗掉表的資料，

內部表和外部表之間還可以進行轉換，

使用如下：

hive (mydb)> create table t1(
           > id int,
           > name string,
           > hobby array<string>,
           > addr map<string, string>
           > )
           > row format delimited
           > fields terminated by ";"
           > collection items terminated by ","
           > map keys terminated by ":"
           > ;
OK
Time taken: 0.111 seconds
hive (mydb)> desc formatted t1;
OK
col_name        data_type       comment
# col_name              data_type               comment             
                 
id                      int                                         
name                    string                                      
hobby                   array<string>                               
addr                    map<string,string>                          
                 
# Detailed Table Information             
Database:               mydb                     
Owner:                  root                     
CreateTime:             Wed Sep 22 17:25:55 CST 2021     
LastAccessTime:         UNKNOWN                  
Retention:              0                        
Location:               hdfs://node01:9000/user/hive/warehouse/mydb.db/t1        
Table Type:             MANAGED_TABLE            
Table Parameters:                
...                  
Time taken: 0.075 seconds, Fetched: 36 row(s)
hive (mydb)> alter table t1 set tblproperties("EXTERNAL"="TRUE");
OK
Time taken: 0.176 seconds
hive (mydb)> desc formatted t1;
OK
col_name        data_type       comment
# col_name              data_type               comment             
                 
id                      int                                         
name                    string                                      
hobby                   array<string>                               
addr                    map<string,string>                          
                 
# Detailed Table Information             
Database:               mydb                     
Owner:                  root                     
CreateTime:             Wed Sep 22 17:25:55 CST 2021     
LastAccessTime:         UNKNOWN                  
Retention:              0                        
Location:               hdfs://node01:9000/user/hive/warehouse/mydb.db/t1        
Table Type:             EXTERNAL_TABLE           
Table Parameters:                
        COLUMN_STATS_ACCURATE   {\"BASIC_STATS\":\"true\"}
        EXTERNAL                TRUE                
        last_modified_by        root                
        last_modified_time      1632302837          
        numFiles                0                   
        numRows                 0                   
        rawDataSize             0                   
        totalSize               0                   
        transient_lastDdlTime   1632302837          
                 
# Storage Information            
...                
Time taken: 0.062 seconds, Fetched: 39 row(s)
hive (mydb)> alter table t1 set tblproperties("EXTERNAL"="FALSE");
OK
Time taken: 0.111 seconds
hive (mydb)> desc formatted t1;
OK
col_name        data_type       comment
# col_name              data_type               comment             
                 
id                      int                                         
name                    string                                      
hobby                   array<string>                               
addr                    map<string,string>                          
                 
# Detailed Table Information             
Database:               mydb                     
Owner:                  root                     
CreateTime:             Wed Sep 22 17:25:55 CST 2021     
LastAccessTime:         UNKNOWN                  
Retention:              0                        
Location:               hdfs://node01:9000/user/hive/warehouse/mydb.db/t1        
Table Type:             MANAGED_TABLE            
Table Parameters:                
        COLUMN_STATS_ACCURATE   {\"BASIC_STATS\":\"true\"}
        EXTERNAL                FALSE               
        last_modified_by        root                
        last_modified_time      1632302857          
        numFiles                0                   
        numRows                 0                   
        rawDataSize             0                   
        totalSize               0                   
        transient_lastDdlTime   1632302857          
                 
# Storage Information            
...               
Time taken: 0.072 seconds, Fetched: 39 row(s)
hive (mydb)> drop table t1;
OK
Time taken: 0.114 seconds
hive (mydb)> show tables;
OK
tab_name
Time taken: 0.033 seconds
hive (mydb)> dfs -ls /user/hive/warehouse/mydb.db/;
Found 1 items
drwxrwxrwx   - root supergroup          0 2021-09-22 17:15 /user/hive/warehouse/mydb.db/t2
hive (mydb)>

可以看到，實作了兩種表型別之間的轉換，

綜上，想保留外部表時使用外部表，并且生產中多用外部表，

（4）磁區表

Hive在執行查詢時，一般會掃描整個表的資料，由于表的資料量大，全表掃描消耗時間長、效率低，

而有時候，查詢只需要掃描表中的一部分資料即可，Hive引入了磁區表的概念，將表的資料存盤在不同的子目錄中，每一個子目錄對應一個磁區，只查詢部分磁區資料時，可避免全表掃描，提高查詢效率，

在實際中，通常根據時間、地區等資訊進行磁區，

現在使用如下：

先進行磁區表創建與資料加載：

-- 創建表
hive (mydb)> create table t3(
           > id int,
           > name string,
           > hobby array<string>,
           > addr map<string, string>
           > )
           > partitioned by (dt string)
           > row format delimited
           > fields terminated by ";"
           > collection items terminated by ","
           > map keys terminated by ":"
           > ;
OK
Time taken: 0.702 seconds
hive (mydb)> desc formatted t3;
OK
col_name        data_type       comment
# col_name              data_type               comment             
                 
id                      int                                         
name                    string                                      
hobby                   array<string>                               
addr                    map<string,string>                          
                 
# Partition Information          
# col_name              data_type               comment             
                 
dt                      string                                      
                 
# Detailed Table Information             
Database:               mydb                     
Owner:                  root                     
CreateTime:             Thu Sep 23 02:46:32 CST 2021     
LastAccessTime:         UNKNOWN                  
Retention:              0                        
Location:               hdfs://node01:9000/user/hive/warehouse/mydb.db/t3        
Table Type:             MANAGED_TABLE            
Table Parameters:                
        COLUMN_STATS_ACCURATE   {\"BASIC_STATS\":\"true\"}
        numFiles                0                   
        numPartitions           0                   
        numRows                 0                   
        rawDataSize             0                   
        totalSize               0                   
        transient_lastDdlTime   1632336392          
                 
# Storage Information            
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe       
InputFormat:            org.apache.hadoop.mapred.TextInputFormat         
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat       
Compressed:             No                       
Num Buckets:            -1                       
Bucket Columns:         []                       
Sort Columns:           []                       
Storage Desc Params:             
        colelction.delim        ,                   
        field.delim             ;                   
        mapkey.delim            :                   
        serialization.format    ;                   
Time taken: 0.58 seconds, Fetched: 42 row(s)
-- 加載資料
hive (mydb)> load data local inpath "/home/hadoop/data/t1.dat" into table t3 partition(dt="2021-09-22");
Loading data to table mydb.t3 partition (dt=2021-09-22)
OK
Time taken: 1.841 seconds
hive (mydb)> load data local inpath "/home/hadoop/data/t1.dat" into table t3 partition(dt="2021-09-23");
Loading data to table mydb.t3 partition (dt=2021-09-23)
OK
Time taken: 0.63 seconds
hive (mydb)> select * from t3;
OK
t3.id   t3.name t3.hobby        t3.addr t3.dt
2       zhangsan        ["book","TV","code"]    {"beijing":"chaoyang","shagnhai":"pudong"}      2021-09-22
3       lishi   ["book","code"] {"nanjing":"jiangning","taiwan":"taibei"}       2021-09-22
4       wangwu  ["music","book"]        {"heilongjiang":"haerbin"}      2021-09-22
2       zhangsan        ["book","TV","code"]    {"beijing":"chaoyang","shagnhai":"pudong"}      2021-09-23
3       lishi   ["book","code"] {"nanjing":"jiangning","taiwan":"taibei"}       2021-09-23
4       wangwu  ["music","book"]        {"heilongjiang":"haerbin"}      2021-09-23
Time taken: 3.712 seconds, Fetched: 6 row(s)
hive (mydb)> dfs -ls /user/hive/warehouse/mydb.db/t3;
Found 2 items
drwxrwxrwx   - root supergroup          0 2021-09-23 02:47 /user/hive/warehouse/mydb.db/t3/dt=2021-09-22
drwxrwxrwx   - root supergroup          0 2021-09-23 02:47 /user/hive/warehouse/mydb.db/t3/dt=2021-09-23
hive (mydb)> dfs -ls /user/hive/warehouse/mydb.db/t3/dt=2021-09-22;
Found 1 items
-rwxrwxrwx   3 root supergroup        148 2021-09-23 02:47 /user/hive/warehouse/mydb.db/t3/dt=2021-09-22/t1.dat
hive (mydb)> dfs -ls /user/hive/warehouse/mydb.db/t3/dt=2021-09-23;
Found 1 items
-rwxrwxrwx   3 root supergroup        148 2021-09-23 02:47 /user/hive/warehouse/mydb.db/t3/dt=2021-09-23/t1.dat
hive (mydb)> show partitions t3;
OK
partition
dt=2021-09-22
dt=2021-09-23
Time taken: 0.132 seconds, Fetched: 2 row(s)
-- 新增磁區
hive (mydb)> alter table t3
           > add partition(dt="2021-09-24");
OK
Time taken: 0.274 seconds
hive (mydb)> alter table t3
           > add partition(dt="2021-09-25");
OK
Time taken: 0.186 seconds
hive (mydb)> show partitions t3;
OK
partition
dt=2021-09-22
dt=2021-09-23
dt=2021-09-24
dt=2021-09-25
Time taken: 0.107 seconds, Fetched: 4 row(s)

可以看到，如果表設定了磁區，會在表的詳細資訊中展示出來；

查詢資料時也顯示了每條記錄的磁區，但是磁區欄位不是表中已經存在的資料，可以將磁區欄位看成偽列，

再指定資料路徑新增磁區，如下：

-- 準備資料
hive (mydb)> dfs -cp /user/hive/warehouse/mydb.db/t3/dt=2021-09-22 /user/hive/warehouse/mydb.db/t3/dt=2021-09-26;
hive (mydb)> dfs -cp /user/hive/warehouse/mydb.db/t3/dt=2021-09-22 /user/hive/warehouse/mydb.db/t3/dt=2021-09-27;
hive (mydb)> dfs -ls /user/hive/warehouse/mydb.db/t3/;
Found 6 items
drwxrwxrwx   - root supergroup          0 2021-09-23 02:47 /user/hive/warehouse/mydb.db/t3/dt=2021-09-22
drwxrwxrwx   - root supergroup          0 2021-09-23 02:47 /user/hive/warehouse/mydb.db/t3/dt=2021-09-23
drwxrwxrwx   - root supergroup          0 2021-09-23 02:50 /user/hive/warehouse/mydb.db/t3/dt=2021-09-24
drwxrwxrwx   - root supergroup          0 2021-09-23 02:51 /user/hive/warehouse/mydb.db/t3/dt=2021-09-25
drwxr-xr-x   - root supergroup          0 2021-09-23 02:53 /user/hive/warehouse/mydb.db/t3/dt=2021-09-26
drwxr-xr-x   - root supergroup          0 2021-09-23 02:53 /user/hive/warehouse/mydb.db/t3/dt=2021-09-27
-- 指定路徑設定磁區
hive (mydb)> alter table t3
           > add partition(dt="2021-09-26") location '/user/hive/warehouse/mydb.db/t3/dt=2021-09-26';
OK
Time taken: 0.138 seconds
hive (mydb)> select * from t3;
OK
t3.id   t3.name t3.hobby        t3.addr t3.dt
2       zhangsan        ["book","TV","code"]    {"beijing":"chaoyang","shagnhai":"pudong"}      2021-09-22
3       lishi   ["book","code"] {"nanjing":"jiangning","taiwan":"taibei"}       2021-09-22
4       wangwu  ["music","book"]        {"heilongjiang":"haerbin"}      2021-09-22
2       zhangsan        ["book","TV","code"]    {"beijing":"chaoyang","shagnhai":"pudong"}      2021-09-23
3       lishi   ["book","code"] {"nanjing":"jiangning","taiwan":"taibei"}       2021-09-23
4       wangwu  ["music","book"]        {"heilongjiang":"haerbin"}      2021-09-23
2       zhangsan        ["book","TV","code"]    {"beijing":"chaoyang","shagnhai":"pudong"}      2021-09-26
3       lishi   ["book","code"] {"nanjing":"jiangning","taiwan":"taibei"}       2021-09-26
4       wangwu  ["music","book"]        {"heilongjiang":"haerbin"}      2021-09-26
Time taken: 0.323 seconds, Fetched: 9 row(s)
-- 修改磁區的HDFS路徑
hive (mydb)> alter table t3 partition(dt="2021-09-26") set location "/user/hive/warehouse/mydb.db/t3/dt=2021-09-27";
OK
Time taken: 0.269 seconds
hive (mydb)> select * from t3;
OK
t3.id   t3.name t3.hobby        t3.addr t3.dt
2       zhangsan        ["book","TV","code"]    {"beijing":"chaoyang","shagnhai":"pudong"}      2021-09-22
3       lishi   ["book","code"] {"nanjing":"jiangning","taiwan":"taibei"}       2021-09-22
4       wangwu  ["music","book"]        {"heilongjiang":"haerbin"}      2021-09-22
2       zhangsan        ["book","TV","code"]    {"beijing":"chaoyang","shagnhai":"pudong"}      2021-09-23
3       lishi   ["book","code"] {"nanjing":"jiangning","taiwan":"taibei"}       2021-09-23
4       wangwu  ["music","book"]        {"heilongjiang":"haerbin"}      2021-09-23
2       zhangsan        ["book","TV","code"]    {"beijing":"chaoyang","shagnhai":"pudong"}      2021-09-26
3       lishi   ["book","code"] {"nanjing":"jiangning","taiwan":"taibei"}       2021-09-26
4       wangwu  ["music","book"]        {"heilongjiang":"haerbin"}      2021-09-26
Time taken: 0.263 seconds, Fetched: 9 row(s)
hive (mydb)> show partitions t3;
OK
partition
dt=2021-09-22
dt=2021-09-23
dt=2021-09-24
dt=2021-09-25
dt=2021-09-26
Time taken: 0.133 seconds, Fetched: 5 row(s)
-- 洗掉磁區
hive (mydb)> alter table t3 drop partition(dt="2021-09-24");
Dropped the partition dt=2021-09-24
OK
Time taken: 0.57 seconds
hive (mydb)> alter table t3 drop partition(dt="2021-09-25"), partition(dt="2021-09-26");
Dropped the partition dt=2021-09-25
Dropped the partition dt=2021-09-26
OK
Time taken: 0.273 seconds
hive (mydb)> show partitions t3;
OK
partition
dt=2021-09-22
dt=2021-09-23
Time taken: 0.079 seconds, Fetched: 2 row(s)
hive (mydb)>

洗掉多個磁區時，用逗號隔開，

（5）分桶表

當單個的磁區或者表的資料量過大，磁區不能更細粒度的劃分資料，就需要使用分桶技術將資料劃分成更細的粒度，將資料按照指定的欄位進行分成多個桶中去，即將資料按照欄位進行劃分，資料按照欄位劃分到多個檔案當中去，

Hive中分桶的原理是分桶欄位.hashCode % 分桶個數，這與MapReduce中Shuffle時磁區的規則是類似的，即key.hashCode % reductTask，

使用如下：

-- 創建分桶表
hive (mydb)> create table course(
           > id int,
           > name string,
           > score int
           > )
           > clustered by (id) into 3 buckets
           > row format delimited fields terminated by "\t";
OK
Time taken: 0.105 seconds
-- 創建普通表
hive (mydb)> create table course_common(
           > id int,
           > name string,
           > score int
           > )
           > row format delimited fields terminated by "\t";
OK
Time taken: 0.133 seconds
-- 普通表加載資料
hive (mydb)> load data local inpath "/home/hadoop/data/course.dat" into table course_common;
Loading data to table mydb.course_common
OK
Time taken: 0.55 seconds
hive (mydb)> select * from course_common;
OK
course_common.id        course_common.name      course_common.score
1       java    90
1       c       78
1       python  91
1       hadoop  80
2       java    75
2       c       76
2       python  80
2       hadoop  93
3       java    98
3       c       74
3       python  89
3       hadoop  91
5       java    93
6       c       76
7       python  87
8       hadoop  88
Time taken: 0.244 seconds, Fetched: 16 row(s)
-- 給桶表加載資料
hive (mydb)> insert into table course select * from course_common;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20210923033603_53b34c32-c8b7-4d73-ac85-f8419ca26678
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 3
...
Stage-Stage-1: Map: 1  Reduce: 3   Cumulative CPU: 9.18 sec   HDFS Read: 16039 HDFS Write: 365 SUCCESS
Total MapReduce CPU Time Spent: 9 seconds 180 msec
OK
course_common.id        course_common.name      course_common.scorem
Time taken: 52.252 seconds
hive (mydb)> desc formatted course;
OK
col_name        data_type       comment
# col_name              data_type               comment             
                 
id                      int                                         
name                    string                                      
score                   int                                         
                 
# Detailed Table Information             
Database:               mydb                     
Owner:                  root                     
CreateTime:             Thu Sep 23 03:34:14 CST 2021     
LastAccessTime:         UNKNOWN                  
Retention:              0                        
Location:               hdfs://node01:9000/user/hive/warehouse/mydb.db/course    
Table Type:             MANAGED_TABLE            
Table Parameters:                
        COLUMN_STATS_ACCURATE   {\"BASIC_STATS\":\"true\"}
        numFiles                3                   
        numRows                 16                  
        rawDataSize             148                 
        totalSize               164                 
        transient_lastDdlTime   1632339416          
                 
# Storage Information            
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe       
InputFormat:            org.apache.hadoop.mapred.TextInputFormat         
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat       
Compressed:             No                       
Num Buckets:            3                        
Bucket Columns:         [id]                     
Sort Columns:           []                       
Storage Desc Params:             
        field.delim             \t                  
        serialization.format    \t                  
Time taken: 0.096 seconds, Fetched: 33 row(s)
hive (mydb)> dfs -ls /user/hive/warehouse/mydb.db/course;
Found 3 items
-rwxrwxrwx   3 root supergroup         48 2021-09-23 03:36 /user/hive/warehouse/mydb.db/course/000000_0
-rwxrwxrwx   3 root supergroup         53 2021-09-23 03:36 /user/hive/warehouse/mydb.db/course/000001_0
-rwxrwxrwx   3 root supergroup         63 2021-09-23 03:36 /user/hive/warehouse/mydb.db/course/000002_0
-- 觀察分桶資料
hive (mydb)> dfs -cat /user/hive/warehouse/mydb.db/course/000000_0;
3       hadoop  91
3       python  89
3       c       74
3       java    98
6       c       76
hive (mydb)> dfs -cat /user/hive/warehouse/mydb.db/course/000001_0;
7       python  87
1       hadoop  80
1       python  91
1       c       78
1       java    90
hive (mydb)> dfs -cat /user/hive/warehouse/mydb.db/course/000002_0;
8       hadoop  88
5       java    93
2       python  80
2       c       76
2       java    75
2       hadoop  93
hive (mydb)>

可以看到，在創建分桶表之后，也可以從表的詳細資訊獲取到分桶的資訊；

并且，不能直接向分桶表中添加資料，而需要使用insert ... select ...從普通表中匯入資料；

同時，分桶的規則是分桶欄位.hashCode % 分桶數，這里設定的分桶個數是3，所以對分桶欄位的哈希碼值進行對3求余、進入不同的桶；

從Hive 2.x開始，不需要設定引數hive.enforce.bucketing=true即可支持分桶，

（6）修改表及洗掉表

HIve修改和洗掉表的操作如下：

-- 修改表名，rename
hive (mydb)> alter table course_common rename to course1;
OK
Time taken: 0.163 seconds
-- 修改列名，change column
hive (mydb)> alter table course1
           > change column id cid int;
OK
Time taken: 0.139 seconds
-- 修改欄位型別，change column
hive (mydb)> alter table course1
           > change column cid cid string;
OK
Time taken: 0.128 seconds
-- string不能轉為int
hive (mydb)> alter table course1
           > change column cid cid int;
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Unable to alter table. The following columns have types incompatible with the existing columns in their respective positions :
cid
hive (mydb)> desc course1;
OK
col_name        data_type       comment
cid                     string                                      
name                    string                                      
score                   int                                         
Time taken: 0.16 seconds, Fetched: 3 row(s)
-- 增加欄位，add columns
hive (mydb)> alter table course1
           > add columns(common string);
OK
Time taken: 0.168 seconds
hive (mydb)> select * from course1;
OK
course1.cid     course1.name    course1.score  course1.common
1       java    90      NULL
1       c       78      NULL
1       python  91      NULL
1       hadoop  80      NULL
2       java    75      NULL
2       c       76      NULL
2       python  80      NULL
2       hadoop  93      NULL
3       java    98      NULL
3       c       74      NULL
3       python  89      NULL
3       hadoop  91      NULL
5       java    93      NULL
6       c       76      NULL
7       python  87      NULL
8       hadoop  88      NULL
Time taken: 1.531 seconds, Fetched: 16 row(s)
-- 洗掉欄位，replace columns
hive (mydb)> alter table course1
           > replace columns(
           > cid string, cname string, cscore int);
OK
Time taken: 0.297 seconds
hive (mydb)> desc course1;
OK
col_name        data_type       comment
cid                     string                                      
cname                   string                                      
cscore                  int                                         
Time taken: 0.133 seconds, Fetched: 3 row(s)
-- 洗掉表
hive (mydb)> drop table course1;
OK
Time taken: 0.157 seconds
hive (mydb)> show tables;
OK
tab_name
course
t3
Time taken: 0.071 seconds, Fetched: 2 row(s)
hive (mydb)>

需要注意，修改欄位資料型別時，要滿足資料型別轉換的要求，如int可以轉為string，但是string不能轉為int；

洗掉欄位使用replace columns，僅僅只是在元資料中洗掉了欄位，并沒有改動HDFS上的資料檔案，

可以對Hive DDL總結如下：

主要操作物件是資料庫和表 ；

表的分類：

表型別	特點
內部表	洗掉表時，同時洗掉元資料和表資料
外部表	洗掉表時，僅洗掉元資料，保留表中資料；生產環境多使用外部表
磁區表	按照磁區欄位將表中的資料放置在不同的目錄中，提高SQL查詢的性能
分桶表	按照分桶欄位，將表中資料分開；分桶規則是`分桶欄位.hashCode % 分桶數`

主要命令包括create、alter 、drop，

5.HQL操作之資料操作

（1）load裝載資料

資料匯入有4種方式：

裝載資料（load）
插入資料（insert）
創建表并插入資料（as select）
使用import匯入資料

裝載資料（load）的基本語法如下：

LOAD DATA [LOCAL] INPATH 'filepath'
[OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1,
partcol2=val2 ...)]

其中，各部分的含義如下：

關鍵字	含義
LOCAL	LOAD DATA LOCAL …：從本地檔案系統加載資料到Hive表中，本地檔案會拷貝到Hive表指定的位置； LOAD DATA …：從HDFS加載資料到Hive表中，HDFS檔案移動到Hive表指定的位置
INPATH	加載資料的路徑
OVERWRITE	覆寫表中已有資料；否則表示追加資料
PARTITION	將資料加載到指定的磁區INPATH

現在進行測驗，先進行準備作業，準備資料檔案vim /home/hadoop/data/sourceA.txt，內容如下：

1,fish1,SZ
2,fish2,SH
3,fish3,HZ
4,fish4,QD
5,fish5,SR

再將其上傳到HDFS中，如下：

[root@node03 ~]$ hdfs dfs -mkdir -p /user/hadoop/data/
[root@node03 ~]$ hdfs dfs -put /home/hadoop/data/sourceA.txt /user/hadoop/data/
# Hive中創建資料檔案
[root@node03 ~]$ hdfs dfs -mkdir /user/hive/tabB;
[root@node03 ~]$ hdfs dfs -put /home/hadoop/data/sourceA.txt /user/hive/tabB

Hive中操作如下：

-- 創建表
hive (mydb)> CREATE TABLE tabA ( 
           > id int
           > ,name string
           > ,area string 
           > ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
OK
Time taken: 0.085 seconds
-- 加載本地檔案到Hive
hive (mydb)> load data local inpath '/home/hadoop/data/sourceA.txt'
           > into table tabA;
Loading data to table mydb.taba
OK
Time taken: 0.31 seconds
-- 檢查本地檔案，仍然存在
hive (mydb)> select * from tabA;
OK
taba.id taba.name       taba.area
1       fish1   SZ
2       fish2   SH
3       fish3   HZ
4       fish4   QD
5       fish5   SR
Time taken: 0.259 seconds, Fetched: 5 row(s)
-- 加載HDFS檔案到Hive
hive (mydb)> load data inpath '/user/hadoop/data/sourceA.txt'
           > into table tabA;
Loading data to table mydb.taba
OK
Time taken: 0.449 seconds
-- 檢查HDFS檔案系統，檔案已經不存在
hive (mydb)> select * from tabA;
OK
taba.id taba.name       taba.area
1       fish1   SZ
2       fish2   SH
3       fish3   HZ
4       fish4   QD
5       fish5   SR
1       fish1   SZ
2       fish2   SH
3       fish3   HZ
4       fish4   QD
5       fish5   SR
Time taken: 0.25 seconds, Fetched: 10 row(s)
-- 加載資料覆寫表中已有資料
hive (mydb)> load data local inpath '/home/hadoop/data/sourceA.txt'
           > overwrite into table tabA;
Loading data to table mydb.taba
OK
Time taken: 0.362 seconds
hive (mydb)> select * from tabA;
OK
taba.id taba.name       taba.area
1       fish1   SZ
2       fish2   SH
3       fish3   HZ
4       fish4   QD
5       fish5   SR
Time taken: 0.148 seconds, Fetched: 5 row(s)
-- 創建表時加載資料
hive (mydb)> CREATE TABLE tabB ( 
           > id int,
           > name string,
           > area string 
           > ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
           > location '/user/hive/tabB';
OK
Time taken: 0.129 seconds
hive (mydb)> select * from tabB;
OK
tabb.id tabb.name       tabb.area
1       fish1   SZ
2       fish2   SH
3       fish3   HZ
4       fish4   QD
5       fish5   SR
Time taken: 0.182 seconds, Fetched: 5 row(s)
hive (mydb)>

再查看本地，如下：

[root@node03 ~]$ ls /home/hadoop/data/
course.dat  s1.dat  sourceA.txt  t1.dat
[root@node03 ~]$ hdfs dfs -ls /user/hadoop/data
[root@node03 ~]$

可以看到，從檔案中加載資料后，本地檔案還在，但是HDFS檔案已經不存在于原路徑下；

使用overwrite加載資料時，資料表中原來的資料都會被清空，

（2）insert插入資料

insert插入資料，使用如下：

-- 創建磁區表
hive (mydb)> create table tabC(
           > id int, name string, area string)
           > partitioned by(month string);
OK
Time taken: 0.524 seconds
hive (mydb)> desc tabC;
OK
col_name        data_type       comment
id                      int                                         
name                    string                                      
area                    string                                      
month                   string                                      
                 
# Partition Information          
# col_name              data_type               comment             
                 
month                   string                                      
Time taken: 0.733 seconds, Fetched: 9 row(s)
-- 插入單條資料
hive (mydb)> insert into tabC
           > partition(month="202109")
           > values(1, "Corley", "Beijing");
Automatically selecting local only mode for query
...
Total MapReduce CPU Time Spent: 0 msec
OK
_col0   _col1   _col2
Time taken: 6.384 seconds
hive (mydb)> select * from tabC;
OK
tabc.id tabc.name       tabc.area       tabc.month
1       Corley  Beijing 202109
Time taken: 0.263 seconds, Fetched: 1 row(s)
-- 插入多條資料
hive (mydb)> insert into tabC
           > partition(month="202109")
           > values(2, "Jack", "Tianjin"), (3, "Bob", "Shanghai");
Automatically selecting local only mode for query
...
Total MapReduce CPU Time Spent: 0 msec
OK
_col0   _col1   _col2
Time taken: 2.749 seconds
hive (mydb)> select * from tabC;
OK
tabc.id tabc.name       tabc.area       tabc.month
1       Corley  Beijing 202109
2       Jack    Tianjin 202109
3       Bob     Shanghai        202109
Time taken: 0.362 seconds, Fetched: 3 row(s)
-- 插入查詢的結果資料
hive (mydb)> insert into tabC
           > partition(month="202110")
           > select id, name, area from tabC;
Automatically selecting local only mode for query
...
Total MapReduce CPU Time Spent: 0 msec
OK
id      name    area
Time taken: 2.727 seconds
hive (mydb)> select * from tabC;
OK
tabc.id tabc.name       tabc.area       tabc.month
1       Corley  Beijing 202109
2       Jack    Tianjin 202109
3       Bob     Shanghai        202109
1       Corley  Beijing 202110
2       Jack    Tianjin 202110
3       Bob     Shanghai        202110
Time taken: 0.235 seconds, Fetched: 6 row(s)
-- 多表（多磁區）插入模式
hive (mydb)> from tabC
           > insert overwrite table tabC partition(month="202111")
           > select id, name ,area where month="202109"
           > insert overwrite table tabC partition(month="202112")
           > select id, name ,area where month="202109" or month="202110";
Automatically selecting local only mode for query
...
Total MapReduce CPU Time Spent: 0 msec
OK
id      name    area
Time taken: 3.771 seconds
hive (mydb)> select * from tabC;
OK
tabc.id tabc.name       tabc.area       tabc.month
1       Corley  Beijing 202109
2       Jack    Tianjin 202109
3       Bob     Shanghai        202109
1       Corley  Beijing 202110
2       Jack    Tianjin 202110
3       Bob     Shanghai        202110
1       Corley  Beijing 202111
2       Jack    Tianjin 202111
3       Bob     Shanghai        202111
1       Corley  Beijing 202112
2       Jack    Tianjin 202112
3       Bob     Shanghai        202112
1       Corley  Beijing 202112
2       Jack    Tianjin 202112
3       Bob     Shanghai        202112
Time taken: 0.244 seconds, Fetched: 15 row(s)
hive (mydb)>

insert插入資料有3種方式：

手動插入單挑或多條資料
使用查詢結果資料作為插入資料
多表（多磁區）插入模式

還可以使用as select在創建表時插入資料：

-- 根據查詢結果創建表
hive (mydb)> create table if not exists tabD
           > as select * from tabC;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
...
Total MapReduce CPU Time Spent: 1 seconds 590 msec
OK
tabc.id tabc.name       tabc.area       tabc.month
Time taken: 37.142 seconds
hive (mydb)> select * from tabD;
OK
tabd.id tabd.name       tabd.area       tabd.month
1       Corley  Beijing 202109
2       Jack    Tianjin 202109
3       Bob     Shanghai        202109
1       Corley  Beijing 202110
2       Jack    Tianjin 202110
3       Bob     Shanghai        202110
1       Corley  Beijing 202111
2       Jack    Tianjin 202111
3       Bob     Shanghai        202111
1       Corley  Beijing 202112
2       Jack    Tianjin 202112
3       Bob     Shanghai        202112
1       Corley  Beijing 202112
2       Jack    Tianjin 202112
3       Bob     Shanghai        202112
Time taken: 0.163 seconds, Fetched: 15 row(s)
hive (mydb)> desc tabD;
OK
col_name        data_type       comment
id                      int                                         
name                    string                                      
area                    string                                      
month                   string                                      
Time taken: 0.049 seconds, Fetched: 4 row(s)
hive (mydb)>

在創建表時使用查詢結果作為插入的資料時，沒有將磁區資訊復制過來，只是復制普通的欄位資料，所以表tabD資料中沒有磁區資訊，

先使用insert overwrite匯出資料：

-- 1.將查詢結果匯出到本地
hive (mydb)> insert overwrite local directory '/home/hadoop/data/tabC'
           > select * from tabC;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
...
Total MapReduce CPU Time Spent: 1 seconds 740 msec
OK
tabc.id tabc.name       tabc.area       tabc.month
Time taken: 33.941 seconds
-- 2.將查詢結果格式化輸出到本地
hive (mydb)> insert overwrite local directory '/home/hadoop/data/tabC'
           > row format delimited fields terminated by ' '
           > select * from tabC;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
...
Total MapReduce CPU Time Spent: 1 seconds 610 msec
OK
tabc.id tabc.name       tabc.area       tabc.month
Time taken: 28.068 seconds
-- 3.將查詢結果匯出到HDFS
hive (mydb)> insert overwrite directory '/user/hadoop/data/tabC'
           > row format delimited fields terminated by ' '
           > select * from tabC;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
...
Total MapReduce CPU Time Spent: 1 seconds 140 msec
OK
tabc.id tabc.name       tabc.area       tabc.month
Time taken: 20.725 seconds
hive (mydb)>

3種匯出方式對應的檔案系統查看如下：

[root@node03 ~]$ ll -ht /home/hadoop/data/
總用量 16K
drwxr-xr-x 2 root root  43 9月  24 22:25 tabC
-rw-r--r-- 1 root root  55 9月  24 17:18 sourceA.txt
-rw-r--r-- 1 root root 164 9月  23 03:32 course.dat
-rw-r--r-- 1 root root 148 9月  22 16:58 t1.dat
-rw-r--r-- 1 root root  84 9月  21 13:59 s1.dat
[root@node03 ~]$ cat -A /home/hadoop/data/tabC/000000_0 
1^ACorley^ABeijing^A202109$
2^AJack^ATianjin^A202109$
3^ABob^AShanghai^A202109$
1^ACorley^ABeijing^A202110$
2^AJack^ATianjin^A202110$
3^ABob^AShanghai^A202110$
1^ACorley^ABeijing^A202111$
2^AJack^ATianjin^A202111$
3^ABob^AShanghai^A202111$
1^ACorley^ABeijing^A202112$
2^AJack^ATianjin^A202112$
3^ABob^AShanghai^A202112$
1^ACorley^ABeijing^A202112$
2^AJack^ATianjin^A202112$
3^ABob^AShanghai^A202112$
[root@node03 ~]$ cat  /home/hadoop/data/tabC/000000_0 
1 Corley Beijing 202109
2 Jack Tianjin 202109
3 Bob Shanghai 202109
1 Corley Beijing 202110
2 Jack Tianjin 202110
3 Bob Shanghai 202110
1 Corley Beijing 202111
2 Jack Tianjin 202111
3 Bob Shanghai 202111
1 Corley Beijing 202112
2 Jack Tianjin 202112
3 Bob Shanghai 202112
1 Corley Beijing 202112
2 Jack Tianjin 202112
3 Bob Shanghai 202112
[root@node03 ~]$ hdfs dfs -ls /user/hadoop/data
Found 1 items
drwxr-xr-x   - root supergroup          0 2021-09-24 22:29 /user/hadoop/data/tabC
[root@node03 ~]$ hdfs dfs -cat /user/hadoop/data/tabC/000000_0
1 Corley Beijing 202109
2 Jack Tianjin 202109
3 Bob Shanghai 202109
1 Corley Beijing 202110
2 Jack Tianjin 202110
3 Bob Shanghai 202110
1 Corley Beijing 202111
2 Jack Tianjin 202111
3 Bob Shanghai 202111
1 Corley Beijing 202112
2 Jack Tianjin 202112
3 Bob Shanghai 202112
1 Corley Beijing 202112
2 Jack Tianjin 202112
3 Bob Shanghai 202112
[root@node03 ~]$

再使用DFS命令匯出資料到本地：

hive (mydb)> dfs -ls /user/hive/warehouse/mydb.db/tabc;
Found 4 items
drwxrwxrwx   - root supergroup          0 2021-09-24 21:53 /user/hive/warehouse/mydb.db/tabc/month=202109
drwxrwxrwx   - root supergroup          0 2021-09-24 21:54 /user/hive/warehouse/mydb.db/tabc/month=202110
drwxrwxrwx   - root supergroup          0 2021-09-24 21:59 /user/hive/warehouse/mydb.db/tabc/month=202111
drwxrwxrwx   - root supergroup          0 2021-09-24 21:59 /user/hive/warehouse/mydb.db/tabc/month=202112
hive (mydb)> dfs -get /user/hive/warehouse/mydb.db/tabc/month=202109 /home/hadoop/data/tabC;
hive (mydb)>

再查看本地，如下：

[root@node03 ~]$ ll -ht /home/hadoop/data/
總用量 16K
drwxr-xr-x 3 root root  63 9月  24 22:40 tabC
-rw-r--r-- 1 root root  55 9月  24 17:18 sourceA.txt
-rw-r--r-- 1 root root 164 9月  23 03:32 course.dat
-rw-r--r-- 1 root root 148 9月  22 16:58 t1.dat
-rw-r--r-- 1 root root  84 9月  21 13:59 s1.dat
[root@node03 ~]$ cat -A /home/hadoop/data/tabC/month\=202109/000000_0
1^ACorley^ABeijing$

這種方式的本質是進行資料檔案的拷貝，

也可以在本地執行hive命令匯出資料到本地，如下：

[root@node03 ~]$ hive -e "select * from mydb.tabC" > tabc.dat
which: no hbase in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/opt/software/java/jdk1.8.0_231/bin:/opt/software/hadoop-2.9.2/bin:/opt/software/hadoop-2.9.2/sbin:/opt/software/hive-2.3.7/bin:/root/bin)

Logging initialized using configuration in file:/opt/software/hive-2.3.7/conf/hive-log4j2.properties Async: true
OK
Time taken: 11.372 seconds, Fetched: 15 row(s)
[root@node03 ~]$ cat tabc.dat 
tabc.id tabc.name       tabc.area       tabc.month
1       Corley  Beijing 202109
2       Jack    Tianjin 202109
3       Bob     Shanghai        202109
1       Corley  Beijing 202110
2       Jack    Tianjin 202110
3       Bob     Shanghai        202110
1       Corley  Beijing 202111
2       Jack    Tianjin 202111
3       Bob     Shanghai        202111
1       Corley  Beijing 202112
2       Jack    Tianjin 202112
3       Bob     Shanghai        202112
1       Corley  Beijing 202112
2       Jack    Tianjin 202112
3       Bob     Shanghai        202112

本質是執行查詢并將查詢結果重定向到檔案，

Hive有專門的匯出命令expert，匯出資料到HDFS，如下：

hive (mydb)> export table tabC to '/user/hadoop/data/tabC2';
Copying data from file:/tmp/root/88c65aff-1880-43e6-a9c0-452b1b0f63a8/hive_2021-09-24_22-50-40_489_1035737842386809043-1/-local-10000/_metadata
Copying file: file:/tmp/root/88c65aff-1880-43e6-a9c0-452b1b0f63a8/hive_2021-09-24_22-50-40_489_1035737842386809043-1/-local-10000/_metadata
Copying data from hdfs://node01:9000/user/hive/warehouse/mydb.db/tabc/month=202109
Copying file: hdfs://node01:9000/user/hive/warehouse/mydb.db/tabc/month=202109/000000_0
Copying file: hdfs://node01:9000/user/hive/warehouse/mydb.db/tabc/month=202109/000000_0_copy_1
Copying data from hdfs://node01:9000/user/hive/warehouse/mydb.db/tabc/month=202110
Copying file: hdfs://node01:9000/user/hive/warehouse/mydb.db/tabc/month=202110/000000_0
Copying data from hdfs://node01:9000/user/hive/warehouse/mydb.db/tabc/month=202111
Copying file: hdfs://node01:9000/user/hive/warehouse/mydb.db/tabc/month=202111/000000_0
Copying data from hdfs://node01:9000/user/hive/warehouse/mydb.db/tabc/month=202112
Copying file: hdfs://node01:9000/user/hive/warehouse/mydb.db/tabc/month=202112/000000_0
OK
Time taken: 1.242 seconds

本地查看如下：

[root@node03 ~]$ hdfs dfs -ls /user/hadoop/data/tabC2
Found 5 items
-rwxr-xr-x   3 root supergroup       6086 2021-09-24 22:50 /user/hadoop/data/tabC2/_metadata
drwxr-xr-x   - root supergroup          0 2021-09-24 22:50 /user/hadoop/data/tabC2/month=202109
drwxr-xr-x   - root supergroup          0 2021-09-24 22:50 /user/hadoop/data/tabC2/month=202110
drwxr-xr-x   - root supergroup          0 2021-09-24 22:50 /user/hadoop/data/tabC2/month=202111
drwxr-xr-x   - root supergroup          0 2021-09-24 22:50 /user/hadoop/data/tabC2/month=202112

可以看到，使用export匯出資料時，不僅有資料，還有表的元資料資訊，

export匯出的資料，可以使用import命令匯入到Hive表中，如下：

hive (mydb)> create table tabE like tabC;
OK
Time taken: 0.429 seconds
hive (mydb)> desc tabE;
OK
col_name        data_type       comment
id                      int                                         
name                    string                                      
area                    string                                      
month                   string                                      
                 
# Partition Information          
# col_name              data_type               comment             
                 
month                   string                                      
Time taken: 0.444 seconds, Fetched: 9 row(s)
hive (mydb)> select * from tabE;
OK
tabe.id tabe.name       tabe.area       tabe.month
Time taken: 2.391 seconds
hive (mydb)> import table tabE from '/user/hadoop/data/tabC2';
Copying data from hdfs://node01:9000/user/hadoop/data/tabC2/month=202109
Copying file: hdfs://node01:9000/user/hadoop/data/tabC2/month=202109/000000_0
Copying file: hdfs://node01:9000/user/hadoop/data/tabC2/month=202109/000000_0_copy_1
Copying data from hdfs://node01:9000/user/hadoop/data/tabC2/month=202110
Copying file: hdfs://node01:9000/user/hadoop/data/tabC2/month=202110/000000_0
Copying data from hdfs://node01:9000/user/hadoop/data/tabC2/month=202111
Copying file: hdfs://node01:9000/user/hadoop/data/tabC2/month=202111/000000_0
Copying data from hdfs://node01:9000/user/hadoop/data/tabC2/month=202112
Copying file: hdfs://node01:9000/user/hadoop/data/tabC2/month=202112/000000_0
Loading data to table mydb.tabe partition (month=202109)
Loading data to table mydb.tabe partition (month=202110)
Loading data to table mydb.tabe partition (month=202111)
Loading data to table mydb.tabe partition (month=202112)
OK
Time taken: 4.861 seconds
hive (mydb)> select * from tabE;
OK
tabe.id tabe.name       tabe.area       tabe.month
1       Corley  Beijing 202109
2       Jack    Tianjin 202109
3       Bob     Shanghai        202109
1       Corley  Beijing 202110
2       Jack    Tianjin 202110
3       Bob     Shanghai        202110
1       Corley  Beijing 202111
2       Jack    Tianjin 202111
3       Bob     Shanghai        202111
1       Corley  Beijing 202112
2       Jack    Tianjin 202112
3       Bob     Shanghai        202112
1       Corley  Beijing 202112
2       Jack    Tianjin 202112
3       Bob     Shanghai        202112
Time taken: 0.232 seconds, Fetched: 15 row(s)
hive (mydb)>

可以總結，使用 like tname創建的表結構與原表一致，而使用create ... as select ...結構可能不一致，例如不會攜帶磁區資訊，

truncate可以用來截斷表，也就是清空資料，如下：

hive (mydb)> truncate table tabE;
OK
Time taken: 0.833 seconds
hive (mydb)> select * from tabE;
OK
tabe.id tabe.name       tabe.area       tabe.month
Time taken: 0.287 seconds
hive (mydb)> alter table tabE set tblproperties("EXTERNAL"="TRUE");
OK
Time taken: 0.167 seconds
hive (mydb)> truncate table tabE;
FAILED: SemanticException [Error 10146]: Cannot truncate non-managed table tabE.
hive (mydb)>

需要注意，truncate僅能操作內部表，操作外部表時會報錯，

總結如下：

資料匯入方式如下：

load data
insert
create table … as select …
import table

資料匯出方式如下：

insert overwrite … diretory …
hdfs dfs -get
hive -e “select …” > file
export table …

除此之外，Hive的資料匯入與匯出還可以使用其他工具，包括Sqoop、DataX等，

6.HQL操作之DQL命令

DQL即Data Query Language資料查詢語言，是HQL的重點，

書寫SQL陳述句時，注意事項如下：

SQL陳述句對大小寫不敏感
SQL陳述句可以寫一行（簡單SQL），也可以寫多行（復雜SQL）
關鍵字不能縮寫，也不能分行
各子句一般要分行
使用縮進格式，提高SQL陳述句的可讀性

（1）簡單查詢

先準備資料檔案，vim /home/hadoop/data/emp.dat，輸入內容如下：

7369,SMITH,CLERK,7902,2010-12-17,800,,20
7499,ALLEN,SALESMAN,7698,2011-02-20,1600,300,30
7521,WARD,SALESMAN,7698,2011-02-22,1250,500,30
7566,JONES,MANAGER,7839,2011-04-02,2975,,20
7654,MARTIN,SALESMAN,7698,2011-09-28,1250,1400,30
7698,BLAKE,MANAGER,7839,2011-05-01,2850,,30
7782,CLARK,MANAGER,7839,2011-06-09,2450,,10
7788,SCOTT,ANALYST,7566,2017-07-13,3000,,20
7839,KING,PRESIDENT,,2011-11-07,5000,,10
7844,TURNER,SALESMAN,7698,2011-09-08,1500,0,30
7876,ADAMS,CLERK,7788,2017-07-13,1100,,20
7900,JAMES,CLERK,7698,2011-12-03,950,,30
7902,FORD,ANALYST,7566,2011-12-03,3000,,20
7934,MILLER,CLERK,7782,2012-01-23,1300,,10

再創建表和匯入資料，如下：

hive (mydb)> CREATE TABLE emp(
           >     empno int,
           >     ename string, 
           >     job string, 
           >     mgr int, 
           >     hiredate DATE, 
           >     sal int, 
           >     comm int, 
           >     deptno int
           > )row format delimited fields terminated by ",";
OK
Time taken: 0.179 seconds
hive (mydb)> load data local inpath '/home/hadoop/data/emp.dat' into table emp;
Loading data to table mydb.emp
OK
Time taken: 0.712 seconds
hive (mydb)> select * from emp;
OK
emp.empno       emp.ename       emp.job emp.mgr emp.hiredate    emp.sal emp.comm        emp.deptno
7369    SMITH   CLERK   7902    2010-12-17      800     NULL    20
7499    ALLEN   SALESMAN        7698    2011-02-20      1600    300     30
7521    WARD    SALESMAN        7698    2011-02-22      1250    500     30
7566    JONES   MANAGER 7839    2011-04-02      2975    NULL    20
7654    MARTIN  SALESMAN        7698    2011-09-28      1250    1400    30
7698    BLAKE   MANAGER 7839    2011-05-01      2850    NULL    30
7782    CLARK   MANAGER 7839    2011-06-09      2450    NULL    10
7788    SCOTT   ANALYST 7566    2017-07-13      3000    NULL    20
7839    KING    PRESIDENT       NULL    2011-11-07      5000    NULL    10
7844    TURNER  SALESMAN        7698    2011-09-08      1500    0       30
7876    ADAMS   CLERK   7788    2017-07-13      1100    NULL    20
7900    JAMES   CLERK   7698    2011-12-03      950     NULL    30
7902    FORD    ANALYST 7566    2011-12-03      3000    NULL    20
7934    MILLER  CLERK   7782    2012-01-23      1300    NULL    10
Time taken: 0.329 seconds, Fetched: 14 row(s)
hive (mydb)>

再進行簡單查詢，如下：

-- 省略from子句的查詢
hive (mydb)> select 123 * 321;
OK
_c0
39483
Time taken: 0.139 seconds, Fetched: 1 row(s)
hive (mydb)> select current_date;
OK
_c0
2021-09-24
Time taken: 0.113 seconds, Fetched: 1 row(s)
-- 使用列別名
hive (mydb)> select 123 * 321 as pro;
OK
pro
39483
Time taken: 0.125 seconds, Fetched: 1 row(s)
hive (mydb)> select current_date curdate;
OK
curdate
2021-09-24
Time taken: 0.124 seconds, Fetched: 1 row(s)
-- 全表查詢
hive (mydb)> select * from emp;
OK
emp.empno       emp.ename       emp.job emp.mgr emp.hiredate    emp.sal emp.comm        emp.deptno
7369    SMITH   CLERK   7902    2010-12-17      800     NULL    20
7499    ALLEN   SALESMAN        7698    2011-02-20      1600    300     30
7521    WARD    SALESMAN        7698    2011-02-22      1250    500     30
7566    JONES   MANAGER 7839    2011-04-02      2975    NULL    20
7654    MARTIN  SALESMAN        7698    2011-09-28      1250    1400    30
7698    BLAKE   MANAGER 7839    2011-05-01      2850    NULL    30
7782    CLARK   MANAGER 7839    2011-06-09      2450    NULL    10
7788    SCOTT   ANALYST 7566    2017-07-13      3000    NULL    20
7839    KING    PRESIDENT       NULL    2011-11-07      5000    NULL    10
7844    TURNER  SALESMAN        7698    2011-09-08      1500    0       30
7876    ADAMS   CLERK   7788    2017-07-13      1100    NULL    20
7900    JAMES   CLERK   7698    2011-12-03      950     NULL    30
7902    FORD    ANALYST 7566    2011-12-03      3000    NULL    20
7934    MILLER  CLERK   7782    2012-01-23      1300    NULL    10
Time taken: 0.277 seconds, Fetched: 14 row(s)
-- 選擇特定列查詢
hive (mydb)> select ename, sal, comm from emp;
OK
ename   sal     comm
SMITH   800     NULL
ALLEN   1600    300
WARD    1250    500
JONES   2975    NULL
MARTIN  1250    1400
BLAKE   2850    NULL
CLARK   2450    NULL
SCOTT   3000    NULL
KING    5000    NULL
TURNER  1500    0
ADAMS   1100    NULL
JAMES   950     NULL
FORD    3000    NULL
MILLER  1300    NULL
Time taken: 0.172 seconds, Fetched: 14 row(s)
-- 使用函式
hive (mydb)> select count(*) from emp;
Automatically selecting local only mode for query
...
Total MapReduce CPU Time Spent: 0 msec
OK
_c0
14
Time taken: 6.238 seconds, Fetched: 1 row(s)
hive (mydb)> select count(1) from emp;
Automatically selecting local only mode for query
...
Total MapReduce CPU Time Spent: 0 msec
OK
_c0
14
Time taken: 1.823 seconds, Fetched: 1 row(s)
hive (mydb)> select count(empno) from emp;
Automatically selecting local only mode for query
...
Total MapReduce CPU Time Spent: 0 msec
OK
_c0
14
Time taken: 1.697 seconds, Fetched: 1 row(s)
hive (mydb)> select count(comm) from emp;
Automatically selecting local only mode for query
...
Total MapReduce CPU Time Spent: 0 msec
OK
_c0
4
Time taken: 2.079 seconds, Fetched: 1 row(s)
hive (mydb)> select sum(sal) from emp;
Automatically selecting local only mode for query
...
Total MapReduce CPU Time Spent: 0 msec
OK
_c0
29025
Time taken: 1.746 seconds, Fetched: 1 row(s)
hive (mydb)> select max(sal) from emp;
Automatically selecting local only mode for query
...
Total MapReduce CPU Time Spent: 0 msec
OK
_c0
5000
Time taken: 1.737 seconds, Fetched: 1 row(s)
hive (mydb)> select min(sal) from emp;
Automatically selecting local only mode for query
...
Total MapReduce CPU Time Spent: 0 msec
OK
_c0
800
Time taken: 1.789 seconds, Fetched: 1 row(s)
hive (mydb)> select avg(sal) from emp;
Automatically selecting local only mode for query
...
Total MapReduce CPU Time Spent: 0 msec
OK
_c0
2073.214285714286
Time taken: 1.629 seconds, Fetched: 1 row(s)
-- 使用limit子句限制回傳的行數
hive (mydb)> select * from emp limit 3;
OK
emp.empno       emp.ename       emp.job emp.mgr emp.hiredate    emp.sal emp.comm        emp.deptno
7369    SMITH   CLERK   7902    2010-12-17      800     NULL    20
7499    ALLEN   SALESMAN        7698    2011-02-20      1600    300     30
7521    WARD    SALESMAN        7698    2011-02-22      1250    500     30
Time taken: 0.338 seconds, Fetched: 3 row(s)
hive (mydb)>

需要注意，使用count函式時，如果傳入的是欄位，則不統計NULL，所以要統計資料的行數時，一般不傳入某個欄位，而是傳入*或1，

（2）簡單子句

WHERE子句緊隨FROM子句，使用WHERE子句，過濾不滿足條件的資料；

where 子句中不能使用列的別名，

where子句的簡單用法如下：

hive (mydb)> select ename name, sal from emp;
OK
name    sal
SMITH   800
ALLEN   1600
WARD    1250
JONES   2975
MARTIN  1250
BLAKE   2850
CLARK   2450
SCOTT   3000
KING    5000
TURNER  1500
ADAMS   1100
JAMES   950
FORD    3000
MILLER  1300
Time taken: 0.145 seconds, Fetched: 14 row(s)
hive (mydb)> select ename name, sal from emp where length(ename)=5;
OK
name    sal
SMITH   800
ALLEN   1600
JONES   2975
BLAKE   2850
CLARK   2450
SCOTT   3000
ADAMS   1100
JAMES   950
Time taken: 0.183 seconds, Fetched: 8 row(s)
hive (mydb)> select ename name, sal from emp where length(name)=5;
FAILED: SemanticException [Error 10004]: Line 1:45 Invalid table alias or column reference 'name': (possible column names are: empno, ename, job, mgr, hiredate, sal, comm, deptno)
hive (mydb)> select ename, sal from emp where sal > 2000;
OK
ename   sal
JONES   2975
BLAKE   2850
CLARK   2450
SCOTT   3000
KING    5000
FORD    3000
Time taken: 0.333 seconds, Fetched: 6 row(s)
hive (mydb)>

可以看到，where子句中不能使用欄位的別名作為查詢條件，

where子句中會涉及到較多的比較運算和邏輯運算，

常見的比較運算子如下：

比較運算子	含義
=、==、<=>	等于
<>、!=	不等于
<、<=、>、>=	大于等于、小于等于
is [not] null	如果A等于NULL，則回傳TRUE，反之回傳FALSE；使用NOT關鍵字結果相反
in (value1, value2, …)	匹配串列中的值
LIKE	簡單正則運算式，也稱通配符模式： ‘x%’ 表示必須以字母 ‘x’ 開頭； ’%x’表示必須以字母’x’結尾； ’%x%‘表示包含有字母’x’，可以位于字串任意位置；使用NOT關鍵字結果相反，其中，%代表匹配零個或多個字符(任意個字符)，_ 代表匹配一個字符
[NOT] BETWEEN … AND …	范圍的判斷，使用NOT關鍵字結果相反
RLIKE、REGEXP	基于Java的正則運算式，匹配回傳TRUE，反之回傳FALSE；匹配使用的是JDK中的正則運算式介面實作的，因為正則也依據其中的規則；例如，正則運算式必須和整個字串A相匹配，而不是只需與其字串匹配

更完整的比較運算子可參考官方檔案https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF，

使用比較運算子如下：

hive (mydb)> select sal, comm, sal+comm from emp;
OK
sal     comm    _c2
800     NULL    NULL
1600    300     1900
...
3000    NULL    NULL
1300    NULL    NULL
Time taken: 0.257 seconds, Fetched: 14 row(s)
hive (mydb)> select sal, comm, sal+comm from emp;
OK
sal     comm    _c2
800     NULL    NULL
1600    300     1900
1250    500     1750
...
3000    NULL    NULL
1300    NULL    NULL
Time taken: 0.17 seconds, Fetched: 14 row(s)
hive (mydb)> select * from emp where comm != NULL;
OK
emp.empno       emp.ename       emp.job emp.mgr emp.hiredate    emp.sal emp.comm        emp.deptno
Time taken: 0.246 seconds
hive (mydb)> select * from emp where comm is not NULL;
OK
emp.empno       emp.ename       emp.job emp.mgr emp.hiredate    emp.sal emp.comm        emp.deptno
7499    ALLEN   SALESMAN        7698    2011-02-20      1600    300     30
7521    WARD    SALESMAN        7698    2011-02-22      1250    500     30
7654    MARTIN  SALESMAN        7698    2011-09-28      1250    1400    30
7844    TURNER  SALESMAN        7698    2011-09-08      1500    0       30
Time taken: 0.193 seconds, Fetched: 4 row(s)
Time taken: 0.192 seconds, Fetched: 1 row(s)
hive (mydb)> select * from emp where deptno in (20, 30);
OK
emp.empno       emp.ename       emp.job emp.mgr emp.hiredate    emp.sal emp.comm        emp.deptno
7369    SMITH   CLERK   7902    2010-12-17      800     NULL    20
7499    ALLEN   SALESMAN        7698    2011-02-20      1600    300     30
...
7900    JAMES   CLERK   7698    2011-12-03      950     NULL    30
7902    FORD    ANALYST 7566    2011-12-03      3000    NULL    20
Time taken: 0.247 seconds, Fetched: 11 row(s)
hive (mydb)> select * from emp where ename like 'S%';
OK
emp.empno       emp.ename       emp.job emp.mgr emp.hiredate    emp.sal emp.comm        emp.deptno
7369    SMITH   CLERK   7902    2010-12-17      800     NULL    20
7788    SCOTT   ANALYST 7566    2017-07-13      3000    NULL    20
Time taken: 0.39 seconds, Fetched: 2 row(s)
hive (mydb)> select * from emp where ename like '%S';
OK
emp.empno       emp.ename       emp.job emp.mgr emp.hiredate    emp.sal emp.comm        emp.deptno
7566    JONES   MANAGER 7839    2011-04-02      2975    NULL    20
7876    ADAMS   CLERK   7788    2017-07-13      1100    NULL    20
7900    JAMES   CLERK   7698    2011-12-03      950     NULL    30
Time taken: 0.165 seconds, Fetched: 3 row(s)
hive (mydb)> select * from emp where ename like '%S%';
OK
emp.empno       emp.ename       emp.job emp.mgr emp.hiredate    emp.sal emp.comm        emp.deptno
7369    SMITH   CLERK   7902    2010-12-17      800     NULL    20
7566    JONES   MANAGER 7839    2011-04-02      2975    NULL    20
7788    SCOTT   ANALYST 7566    2017-07-13      3000    NULL    20
7876    ADAMS   CLERK   7788    2017-07-13      1100    NULL    20
7900    JAMES   CLERK   7698    2011-12-03      950     NULL    30
Time taken: 0.098 seconds, Fetched: 5 row(s)
hive (mydb)> select * from emp where sal between 1000 and 2000;
OK
emp.empno       emp.ename       emp.job emp.mgr emp.hiredate    emp.sal emp.comm        emp.deptno
7499    ALLEN   SALESMAN        7698    2011-02-20      1600    300     30
7521    WARD    SALESMAN        7698    2011-02-22      1250    500     30
7654    MARTIN  SALESMAN        7698    2011-09-28      1250    1400    30
7844    TURNER  SALESMAN        7698    2011-09-08      1500    0       30
7876    ADAMS   CLERK   7788    2017-07-13      1100    NULL    20
7934    MILLER  CLERK   7782    2012-01-23      1300    NULL    10
Time taken: 0.855 seconds, Fetched: 6 row(s)
hive (mydb)> select * from emp where ename like 'S%' or ename like '%S';
OK
emp.empno       emp.ename       emp.job emp.mgr emp.hiredate    emp.sal emp.comm        emp.deptno
7369    SMITH   CLERK   7902    2010-12-17      800     NULL    20
7566    JONES   MANAGER 7839    2011-04-02      2975    NULL    20
7788    SCOTT   ANALYST 7566    2017-07-13      3000    NULL    20
7876    ADAMS   CLERK   7788    2017-07-13      1100    NULL    20
7900    JAMES   CLERK   7698    2011-12-03      950     NULL    30
Time taken: 0.151 seconds, Fetched: 5 row(s)
hive (mydb)> select * from emp where ename rlike '^S.*|.*S$';
OK
emp.empno       emp.ename       emp.job emp.mgr emp.hiredate    emp.sal emp.comm        emp.deptno
7369    SMITH   CLERK   7902    2010-12-17      800     NULL    20
7566    JONES   MANAGER 7839    2011-04-02      2975    NULL    20
7788    SCOTT   ANALYST 7566    2017-07-13      3000    NULL    20
7876    ADAMS   CLERK   7788    2017-07-13      1100    NULL    20
7900    JAMES   CLERK   7698    2011-12-03      950     NULL    30
Time taken: 0.152 seconds, Fetched: 5 row(s)
hive (mydb)> select null=null;
OK
_c0
NULL
Time taken: 0.078 seconds, Fetched: 1 row(s)
hive (mydb)> select null==null;
OK
_c0
NULL
Time taken: 0.074 seconds, Fetched: 1 row(s)
hive (mydb)> select null<=>null;
OK
_c0
true
Time taken: 0.068 seconds, Fetched: 1 row(s)
hive (mydb)> select null is null;
OK
_c0
true
Time taken: 0.066 seconds, Fetched: 1 row(s)
hive (mydb)>

可以看到，通常情況下NULL參與運算，回傳值為NULL，并且判斷欄位（不）NULL時，不能使用=，而要使用is和is not，否則會得到例外的結果；

null<=>null和null is null的結果相同，都是true，

邏輯運算子包括and、or和not，

（3）group by分組子句

GROUP BY陳述句通常與聚組函式一起使用，按照一個或多個列對資料進行分組，對每個組進行聚合操作，

使用如下：

hive (mydb)> select avg(sal)
           >     from emp
           > group by deptno;
Automatically selecting local only mode for query
...
Total MapReduce CPU Time Spent: 0 msec
OK
_c0
2916.6666666666665
2175.0
1566.6666666666667
Time taken: 2.428 seconds, Fetched: 3 row(s)
-- 計算emp表每個部門的平均工資
hive (mydb)> select deptno, avg(sal)
           >     from emp
           > group by deptno;
Automatically selecting local only mode for query
...
Total MapReduce CPU Time Spent: 0 msec
OK
deptno  _c1
10      2916.6666666666665
20      2175.0
30      1566.6666666666667
Time taken: 1.921 seconds, Fetched: 3 row(s)
-- 計算emp每個部門中每個崗位的最高薪水
hive (mydb)> select deptno, job , max(sal)
           >     from emp
           > group by deptno, job;
Automatically selecting local only mode for query
...
Total MapReduce CPU Time Spent: 0 msec
OK
deptno  job     _c2
20      ANALYST 3000
10      CLERK   1300
20      CLERK   1100
30      CLERK   950
10      MANAGER 2450
20      MANAGER 2975
30      MANAGER 2850
10      PRESIDENT       5000
30      SALESMAN        1600
Time taken: 1.587 seconds, Fetched: 9 row(s)
hive (mydb)> select deptno, max(sal)
           >     from emp
           > group by deptno;
Automatically selecting local only mode for query
...
Total MapReduce CPU Time Spent: 0 msec
OK
deptno  _c1
10      5000
20      3000
30      2850
Time taken: 1.719 seconds, Fetched: 3 row(s)
-- 求每個部門的平均薪水大于2000的部門
hive (mydb)> select deptno, avg(sal) avgsal
           >     from emp
           > group by deptno
           > having avgsal > 2000;
Automatically selecting local only mode for query
...
Total MapReduce CPU Time Spent: 0 msec
OK
deptno  avgsal
10      2916.6666666666665
20      2175.0
Time taken: 1.881 seconds, Fetched: 2 row(s)
hive (mydb)> [root@node03 ~]$

現在對where和having進行總結：

where子句針對表中的資料發揮作用；having針對查詢結果（聚組以后的結果）發揮作用
where子句不能有分組函式；having子句可以有分組函式
having一般只用于group by分組統計之后

（4）表連接

Hive支持通常的SQL JOIN陳述句，默認情況下，僅支持等值連接，不支持非等值連接，

JOIN 陳述句中經常會使用表的別名，使用別名可以簡化SQL陳述句的撰寫，使用表名前綴可以提高SQL的決議效率，

連接查詢操作分為兩大類，內連接和外連接，而外連接可進一步細分為三種型別，如下：

內連接[inner] join
外連接outer join

????- 左外連接left [outer] join，左表的資料全部顯示

????- 右外連接right [outer] join，右表的資料全部顯示

????- 全外連接full [outer] join，兩張表的資料都顯示

圖示如下：
hive join

先準備資料，vim /home/hadoop/data/u1.txt，輸入如下：

1,a
2,b
3,c
4,d
5,e
6,f

vim /home/hadoop/data/u2.txt，輸入如下：

4,d
5,e
6,f
7,g
8,h
9,i

創建表并加載資料，如下：

hive (mydb)> create table if not exists u1(
           >     id int,
           >     name string)
           > row format delimited fields terminated by ',';
OK
Time taken: 0.823 seconds
hive (mydb)> create table if not exists u2(
           >     id int,
           >     name string)
           > row format delimited fields terminated by ',';
OK
Time taken: 0.143 seconds
hive (mydb)> load data local inpath '/home/hadoop/data/u1.txt' into table u1;
Loading data to table mydb.u1
OK
Time taken: 0.949 seconds
hive (mydb)> load data local inpath '/home/hadoop/data/u2.txt' into table u2;
Loading data to table mydb.u2
OK
Time taken: 0.773 seconds
hive (mydb)> select * from u1;
OK
u1.id   u1.name
1       a
2       b
3       c
4       d
5       e
6       f
Time taken: 1.587 seconds, Fetched: 6 row(s)
hive (mydb)> select * from u2;
OK
u2.id   u2.name
4       d
5       e
6       f
7       g
8       h
9       i
Time taken: 0.205 seconds, Fetched: 6 row(s)
hive (mydb)>

再測驗4種連接方式：

-- 內連接
hive (mydb)> select * from u1 join u2 on u1.id = u2.id;
Automatically selecting local only mode for query
...
2021-09-25 01:40:36     Uploaded 1 File to: file:/tmp/root/8c73cb74-4529-401a-a664-93c1ea29f8d0/hive_2021-09-25_01-40-19_146_1199023122301087997-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile00--.hashtable (386 bytes)
2021-09-25 01:40:36     End of local task; Time Taken: 2.182 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
2021-09-25 01:40:40,275 Stage-3 map = 100%,  reduce = 0%
Ended Job = job_local250415828_0001
MapReduce Jobs Launched: 
Stage-Stage-3:  HDFS Read: 72 HDFS Write: 195 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
u1.id   u1.name u2.id   u2.name
4       d       4       d
5       e       5       e
6       f       6       f
Time taken: 21.206 seconds, Fetched: 3 row(s)
-- 左外連接
hive (mydb)> select * from u1 left join u2 on u1.id = u2.id;
Automatically selecting local only mode for query
...
2021-09-25 01:41:27     Uploaded 1 File to: file:/tmp/root/8c73cb74-4529-401a-a664-93c1ea29f8d0/hive_2021-09-25_01-41-11_852_942067958094048095-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile11--.hashtable (386 bytes)
2021-09-25 01:41:27     End of local task; Time Taken: 1.788 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
2021-09-25 01:41:30,494 Stage-3 map = 100%,  reduce = 0%
Ended Job = job_local794342130_0002
MapReduce Jobs Launched: 
Stage-Stage-3:  HDFS Read: 243 HDFS Write: 408 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
u1.id   u1.name u2.id   u2.name
1       a       NULL    NULL
2       b       NULL    NULL
3       c       NULL    NULL
4       d       4       d
5       e       5       e
6       f       6       f
Time taken: 18.726 seconds, Fetched: 6 row(s)
-- 右外連接
hive (mydb)> select * from u1 right join u2 on u1.id = u2.id;
Automatically selecting local only mode for query
...
2021-09-25 01:41:55     Uploaded 1 File to: file:/tmp/root/8c73cb74-4529-401a-a664-93c1ea29f8d0/hive_2021-09-25_01-41-41_317_899584510973126689-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile20--.hashtable (386 bytes)
2021-09-25 01:41:55     End of local task; Time Taken: 1.925 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
2021-09-25 01:41:58,864 Stage-3 map = 100%,  reduce = 0%
Ended Job = job_local644970176_0003
MapReduce Jobs Launched: 
Stage-Stage-3:  HDFS Read: 480 HDFS Write: 621 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
u1.id   u1.name u2.id   u2.name
4       d       4       d
5       e       5       e
6       f       6       f
NULL    NULL    7       g
NULL    NULL    8       h
NULL    NULL    9       i
Time taken: 17.577 seconds, Fetched: 6 row(s)
-- 全外連接
hive (mydb)> select * from u1 full join u2 on u1.id = u2.id;
Automatically selecting local only mode for query
...
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2021-09-25 01:42:16,604 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_local905853098_0004
MapReduce Jobs Launched: 
Stage-Stage-1:  HDFS Read: 2199 HDFS Write: 2142 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
u1.id   u1.name u2.id   u2.name
1       a       NULL    NULL
2       b       NULL    NULL
3       c       NULL    NULL
4       d       4       d
5       e       5       e
6       f       6       f
NULL    NULL    7       g
NULL    NULL    8       h
NULL    NULL    9       i
Time taken: 1.802 seconds, Fetched: 9 row(s)
hive (mydb)>

除此之外，還可以進行多表連接；
連接 n張表，至少需要 n-1 個連接條件，例如連接四張表至少需要三個連接條件，

舉例，多表連接查詢，查詢老師對應的課程，以及對應的分數，對應的學生，如下：

select *
    from techer t left join course c on t.t_id = c.t_id
        left join score s on s.c_id = c.c_id
        left join student stu on s.s_id = stu.s_id;

Hive總是按照從左到右的順序執行，Hive會對每對 JOIN 連接物件啟動一個MapReduce 任務，

上面的例子中會首先啟動一個MapReduce Job對表t和表c進行連接操作；然后再啟動一個MapReduce Job將第一個MapReduce Job的輸出和表s進行連接操作；然后再繼續啟動一個MapReduce Job將第二個MapReduce Job的輸出和表stu進行連接操作，所以總共會有3個MapReduce Job，

可以看到，連接條件會占用較多的連接資源，

Hive種也可以產生笛卡爾積，滿足以下條件將會產生笛卡爾積：

沒有連接條件
連接條件無效
所有表中的所有行互相連接

如果表A、B分別有M、N條資料，其笛卡爾積的結果將有 M*N 條資料，預設條件下HIve不支持笛卡爾積運算，需要設定引數hive.strict.checks.cartesian.product=false才能進行笛卡爾積運算，

使用如下：

hive (mydb)> select * from u1, u2;
FAILED: SemanticException Cartesian products are disabled for safety reasons. If you know what you are doing, please sethive.strict.checks.cartesian.product to false and that hive.mapred.mode is not set to 'strict' to proceed. Note that if you may get errors or incorrect results if you make a mistake while using some of the unsafe features.
hive (mydb)> set hive.strict.checks.cartesian.product;
hive.strict.checks.cartesian.product=true
hive (mydb)> set hive.strict.checks.cartesian.product=false;
hive (mydb)> select * from u1, u2;
Warning: Map Join MAPJOIN[9][bigTable=?] in task 'Stage-3:MAPRED' is a cross product
Automatically selecting local only mode for query
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20210925015241_c38361bc-8bd1-4473-8e2c-ec9479516299
Total jobs = 1
2021-09-25 01:52:55     Starting to launch local task to process map join;      maximum memory = 518979584
2021-09-25 01:52:57     Dump the side-table for tag: 0 with group count: 1 into file: file:/tmp/root/8c73cb74-4529-401a-a664-93c1ea29f8d0/hive_2021-09-25_01-52-41_759_493077905194279454-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile30--.hashtable
2021-09-25 01:52:57     Uploaded 1 File to: file:/tmp/root/8c73cb74-4529-401a-a664-93c1ea29f8d0/hive_2021-09-25_01-52-41_759_493077905194279454-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile30--.hashtable (320 bytes)
2021-09-25 01:52:57     End of local task; Time Taken: 1.549 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
2021-09-25 01:53:00,570 Stage-3 map = 100%,  reduce = 0%
Ended Job = job_local1556109485_0005
MapReduce Jobs Launched: 
Stage-Stage-3:  HDFS Read: 1044 HDFS Write: 1707 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
u1.id   u1.name u2.id   u2.name
1       a       4       d
2       b       4       d
3       c       4       d
4       d       4       d
5       e       4       d
6       f       4       d
1       a       5       e
2       b       5       e
3       c       5       e
4       d       5       e
5       e       5       e
6       f       5       e
1       a       6       f
2       b       6       f
3       c       6       f
4       d       6       f
5       e       6       f
6       f       6       f
1       a       7       g
2       b       7       g
3       c       7       g
4       d       7       g
5       e       7       g
6       f       7       g
1       a       8       h
2       b       8       h
3       c       8       h
4       d       8       h
5       e       8       h
6       f       8       h
1       a       9       i
2       b       9       i
3       c       9       i
4       d       9       i
5       e       9       i
6       f       9       i
Time taken: 18.844 seconds, Fetched: 36 row(s)
hive (mydb)> select count(*) from u1, u2;
Warning: Map Join MAPJOIN[15][bigTable=?] in task 'Stage-2:MAPRED' is a cross product
Automatically selecting local only mode for query
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20210925015310_41ce7307-4ea7-41f3-8c40-6c1927d3feb7
Total jobs = 1
2021-09-25 01:53:25     Starting to launch local task to process map join;      maximum memory = 518979584
2021-09-25 01:53:27     Dump the side-table for tag: 0 with group count: 1 into file: file:/tmp/root/8c73cb74-4529-401a-a664-93c1ea29f8d0/hive_2021-09-25_01-53-10_349_6581783439017176314-1/-local-10005/HashTable-Stage-2/MapJoin-mapfile40--.hashtable
2021-09-25 01:53:27     Uploaded 1 File to: file:/tmp/root/8c73cb74-4529-401a-a664-93c1ea29f8d0/hive_2021-09-25_01-53-10_349_6581783439017176314-1/-local-10005/HashTable-Stage-2/MapJoin-mapfile40--.hashtable (296 bytes)
2021-09-25 01:53:27     End of local task; Time Taken: 1.715 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2021-09-25 01:53:30,449 Stage-2 map = 100%,  reduce = 100%
Ended Job = job_local480461869_0006
MapReduce Jobs Launched: 
Stage-Stage-2:  HDFS Read: 3750 HDFS Write: 3516 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
_c0
36
Time taken: 20.145 seconds, Fetched: 1 row(s)
hive (mydb)>

（5）order by排序子句

Hive中的order by子句與MySQL中存在一定的區別，

order by子句用于對最終的結果進行排序，一般出現在select陳述句的結尾；
默認使用升序（ASC），可以使用DESC，跟在欄位名之后表示降序；
ORDER BY執行全域排序 ，只有一個reduce任務，

使用如下：

hive (mydb)> select * from emp order by deptno;
Automatically selecting local only mode for query
...
Total MapReduce CPU Time Spent: 0 msec
OK
emp.empno       emp.ename       emp.job emp.mgr emp.hiredate    emp.sal emp.comm        emp.deptno
7934    MILLER  CLERK   7782    2012-01-23      1300    NULL    10
7839    KING    PRESIDENT       NULL    2011-11-07      5000    NULL    10
7782    CLARK   MANAGER 7839    2011-06-09      2450    NULL    10
7876    ADAMS   CLERK   7788    2017-07-13      1100    NULL    20
7788    SCOTT   ANALYST 7566    2017-07-13      3000    NULL    20
7369    SMITH   CLERK   7902    2010-12-17      800     NULL    20
7566    JONES   MANAGER 7839    2011-04-02      2975    NULL    20
7902    FORD    ANALYST 7566    2011-12-03      3000    NULL    20
7844    TURNER  SALESMAN        7698    2011-09-08      1500    0       30
7499    ALLEN   SALESMAN        7698    2011-02-20      1600    300     30
7698    BLAKE   MANAGER 7839    2011-05-01      2850    NULL    30
7654    MARTIN  SALESMAN        7698    2011-09-28      1250    1400    30
7521    WARD    SALESMAN        7698    2011-02-22      1250    500     30
7900    JAMES   CLERK   7698    2011-12-03      950     NULL    30
Time taken: 7.211 seconds, Fetched: 14 row(s)
hive (mydb)> select empno, ename, job, mgr, sal+comm salsum, deptno
           > from emp
           > order by salsum desc;
Automatically selecting local only mode for query
...
Total MapReduce CPU Time Spent: 0 msec
OK
empno   ename   job     mgr     salsum  deptno
7654    MARTIN  SALESMAN        7698    2650    30
7499    ALLEN   SALESMAN        7698    1900    30
7521    WARD    SALESMAN        7698    1750    30
7844    TURNER  SALESMAN        7698    1500    30
7934    MILLER  CLERK   7782    NULL    10
7902    FORD    ANALYST 7566    NULL    20
7900    JAMES   CLERK   7698    NULL    30
7876    ADAMS   CLERK   7788    NULL    20
7839    KING    PRESIDENT       NULL    NULL    10
7788    SCOTT   ANALYST 7566    NULL    20
7782    CLARK   MANAGER 7839    NULL    10
7698    BLAKE   MANAGER 7839    NULL    30
7566    JONES   MANAGER 7839    NULL    20
7369    SMITH   CLERK   7902    NULL    20
Time taken: 2.068 seconds, Fetched: 14 row(s)
hive (mydb)> select empno, ename, job, mgr, sal+nvl(comm, 0) salsum, deptno
           > from emp
           > order by salsum desc;
Automatically selecting local only mode for query
...
Total MapReduce CPU Time Spent: 0 msec
OK
empno   ename   job     mgr     salsum  deptno
7839    KING    PRESIDENT       NULL    5000    10
7902    FORD    ANALYST 7566    3000    20
7788    SCOTT   ANALYST 7566    3000    20
7566    JONES   MANAGER 7839    2975    20
7698    BLAKE   MANAGER 7839    2850    30
7654    MARTIN  SALESMAN        7698    2650    30
7782    CLARK   MANAGER 7839    2450    10
7499    ALLEN   SALESMAN        7698    1900    30
7521    WARD    SALESMAN        7698    1750    30
7844    TURNER  SALESMAN        7698    1500    30
7934    MILLER  CLERK   7782    1300    10
7876    ADAMS   CLERK   7788    1100    20
7900    JAMES   CLERK   7698    950     30
7369    SMITH   CLERK   7902    800     20
Time taken: 1.739 seconds, Fetched: 14 row(s)
hive (mydb)> select empno, ename, job, mgr, sal+nvl(comm, 0) salsum, deptno
           > from emp
           > order by deptno, salsum desc;
Automatically selecting local only mode for query
...
Total MapReduce CPU Time Spent: 0 msec
OK
empno   ename   job     mgr     salsum  deptno
7839    KING    PRESIDENT       NULL    5000    10
7782    CLARK   MANAGER 7839    2450    10
7934    MILLER  CLERK   7782    1300    10
7788    SCOTT   ANALYST 7566    3000    20
7902    FORD    ANALYST 7566    3000    20
7566    JONES   MANAGER 7839    2975    20
7876    ADAMS   CLERK   7788    1100    20
7369    SMITH   CLERK   7902    800     20
7698    BLAKE   MANAGER 7839    2850    30
7654    MARTIN  SALESMAN        7698    2650    30
7499    ALLEN   SALESMAN        7698    1900    30
7521    WARD    SALESMAN        7698    1750    30
7844    TURNER  SALESMAN        7698    1500    30
7900    JAMES   CLERK   7698    950     30
Time taken: 1.849 seconds, Fetched: 14 row(s)
hive (mydb)> select empno, ename, job, mgr, sal+nvl(comm, 0) salsum
           > from emp
           > order by deptno, salsum desc;
FAILED: SemanticException [Error 10004]: Line 3:9 Invalid table alias or column reference 'deptno': (possible column names are: empno, ename, job, mgr, salsum)
hive (mydb)>

其中，nvl函式的作用是在傳入的欄位的值為空時，將欄位的值設定為第二個引數的值，一般在欄位參與運算時，會用到該函式；

同時需要保證，排序欄位要出現在select子句中，否則查詢陳述句無法執行，上面的查詢陳述句因為select子句中缺少deptno、而order by子句中存在deptno，所以不能正常執行，

（6）sort by排序

對于大規模資料而言order by效率低；
在很多業務場景，我們并不需要全域有序的資料、而只需要區域有序的資料即可，此時可以使用sort by；
sort by可以為每個reduce產生一個排序檔案，在reduce內部進行排序，得到區域有序的結果，

現在使用如下：

-- 設定reduce個數
hive (mydb)> set mapreduce.job.reduces;
mapreduce.job.reduces=-1
hive (mydb)> set mapreduce.job.reduces=2;
hive (mydb)> set mapreduce.job.reduces;
mapreduce.job.reduces=2
-- 按照工資降序查看員工資訊
hive (mydb)> select * from emp sort by sal desc;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
...
Total MapReduce CPU Time Spent: 7 seconds 860 msec
OK
emp.empno       emp.ename       emp.job emp.mgr emp.hiredate    emp.sal emp.comm        emp.deptno
7902    FORD    ANALYST 7566    2011-12-03      3000    NULL    20
7788    SCOTT   ANALYST 7566    2017-07-13      3000    NULL    20
7566    JONES   MANAGER 7839    2011-04-02      2975    NULL    20
7844    TURNER  SALESMAN        7698    2011-09-08      1500    0       30
7521    WARD    SALESMAN        7698    2011-02-22      1250    500     30
7654    MARTIN  SALESMAN        7698    2011-09-28      1250    1400    30
7876    ADAMS   CLERK   7788    2017-07-13      1100    NULL    20
7900    JAMES   CLERK   7698    2011-12-03      950     NULL    30
7369    SMITH   CLERK   7902    2010-12-17      800     NULL    20
7839    KING    PRESIDENT       NULL    2011-11-07      5000    NULL    10
7698    BLAKE   MANAGER 7839    2011-05-01      2850    NULL    30
7782    CLARK   MANAGER 7839    2011-06-09      2450    NULL    10
7499    ALLEN   SALESMAN        7698    2011-02-20      1600    300     30
7934    MILLER  CLERK   7782    2012-01-23      1300    NULL    10
Time taken: 42.303 seconds, Fetched: 14 row(s)
-- 將查詢結果匯入到檔案中（按照工資降序），生成兩個輸出檔案，每個檔案內部資料按工資降序排列
hive (mydb)> insert overwrite local directory '/home/hadoop/output/sortsal'
           > select * from emp sort by sal desc;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
...
Total MapReduce CPU Time Spent: 9 seconds 50 msec
OK
emp.empno       emp.ename       emp.job emp.mgr emp.hiredate    emp.sal emp.comm        emp.deptno
Time taken: 46.032 seconds
hive (mydb)>

運行后，查看本地，如下：

[root@node03 ~]$ ll /home/hadoop/output/sortsal/
總用量 8
-rw-r--r-- 1 root root 411 9月  25 15:20 000000_0
-rw-r--r-- 1 root root 230 9月  25 15:20 000001_0
[root@node03 ~]$ cat -A /home/hadoop/output/sortsal/000000_0
7902^AFORD^AANALYST^A7566^A2011-12-03^A3000^A\N^A20$
7788^ASCOTT^AANALYST^A7566^A2017-07-13^A3000^A\N^A20$
7566^AJONES^AMANAGER^A7839^A2011-04-02^A2975^A\N^A20$
7844^ATURNER^ASALESMAN^A7698^A2011-09-08^A1500^A0^A30$
7521^AWARD^ASALESMAN^A7698^A2011-02-22^A1250^A500^A30$
7654^AMARTIN^ASALESMAN^A7698^A2011-09-28^A1250^A1400^A30$
7876^AADAMS^ACLERK^A7788^A2017-07-13^A1100^A\N^A20$
7900^AJAMES^ACLERK^A7698^A2011-12-03^A950^A\N^A30$
7369^ASMITH^ACLERK^A7902^A2010-12-17^A800^A\N^A20$
[root@node03 ~]$ cat -A /home/hadoop/output/sortsal/000001_0
7839^AKING^APRESIDENT^A\N^A2011-11-07^A5000^A\N^A10$
7698^ABLAKE^AMANAGER^A7839^A2011-05-01^A2850^A\N^A30$
7782^ACLARK^AMANAGER^A7839^A2011-06-09^A2450^A\N^A10$
7499^AALLEN^ASALESMAN^A7698^A2011-02-20^A1600^A300^A30$
7934^AMILLER^ACLERK^A7782^A2012-01-23^A1300^A\N^A10$
[root@node03 ~]$

可以看到，reduce個數（mapreduce.job.reduces引數的值）默認為-1，此時Hive可以自行計算reduce的個數，當資料很小時就會只計算出一個reduce，所以要想有多個reduce，需要手動設定；

此時有多個reduce，不能再啟用本地模式，而是使用多個MR Job；

在查詢的結果中，無論是列印出來到控制臺，還是輸出到檔案，在區域都是有序的，

（7）distribute by和cluster by排序

distribute by用于磁區排序；
distribute by 將特定的行發送到特定的reducer中，便于后繼的聚合與排序操作；
distribute by 類似于MR中的磁區操作，可以結合sort by操作，使磁區資料有序，結合使用時distribute by 要寫在sort by之前，
使用如下：

-- 啟動2個reduce task
hive (mydb)> set mapreduce.job.reduces=2;
-- 先按deptno磁區，在磁區內按sal+comm排序，將結果輸出到檔案，觀察輸出結果
hive (mydb)> insert overwrite local directory '/home/hadoop/output/distBy'
           > select empno, ename, deptno, job, sal+nvl(comm, 0) salsum
           > from emp
           > distribute by deptno
           > sort by salsum desc;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
...
Total MapReduce CPU Time Spent: 6 seconds 310 msec
OK
empno   ename   deptno  job     salsum
Time taken: 39.482 seconds
hive (mydb)> select distinct deptno from emp;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
...
Total MapReduce CPU Time Spent: 7 seconds 730 msec
OK
deptno
10
20
30
Time taken: 38.173 seconds, Fetched: 3 row(s)
-- 啟動3個reduce task，將資料分到3個區中
hive (mydb)> set mapreduce.job.reduces=3;
hive (mydb)> insert overwrite local directory '/home/hadoop/output/distby'
           > select empno, ename, deptno, job, sal+nvl(comm, 0) salsum
           > from emp
           > distribute by deptno
           > sort by salsum desc;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
...
Total MapReduce CPU Time Spent: 10 seconds 890 msec
OK
empno   ename   deptno  job     salsum
Time taken: 36.433 seconds
hive (mydb)>

查看本地，如下：

# 2個reduce task
[root@node03 ~]$ ll /home/hadoop/output/distBy/
總用量 4
-rw-r--r-- 1 root root 374 9月  25 15:34 000000_0
-rw-r--r-- 1 root root   0 9月  25 15:34 000001_0
[root@node03 ~]$ cat -A /home/hadoop/output/distBy/000000_0 
7839^AKING^A10^APRESIDENT^A5000$
7902^AFORD^A20^AANALYST^A3000$
7788^ASCOTT^A20^AANALYST^A3000$
7566^AJONES^A20^AMANAGER^A2975$
7698^ABLAKE^A30^AMANAGER^A2850$
7654^AMARTIN^A30^ASALESMAN^A2650$
7782^ACLARK^A10^AMANAGER^A2450$
7499^AALLEN^A30^ASALESMAN^A1900$
7521^AWARD^A30^ASALESMAN^A1750$
7844^ATURNER^A30^ASALESMAN^A1500$
7934^AMILLER^A10^ACLERK^A1300$
7876^AADAMS^A20^ACLERK^A1100$
7900^AJAMES^A30^ACLERK^A950$
7369^ASMITH^A20^ACLERK^A800$
[root@node03 ~]$ cat -A /home/hadoop/output/distBy/000001_0 
# 3個reduce task
[root@node03 ~]$ ll /home/hadoop/output/distby/
總用量 12
-rw-r--r-- 1 root root 164 9月  25 15:42 000000_0
-rw-r--r-- 1 root root  81 9月  25 15:42 000001_0
-rw-r--r-- 1 root root 129 9月  25 15:42 000002_0
[root@node03 ~]$ cat -A /home/hadoop/output/distby/000000_0 
7698^ABLAKE^A30^AMANAGER^A2850$
7654^AMARTIN^A30^ASALESMAN^A2650$
7499^AALLEN^A30^ASALESMAN^A1900$
7521^AWARD^A30^ASALESMAN^A1750$
7844^ATURNER^A30^ASALESMAN^A1500$
7900^AJAMES^A30^ACLERK^A950$
[root@node03 ~]$ cat -A /home/hadoop/output/distby/000001_0 
7839^AKING^A10^APRESIDENT^A5000$
7782^ACLARK^A10^AMANAGER^A2450$
7934^AMILLER^A10^ACLERK^A1300$
[root@node03 ~]$ cat -A /home/hadoop/output/distby/000002_0 
7788^ASCOTT^A20^AANALYST^A3000$
7902^AFORD^A20^AANALYST^A3000$
7566^AJONES^A20^AMANAGER^A2975$
7876^AADAMS^A20^ACLERK^A1100$
7369^ASMITH^A20^ACLERK^A800$
[root@node03 ~]$

可以看到，因為磁區規則是磁區欄位.hashCode % 磁區數，并且磁區欄位deptno的值包括10、20、30，磁區數為2，計算得到的磁區編號都是0，所以設定reduce為2時最后得到的資料都在第一個磁區檔案000000_0中，設定reduce為3時，查詢結果會分布到不同的檔案中，

當distribute by與sort by是同一個欄位時，可使用cluster by簡化語法；
cluster by只能是升序，不能指定排序規則，

使用如下：

hive (mydb)> select * from emp distribute by deptno sort by deptno;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
...
Total MapReduce CPU Time Spent: 10 seconds 130 msec
OK
emp.empno       emp.ename       emp.job emp.mgr emp.hiredate    emp.sal emp.comm        emp.deptno
7654    MARTIN  SALESMAN        7698    2011-09-28      1250    1400    30
7900    JAMES   CLERK   7698    2011-12-03      950     NULL    30
7698    BLAKE   MANAGER 7839    2011-05-01      2850    NULL    30
7521    WARD    SALESMAN        7698    2011-02-22      1250    500     30
7844    TURNER  SALESMAN        7698    2011-09-08      1500    0       30
7499    ALLEN   SALESMAN        7698    2011-02-20      1600    300     30
7934    MILLER  CLERK   7782    2012-01-23      1300    NULL    10
7839    KING    PRESIDENT       NULL    2011-11-07      5000    NULL    10
7782    CLARK   MANAGER 7839    2011-06-09      2450    NULL    10
7788    SCOTT   ANALYST 7566    2017-07-13      3000    NULL    20
7566    JONES   MANAGER 7839    2011-04-02      2975    NULL    20
7876    ADAMS   CLERK   7788    2017-07-13      1100    NULL    20
7902    FORD    ANALYST 7566    2011-12-03      3000    NULL    20
7369    SMITH   CLERK   7902    2010-12-17      800     NULL    20
Time taken: 43.018 seconds, Fetched: 14 row(s)
hive (mydb)> select * from emp cluster by deptno;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
...
Total MapReduce CPU Time Spent: 8 seconds 650 msec
OK
emp.empno       emp.ename       emp.job emp.mgr emp.hiredate    emp.sal emp.comm        emp.deptno
7654    MARTIN  SALESMAN        7698    2011-09-28      1250    1400    30
7900    JAMES   CLERK   7698    2011-12-03      950     NULL    30
7698    BLAKE   MANAGER 7839    2011-05-01      2850    NULL    30
7521    WARD    SALESMAN        7698    2011-02-22      1250    500     30
7844    TURNER  SALESMAN        7698    2011-09-08      1500    0       30
7499    ALLEN   SALESMAN        7698    2011-02-20      1600    300     30
7934    MILLER  CLERK   7782    2012-01-23      1300    NULL    10
7839    KING    PRESIDENT       NULL    2011-11-07      5000    NULL    10
7782    CLARK   MANAGER 7839    2011-06-09      2450    NULL    10
7788    SCOTT   ANALYST 7566    2017-07-13      3000    NULL    20
7566    JONES   MANAGER 7839    2011-04-02      2975    NULL    20
7876    ADAMS   CLERK   7788    2017-07-13      1100    NULL    20
7902    FORD    ANALYST 7566    2011-12-03      3000    NULL    20
7369    SMITH   CLERK   7902    2010-12-17      800     NULL    20
Time taken: 36.315 seconds, Fetched: 14 row(s)
hive (mydb)>

可以看到，兩種方式效果相同，但是這里沒有實際的意義，

現在對排序總結如下：

order by：執行全域排序，效率低，生產環境中慎用
sort by：使資料區域有序（在reduce內部有序）
distribute by：按照指定的條件將資料分組，常與sort by聯用，使資料區域有序
cluster by：當distribute by與sort by是同一個欄位時，可使用cluster by簡化語法

轉載請註明出處，本文鏈接：https://www.uj5u.com/qita/357063.html

標籤：其他

上一篇：基于Tableau探索分析世界銀行提供的關于科學技術的資料

下一篇：Hadoop 保姆級教程 - 從安裝到實踐

大資料開發基礎入門與專案實戰（三）Hadoop核心及生態圈技術堆疊之4.Hive DDL、DQL和資料操作

文章目錄

1.HQL操作之DDL命令

（1）資料庫操作

（2）建表語法

（3）內部表及外部表

（4）磁區表

（5）分桶表

（6）修改表及洗掉表

5.HQL操作之資料操作

（1）load裝載資料

（2）insert插入資料

6.HQL操作之DQL命令

（1）簡單查詢

（2）簡單子句

（3）group by分組子句

（4）表連接

（5）order by排序子句

（6）sort by排序

（7）distribute by和cluster by排序