hive 學習筆記-有解無憂

一、官網和檔案地址

Hive 官網地址

hive官網

檔案查看地址

檔案地址

二、Hive 常用互動命令

（1）“-e”不進入 hive 的互動視窗執行 sql 陳述句

bin/hive -e "select id from student;"

（2）-f”執行腳本中 sql 陳述句

bin/hive -f /opt/module/hive/datas/hivef.sql

（3）退出 hive 視窗

hive(default)>exit; 
hive(default)>quit;

（4）在 hive cli 命令視窗中如何查看 hdfs 檔案系統

hive(default)>dfs -ls /;

三、Hive 資料型別

（1）基本資料型別

Hive 資料型別	Java 資料型別	長度	例子
TINYINT	byte	1byte 有符號整數	20
SMALINT	short	2byte 有符號整數	20
INT	int	4byte 有符號整數	20
BIGINT	long	8byte 有符號整數	20
BOOLEAN	boolean	布爾型別，true 或者false	TRUE FALSE
FLOAT	float	單精度浮點數	3.14159
DOUBLE	double	雙精度浮點數	3.14159
STRING	string	字符系列，可以指定字符集，可以使用單引號或者雙引號，	‘ now is the time ’ “for all good men”
TIMESTAMP		時間型別
BINARY		位元組陣列

（2）集合資料型別

資料型別	描述	語法示例
STRUCT	和 c 語言中的 struct 類似，都可以通過“點”符號訪問元素內容，例如，如果某個列的資料型別是 STRUCT{first STRING, last STRING},那么第 1 個元素可以通過欄位.first 來參考，	struct() 例如 struct<street:string, city:string>
MAP	MAP 是一組鍵-值對元組集合，使用陣串列示法可以訪問資料，例如，如果某個列的資料型別是 MAP，其中鍵 ->值對是’first’->’John’和’last’->’Doe’，那么可以通過欄位名[‘last’]獲取最后一個元素	map() 例如 map<string, int>
ARRAY	陣列是一組具有相同型別和名稱的變數的集合，這些變數稱為陣列的元素，每個陣列元素都有一個編號，編號從零開始，例如，陣列值為[‘John’, ‘Doe’]，那么第 2 個元素可以通過陣列名[1]進行參考，	Array() 例如 array<string>

四、DDL 資料定義

（1）創建資料庫

// 資料庫在 HDFS 上的默認存盤路徑是/user/hive/warehouse/*.db
CREATE DATABASE [IF NOT EXISTS] database_name [COMMENT database_comment]
[LOCATION hdfs_path]
[WITH DBPROPERTIES (property_name=property_value, ...)];

eg: 創建一個資料庫，指定資料庫在 HDFS 上存放的位置,
create database db_hive2 location '/db_hive2.db';

（2）查詢資料庫

hive> show databases; 	
// 過濾顯示查詢的資料庫
hive> show databases like 'db_hive*'; 
// 查看資料庫詳情
hive> desc database db_hive;
// 顯示資料庫詳細資訊，extended
desc database extended db_hive;

（3）修改資料庫

// 用戶可以使用 ALTER DATABASE 命令為某個資料庫的 DBPROPERTIES 設定鍵-值對屬性值， 來描述這個資料庫的屬性資訊，
hive (default)> alter database db_hive
set dbproperties('createtime'='20220130');

（4）洗掉資料庫

// 洗掉空資料庫
hive>drop database db_hive2; 	
// 如果洗掉的資料庫不存在，最好采用 if exists 判斷資料庫是否存在
hive> drop database if exists db_hive2;
// 如果資料庫不為空，可以采用 cascade 命令，強制洗掉
hive> drop database db_hive cascade;

（5）創建表

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name
[(col_name data_type [COMMENT col_comment], ...)] [COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] [CLUSTERED BY (col_name, col_name, ...)
[SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS] [ROW FORMAT row_format]
[STORED AS file_format] [LOCATION hdfs_path]
[TBLPROPERTIES (property_name=property_value, ...)] [AS select_statement]

eg:
create table if not exists student( id int, name string
)
row format delimited fields terminated by '\t' stored as textfile
location '/user/hive/warehouse/student';

字段解釋說明
（1）CREATE TABLE 創建一個指定名字的表，如果相同名字的表已經存在，則拋出例外；用戶可以用 IF NOT EXISTS 選項來忽略這個例外，
（2）EXTERNAL 關鍵字可以讓用戶創建一個外部表，在建表的同時可以指定一個指向實際資料的路徑（LOCATION），在洗掉表的時候，內部表的元資料和資料會被一起洗掉，而外部表只洗掉元資料，不洗掉資料，
（3）COMMENT：為表和列添加注釋，
（4）PARTITIONED BY 創建磁區表
（5）CLUSTERED BY 創建分桶表
（6）SORTED BY 不常用，對桶中的一個或多個列另外排序
（7）ROW FORMAT
DELIMITED [FIELDS TERMINATED BY char] [COLLECTION ITEMS TERMINATED BY char] [MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]|SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)]
用戶在建表的時候可以自定義 SerDe 或者使用自帶的 SerDe，如果沒有指定 ROW FORMAT 或者 ROW FORMAT DELIMITED，將會使用自帶的 SerDe，在建表的時候，用戶還需要為表指定列，用戶在指定表的列的同時也會指定自定義的 SerDe，Hive 通過 SerDe 確定表的具體的列的資料，
SerDe 是 Serialize/Deserilize 的簡稱， hive 使用 Serde 進行行物件的序列與反序列化，
（8）STORED AS 指定存盤檔案型別常用的存盤檔案型別：SEQUENCEFILE（二進制序列檔案）、TEXTFILE（文本）、RCFILE（列式存盤格式檔案）
如果檔案資料是純文本，可以使用 STORED AS TEXTFILE，如果資料需要壓縮，使用 STORED AS SEQUENCEFILE，
（9）LOCATION ：指定表在 HDFS 上的存盤位置，
（10）AS：后跟查詢陳述句，根據查詢結果創建表，
（11）LIKE 允許用戶復制現有的表結構，但是不復制資料，

（6）修改表

// 修改表名
ALTER TABLE table_name RENAME TO new_table_name;
// 增加單個表磁區
hive (default)> alter table dept_partition add partition(day='20200404');
// 增加多個表磁區
hive (default)> alter table dept_partition add partition(day='20200405') partition(day='20200406');
// 洗掉單個磁區
hive (default)> alter table dept_partition drop partition (day='20200406');
// 同時洗掉多個磁區
hive (default)> alter table dept_partition drop partition (day='20200404'), partition(day='20200405');

（7）洗掉表

hive (default)> drop table dept;

五、DML 資料操作

（1）資料匯入 load data

hive> load data [local] inpath '資料的 path' [overwrite] into table
student [partition (partcol1=val1,…)];

（1）load data:表示加載資料
（2）local:表示從本地加載資料到 hive 表；否則從 HDFS 加載資料到 hive 表
（3）inpath:表示加載資料的路徑
（4）overwrite:表示覆寫表中已有資料，否則表示追加
（5）into table:表示加載到哪張表
（6）student:表示具體的表

（2）插入資料

// 基本模式插入（根據單張表查詢結果）
hive (default)> insert overwrite table student_par
select id, name from student where month='201709';
// 多表（多磁區）插入模式（根據多張表查詢結果）,student 為具體的源表
hive (default)> from student
insert overwrite table student partition(month='201707') select id, name where month='201709'
insert overwrite table student partition(month='201706') select id, name where month='201709';

（3）創建表時通過 Location 指定加載資料路徑

1.上傳資料到 hdfs 上
hive (default)> dfs -mkdir /student;
hive (default)> dfs -put /opt/module/datas/student.txt /student;

2.創建表，并指定在 hdfs 上的位置
hive (default)> create external table if not exists student5( id int, name string)
row format delimited fields terminated by '\t' location '/student;

3.查詢資料
hive (default)> select * from student5;

（4）Import 資料到指定 Hive 表中

hive (default)> import table student2
from '/user/hive/warehouse/export/student';

（5）資料匯出

1.Insert 匯出，將查詢的結果匯出到本地
hive (default)> insert overwrite local directory '/opt/module/hive/data/export/student'
select * from student;

2.將查詢的結果格式化匯出到本地
hive(default)>insert overwrite local directory '/opt/module/hive/data/export/student1'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
select * from student;

3.將查詢的結果匯出到 HDFS 上(沒有 local)
hive (default)> insert overwrite directory '/user/atguigu/student2' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
select * from student;

4.Hadoop 命令匯出到本地
hive (default)> dfs -get /user/hive/warehouse/student/student.txt
/opt/module/data/export/student3.txt;

5. Hive Shell 命令匯出
bin/hive -e 'select * from default.student;' > /opt/module/hive/data/export/student4.txt;

6. Export 匯出到 HDFS 上
hive (default)> export table default.student to

（6）清除表中資料（Truncate）

// 注意：Truncate 只能洗掉管理表，不能洗掉外部表中資料
hive (default)> truncate table student;

六、查詢

官網查詢解釋

SELECT [ALL | DISTINCT] select_expr, select_expr, ...
  FROM table_reference
  [WHERE where_condition]
  [GROUP BY col_list]
  [ORDER BY col_list]
  [CLUSTER BY col_list
    | [DISTRIBUTE BY col_list] [SORT BY col_list]
  ]
 [LIMIT [offset,] rows]

七、磁區表和分桶表

（1）磁區表

1. 創建磁區表語法
hive (default)> create table dept_partition( deptno int, dname string, loc string)
partitioned by (day string)
row format delimited fields terminated by '\t';
注意：磁區欄位不能是表中已經存在的資料，可以將磁區欄位看作表的偽列，

2.加載資料到磁區表中，注意：磁區表加載資料時，必須指定磁區
hive (default)> load data local inpath
'/opt/module/hive/datas/dept_20200401.log' into table dept_partition partition(day='20200401');
hive (default)> load data local inpath
'/opt/module/hive/datas/dept_20200403.log' into table dept_partition
partition(day='20200403');

3.查詢磁區表中資料 單磁區查詢
hive (default)> select * from dept_partition where day='20200401'; 	

4.增加磁區
hive (default)> alter table dept_partition add partition(day='20200404');
hive (default)> alter table dept_partition add partition(day='20200405') partition(day='20200406');

5.洗掉磁區
hive (default)> alter table dept_partition drop partition (day='20200406');
hive (default)> alter table dept_partition drop partition (day='20200404'), partition(day='20200405');

6.查看磁區表有多少磁區
hive> show partitions dept_partition; 	

7.查看磁區表結構
hive> desc formatted dept_partition;

（2）二級磁區

// 創建二級磁區表
hive (default)> create table dept_partition2( deptno int, dname string, loc string)
partitioned by (day string, hour string)
row format delimited fields terminated by '\t';

// 加載資料到二級磁區表中
hive (default)> load data local inpath '/opt/module/hive/datas/dept_20200401.log' into table dept_partition2 partition(day='20200401', hour='12');

// 查詢磁區資料
hive (default)> select * from dept_partition2 where day='20200401' and hour='12';

（3）把資料直接上傳到磁區目錄上，讓磁區表和資料產生關聯的三種方式

方式一：上傳資料后修復 上傳
hive (default)> dfs -mkdir -p
/user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=13; 
hive (default)> dfs -put /opt/module/datas/dept_20200401.log
/user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=13;
查詢資料（查詢不到剛上傳的資料）
hive (default)> select * from dept_partition2 where day='20200401' and hour='13';
執行修復命令
hive> msck repair table dept_partition2; 	


方式二：上傳資料后添加磁區 上傳資料
hive (default)> dfs -mkdir -p
/user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=14; 
hive (default)> dfs -put /opt/module/hive/datas/dept_20200401.log
/user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=14;
執行添加磁區
hive (default)> alter table dept_partition2 add partition(day='201709',hour='14');
查詢資料
hive (default)> select * from dept_partition2 where day='20200401' and hour='14';


方式三：創建檔案夾后 load 資料到磁區 創建目錄
hive (default)> dfs -mkdir -p
/user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=15;
上傳資料
hive (default)> load data local inpath '/opt/module/hive/datas/dept_20200401.log' into table dept_partition2 partition(day='20200401',hour='15');
查詢資料
hive (default)> select * from dept_partition2 where day='20200401' and hour='15';

（4）分桶表

// 創建分桶表
create table stu_buck(id int, name string) clustered by(id)
into 4 buckets
row format delimited fields terminated by '\t';

// 查看表結構
hive (default)> desc formatted stu_buck; 
Num Buckets:	4

// 匯入資料到分桶表中，load 的方式
hive (default)> load data inpath '/student.txt' into table stu_buck; 

分桶表操作需要注意的事項:
（1）reduce 的個數設定為-1,讓 Job 自行決定需要用多少個 reduce 或者將 reduce 的個 數設定為大于等于分桶表的桶數
（2）從 hdfs 中 load 資料到分桶表中，避免本地檔案找不到問題
（3）不要使用本地模式

八、函式

（1）查看系統自帶的函式
hive> show functions;
（2）顯示自帶的函式的用法
hive> desc function upper;
（3）詳細顯示自帶的函式的用法
hive> desc function extended upper;

轉載請註明出處，本文鏈接：https://www.uj5u.com/qita/423404.html

標籤：其他

上一篇：MySQL高級查詢

下一篇：SpringBoot專案集成全文搜索引擎Elasticsearch