文章目錄

一、Hive安裝
- 1、解壓環境
- 2、環境變數配置
- 3、組態檔資訊
- - 1.打開編輯檔案
  - 2.輸入以下內容
- 4、拷貝mysql驅動
- 5、更新guava包和hadoop一致
- 6、mysql授權
- 7、初始化
- 8、hive啟動模式
- 9、Hadoop的core-site.xml配置
二、Hive
- 1、Hive的檔案結構
- 2、MySQL上Hive的元資料
- 3、hadoop檔案授權
- 4、Hive的驅動器：Driver
- 5、抽象語法樹（AST）
- 6、動態磁區
三、Hive的資料型別
- 1、Hive基本資料型別
- 2、非常規資料型別
- - 1.陣列型別
  - 2.圖型別
  - 3.結構體型別
  - 4.集合型別
四、Hive的DDL
- 1、資料庫的基本操作
- - 1.創建資料庫
  - 2.顯示資料庫資訊
  - 3.顯示資料庫詳細資訊
  - 4.切換資料庫
  - 5.修改資料庫
  - 6.洗掉資料庫
- 2、創建表
- - 1.建表陳述句
  - 1.建表陳述句
  - 2.創建簡單表
  - 3.創建外部表
  - 4.創建含有特殊資料結構的內部表
  - 5.從已有表復制新表
  - 6.復制表結構
  - 7.指定路徑創建資料表
  - 8.創建磁區表
  - 9.創建分桶表
- 3、查看表
- - 1.查看資料表
  - 2.查看資料表的詳細資訊
  - 3.查看表磁區
  - 4.查看詳細建表陳述句
- 4、修改表
- - 1.重命名表
  - 2.修改欄位定義
  - 3.修改磁區資訊
  - 4.修改磁區
  - 5.洗掉表磁區
  - 6.更新列
  - 7.增加和替換列
- 5、洗掉表
- - 1.洗掉表
  - 2.清空表
五、Hive的DML
- 1、資料匯入方式
- 2、資料匯入
- - 1.從本地磁盤或者HDFS匯入資料
  - 2.將資料掛載在資料表中
  - 3.通過insert插入
  - 4.建表時候用select as將資料掛載
  - 5.建表的時候用load進行資料掛載
  - 6.向分桶表插入資料
- 2、資料匯出
- - 1.使用insert將資料匯出
  - 2.帶格式匯出
  - 3.bash命令匯出
  - 4.將資料表匯出到HDFS
  - 5.從匯出結果匯入到Hive
- 3、資料洗掉
六、Hive的DQL
- 1、基本查詢陳述句
- - 1.全表查詢
  - 2.查詢某些列
  - 3.給查詢的特征列起別名
  - 4.查詢進行運算
  - 5.函式
- 2、條件過濾
- - 1.資料量限制
  - 2.where條件篩選
  - 3.like欄位查詢
  - 4.與或非判斷
- 3、分組
- - 1.group by 分組
  - 2.分組過濾
- 4、連接
- - 1.內關聯（JOIN）
  - 2.左外關聯（LEFT [OUTER] JOIN）
  - 3.右外關聯（RIGHT [OUTER] JOIN）
  - 4.全外關聯（FULL [OUTER] JOIN）
  - 5.LEFT SEMI JOIN
  - 6.笛卡爾積關聯（CROSS JOIN）
- 5、排序
- - 1.單欄位排序
  - 2.多欄位排序
  - 3.區域排序
  - 4.指定區域排序的磁區欄位
  - 5.cluster
- 6、分桶
- - 1.創建分桶表
  - 2.插入資料
七、hive函式
- 1、日期函式
- 2、數學函式
- 3、字符函式
- 4、聚合函式
- 5、集合函式
- 6、條件函式
- 7、表生成函式
- 8、型別轉換函式
- 9、視窗函式
- - 1.row_number() over()
  - 2.sum() over()
  - 3.lag/lead() over()
  - 4.hive的自增列
八、with語法
九、多維分組聚合
- 1、grouping sets函式
- - 1.單欄位
  - 2.全欄位排序，按照某個欄位聚合
  - 3.多維度聚合
  - 4.全維度聚合
- 2、roll up函式
- 3、with cube函式
十、格式化創建動態表
- 1、CSV檔案
- 2、json檔案
十一、增量表、全量表和拉鏈表
- 1、增量表
- 2、全量表
- 3、拉鏈表
十二、數倉分層
- 1、源資料層（ODS）
- 2、明細粒度事實層（DWD）
- 3、資料中間層（DWM）
- 4、公共匯總粒度事實層（DWS）
- 5、公共維度層（DIM）
十三、數倉模型
- 1、星型模型
- 2、雪花模型
- 3、星座模型

一、Hive安裝

1、解壓環境

#切換到指定檔案夾
cd /opt/download
#解壓
tar -zxvf apache-hive-3.1.2-bin.tar.gz -C /opt/software/
#重命名
mv /opt/softwareapache-hive-3.1.2-bin/ /opt/softwarehive312
#切換到hive檔案夾中
cd /opt/software/hive312

2、環境變數配置

#編輯檔案并輸入配置資訊
vim /etc/profile.d/my.sh
#-------------------------------------
# hive
export HIVE_HOME=/opt/software/hive312
export PATH=$PATH:$HIVE_HOME/bin
#-------------------------------------
#生效配置資訊
source /etc/profile

3、組態檔資訊

1.打開編輯檔案

#檔案重命名
mv conf/hive-default.xml.template  conf/hive-default.xml
#創建并編輯hive-site.xml資訊
vim conf/hive-site.xml

2.輸入以下內容

#-----------------------------------------
<configuration>
    <!--hdfs倉庫路徑-->
    <property>
        <name>hive.metastore.warehouse.dir</name>
        <value>/hive312/warehouse</value>
    </property>
    <!--metastore(元)資料庫型別-->
    <property>
        <name>hive.metastore.db.type</name>
        <value>mysql</value>
        <description>Expects one of [derby, oracle, mysql, mssql, postgres].</description>
    </property>
    <!--連接mysql字串-->
    <property>
        <name>javax.jdo.option.ConnectionURL</name>mysql
        <value>jdbc:mysql://192.168.71.128:3306/hive312?createDatabaseIfNotExist=true</value>
    </property>
    <!--mysql連接驅動-->
    <property>
        <name>javax.jdo.option.ConnectionDriverName</name>
        <value>com.mysql.jdbc.Driver</value>
    </property>
    <!--mysql連接賬號-->
    <property>
        <name>javax.jdo.option.ConnectionUserName</name>
        <value>root</value>
    </property>
    <!--mysql本地連接密碼-->
    <property>
        <name>javax.jdo.option.ConnectionPassword</name>
        <value>kb16</value>
    </property>
    <!--關閉schema驗證-->
    <property>
        <name>hive.metastore.schema.verification</name>
        <value>false</value>
    </property>
    <!--提示當前庫名-->
    <property>
        <name>hive.cli.print.current.db</name>
        <value>true</value>
        <description>Whether to include the current database in the Hive prompt.</description>
    </property>
    <!--查詢輸出顯示列名-->
    <property>
        <name>hive.cli.print.header</name>
        <value>true</value>
        <description>Whether to print the names of the columns in query output.</description>
    </property>
    <!--server2對外開放的埠號-->
    <property>
        <name>hive.server2.thrift.port</name>
        <value>10000</value>
    </property>
    <property>
        <name>hive.server2.thrift.bind.host</name>
        <value>localhost</value>
    </property>
</configuration>
#-----------------------------------------

4、拷貝mysql驅動

#將JDBC的JAR包拷貝到hive的lib檔案夾中，這樣hive就可以操作mysql資料庫
cp /opt/download/mysql-connector-java-5.1.47.jar lib/

5、更新guava包和hadoop一致

#列出hive的guava的jar包
ls lib/|grep guava
#--------------
# guava-19.0.jar
#---------------
#洗掉當前hive的guava包
rm -f lib/guava-19.0.jar
#查看hadoop使用guava包的版本
find /opt/software/hadoop313/ -name guava*
#----------------------------------------------------------------
/opt/software/hadoop313/share/hadoop/common/lib/guava-27.0-jre.jar
/opt/software/hadoop313/share/hadoop/hdfs/lib/guava-27.0-jre.jar
#----------------------------------------------------------------
#將hadoop的guava包拷貝到本地中
cp /opt/software/hadoop313/share/hadoop/hdfs/lib/guava-27.0-jre.jar lib/

6、mysql授權

#mysql資料庫對外授權
grant all on *.* to root@master01 identified by 'kb16';
#重繪權限
flush privileges;

7、初始化

#hive初始化，并在mysql中建立一個資料庫，該資料庫用于存盤元資料
schematool -dbType mysql -initSchema

8、hive啟動模式

#首先啟動元資料服務
nohup hive --service metastore 1>/dev/null 2>&1 &
#1、方法一 hive客戶端
hive
#2、方法二 基于metastore和hiveserver2的beeline
#啟動hiveserver2服務
nohup hive --service hiveserver2 1>/dev/null 2>&1 &
#登錄的時候，一定重啟MySQL服務
beeline -u jdbc:hive2://localhost:10000 -n uername
 
beeline -u jdbc:hive2://master01:10000 -n root
beeline -u jdbc:hive2://single01:10000 -n root
!connect jdbc:hive2://localhost:10000/default

9、Hadoop的core-site.xml配置

<!---hadoop對外開發的用戶和用戶組-->
<property>
	<name>hadoop.proxyuser.root.hosts</name>
	<value>*</value>
</property>
<property>
	<name>hadoop.proxyuser.root.groups</name>
	<value>*</value>
</property>

二、Hive

Hive是一套根據客戶需求，集合各種大資料組件工具，對客戶資料進行管理、處理，治理方案

1、Hive的檔案結構

1、bin： 主要存放hive運行的可執行檔案

2、lib： 主要存放hive運行的jar包

wget --no-check-certificate --no-cookies --header "Cookies: oraclelicense=accept-securebackup-cookies" https://archive.apache.org/dist/hadoop/core/hadoop-3.1.2/hadoop-3.1.2.tar.gz

2、MySQL上Hive的元資料

show tables;
+-------------------------------+
| Tables_in_hive312             |
+-------------------------------+
| AUX_TABLE                     |
| BUCKETING_COLS                |
| CDS                           |
| COLUMNS_V2                    |
| COMPACTION_QUEUE              |
| COMPLETED_COMPACTIONS         |
| COMPLETED_TXN_COMPONENTS      |
| CTLGS                         |
| DATABASE_PARAMS               |
| DBS                           |
| DB_PRIVS                      |
| DELEGATION_TOKENS             |
| FUNCS                         |
| FUNC_RU                       |
| GLOBAL_PRIVS                  |
| HIVE_LOCKS                    |
| IDXS                          |
| INDEX_PARAMS                  |
| I_SCHEMA                      |
| KEY_CONSTRAINTS               |
| MASTER_KEYS                   |
| MATERIALIZATION_REBUILD_LOCKS |
| METASTORE_DB_PROPERTIES       |
| MIN_HISTORY_LEVEL             |
| MV_CREATION_METADATA          |
| MV_TABLES_USED                |
| NEXT_COMPACTION_QUEUE_ID      |
| NEXT_LOCK_ID                  |
| NEXT_TXN_ID                   |
| NEXT_WRITE_ID                 |
| NOTIFICATION_LOG              |
| NOTIFICATION_SEQUENCE         |
| NUCLEUS_TABLES                |
| PARTITIONS                    |
| PARTITION_EVENTS              |
| PARTITION_KEYS                |
| PARTITION_KEY_VALS            |
| PARTITION_PARAMS              |
| PART_COL_PRIVS                |
| PART_COL_STATS                |
| PART_PRIVS                    |
| REPL_TXN_MAP                  |
| ROLES                         |
| ROLE_MAP                      |
| RUNTIME_STATS                 |
| SCHEMA_VERSION                |
| SDS                           |
| SD_PARAMS                     |
| SEQUENCE_TABLE                |
| SERDES                        |
| SERDE_PARAMS                  |
| SKEWED_COL_NAMES              |
| SKEWED_COL_VALUE_LOC_MAP      |
| SKEWED_STRING_LIST            |
| SKEWED_STRING_LIST_VALUES     |
| SKEWED_VALUES                 |
| SORT_COLS                     |
| TABLE_PARAMS                  |
| TAB_COL_STATS                 |
| TBLS                          |
| TBL_COL_PRIVS                 |
| TBL_PRIVS                     |
| TXNS                          |
| TXN_COMPONENTS                |
| TXN_TO_WRITE_ID               |
| TYPES                         |
| TYPE_FIELDS                   |
| VERSION                       |
| WM_MAPPING                    |
| WM_POOL                       |
| WM_POOL_TO_TRIGGER            |
| WM_RESOURCEPLAN               |
| WM_TRIGGER                    |
| WRITE_SET                     |
+-------------------------------+

3、hadoop檔案授權

#修改hdfs的hive檔案的所有者和用戶組
hdfs dfs -chown -R root:supergroup /hive312
#修改快取文夾件的權限
hdfs dfs -chmod -R 777 /tmp

4、Hive的驅動器：Driver

（1）決議器（SQL Parser）：將SQL字串轉換成抽象語法樹AST，這一步一般都用第三方工具庫完
成，比如antlr；對AST進行語法分析，比如表是否存在、欄位是否存在、SQL語意是否有誤，
（2）編譯器（Physical Plan）：將AST編譯生成邏輯執行計劃，
（3）優化器（Query Optimizer）：對邏輯執行計劃進行優化，
（4）執行器（Execution）：把邏輯執行計劃轉換成可以運行的物理計劃，對于Hive來說，就是MR/Spark，

5、抽象語法樹（AST）

Antlr定義SQL的語法規則，完成SQL詞法，語法決議，將SQL轉化為抽象語法樹AST Tree
遍歷AST Tree，抽象出查詢的基本組成單元QueryBlock
遍歷QueryBlock，翻譯為執行操作樹OperatorTree
邏輯層優化器進行OperatorTree變換，合并不必要的ReduceSinkOperator，減少shuffle資料量
遍歷OperatorTree，翻譯為MapReduce任務
物理層優化器進行MapReduce任務的變換，生成最終的執行計劃

6、動態磁區

#查看hive的動態磁區的狀態
set hive.exec.dynamic.partition;
#修改hive的動態磁區狀態
set hive.exec.dynamic.partition.mode=nonstrict;
#向磁區表中插入資料
insert overwrite into table kb16.user_movie_rating_par partition(dt) select userid,moviedid,rating,`timestamp`,data_format(from_unixtime(`timestamp`),'yyyy-MM') dt from user_moive_rating; order by `timestamp` desc limit 10;

三、Hive的資料型別

1、Hive基本資料型別

欄位	資料型別	和Java對應的資料型別
短短整型	TINYINT	byte
短整型	SMALLINT	short
整型	INT	int
長整型	BIGINT	long
布爾型	BOOLEAN	boolean
浮點型	FLOAT	float
雙精度	DOUBLE	double
雙精度	DOUBLE PRECISION	double
字符	STRING	String
位元	BINARY	bit
時間戳	TIMESTAMP	date
大資料	DECMAL	BigDecimal
大資料	DECIMAL(precision，scala)	BigDecimal
時間	DATE	date
不定長字符	VARCHAR	String
字符型別	CHAR	String

2、非常規資料型別

1.陣列型別

#Hive陣列型別定義，data_type為上面的基本資料型別
ARRAY<data_type>
#陣列型別定義
stuhobby array<string>

2.圖型別

#圖型別定義,primitivez_type代表鍵型別，data_type為值型別
MAP<primitivez_type,data_type>
#定義圖型別欄位
stuscore map<string,int>

3.結構體型別

#定義自定義結構體欄位
STRUCT<col_name:data_type[COMMENT col_comment],...>
#定義自定義的欄位
stuaddress struct<province:string,city:string,district:string>)

4.集合型別

UNIONTYPE<data_type,data_type,...>

四、Hive的DDL

1、資料庫的基本操作

1.創建資料庫

CREATE DATABASE [IF NOT EXISTS] database_name 
#關于資料塊的描述
[COMMENT database_comment]
#指定資料庫在HDFS上存盤位置
[LOCATION hdfs_path] 
#指定資料塊屬性
[WITH DBPROPERTIES (property_name=value, ...)];

2.顯示資料庫資訊

#顯示資料庫的詳細資訊
desc database db_hive;

3.顯示資料庫詳細資訊

#描述資料庫詳細資訊
desc database extended db_hive;

4.切換資料庫

#切換資料庫
use db_hive;

5.修改資料庫

#修改資料庫屬性
alter database db_hive set dbproperties('欄位'='value');

6.洗掉資料庫

#if exits判斷是否存在，cascade可以強制洗掉
drop database[ if exits] db_hive[cascade];

2、創建表

1.建表陳述句

#EXTERNAL表示創建外部表，TEMPORARY表示創建內部表，創建時建議庫名.表名
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name  
  [(col_name data_type [column_constraint_specification] [COMMENT col_comment], ... [constraint_specification])]
  [COMMENT table_comment]
  #按照什么欄位進行磁區
  [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
  #CLUSTERED BY按照什么欄位進行分桶，SORTED BY按照什么欄位進行排序
  [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
  [SKEWED BY (col_name, col_name, ...)                  -
     ON ((col_value, col_value, ...), (col_value, col_value, ...), ...)
     [STORED AS DIRECTORIES]
  [
   [ROW FORMAT row_format] 
   [STORED AS file_format]
     | STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]  -- (Note: Available in Hive 0.6.0 and later)
  ]
  [LOCATION hdfs_path]
  [TBLPROPERTIES (property_name=property_value, ...)]   -- (Note: Available in Hive 0.6.0 and later)
  [AS select_statement];   -- (Note: Available in Hive 0.5.0 and later; not supported for external tables)
 

 
data_type
  : primitive_type
  | array_type
  | map_type
  | struct_type
  | union_type  -- (Note: Available in Hive 0.7.0 and later)
 
primitive_type
  : TINYINT
  | SMALLINT
  | INT
  | BIGINT
  | BOOLEAN
  | FLOAT
  | DOUBLE
  | DOUBLE PRECISION -- (Note: Available in Hive 2.2.0 and later)
  | STRING
  | BINARY      -- (Note: Available in Hive 0.8.0 and later)
  | TIMESTAMP   -- (Note: Available in Hive 0.8.0 and later)
  | DECIMAL     -- (Note: Available in Hive 0.11.0 and later)
  | DECIMAL(precision, scale)  -- (Note: Available in Hive 0.13.0 and later)
  | DATE        -- (Note: Available in Hive 0.12.0 and later)
  | VARCHAR     -- (Note: Available in Hive 0.12.0 and later)
  | CHAR        -- (Note: Available in Hive 0.13.0 and later)
 
array_type
  : ARRAY < data_type >
 
map_type
  : MAP < primitive_type, data_type >
 
struct_type
  : STRUCT < col_name : data_type [COMMENT col_comment], ...>
 
union_type
   : UNIONTYPE < data_type, data_type, ... >  -- (Note: Available in Hive 0.7.0 and later)
 
row_format
  : DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION ITEMS TERMINATED BY char]
        [MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]
        [NULL DEFINED AS char]   -- (Note: Available in Hive 0.13 and later)
  | SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)]
 
file_format:
  : SEQUENCEFILE #序列化檔案
  | TEXTFILE    -- (Default, depending on hive.default.fileformat configuration)
  | RCFILE      -- (Note: Available in Hive 0.6.0 and later)
  | ORC         -- (Note: Available in Hive 0.11.0 and later)
  | PARQUET     -- (Note: Available in Hive 0.13.0 and later)
  | AVRO        -- (Note: Available in Hive 0.14.0 and later)
  | JSONFILE    -- (Note: Available in Hive 4.0.0 and later)
  | INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname
 
column_constraint_specification:
  : [ PRIMARY KEY|UNIQUE|NOT NULL|DEFAULT [default_value]|CHECK  [check_expression] ENABLE|DISABLE NOVALIDATE RELY/NORELY ]
 
default_value:
  : [ LITERAL|CURRENT_USER()|CURRENT_DATE()|CURRENT_TIMESTAMP()|NULL ] 
 
constraint_specification:
  : [, PRIMARY KEY (col_name, ...) DISABLE NOVALIDATE RELY/NORELY ]
    [, PRIMARY KEY (col_name, ...) DISABLE NOVALIDATE RELY/NORELY ]
    [, CONSTRAINT constraint_name FOREIGN KEY (col_name, ...) REFERENCES table_name(col_name, ...) DISABLE NOVALIDATE 
    [, CONSTRAINT constraint_name UNIQUE (col_name, ...) DISABLE NOVALIDATE RELY/NORELY ]
    [, CONSTRAINT constraint_name CHECK [check_expression] ENABLE|DISABLE NOVALIDATE RELY/NORELY ]

1.建表陳述句

create [temporary][external] table [if not exits] [dbname.]tabname
[(
	colname data_type [comment col_comment],
    ...,
    [constraint_specification]
)]
[comment table_comment]
#創建磁區表
[partitioned by (extrenal_colname data_type [comment col_comment],...)]
#創建分桶表
[clustered by (colname,...)[sorted by (colname ASC|DESC,...)] into num_buckets buckets]
[skewed by (colname,colname,...) on (colvalue,...),(colvalue,...),...][stored as directories]
[rowformat row_format]
[stored as file_format| stored by 'storge.handler.classname'[with serdeproperrties]]
[LOCATION hdfs_path]
[TBL PROPERTIES (property_name=value, ...)]
[AS select_statement]

**temporary：**建立內部表，**external：**建立外部表
if not exits： 判斷表名是否已經存在，如果存在就拋出例外
comment: 為表和列添加注釋
partitioned by： 創建磁區表
clustered by： 創建分桶表
sorted by： 對桶中的一個或者多個列進行排序
rowformat：
stored as： 指定檔案存盤型別，常用的存盤檔案型別：SEQUENCEFILE（二進制序列檔案）、 TEXTFILE（文本）、 RCFILE（列式存盤檔案）
location： 指定表在HDFS上存盤位置
AS： 根據查詢結果創建表
like： 允許用戶復制表結構，資料不能復制

2.創建簡單表

#創建簡單表
create table kb16.student(stuname string,stuage int,stugender string);
#插入資料
insert into kb16.student(stuname,stuage,stugender)values('tom',18,'f'),('jack',16,'m'),('jhon',18,'f');

3.創建外部表

#創建外部表
create external table kb16.student_ext(
stuname string,
stuage int,
stugender string,
#定義陣列欄位
stuhobby array<string>,
#定義圖欄位
stuscore map<string,int>,
#定義結構體欄位
stuaddress struct<province:string,city:string,district:string>
)
row format delimited
#欄位用,進行分割
fields terminated by ','
#陣列用|進行分割
collection items terminated by '|'
#圖
map keys terminated by ':'
#每行按斬訓行進行分割
lines terminated by '\n'
#按照普通文本進行存盤
stored as textfile
#指定檔案存盤路徑
location '/test/hive/student_ext';

4.創建含有特殊資料結構的內部表

select
stuname,stuage,stugender,
stuhobby[0] hobby1,stuhobby[1] hobby2,stuhobby[2] hobby3,
stuscore['java'] javascore,stuscore['mysql'] mysqlscore,
stuaddress.province province,stuaddress.city city,stuaddress.district district
from kb16.student_ext;
where array_contains(student,'eat')
#資料------------------------------------------------
henry,10,f,sing|dance|read,java:88|mysql:67,ah|hf|fx
pola,16,m,sing|eat|read,java:76|mysql:85,ah|la|sc
ariel,8,m,caton|pizzle|read,java:90|mysql:80,ah|hf|fx
#----------------------------------------------------

5.從已有表復制新表

#將一個表的查詢結果創建為一個新表
create table kb16.student_cpoy 
as
select * from kb16.student_ext;

6.復制表結構

#復制現有表結構
create table kb16,student_like like kb16.student_ext;

7.指定路徑創建資料表

#用一個掛載在HDFS上檔案創建新的資料表7
create external table kb16.user_movie_rating(
userid bigint,
movieid bigint,
rating decimal(2,1),
`timestamp` bigint
)
row format delimited
fields terminated by ','
location '/test/kb16/hive/moive_rating/'
#跳過檔案第一行
tblproperties("skip.header.line,count"="1");

8.創建磁區表

#創建磁區表，向磁區表插入資料時，需要增加磁區欄位
create external table kb16.user_movie_rating_par(
userid bigint,
movieid bigint,
rating decimal(2,1),
`timestamp` bigint
)
partitioned by (dt string)
row format delimited
fields terminated by ',';

9.創建分桶表

#創建磁區分桶表
create external table user_movie_ratin_par_bucket(
userid bigint,
movieid bigint,
rating decimal(2,1),
`timestamp` bigint,
)
partition by (years int)
clustered by (`timestamp`) sorted by (`timestamp` ASC) into 5 buckets
row format delimited fields terminated by ',';

3、查看表

1.查看資料表

#查看當前資料庫中有哪些資料表
show tables;
#查看非當前資料的資料表
show tables in HIVE_DATABASE;
#查看資料庫以xxx開頭的表
show tables like 'xxx*';

2.查看資料表的詳細資訊

#查看表資訊
desc table 表名;
#查看表資訊
desc extened 表名;
#查看表的詳細資訊
desc formatted 表名;

3.查看表磁區

#查看磁區資訊
show partitions 表名;

4.查看詳細建表陳述句

#查看詳細建表陳述句
show create table 表名;

4、修改表

1.重命名表

#語法結構
alter table table_name rename to new_table_name;
#例程
alter table student rename to stu;

2.修改欄位定義

#增加一個欄位
alter table 表名 add columns (欄位名  欄位型別);
#修改一個欄位的定義
alter table 表名 change name 欄位名 欄位型別;
#替換所有欄位
alter table 表名 replace columns (欄位1 欄位型別1,欄位2 欄位型別2 ...)

3.修改磁區資訊

#添加磁區
alter table 表名 add partition(欄位="值");
#添加多個磁區
alter table 表名 add partition(磁區欄位="值1") partition(磁區欄位="值2");
#動態磁區
load data local inpath "檔案的HDFS路徑" into table 表名 partition(欄位="值");

4.修改磁區

#添加磁區的時候，直接指定當前磁區的資料存盤目錄
alter table 表名 add if not exists partition(欄位="值") location 'HDFS地址' partition(欄位="值") location 'HDFS地址';
#修改已經指定好的磁區的資料存盤目錄
alter table student_ptn partition (city='beijing') set location '/student_ptn_beijing';

5.洗掉表磁區

#洗掉表磁區
alter table 表名 drop partition (欄位="值");

6.更新列

#語法結構
alter table table_name change [column] col_old_name col_new_name
column_type [comment col_comment] [first|after column_name]

7.增加和替換列

#語法結構
alter table table_name add|replace columns(col_name data_type[comment col_comment],...)

ADD是代表新增一欄位，欄位位置在所有列后面 (partition列前 )
REPLACE則是表示替換中所有欄位，

5、洗掉表

1.洗掉表

#語法結構
drop table table_name;
#例程
drop table stu;

2.清空表

#清空表
truncate table 表名;

五、Hive的DML

1、資料匯入方式

資料匯入方式可以分為以下幾種：

通過hive，使用insert方式插入
使用HDFS的put命令，將檔案直接寫入到hive指定表檔案夾下的方式
使用hive提供的load命令，將資料匯入

2、資料匯入

1.從本地磁盤或者HDFS匯入資料

#從本地磁盤或者HDFS匯入資料
load data [local] inpath '/opt/module/datas/student.txt' [overwrite] into table student [partition (partcol1=val1,…)];
#例程
load data local inpath '/opt/module/datas/student.txt' overwrite into table student;

2.將資料掛載在資料表中

#將資料掛載在資料表中
load data inpath '/xxx/student.txt' overwrite into table student;

3.通過insert插入

#通過insert指令將資料掛載
insert into table student select id, name from stu_par where class="01";

4.建表時候用select as將資料掛載

#在創建表的時候將資料掛載，該掛載方式有局限性，不建議使用
#將select查詢的結果創建一個資料表
create TAB_NAME as select xxx;

5.建表的時候用load進行資料掛載

#建表的時候用load進行資料掛載
#資料在資料表創建前已經上傳至HDFS上
create external table student2
(id int, name string)
row format delimited 
fields terminated by '\t'
location '/xxx';

6.向分桶表插入資料

insert overwrite table user_movie_ratin_par_bucket partition(years)
select *,pmod(cast(date_format(from_unixtime(U.`timestamp`),'yyyy') as int),5) years
from user_movie_rating limit 10;

2、資料匯出

1.使用insert將資料匯出

#使用insert將資料匯出
#該方法將資料表的查詢結果匯出為檔案
insert overwrite local directory '/opt/module/datas/export/student'
select * from student;

2.帶格式匯出

#將hive表的資料進行格式化匯出
insert overwrite local directory '/opt/module/datas/export/student1'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
select * from student;

3.bash命令匯出

#bash命令匯出
#利用hive -e執行hive的查詢陳述句，利用重定向方式將資料匯出
hive -e "select * from student" >/root/student.txt

4.將資料表匯出到HDFS

#整張表export到HDFS
export table student to '/export/student';

5.從匯出結果匯入到Hive

#從匯出結果匯入到Hive
import table student3 from '/export/student';

3、資料洗掉

#只刪表資料，不刪表本身
truncate table student;
#徹底洗掉資料
drop table student;

六、Hive的DQL

SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[ORDER BY order_condition]
[DISTRIBUTE BY distribute_condition [SORT BY sort_condition] ]
[LIMIT number]

1、基本查詢陳述句

1.全表查詢

#全表查詢
select * from emp;

2.查詢某些列

#查詢某些列
select empno, ename from emp;

3.給查詢的特征列起別名

#起別名
select ename as name from emp;
#as可以省略
select ename name from emp;

4.查詢進行運算

#運算子
select ename, sal + 10 from emp;

5.函式

#UDF函式
select substring(ename, 1, 1) from emp;
#UDAF函式
select count(*) from emp;

2、條件過濾

1.資料量限制

#limit，取前幾行
select * from emp limit 5;

2.where條件篩選

#查詢工資大于1000的人
select * from emp where sal > 1000;

3.like欄位查詢

#通配符字串匹配 % _ ，以A開頭的員工
select * from emp where ename like "A%";
/*正則入門
一般字符匹配自己
^ 匹配一行開頭 ^R 以R開頭
$ 匹配一行結束 R$ 以R結尾
. 匹配任意字符 ^.$ 一行只有一個字符
* 前一個子式匹配零次或多次
[] 匹配一個范圍內的任意字符
\ 轉義
*/

4.與或非判斷

#與(and)或非(or)
select * from emp where empno = 30 and sal > 1000;

3、分組

1.group by 分組

#計算emp表每個部門的平均工資
select deptno, avg(sal) aa from emp group by deptno;

2.分組過濾

#分組過濾
#計算部門平均工資大于2000的部門
select deptno, avg(sal) aa from emp group by deptno having aa>2000;

4、連接

1.內關聯（JOIN）

只回傳能關聯上的結果，

#內連接
SELECT column1,column2,...
FROM table_a a
join table_b b
ON (a.column = b.column);

2.左外關聯（LEFT [OUTER] JOIN）

以LEFT [OUTER] JOIN關鍵字前面的表作為主表，和其他表進行關聯，回傳記錄和主表的記錄數一致，關聯不上的欄位置為NULL，

#左外關聯
SELECT column1,column2,...
FROM table_a a
left join table_b b
ON (a.column = b.column);

3.右外關聯（RIGHT [OUTER] JOIN）

和左外關聯相反，以RIGTH [OUTER] JOIN關鍵詞后面的表作為主表，和前面的表做關聯，回傳記錄數和主表一致，關聯不上的欄位為NULL，

#右外關聯
SELECT column1,column2,...
FROM table_a a
right join table_b b
ON (a.column = b.column);

4.全外關聯（FULL [OUTER] JOIN）

以兩個表的記錄為基準，回傳兩個表的記錄去重之和，關聯不上的欄位為NULL，注意：FULL JOIN時候，Hive不會使用MapJoin來優化，

#右外關聯
SELECT column1,column2,...
FROM table_a a
full outer join table_b b
ON (a.column = b.column);

5.LEFT SEMI JOIN

以LEFT SEMI JOIN關鍵字前面的表為主表，回傳主表的KEY也在副表中的記錄，

#LEFT SEMI JOIN
SELECT column1,column2,...
FROM table_a a
left semi join table_b b
ON (a.column = b.column);

6.笛卡爾積關聯（CROSS JOIN）

回傳兩個表的笛卡爾積結果，不需要指定關聯鍵，

#笛卡爾積關聯
SELECT column1,column2,...
FROM table_a a
cross join table_b b;

5、排序

asc: 為升序排序
desc: 為降序排序

1.單欄位排序

#單欄位降序排序
SELECT column1,column2,...
FROM table_a
order by column_od desc;

2.多欄位排序

#單欄位降序排序
SELECT column1,column2,...
FROM table_a
order by column_od1 asc,column_od2 desc;

3.區域排序

#Hive區域排序
SELECT column1,column2,...
FROM table_a
sort by column_od desc;

4.指定區域排序的磁區欄位

#指定區域排序的磁區欄位
select column1,column2,... 
from table_a
distribute by column_
sort by column_od desc;

5.cluster

#如果磁區和排序的欄位一樣，我們可以用cluster by代替
select * from emp distribute by empno sort by empno;
select * from emp cluster by empno;

6、分桶

1.創建分桶表

#創建分桶表
create table stu_buck(id int, name string)
clustered by(id) 
into 4 buckets
row format delimited fields terminated by '\t';

2.插入資料

#向分桶表中插入資料
load data local inpath '/opt/module/datas/student.txt' into table stu_buck;

七、hive函式

1、日期函式

回傳值	語法結構	描述
string	from_unix(bigint time,string time_format)	將時間戳進行格式化輸出
bigint	unix_timestamp()	獲得當前時區的UNIX時間戳
bigint	unix_timestamp(string date)	將格式為yyyy-MM-dd HH:mm:ss的時間字串轉換成時間戳，如果轉化失敗，則回傳0，
bigint	unix_timestamp(string date, string pattern)	將指定時間字串格式字串轉換成Unix時間戳，如果轉化失敗，則回傳0，
string	to_date(string timestamp)	回傳日期時間欄位中的日期部分
int	year(string date)	回傳日期中的年
int	month (string date)	回傳日期中的月份
int	day (string date)	回傳日期中的天
int	hour (string date)	回傳日期中的小時
int	minute (string date)	回傳日期中的分鐘
int	second (string date)	回傳日期中的秒
int	weekofyear (string date)	回傳時間字串位于一年中的第幾個周內
int	datediff(string enddate, string startdate)	回傳結束日期減去開始日期的天數
string	date_add(string startdate, int days)	回傳開始日期startdate增加days天后的日期
string	date_sub (string startdate, int days)	回傳開始日期startdate減少days天后的日期
timestamp	from_utc_timestamp(timestamp, string timezone)	如果給定的時間戳并非UTC，則將其轉化成指定的時區下時間戳
timestamp	to_utc_timestamp(timestamp, string timezone)	如果給定的時間戳指定的時區下時間戳，則將其轉化成UTC下的時間戳
date	current_date()	回傳當前時間日期
timestamp	current_timestamp()	回傳當前時間戳
string	add_months(string start_date, int num_months)	回傳當前時間下再增加num_months個月的日期
string	last_day(string date)	回傳這個月的最后一天的日期，忽略時分秒部分（HH:mm:ss）
string	next_day(string start_date, string day_of_week)	回傳當前時間的下一個星期X所對應的日期
string	trunc(string date, string format)	回傳時間的最開始年份或月份
double	months_between(date1, date2)	回傳date1與date2之間相差的月份，如date1>date2，則回傳正，如果date1<date2,則回傳負，否則回傳0.0
string	date_format(date/timestamp/string ts, string fmt)	按指定格式回傳時間date
int	dayofweek(date)	回傳日期那天的周幾
int	quarter(date/timestamp/string)	回傳當前時間屬性哪個季度

2、數學函式

回傳值	語法結構	描述
double	round(double a)	回傳double型別的整數值部分（遵循四舍五入）
double	round(double a, int d)	回傳指定精度d的double型別
bigint	floor(double a)	回傳等于或者小于該double變數的最大的整數
bigint	ceil(double a)	回傳等于或者大于該double變數的最小的整數
bigint	ceiling(double a)	回傳等于或者大于該double變數的最小的整數
double	rand(),rand(int seed)	回傳一個double型0到1范圍內的亂數，如果指定種子seed，則會等到一個穩定的亂數序列
double	exp(double a)	回傳自然對數e的a次方,a可為小數
double	ln(double a)	回傳a的自然對數，a可為小數
double	log10(double a)	回傳以10為底的a的對數，a可為小數
double	log2(double a)	回傳以2為底的a的對數，a可為小數
double	log(double base, double a)	回傳以base為底的a的對數，base與a都是double型別
double	pow(double a, double p), power(double a, double p)	回傳a的p次冪
double	sqrt(double a)	回傳a的平方根
string	bin(BIGINT a)	回傳a的二進制代碼表示,，a為BIGINT型別
string	hex(BIGINT a),hex(string a)	如果變數是int型別，那么回傳a的十六進制表示；
string	unhex(string a)	回傳該十六進制字串所代碼的字串,hex的逆方法
string	conv(BIGINT num, int from_base, int to_base),conv(STRING num, int from_base, int to_base)	將bigint/string數值num從from_base進制轉化到to_base進制
double or int	abs(double a),abs(int a)	回傳數值a的絕對值
int or double	pmod(int a, int b),pmod(double a, double b)	回傳正的a除以b的余數
double	sin(double a)	回傳a的正弦值
double	asin(double a)	回傳a的反正弦值
double	cos(double a)	回傳a的余弦值
double	acos(double a)	回傳a的反余弦值
double	tan(double a)	回傳a的正切值
double	atan(double a)	回傳a的反正切值
double	degrees(double a)	回傳a的角度值
double	radians(double a)	回傳a的弧度值
int or double	positive(int a), positive(double a)	回傳a的正數
int or double	negative(int a), negative(double a)	回傳a的負數
double	sign(double a)	如果a是正數則回傳1.0，是負數則回傳-1.0，否則回傳0.0
double	e()	數學常數e
double	pi()	圓周率π
bigint	factorial(int a)	求a的階乘
double	cbrt(double a)	求a的立方根
int bigint	shiftleft(BIGINT a, int b)	按位左移
int bigint	shiftright(BIGINT a, int b)	按位右移
int bigint	shiftrightunsigned(BIGINT a, int b)	無符號按位右移（<<<）
T	greatest(T v1, T v2, …)	求最大值
T	least(T v1, T v2, …)	求最小值
double	bround(double a)	銀行家舍入法（1-4：舍，6-9：進，5->前位數是偶：舍，5->前位數是奇：進）
double	bround(double a，int d)	銀行家舍入法,保留d位小數

3、字符函式

回傳值	語法結構	功能描述
int	ascii(string str)	回傳字串str第一個字符的ascii碼
string	base64(binary bin)	將二進制bin轉換成64位的字串
string	concat(string A, string B…)	回傳輸入字串連接后的結果，支持任意個輸入字串
array<struct<string,double>>	context_ngrams(array<array>, array, int K, int pf)	與ngram類似，但context_ngram()允許你預算指定背景關系(陣列)來去查找子序列
string	concat_ws(string SEP, string A, string B…) \| concat_ws(string SEP, array)	回傳輸入字串連接后的結果，SEP表示各個字串間的分隔符
string	decode(binary bin, string charset)	使用指定的字符集charset將二進制值bin解碼成字串，支持的字符集有：‘US-ASCII’, ‘ISO-8859-1’, ‘UTF-8’, ‘UTF-16BE’, ‘UTF-16LE’, ‘UTF-16’，如果任意輸入引數為NULL都將回傳NULL
binary	encode(string src, string charset)	使用指定的字符集charset將字串編碼成二進制值，支持的字符集有：‘US-ASCII’, ‘ISO-8859-1’, ‘UTF-8’, ‘UTF-16BE’, ‘UTF-16LE’, ‘UTF-16’，如果任一輸入引數為NULL都將回傳NULL
int	find_in_set(string str, string strList)	回傳str在strlist第一次出現的位置，strlist是用逗號分割的字串，如果沒有找該str字符，則回傳0，如果任一引數為NULL將回傳NULL
string	format_number(number x, int d)	將數值X轉換成"#,###,###.##"格式字串，并保留d位小數，如果d為0，將進行四舍五入且不保留小數
string	get_json_object(string json_string, string path)	決議json的字串json_string,回傳path指定的內容，如果輸入的json字串無效，那么回傳NULL，注意此路徑上JSON字串只能由數字字母下劃線組成且不能有大寫字母和特殊字符，且key不能由數字開頭，這是由于Hive對列名的限制
boolean	in_file(string str, string filename)	如果檔案名為filename的檔案中有一行資料與字串str匹配成功就回傳true
int	instr(string str, string substr)	查找字串str中子字串substr出現的位置，如果查找失敗將回傳0，如果任一引數為Null將回傳null，注意位置為從1開始的
int	length(string A)	回傳字串A的長度
int	locate(string substr, string str[, int pos])	查找字串str中的pos位置后字串substr第一次出現的位置
string	lower(string A) lcase(string A)	回傳字串A的小寫格式
string	lpad(string str, int len, string pad)	將str進行用pad進行左補足到len位,從左邊開始對字串str使用字串pad填充，最終len長度為止，如果字串str本身長度比len大的話，將去掉多余的部分
string	ltrim(string A)	去掉字串A前面的空格
array<struct<string,double>>	ngrams(array, int N, int K, int pf)	回傳出現次數TOP K的的子序列,n表示子序列的長度
string	parse_url(string urlString, string partToExtract [, stringkeyToExtract])	回傳從URL中抽取指定部分的內容，引數url是URL字串，而引數partToExtract是要抽取的部分，這個引數包含(HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, and USERINFO
string	printf(String format, Obj… args)	按照printf風格格式輸出字串
string	regexp_extract(string subject, string pattern, int index)	將字串subject按照pattern正則運算式的規則拆分，回傳index指定的字符，注意些預定義字符的使用，如第二個引數如果使用’\s’將被匹配到s,’\s’才是匹配空格
string	regexp_replace(string A, string B, string C)	按照Java正則運算式PATTERN將字串INTIAL_STRING中符合條件的部分成REPLACEMENT所指定的字串，如里REPLACEMENT這空的話，抽符合正則的部分將被去掉
string	repeat(string str, int n)	回傳重復n次后的str字串
string	reverse(string A)	回傳字串A的反轉結果
string	rpad(string str, int len, string pad)	從右邊開始對字串str使用字串pad填充，最終len長度為止，如果字串str本身長度比len大的話，將去掉多余的部分
string	rtrim(string A)	去除字串右邊的空格
array	sentences(string str, string lang, string locale)	字串str將被轉換成單詞陣列，如：sentences(‘Hello there! How are you?’) =( (“Hello”, “there”), (“How”, “are”, “you”) )
string	space(int n)	回傳n個空格
array	split(string str, string pat)	按照pat字串分割str，會回傳分割后的字串陣列
map<string,string>	str_to_map(text[, delimiter1, delimiter2])	將字串str按照指定分隔符轉換成Map，第一個引數是需要轉換字串，第二個引數是鍵值對之間的分隔符，默認為逗號;第三個引數是鍵值之間的分隔符，默認為"="
string	substr(string A, int start),substring(string A, int start)	回傳字串A從start位置到結尾的字串
string	substr(string A, int start, int len),substring(string A, int start, int len)	回傳字串A從start位置開始，長度為len的字串
string	substring_index(string A, string delim, int count)	截取第count分隔符之前的字串，如count為正則從左邊開始截取，如果為負則從右邊開始截取
string	translate(string\|char\|varchar input, string\|char\|varchar from,string\|char\|varchar to)	將input出現在from中的字串替換成to中的字串
string	trim(string A)	去除字串兩邊的空格
binary	unbase64(string str)	將64位的字串轉換二進制值
string	upper(string A) ucase(string A)	將字串A中的字母轉換成大寫字母
string	initcap(string A)	將字串A轉換第一個字母大寫其余字母的字串
int	levenshtein(string A, string B)	計算兩個字串之間的差異大小
string	soundex(string A)	將普通字串轉換成soundex字串

4、聚合函式

回傳值	語法結構	功能描述
bigint	*count(), count(expr), count(DISTINCT expr[, expr…])**	count(*)統計檢索出的行的個數，包括NULL值的行；count(expr)回傳指定欄位的非空值的個數；
double	sum(col), sum(DISTINCT col)	sum(col)統計結果集中col的相加的結果；sum(DISTINCT col)統計結果中col不同值相加的結果
double	avg(col), avg(DISTINCT col)	avg(col)統計結果集中col的平均值；avg(DISTINCT col)統計結果中col不同值相加的平均值
double	min(col)	統計結果集中col欄位的最小值
double	max(col)	統計結果集中col欄位的最大值
double	variance(col), var_pop(col)	統計結果集中col非空集合的總體變數（忽略null），（求指定列數值的方差）
double	var_samp (col)	統計結果集中col非空集合的樣本變數（忽略null）（求指定列數值的樣本方差）
double	stddev_pop(col)	該函式計算總體標準偏離，并回傳總體變數的平方根，其回傳值與VAR_POP函式的平方根相同（求指定列數值的標準偏差）
double	stddev_samp (col)	該函式計算樣本標準偏離，（求指定列數值的樣本標準偏差）
double	covar_pop(col1, col2)	求指定列數值的協方差
double	covar_samp(col1, col2)	求指定列數值的樣本協方差
double	corr(col1, col2)	回傳兩列數值的相關系數
double	percentile(BIGINT col, p)	求準確的第pth個百分位數，p必須介于0和1之間，但是col欄位目前只支持整數，不支持浮點數型別
array	percentile(BIGINT col, array(p1 [, p2]…))	功能和上述類似，之后后面可以輸入多個百分位數，回傳型別也為array，其中為對應的百分位數
double	percentile_approx(DOUBLE col, p [, B])	求近似的第pth個百分位數，p必須介于0和1之間，回傳型別為double，但是col欄位支持浮點型別，引數B控制記憶體消耗的近似精度，B越大，結果的準確度越高，默認為10,000，當col欄位中的distinct值的個數小于B時，結果為準確的百分位數
array	percentile_approx(DOUBLE col, array(p1 [, p2]…) [, B])	功能和上述類似，之后后面可以輸入多個百分位數，回傳型別也為array，其中為對應的百分位數，
array<struct {‘x’,‘y’}>	histogram_numeric(col, b)	以b為基準計算col的直方圖資訊

5、集合函式

回傳值	語法結構	功能描述
int	size(Map<K.V>)	回傳map型別的長度
int	size(Array)	求陣列的長度
array	map_keys(Map<K.V>)	回傳map中的所有key
array	map_keys(Map<K.V>)	回傳map中的所有value
boolean	array_contains(Array, value)	如該陣列Array包含value回傳true，，否則回傳false
array	sort_array(Array)	按自然順序對陣列進行排序并回傳

6、條件函式

回傳值	語法結構	功能描述
T	if(boolean testCondition, T valueTrue, T valueFalseOrNull)	當條件testCondition為TRUE時，回傳valueTrue；否則回傳valueFalseOrNull（valueTrue，valueFalseOrNull為泛型）
T	nvl(T value, T default_value)	如果value值為NULL就回傳default_value,否則回傳value
T	COALESCE(T v1, T v2,…)	回傳引數中的第一個非空值；如果所有值都為NULL，那么回傳NULL
T	CASE a WHEN b THEN c [WHEN d THEN e] [ELSE f] END*	如果a等于b，那么回傳c；如果a等于d，那么回傳e；否則回傳f
T	CASE WHEN a THEN b [WHEN c THEN d] [ELSE e] END*	如果a為TRUE,則回傳b；如果c為TRUE，則回傳d；否則回傳e
boolean	isnull( a )	如果a為null就回傳true，否則回傳false
boolean	isnotnull ( a )	如果a為非null就回傳true，否則回傳false

7、表生成函式

回傳值	語法結構	功能描述
Array Type	explode(array a)	對于a中的每個元素，將生成一行且包含該元素
N rows	explode(ARRAY)	每行對應陣列中的一個元素
N rows	explode(MAP)	每行對應每個map鍵-值，其中一個欄位是map的鍵，另一個欄位是map的值
N rows	posexplode(ARRAY)	與explode類似，不同的是還回傳各元素在陣列中的位置
N rows	stack(INT n, v_1, v_2, …, v_k)	把M列轉換成N行，每行有M/N個欄位，其中n必須是個常數
tuple	json_tuple(jsonStr, k1, k2, …)	從一個JSON字串中獲取多個鍵并作為一個元組回傳，與get_json_object不同的是此函式能一次獲取多個鍵值
tuple	parse_url_tuple(url, p1, p2, …)	回傳從URL中抽取指定N部分的內容，引數url是URL字串，而引數p1,p2,…是要抽取的部分，這個引數包含HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, USERINFO, QUERY:
tuple	inline(ARRAY<STRUCT[,STRUCT]>)	將結構體陣列提取出來并插入到表中

8、型別轉換函式

回傳值	語法結構	功能描述
binary	binary(string\|binary)	將輸入的值轉換成二進制
Expected “=” to follow “type”	cast(expr as )	將expr轉換成type型別如：cast(“1” as BIGINT) 將字串1轉換成了BIGINT型別，如果轉換失敗將回傳NULL

9、視窗函式

橫向擴展表，控制粒度

order by 進行全域檢索

#視窗函式
func over(partition by field1,...)
#粒度全表
over()
#所有磁區資料
over(partition by field1,...)
#從當前磁區的首行到當前行
over(partition by field1,... order by field_a,... rows between ... and ...)

sort by 進行區域排序，只保證部分有序，效率高

func over(distribute by field1,… sort by fielda,…rows between … and …)

當前行：current row
當前行的前多少行： preceding
當前行的后多少行：following
無邊界：unbounded

unbounded preceding unbound follwing

unbounded preceding … current row

1.row_number() over()

這個方法主要進行開窗增加自增列

2.sum() over()

通過開窗進行sum計算

3.lag/lead() over()

4.hive的自增列

insert into table User_Attribute select (row_number() over())+1000 as id,customid from tbl_custom;

八、with語法

with…as…需要定義一個SQLK片段，會將這個片段產生的結果集保存在記憶體中，后續的SQL均可以訪問這個結果集和，作用與視圖或臨時表類似，一個SQL查詢陳述句中只允許出現一個with陳述句，該語法主要用于子查詢，

with t1 as (
		select *
		from user_info
	), 
	t2 as (
		select *
		from goods_list
	)
select *
from t1, t2;

九、多維分組聚合

多維分組聚合函式有：grouping sets、roll up、with cube

1、grouping sets函式

GROUPING SETS子句允許開發者自行組合GROUP BY子句中出現的欄位作為分組欄位，其實作效果等同于按照不同欄位分組的SQL陳述句進行UNION操作，

SELECT    a.product_id
        , a.channel_id
        , a.promotion_id
        , SUM(a.sale_amount) AS sale_amount
FROM    dwd.dwd_sales a
GROUP BY a.product_id
        , a.channel_id
        , a.promotion_id
GROUPING SETS ((a.product_id, a.channel_id),(a.channel_id, a.promotion_id));

1.單欄位

#grouping sets陳述句
select 
	device_id,
	os_id,app_id,
	count(user_id) 
from test_xinyan_reg 
group by device_id,os_id,app_id #全欄位group by
grouping sets((device_id));
#等價hive陳述句
SELECT 
	device_id,
	null,
	null,
	count(user_id) 
FROM test_xinyan_reg 
group by device_id;

2.全欄位排序，按照某個欄位聚合

#grouping sets陳述句
select 
	device_id,
	os_id,app_id,
	count(user_id) from test_xinyan_reg 
group by 
	device_id,
	os_id,
	app_id 
grouping sets((device_id,os_id))
#等價的hive陳述句
SELECT 
	device_id,
	os_id,null,
	count(user_id) 
FROM test_xinyan_reg 
group by device_id,os_id;

3.多維度聚合

#grouping sets陳述句
select 
	device_id,
	os_id,app_id,
	count(user_id) 
from test_xinyan_reg 
group by 
	device_id,
	os_id,
	app_id 
grouping sets((device_id,os_id),(device_id));
#等價的hive陳述句
SELECT 
	device_id,
	os_id,
	null,
	count(user_id) 
FROM test_xinyan_reg 
group by device_id,os_id 
UNION ALL 
SELECT 
	device_id,
	null,
	null,
	count(user_id) 
FROM test_xinyan_reg 
group by device_id;

4.全維度聚合

#grouping sets陳述句
select
	device_id,
	os_id,app_id,
	count(user_id) 
from test_xinyan_reg 
group by
	device_id,
	os_id,
	app_id 
grouping sets((device_id),(os_id),(device_id,os_id),());
#等價的hive陳述句
SELECT 
	device_id,
	null,
	null,
	count(user_id) 
FROM test_xinyan_reg 
group by device_id 
UNION ALL 
SELECT 
	null,
	os_id,
	null,
	count(user_id) 
FROM test_xinyan_reg 
group by 
	os_id 
UNION ALL 
SELECT 
	device_id,
	os_id,
	null,
	count(user_id)
FROM test_xinyan_reg 
group by
	device_id,
	os_id 
UNION ALL 
SELECT 
	null,
	null,
	null,
	count(user_id) 
FROM test_xinyan_reg

2、roll up函式

rollup可以實作從右到做遞減多級的統計，顯示統計某一層次結構的聚合，

#roll up陳述句
select device_id,os_id,app_id,client_version,from_id,count(user_id)
from test_xinyan_reg
group by device_id,os_id,app_id,client_version,from_id with rollup;
#等價的hive陳述句
select device_id,os_id,app_id,client_version,from_id,count(user_id)
from test_xinyan_reg
group by device_id,os_id,app_id,client_version,from_id 
grouping sets ((device_id,os_id,app_id,client_version,from_id),(device_id,os_id,app_id,client_version),(device_id,os_id,app_id),(device_id,os_id),(device_id),());

3、with cube函式

cube簡稱資料魔方，可以實作hive多個任意維度的查詢，cube(a,b,c)則首先會對(a,b,c)進行group by，然后依次是(a,b),(a,c),(a),(b,c),(b),?,最后在對全表進行group by，他會統計所選列中值的所有組合的聚合
cube即為grouping sets的簡化程序函式

#with cube陳述句
select device_id,os_id,app_id,client_version,from_id,count(user_id)
from test_xinyan_reg
group by device_id,os_id,app_id,client_version,from_id with cube;
#等價的hive陳述句
SELECT device_id,null,null,null,null ,count(user_id) FROM test_xinyan_reg group by device_id
UNION ALL
SELECT null,os_id,null,null,null ,count(user_id) FROM test_xinyan_reg group by os_id
UNION ALL
SELECT device_id,os_id,null,null,null ,count(user_id) FROM test_xinyan_reg group by device_id,os_id
UNION ALL
SELECT null,null,app_id,null,null ,count(user_id) FROM test_xinyan_reg group by app_id
UNION ALL
SELECT device_id,null,app_id,null,null ,count(user_id) FROM test_xinyan_reg group by device_id,app_id
UNION ALL
SELECT null,os_id,app_id,null,null ,count(user_id) FROM test_xinyan_reg group by os_id,app_id
UNION ALL
SELECT device_id,os_id,app_id,null,null ,count(user_id) FROM test_xinyan_reg group by device_id,os_id,app_id
UNION ALL
SELECT null,null,null,client_version,null ,count(user_id) FROM test_xinyan_reg group by client_version
UNION ALL
SELECT device_id,null,null,client_version,null ,count(user_id) FROM test_xinyan_reg group by device_id,client_version
UNION ALL
SELECT null,os_id,null,client_version,null ,count(user_id) FROM test_xinyan_reg group by os_id,client_version
UNION ALL
SELECT device_id,os_id,null,client_version,null ,count(user_id) FROM test_xinyan_reg group by device_id,os_id,client_version
UNION ALL
SELECT null,null,app_id,client_version,null ,count(user_id) FROM test_xinyan_reg group by app_id,client_version
UNION ALL
SELECT device_id,null,app_id,client_version,null ,count(user_id) FROM test_xinyan_reg group by device_id,app_id,client_version
UNION ALL
SELECT null,os_id,app_id,client_version,null ,count(user_id) FROM test_xinyan_reg group by os_id,app_id,client_version
UNION ALL
SELECT device_id,os_id,app_id,client_version,null ,count(user_id) FROM test_xinyan_reg group by device_id,os_id,app_id,client_version
UNION ALL
SELECT null,null,null,null,from_id ,count(user_id) FROM test_xinyan_reg group by from_id
UNION ALL
SELECT device_id,null,null,null,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,from_id
UNION ALL
SELECT null,os_id,null,null,from_id ,count(user_id) FROM test_xinyan_reg group by os_id,from_id
UNION ALL
SELECT device_id,os_id,null,null,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,os_id,from_id
UNION ALL
SELECT null,null,app_id,null,from_id ,count(user_id) FROM test_xinyan_reg group by app_id,from_id
UNION ALL
SELECT device_id,null,app_id,null,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,app_id,from_id
UNION ALL
SELECT null,os_id,app_id,null,from_id ,count(user_id) FROM test_xinyan_reg group by os_id,app_id,from_id
UNION ALL
SELECT device_id,os_id,app_id,null,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,os_id,app_id,from_id
UNION ALL
SELECT null,null,null,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by client_version,from_id
UNION ALL
SELECT device_id,null,null,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,client_version,from_id
UNION ALL
SELECT null,os_id,null,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by os_id,client_version,from_id
UNION ALL
SELECT device_id,os_id,null,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,os_id,client_version,from_id
UNION ALL
SELECT null,null,app_id,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by app_id,client_version,from_id
UNION ALL
SELECT device_id,null,app_id,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,app_id,client_version,from_id
UNION ALL
SELECT null,os_id,app_id,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by os_id,app_id,client_version,from_id
UNION ALL
SELECT device_id,os_id,app_id,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,os_id,app_id,client_version,from_id
UNION ALL
SELECT null,null,null,null,null ,count(user_id) FROM test_xinyan_reg

十、格式化創建動態表

1、CSV檔案

該創建方式的忽略欄位中包含的分割字符

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
	WITH SERDEPROPERTIES (
    "separatorChar" = ",",
    "quoteChar" = "\"",
	"escapeChar" = "\\"
	)
STORED AS TEXTFILE;

2、json檔案

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
    WITH SERDEPROPERTIES (
       "separatorChar" = "\t",
       "quoteChar"     = "'",
       "escapeChar"    = "\\"
    )  
STORED AS TEXTFILE;

十一、增量表、全量表和拉鏈表

1、增量表

記錄更新周期內新增資料，即在原表中資料的基礎上新增本周期內產生的新資料，

2、全量表

記錄更新周期內的全量資料，無論資料是否有變化都需要記錄

3、拉鏈表

拉鏈表是針對資料倉庫設計中表存盤資料的方式而定義，所謂拉鏈急事記錄歷史，記錄一個事物從開始，一直到當前狀態的所有變化的資訊，

應用場景： 大量的歷史資料+新增的資料+有限時間范圍內（截止拉取資料的時間）的少量的更新資料

十二、數倉分層

1、源資料層（ODS）

原始資料層，存放原始資料，直接加載原始日志、資料，資料保持原貌不作處理

2、明細粒度事實層（DWD）

以業務程序作為建模驅動，基于每個具體的業務程序特點，構建最細粒度的明細層事實表，可以結合企業的資料使用特點，將明細事實表的某些重要維度屬性欄位做適當冗余，即寬表化處理，

3、資料中間層（DWM）

在DWD層的資料基礎上，對資料做輕度的聚合操作，生成一系列的中間表，提升公共指標的復用性，減少重復加工，直觀來講，就是對通用的核心維度進行聚合操作，算出相應的統計指標

4、公共匯總粒度事實層（DWS）

以分析的主題物件作為建模驅動，基于上層的應用和產品的指標需求，構建公共粒度的匯總指標事實表，以寬表化手段物理化模型，構建命名規范、口徑一致的統計指標，為上層提供公共指標，建立匯總寬表、明細事實表，

5、公共維度層（DIM）

基于維度建模理念，建立整個企業的一致性維度，降低資料計算口徑和演算法不統一風險，此表也被稱為邏輯維度表，維度和維度邏輯表通常一一對應，

十三、數倉模型

1、星型模型

由事實表和多個維表組成，事實表中存放大量關于企業的事實資料，元祖個數通常很大，而且非規范化程度很高

優點：

**讀取速度快：**針對各個維做了大量預處理，如按照維度進行預先的統計、分組合排序等
**多種資料源，**減少異構資料帶來的分析復雜性
標準性，新員工可快速掌握，資料工程師和分析師比較了解，可促進協作
可擴展性，添加的事實表可以重用先有維度向事實表添加更多外鍵，實作事實表添加新維度

2、雪花模型

星型模型的擴展，將星型模型的維表進一步層次化，原來的各個維表可能被擴展為小的事實表，形成一些區域的層次區域

特點：

通過定義多重父類維表來描述某些特殊維表定義特殊的統計資訊
最大限度的減少資料存盤量
把較小的維度表聯合在一起改善查詢性能

3、星座模型

星型模型的擴展延伸，多張事實表共享維度表，只有一些大型公司使用

轉載請註明出處，本文鏈接：https://www.uj5u.com/qita/438090.html

標籤：其他

上一篇：【LeetCode-SQL每日一練】—184. 部門工資最高的員工

下一篇：Mysql 視窗函式

大資料組件之Hive（Hive學習一篇就夠了）

文章目錄

一、Hive安裝

1、解壓環境

2、環境變數配置

3、組態檔資訊

1.打開編輯檔案

2.輸入以下內容

4、拷貝mysql驅動

5、更新guava包和hadoop一致

6、mysql授權

7、初始化

8、hive啟動模式

9、Hadoop的core-site.xml配置

二、Hive

1、Hive的檔案結構

2、MySQL上Hive的元資料

3、hadoop檔案授權

4、Hive的驅動器：Driver

5、抽象語法樹（AST）

6、動態磁區

三、Hive的資料型別

1、Hive基本資料型別

2、非常規資料型別

1.陣列型別

2.圖型別

3.結構體型別

4.集合型別

四、Hive的DDL

1、 資料庫的基本操作

1.創建資料庫

2.顯示資料庫資訊

3.顯示資料庫詳細資訊

4.切換資料庫

5.修改資料庫

6.洗掉資料庫

2、創建表

1.建表陳述句

1.建表陳述句

2.創建簡單表

3.創建外部表

4.創建含有特殊資料結構的內部表

5.從已有表復制新表

6.復制表結構

7.指定路徑創建資料表

8.創建磁區表

9.創建分桶表

3、查看表

1.查看資料表

2.查看資料表的詳細資訊

3.查看表磁區

4.查看詳細建表陳述句

4、修改表

1.重命名表

2.修改欄位定義

3.修改磁區資訊

4.修改磁區

5.洗掉表磁區

6.更新列

7.增加和替換列

5、洗掉表

1.洗掉表

2.清空表

五、Hive的DML

1、資料匯入方式

2、資料匯入

1.從本地磁盤或者HDFS匯入資料

2.將資料掛載在資料表中

3.通過insert插入

4.建表時候用select as將資料掛載

5.建表的時候用load進行資料掛載

6.向分桶表插入資料

2、資料匯出

1.使用insert將資料匯出

2.帶格式匯出

3.bash命令匯出

4.將資料表匯出到HDFS

5.從匯出結果匯入到Hive

3、資料洗掉

六、Hive的DQL

1、資料庫的基本操作