HiveSql調優系列之Hive嚴格模式，如何合理使用Hive嚴格模式-有解無憂

綜述
1.嚴格模式
- 1.1 引數設定
- 1.2 查看引數
- 1.3 嚴格模式限制內容及對應引數設定
2.實際操作
- 2.1 磁區表查詢時必須指定磁區
- 2.2 order by必須指定limit
- 2.3 限制笛卡爾積
3.搭配使用
- 3.1 引數
- 3.2 搭配使用案例

綜述

在同樣的集群運行環境中，hive調優有兩種方式，即引數調優和sql調優，

本篇講涉及到的Hive嚴格模式，

前兩天在優化一個前人遺留下的sql，發現關于嚴格模式引數是這樣使用的，嚴重錯誤，

set hive.strict.checks.cartesian.product=flase;
set hive.mapred.mode=nonstrict;

而且我發現在使用引數上，無論sql大小直接貼一堆引數，類似這樣，

set hive.exec.parallel=true;
set hive.exec.parallel.thread.number=16;
set hive.merge.mapfiles = true; 
set hive.merge.mapredfiles = true; 
set hive.merge.size.per.task=256000000;
set hive.merge.smallfiles.avgsize = 256000000;
set mapred.max.split.size=1024000000;
set mapred.min.split.size.per.node=1024000000;
set mapred.min.split.size.per.rack=1024000000; 
set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
set hive.join.emit.interval = 2000;
set hive.mapjoin.size.key = 20000;
set hive.mapjoin.cache.numrows = 20000;
set hive.exec.reducers.bytes.per.reducer=2000000000;
set hive.exec.reducers.max=999;
set hive.map.aggr=true;
set hive.groupby.mapaggr.checkinterval=100000;
set hive.auto.convert.join = true;
set hive.exec.dynamic.partition.mode = nonstrict;
set hive.exec.dynamic.partition = true;
set hive.cli.print.header=true;
set hive.resultset.use.unique.column.names=false;
set mapreduce.reduce.memory.mb=4096;
set mapreduce.reduce.java.opts=-Xmx4096m;
set mapred.max.split.size=1024000000;
set mapred.min.split.size.per.node=1024000000;
set mapred.min.split.size.per.rack=1024000000;

優化是優化了，但是我看到了優化的無目標性，反而在一定程度上多消耗了計算資源，

于是打算開一個系列文章，Hive SQL調優系列，如何合理的使用引數進行SQL優化，針對什么情況使用哪些引數優化，

本篇先說說嚴格模式相關引數怎么使用，

正文如下，

1.嚴格模式

所謂Hive的嚴格模式，就是為了避免用戶提交一些惡意SQL，消耗大量資源進而使得運行環境崩潰做出的一些安全性的限制，

或多或少我們都提交過一些執行很久，集群資源不足的SQL，應該能理解，

前文Hive動態磁區詳解中有提到過

1.1 引數設定

-- strict 為開啟嚴格模式  nostrict 關閉嚴格模式
set hive.mapred.mode=strict

1.2 查看引數

通過hive的set 查看指定引數

-- 黑視窗查看Hive模式，以下結果為未開啟嚴格模式
hive> set hive.mapred.mode;
hive.mapred.mode is undefined

1.3 嚴格模式限制內容及對應引數設定

如果Hive開啟嚴格模式，將會阻止一下三種查詢：

a.對磁區表查詢，where條件中過濾欄位沒有磁區欄位；

b.對order by查詢，order by的查詢不帶limit陳述句，

c.笛卡爾積join查詢，join查詢陳述句中不帶on條件或者where條件；

以上三種查詢情況也有自己單獨的引數可以進行控制，

磁區表查詢必須指定磁區

-- 開啟限制(默認為 false)
set hive.strict.checks.no.partition.filter=true;

orderby排序必須指定limit

-- 開啟限制(默認為false)
set hive.strict.checks.orderby.no.limit=true;

限制笛卡爾積運算

-- 開啟限制(默認為false)
set hive.strict.checks.cartesian.product=true;

2.實際操作

2.1 磁區表查詢時必須指定磁區

磁區表查詢必須指定磁區的原因：如果該表有大量磁區，如果不加限制，在讀取時會讀取到超出預估的資料量，

-- 測驗
create table `lubian` (
`id` string comment 'id',
`name` string comment '姓名'
)
comment 'lubian' 
PARTITIONED BY (ymd string)
row format delimited fields terminated by '\t' lines terminated by '\n' 
stored as orc;

set hive.strict.checks.no.partition.filter=true;
select * from lubian limit 111;

執行結果

FAILED: SemanticException [Error 10056]:
    Queries against partitioned tables without a partition filter are disabled for safety reasons.
    If you know what you are doing, please set hive.strict.checks.no.partition.
    filter to false and make sure that hive.mapred.mode is not set to 'strict' to proceed.
    Note that you may get errors or incorrect results if you make a mistake while using some of the unsafe features.
    No partition predicate for Alias "lubian" Table "lubian"

select * from partab where dt='11' limit 111;
Time taken: 0.77 seconds

2.2 order by必須指定limit

order by必須指定limit的主要原因: order by 為全域排序，所有資料只有一個reduceTask來處理，防止單個reduce運行時間過長,而導致任務阻塞

-- 測驗
set hive.strict.checks.orderby.no.limit=true;
select * from lubian order by name;

執行結果

FAILED: SemanticException 1:36
    Order by-s without limit are disabled for safety reasons.
    If you know what you are doing, please set hive.strict.checks.orderby.no.limit to false
    and make sure that hive.mapred.mode is not set to 'strict' to proceed.
    Note that you may get errors or incorrect results if you make a mistake while using some of the unsafe features..
    Error encountered near token 'name'

2.3 限制笛卡爾積

限制笛卡爾積運算原因：笛卡爾積可能會造成資料急速膨脹，例如兩個1000條資料表關聯，會產生100W條資料，n的平方增長，觸發笛卡爾積時,join操作會在一個reduceTask中執行

-- 測驗
set hive.strict.checks.cartesian.product=true;
select t1.*,t2.* from lubian as t1
inner join lubian as t2;

執行結果

FAILED: SemanticException Cartesian products are disabled for safety reasons.
    If you know what you are doing, please set hive.strict.checks.cartesian.product to false
    and make sure that hive.mapred.mode is not set to 'strict' to proceed.
    Note that you may get errors or incorrect results
    if you make a mistake while using some of the unsafe features.

3.搭配使用

3.1 引數

設定hive嚴格模式引數如下

set hive.mapred.mode=strict;
set hive.strict.checks.no.partition.filter=true;
set hive.strict.checks.orderby.no.limit=true;
set hive.strict.checks.cartesian.product=true;

以上引數可以使用 set hive.mapred.mode=strict; 默認開啟三種情況的嚴格模式，也可以使用每個限制內容引數開啟指定嚴格校驗，

3.2 搭配使用案例

也可以搭配使用，但是使用以下方式就有些問題了：

-- 關閉笛卡爾積運算校驗
set hive.strict.checks.cartesian.product=flase;
-- 關閉嚴格模式
set hive.mapred.mode=nonstrict;

應該是嚴格模式默認關閉，但仍想對其中一種情況做校驗，如下

set hive.mapred.mode=nonstrict;
set hive.strict.checks.cartesian.product=true;

或者嚴格模式默認開啟，但對其中一種不想做校驗：

set hive.mapred.mode=strict;
set hive.strict.checks.cartesian.product=false;

以上內容，

按例，歡迎點擊此處關注我的個人公眾號，交流更多知識，

后臺回復關鍵字 hive，隨機贈送一本魯邊備注版珍藏大資料書籍，

轉載請註明出處，本文鏈接：https://www.uj5u.com/shujuku/503500.html

標籤：其他

上一篇：金融數字化轉型落地實踐，騰訊云資料庫的三問三答

下一篇：SQL中的排座位問題