前言: 本文是對這篇博客MySQL 8.0 Histograms的翻譯,翻譯如有不當的地方,敬請諒解,請尊重原創和翻譯勞動成果,轉載的時候請注明出處,謝謝!
英文原文地址:https://lefred.be/content/mysql-8-0-histograms/
翻譯原文地址:https://www.cnblogs.com/kerrycode/p/11817026.html
在MySQL 8.0之前,MySQL缺失了其它關系資料庫中一個眾所周知的功能:優化器的直方圖
優化器團隊(Optimizer Team)在越來越多的MySQL DBA的呼聲中實作了這個功能,
直方圖定義
但什么是直方圖呢?我們來看維基百科的定義吧,直方圖是數值資料分布的準確表示, 對于RDBMS來說,直方圖是特定列內資料分布的近似值,因此在MySQL中,直方圖能夠幫助優化器找到最有效的執行計劃,
直方圖例子
為了說明直方圖是如何影響優化器作業的,我會用dbt3生成的資料來演示,
我們準備了一個簡單查詢:
SELECT * FROM orders
JOIN customer ON o_custkey = c_custkey
WHERE o_orderdate < '1993-01-01'
AND c_mktsegment = "AUTOMOBILE"\G
讓我們看一下傳統的執行計劃的EXPLAIN輸出,以及可視化方式(VISUAL one):
mysql> EXPLAIN SELECT * FROM orders
JOIN customer ON o_custkey = c_custkey
WHERE o_orderdate < '1993-01-01' AND c_mktsegment = "AUTOMOBILE"\G
*************************** 1. row ***************************id: 1
select_type: SIMPLE
table: customer partitions: NULL type: ALLpossible_keys: PRIMARYkey: NULL
key_len: NULLref: NULL
rows: 149050filtered: 10.00
Extra: Using where
*************************** 2. row ***************************id: 1
select_type: SIMPLE
table: orders partitions: NULL type: refpossible_keys: i_o_custkey,i_o_orderdate
key: i_o_custkeykey_len: 5
ref: dbt3.customer.c_custkey rows: 14filtered: 30.62
Extra: Using where
2 rows in set, 1 warning (0.28 sec)
我們看到MySQL首先對customer表做了一個全表掃描,并且它的選擇估計記錄(過濾)是10%;

接下來讓我們運行這個查詢(我使用了COUNT(*)),然后我們來看看有多少行記錄
mysql> SELECT count(*) FROM orders
JOIN customer ON o_custkey = c_custkey
WHERE o_orderdate < '1993-01-01' AND c_mktsegment = "AUTOMOBILE"\G
*************************** 1. row ***************************count(*): 451271 row in set (49.98 sec)
創建直方圖
現在,我將在表customer上的欄位c_mktsegment上創建一個直方圖
mysql> ANALYZE TABLE customer UPDATE HISTOGRAM ON c_mktsegment WITH 1024 BUCKETS;
+---------------+-----------+----------+---------------------------------------------------------+
| Table | Op | Msg_type | Msg_text |+---------------+-----------+----------+---------------------------------------------------------+
| dbt3.customer | histogram | status | Histogram statistics created for column 'c_mktsegment'. |
+---------------+-----------+----------+---------------------------------------------------------+
接下來,我們來驗證查詢的執行計劃:
mysql> EXPLAIN SELECT * FROM orders
JOIN customer ON o_custkey = c_custkey
WHERE o_orderdate < '1993-01-01' AND c_mktsegment = "AUTOMOBILE"\G
*************************** 1. row ***************************id: 1
select_type: SIMPLE
table: orders partitions: NULL type: ALLpossible_keys: i_o_custkey,i_o_orderdate
key: NULL
key_len: NULLref: NULL
rows: 1494230filtered: 30.62
Extra: Using where
*************************** 2. row ***************************id: 1
select_type: SIMPLE
table: customer partitions: NULLtype: eq_ref
possible_keys: PRIMARYkey: PRIMARY
key_len: 4
ref: dbt3.orders.o_custkey rows: 1filtered: 19.84
Extra: Using where
2 rows in set, 1 warning (1.06 sec)
現在,使用直方圖后,我們可以看到customer表的“吸引力”降低了,因為order表按條件過濾的行的百分比(30.62)幾乎是customer表按條件過濾行的百分比的兩倍(19.84%),這將導致低order表進行查找,
注意:這段感覺沒有翻譯恰當,英文原文如下,如果感覺翻譯比較生硬,參考原文
Now with the histogram we can see that it becomes less attractive to start with customer table since almost twice as many rows (19.84%) will cause look-ups into the order table.

優化器選擇對order表進行全表掃描(full sacn),此時執行計劃的代價看起來似憾訓高一些,,讓我們看一下SQL的執行時間:
mysql> SELECT count(*) FROM orders
JOIN customer ON o_custkey = c_custkey
WHERE o_orderdate < '1993-01-01' AND c_mktsegment = "AUTOMOBILE"\G
*************************** 1. row ***************************count(*): 451271 row in set (6.35 sec)
SQL陳述句的執行時間更短,明顯比之前要快了
查看資料的分布
直方圖資料存貯在Information_Schema.column_statistics表中,這個表的定義如下
+-------------+-------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------------+-------------+------+-----+---------+-------+
| SCHEMA_NAME | varchar(64) | NO | | NULL | |
| TABLE_NAME | varchar(64) | NO | | NULL | |
| COLUMN_NAME | varchar(64) | NO | | NULL | |
| HISTOGRAM | json | NO | | NULL | |
+-------------+-------------+------+-----+---------+-------+
它的一條記錄類似下面這樣:
SELECT SCHEMA_NAME, TABLE_NAME, COLUMN_NAME, JSON_PRETTY(HISTOGRAM) FROM information_schema.column_statistics WHERE COLUMN_NAME = 'c_mktsegment'\G
*************************** 1. row ***************************SCHEMA_NAME: dbt3
TABLE_NAME: customer
COLUMN_NAME: c_mktsegment
JSON_PRETTY(HISTOGRAM): {"buckets": [
[
"base64:type254:QVVUT01PQklMRQ==",
0.19837010534684954
],
[
"base64:type254:QlVJTERJTkc=",
0.3983104750546611
],
[
"base64:type254:RlVSTklUVVJF",
0.5978433710991851
],
[
"base64:type254:SE9VU0VIT0xE",
0.799801232359372
],
[
"base64:type254:TUFDSElORVJZ",
1.0
]
],
"data-type": "string","null-values": 0.0,
"collation-id": 255, "last-updated": "2018-03-02 20:21:48.271523","sampling-rate": 0.6709158000670916,
"histogram-type": "singleton",
"number-of-buckets-specified": 1024}
而且可以查看分布
SELECT FROM_BASE64(SUBSTRING_INDEX(v, ':', -1)) value, concat(round(c*100,1),'%') cumulfreq,
CONCAT(round((c - LAG(c, 1, 0) over()) * 100,1), '%') freq
FROM information_schema.column_statistics, JSON_TABLE(histogram->'$.buckets',
'$[*]' COLUMNS(v VARCHAR(60) PATH '$[0]', c double PATH '$[1]')) hist
WHERE schema_name = 'dbt3' and table_name = 'customer' and column_name = 'c_mktsegment';
+------------+-----------+-------+
| value | cumulfreq | freq |+------------+-----------+-------+
| AUTOMOBILE | 19.8% | 19.8% |
| BUILDING | 39.9% | 20.1% |
| FURNITURE | 59.9% | 19.9% |
| HOUSEHOLD | 79.9% | 20.1% |
| MACHINERY | 100.0% | 20.1% |
+------------+-----------+-------+
你也可以用下面語法洗掉直方圖資訊,
mysql> ANALYZE TABLE customer DROP HISTOGRAM on c_mktsegment;
+---------------+-----------+----------+---------------------------------------------------------+
| Table | Op | Msg_type | Msg_text |+---------------+-----------+----------+---------------------------------------------------------+
| dbt3.customer | histogram | status | Histogram statistics removed for column 'c_mktsegment'. |
+---------------+-----------+----------+---------------------------------------------------------+
1 row in set (0.00 sec)
Buckets
你會注意到,當我們創建一個直方圖時,我們需要指定buckets的數量,事實上,資料被分成包含特定值以及他們基數(cardinality)的一組Buckets,如果在上一個例子中檢查直方圖的型別,你會發現它是等寬直方圖(singleton)
"histogram-type": "singleton",
這種型別的直方圖最好的,因為基數是針對單個特定值, 如果這次我僅使用2個存盤桶(buckets)來重新創建直方圖(請記住,在c_mktsegment列中有4個不同的值):
mysql> ANALYZE TABLE customer UPDATE HISTOGRAM ON c_mktsegment WITH 2 BUCKETS;
+---------------+-----------+----------+---------------------------------------------------------+
| Table | Op | Msg_type | Msg_text |+---------------+-----------+----------+---------------------------------------------------------+
| dbt3.customer | histogram | status | Histogram statistics created for column 'c_mktsegment'. |
+---------------+-----------+----------+---------------------------------------------------------+
如果我檢查直方圖的型別:
mysql> SELECT SCHEMA_NAME, TABLE_NAME, COLUMN_NAME, JSON_PRETTY(HISTOGRAM)
FROM information_schema.column_statistics WHERE COLUMN_NAME = 'c_mktsegment'\G
*************************** 1. row ***************************SCHEMA_NAME: dbt3
TABLE_NAME: customer
COLUMN_NAME: c_mktsegment
JSON_PRETTY(HISTOGRAM): {"buckets": [
[
"base64:type254:QVVUT01PQklMRQ==",
"base64:type254:RlVSTklUVVJF",
0.5996992690844636,
3
],
[
"base64:type254:SE9VU0VIT0xE",
"base64:type254:TUFDSElORVJZ",
1.0,
2
]
],
"data-type": "string","null-values": 0.0,
"collation-id": 255, "last-updated": "2018-03-02 20:42:26.165898","sampling-rate": 0.6709158000670916,
"histogram-type": "equi-height",
"number-of-buckets-specified": 2}
現在的直方圖型別是等高直方圖,這意味著將連續范圍的值分組到存盤桶中,以使落入每個存盤桶的資料項的數量相同,
結論:
直方圖對那些不是索引中第一列的列非常有用,這些列用于JOIN、IN子查詢(IN-subqueries)或ORDER BY…LIMIT的查詢的WHERE條件下使用,
另外, 可以考慮嘗試使用足夠的存盤通來獲取等寬直方圖,
轉載請註明出處,本文鏈接:https://www.uj5u.com/shujuku/111038.html
標籤:MySQL
上一篇:MySQL第三課
下一篇:MySQL的統計資訊學習總結
