MySQL Hash Join前世今生-有解無憂

GreatSQL社區原創內容未經授權不得隨意使用，轉載請聯系小編并注明來源，
GreatSQL是MySQL的國產分支版本，使用上與MySQL一致，
作者：nw

MySQL Hash Join前世今生

因作業需要，對MySQL Hash Join的內部實作做了一些探索和實踐，對這個由8.0.18開始引入的連接演算法有了一定的了解，寫文總結與各位大佬分享，歡迎大家指教，因篇幅較長，這份總結分成若干部分，我們今天先一起來看一下MySQL Hash join的變遷史，

爬了一下 MySQL worklog[1]，并結合原始碼及各版本的實際使用，個人認為比較重要的worklogs為如下幾個，其它的變更一般圍繞這些worklogs做的小調整或bugfixes，這里我們就不一一列舉，

1. WL#2241: Hash join (變更版本：8.0.18)

主要內容：

新增執行類HashJoinIterator，實作hash join演算法原型 (支持on-disk hash)
HASH JOIN 僅支持INNER JOIN，并對使用hash join做了限制：關聯表連接條件必須至少包含一條等值條件(equi-join condition, 如t1.a = t2.a)，且join條件中的列不包含索引

注：這里我認為官網的 Release Notes[2] 描述是不太準確的，以如下例子為例，雖然該查詢包含了非等值連接條件(non-equi-join condition, 如 t1.a <> t2.a ,t1.a = 1, t2.a > 1 等），但實際查詢中還是使用hash join的, 因為查詢陳述句在決議執行程序中，可能會經歷陳述句重寫、sql優化等步驟，與表象上會有所不同，建議借助explain工具，查看實際上查詢陳述句選擇的join演算法，

-- 版本：8.0.18
-- 在創建iterator時，t1.a > 1 會被當成表的過濾條件，而非inner join的join條件
-- 此查詢仍使用了hash join（join 條件為空）
EXPLAIN FORMAT=tree SELECT * FROM t1 JOIN t2 ON t1.i > 1;
-> Inner hash join  (cost=1.17 rows=3)
    -> Table scan on t2  (cost=0.34 rows=2)
    -> Hash
        -> Filter: (t1.i > 1)  (cost=0.65 rows=1)
            -> Table scan on t1  (cost=0.65 rows=4)

(盡量)使用HASH JOIN演算法替換BNL(Blocked Nested-Loop)連接演算法

2. WL#13377: Add support for hash outer, anti and semi join( 變更版本：8.0.20)

主要內容:

HASH INNER JOIN改進

INNER JOIN中的non-equi-join conditions, 會被附為hash join的過濾條件:

-- 版本：8.0.20
EXPLAIN FORMAT=tree SELECT * FROM t1 JOIN t2 ON t1.i <> t2.i;
-> Filter: (t1.i <> t2.i)  (cost=1.10 rows=2)
    -> Inner hash join (no condition)  (cost=1.10 rows=2)
        -> Table scan on t2  (cost=0.18 rows=2)
        -> Hash
            -> Table scan on t1  (cost=0.45 rows=2)

HAH JOIN支持outer join/anti join/semi join

-- 版本：8.0.20
-- Left outer join
EXPLAIN FORMAT=tree SELECT * FROM t1 LEFT JOIN t2 ON t1.i=t2.i;
-> Left hash join (t2.i = t1.i)  (cost=0.88 rows=4)                                                                                                                                                                          
    -> Table scan on t1  (cost=0.45 rows=2)                                                                                                                                                                                    
    -> Hash                                                                                                                                                                                                                    
        -> Table scan on t2  (cost=0.23 rows=2)

-- Right outer join（注：MySQL在parser階段會將所有的right join改寫為left join
--                      所以我們這里看到的explain為Left hash join
EXPLAIN FORMAT=tree SELECT * FROM t1 RIGHT JOIN t2 ON t1.i=t2.i;
-> Left hash join (t1.i = t2.i)  (cost=0.88 rows=4)                                                                                                                                                                          
    -> Table scan on t2  (cost=0.45 rows=2)                                                                                                                                                                                    
    -> Hash                                                                                                                                                                                                                    
        -> Table scan on t1  (cost=0.23 rows=2)

-- Semijoin
EXPLAIN FORMAT=tree SELECT * FROM t1 WHERE (t1.i) IN (SELECT t2.i FROM t2);
-> Hash semijoin (t2.i = t1.i)  (cost=0.83 rows=2)
    -> Table scan on t1  (cost=0.45 rows=2)
    -> Hash
        -> Table scan on t2  (cost=0.18 rows=2)

-- Antijoin
EXPLAIN FORMAT=tree 
SELECT * FROM t2 WHERE NOT EXISTS (SELECT * FROM t1 WHERE t1.i = t2.i);
-> Hash antijoin (t1.i = t2.i)  (cost=1.10 rows=4)                                                                                                                                                                           
    -> Table scan on t2  (cost=0.45 rows=2)                                                                                                                                                                                    
    -> Hash                                                                                                                                                                                                                    
        -> Table scan on t1  (cost=0.45 rows=2)

所有使用BNL的連接，都將被轉為使用HASH JOIN
non-equi-join conditions 處理

Hash join iterator引入"extra" condition的概念，部分 non-equi-join conditions會被當成Hash join iterator的extra condition, 在建hash table時，join key的計算不依賴這些條件，但會在hash查找到匹配項后，作為附加的過濾條件：

-- 版本: 8.0.20
-- 注： t1.i > 1 會被放到hash join的 extra conditions中
EXPLAIN FORMAT=tree SELECT * FROM  t1 LEFT JOIN t2 ON t1.i=t2.i AND t1.i > 1;
-> Left hash join (t2.i = t1.i), extra conditions: (t1.i > 1)  (cost=0.88 rows=4)
    -> Table scan on t1  (cost=0.45 rows=2)
    -> Hash
        -> Table scan on t2  (cost=0.23 rows=2)

相關原始碼：

// 代碼版本：8.0.20 HEAD: commit 91a17cedb1ee880fe7915fb14cfd74c04e8d6588
// 檔案名：sql/hash_join_iterator.cc
int HashJoinIterator::ReadNextJoinedRowFromHashTable() {
  int res;
  bool passes_extra_conditions = false;
  do { // 所有匹配行都需要多做一個extra condition的判定，因為有可能存在不同行的記錄
       // 映射在同一個join key的情況，因此需要通過遍歷逐條讀取出符合extra conditions
       // 的資料
    res = ReadJoinedRow();  // 讀取通過join key查找已經得到的匹配行(單行記錄）
    DBUG_ASSERT(res == 0 || res == -1);
    
    if (res == 0) {
      passes_extra_conditions = JoinedRowPassesExtraConditions();
      if (thd()->is_error()) {
        // Evaluation of extra conditions raised an error, so abort the join.
        return 1;
      }

      if (!passes_extra_conditions) {
        ++m_hash_map_iterator;
      }
    }
  } while (res == 0 && !passes_extra_conditions);
}

3. WL#13459: Optimize hash table in hash join (變更版本：8.0.23）

主要內容：

優化hash join table的創建方法

這里MySQL所說的“優化”，實際上會更激進一點，這個版本中，MySQL直接使用了一個基于 robin hood hashing[3] 實作的 開源hash table[4] ，更換了原先的hash join table實作( from mem_root_unordered_multimap to robin_hood::unordered_flat_map)
優化記憶體管理和使用，降低了 on-disk hash 的頻率，提高了記憶體有效使用率等（其他的改進內容及提升的效果可以參考MySQL 8.0.23的release notes[5] 以及 worklog #13459[6] 的Low Level Design頁面）

本篇我們對MySQL hash join的3個重要變更做了簡要的總結，算是對它的前世今生有了印象，謝謝各位閱讀；之后讓我們會結合具體的sql查詢樣例，去跟蹤一下對應的代碼執行流程，不日更新，敬請期待，

參考檔案

[1] MySQL worklog: https://dev.mysql.com/worklog/

[2] MySQL 8.0.18 release notes#optimizer: https://dev.mysql.com/doc/relnotes/mysql/8.0/en/news-8-0-18.html#mysqld-8-0-18-optimizer

[3] robin hood hashing 演算法介紹: https://programming.guide/robin-hood-hashing.html

[4] robin hood hasing 開源實作: https://github.com/martinus/robin-hood-hashing

[5] MySQL 8.0.23 release notes#optimizer: https://dev.mysql.com/doc/relnotes/mysql/8.0/en/news-8-0-23.html#mysqld-8-0-23-optimizer

[6] MySQL worklog #13459: https://dev.mysql.com/worklog/task/?id=13459

Enjoy GreatSQL ??

關于 GreatSQL

GreatSQL是由萬里資料庫維護的MySQL分支，專注于提升MGR可靠性及性能，支持InnoDB并行查詢特性，是適用于金融級應用的MySQL分支版本，

相關鏈接： GreatSQL社區 Gitee GitHub Bilibili

GreatSQL社區：

捉蟲活動詳情：https://greatsql.cn/thread-97-1-1.html

社區博客有獎征稿詳情：https://greatsql.cn/thread-100-1-1.html

6440

技術交流群：

微信：掃碼添加GreatSQL社區助手微信好友，發送驗證資訊加群，

)

轉載請註明出處，本文鏈接：https://www.uj5u.com/shujuku/537750.html

標籤：MySQL

上一篇：分頁查詢總結

下一篇：Oracle中ALTER TABLE的五種用法（一）