不要再問我 in，exists 走不走索引了-有解無憂

微信搜『煙雨星空』，獲取最新好文，

前言

最近，有一個業務需求，給我一份資料 A ，把它在資料庫 B 中存在，而又比 A 多出的部分算出來，由于資料比較雜亂，我這里簡化模型，

然后就會發現，我去，這不就是 not in ，not exists 嘛，

那么問題來了，in, not in , exists , not exists 它們有什么區別，效率如何？

曾經從網上聽說，in 和 exists 不會走索引，那么事實真的是這樣嗎？

帶著疑問，我們研究下去，

注意： 在說這個問題時，不說明 MySQL 版本的都是耍流氓，我這里用的是 5.7.18 ，

用法講解

為了方便，我們創建兩張表 t1 和 t2 ，并分別加入一些資料，（id為主鍵，name為普通索引）

-- t1
DROP TABLE IF EXISTS `t1`;
CREATE TABLE `t1` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `name` varchar(255) DEFAULT NULL,
  `address` varchar(255) DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `idx_t1_name` (`name`(191)) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=1009 DEFAULT CHARSET=utf8mb4;

INSERT INTO `t1` VALUES ('1001', '張三', '北京'), ('1002', '李四', '天津'), ('1003', '王五', '北京'), ('1004', '趙六', '河北'), ('1005', '杰克', '河南'), ('1006', '湯姆', '河南'), ('1007', '貝爾', '上海'), ('1008', '孫琪', '北京');

-- t2
DROP TABLE IF EXISTS `t2`;
CREATE TABLE `t2`  (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `name` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
  `address` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
  PRIMARY KEY (`id`) USING BTREE,
  INDEX `idx_t2_name`(`name`(191)) USING BTREE
) ENGINE = InnoDB AUTO_INCREMENT = 1014 CHARACTER SET = utf8mb4 COLLATE = utf8mb4_general_ci ROW_FORMAT = Dynamic;

INSERT INTO `t2` VALUES (1001, '張三', '北京');
INSERT INTO `t2` VALUES (1004, '趙六', '河北');
INSERT INTO `t2` VALUES (1005, '杰克', '河南');
INSERT INTO `t2` VALUES (1007, '貝爾', '上海');
INSERT INTO `t2` VALUES (1008, '孫琪', '北京');
INSERT INTO `t2` VALUES (1009, '曹操', '魏國');
INSERT INTO `t2` VALUES (1010, '劉備', '蜀國');
INSERT INTO `t2` VALUES (1011, '孫權', '吳國');
INSERT INTO `t2` VALUES (1012, '諸葛亮', '蜀國');
INSERT INTO `t2` VALUES (1013, '典韋', '魏國');

那么，對于當前的問題，就很簡單了，用 not in 或者 not exists 都可以把 t1 表中比 t2 表多出的那部分資料給挑出來，（當然，t2 比 t1 多出來的那部分不算）

這里假設用 name 來匹配資料，

select * from t1 where name not in (select name from t2);
或者用
select * from t1 where not exists (select name from t2 where t1.name=t2.name);

得到的結果都是一樣的，

但是，需要注意的是，not in 和 not exists 還是有不同點的，

在使用 not in 的時候，需要保證子查詢的匹配欄位是非空的，如，此表 t2 中的 name 需要有非空限制，如若不然，就會導致 not in 回傳的整個結果集為空，

例如，我在 t2 表中加入一條 name 為空的資料，

INSERT INTO `t2` VALUES (1014, NULL, '魏國');

則此時，not in 結果就會回傳空，

另外需要明白的是， exists 回傳的結果是一個 boolean 值 true 或者 false ，而不是某個結果集，因為它不關心回傳的具體資料是什么，只是外層查詢需要拿這個布林值做判斷，

區別是，用 exists 時，若子查詢查到了資料，則回傳真，用 not exists 時，若子查詢沒有查到資料，則回傳真，

由于 exists 子查詢不關心具體回傳的資料是什么，因此，以上的陳述句完全可以修改為如下，

-- 子查詢中 name 可以修改為其他任意的欄位，如此處改為 1 ，
select * from t1 where not exists (select 1 from t2 where t1.name=t2.name);

從執行效率來說，1 > column > * ，因此推薦用 select 1，（準確的說應該是常量值）

in, exists 執行流程

1、對于 in 查詢來說，會先執行子查詢，如上邊的 t2 表，然后把查詢得到的結果和外表 t1 做笛卡爾積，再通過條件進行篩選（這里的條件就是指 name 是否相等），把每個符合條件的資料都加入到結果集中，

sql 如下，

select * from t1 where name in (select name from t2);

偽代碼如下：

for(x in A){
    for(y in B){
    	if(condition is true) {result.add();}
    }
}

這里的 condition 其實就是對比兩張表中的 name 是否相同，

2、對于 exists 來說，是先查詢遍歷外表 t1 ，然后每次遍歷時，再檢查在內表是否符合匹配條件，即檢查是否存在 name 相等的資料，

sql 如下，

select * from t1 where name exists (select 1 from t2);

偽代碼如下：

for(x in A){
	if(exists condition is true){result.add();}
}

對應于此例，就是從 id 為 1001 開始遍歷 t1 表，然后遍歷時檢查 t2 中是否有相等的 name ，

如 id=1001時，張三存在于 t2 表中，則回傳 true，把 t1 中張三的這條記錄加入到結果集，繼續下次回圈， id=1002 時，李四不在 t2 表中，則回傳 false，不做任何操作，繼續下次回圈，直到遍歷完整個 t1 表，

是否走索引？

針對網上說的 in 和 exists 不走索引，那么究竟是否如此呢？

我們在 MySQL 5.7.18 中驗證一下，（注意版本號哦）

單表查詢

首先，驗證單表的最簡單的情況，我們就以 t1 表為例，id為主鍵， name 為普通索引，

分別執行以下陳述句，

explain select * from t1 where id in (1001,1002,1003,1004);
explain select * from t1 where id in (1001,1002,1003,1004,1005);
explain select * from t1 where name in ('張三','李四');
explain select * from t1 where name in ('張三','李四','王五');

為什么我要分別查不同的 id 個數呢？看截圖，

會驚奇的發現，當 id 是四個值時，還走主鍵索引，而當 id 是五個值時，就不走索引了，這就很耐人尋味了，

再看 name 的情況，

同樣的當值多了之后，就不走索引了，

所以，我猜測這個跟匹配欄位的長度有關，按照漢字是三個位元組來計算，且程式設計中喜歡用2的n次冪的尿性，這里大概就是以 16 個位元組為分界點，

然而，我又以同樣的資料，去我的服務器上查詢（版本號 5.7.22），發現四個id值時，就不走索引了，因此，估算這里的臨界值為 12 個位元組，

不管怎樣，這說明了，在 MySQL 中應該對 in 查詢的位元組長度是有限制的，（沒有官方確切說法，所以，僅供參考）

多表涉及子查詢

我們主要是去看當前的這個例子中的兩表查詢時， in 和 exists 是否走索引，

一、分別執行以下陳述句，主鍵索引（id）和普通索引（name），在 in , not in 下是否走索引，

explain select * from t1 where id in (select id from t2); --1
explain select * from t1 where name in (select name from t2); --2
explain select * from t1 where id not in (select id from t2); --3
explain select * from t1 where name not in (select name from t2); --4

結果截圖如下，

1、t1 走索引，t2 走索引，

2、t1 不走索引，t2不走索引，（此種情況，實測若把name改為唯一索引，則t1也會走索引）

3、t1 不走索引，t2走索引，

4、t1不走索引，t2不走索引，

我滴天，這結果看起來亂七八糟的，好像走不走索引，完全看心情，

但是，我們發現只有第一種情況，即用主鍵索引欄位匹配，且用 in 的情況下，兩張表才都走索引，

這個到底是不是規律呢？有待考察，且往下看，

二、接下來測驗，主鍵索引和普通索引在 exists 和 not exists 下的情況，sql如下，

explain select * from t1 where exists (select 1 from t2 where t1.id=t2.id);
explain select * from t1 where exists (select 1 from t2 where t1.name=t2.name);
explain select * from t1 where not exists (select 1 from t2 where t1.id=t2.id);
explain select * from t1 where not exists (select 1 from t2 where t1.name=t2.name);

這個結果就非常有規律了，且看，

有沒有發現， t1 表哪種情況都不會走索引，而 t2 表是有索引的情況下就會走索引，為什么會出現這種情況？

其實，上一小節說到了 exists 的執行流程，就已經說明問題了，

它是以外層表為驅動表，無論如何都會回圈遍歷的，所以會全表掃描，而內層表通過走索引，可以快速判斷當前記錄是否匹配，

效率如何？

針對網上說的 exists 一定比 in 的執行效率高，我們做一個測驗，

分別在 t1，t2 中插入 100W，200W 條資料，

我這里，用的是自定義函式來回圈插入，陳述句參考如下，（沒有把表名抽離成變數，因為我沒有找到方法，尷尬）

-- 傳入需要插入資料的id開始值和資料量大小，函式回傳結果為最終插入的條數，此值正常應該等于資料量大小，
-- id自增，回圈往 t1 表添加資料，這里為了方便，id、name取同一個變數，address就為北京，
delimiter // 
drop function if exists insert_datas1//
create function insert_datas1(in_start int(11),in_len int(11)) returns int(11)
begin  
  declare cur_len int(11) default 0;
  declare cur_id int(11);
  set cur_id = in_start;
	
  while cur_len < in_len do
     insert into t1 values(cur_id,cur_id,'北京');
  set cur_len = cur_len + 1;
  set cur_id = cur_id + 1;
  end while; 
  return cur_len;
end  
//
delimiter ;
-- 同樣的，往 t2 表插入資料
delimiter // 
drop function if exists insert_datas2//
create function insert_datas2(in_start int(11),in_len int(11)) returns int(11)
begin  
  declare cur_len int(11) default 0;
  declare cur_id int(11);
  set cur_id = in_start;
	
  while cur_len < in_len do
     insert into t2 values(cur_id,cur_id,'北京');
  set cur_len = cur_len + 1;
  set cur_id = cur_id + 1;
  end while; 
  return cur_len;
end  
//
delimiter ;

在此之前，先清空表里的資料，然后執行函式，

select insert_datas1(1,1000000);

對 t2 做同樣的處理，不過為了兩張表資料有交叉，就從 70W 開始，然后插入 200W 資料，

select insert_datas2(700000,2000000);

在家里的電腦，實際執行時間，分別為 36s 和 74s，

不知為何，家里的電腦還沒有在 Docker 虛擬機中跑的腳本快，，害，就這樣湊合著用吧，

等我有了新歡錢，就把它換掉，哼哼，

同樣的，把上邊的執行計劃都執行一遍，進行對比，我這里就不貼圖了，

in 和 exists 孰快孰慢

為了方便，主要拿以下這兩個 sql 來對比分析，

select * from t1 where id in (select id from t2);
select * from t1 where exists (select 1 from t2 where t1.id=t2.id);

執行結果顯示，兩個 sql 分別執行 1.3s 和 3.4s ，

注意此時，t1 表資料量為 100W， t2 表資料量為 200W ，

按照網上對 in 和 exists 區別的通俗說法，

如果查詢的兩個表大小相當，那么用in和exists差別不大；如果兩個表中一個較小一個較大，則子查詢表大的用exists，子查詢表小的用in；

對應于此處就是：

當 t1 為小表， t2 為大表時，應該用 exists ，這樣效率高，
當 t1 為大表，t2 為小表時，應該用 in，這樣效率較高，

而我用實際資料測驗，就把第一種說法給推翻了，因為很明顯，t1 是小表，但是 in 比 exists 的執行速度還快，

為了繼續測驗它這個觀點，我把兩個表的內表外表關系調換一下，讓 t2 大表作為外表，來對比查詢，

select * from t2 where id in (select id from t1);
select * from t2 where exists (select 1 from t1 where t1.id=t2.id);

執行結果顯示，兩個 sql 分別執行 1.8s 和 10.0s ，

是不是很有意思，可以發現，

對于 in 來說，大表小表調換了內外層關系，執行時間并無太大區別，一個是 1.3s，一個是 1.8s，
對于 exists 來說，大小表調換了內外層關系，執行時間天壤之別，一個是 3.4s ，一個是 10.0s，足足慢了兩倍，

一、以查詢優化器維度對比，

為了探究這個結果的原因，我去查看它們分別在查詢優化器中優化后的 sql ，

以 select * from t1 where id in (select id from t2); 為例，順序執行以下兩個陳述句，

-- 此為 5.7 寫法，如果是 5.6版本，需要用 explain extended ...
explain select * from t1 where id in (select id from t2);
-- 本意為顯示警告資訊，但是和 explain 一塊兒使用，就會顯示出優化后的sql，需要注意使用順序，
show warnings;

在結果 Message 里邊就會顯示我們要的陳述句，

-- message 優化后的sql
select `test`.`t1`.`id` AS `id`,`test`.`t1`.`name` AS `name`,`test`.`t1`.`address` AS `address` from `test`.`t2` join `test`.`t1` where (`test`.`t2`.`id` = `test`.`t1`.`id`)

可以發現，這里它把 in 轉換為了 join 來執行，

這里沒有用 on，而用了 where，是因為當只有 join 時，后邊的 on 可以用 where 來代替，即 join on 等價于 join where ，

PS： 這里我們也可以發現，select * 最侄訓被轉化為具體的欄位，知道為什么我們不建議用 select * 了吧，

同樣的，以 t2 大表為外表的查詢情況，也查看優化后的陳述句，

explain select * from t2 where id in (select id from t1);
show warnings;

我們會發現，它也會轉化為 join 的，

select `test`.`t2`.`id` AS `id`,`test`.`t2`.`name` AS `name`,`test`.`t2`.`address` AS `address` from `test`.`t1` join `test`.`t2` where (`test`.`t2`.`id` = `test`.`t1`.`id`)

這里不再貼 exists 的轉化 sql ，其實它沒有什么大的變化，

二、以執行計劃維度對比，

我們再以執行計劃維度來對比他們的區別，

explain select * from t1 where id in (select id from t2);
explain select * from t2 where id in (select id from t1);
explain select * from t1 where exists (select 1 from t2 where t1.id=t2.id);
explain select * from t2 where exists (select 1 from t1 where t1.id=t2.id);

執行結果分別為，

可以發現，對于 in 來說，大表 t2 做外表還是內表，都會走索引的，小表 t1 做內表時也會走索引，看它們的 rows 一列也可以看出來，前兩張圖結果一樣，

對于 exists 來說，當小表 t1 做外表時，t1 全表掃描，rows 近 100W；當大表 t2 做外表時， t2 全表掃描，rows 近 200W ，這也是為什么 t2 做外表時，執行效率非常低的原因，

因為對于 exists 來說，外表總會執行全表掃描的，當然表資料越少越好了，

最終結論： 外層大表內層小表，用in，外層小表內層大表，in和exists效率差不多（甚至 in 比 exists 還快，而并不是網上說的 exists 比 in 效率高），

not in 和 not exists 孰快孰慢

此外，實測對比 not in 和 not exists ，

explain select * from t1 where id not in (select id from t2);
explain select * from t1 where not exists (select 1 from t2 where t1.id=t2.id);
explain select * from t1 where name not in (select name from t2);
explain select * from t1 where not exists (select 1 from t2 where t1.name=t2.name);

explain select * from t2 where id not in (select id from t1);
explain select * from t2 where not exists (select 1 from t1 where t1.id=t2.id);
explain select * from t2 where name not in (select name from t1);
explain select * from t2 where not exists (select 1 from t1 where t1.name=t2.name);

小表做外表的情況下，對于主鍵來說， not exists 比 not in 快，對于普通索引來說， not in 和 not exists 差不了多少，甚至 not in 會稍快，

大表做外表的情況下，對于主鍵來說， not in 比 not exists 快，對于普通索引來說， not in 和 not exists 差不了多少，甚至 not in 會稍快，

感興趣的同學，可自行嘗試，以上邊的兩個維度（查詢優化器和執行計劃）分別來對比一下，

join 的嵌套回圈 (Nested-Loop Join)

為了理解為什么這里的 in 會轉換為 join ，我感覺有必要了解一下 join 的三種嵌套回圈連接，

1、簡單嵌套回圈連接，Simple Nested-Loop Join ，簡稱 SNLJ

join 即是 inner join ，內連接，它是一個笛卡爾積，即利用雙層回圈遍歷兩張表，

我們知道，一般在 sql 中都會以小表作為驅動表，所以，對于 A，B 兩張表，若A的結果集較少，則把它放在外層回圈，作為驅動表，自然，B 就在內層回圈，作為被驅動表，

簡單嵌套回圈，就是最簡單的一種情況，沒有做任何優化，

因此，復雜度也是最高的，O(mn)，偽代碼如下，

for(id1 in A){
    for(id2 in B){
        if(id1==id2){
            result.add();
        }
    }
}

2、索引嵌套回圈連接，Index Nested-Loop Join ，簡稱 INLJ

看名字也能看出來了，這是通過索引進行匹配的，外層表直接和內層表的索引進行匹配，這樣就不需要遍歷整個內層表了，利用索引，減少了外層表和內層表的匹配次數，

所以，此種情況要求內層表的列要有索引，

偽代碼如下，

for(id1 in A){
    if(id1 matched B.id){
        result.add();
    }
}

3、塊索引嵌套連接，Block Nested-Loop Join ，簡稱 BNLJ

塊索引嵌套連接，是通過快取外層表的資料到 join buffer 中，然后 buffer 中的資料批量和內層表資料進行匹配，從而減少內層回圈的次數，

以外層回圈100次為例，正常情況下需要在內層回圈讀取外層資料100次，如果以每10條資料存入快取buffer中，并傳遞給內層回圈，則內層回圈只需要讀取10次(100/10)就可以了，這樣就降低了內層回圈的讀取次數，

MySQL 官方檔案也有相關說明，可以參考：https://dev.mysql.com/doc/refman/5.7/en/nested-loop-joins.html#block-nested-loop-join-algorithm

所以，這里轉化為 join，可以用到索引嵌套回圈連接，從而提高了執行效率，

宣告： 以上是以我的測驗資料為準，測出來的結果，實際真實資料和測驗結果很有可能會不太一樣，如果有不同意見，歡迎留言討論，

轉載請註明出處，本文鏈接：https://www.uj5u.com/houduan/33555.html

標籤：Java

上一篇：go：流程控制

下一篇：Redis秒殺實戰-微信搶紅包-秒殺庫存，附案例原始碼（Jmeter壓測）