ElasticSearch如何更新集群的狀態-有解無憂

ElasticSearch如何更新集群的狀態

最近發生了很多事情，甚至對自己的技術能力和學習方式產生了懷疑，所以有一段時間沒更新文章了，估計以后更新的頻率會越來越少，希望有更多的沉淀而不是簡單地分享，讓我有感悟的是，最近看到一篇關于ES集群狀態更新的文章Elasticsearch Distributed Consistency Principles Analysis (2) - Meta，和 “提交給執行緒池的Runnable任務是以怎樣的順序執行的？”這個問題，因此，結合ES6.3.2原始碼，分析一下ES的Master節點是如何更新集群狀態的，

分布式系統的集群狀態一般是指各種元資料資訊，通俗地講，在ES中創建了一個Index，這個Index的Mapping結構資訊、Index由幾個分片組成，這些分片分布在哪些節點上，這樣的資訊就組成了集群的狀態，當Client創建一個新索引、或者洗掉一個索、或者進行快照備份、或者集群又進行了一次Master選舉，這些都會導致集群狀態的變化，概括一下就是：發生了某個事件，導致集群狀態發生了變化，產生了新集群狀態后，如何將新的狀態應用到各個節點上去，并且保證一致性，

在ES中，各個模塊發生一些事件，會導致集群狀態變化，并由org.elasticsearch.cluster.service.ClusterService#submitStateUpdateTask(java.lang.String, T)提交集群狀態變化更新任務，當任務執行完成時，就產生了新的集群狀態，然后通過"二階段提交協議"將新的集群狀態應用到各個節點上，這里可大概了解一下有哪些模塊的操作會提交一個更新任務，比如：

MetaDataDeleteIndexService#deleteIndices 洗掉索引
org.elasticsearch.snapshots.SnapshotsService#createSnapshot 創建快照
org.elasticsearch.cluster.metadata.MetaDataIndexTemplateService#putTemplate 創建索引模板

因此各個Service(比如：MetaDataIndexTemplateService)都持有org.elasticsearch.cluster.service.ClusterService實體參考，通過ClusterService#submitStateUpdateTask方法提交更新集群狀態的任務，

既然創建新索引、洗掉索引、修改索引模板、創建快照等都會觸發集群狀態更新，那么如何保證這些更新操作是"安全"的？比如操作A是洗掉索引，操作B是對索引做快照備份，操作A、B的順序不當，就會引發錯誤！比如，索引都已經洗掉了，那還怎么做快照？因此，為了防止這種并發操作對集群狀態更新的影響，org.elasticsearch.cluster.service.MasterService中采用單執行緒執行方式提交更新集群狀態的任務，狀態更新任務由org.elasticsearch.cluster.service.MasterService.Batcher.UpdateTask表示，它本質上是一個具有優先級特征的Runnable任務：

//PrioritizedRunnable 實作了Comparable介面，compareTo方法比較任務的優先級
public abstract class PrioritizedRunnable implements Runnable, Comparable<PrioritizedRunnable> {
	
    private final Priority priority;//Runnable任務優先級
    private final long creationDate;
    private final LongSupplier relativeTimeProvider;
    
     @Override
    public int compareTo(PrioritizedRunnable pr) {
        return priority.compareTo(pr.priority);
    }
}

而單執行緒的執行方式，則是通過org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor執行緒池實作的，看org.elasticsearch.common.util.concurrent.EsExecutors#newSinglePrioritizing執行緒池的創建：

public static PrioritizedEsThreadPoolExecutor newSinglePrioritizing(String name, ThreadFactory threadFactory, ThreadContext contextHolder, ScheduledExecutorService timer) {
    //core pool size == max pool size ==1，說明該執行緒池里面只有一個作業執行緒
        return new PrioritizedEsThreadPoolExecutor(name, 1, 1, 0L, TimeUnit.MILLISECONDS, threadFactory, contextHolder, timer);
    }

而執行緒池的任務佇列則是采用：PriorityBlockingQueue（底層是個陣列，資料結構是：堆 Heap），通過compareTo方法比較Priority，從而決定任務的排隊順序，

//PrioritizedEsThreadPoolExecutor#PrioritizedEsThreadPoolExecutor
PrioritizedEsThreadPoolExecutor(String name, int corePoolSize, int maximumPoolSize, long keepAliveTime, TimeUnit unit,ThreadFactory threadFactory, ThreadContext contextHolder, ScheduledExecutorService timer) {
        super(name, corePoolSize, maximumPoolSize, keepAliveTime, unit, new PriorityBlockingQueue<>(), threadFactory, contextHolder);
        this.timer = timer;
    }

這里想提一下這種只采用一個執行緒執行任務狀態更新的思路，它與Redis采用單執行緒執行Client的操作命令是一致的，各個Redis Client向Redis Server發起操作請求，Redis Server最終是以一個執行緒來"順序地"執行各個命令，單執行緒執行方式，避免了資料并發操作導致的不一致性，并且不需要執行緒同步，畢竟同步一般是通過加鎖來實作的，而加鎖會影響程式性能，

在這里，我想插一個問題：JDK執行緒池執行任務的順序是怎樣的？通過java.util.concurrent.ThreadPoolExecutor#execute方法先提交到執行緒池中的任務，一定會優先執行嗎？這個問題經常被人問到，哈哈，但是，真正地理解，卻不容易，因為它涉及到執行緒池引數，core pool size、max pool size 、任務佇列的長度以及任務到來的時機，其實JDK原始碼中的注釋已經講得很清楚了：

        /*
         * Proceed in 3 steps:
         *
         * 1. If fewer than corePoolSize threads are running, try to
         * start a new thread with the given command as its first
         * task.  The call to addWorker atomically checks runState and
         * workerCount, and so prevents false alarms that would add
         * threads when it shouldn't, by returning false.
         *
         * 2. If a task can be successfully queued, then we still need
         * to double-check whether we should have added a thread
         * (because existing ones died since last checking) or that
         * the pool shut down since entry into this method. So we
         * recheck state and if necessary roll back the enqueuing if
         * stopped, or start a new thread if there are none.
         *
         * 3. If we cannot queue task, then we try to add a new
         * thread.  If it fails, we know we are shut down or saturated
         * and so reject the task.
         */

任務提交到執行緒池，如果執行緒池的活躍執行緒數量小于 core pool size，那么直接創建新執行緒執行任務，這種情況下任務是不會入佇列的，
當執行緒池中的活躍執行緒數量已經達到core pool size時，繼續提交任務，這時的任務就會入佇列排隊，
當任務佇列已經滿了時，同時又有新任務提交過來，如果執行緒池的活躍執行緒數小于 max pool size，那么會創建新的執行緒，執行這些剛提交過來的任務，此時的任務也不會入佇列排隊，（注意：這里新創建的執行緒并不是從任務佇列中取任務，而是直接執行剛剛提交過來的任務，而那些前面已經提交了的在任務佇列中排隊的任務反而不能優先執行，換句話說：任務的執行順序并不是嚴格按提交順序來執行的）

代碼驗證一下如下，會發現：如果 cool pore size 不等于 max pool size，那么后提交的任務，反而可能先開始執行，因為，先提交的任務在佇列中排隊，而后提交的任務直接被新創建的執行緒執行了，省去了排隊程序，（這里為了方便看結果，每個任務的所需要的執行時間都是相同的，即1s鐘）

import com.google.common.util.concurrent.ThreadFactoryBuilder;
import java.util.concurrent.*;

/**
 * @author psj
 * @date 2019/11/14
 */
public class ThreadPoolTest {
    public static void main(String[] args) throws InterruptedException{

        ThreadFactory threadFactory = new ThreadFactoryBuilder().setNameFormat("test-%d").build();
        BlockingQueue<Runnable> workQueue = new LinkedBlockingQueue<>(4);
        ThreadPoolExecutor executorSevice = new ThreadPoolExecutor(1, 4, 0, TimeUnit.HOURS,
                workQueue, threadFactory, new ThreadPoolExecutor.DiscardPolicy());

        for (int i = 1; i <=8; i++) {
            MyRunnable task = new MyRunnable(i, workQueue);
            executorSevice.execute(task);
            sleepMills(200);
            System.out.println("submit: " + i  + ", queue size:" + workQueue.size() + ", active count:" + executorSevice.getActiveCount());
        }
        Thread.currentThread().join();
    }


    public static class MyRunnable implements Runnable {
        private int sequence;
        private BlockingQueue taskQueue;
        public MyRunnable(int sequence, BlockingQueue taskQueue) {
            this.sequence = sequence;
            this.taskQueue = taskQueue;
        }
        @Override
        public void run() {
            //模擬任務需要1秒鐘才能執行完成
            sleepMills(1000);
            System.out.println("task :" + sequence + " finished, current queue size:" + taskQueue.size());
        }
    }

    public static void sleepMills(int mills) {
        try {
            TimeUnit.MILLISECONDS.sleep(mills);
        } catch (InterruptedException e) {

        }
    }
}

當提交第7個任務時，此時任務佇列size為4，已經滿了，因此，繼續創建新執行緒(因為此時活躍執行緒數小于max pool size)執行第7個任務，也可以發現：第7個任務比那些在佇列中排隊的任務(比如第2、3、4個任務)要早執行完成，這是因為第7個任務沒有入佇列排隊，而是直接創建新執行緒執行它，

submit: 1, queue size:0, active count:1
submit: 2, queue size:1, active count:1
submit: 3, queue size:2, active count:1
submit: 4, queue size:3, active count:1
task :1 finished, current queue size:4
submit: 5, queue size:3, active count:1
submit: 6, queue size:4, active count:1
submit: 7, queue size:4, active count:2
submit: 8, queue size:4, active count:3
task :2 finished, current queue size:4
task :7 finished, current queue size:3
task :8 finished, current queue size:2
task :3 finished, current queue size:1
task :4 finished, current queue size:0
task :5 finished, current queue size:0
task :6 finished, current queue size:0

那么，有什么辦法，能夠保證先提交的任務，一定先執行嗎？還是有的：那就是將執行緒池的核心執行緒數core pool size設定成 max pool size 一樣大，
但是在現實應用中，有些任務很復雜，有些任務很簡單，因此 “每個任務所需要的執行完成的時間完全相等幾乎是不可能的”，當執行緒池的 core pool size等于max pool size 能保證：先提交的任務，先被執行緒池執行，但是先提交的任務，不一定先執行完成，這是要注意的，
為什么當core pool size 等于 max pool size時，先提交的任務，就一定會先執行呢？
這是因為：任務提交過來，先通過 addWorker 創建新執行緒執行任務，由于core ppol size 等于 max pool size，那么addWorker會首先創建到 max pool size個執行緒數，再有任務提交過來，就會入佇列排隊，當某個執行緒上的任務執行完成時，這個執行緒是在一個while 回圈里面去取任務佇列里面排隊的任務，原始碼如下：java.util.concurrent.ThreadPoolExecutor#runWorker

try {       
            //如果執行緒剛才執行完了一個task，該task==null, 這時 while 回圈 中 getTask()方法執行，從任務佇列里面取任務執行
            while (task != null || (task = getTask()) != null) {
            //省略其他代碼...
                try {
                    beforeExecute(wt, task);
                    Throwable thrown = null;
                    try {
                        task.run();// 這里會執行我們實作的run方法
                    } catch (RuntimeException x) {
                        thrown = x; throw x;
                    } catch (Error x) {
                        thrown = x; throw x;
                    } catch (Throwable x) {
                        thrown = x; throw new Error(x);
                    } finally {
                        afterExecute(task, thrown);
                    }
                } finally {
                    task = null;//任務執行完成置成null，這樣while回圈的條件就會呼叫 getTask()從任務佇列里面取新任務執行
                    w.completedTasks++;
                    w.unlock();
                }
            }
            completedAbruptly = false;
        } finally {
            processWorkerExit(w, completedAbruptly);
        }

OK，分析完了執行緒池執行任務的順序，再看看ES的PrioritizedEsThreadPoolExecutor執行緒池的引數：將 core pool size 和 max pool size 都設定成1，避免了這種"插隊"的現象，即能夠保證：先提交的任務，一定是先執行完成的，各個模塊觸發的集群狀態更新最終在org.elasticsearch.cluster.service.MasterService#submitStateUpdateTasks方法中構造UpdateTask物件實體，并通過submitTasks方法提交任務執行，額外需要注意的是：集群狀態更新任務可以以批量執行方式提交，具體看org.elasticsearch.cluster.service.TaskBatcher的實作吧，

        try {
            List<Batcher.UpdateTask> safeTasks = tasks.entrySet().stream()
                .map(e -> taskBatcher.new UpdateTask(config.priority(), source, e.getKey(), safe(e.getValue()), executor))
                .collect(Collectors.toList());
            taskBatcher.submitTasks(safeTasks, config.timeout());
        } catch (EsRejectedExecutionException e) {
            // ignore cases where we are shutting down..., there is really nothing interesting
            // to be done here...
            if (!lifecycle.stoppedOrClosed()) {
                throw e;
            }
        }

最后來分析一下 org.elasticsearch.cluster.service.ClusterService類，在ES節點啟動的時候，在Node#start()方法中會啟動ClusterService，當其它各個模塊執行一些操作觸發集群狀態改變時，就是通過ClusterService來提交集群狀態更新任務，而ClusterService其實就是封裝了 MasterService和ClusterApplierService，MasterService提供任務提交介面，內部維護一個執行緒池處理更新任務，而ClusterApplierService則負責通知各個模塊應用新生成的集群狀態，

總結

只有單個執行緒的執行緒池執行任務，能夠保證任務處理的順序性，又不需要通過加鎖實作資料同步，這種思路值得借鑒
如果想保證：先提交的任務要先執行，那么 max pool size 必須等于 core pool size
如果 max pool size 不等于 core pool size，那么先提交的任務，可能在任務佇列中排隊，當任務佇列滿了時，后提交過來的任務直接通過 addWorker新建一個執行緒執行，從而使得后提交的任務先執行了（但并不是后提交的任務先執行完成，因為每個任務的復雜度不一樣）
如果想保證：先提交的任務先執行完成，那么 max pool size 必須等于 core pool size 并且等于1，這就是ES的PrioritizedEsThreadPoolExecutor執行緒池所采用的方式，它能保證ES集群的任務更新狀態是有序的，

轉載請註明出處，本文鏈接：https://www.uj5u.com/qita/3827.html

標籤：其他

上一篇：GitHub 發布了官方 App，還打算冰封你的代碼一千年

下一篇：Elasticsearch系列---常見搜索方式與聚合分析