解決es集群Yellow與Red的問題-有解無憂

1. 集群健康度

分片健康，在集群中節點的狀態有三種：綠色、黃色、紅色
- 紅色：至少有一個主分片沒有分配，表示集群無法正常作業，
- 黃色：表示節點的運行狀態為警告狀態，所有的主分片目前都可以直接運行，但是至少有一個副本分片是不能正常作業的，
- 綠色：節點運行狀態為健康狀態，所有的主分片、副本分片都可以正常作業，
索引健康：最差的分片的狀態
集群健康：最差的索引的狀態

2. Health相關的API

解釋	API
集群的狀態（檢查節點數量）	GET _cluster/health
所有索引的健康狀態（查看有問題的索引）	GET _cluster/health?level=indices
單個索引的健康狀態（查看具體的索引）	GET _cluster/health/my_index
分片級的索引	GET _cluster/health?level=shards
回傳第一個未分配 Shard 的原因	GET _cluster/allocation/explain

示例1：獲取索引的健康值

# 瀏覽器查看
http://IP:9200/_cat/health

# 有問題的結果
1635313779 05:49:39 kubernetes-logging red 15 10 2128 1064 0 0 32 0 - 98.5%

# 正常的結果
1635328870 10:01:10 kubernetes-logging green 15 10 2160 1080 2 0 0 0 - 100.0%

Kibana查看

GET _cat/health

示例2：集群的狀態（檢查節點數量）

# 瀏覽器查看
http://IP:9200/_cluster/health
# 結果
{"cluster_name":"kubernetes-logging","status":"red","timed_out":false,"number_of_nodes":15,
"number_of_data_nodes":10,"active_primary_shards":1064,"active_shards":2128,"relocating_shards":0,
"initializing_shards":0,"unassigned_shards":32,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,
"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0,"active_shards_percent_as_number":98.51851851851852}

Kibana查看

GET _cluster/health

示例3：所有索引的健康狀態

# 瀏覽器查看
http://IP:9200/_cluster/health?level=indices
# 結果略

Kibana 查看

GET _cluster/health?level=indices

示例4：單個索引的健康狀態（查看具體的索引）

http://IP:9200/_cluster/health/dev-tool-deployment-service
# 結果
{"cluster_name":"kubernetes-logging","status":"red","timed_out":false,"number_of_nodes":15,
"number_of_data_nodes":10,"active_primary_shards":2,"active_shards":4,"relocating_shards":0,
"initializing_shards":0,"unassigned_shards":6,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,
"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0,"active_shards_percent_as_number":98.52534562211981}

kibana 查看

GET _cluster/health/my_index

3. 集群健康與問題排查

3.1 啟動 Elasticsearch 集群

cat docker-compose.yaml
version: '2.2'
services:
  cerebro:
    image: lmenezes/cerebro:0.8.3
    container_name: hwc_cerebro
    ports:
      - "9000:9000"
    command:
      - -Dhosts.0.host=http://elasticsearch:9200
    networks:
      - hwc_es7net
  kibana:
    image: docker.elastic.co/kibana/kibana:7.1.0
    container_name: hwc_kibana7
    environment:
      #- I18N_LOCALE=zh-CN
      - XPACK_GRAPH_ENABLED=true
      - TIMELION_ENABLED=true
      - XPACK_MONITORING_COLLECTION_ENABLED="true"
    ports:
      - "5601:5601"
    networks:
      - hwc_es7net
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:7.1.0
    container_name: es7_hot
    environment:
      - cluster.name=geektime-hwc
      - node.name=es7_hot
      - node.attr.box_type=hot
      - bootstrap.memory_lock=true
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
      - discovery.seed_hosts=es7_hot,es7_warm,es7_cold
      - cluster.initial_master_nodes=es7_hot,es7_warm,es7_cold
    ulimits:
      memlock:
        soft: -1
        hard: -1
    volumes:
      - hwc_es7data_hot:/usr/share/elasticsearch/data
    ports:
      - 9200:9200
    networks:
      - hwc_es7net
  elasticsearch2:
    image: docker.elastic.co/elasticsearch/elasticsearch:7.1.0
    container_name: es7_warm
    environment:
      - cluster.name=geektime-hwc
      - node.name=es7_warm
      - node.attr.box_type=warm
      - bootstrap.memory_lock=true
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
      - discovery.seed_hosts=es7_hot,es7_warm,es7_cold
      - cluster.initial_master_nodes=es7_hot,es7_warm,es7_cold
    ulimits:
      memlock:
        soft: -1
        hard: -1
    volumes:
      - hwc_es7data_warm:/usr/share/elasticsearch/data
    networks:
      - hwc_es7net
  elasticsearch3:
    image: docker.elastic.co/elasticsearch/elasticsearch:7.1.0
    container_name: es7_cold
    environment:
      - cluster.name=geektime-hwc
      - node.name=es7_cold
      - node.attr.box_type=cold
      - bootstrap.memory_lock=true
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
      - discovery.seed_hosts=es7_hot,es7_warm,es7_cold
      - cluster.initial_master_nodes=es7_hot,es7_warm,es7_cold
    ulimits:
      memlock:
        soft: -1
        hard: -1
    volumes:
      - hwc_es7data_cold:/usr/share/elasticsearch/data
    networks:
      - hwc_es7net


volumes:
  hwc_es7data_hot:
    driver: local
  hwc_es7data_warm:
    driver: local
  hwc_es7data_cold:
    driver: local

networks:
  hwc_es7net:
    driver: bridge

案例1

癥狀：集群變紅
分析：通過 Allocation Explain API 發現創建索引失敗，因為無法找到標記了相應 box type 的節點
解決：洗掉索引，集群變綠，重新創建索引，并且指定正確的 routing box type，索引創建成功，保持綠色狀態

# 將 hot 寫成 hott 創建索引查看狀態
DELETE mytest
PUT mytest
{
  "settings":{
    "number_of_shards":3,
    "number_of_replicas":0,
    "index.routing.allocation.require.box_type":"hott"
  }
}


# 檢查集群狀態，查看是否有節點丟失，有多少分片無法分配
GET /_cluster/health/

# 查看索引級別,找到紅色的索引
GET /_cluster/health?level=indices


#查看索引的分片
GET _cluster/health?level=shards

# Explain 變紅的原因
GET /_cluster/allocation/explain

GET /_cat/shards/mytest

GET _cat/nodeattrs


# 將 hott 修改成正確的 hot 后，創建索引查看狀態
DELETE mytest
GET /_cluster/health/

PUT mytest
{
  "settings":{
    "number_of_shards":3,
    "number_of_replicas":0,
    "index.routing.allocation.require.box_type":"hot"
  }
}

GET /_cluster/health/

案例2：Explain 看 hot 上的 explain

癥狀：集群變黃
分析：通過 Allocation Explain API 發現無法在相同的節點上創建副本
解決：將索引的副本數設定為0，或者通過增加節點解決

# 錯誤的寫法
DELETE mytest
PUT mytest
{
  "settings":{
    "number_of_shards":2,
    "number_of_replicas":1,
    "index.routing.allocation.require.box_type":"hot"
  }
}

GET _cluster/health
GET _cat/shards/mytest
GET /_cluster/allocation/explain

# 修改為正確的之后再次查看
PUT mytest/_settings
{
    "number_of_replicas": 0
}

4. 分片沒有被分配的一些原因

INDEX_CREATE：創建索引失敗，在索引的全部分片分配完成之前，會有短暫的 Red，不一定代表有問題
CLUSTER_RECOVER：集群重啟階段，會有這個問題
INDEX_REOPEN：Open 一個之前 Close 的索引
DANGLING_INDEX_IMPORTED：一個節點離開集群期間，有索引被洗掉，這個節點重新回傳時，會導致 Dangling 的問題

5. 常見問題與解決辦法

集群變紅，需要檢查是否有節點離線，如果有，通常通過重啟離線的節點就可以解決問題
由于配置導致的問題，需要修復相關的配置（例如錯誤的 box_type，錯誤的副本數）
因為磁盤空間限制，分片規則（Shard Filtering）引發的，需要調整規則或者增加節點
對于節點回傳集群，導致 danging 變紅，可直接洗掉 dangling 索引

6. 集群 Red & Yellow 問題的總結

Red & Yellow 是集群運維中常見的問題
除了集群故障，一些創建，增加副本等操作，都會導致集群短暫的 Red 和 Yellow，所以監控和報警時需要設定一定的延時
通過檢查節點數，使用 ES 提供的相關 API，找到真正的原因
可以指定 Move 或者 Reallocate 分片

POST _cluster/reroute
{
    "commands": [
        {
            "move": {
                "index": "index_name",
                "shard": 0,
                "from_node": "node_name_1",  # 將一個索引的分片從一個 node 移動到另外一個 node，來解決集群變紅或變黃的問題
                "to_node": "node_name_2"
            }
        }
    ]
}

POST _cluster/reroute?explain
{
    "commands": [
        {
            "allocate": {
                "index": "index_name",
                "shard": 0,
                "node": "nodename"
            }
        }
    ]
}

轉載請註明出處，本文鏈接：https://www.uj5u.com/qita/339553.html

標籤：其他

上一篇：Javascript知識分享——流程控制

下一篇：2021大學生筆記本電腦購買指南