1. 集群健康度
-
分片健康,在集群中節點的狀態有三種:綠色、黃色、紅色
-
紅色:至少有一個主分片沒有分配,表示集群無法正常作業,
-
黃色:表示節點的運行狀態為警告狀態,所有的主分片目前都可以直接運行,但是至少有一個副本分片是不能正常作業的,
-
綠色:節點運行狀態為健康狀態,所有的主分片、副本分片都可以正常作業,
-
-
索引健康:最差的分片的狀態
-
集群健康:最差的索引的狀態
2. Health相關的API
| 解釋 | API |
|---|---|
| 集群的狀態(檢查節點數量) | GET _cluster/health |
| 所有索引的健康狀態(查看有問題的索引) | GET _cluster/health?level=indices |
| 單個索引的健康狀態(查看具體的索引) | GET _cluster/health/my_index |
| 分片級的索引 | GET _cluster/health?level=shards |
| 回傳第一個未分配 Shard 的原因 | GET _cluster/allocation/explain |
示例1:獲取索引的健康值
# 瀏覽器查看
http://IP:9200/_cat/health
# 有問題的結果
1635313779 05:49:39 kubernetes-logging red 15 10 2128 1064 0 0 32 0 - 98.5%
# 正常的結果
1635328870 10:01:10 kubernetes-logging green 15 10 2160 1080 2 0 0 0 - 100.0%
Kibana查看
GET _cat/health
示例2:集群的狀態(檢查節點數量)
# 瀏覽器查看
http://IP:9200/_cluster/health
# 結果
{"cluster_name":"kubernetes-logging","status":"red","timed_out":false,"number_of_nodes":15,
"number_of_data_nodes":10,"active_primary_shards":1064,"active_shards":2128,"relocating_shards":0,
"initializing_shards":0,"unassigned_shards":32,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,
"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0,"active_shards_percent_as_number":98.51851851851852}
Kibana查看
GET _cluster/health
示例3:所有索引的健康狀態
# 瀏覽器查看
http://IP:9200/_cluster/health?level=indices
# 結果略
Kibana 查看
GET _cluster/health?level=indices
示例4:單個索引的健康狀態(查看具體的索引)
http://IP:9200/_cluster/health/dev-tool-deployment-service
# 結果
{"cluster_name":"kubernetes-logging","status":"red","timed_out":false,"number_of_nodes":15,
"number_of_data_nodes":10,"active_primary_shards":2,"active_shards":4,"relocating_shards":0,
"initializing_shards":0,"unassigned_shards":6,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,
"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0,"active_shards_percent_as_number":98.52534562211981}
kibana 查看
GET _cluster/health/my_index
3. 集群健康與問題排查
3.1 啟動 Elasticsearch 集群
cat docker-compose.yaml
version: '2.2'
services:
cerebro:
image: lmenezes/cerebro:0.8.3
container_name: hwc_cerebro
ports:
- "9000:9000"
command:
- -Dhosts.0.host=http://elasticsearch:9200
networks:
- hwc_es7net
kibana:
image: docker.elastic.co/kibana/kibana:7.1.0
container_name: hwc_kibana7
environment:
#- I18N_LOCALE=zh-CN
- XPACK_GRAPH_ENABLED=true
- TIMELION_ENABLED=true
- XPACK_MONITORING_COLLECTION_ENABLED="true"
ports:
- "5601:5601"
networks:
- hwc_es7net
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:7.1.0
container_name: es7_hot
environment:
- cluster.name=geektime-hwc
- node.name=es7_hot
- node.attr.box_type=hot
- bootstrap.memory_lock=true
- "ES_JAVA_OPTS=-Xms512m -Xmx512m"
- discovery.seed_hosts=es7_hot,es7_warm,es7_cold
- cluster.initial_master_nodes=es7_hot,es7_warm,es7_cold
ulimits:
memlock:
soft: -1
hard: -1
volumes:
- hwc_es7data_hot:/usr/share/elasticsearch/data
ports:
- 9200:9200
networks:
- hwc_es7net
elasticsearch2:
image: docker.elastic.co/elasticsearch/elasticsearch:7.1.0
container_name: es7_warm
environment:
- cluster.name=geektime-hwc
- node.name=es7_warm
- node.attr.box_type=warm
- bootstrap.memory_lock=true
- "ES_JAVA_OPTS=-Xms512m -Xmx512m"
- discovery.seed_hosts=es7_hot,es7_warm,es7_cold
- cluster.initial_master_nodes=es7_hot,es7_warm,es7_cold
ulimits:
memlock:
soft: -1
hard: -1
volumes:
- hwc_es7data_warm:/usr/share/elasticsearch/data
networks:
- hwc_es7net
elasticsearch3:
image: docker.elastic.co/elasticsearch/elasticsearch:7.1.0
container_name: es7_cold
environment:
- cluster.name=geektime-hwc
- node.name=es7_cold
- node.attr.box_type=cold
- bootstrap.memory_lock=true
- "ES_JAVA_OPTS=-Xms512m -Xmx512m"
- discovery.seed_hosts=es7_hot,es7_warm,es7_cold
- cluster.initial_master_nodes=es7_hot,es7_warm,es7_cold
ulimits:
memlock:
soft: -1
hard: -1
volumes:
- hwc_es7data_cold:/usr/share/elasticsearch/data
networks:
- hwc_es7net
volumes:
hwc_es7data_hot:
driver: local
hwc_es7data_warm:
driver: local
hwc_es7data_cold:
driver: local
networks:
hwc_es7net:
driver: bridge
案例1
-
癥狀:集群變紅
-
分析:通過 Allocation Explain API 發現創建索引失敗,因為無法找到標記了相應 box type 的節點
-
解決:洗掉索引,集群變綠,重新創建索引,并且指定正確的 routing box type,索引創建成功,保持綠色狀態
# 將 hot 寫成 hott 創建索引查看狀態
DELETE mytest
PUT mytest
{
"settings":{
"number_of_shards":3,
"number_of_replicas":0,
"index.routing.allocation.require.box_type":"hott"
}
}
# 檢查集群狀態,查看是否有節點丟失,有多少分片無法分配
GET /_cluster/health/
# 查看索引級別,找到紅色的索引
GET /_cluster/health?level=indices
#查看索引的分片
GET _cluster/health?level=shards
# Explain 變紅的原因
GET /_cluster/allocation/explain
GET /_cat/shards/mytest
GET _cat/nodeattrs
# 將 hott 修改成正確的 hot 后,創建索引查看狀態
DELETE mytest
GET /_cluster/health/
PUT mytest
{
"settings":{
"number_of_shards":3,
"number_of_replicas":0,
"index.routing.allocation.require.box_type":"hot"
}
}
GET /_cluster/health/
案例2:Explain 看 hot 上的 explain
-
癥狀:集群變黃
-
分析:通過 Allocation Explain API 發現無法在相同的節點上創建副本
-
解決:將索引的副本數設定為0,或者通過增加節點解決
# 錯誤的寫法
DELETE mytest
PUT mytest
{
"settings":{
"number_of_shards":2,
"number_of_replicas":1,
"index.routing.allocation.require.box_type":"hot"
}
}
GET _cluster/health
GET _cat/shards/mytest
GET /_cluster/allocation/explain
# 修改為正確的之后再次查看
PUT mytest/_settings
{
"number_of_replicas": 0
}
4. 分片沒有被分配的一些原因
-
INDEX_CREATE:創建索引失敗,在索引的全部分片分配完成之前,會有短暫的 Red,不一定代表有問題
-
CLUSTER_RECOVER:集群重啟階段,會有這個問題
-
INDEX_REOPEN:Open 一個之前 Close 的索引
-
DANGLING_INDEX_IMPORTED:一個節點離開集群期間,有索引被洗掉,這個節點重新回傳時,會導致 Dangling 的問題
5. 常見問題與解決辦法
-
集群變紅,需要檢查是否有節點離線,如果有,通常通過重啟離線的節點就可以解決問題
-
由于配置導致的問題,需要修復相關的配置(例如錯誤的 box_type,錯誤的副本數)
-
因為磁盤空間限制,分片規則(Shard Filtering)引發的,需要調整規則或者增加節點
-
對于節點回傳集群,導致 danging 變紅,可直接洗掉 dangling 索引
6. 集群 Red & Yellow 問題的總結
-
Red & Yellow 是集群運維中常見的問題
-
除了集群故障,一些創建,增加副本等操作,都會導致集群短暫的 Red 和 Yellow,所以監控和報警時需要設定一定的延時
-
通過檢查節點數,使用 ES 提供的相關 API,找到真正的原因
-
可以指定 Move 或者 Reallocate 分片
POST _cluster/reroute
{
"commands": [
{
"move": {
"index": "index_name",
"shard": 0,
"from_node": "node_name_1", # 將一個索引的分片從一個 node 移動到另外一個 node,來解決集群變紅或變黃的問題
"to_node": "node_name_2"
}
}
]
}
POST _cluster/reroute?explain
{
"commands": [
{
"allocate": {
"index": "index_name",
"shard": 0,
"node": "nodename"
}
}
]
}
轉載請註明出處,本文鏈接:https://www.uj5u.com/qita/339553.html
標籤:其他
下一篇:2021大學生筆記本電腦購買指南
