1.實體重啟背景
收到監控組同學反饋,連接某一個MongoDB實體的應用耗時例外,并且出現了超時,查看資料庫監控平臺,發現此實體服務器的IO例外飆升,而查看副本集狀態(rs.status()),主從是壞掉的,從節點不可達,
登入從節點,查看mongodb服務狀態,是stop的,
查看服務器的log,發現出現了OOM,Mongodb被關閉了,需要手動重啟,
Jan 17 12:02:48 qqorderdb02 kernel: Out of memory: Kill process 83717 (mongod) score 919 or sacrifice child Jan 17 12:02:48 qqorderdb02 kernel: Killed process 83717 (mongod), UID 1001, total-vm:21256876kB, anon-rss:15529572kB, file-rss:0kB, shmem-rss:0kB Jan 17 12:42:42 qqorderdb02 systemd[1]: mongodbqq.service: main process exited, code=killed, status=9/KILL Jan 17 12:42:42 qqorderdb02 systemd[1]: Unit mongodbqq.service entered failed state. Jan 17 12:42:42 qqorderdb02 systemd[1]: mongodbqq.service failed.
事后分析,主節點的記憶體比從節點的記憶體大,創建索引,主節點正常執行了,而從節點出席那了OOM(12:02),KIll后,服務自啟動也失敗了(12:42),
2.重啟服務
重啟,查看mongodblog,我們會看到redo未提交的創建索引的事務
2019-01-17T19:38:11.529+0800 I CONTROL [initandlisten] ** WARNING: /sys/kernel/mm/transparent_hugepage/enabled is 'always'. 2019-01-17T19:38:11.529+0800 I CONTROL [initandlisten] ** We suggest setting it to 'never' 2019-01-17T19:38:11.529+0800 I CONTROL [initandlisten] 2019-01-17T19:38:11.529+0800 I CONTROL [initandlisten] ** WARNING: /sys/kernel/mm/transparent_hugepage/defrag is 'always'. 2019-01-17T19:38:11.529+0800 I CONTROL [initandlisten] ** We suggest setting it to 'never' 2019-01-17T19:38:11.529+0800 I CONTROL [initandlisten] 2019-01-17T19:38:11.529+0800 I CONTROL [initandlisten] ** WARNING: Running wiredTiger with the --nojournal option in a replica set 2019-01-17T19:38:11.529+0800 I CONTROL [initandlisten] ** is deprecated and subject to be removed in a future version. 2019-01-17T19:38:11.529+0800 I CONTROL [initandlisten] 2019-01-17T19:38:11.592+0800 I INDEX [initandlisten] found 1 index(es) that wasn't finished before shutdown 2019-01-17T19:38:11.595+0800 I FTDC [initandlisten] Initializing full-time diagnostic data capture with directory '/var/mongodbqq/db/diagnostic.data' 2019-01-17T19:38:11.596+0800 I INDEX [initandlisten] found 1 interrupted index build(s) on qqorderdb.weixinordersn 2019-01-17T19:38:11.596+0800 I INDEX [initandlisten] note: restart the server with --noIndexBuildRetry to skip index rebuilds
但執行一會后,重啟失敗,退出,查看server log,報錯資訊如下:
Jan 17 19:41:10 qqorderdb02 systemd[1]: mongodbqq.service stop-final-sigterm timed out. Killing. Jan 17 19:41:10 qqorderdb02 systemd[1]: Failed to start mongodbqq_service. Jan 17 19:41:10 qqorderdb02 systemd[1]: Unit mongodbqq.service entered failed state. Jan 17 19:41:10 qqorderdb02 systemd[1]: mongodbqq.service failed.
查看mongodb的log,最新的資訊如下:
2019-01-17T19:41:00.001+0800 I - [initandlisten] Index Build: 55387600/192576426 28% 2019-01-17T19:41:03.002+0800 I - [initandlisten] Index Build: 57463100/192576426 29% 2019-01-17T19:41:06.002+0800 I - [initandlisten] Index Build: 59385700/192576426 30% 2019-01-17T19:41:09.001+0800 I - [initandlisten] Index Build: 61549000/192576426 31%
通過查看Server log 和 Mongodb log ,我們可以判斷:啟動時需要重建關閉時未完成的index,但是重建這個大集合(本案例為weixinordersn,5億資料量,102Gsize)的索引耗時較長,超過了啟動服務允許的時間,服務超時后會被killed,
3.解決方案
設定systemd的service超時時間,在mongodb服務中,指明TimeoutSec引數,
TimeoutSec:定義 Systemd 停止當前服務之前等待的秒數,單位是秒,設定為0是不限制.
例如mongodbtest.service的撰寫,增加 TimeoutSec=1800
[Unit] Description=mongodbtest After=network.target remote-fs.target nss-lookup.target [Service] User=mongouser Group=mongouser # (open files) LimitNOFILE=64000 Type=forking ExecStart=/data/mongodb/mongobin404/bin/mongod --config /data/mongodb/mongobin404/bin/mongodbtest.conf ExecReload=/bin/kill -s HUP $MAINPID ExecStop=/data/mongodb/mongobin404/bin/mongod --shutdown --config /data/mongodb/mongobin404/bin/mongodbtest.conf PrivateTmp=true TimeoutSec=1800 [Install] WantedBy=multi-user.target
4.性能下降的分析
結合 應用超時 和 資料庫監控的IO飆升的時間關聯,性能下降主要分別是 創建索引 和 主從斷掉導致oplog.rs 插入、查詢、更新變慢導致,

5.參考文獻
1.https://cloudblue.freshdesk.com/support/solutions/articles/44001881778
2.https://www.cnblogs.com/f-society/p/13177614.html
本文著作權歸作者所有,未經作者同意不得轉載,謝謝配合!!!
轉載請註明出處,本文鏈接:https://www.uj5u.com/shujuku/308488.html
標籤:NoSQL
上一篇:MongoDB學習筆記三 - MongooseAPI操作資料
下一篇:Redis單節點安裝與使用
