早上回来,一看日志中心的ES集群又挂了,有2台节点离线,集群状态为red。

1、处置过程

进入各节点,查看elasticsearch进程状态,发现有两台节点的进程已挂掉。
重启es进程

[ ~]$  cd /espath
[ ~]$ ./elasticsearch -d

使用curl命令,访问可用的es节点,查看集群状态。

可以看到集群节点数已恢复,未分配的分片也在慢慢重分配中,这个时候只能等了

[ ~]$ curl -XGET "http://10.86.18.xxx:9200/_cluster/health?pretty"
{
  "cluster_name" : "elk6.4",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 5,
  "number_of_data_nodes" : 5,
  "active_primary_shards" : 21928,
  "active_shards" : 21967,
  "relocating_shards" : 0,
  "initializing_shards" : 8,
  "unassigned_shards" : 10362,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 96,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 1986814,
  "active_shards_percent_as_number" : 67.93147168877755
}

2、原因分析

curl查看各节点存储空间状态,发现各节点磁盘空间是足够的,但各节点分片数已飚到6500,大大超出了可负载能力,进而导致在集群自动定时批量创建index时进程OOM了

[ ~]$ curl -XGET "http://10.86.18.xxx:9200/_cat/allocation?v"
shards disk.indices disk.used disk.avail disk.total disk.percent host         ip           node
  6489        1.1tb     1.2tb     13.3tb     14.6tb            8 10.86.x 10.86.x 10.86.x
  6489        1.2tb     1.6tb      5.7tb      7.3tb           21 10.86.x 10.86.x 10.86.x
  1573      268.9gb     1.3tb        6tb      7.3tb           17 10.86.x 10.86.x 110.86.x
  6490        1.2tb     1.4tb      5.8tb      7.3tb           20 10.86.x 10.86.x 110.86.x
  1491      272.3gb     1.5tb      5.7tb      7.3tb           21 10.86.x 10.86.x 110.86.x
  9805                                                                                     UNASSIGNED

集群的index是通过crontab定时执行curator来删除的,查看curator日志发现在9月1日后就没再成功执行过

进一步查看crontab日志,发现根因是执行curator的appxxx用户密码过期导致

Sep  2 22:00:01 appgsvr03 crond[166761]: (appxxx) PAM ERROR (Authentication token is no longer valid; new one required)
Sep  2 22:00:01 appgsvr03 crond[166761]: (appxxx) FAILED to authorize user with PAM (Authentication token is no longer valid; new one required)

3、解决方案

因之前没注意,是在appxxx用户下直接使用crontab -e配置的。现将crontab改回使用root用户执行

sudo vi /etc/crontab

# For details see man 4 crontabs
# Example of job definition:
# .---------------- minute (0 - 59)
# |  .------------- hour (0 - 23)
# |  |  .---------- day of month (1 - 31)
# |  |  |  .------- month (1 - 12) OR jan,feb,mar,apr ...
# |  |  |  |  .---- day of week (0 - 6) (Sunday=0 or 7) OR sun,mon,tue,wed,thu,fri,sat
# |  |  |  |  |
# *  *  *  *  * user-name  command to be executed

SHELL=/bin/bash
PATH=/sbin:/bin:/usr/sbin:/usr/bin
MAILTO=root
00 22 * * * root curator --config /curator/curator.yml /curator/action.yml