0%

Sentinel模式CONFIG禁用导致异常

场景

最近公司在做redis规范,其中有一条就是禁用或者重命名CONFIG命令,然后在部署Redis Sentinel的时候也傻乎乎的就把CONFIG命令给重命名了。今天在查看相关文档时,发现sentinel在做节点管理时会使用CONFIG命令。按照这个逻辑,如果重命名CONFIG命令会导致无法实现故障转移。

测试

在本地部署redis进行测试,观察是否可以做到故障转移

1
2
3
4
5
6
7
8
9
10
# 启动redis
docker run --name redis-6380 -d redis:6.2.4 -v /data/6380:/data redis-server redis.conf
docker run --name redis-6381 -d redis:6.2.4 -v /data/6381:/data redis-server redis.conf
docker run --name redis-6382 -d redis:6.2.4 -v /data/6382:/data redis-server redis.conf

# 启动sentinel
docker run --name sentinel-6390 -d redis:6.2.4 -v /data/6390:/data redis-sentinel sentinel.conf
docker run --name sentinel-6391 -d redis:6.2.4 -v /data/6391:/data redis-sentinel sentinel.conf
docker run --name sentinel-6392 -d redis:6.2.4 -v /data/6392:/data redis-sentinel sentinel.conf

服务状态

  • 部署起来服务状态正常
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
localhost:6380> info replication
# Replication
role:master
connected_slaves:2
slave0:ip=localhost,port=6381,state=online,offset=87732583,lag=0
slave1:ip=localhost,port=6382,state=online,offset=87732859,lag=0
master_failover_state:no-failover
master_replid:37de4d22e710641dc652a5bbdfce56e2a1a2adbc
master_replid2:34e7cea8f2c107360f54ffa42830f835160e9179
master_repl_offset:87732859
second_repl_offset:66487
repl_backlog_active:1
repl_backlog_size:4194304
repl_backlog_first_byte_offset:83538556
repl_backlog_histlen:4194304

  • 主动关闭master节点

sentinel 会推选新的节点,但是集群状态已经不可用

1
2
3
4
5
6
sentinel_masters:1
sentinel_tilt:0
sentinel_running_scripts:0
sentinel_scripts_queue_length:0
sentinel_simulate_failure_flags:0
master0:name=mymuster,status=odown,address=localhost:6381,slaves=2,sentinels=3

sentinel log 提示failover timeout

1
2
3
4
5
6
7
8
9
10
11
12
1:X 29 Jul 2021 08:04:02.564 # +new-epoch 6
1:X 29 Jul 2021 08:04:02.564 # +try-failover master mymuster localhost 6380
1:X 29 Jul 2021 08:04:02.565 # +vote-for-leader 51d50beb72d5c0be16ed7024a14e4cdca42eea40 6
1:X 29 Jul 2021 08:04:02.570 # 1492f59d09fd16db8d8f4b389a1a3b54c719f61f voted for 51d50beb72d5c0be16ed7024a14e4cdca42eea40 6
1:X 29 Jul 2021 08:04:02.575 # 8e77458d184c98048bf7591822f8549010b744b1 voted for 51d50beb72d5c0be16ed7024a14e4cdca42eea40 6
1:X 29 Jul 2021 08:04:02.624 # +elected-leader master mymuster localhost 6380
1:X 29 Jul 2021 08:04:02.624 # +failover-state-select-slave master mymuster localhost 6380
1:X 29 Jul 2021 08:04:02.695 # +selected-slave slave localhost:6381 localhost 6381 @ mymuster localhost 6380
1:X 29 Jul 2021 08:04:02.695 * +failover-state-send-slaveof-noone slave localhost:6381 localhost 6381 @ mymuster localhost 6380
1:X 29 Jul 2021 08:04:02.767 * +failover-state-wait-promotion slave localhost:6381 localhost 6381 @ mymuster localhost 6380
1:X 29 Jul 2021 08:07:02.850 # -failover-abort-slave-timeout master mymuster localhost 6380
1:X 29 Jul 2021 08:07:02.916 # Next failover delay: I will not start a failover before Thu Jul 29 08:10:03 2021

两个slave输入如下log内容

1
2
3
4
5
6
7
8
1:S 29 Jul 2021 08:07:41.866 # Error condition on socket for SYNC: Connection refused
1:S 29 Jul 2021 08:07:42.874 # Error condition on socket for SYNC: Connection refused
1:S 29 Jul 2021 08:07:43.882 # Error condition on socket for SYNC: Connection refused
1:S 29 Jul 2021 08:07:44.890 # Error condition on socket for SYNC: Connection refused
1:S 29 Jul 2021 08:07:45.897 # Error condition on socket for SYNC: Connection refused
1:S 29 Jul 2021 08:07:46.902 # Error condition on socket for SYNC: Connection refused
1:S 29 Jul 2021 08:07:47.908 # Error condition on socket for SYNC: Connection refused
1:S 29 Jul 2021 08:07:48.913 # Error condition on socket for SYNC: Connection refused

通过测试,我们发现sentinel确实无法实现故障转移。

问题修复

虽然线上目前运行正常,但是如果master节点发生故障,则会导致redis不可用。因此我们需要修复此问题。

  • 修改redis.conf,删除CONFIG修改配置
  • 对一个slave进行BGSAVE操作,使用LASTSAVE命令来观察是否执行完成(可以实现快速启动)
  • 重启执行完BGSAVE执行完成的slave节点
  • 连接master节点,通过info replication命令观察slave节点是否完成同步,完成之后在使用相同方法操作另外的slave节点
  • 所有slave节点处理完成之后,在master节点同样执行BGSAVE
  • 然后在连接sentinel执行 sentinel failover mymuster来进行主动故障转移
  • 完成故障转移之后重启原来的muster节点

通过以上步骤即可完成问题修复。# 场景

最近公司在做redis规范,其中有一条就是禁用或者重命名CONFIG命令,然后在部署Redis Sentinel的时候也傻乎乎的就把CONFIG命令给重命名了。今天在查看相关文档时,发现sentinel在做节点管理时会使用CONFIG命令。按照这个逻辑,如果重命名CONFIG命令会导致无法实现故障转移。

测试

在本地部署redis进行测试,观察是否可以做到故障转移

1
2
3
4
5
6
7
8
9
10
# 启动redis
docker run --name redis-6380 -d redis:6.2.4 -v /data/6380:/data redis-server redis.conf
docker run --name redis-6381 -d redis:6.2.4 -v /data/6381:/data redis-server redis.conf
docker run --name redis-6382 -d redis:6.2.4 -v /data/6382:/data redis-server redis.conf

# 启动sentinel
docker run --name sentinel-6390 -d redis:6.2.4 -v /data/6390:/data redis-sentinel sentinel.conf
docker run --name sentinel-6391 -d redis:6.2.4 -v /data/6391:/data redis-sentinel sentinel.conf
docker run --name sentinel-6392 -d redis:6.2.4 -v /data/6392:/data redis-sentinel sentinel.conf

服务状态

  • 部署起来服务状态正常
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
localhost:6380> info replication
# Replication
role:master
connected_slaves:2
slave0:ip=localhost,port=6381,state=online,offset=87732583,lag=0
slave1:ip=localhost,port=6382,state=online,offset=87732859,lag=0
master_failover_state:no-failover
master_replid:37de4d22e710641dc652a5bbdfce56e2a1a2adbc
master_replid2:34e7cea8f2c107360f54ffa42830f835160e9179
master_repl_offset:87732859
second_repl_offset:66487
repl_backlog_active:1
repl_backlog_size:4194304
repl_backlog_first_byte_offset:83538556
repl_backlog_histlen:4194304

  • 主动关闭master节点

sentinel 会推选新的节点,但是集群状态已经不可用

1
2
3
4
5
6
sentinel_masters:1
sentinel_tilt:0
sentinel_running_scripts:0
sentinel_scripts_queue_length:0
sentinel_simulate_failure_flags:0
master0:name=mymuster,status=odown,address=localhost:6381,slaves=2,sentinels=3

sentinel log 提示failover timeout

1
2
3
4
5
6
7
8
9
10
11
12
1:X 29 Jul 2021 08:04:02.564 # +new-epoch 6
1:X 29 Jul 2021 08:04:02.564 # +try-failover master mymuster localhost 6380
1:X 29 Jul 2021 08:04:02.565 # +vote-for-leader 51d50beb72d5c0be16ed7024a14e4cdca42eea40 6
1:X 29 Jul 2021 08:04:02.570 # 1492f59d09fd16db8d8f4b389a1a3b54c719f61f voted for 51d50beb72d5c0be16ed7024a14e4cdca42eea40 6
1:X 29 Jul 2021 08:04:02.575 # 8e77458d184c98048bf7591822f8549010b744b1 voted for 51d50beb72d5c0be16ed7024a14e4cdca42eea40 6
1:X 29 Jul 2021 08:04:02.624 # +elected-leader master mymuster localhost 6380
1:X 29 Jul 2021 08:04:02.624 # +failover-state-select-slave master mymuster localhost 6380
1:X 29 Jul 2021 08:04:02.695 # +selected-slave slave localhost:6381 localhost 6381 @ mymuster localhost 6380
1:X 29 Jul 2021 08:04:02.695 * +failover-state-send-slaveof-noone slave localhost:6381 localhost 6381 @ mymuster localhost 6380
1:X 29 Jul 2021 08:04:02.767 * +failover-state-wait-promotion slave localhost:6381 localhost 6381 @ mymuster localhost 6380
1:X 29 Jul 2021 08:07:02.850 # -failover-abort-slave-timeout master mymuster localhost 6380
1:X 29 Jul 2021 08:07:02.916 # Next failover delay: I will not start a failover before Thu Jul 29 08:10:03 2021

两个slave输入如下log内容

1
2
3
4
5
6
7
8
1:S 29 Jul 2021 08:07:41.866 # Error condition on socket for SYNC: Connection refused
1:S 29 Jul 2021 08:07:42.874 # Error condition on socket for SYNC: Connection refused
1:S 29 Jul 2021 08:07:43.882 # Error condition on socket for SYNC: Connection refused
1:S 29 Jul 2021 08:07:44.890 # Error condition on socket for SYNC: Connection refused
1:S 29 Jul 2021 08:07:45.897 # Error condition on socket for SYNC: Connection refused
1:S 29 Jul 2021 08:07:46.902 # Error condition on socket for SYNC: Connection refused
1:S 29 Jul 2021 08:07:47.908 # Error condition on socket for SYNC: Connection refused
1:S 29 Jul 2021 08:07:48.913 # Error condition on socket for SYNC: Connection refused

通过测试,我们发现sentinel确实无法实现故障转移。

问题修复

虽然线上目前运行正常,但是如果master节点发生故障,则会导致redis不可用。因此我们需要修复此问题。

  • 修改redis.conf,删除CONFIG修改配置
  • 对一个slave进行BGSAVE操作,使用LASTSAVE命令来观察是否执行完成(可以实现快速启动)
  • 重启执行完BGSAVE执行完成的slave节点
  • 连接master节点,通过info replication命令观察slave节点是否完成同步,完成之后在使用相同方法操作另外的slave节点
  • 所有slave节点处理完成之后,在master节点同样执行BGSAVE
  • 然后在连接sentinel执行 sentinel failover mymuster来进行主动故障转移
  • 完成故障转移之后重启原来的muster节点

通过以上步骤即可完成问题修复。