文章目录
常见报错关键字
缓冲区过小导致常见报错
Connection with master lost.
I/O error trying to sync with MASTER: connection lost
什么是Redis缓冲区
为避免Client、Server的请求发送和处理速度不匹配,Server给每个连接的Client都设个输入、输出缓冲区,称为客户端输入、输出缓冲区。
输入缓冲区先暂存Client发来的命令,Redis主线程再从中读命令并处理
当Redis主线程处理完数据,会把结果写入输出缓冲区,再通过输出缓冲区返给客户端
可以通过client list
查看对应的连接数
127.0.0.1:6379> client list
id=896 addr=127.0.0.1:42352 laddr=127.0.0.1:6379 fd=345 name= age=7 idle=0 flags=N db=0 sub=0 psub=0 ssub=0 multi=-1 qbuf=0 qbuf-free=20474 argv-mem=0 multi-mem=0 rbs=1024 rbp=5 obl=5 oll=0 omem=0 tot-mem=22272 events=r cmd=set user=default redir=-1 resp=2
id=897 addr=127.0.0.1:42354 laddr=127.0.0.1:6379 fd=346 name= age=7 idle=0 flags=N db=0 sub=0 psub=0 ssub=0 multi=-1 qbuf=0 qbuf-free=20474 argv-mem=0 multi-mem=0 rbs=1024 rbp=5 obl=5 oll=0 omem=0 tot-mem=22272 events=r cmd=set user=default redir=-1 resp=2
参数解释
- cmd 客户端最新执行的命令。这个例子中执行的是CLIENT命令。
- qbuf 输入缓冲区已经使用的大小。这个例子中的CLIENT命令已使用了26字节大小的缓冲区。
- qbuf-free 输入缓冲区尚未使用的大小。这个例子中的CLIENT命令还可以使用32742字节的缓冲区。qbuf和qbuf-free的总和就是,Redis服务器端当前为已连接的这个客户端分配的缓冲区总大小。这个例子中总共分配了 0 + 20474 = 20474字节,也就是20KB的缓冲区。
通常Redis Server不止服务一个Client,当多个C连接占用的内存总量,超过maxmemory配置项(如4G),触发Redis数据淘汰。一旦数据被淘汰,再要访问这部分数据,就需要去后端DB读,降低业务应用访问性能。如使用多个客户端,导致Redis内存占用过大,也会导致内存溢出(out-of-memory),进而引起Redis崩溃,给业务应用造成严重影响。
Redis并没有提供参数调节客户端输入缓冲区大小。如要避免输入缓冲区溢出,只能从数据命令的发送和处理速度入手,即避免客户端写入bigkey,以及避免Redis主线程阻塞。
如果想实时排查Redis各个命令操作,可以使用MONITOR
MONITOR占用的是缓冲区的资源,当执行多个MONITOR也会引起Redis缓冲区异常
MONITOR 命令用于实时打印出 Redis 服务器接收到的命令,调试用 (MONITOR会影响Redis性能,流量大的情况下不建议执行)
127.0.0.1:6379> MONITOR
OK
1732604362.782596 [0 127.0.0.1:38750] "auth" "(redacted)"
1732604367.838141 [0 127.0.0.1:38750] "set" "a" "b"
1732604393.800393 [0 127.0.0.1:38750] "DEL" "a"
错误模拟
- Redis版本7.0.2
- Redis主从架构
首先进行压测Redis,当连接数超过buffer缓冲区就可以复现报错
[root@redis01 redis-7.0.8]# ./src/redis-benchmark -h 127.0.0.1 -p 6379 -a 123123 -n 10000000 -r 100000 -c 512 -t set -d 1024 --threads 32
Redis 从库错误日志 Connection with master lost.
这里可以看到主从复制已经出现超时的情况
228669:S 26 Nov 2024 14:00:25.903 # Connection with master lost.
228669:S 26 Nov 2024 14:00:25.903 * Caching the disconnected master state.
228669:S 26 Nov 2024 14:00:25.903 * Reconnecting to MASTER 192.168.31.70:6379
228669:S 26 Nov 2024 14:00:25.903 * MASTER <-> REPLICA sync started
228669:S 26 Nov 2024 14:00:25.904 * Non blocking connect for SYNC fired the event.
228669:S 26 Nov 2024 14:00:25.940 * Master replied to PING, replication can continue...
228669:S 26 Nov 2024 14:00:25.951 * Trying a partial resynchronization (request 82a4447de6a2668795037238256ae70ab2b916a5:236872550).
228669:S 26 Nov 2024 14:00:25.964 * Full resync from master: 82a4447de6a2668795037238256ae70ab2b916a5:511625874
228669:S 26 Nov 2024 14:00:26.918 * MASTER <-> REPLICA sync: receiving 103527419 bytes from master to disk
这里整理了一下主从复制三种工作机制
- 当Master、Slave连接良好时,Master通过向Slave发送命令流来保持Slave更新,以复制由于在Master对数据集产生的影响:客户端写入、Key过期或逐出、任何其他更改主数据集的操作。
- 当Master、Slave由于网络问题或由于主从心跳检测超时等原因导致连接中断时,Slave将重新连接并尝试继续增量同步,尝试获取在断开连接期间丢失的命令流部分。
- 当偏移量之后的数据已挤出挤压缓冲区或初次连接,则副本将要求全量同步。Master将创建其数据的快照并发送到Slave,然后在数据集更改时继续发送命令流。
参考https://blog.csdn.net/qq_41025464/article/details/131236112
接下来我们看从库断开日志
主从复制详解 https://blog.csdn.net/qq_34556414/article/details/78666411
这里可以看到从偏移量以后开始重连同步Full resync from master: 82a4447de6a2668795037238256ae70ab2b916a5:1057029748
228669:S 26 Nov 2024 14:00:45.188 # I/O error trying to sync with MASTER: connection lost
228669:S 26 Nov 2024 14:00:45.192 * Reconnecting to MASTER 192.168.31.70:6379 after failure
228669:S 26 Nov 2024 14:00:45.192 * MASTER <-> REPLICA sync started
228669:S 26 Nov 2024 14:00:45.192 * Non blocking connect for SYNC fired the event.
228669:S 26 Nov 2024 14:00:45.215 * Master replied to PING, replication can continue...
228669:S 26 Nov 2024 14:00:45.234 * Trying a partial resynchronization (request 82a4447de6a2668795037238256ae70ab2b916a5:236872550).
228669:S 26 Nov 2024 14:00:45.242 * Full resync from master: 82a4447de6a2668795037238256ae70ab2b916a5:1057029748
从节点接收主节点的响应数据保存运行ID和偏移量offset, 执行到当前步骤时从节点打印如下日志:
228669:S 26 Nov 2024 14:00:46.225 * MASTER <-> REPLICA sync: receiving 104394983 bytes from master to disk
228669:S 26 Nov 2024 14:00:52.046 * 10000 changes in 60 seconds. Saving...
228669:S 26 Nov 2024 14:00:52.050 * Background saving started by pid 232880
232880:C 26 Nov 2024 14:00:53.153 * DB saved on disk
232880:C 26 Nov 2024 14:00:53.159 * Fork CoW for RDB: current 0 MB, peak 0 MB, average 0 MB
228669:S 26 Nov 2024 14:00:53.260 * Background saving terminated with success
此时从服务器已经无法连接到主服务器上
228669:S 26 Nov 2024 14:00:55.460 # I/O error trying to sync with MASTER: connection lost
228669:S 26 Nov 2024 14:00:55.463 * Reconnecting to MASTER 192.168.31.70:6379 after failure
228669:S 26 Nov 2024 14:00:55.464 * MASTER <-> REPLICA sync started
228669:S 26 Nov 2024 14:00:55.464 * Non blocking connect for SYNC fired the event.
228669:S 26 Nov 2024 14:00:55.520 * Master replied to PING, replication can continue...
228669:S 26 Nov 2024 14:00:55.534 * Trying a partial resynchronization (request 82a4447de6a2668795037238256ae70ab2b916a5:236872550).
228669:S 26 Nov 2024 14:00:55.545 * Full resync from master: 82a4447de6a2668795037238256ae70ab2b916a5:1326841109
228669:S 26 Nov 2024 14:00:56.432 * MASTER <-> REPLICA sync: receiving 104400203 bytes from master to disk
228669:S 26 Nov 2024 14:01:14.790 # I/O error trying to sync with MASTER: connection lost
228669:S 26 Nov 2024 14:01:14.790 * Reconnecting to MASTER 192.168.31.70:6379 after failure
228669:S 26 Nov 2024 14:01:14.790 * MASTER <-> REPLICA sync started
228669:S 26 Nov 2024 14:01:14.790 * Non blocking connect for SYNC fired the event.
228669:S 26 Nov 2024 14:01:14.819 * Master replied to PING, replication can continue...
228669:S 26 Nov 2024 14:01:14.829 * Trying a partial resynchronization (request 82a4447de6a2668795037238256ae70ab2b916a5:236872550).
传输文件这一步操作非常耗时, 速度取决于主从节点之间网络带宽, 通过细致分析Full resync和MASTER<->SLAVE这两行日志的时间差, 可以算出RDB文件从创建到传输完毕消耗的总时间。
228669:S 26 Nov 2024 14:01:14.904 * Full resync from master: 82a4447de6a2668795037238256ae70ab2b916a5:1883888517
228669:S 26 Nov 2024 14:01:15.814 * MASTER <-> REPLICA sync: receiving 104400203 bytes from master to disk 主节点发送RDB文件给从节点, 从节点把接收的RDB文件保存在本地并直接作为从节点的数据文件, 接收完RDB后从节点打印相关日志, 可以在日志中查看主节点发送的数据量:
主库错误日志
21776:M 26 Nov 2024 14:05:30.730 * Starting automatic rewriting of AOF on 103% growth
21776:M 26 Nov 2024 14:05:30.735 * Creating AOF incr file appendonly.aof.60.incr.aof on background rewrite
21776:M 26 Nov 2024 14:05:30.743 * Background append only file rewriting started by pid 22159
22159:C 26 Nov 2024 14:05:31.590 * Successfully created the temporary AOF base file temp-rewriteaof-bg-22159.aof
22159:C 26 Nov 2024 14:05:31.599 * Fork CoW for AOF rewrite: current 63 MB, peak 63 MB, average 37 MB
对于从节点开始接收RDB快照到接收完成期间, 主节点仍然响应读写命令, 因此主节点会把这期间写命令数据保存在复制客户端缓冲区内, 当从节点加载完RDB文件后, 主节点再把缓冲区内的数据发送给从节点, 保证主从之间数据一致性
21776:M 26 Nov 2024 14:05:31.654 # Client id=546 addr=192.168.31.72:46262 laddr=192.168.31.70:6379 fd=9 name= age=9 idle=9 flags=S db=0 sub=0 psub=0 ssub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=1024 rbp=0 obl=0 oll=13092 omem=268438368 tot-mem=268440168 events=rw cmd=psync user=default redir=-1 resp=2 scheduled to be closed ASAP for overcoming of output buffer limits.:
21776:M 26 Nov 2024 14:05:31.655 * Background AOF rewrite terminated with success
21776:M 26 Nov 2024 14:05:31.655 * Successfully renamed the temporary AOF base file temp-rewriteaof-bg-22159.aof into appendonly.aof.60.base.rdb
21776:M 26 Nov 2024 14:05:31.656 * Removing the history file appendonly.aof.59.incr.aof in the background
21776:M 26 Nov 2024 14:05:31.656 * Removing the history file appendonly.aof.59.base.rdb in the background
21776:M 26 Nov 2024 14:05:31.657 * Background AOF rewrite finished successfully
21776:M 26 Nov 2024 14:05:31.662 # Connection with replica 192.168.31.72:6379 lost.
21776:M 26 Nov 2024 14:05:31.733 * Replica 192.168.31.72:6379 asks for synchronization
21776:M 26 Nov 2024 14:05:31.733 * Unable to partial resync with replica 192.168.31.72:6379 for lack of backlog (Replica request was: 236872550).
主节点执行bgsave保存RDB文件到本地,主节点bgsave相关日志如下
21776:M 26 Nov 2024 14:05:31.733 * Starting BGSAVE for SYNC with target: disk
21776:M 26 Nov 2024 14:05:31.741 * Background saving started by pid 22160
22160:C 26 Nov 2024 14:05:32.625 * DB saved on disk
22160:C 26 Nov 2024 14:05:32.638 * Fork CoW for RDB: current 67 MB, peak 67 MB, average 37 MB
21776:M 26 Nov 2024 14:05:32.732 * Background saving terminated with success
21776:M 26 Nov 2024 14:05:35.683 * Starting automatic rewriting of AOF on 102% growth
21776:M 26 Nov 2024 14:05:35.688 * Creating AOF incr file appendonly.aof.61.incr.aof on background rewrite
21776:M 26 Nov 2024 14:05:35.696 * Background append only file rewriting started by pid 22161
22161:C 26 Nov 2024 14:05:36.642 * Successfully created the temporary AOF base file temp-rewriteaof-bg-22161.aof
22161:C 26 Nov 2024 14:05:36.655 * Fork CoW for AOF rewrite: current 73 MB, peak 73 MB, average 42 MB
21776:M 26 Nov 2024 14:05:36.772 * Background AOF rewrite terminated with success
21776:M 26 Nov 2024 14:05:36.772 * Successfully renamed the temporary AOF base file temp-rewriteaof-bg-22161.aof into appendonly.aof.61.base.rdb
21776:M 26 Nov 2024 14:05:36.774 * Removing the history file appendonly.aof.60.incr.aof in the background
21776:M 26 Nov 2024 14:05:36.774 * Removing the history file appendonly.aof.60.base.rdb in the background
21776:M 26 Nov 2024 14:05:36.786 * Background AOF rewrite finished successfully
21776:M 26 Nov 2024 14:05:40.274 * Starting automatic rewriting of AOF on 101% growth
21776:M 26 Nov 2024 14:05:40.279 * Creating AOF incr file appendonly.aof.62.incr.aof on background rewrite
21776:M 26 Nov 2024 14:05:40.287 * Background append only file rewriting started by pid 22162
21776:M 26 Nov 2024 14:05:40.949 # Client id=547 addr=192.168.31.72:42884 laddr=192.168.31.70:6379 fd=7 name= age=9 idle=9 flags=S db=0 sub=0 psub=0 ssub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=1024 rbp=0 obl=0 oll=13092 omem=268438368 tot-mem=268440168 events=rw cmd=psync user=default redir=-1 resp=2 scheduled to be closed ASAP for overcoming of output buffer limits.
21776:M 26 Nov 2024 14:05:40.958 # Connection with replica 192.168.31.72:6379 lost.
22162:C 26 Nov 2024 14:05:41.102 * Successfully created the temporary AOF base file temp-rewriteaof-bg-22162.aof
22162:C 26 Nov 2024 14:05:41.111 * Fork CoW for AOF rewrite: current 112 MB, peak 112 MB, average 63 MB
21776:M 26 Nov 2024 14:05:41.170 * Replica 192.168.31.72:6379 asks for synchronization
21776:M 26 Nov 2024 14:05:41.171 * Unable to partial resync with replica 192.168.31.72:6379 for lack of backlog (Replica request was: 236872550).
21776:M 26 Nov 2024 14:05:41.171 * No BGSAVE in progress, but another BG operation is active. BGSAVE for replication delayed
21776:M 26 Nov 2024 14:05:41.173 * Background AOF rewrite terminated with success
21776:M 26 Nov 2024 14:05:41.173 * Successfully renamed the temporary AOF base file temp-rewriteaof-bg-22162.aof into appendonly.aof.62.base.rdb
21776:M 26 Nov 2024 14:05:41.174 * Removing the history file appendonly.aof.61.incr.aof in the background
21776:M 26 Nov 2024 14:05:41.174 * Removing the history file appendonly.aof.61.base.rdb in the background
21776:M 26 Nov 2024 14:05:41.175 * Background AOF rewrite finished successfully
21776:M 26 Nov 2024 14:05:41.175 * Starting BGSAVE for SYNC with target: disk
21776:M 26 Nov 2024 14:05:41.184 * Background saving started by pid 22163
22163:C 26 Nov 2024 14:05:41.995 * DB saved on disk
22163:C 26 Nov 2024 14:05:42.008 * Fork CoW for RDB: current 67 MB, peak 67 MB, average 35 MB
21776:M 26 Nov 2024 14:05:42.062 * Background saving terminated with success
关键错误 Unable to partial resync with replica 192.168.31.72:6379 for lack of backlog (Replica request was: 236872550).
主从连接状态
127.0.0.1:6379> info replication
# Replication
role:master
connected_slaves:1
slave0:ip=192.168.31.72,port=6379,state=send_bulk,offset=0,lag=0
master_failover_state:no-failover
master_replid:82a4447de6a2668795037238256ae70ab2b916a5
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:5518345775
second_repl_offset:-1
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:5414985241
repl_backlog_histlen:103360535
解决办法
调整前参数
# 普通client buffer限制
client-output-buffer-limit normal 0 0 0
# slave client buffer限制
client-output-buffer-limit slave 256mb 64mb 60
# pubsub client buffer限制
client-output-buffer-limit pubsub 32mb 8mb 60
调整后
# 普通client buffer限制
client-output-buffer-limit normal 0 0 0
# slave client buffer限制
client-output-buffer-limit slave 0 0 0
默认缓冲区大小限制:当缓冲区使用超过256mb,master会尽快杀掉它;当缓冲区使用大于64mb,且小于256mb的soft limit值时,并持续时间达60秒,也会被Master尽快杀掉。
master的client output buffer持续增长,直到触发默认配置的阈值限制client-output-buffer-limit slave 256mb 64mb 60
,那么master则会把这个slave连接强制断开,这就会导致复制中断。
之后slave重新发送复制请求,但是以上原因可能依旧存在,经过一段时间后又产生上述问题,主从连接再次被断开,周而复始,主从频繁断开链接,无法正常复制数据。
解决方案是,适当调大client-output-buffer-limit
的阈值,并且解决slave写入慢的情况,保证master发给slave的数据可以很快得处理完成,这样才能避免频繁断开复制的问题
通常不建议client-output-buffer-limit slave设置为0 0 0 还需要实际情况进行调整,不宜过大
同步压测结果均正常
在调整参数时,请及时关注主机内存!