一文详解长连接黑洞重现和分析
阿里妹导读
背景
为什么数据库crash重启恢复后,业务还长时间不能恢复?
我依赖的业务做了高可用切换,但是我的业务长时间报错。
我依赖的服务下掉了一个节点,为什么我的业务长时间报错?
客户做变配,升级云服务节点规格,为什么会导致客户业务长时间报错?
重现
LVS+MySQL高可用切换
net.ipv4.tcp_keepalive_intvl = 30
net.ipv4.tcp_keepalive_probes = 5
net.ipv4.tcp_keepalive_time = 10
net.ipv4.tcp_retries1 = 3
net.ipv4.tcp_retries2 = 15 //主要是这个参数,默认以及alios 几乎都是15
#ipvsadm -L --timeout
Timeout (tcp tcpfin udp): 900 120 300
#ipvsadm -L -n
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP 127.0.0.1:3001 rr
127.0.0.1:3307 Masq 1 0 0
Sysbench启动压力模拟用户访问,在31秒的时候模拟管控检测到3307的Master无法访问,所以管控执行切主把3306的Slave提升为新的Master,同时到LVS摘掉 3307,挂上3306,此时管控端着冰可乐、翘着二郎腿,得意地说,你就看吧我们管控牛不牛、我们的高可用牛不牛,这一套行云流水3秒钟不到全搞定。
#cat del3307.sh
ipvsadm -d -t 127.0.0.1:3001 -r 127.0.0.1:3307 ; ipvsadm -a -t 127.0.0.1:3001 -r 127.0.0.1:3306 -m
此时Sysbench运行状态,在第32秒如期跌0:
#/usr/local/bin/sysbench --debug=on --mysql-user='root' --mysql-password='123' --mysql-db='test' --mysql-host='127.0.0.1' --mysql-port='3001' --tables='16' --table-size='10000' --range-size='5' --db-ps-mode='disable' --skip-trx='on' --mysql-ignore-errors='all' --time='11080' --report-interval='1' --histogram='on' --threads=1 oltp_read_write run
sysbench 1.1.0 (using bundled LuaJIT 2.1.0-beta3)
Running the test with following options:
Number of threads: 1
Report intermediate results every 1 second(s)
Debug mode enabled.
Initializing random number generator from current time
Initializing worker threads...
DEBUG: Worker thread (#0) started
DEBUG: Reporting thread started
DEBUG: Worker thread (#0) initialized
Threads started!
[ 1s ] thds: 1 tps: 51.89 qps: 947.00 (r/w/o: 739.44/207.56/0.00) lat (ms,95%): 35.59 err/s 0.00 reconn/s: 0.00
[ 2s ] thds: 1 tps: 60.03 qps: 1084.54 (r/w/o: 841.42/243.12/0.00) lat (ms,95%): 22.28 err/s 0.00 reconn/s: 0.00
…………
[ 29s ] thds: 1 tps: 68.00 qps: 1223.01 (r/w/o: 952.00/271.00/0.00) lat (ms,95%): 16.12 err/s 0.00 reconn/s: 0.00
[ 30s ] thds: 1 tps: 66.00 qps: 1188.00 (r/w/o: 924.00/264.00/0.00) lat (ms,95%): 16.71 err/s 0.00 reconn/s: 0.00
[ 31s ] thds: 1 tps: 67.00 qps: 1203.96 (r/w/o: 937.97/265.99/0.00) lat (ms,95%): 17.95 err/s 0.00 reconn/s: 0.00
[ 32s ] thds: 1 tps: 22.99 qps: 416.85 (r/w/o: 321.88/94.96/0.00) lat (ms,95%): 15.55 err/s 0.00 reconn/s: 0.00
[ 33s ] thds: 1 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s 0.00 reconn/s: 0.00
[ 34s ] thds: 1 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s 0.00 reconn/s: 0.00
[ 35s ] thds: 1 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s 0.00 reconn/s: 0.00
5分钟后故障报告大批量涌进来,客户:怎么回事,我们的业务挂掉10分钟了,报错都是访问MySQL超时,赶紧给我看看,从监控确实看到10分钟后客户业务还没恢复:
[1 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s 0.00 reconn/s: 0.00 ] thds:
[1 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s 0.00 reconn/s: 0.00 ] thds:
[1 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s 0.00 reconn/s: 0.00 ] thds:
[1 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s 0.00 reconn/s: 0.00 ] thds:
这时oncall都被从被窝里拎了起来,不知谁说了一句赶紧恢复吧,先试试把应用重启,5秒钟后应用重启完毕,业务恢复,大家开心地笑了,又成功防御住一次故障升级,还是重启大法好!
#ipvsadm -L -n --stats -t 127.0.0.1:3001
Prot LocalAddress:Port Conns InPkts OutPkts InBytes OutBytes
-> RemoteAddress:Port
TCP 127.0.0.1:3001 2 660294 661999 78202968 184940K
-> 127.0.0.1:3306 0 0 0 0 0
#ipvsadm -Lcn | head -10
IPVS connection entries
pro expire state source virtual destination
TCP 13:11 ESTABLISHED 127.0.0.1:33864 127.0.0.1:3001 127.0.0.1:3307
#netstat -anto |grep -E "Recv|33864|3001|33077"
Proto Recv-Q Send-Q Local Address Foreign Address State Timer
tcp 0 248 127.0.0.1:33864 127.0.0.1:3001 ESTABLISHED probe (33.48/0/8)
tcp6 0 11 127.0.0.1:3307 127.0.0.1:33864 ESTABLISHED on (49.03/13/0)
直到900多秒后OS重试了15次发现都失败,于是向业务/Sysbench返回连接异常,触发业务/Sysbench释放异常连接重建新连接,新连接指向了新的Master 3306,业务恢复正常。
[1 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s 0.00 reconn/s: 0.00 ] thds:
DEBUG: Ignoring error 2013 Lost connection to MySQL server during query,
DEBUG: Reconnecting
DEBUG: Reconnected
[1 tps: 53.00 qps: 950.97 (r/w/o: 741.98/208.99/0.00) lat (ms,95%): 30.26 err/s 0.00 reconn/s: 1.00 ] thds:
[1 tps: 64.00 qps: 1154.03 (r/w/o: 896.02/258.01/0.00) lat (ms,95%): 22.69 err/s 0.00 reconn/s: 0.00 ] thds:
[1 tps: 66.00 qps: 1184.93 (r/w/o: 923.94/260.98/0.00) lat (ms,95%): 25.28 err/s 0.00 reconn/s: 0.00 ] thds:
到这里重现了故障中经常碰到的业务需要900多秒才能慢慢恢复,这个问题也就是TCP长连接流量黑洞。
[1 tps: 66.00 qps: 1191.00 (r/w/o: 924.00/267.00/0.00) lat (ms,95%): 17.63 err/s 0.00 reconn/s: 0.00 ] thds:
[1 tps: 63.00 qps: 1123.01 (r/w/o: 874.00/249.00/0.00) lat (ms,95%): 17.63 err/s 0.00 reconn/s: 0.00 ] thds:
[1 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s 0.00 reconn/s: 0.00 ] thds:
[1 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s 0.00 reconn/s: 0.00 ] thds:
[1 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s 0.00 reconn/s: 0.00 ] thds:
[1 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s 0.00 reconn/s: 0.00 ] thds:
[1 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s 0.00 reconn/s: 0.00 ] thds:
[1 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s 0.00 reconn/s: 0.00 ] thds:
[1 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s 0.00 reconn/s: 0.00 ] thds:
[1 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s 0.00 reconn/s: 0.00 ] thds:
[1 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s 0.00 reconn/s: 0.00 ] thds:
[1 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s 0.00 reconn/s: 0.00 ] thds:
[1 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s 0.00 reconn/s: 0.00 ] thds:
[1 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s 0.00 reconn/s: 0.00 ] thds:
[1 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s 0.00 reconn/s: 0.00 ] thds:
[1 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s 0.00 reconn/s: 0.00 ] thds:
[1 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s 0.00 reconn/s: 0.00 ] thds:
[1 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s 0.00 reconn/s: 0.00 ] thds:
[1 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s 0.00 reconn/s: 0.00 ] thds:
[1 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s 0.00 reconn/s: 0.00 ] thds:
[1 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s 0.00 reconn/s: 0.00 ] thds:
DEBUG: Ignoring error 2013 Lost connection to MySQL server during query,
DEBUG: Reconnecting
DEBUG: Reconnected
[1 tps: 26.00 qps: 457.01 (r/w/o: 357.01/100.00/0.00) lat (ms,95%): 16.41 err/s 0.00 reconn/s: 1.00 ] thds:
[1 tps: 60.00 qps: 1086.94 (r/w/o: 846.96/239.99/0.00) lat (ms,95%): 26.68 err/s 0.00 reconn/s: 0.00 ] thds:
[1 tps: 63.00 qps: 1134.02 (r/w/o: 882.01/252.00/0.00) lat (ms,95%): 23.10 err/s 0.00 reconn/s: 0.00 ] thds:
LVS+Nginx上重现
TCP长连接流量黑洞原理总结
怎么解决
业务方
SocketTimeout
RFC 5482 TCP_USER_TIMEOUT
int timeout = 30000; // 30 seconds
setsockopt(sock, IPPROTO_TCP, TCP_USER_TIMEOUT, (char *)&timeout, sizeof(timeout));
在这行代码中,sock是已经established的TCP socket,我们将该socket的 TCP_USER_TIMEOUT设置为30000毫秒,也就是30秒。如果设置成功,这个TCP连接在发送数据包后30秒内如果没有收到ACK确认,将开始进行TCP连接的释放流程。
ConnectTimeout
连接池
OS兜底
net.ipv4.tcp_retries2 = 8
net.ipv4.tcp_syn_retries = 4
keepalive
LVS
其它
神奇的900秒
#ipvsadm -L --timeout
Timeout (tcp tcpfin udp): 900 120 300
为什么这个问题这几年才明显暴露
工程师们混沌了几十年; 之前因为出现频率低重启业务就糊弄过去了; 对新连接不存在这个问题; 有些连接池配置了Check机制(Check机制一般几秒钟超时fail); 微服务多了; 云上LVS普及了; k8s service大行其道;
我用的7层是不是就没有这个问题了?
极端情况
总结
如果你的业务支持设置SocketTimeout那么请一定要设置,但不一定适合分析类就是需要长时间返回的请求;
最好的方式是设置OS层面的TCP_USER_TIMEOUT参数,只要长时间没有ack就报错返回,但JDK不支持直接设置;
如果用了ALB/SLB就一定要配置connection_drain_timeout这个参数;
OS镜像层面也可以将tcp_retries2设置为5-10次做一个兜底;
对你的超时时间做到可控、可预期;
相关故障和资料:
tcp_retries1 - INTEGER
This value influences the time, after which TCP decides, that
something is wrong due to unacknowledged RTO retransmissions,
and reports this suspicion to the network layer.
See tcp_retries2 for more details.
RFC 1122 recommends at least 3 retransmissions, which is the
default.
tcp_retries2 - INTEGER
This value influences the timeout of an alive TCP connection,
when RTO retransmissions remain unacknowledged.
Given a value of N, a hypothetical TCP connection following
exponential backoff with an initial RTO of TCP_RTO_MIN would
retransmit N times before killing the connection at the (N+1)th RTO.
The default value of 15 yields a hypothetical timeout of 924.6
seconds and is a lower bound for the effective timeout.
TCP will effectively time out at the first RTO which exceeds the
hypothetical timeout.
RFC 1122 recommends at least 100 seconds for the timeout,
which corresponds to a value of at least 8.
参考链接:
[1]https://datatracker.ietf.org/doc/html/rfc5482
[2]https://github.com/torvalds/linux/commit/dca43c75e7e545694a9dd6288553f55c53e2a3a3
[3]https://github.com/tomasol/netty/commit/3010366d957d7b8106e353f99e15ccdb7d391d8f#diff-a998f73b7303461ca171432d10832884c6e7b0955d9f5634b9a8302b42a4706c
[4]https://github.com/redis/lettuce/pull/2499
[5]https://help.aliyun.com/document_detail/181399.html
[6]https://github.com/alibaba/druid/releases/tag/1.2.22
微信扫码关注该文公众号作者