Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blocking on Catalog HTTP requests. #1154

Closed
darron opened this issue Aug 4, 2015 · 39 comments
Closed

Blocking on Catalog HTTP requests. #1154

darron opened this issue Aug 4, 2015 · 39 comments

Comments

@darron
Copy link
Contributor

darron commented Aug 4, 2015

830 nodes in AWS - 5 server nodes spread across 3 AZs.

Symptoms:

Couldn't read the catalog (nodes, services or specific services) from the HTTP API - but I could read from some other endpoints no problem:

[prod]darron@i-4d107ee5:~$ curl "$CONSUL_SERVER/v1/agent/services"
{"consul":{"ID":"consul","Service":"consul","Tags":[],"Address":"","Port":8300},"datadog":{"ID":"datadog","Service":"datadog","Tags":["encrypted-storage","consul-server","az-us-east-1e"],"Address":"","Port":0}}
[prod]darron@i-4d107ee5:~$ curl "$CONSUL_SERVER/v1/agent/self"
{"Config":{"Bootstrap":false,"BootstrapExpect":0,"Server":true,"Datacenter":"us-east-1-prod","DataDir":"/mnt/consul","DNSRecursor":"","DNSRecursors":["8.8.8.8"],"DNSConfig":{"NodeTTL":0,"ServiceTTL":{"*":10000000000,"package-proxy":60000000000},"AllowStale":true,"EnableTruncate":false,"MaxStale":5000000000,"OnlyPassing":false},"Domain":"consul.","LogLevel":"debug","NodeName":"i-4d107ee5","ClientAddr":"127.0.0.1","BindAddr":"0.0.0.0","AdvertiseAddr":"10.237.194.239","AdvertiseAddrWan":"10.237.194.239","Ports":{"DNS":8600,"HTTP":8500,"HTTPS":-1,"RPC":8400,"SerfLan":8301,"SerfWan":8302,"Server":8300},"Addresses":{"DNS":"","HTTP":"","HTTPS":"","RPC":""},"LeaveOnTerm":false,"SkipLeaveOnInt":false,"StatsiteAddr":"","StatsitePrefix":"consul","StatsdAddr":"127.0.0.1:8125","Protocol":2,"EnableDebug":false,"VerifyIncoming":false,"VerifyOutgoing":false,"VerifyServerHostname":false,"CAFile":"","CertFile":"","KeyFile":"","ServerName":"","StartJoin":["10.154.251.119","10.167.160.227","10.230.66.62","10.237.194.239","10.99.179.149"],"StartJoinWan":[],"RetryJoin":[],"RetryMaxAttempts":0,"RetryIntervalRaw":"","RetryJoinWan":[],"RetryMaxAttemptsWan":0,"RetryIntervalWanRaw":"","UiDir":"/mnt/consul/ui","PidFile":"","EnableSyslog":false,"SyslogFacility":"LOCAL0","RejoinAfterLeave":false,"CheckUpdateInterval":300000000000,"ACLDatacenter":"us-east-1-prod","ACLTTL":30000000000,"ACLTTLRaw":"","ACLDefaultPolicy":"allow","ACLDownPolicy":"extend-cache","Watches":[{}],"DisableRemoteExec":false,"DisableUpdateCheck":false,"DisableAnonymousSignature":false,"HTTPAPIResponseHeaders":null,"AtlasInfrastructure":"","AtlasJoin":false,"Revision":"9a9cc9341bb487651a0399e3fc5e1e8a42e62dd9+CHANGES","Version":"0.5.2","VersionPrerelease":"","UnixSockets":{"Usr":"","Grp":"","Perms":""},"SessionTTLMin":0,"SessionTTLMinRaw":""},"Member":{"Name":"i-4d107ee5","Addr":"10.237.194.239","Port":8301,"Tags":{"build":"0.5.2:9a9cc934","dc":"us-east-1-prod","port":"8300","role":"consul","vsn":"2","vsn_max":"2","vsn_min":"1"},"Status":1,"ProtocolMin":1,"ProtocolMax":2,"ProtocolCur":2,"DelegateMin":2,"DelegateMax":4,"DelegateCur":4}}
[prod]darron@i-4d107ee5:~$ curl "$CONSUL_SERVER/v1/catalog/services"
^C
[prod]darron@i-4d107ee5:~$ curl "$CONSUL_SERVER/v1/catalog/nodes"
^C
[prod]darron@i-4d107ee5:~$ curl "$CONSUL_SERVER/v1/catalog/datacenters"
["us-east-1-prod"]

The cluster was quite calm during this time period:

http://shared.froese.org/2015/fgfti-12-20.jpg

No leadership transitions at all - nothing to note.

Looking through the logs - it looked like the node actually blocked for 30 minutes and only unblocked after a connection timed out:

2015/08/04 17:43:08 [ERR] yamux: Failed to write header: write tcp 10.99.179.149:8300: connection timed out
2015/08/04 17:43:08 [ERR] http: Request /v1/catalog/services?token=<hidden>, error: rpc error: failed to start stream: session shutdown
2015/08/04 17:43:08 [DEBUG] http: Request /v1/catalog/services?token=<hidden> (26m39.38531829s)
2015/08/04 17:43:08 [ERR] http: Request /v1/catalog/services, error: rpc error: failed to start stream: session shutdown
2015/08/04 17:43:08 [DEBUG] http: Request /v1/catalog/services (22m22.342346091s)
2015/08/04 17:43:08 [ERR] http: Request /v1/catalog/services?token=<hidden>, error: rpc error: failed to start stream: session shutdown
2015/08/04 17:43:08 [DEBUG] http: Request /v1/catalog/services?token=<hidden> (24m4.874033511s)
2015/08/04 17:43:08 [ERR] http: Request /v1/catalog/services, error: rpc error: failed to start stream: session shutdown
2015/08/04 17:43:08 [DEBUG] http: Request /v1/catalog/services (21m19.359961147s)
2015/08/04 17:43:08 [ERR] http: Request /v1/catalog/services, error: rpc error: failed to start stream: session shutdown
2015/08/04 17:43:08 [DEBUG] http: Request /v1/catalog/services (21m29.150600718s)
2015/08/04 17:43:08 [ERR] http: Request /v1/catalog/services, error: rpc error: failed to start stream: session shutdown
2015/08/04 17:43:08 [DEBUG] http: Request /v1/catalog/services (20m3.012623844s)
2015/08/04 17:43:08 [ERR] http: Request /v1/catalog/services?token=<hidden>, error: rpc error: failed to start stream: session shutdown
2015/08/04 17:43:08 [DEBUG] http: Request /v1/catalog/services?token=<hidden> (28m42.310562609s)
2015/08/04 17:43:08 [ERR] http: Request /v1/catalog/services?token=<hidden>, error: rpc error: failed to start stream: session shutdown
2015/08/04 17:43:08 [DEBUG] http: Request /v1/catalog/services?token=<hidden> (26m42.342176182s)
2015/08/04 17:43:08 [ERR] http: Request /v1/catalog/services?token=<hidden>, error: rpc error: failed to start stream: session shutdown
2015/08/04 17:43:08 [DEBUG] http: Request /v1/catalog/services?token=<hidden> (26m37.069896892s)
2015/08/04 17:43:08 [ERR] http: Request /v1/catalog/services?token=<hidden>, error: rpc error: failed to start stream: session shutdown
2015/08/04 17:43:08 [DEBUG] http: Request /v1/catalog/services?token=<hidden> (26m16.177630996s)
2015/08/04 17:43:08 [ERR] http: Request /v1/catalog/services?token=<hidden>, error: rpc error: failed to start stream: session shutdown
2015/08/04 17:43:08 [DEBUG] http: Request /v1/catalog/services?token=<hidden> (27m22.315814218s)
2015/08/04 17:43:08 [ERR] http: Request /v1/catalog/services?token=<hidden>, error: rpc error: failed to start stream: session shutdown
2015/08/04 17:43:08 [DEBUG] http: Request /v1/catalog/services?token=<hidden> (26m25.775714773s)
2015/08/04 17:43:08 [ERR] http: Request /v1/catalog/services?token=<hidden>, error: rpc error: failed to start stream: session shutdown
2015/08/04 17:43:08 [DEBUG] http: Request /v1/catalog/services?token=<hidden> (29m3.656387875s)
2015/08/04 17:43:08 [ERR] http: Request /v1/catalog/services?token=<hidden>, error: rpc error: failed to start stream: session shutdown
2015/08/04 17:43:08 [DEBUG] http: Request /v1/catalog/services?token=<hidden> (30m24.660808717s)
2015/08/04 17:43:08 [ERR] http: Request /v1/catalog/services?token=<hidden>, error: rpc error: failed to start stream: session shutdown
2015/08/04 17:43:08 [DEBUG] http: Request /v1/catalog/services?token=<hidden> (28m14.89759416s)
2015/08/04 17:43:08 [DEBUG] http: Request /v1/catalog/services?token=<hidden> (29m25.053353314s)
2015/08/04 17:43:08 [ERR] http: Request /v1/catalog/services, error: rpc error: failed to start stream: session shutdown

Full debug logs are posted here - look from just before 17:00 until 17:12 and then from there until 17:43:

https://gist.github.com/darron/d7337cafb1a7b8640abd

It looks like it blocked on a failed connection to 10.99.179.149 and then once it timed out it finally showed the errors from the failed queries - and allowed subsequant queries to work.

If you look at the quantity of the logs - when the server is 'blocked' from responding - there isn't the usual chatter.

That server is very close:

[prod]root@i-4d107ee5:/var/log/consul# ping 10.99.179.149
PING 10.99.179.149 (10.99.179.149) 56(84) bytes of data.
64 bytes from 10.99.179.149: icmp_seq=1 ttl=41 time=0.610 ms
64 bytes from 10.99.179.149: icmp_seq=2 ttl=41 time=0.554 ms
64 bytes from 10.99.179.149: icmp_seq=3 ttl=41 time=0.556 ms
64 bytes from 10.99.179.149: icmp_seq=4 ttl=41 time=0.531 ms
64 bytes from 10.99.179.149: icmp_seq=5 ttl=41 time=0.505 ms
64 bytes from 10.99.179.149: icmp_seq=6 ttl=41 time=0.489 ms
64 bytes from 10.99.179.149: icmp_seq=7 ttl=41 time=0.569 ms
64 bytes from 10.99.179.149: icmp_seq=8 ttl=41 time=0.576 ms
64 bytes from 10.99.179.149: icmp_seq=9 ttl=41 time=0.545 ms
64 bytes from 10.99.179.149: icmp_seq=10 ttl=41 time=0.543 ms
64 bytes from 10.99.179.149: icmp_seq=11 ttl=41 time=0.534 ms
^C
--- 10.99.179.149 ping statistics ---
11 packets transmitted, 11 received, 0% packet loss, time 9996ms
rtt min/avg/max/mdev = 0.489/0.546/0.610/0.039 ms

I didn't check during the event - but there's a very low probability they were completely disconnected for 30 minutes.

This was a really nice long outage on a machine that didn't impact production - but we're starting to see many, many of these sorts of transient blocks across our cluster.

Is this something that's known?

I looked through the other issues and didn't see anything that looked like this.

Are there any other details I could grab the next time this happens - it's happening a fair amount in our cluster - but this was the longest outage we had seen.

I tried to strace - but there was so much output from all the threads it wasn't very useful.

@armon
Copy link
Member

armon commented Aug 5, 2015

@darron Hmm this is odd. I haven't seen that before. @slackpad is looking into this.

@armon
Copy link
Member

armon commented Aug 6, 2015

@darron What version of Consul are you guys running? Two things are odd here. A blocking query is capped to 10m on the server side, but the fact that no query is going through and "failed to start stream" is being logged means we are blocking trying to open a new stream to the server (single TCP stream, multiplexed per-request).

We apply back pressure on new streams to prevent overwhelming the servers, and it seems like the clients are waiting for a new stream to be available. This is causing the blocking query to go for a very long time.

Not sure yet why clients are blocking. Do you see anything odd on the servers around those times? Specifically the leader is likely to be the culprit.

@armon
Copy link
Member

armon commented Aug 6, 2015

@darron Could you also check the value of /proc/sys/net/ipv4/tcp_keepalive_time

@darron
Copy link
Contributor Author

darron commented Aug 6, 2015

We're running 0.5.2 everywhere.

Those logs were from a server node - but it wasn't the leader at that time.

[prod]root@i-4d107ee5:~# cat /proc/sys/net/ipv4/tcp_keepalive_time
7200

I will look at the Leader logs from that time - will post them next.

@darron
Copy link
Contributor Author

darron commented Aug 6, 2015

Just to clarify - I was doing all of the querying for the catalog ON a server node - that node was NOT the leader. All the logs were from the same server node.

Leader logs from that time are posted here: http://shared.froese.org/2015/consul-leader-logs.txt.zip

@armon
Copy link
Member

armon commented Aug 6, 2015

The leader seems to be processing things in a timely manner, and no serious errors coming from yamux. Something is suspicious about the "connection timed out" and "failed to start stream" errors.

Current thinking is there is a client-side stall creating new multiplex connections for some reason, which is why the queries are taking so long. Not yet clear why, as the server side doesn't appear to be having any issues.

@darron
Copy link
Contributor Author

darron commented Aug 6, 2015

We're seeing this sort of thing intermittently for short periods of time across the cluster - a failure to query something that requires the catalog.

It has seemed to get more frequent lately - when we hit around 800 nodes.

We are using the service catalog to list and ssh to various groups of nodes - which people use many times during the day - but the interruptions are usually very short - a few seconds - and then it's all OK again.

We're also seeing the same sort of pauses when we query the KV store intermittently - when I can see something like rpc error in the logs - it's usually pretty easy to diagnose - but this one is pretty baffling.

@slackpad
Copy link
Contributor

slackpad commented Aug 6, 2015

Thanks for the additional log info. We are tracing through the yamux code and its related code in Consul's pool.go looking for potential deadlocks that could cause this. Will update this as we find more.

@slackpad
Copy link
Contributor

slackpad commented Aug 6, 2015

Still need to verify, but looking at the code there's a potential issue related to the behavior of keepalive()'s use of Ping() inside yamux. It looks like if the TCP connection gets disrupted after it sends a ping message (and we pile up too many outgoing sessions in flight to attempt new sessions) then we will never try to write more data to the TCP stream, even from the keepalive goroutine, since it's blocking on the ping response message to measure the RTT. Also, an error writing the ping message won't shutdown the session. This all seems like it could lead to a blocked client who has to wait for the kernel to decide the TCP connection is dead, which seems to match the symptoms in your client logs.

One other thought, I think to get into this condition you'd have to be sending a bunch of things from a client (to fill up synCh and get it to wait before trying to open new sessions) and then disrupt the TCP connection. Since those sessions won't get ACKd and the keepalive doesn't try to close if it can't write, then nothing will detect that the connection is dead from the client side.

@slackpad
Copy link
Contributor

slackpad commented Aug 6, 2015

I've been able to reproduce a similar client-side hang by setting up a simple 2 node cluster and having a process make curl calls to the catalog endpoing in a loop. While that's running I run sudo iptables -A OUTPUT -p tcp --dport 8300 -j DROP on the same client machine. Once TCP traffic is disrupted and the synCh channel fills up, new http requests with curl will hang for minutes until I kill them or run sudo iptables -F OUTPUT they don't timeout or start failing because OpenStream() in yamux will block while it has too many SYNs in flight.

In the case of bad TCP connectivity we can get into this state, though in your case Raft was happy and there weren't a lot of other indications of network problems. Given this, we'll dig deeper on the server side and see if there's some way to get into a stall and stop processing the incoming stream. If that happened it would cause these exact symptoms on the client.

@slackpad
Copy link
Contributor

slackpad commented Aug 7, 2015

A thorough review of the receive side in the server hasn't found any smoking guns, so my current working hypothesis is that you are seeing TCP connectivity issues and this known mode with the client is causing things to hang. I think the next thing on my side would be some stress testing to try to get it to repro without messing with the TCP stream.

@darron it would be super useful to get some tcpdump for Consul RPC traffic from the agent to the server during one of these events (preferably showing things working properly and then into the failure) - would something like that be possible? Also, I think we could make this a lot better behaved with write timeouts on the client side - would it be possible to run a slightly patched agent in your environment?

@darron
Copy link
Contributor Author

darron commented Aug 7, 2015

Hmm - I won't be able to install a patched agent tomorrow - but I may be able to get some tcpdump data and send it over.

Will see what I can get.

@slackpad
Copy link
Contributor

slackpad commented Aug 7, 2015

Thanks! If there's a choice of where to gather it, the server side is probably better. If we see the client hanging and don't see the traffic on the server we will know it's probably the problem I've been able to repro. If we see it coming in to the server then we will need to dig some more. If that's too much load, we can still get useful info from the client, looking at responses from the server.

@darron
Copy link
Contributor Author

darron commented Aug 7, 2015

I'll try and get it in both locations (client and leader node) when I can see it happening again - blocking on read queries.

I noticed tonight it corrected after I bounced the client node - will keep tcpdump going through the bounce if that's helpful at all.

Would it be of any use to get dumps from the other server nodes as well? Or just the leader?

@slackpad
Copy link
Contributor

slackpad commented Aug 7, 2015

Ideally we'd have the client <-> server it picked <-> leader so we can piece together the whole chain. I think the hard thing would be seeing a few successful queries followed by the hang - there might be a ring buffer mode to tcpdump that would let you keep going and then kill it once you see the hang so you could observe the start of the event. Definitely understand if this is hard to get, though, so I'll work with whatever you can provide!

@darron
Copy link
Contributor Author

darron commented Aug 7, 2015

Had a quick look to see about dropped packets - these are all the servers. All of them look pretty similar:

DEBUG [2015-08-07 20:43:59 +0000][i-5fd89ff7]             RX packets:1501056815 errors:0 dropped:0 overruns:0 frame:0
DEBUG [2015-08-07 20:43:59 +0000][i-5fd89ff7]             TX packets:1402597692 errors:0 dropped:6417 overruns:0 carrier:0
DEBUG [2015-08-07 20:43:59 +0000][i-4d107ee5]             RX packets:1445179124 errors:0 dropped:0 overruns:0 frame:0
DEBUG [2015-08-07 20:43:59 +0000][i-4d107ee5]             TX packets:1526096898 errors:0 dropped:4991 overruns:0 carrier:0
DEBUG [2015-08-07 20:43:59 +0000][i-6d1f93c1]             RX packets:1267607279 errors:0 dropped:0 overruns:0 frame:0
DEBUG [2015-08-07 20:43:59 +0000][i-6d1f93c1]             TX packets:1228674138 errors:0 dropped:4891 overruns:0 carrier:0
DEBUG [2015-08-07 20:43:59 +0000][i-9e0b344d]             RX packets:1712000326 errors:0 dropped:0 overruns:0 frame:0
DEBUG [2015-08-07 20:43:59 +0000][i-9e0b344d]             TX packets:1669753494 errors:0 dropped:6106 overruns:0 carrier:0
DEBUG [2015-08-07 20:43:59 +0000][i-798685aa]             RX packets:1675884488 errors:0 dropped:0 overruns:0 frame:0
DEBUG [2015-08-07 20:43:59 +0000][i-798685aa]             TX packets:1657588499 errors:0 dropped:7576 overruns:0 carrier:0

From the client node with the most consistent problems:

RX packets:287530432 errors:0 dropped:0 overruns:0 frame:0
TX packets:394703605 errors:0 dropped:0 overruns:0 carrier:0

Still looking at other items and will be getting the tcpdumps when I've replaced the one odd server node:

http://shared.froese.org/2015/g9mfe-14-50.jpg

@darron
Copy link
Contributor Author

darron commented Aug 10, 2015

OK - replaced that "bad" node and hoped it was better - but I have since found a node that has been broken for a while - another server node.

Grabbed a pcap and logs and sending them via email.

@slackpad
Copy link
Contributor

Sounds good - thanks for gathering up this data. If you email support@hashicorp.com I'll have access to it, or you can CC me (james at hashicorp.com).

@darron
Copy link
Contributor Author

darron commented Aug 10, 2015

One question - the leader node has approximately 1500 connections from that node:

consul    21826 12788       root  436u     IPv6           12975902      0t0        TCP ip-10-99-179-149.ec2.internal:8300->ip-10-230-66-62.ec2.internal:60884 (ESTABLISHED)
consul    21826 12788       root  439u     IPv6           12998231      0t0        TCP ip-10-99-179-149.ec2.internal:8300->ip-10-230-66-62.ec2.internal:36362 (ESTABLISHED)
consul    21826 12788       root  443u     IPv6           12604821      0t0        TCP ip-10-99-179-149.ec2.internal:8300->ip-10-230-66-62.ec2.internal:57392 (ESTABLISHED)
consul    21826 12788       root  482u     IPv6           12197349      0t0        TCP ip-10-99-179-149.ec2.internal:8300->ip-10-230-66-62.ec2.internal:38264 (ESTABLISHED)
consul    21826 12788       root  490u     IPv6           12574229      0t0        TCP ip-10-99-179-149.ec2.internal:8300->ip-10-230-66-62.ec2.internal:52213 (ESTABLISHED)
consul    21826 12789       root   31u     IPv6           11828421      0t0        TCP ip-10-99-179-149.ec2.internal:8300->ip-10-230-66-62.ec2.internal:41615 (ESTABLISHED)
consul    21826 12789       root   54u     IPv6           11588799      0t0        TCP ip-10-99-179-149.ec2.internal:8300->ip-10-230-66-62.ec2.internal:55525 (ESTABLISHED)
consul    21826 12789       root   55u     IPv6           11971532      0t0        TCP ip-10-99-179-149.ec2.internal:8300->ip-10-230-66-62.ec2.internal:51587 (ESTABLISHED)
consul    21826 12789       root  101u     IPv6           11586535      0t0        TCP ip-10-99-179-149.ec2.internal:8300->ip-10-230-66-62.ec2.internal:55479 (ESTABLISHED)
consul    21826 12789       root  109u     IPv6           11817542      0t0        TCP ip-10-99-179-149.ec2.internal:8300->ip-10-230-66-62.ec2.internal:40625 (ESTABLISHED)
consul    21826 12789       root  168u     IPv6           11870052      0t0        TCP ip-10-99-179-149.ec2.internal:8300->ip-10-230-66-62.ec2.internal:45734 (ESTABLISHED)
consul    21826 12789       root  196u     IPv6           12939976      0t0        TCP ip-10-99-179-149.ec2.internal:8300->ip-10-230-66-62.ec2.internal:55478 (ESTABLISHED)
consul    21826 12789       root  234u     IPv6           11599674      0t0        TCP ip-10-99-179-149.ec2.internal:8300->ip-10-230-66-62.ec2.internal:56084 (ESTABLISHED)
consul    21826 12789       root  238u     IPv6           12584337      0t0        TCP ip-10-99-179-149.ec2.internal:8300->ip-10-230-66-62.ec2.internal:54285 (ESTABLISHED)
consul    21826 12789       root  348u     IPv6           12896099      0t0        TCP ip-10-99-179-149.ec2.internal:8300->ip-10-230-66-62.ec2.internal:47237 (ESTABLISHED)
consul    21826 12789       root  363u     IPv6           11980649      0t0        TCP ip-10-99-179-149.ec2.internal:8300->ip-10-230-66-62.ec2.internal:52613 (ESTABLISHED)
consul    21826 12789       root  371u     IPv6           12561975      0t0        TCP ip-10-99-179-149.ec2.internal:8300->ip-10-230-66-62.ec2.internal:51324 (ESTABLISHED)
consul    21826 12789       root  380u     IPv6           12886575      0t0        TCP ip-10-99-179-149.ec2.internal:8300->ip-10-230-66-62.ec2.internal:45731 (ESTABLISHED)
consul    21826 12789       root  384u     IPv6           11999693      0t0        TCP ip-10-99-179-149.ec2.internal:8300->ip-10-230-66-62.ec2.internal:55626 (ESTABLISHED)
consul    21826 12789       root  394u     IPv6           12951487      0t0        TCP ip-10-99-179-149.ec2.internal:8300->ip-10-230-66-62.ec2.internal:57427 (ESTABLISHED)
consul    21826 12789       root  396u     IPv6           12669285      0t0        TCP ip-10-99-179-149.ec2.internal:8300->ip-10-230-66-62.ec2.internal:60813 (ESTABLISHED)
consul    21826 12789       root  402u     IPv6           12678029      0t0        TCP ip-10-99-179-149.ec2.internal:8300->ip-10-230-66-62.ec2.internal:33793 (ESTABLISHED)

Is that normal?

Another server node that IS working - can query the catalog and KV store - has a similar amount of connections.

@slackpad
Copy link
Contributor

That doesn't seem right - with connection pooling it should maintain a single connection and multiplex different logical connections over that via yamux streams. I'll take a look at this (and the data you sent) - this may be an important clue.

@darron
Copy link
Contributor Author

darron commented Aug 10, 2015

Around 870 nodes in the cluster - the leader has around 23k TCP connections as reported by lsof - might not be accurate though as I look at it with other tools.

@darron
Copy link
Contributor Author

darron commented Aug 10, 2015

It may be an artifact of how lsof reports things - there appear to be 29 different unique ports on the client side - it looks like each one is connected 51 times - which appears to correspond to the number of threads.

Sorry if this added noise - just looked really odd.

@darron
Copy link
Contributor Author

darron commented Aug 10, 2015

Did some more looking tonight at the node that's messed up.

I wrote a little script that queries the local catalog and then logs the result to a file:

https://gist.github.com/darron/fc8f01aec8c3c3fc2223

It times out after 59 seconds - when it doesn't time out it runs every 5 seconds. Here's the results from today:

https://gist.github.com/darron/f770119f438c9f9a818a

It was failing for many hours today (even before I started logging), but it started working again at 2015-08-10 01:56:28+00:00:

https://gist.github.com/darron/f770119f438c9f9a818a#file-gistfile1-txt-L78

Consul logs on that node from around that time:

https://gist.github.com/darron/630c1571f819363ffc1d

There's nothing obvious in the leader Consul logs.

It stopped working again: 2015-08-10 04:32:42+00:00:

https://gist.github.com/darron/f770119f438c9f9a818a#file-gistfile1-txt-L1929

Here's the Consul logs from that time:

https://gist.github.com/darron/6d43d1fdce2b8e586a8a#file-gistfile1-txt-L17

Consul logs from the leader show the same "push/pull sync":

https://gist.github.com/darron/a66074cd61a61faf24d5#file-gistfile1-txt-L9

Will send links to full copies of logs via email.

@darron
Copy link
Contributor Author

darron commented Aug 10, 2015

On top of the 2 nodes I've been sending over debugging info about - it can be seen randomly on other nodes as well - have another server doing the same thing right now - luckily it's not the master. Watching #1165

@darron
Copy link
Contributor Author

darron commented Aug 10, 2015

@slackpad Do you need anymore tcpdumps? either:

  1. Working to non-working.
  2. Non-working to working.

Let me know - I have both types of nodes right now that I can grab data from. All server nodes.

@slackpad
Copy link
Contributor

@darron I think I'm pretty good with what you've sent so far. Your last round of pcap files led to #1165, which I don't think will actively harm anything but will create a small flurry of network traffic and probably some extra load on the servers when it happens (spins up 3 goroutines per connection, etc.).

If it's easy to get a two-sided pcap of a good to bad transition that would be useful, but I don't want to take too much of your time trying to get that if it's a hassle.

@slackpad
Copy link
Contributor

Also, if you get the working to not-working transition can you please snag both sides of the conversation on the failing server as well - I think that'll paint the complete picture.

@darron
Copy link
Contributor Author

darron commented Aug 10, 2015

Yes - trying for both sides - capturing a lot and then occasionally clearing it when it hasn't failed yet.

@darron
Copy link
Contributor Author

darron commented Aug 10, 2015

Got one - packaging up the pcaps - this one is a little bigger than the last one - was running for a while.

@slackpad
Copy link
Contributor

Awesome - last one I had to pre-filter because wireshark was crashing on me :-)

@darron
Copy link
Contributor Author

darron commented Aug 10, 2015

Yeah - it's hard to catch a semi-random event when you're tcpdumping. You'll need to pre-filter this as well. :-)

@bastichelaar
Copy link

We're seeing the same behaviour at our cluster. One of the nodes doesn't respond anymore, calls to /agent are fine, but to /catalog just hang forever. Our cluster is much smaller, but we're querying the API quite a bit. We're running Prometheus Consul exporter, which polls the API quite often, and our own gateway, which is using Consul as well.

I turned on the debug level, but the only thing I could see is that the health checks via http are not being processed anymore.

Let me know if I can help debugging this! @darron, could it be that you're also using the Prometheus Consul exporter?

@slackpad
Copy link
Contributor

Hi @bastichelaar thanks for offering to help debug this (and sorry for the late reply)! Here's what we know so far:

  1. Through @darron's data drops we realized that yamux (the library Consul uses to multiplex RPC requests over a single TCP connection) had the potential to stall on the client or server side. There was coupling between the send and receive loops due to the way it can try to send window updates to apply back pressure or send replies to ping requests during reads. We cleaned these up and added a fallback timeout here - Fixes read/write loop and ping deadlocks yamux#9.
  2. We also realized that heavy RPC traffic during TCP connects would create a ton of new connections and kill off all but one, we fixed that here - Fixes #1165 by having threads wait for any outstanding connect to finish. #1170. This wasn't directly involved in the stalls but it wasn't helping anything and created a bunch of useless network traffic at a critical time.

I merged those fixes, along with some minimal cherry picks to get the build working, to a branch based on the 0.5.2 tag, which is here - https://github.com/hashicorp/consul/tree/v0.5.2-deadlock-patches. I ended up rolling back both memberlist (6025015f2dc659ca2c735112d37e753bda6e329d) and serf (668982d8f90f5eff4a766583c1286393c1d27f68) to versions near the 0.5.2 release date since there are protocol version bumps that you definitely don't want in there, so a build in this configuration is very close to 0.5.2, but includes the above fixes.

@darron is doing some testing with this new binary and it is behaving better (shorter stall periods), but alas it's hitting the new timeouts we added so there's still something we need to figure out.

If you'd like to try out this binary and/or send some logs from this happening in your configuration you can email support@hashicorp.com and CC me at james at hashicorp.com and I'll give you the build. We will keep this issue posted with what we find and with the fix once we figure it out.

@darron
Copy link
Contributor Author

darron commented Aug 20, 2015

OK - to summarize where this particular issue is at - we've now had a 24 hour period where no nodes have gone deaf, couldn't communicate with the master and needed to be bounced. We haven't seen that for a long time given we're at 900 fairly busy nodes - busy with services, DNS lookups and KV traffic.

A key understanding for me was recognizing that:

  1. Server nodes were all of a sudden unable to talk to the leader.
  2. As a result, all of the client nodes that were talking to them were effectively cut off as well.

Bouncing the server would resolve it for all the nodes - but that's an unsustainable strategy.

There have been 3 things that have changed this status and helped to hopefully solve it:

  1. The Yamux patch for deadlocks - we're running that on the server nodes.
  2. The Consul patch for connection throttling - we're running that on the server nodes.
  3. Making sure the server nodes were immune to the "Rides the Rocket" Xen/EC2 networking bug. For future reference - this is what did it for us:

/sbin/ethtool -K eth0 tso off gso off sg off

1 and 2 helped a ton - but what seems to have made it much more consistent is 3.

We had thought we had fixed the Xen/EC2 bug - but it was an incomplete solution on the Consul server nodes and only came to light through some excellent "reading-the-pcap-leaves" by @slackpad.

Basically - the leader was sending things to the server nodes that it never got - and so that put it into a strange state that took a while to come out from.

There's another patch to yamux - but we're going to give this current state a week and then think of changing out the Consul binary cluster wide.

I'll update this next week after I'm done traveling and we have more data to back up the current conclusion.

@bastichelaar
Copy link

Same here: we also applied the "Rides the Rocket" Xen/EC2 fix and we build a new version of Consul using the deadlock fixes branch (https://github.com/hashicorp/consul/tree/v0.5.2-deadlock-patches). The cluster is stable now, and the /v1/catalog API endpoint doesn't become unresponsive anymore. We're still monitoring the cluster, but it looks good! I'll keep this issue updated if we encounter any issues.

Thanks @darron and @slackpad for your help!

@darron
Copy link
Contributor Author

darron commented Aug 28, 2015

@slackpad It looks like those patches have pretty much solved this problem for us.

In the last 8 days we have had a single instance of things going deaf - but it was short and corrected itself:

http://shared.froese.org/2015/0hrr9-17-16.jpg

I wonder if we need to try the other Yamux patch - but pretty happy with where we are at the moment.

Worth a 0.5.3 release?

@slackpad
Copy link
Contributor

@darron thanks for the update. If things are performing well you probably don't need that last patch as that was kind of a catch all for any TCP errors; you've got a high enough request load that one of your streams is likely going to try to send a header and time out (the patch will time out even if it's not trying to write a header) As for 0.5.3 I think at this point we should focus on 0.6 which will have the performance fixes, too. Are you ok running the patch release for a little while longer?

@darron
Copy link
Contributor Author

darron commented Aug 28, 2015

Yeah - it should be OK. I will build a package and install across my cluster.

Thanks for all your help @slackpad - really appreciate it.

@darron darron closed this as completed Aug 28, 2015
@slackpad
Copy link
Contributor

@darron my pleasure - I totally appreciate all your efforts repro-ing this and mechanizing and supplying the debug data - you put in some major effort to track this down!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants