-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Blocking on Catalog HTTP requests. #1154
Comments
@darron What version of Consul are you guys running? Two things are odd here. A blocking query is capped to 10m on the server side, but the fact that no query is going through and "failed to start stream" is being logged means we are blocking trying to open a new stream to the server (single TCP stream, multiplexed per-request). We apply back pressure on new streams to prevent overwhelming the servers, and it seems like the clients are waiting for a new stream to be available. This is causing the blocking query to go for a very long time. Not sure yet why clients are blocking. Do you see anything odd on the servers around those times? Specifically the leader is likely to be the culprit. |
@darron Could you also check the value of /proc/sys/net/ipv4/tcp_keepalive_time |
We're running 0.5.2 everywhere. Those logs were from a server node - but it wasn't the leader at that time.
I will look at the Leader logs from that time - will post them next. |
Just to clarify - I was doing all of the querying for the catalog ON a server node - that node was NOT the leader. All the logs were from the same server node. Leader logs from that time are posted here: http://shared.froese.org/2015/consul-leader-logs.txt.zip |
The leader seems to be processing things in a timely manner, and no serious errors coming from yamux. Something is suspicious about the "connection timed out" and "failed to start stream" errors. Current thinking is there is a client-side stall creating new multiplex connections for some reason, which is why the queries are taking so long. Not yet clear why, as the server side doesn't appear to be having any issues. |
We're seeing this sort of thing intermittently for short periods of time across the cluster - a failure to query something that requires the catalog. It has seemed to get more frequent lately - when we hit around 800 nodes. We are using the service catalog to list and ssh to various groups of nodes - which people use many times during the day - but the interruptions are usually very short - a few seconds - and then it's all OK again. We're also seeing the same sort of pauses when we query the KV store intermittently - when I can see something like |
Thanks for the additional log info. We are tracing through the yamux code and its related code in Consul's pool.go looking for potential deadlocks that could cause this. Will update this as we find more. |
Still need to verify, but looking at the code there's a potential issue related to the behavior of One other thought, I think to get into this condition you'd have to be sending a bunch of things from a client (to fill up |
I've been able to reproduce a similar client-side hang by setting up a simple 2 node cluster and having a process make In the case of bad TCP connectivity we can get into this state, though in your case Raft was happy and there weren't a lot of other indications of network problems. Given this, we'll dig deeper on the server side and see if there's some way to get into a stall and stop processing the incoming stream. If that happened it would cause these exact symptoms on the client. |
A thorough review of the receive side in the server hasn't found any smoking guns, so my current working hypothesis is that you are seeing TCP connectivity issues and this known mode with the client is causing things to hang. I think the next thing on my side would be some stress testing to try to get it to repro without messing with the TCP stream. @darron it would be super useful to get some |
Hmm - I won't be able to install a patched agent tomorrow - but I may be able to get some Will see what I can get. |
Thanks! If there's a choice of where to gather it, the server side is probably better. If we see the client hanging and don't see the traffic on the server we will know it's probably the problem I've been able to repro. If we see it coming in to the server then we will need to dig some more. If that's too much load, we can still get useful info from the client, looking at responses from the server. |
I'll try and get it in both locations (client and leader node) when I can see it happening again - blocking on read queries. I noticed tonight it corrected after I bounced the client node - will keep Would it be of any use to get dumps from the other server nodes as well? Or just the leader? |
Ideally we'd have the client <-> server it picked <-> leader so we can piece together the whole chain. I think the hard thing would be seeing a few successful queries followed by the hang - there might be a ring buffer mode to |
Had a quick look to see about dropped packets - these are all the servers. All of them look pretty similar:
From the client node with the most consistent problems:
Still looking at other items and will be getting the tcpdumps when I've replaced the one odd server node: |
OK - replaced that "bad" node and hoped it was better - but I have since found a node that has been broken for a while - another server node. Grabbed a pcap and logs and sending them via email. |
Sounds good - thanks for gathering up this data. If you email support@hashicorp.com I'll have access to it, or you can CC me (james at hashicorp.com). |
One question - the leader node has approximately 1500 connections from that node:
Is that normal? Another server node that IS working - can query the catalog and KV store - has a similar amount of connections. |
That doesn't seem right - with connection pooling it should maintain a single connection and multiplex different logical connections over that via yamux streams. I'll take a look at this (and the data you sent) - this may be an important clue. |
Around 870 nodes in the cluster - the leader has around 23k TCP connections as reported by |
It may be an artifact of how Sorry if this added noise - just looked really odd. |
Did some more looking tonight at the node that's messed up. I wrote a little script that queries the local catalog and then logs the result to a file: https://gist.github.com/darron/fc8f01aec8c3c3fc2223 It times out after 59 seconds - when it doesn't time out it runs every 5 seconds. Here's the results from today: https://gist.github.com/darron/f770119f438c9f9a818a It was failing for many hours today (even before I started logging), but it started working again at 2015-08-10 01:56:28+00:00: https://gist.github.com/darron/f770119f438c9f9a818a#file-gistfile1-txt-L78 Consul logs on that node from around that time: https://gist.github.com/darron/630c1571f819363ffc1d There's nothing obvious in the leader Consul logs. It stopped working again: 2015-08-10 04:32:42+00:00: https://gist.github.com/darron/f770119f438c9f9a818a#file-gistfile1-txt-L1929 Here's the Consul logs from that time: https://gist.github.com/darron/6d43d1fdce2b8e586a8a#file-gistfile1-txt-L17 Consul logs from the leader show the same "push/pull sync": https://gist.github.com/darron/a66074cd61a61faf24d5#file-gistfile1-txt-L9 Will send links to full copies of logs via email. |
On top of the 2 nodes I've been sending over debugging info about - it can be seen randomly on other nodes as well - have another server doing the same thing right now - luckily it's not the master. Watching #1165 |
@slackpad Do you need anymore tcpdumps? either:
Let me know - I have both types of nodes right now that I can grab data from. All server nodes. |
@darron I think I'm pretty good with what you've sent so far. Your last round of pcap files led to #1165, which I don't think will actively harm anything but will create a small flurry of network traffic and probably some extra load on the servers when it happens (spins up 3 goroutines per connection, etc.). If it's easy to get a two-sided pcap of a good to bad transition that would be useful, but I don't want to take too much of your time trying to get that if it's a hassle. |
Also, if you get the working to not-working transition can you please snag both sides of the conversation on the failing server as well - I think that'll paint the complete picture. |
Yes - trying for both sides - capturing a lot and then occasionally clearing it when it hasn't failed yet. |
Got one - packaging up the pcaps - this one is a little bigger than the last one - was running for a while. |
Awesome - last one I had to pre-filter because wireshark was crashing on me :-) |
Yeah - it's hard to catch a semi-random event when you're tcpdumping. You'll need to pre-filter this as well. :-) |
We're seeing the same behaviour at our cluster. One of the nodes doesn't respond anymore, calls to /agent are fine, but to /catalog just hang forever. Our cluster is much smaller, but we're querying the API quite a bit. We're running Prometheus Consul exporter, which polls the API quite often, and our own gateway, which is using Consul as well. I turned on the debug level, but the only thing I could see is that the health checks via http are not being processed anymore. Let me know if I can help debugging this! @darron, could it be that you're also using the Prometheus Consul exporter? |
Hi @bastichelaar thanks for offering to help debug this (and sorry for the late reply)! Here's what we know so far:
I merged those fixes, along with some minimal cherry picks to get the build working, to a branch based on the 0.5.2 tag, which is here - https://github.com/hashicorp/consul/tree/v0.5.2-deadlock-patches. I ended up rolling back both memberlist (6025015f2dc659ca2c735112d37e753bda6e329d) and serf (668982d8f90f5eff4a766583c1286393c1d27f68) to versions near the 0.5.2 release date since there are protocol version bumps that you definitely don't want in there, so a build in this configuration is very close to 0.5.2, but includes the above fixes. @darron is doing some testing with this new binary and it is behaving better (shorter stall periods), but alas it's hitting the new timeouts we added so there's still something we need to figure out. If you'd like to try out this binary and/or send some logs from this happening in your configuration you can email support@hashicorp.com and CC me at james at hashicorp.com and I'll give you the build. We will keep this issue posted with what we find and with the fix once we figure it out. |
OK - to summarize where this particular issue is at - we've now had a 24 hour period where no nodes have gone deaf, couldn't communicate with the master and needed to be bounced. We haven't seen that for a long time given we're at 900 fairly busy nodes - busy with services, DNS lookups and KV traffic. A key understanding for me was recognizing that:
Bouncing the server would resolve it for all the nodes - but that's an unsustainable strategy. There have been 3 things that have changed this status and helped to hopefully solve it:
1 and 2 helped a ton - but what seems to have made it much more consistent is 3. We had thought we had fixed the Xen/EC2 bug - but it was an incomplete solution on the Consul server nodes and only came to light through some excellent "reading-the-pcap-leaves" by @slackpad. Basically - the leader was sending things to the server nodes that it never got - and so that put it into a strange state that took a while to come out from. There's another patch to yamux - but we're going to give this current state a week and then think of changing out the Consul binary cluster wide. I'll update this next week after I'm done traveling and we have more data to back up the current conclusion. |
Same here: we also applied the "Rides the Rocket" Xen/EC2 fix and we build a new version of Consul using the deadlock fixes branch (https://github.com/hashicorp/consul/tree/v0.5.2-deadlock-patches). The cluster is stable now, and the /v1/catalog API endpoint doesn't become unresponsive anymore. We're still monitoring the cluster, but it looks good! I'll keep this issue updated if we encounter any issues. |
@slackpad It looks like those patches have pretty much solved this problem for us. In the last 8 days we have had a single instance of things going deaf - but it was short and corrected itself: http://shared.froese.org/2015/0hrr9-17-16.jpg I wonder if we need to try the other Yamux patch - but pretty happy with where we are at the moment. Worth a 0.5.3 release? |
@darron thanks for the update. If things are performing well you probably don't need that last patch as that was kind of a catch all for any TCP errors; you've got a high enough request load that one of your streams is likely going to try to send a header and time out (the patch will time out even if it's not trying to write a header) As for 0.5.3 I think at this point we should focus on 0.6 which will have the performance fixes, too. Are you ok running the patch release for a little while longer? |
Yeah - it should be OK. I will build a package and install across my cluster. Thanks for all your help @slackpad - really appreciate it. |
@darron my pleasure - I totally appreciate all your efforts repro-ing this and mechanizing and supplying the debug data - you put in some major effort to track this down! |
830 nodes in AWS - 5 server nodes spread across 3 AZs.
Symptoms:
Couldn't read the catalog (nodes, services or specific services) from the HTTP API - but I could read from some other endpoints no problem:
The cluster was quite calm during this time period:
http://shared.froese.org/2015/fgfti-12-20.jpg
No leadership transitions at all - nothing to note.
Looking through the logs - it looked like the node actually blocked for 30 minutes and only unblocked after a connection timed out:
Full
debug
logs are posted here - look from just before 17:00 until 17:12 and then from there until 17:43:https://gist.github.com/darron/d7337cafb1a7b8640abd
It looks like it blocked on a failed connection to
10.99.179.149
and then once it timed out it finally showed the errors from the failed queries - and allowed subsequant queries to work.If you look at the quantity of the logs - when the server is 'blocked' from responding - there isn't the usual chatter.
That server is very close:
I didn't check during the event - but there's a very low probability they were completely disconnected for 30 minutes.
This was a really nice long outage on a machine that didn't impact production - but we're starting to see many, many of these sorts of transient blocks across our cluster.
Is this something that's known?
I looked through the other issues and didn't see anything that looked like this.
Are there any other details I could grab the next time this happens - it's happening a fair amount in our cluster - but this was the longest outage we had seen.
I tried to strace - but there was so much output from all the threads it wasn't very useful.
The text was updated successfully, but these errors were encountered: