-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Need to guard creation of new TCP connections in pool.go/acquire() #1165
Comments
Could this also be why sometimes we see nodes that respond to a watch for some KV data - a prefix watch - download some of the content and then are unable to query the KV store and are "locked out". They don't get all of the data and KV queries are subsequently deaf on those nodes. A restart of the Consul agent on those same nodes usually allows the KV queries to succeed. NOTE: About 100 JSON fragments underneath the prefix - not even 1MB of data in total - usually only 1 fragment changes at a time. |
Only one of these connections "wins" and is actually used to send queries, so nobody gets a dead one or one that gets immediately closed. I suppose the huge amount of network activity related to this could cause packet loss and/or leader slowness and then the subsequent client-side lockup that we were able to produce on #1154 when there are too many streams outstanding in yamux and it starts to block. |
Talked about this with @armon - we probably want to lock per-address rather than one big lock for the pool. Otherwise, a slow connect to a remote DC could hold up a connection to a local server. |
Fixes #1165 by having threads wait for any outstanding connect to finish.
Fixes #1165 by having threads wait for any outstanding connect to finish.
While debugging #1154 we noticed that under heavy client request loads, there can be hundreds of calls to
pool.go/acquire()
. The underlyinggetNewConn()
function will add one of these to the pool in a safe way, causing the others to get canceled, but this is hugely wasteful. In the case of #1154 there were close to 700 useless TCP connections spun up and killed in one instance.The text was updated successfully, but these errors were encountered: