Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need to guard creation of new TCP connections in pool.go/acquire() #1165

Closed
slackpad opened this issue Aug 10, 2015 · 3 comments
Closed

Need to guard creation of new TCP connections in pool.go/acquire() #1165

slackpad opened this issue Aug 10, 2015 · 3 comments

Comments

@slackpad
Copy link
Contributor

While debugging #1154 we noticed that under heavy client request loads, there can be hundreds of calls to pool.go/acquire(). The underlying getNewConn() function will add one of these to the pool in a safe way, causing the others to get canceled, but this is hugely wasteful. In the case of #1154 there were close to 700 useless TCP connections spun up and killed in one instance.

@darron
Copy link
Contributor

darron commented Aug 10, 2015

Could this also be why sometimes we see nodes that respond to a watch for some KV data - a prefix watch - download some of the content and then are unable to query the KV store and are "locked out".

They don't get all of the data and KV queries are subsequently deaf on those nodes. A restart of the Consul agent on those same nodes usually allows the KV queries to succeed.

NOTE: About 100 JSON fragments underneath the prefix - not even 1MB of data in total - usually only 1 fragment changes at a time.

@slackpad
Copy link
Contributor Author

Only one of these connections "wins" and is actually used to send queries, so nobody gets a dead one or one that gets immediately closed. I suppose the huge amount of network activity related to this could cause packet loss and/or leader slowness and then the subsequent client-side lockup that we were able to produce on #1154 when there are too many streams outstanding in yamux and it starts to block.

@slackpad
Copy link
Contributor Author

Talked about this with @armon - we probably want to lock per-address rather than one big lock for the pool. Otherwise, a slow connect to a remote DC could hold up a connection to a local server.

slackpad added a commit that referenced this issue Aug 13, 2015
Fixes #1165 by having threads wait for any outstanding connect to finish.
slackpad added a commit that referenced this issue Aug 13, 2015
Fixes #1165 by having threads wait for any outstanding connect to finish.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants