Skip to content
This repository has been archived by the owner on Sep 30, 2024. It is now read-only.

Race-condition in Consul KV updates during failover #1273

Closed
timvaillancourt opened this issue Nov 19, 2020 · 9 comments
Closed

Race-condition in Consul KV updates during failover #1273

timvaillancourt opened this issue Nov 19, 2020 · 9 comments

Comments

@timvaillancourt
Copy link
Contributor

timvaillancourt commented Nov 19, 2020

👋 recently we encountered a key-value-update race condition under Orchestrator release v3.1.4 using a setup that stores key-values for the current MySQL master in Consul

Our setup has Orchestrator deployed in Raft mode with KV updates sent to Consul on topology changes. Our MySQL load balancer reads from the same Consul server(s) and updates HaProxy backend configs (via github.com/hashicorp/consul-template) when KVs are changed in Consul, causing a restart of HaProxy - which is used to direct our MySQL traffic. This setup is outlined by @shlomi-noach in this blog post: https://github.blog/2018-06-20-mysql-high-availability-at-github/

Currently go/kv/consul.go uses 4 x separate Consul "put" calls to update 4 x different keys in Consul without using an atomic Consul transaction

In the problem scenario:

  1. A cluster under Orchestrator was issued a graceful-master-takeover to a new Primary
  2. Orchestrator updated some but not all of the 4 x Consul KVs
  3. Consul-template on our load balancer sees a change with the partially-updated KVs, triggering a restart of HaProxy
    1. All HaProxy process saw a Consul KV change and restarted during the KV changes, however not all got a consistent config
  4. HaProxy ran a config built from the partially-updated Consul KVs, causing some queries to route to the demoted Primary
  5. Orchestrator completed updating all 4 x Consul keys
    1. For whatever reason, consul-template didn't trigger subsequent reloads - a theory is it was "busy" when the final updates came in

The timing of this scenario is under 1-3 seconds. We have many HaProxy instances in each site and this race-condition happens on only a portion of the proxies, somewhat intermittently, perhaps 1 in every 10-20 failovers. This situation is remedied by manually triggering a reload of the Load Balancer once the Consul KVs are consistently updated by Orchestrator

I would like to propose the solution of adding a new, optional KVStore interface that uses a Consul transaction to update the KVs atomically in a single operation. The reason for this is this removes the race condition in the update of Consul KVs. It also is more efficient to perform a single HTTP call to Consul

The drawback of this solution is the Consul client used in Orchestrator (https://github.com/armon/consul-api) is too out-of-date to support Consul Transactions. This client library is deprecated and has seen no updates in 6-7 years, recommending that users switch to the official Consul client library:

DEPRECATED Please use consul api package instead. Godocs for that package are here.

Moving towards using the updated, official Consul client would allow Transactions to be used, resolving this race condition. This would also help Orchestrator move away from a stale, deprecated library. However, this would require the official Consul client to be integrated, something I'm happy to tackle myself in some PR(s)

My hypothetical approach to that PR would be:

  1. Add the official Consul client to vendor/
  2. Create a new KVStore interface that uses Consul Transactions for KV changes
    1. Default to current, no-transaction code
    2. A new interface allows users to use the "old", no-transaction code if desired
      1. Very-old versions of Consul don't support this API, so opting into this support has the lowest impact
    3. Add option to use Consul Transaction-based KVStore interface for Consul communications
  3. In a subsequent PR/version, update the non-transaction code to also use the official Consul client
  4. Remove deprecated github.com/armon/consul-api from the vendor/ dir
  5. Profit 🎉

cc @shlomi-noach for thoughts on this approach

@shlomi-noach
Copy link
Collaborator

Hi @timvaillancourt 👋 thanks for the elaborate analysis. I can see the need for a transaction. OK, I'm open to this -- at the time I tried using the "real" consul client and that required an unreasonable amount of dependencies. If you can make this work, please go for it. I have recollection of someone already suggesting the same in the past, but my search yields nothing.

Thank you.

@timvaillancourt
Copy link
Contributor Author

I'm open to this -- at the time I tried using the "real" consul client and that required an unreasonable amount of dependencies

@shlomi-noach thanks for reviewing this! Agreed, there is still a shocking number of dependencies for the official client and a lot of them don't seem to be actually needed 👎. This creates a big, ugly diff to vendor/

I tried finding an up-to-date, but unofficial library with less dependencies but I wasn't able to find one. Another option is creating an Consul client from scratch in this repo. I guess the drawback there is maintaining it. Any preference?

@shlomi-noach
Copy link
Collaborator

Another option is creating an Consul client from scratch in this repo. I guess the drawback there is maintaining it.

Let;s avoid creating a new client, and see if we can perhaps strip down some of the functionaility + dependencies of the original?

For reference: #320

@timvaillancourt
Copy link
Contributor Author

timvaillancourt commented Nov 19, 2020

Another option is creating an Consul client from scratch in this repo. I guess the drawback there is maintaining it.

Let;s avoid creating a new client, and see if we can perhaps strip down some of the functionaility + dependencies of the original?

@shlomi-noach that sounds good 👍. From my testing think the dependencies govendor is adding aren't 100% required. I suspect it's reading imports for _test.go or unused files, etc

@shlomi-noach
Copy link
Collaborator

OK cool, let's try to exclude the original _test.go files, and see what dependencies we can strip out

@timvaillancourt
Copy link
Contributor Author

OK cool, let's try to exclude the original _test.go files, and see what dependencies we can strip out

@shlomi-noach a draft PR is ready here: #1276

I was able to reduce the deps from 1500+ files to 500, but it's still a huge list of dependencies. I haven't deleted _test.go files in case it confuses govendor, just full repos that the code will still compile without. I'm curious if you see any places to drop more deps?

@tomkrouper
Copy link
Collaborator

@shlomi-noach, kindly ping

@shlomi-noach
Copy link
Collaborator

Whoops! Dropped this. Looking into.

@shlomi-noach
Copy link
Collaborator

closed by #1276

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants