Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

connect: tame thundering herd of CSRs on CA rotation #5228

Merged
merged 3 commits into from
Jan 22, 2019
Merged

Conversation

banks
Copy link
Member

@banks banks commented Jan 16, 2019

This PR does a few things and has been significantly more subtle than I expected!

Approach

We can't perfectly pace clients requesting CSRs currently because we don't know, even on the servers, how many certificates are actively being blocked on currently. We could come up with a mechanism to provide that feedback later as an optimisation but it's not super simple. Using Serf cluster size (the initial idea) is not really a good enough proxy since you might have a huge cluster with only a handful of connect enabled services, or you may have a medium sized cluster with 100s of connect-certificates per node.

Instead, this PR add two knobs on the server CA config:

  • csr_max_per_second: a rate limit for signing. When a CSR RPC comes in and this is non-zero, we will only service it if a slot becomes free in the token bucket within 500ms. This smoothing massively smoothes overall rate without becoming an extra burden (as long polling for a slot might).
  • We default to 50 on the basis that this is a decent number for small to medium clusters (when combined with the 30 seconds smearing, this won't be hit in general until you have 1500 service instances), but my laptop can process 100/second while using only ~30% of one core. We should document that it needs to be increased or disabled for large clusters in our "performance" docs.
  • csr_max_concurrent: a concurrency limit for signing. When a CSR RPC comes in and this is non-zero, we block waiting on a semaphore of this size for up to 500ms.
  • This is off by default as on small deployments with only one core it doesn't provide anything very useful.
  • This is here because for reasonable production deployments, it's not ideal to have to keep updating the rate limit as the cluster grows. By using this to limit CSR signing activity to only 1 or 2 concurrent calls, you can effectively ensure no more than 1 or 2 cores are consumed with CSR requests so regular requests are not impacted, while still processing the rotation as fast as those cores can go. Our production tuning guide should recommend this option for any server with multiple cores.

But rate limiting the CSRs on it's own is not enough - the rate limiting takes place in the RPC handler but by this point we've already paid the cost of the connection, the network IO transmitting the CSR body, the msgpack decoding overhead, the goroutine overhead etc. so having 10k+ CSR requests come in together after rotation is still likely to significantly effect server performance.

For this reason, we keep the initial window over which we randomly spread requests after a rotation is noticed at 30 seconds. We then use a random-windowed backoff strategy which was simulated to be the best balance of reducing load (i.e. few requests hit rate limit and waste network and CPU time) while still rotating in a close-to-optimal time given the number of CSRs to process and the rate limit set. This simulation can be seen here: https://github.com/banks/sim-rate-limit-backoff/blob/master/README.md.

For now the 30 second window is not configurable because it seems reasonable even for large deployments. I'd rather make it be smarter and based on actually work to be done in future rather than give more knobs that are hard to reason about to the operator.

Implementation

There are lots of subtleties in the client side cache implementation and tests! #5091 helped pave the way for this implementation but still needed a few edge cases dealt with.

The server side changes are more straightforward. A new Semaphore package was added mostly because it made testing it in isolation much easier than trying to test the concurrency properties in amongst the actual RPC endpoint complications. I don't anticipate it being super re-usable unless we use the same pattern on other RPC endpoints.

In general, rate limiting and handling bursts of traffic at the servers is a problem deferred for another time as there is internal research going into the best approaches for limiting concurrency and server load to minimise service times and prevent priority inversion where raft and serf heart-beating activity starts falling behind due to tons of client RPCs competing for scheduler time. When we do implement alternative admission control policies it could help with this but it's somewhat orthogonal since this is a self-inflicted DOS we are mitigating here.

Other Changes

  • Fixed an issue where CA config was only being merged for consul provider which is wrong because we have "common" configs which have defaults that should apply the same way to all providers
  • Fixed an issue where API seems too strict when decoding config based on the fact that the CA provider contains opaque config that is not defined in struct and was using ErrorUnused. I think that's reasonable and didn't break any tests but maybe @kyhavlov can remember more context from 82a4b3c#diff-1114c027aafaece078e5770396c31503R37

Open Questions

Do the defaults make sense?

I noted the rationale for 50/second as the default - low enough to not take down your cluster (unless it's super close to the edge already or on impossibly tiny servers for it's size) but large enough to not have any impact on rotation time up to 1500 service instances (and still works for higher, just rotations start taking longer).

Having written docs for it where I explained that if you have more than 4 CPU cores you should probably default to limiting to just 1 or 2 cores instead, I realise that we could just automate that and have the default behaviour depend on how many CPU cores we detect at runtime.

I.e. when processing the Config we could detect no value set (we'd have to distinguish it from default by not putting 50/s limit in as a default in the normal way) and then use this logic:

numCPUs := runtime.NumCPU()
if !limitOrConcurrencySetExplicitly {
  if numCPUs > 4 {
    cfg.ConnectCACSRMaxConcurrency = (numCPUs+7)/8 // 1 core up to 8 cores and then 2 up to 16 etc
  } else {
    cfg.ConnectCACSRMaxPerSecond = numCPUs * 50 // On the basis that 100/s uses about 40% of one core on my mac so this probably reasonable
  }
}

The biggest downsides I see:

  • Documenting that behaviour is more complicated although I already explain most of the thought process already so just automating it is probably not lots worse.
  • Even if we don't set defaults the normal way and do this during config post-parsing, I don't think we have a good way to detect that the user explicitly disabled both settings (e.g. set to 0 since they are ints). We could add a negative int convention for disabling, or we could set -1 as a default and document that if both are -1 then we do this default logic? Or is it OK just to tell user to set them impractically high to "disable" the limit? e.g. disable rate limit and set max concurrency to a million?

TODO

  • Documentation updates for the new CA configs
  • Documentation updates for the Production Tuning guide related to CSR rate limit.
  • Maybe change defaults? (Nope. see discussion).

…s; handle CA rotations gracefully with jitter and backoff-on-rate-limit in client
reply.CertPEM = leaf
reply.ValidAfter = time.Now().Add(-1 * time.Hour)
reply.ValidBefore = time.Now().Add(11 * time.Hour)
reply.CreateIndex = atomic.AddUint64(&idx, 1)
reply.CreateIndex = cIdx
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes necessary because we used to keep signing with the first CA key before but that breaks blocking behaviour now as we detect the CA is different to expected in the final call. Now we have to model the change more accurately.

// next blocking query will _immediately_ see the new root which means it
// needs to correctly notice that it is not the same one that generated the
// current cert and start the rotation. This is good, just not obvious that
// the behavior is actually well tested here when it is.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is inaccurate now. I'll remove it.

// The retrying state is in State so we need to still update that in the
// entry even if we don't have an actual result yet (e.g. hit a rate limit
// on first request for a leaf certificate).
newEntry.State = result.State
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change is fine, but IIRC I ended up not actually relying on it by simplifying the initial-fetch-hits-rate-limit to just return the error to client. I'll leave this though as it's reasonable behaviour that could be useful in other cases.

@banks banks requested a review from a team as a code owner January 17, 2019 12:39
@mkeeler mkeeler requested a review from a team January 17, 2019 14:10
@mkeeler
Copy link
Member

mkeeler commented Jan 17, 2019

@banks Haven't reviewed the code yet but I think not having the config magic to figure out how many cores you are running on is best. Too much magic will lead to confusion. Additionally having a certain number of cores is becoming a worse indication for compute power especially when running in a cloud or container. In both of those environments its possible and probably likely for you to see 4 cores but then be artificially restricted by either the hypervisor or container runtime from taking full advantage of those resources. Therefore core count is not a reliable way of determining how much compute power a given Consul instance has available to it. Even on private/non-containerized infrastructure it could be likely that many services have to share the CPU resources on a single piece of hardware. As we cannot account for all the variables in any sane way, I think picking defaults that should work for most small/medium sized clusters and then letting those with large clusters tune it for their usage is best.

@banks
Copy link
Member Author

banks commented Jan 17, 2019

Additionally having a certain number of cores is becoming a worse indication for compute power

It's not being used as a measure of compute power though - only a way to limit the ability to consume all available compute power. Even if you are running 16 tiny ARM cores, it's still better to say "go as fast as you can provided you don't use more than 4 of my cores up doing the rotation". It might be slower than one core on a XEON server but that really doesn't matter it's about avoiding CPU saturation not raw speed!

I don't necessarily disagree overall esp. on the confusion/magic side, but does it change your point of view if you consider that this is just resource limitation not measuring power? Even if the 4 cores are shared or hyper threads or... the fact that we limit to consuming less than all of them is the important part (and better thatn trying to make operators play the tuning game of an absolute rate IMO).

@mkeeler
Copy link
Member

mkeeler commented Jan 17, 2019

@banks Take the docker scenario. Your machine may have 16 cores but your container gets pinned to exactly 2. So limiting to 2 cores still allows CSR signing to use 100% of the containers cpu resources. The existence of cores does not mean that the core is available to you and therefore you can't use that to determine how many concurrent CSRs could consume all cpu resources.

@banks
Copy link
Member Author

banks commented Jan 17, 2019

Yeah that's a very good point. Fair enough, nothing to change then!

Copy link
Member

@mkeeler mkeeler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good and makes sense why it would help. At some point we may need to address the issue on the clients and find a way to prevent them from all making requests right away. This will help alleviate problems related to CPU resources on the leader but will not however mitigate any potential issues caused by the clients actually invoking the RPCs (network bandwidth, aggregate cpu resources across the cluster, cpu resources on the leader and other servers related to RPC decoding and forwarding).

The only other comments I have are very minor.

agent/cache-types/connect_ca_leaf.go Outdated Show resolved Hide resolved
agent/config/builder.go Show resolved Hide resolved
website/source/docs/agent/options.html.md Show resolved Hide resolved
website/source/docs/guides/performance.html.md Outdated Show resolved Hide resolved
@mkeeler mkeeler added this to the 1.4.1 milestone Jan 18, 2019
@banks banks merged commit ef9f27c into master Jan 22, 2019
@banks banks deleted the sign-rate-limit branch January 22, 2019 17:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants