-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
connect: tame thundering herd of CSRs on CA rotation #5228
Conversation
…s; handle CA rotations gracefully with jitter and backoff-on-rate-limit in client
reply.CertPEM = leaf | ||
reply.ValidAfter = time.Now().Add(-1 * time.Hour) | ||
reply.ValidBefore = time.Now().Add(11 * time.Hour) | ||
reply.CreateIndex = atomic.AddUint64(&idx, 1) | ||
reply.CreateIndex = cIdx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes necessary because we used to keep signing with the first CA key before but that breaks blocking behaviour now as we detect the CA is different to expected in the final call. Now we have to model the change more accurately.
// next blocking query will _immediately_ see the new root which means it | ||
// needs to correctly notice that it is not the same one that generated the | ||
// current cert and start the rotation. This is good, just not obvious that | ||
// the behavior is actually well tested here when it is. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is inaccurate now. I'll remove it.
// The retrying state is in State so we need to still update that in the | ||
// entry even if we don't have an actual result yet (e.g. hit a rate limit | ||
// on first request for a leaf certificate). | ||
newEntry.State = result.State |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this change is fine, but IIRC I ended up not actually relying on it by simplifying the initial-fetch-hits-rate-limit to just return the error to client. I'll leave this though as it's reasonable behaviour that could be useful in other cases.
@banks Haven't reviewed the code yet but I think not having the config magic to figure out how many cores you are running on is best. Too much magic will lead to confusion. Additionally having a certain number of cores is becoming a worse indication for compute power especially when running in a cloud or container. In both of those environments its possible and probably likely for you to see 4 cores but then be artificially restricted by either the hypervisor or container runtime from taking full advantage of those resources. Therefore core count is not a reliable way of determining how much compute power a given Consul instance has available to it. Even on private/non-containerized infrastructure it could be likely that many services have to share the CPU resources on a single piece of hardware. As we cannot account for all the variables in any sane way, I think picking defaults that should work for most small/medium sized clusters and then letting those with large clusters tune it for their usage is best. |
It's not being used as a measure of compute power though - only a way to limit the ability to consume all available compute power. Even if you are running 16 tiny ARM cores, it's still better to say "go as fast as you can provided you don't use more than 4 of my cores up doing the rotation". It might be slower than one core on a XEON server but that really doesn't matter it's about avoiding CPU saturation not raw speed! I don't necessarily disagree overall esp. on the confusion/magic side, but does it change your point of view if you consider that this is just resource limitation not measuring power? Even if the 4 cores are shared or hyper threads or... the fact that we limit to consuming less than all of them is the important part (and better thatn trying to make operators play the tuning game of an absolute rate IMO). |
@banks Take the docker scenario. Your machine may have 16 cores but your container gets pinned to exactly 2. So limiting to 2 cores still allows CSR signing to use 100% of the containers cpu resources. The existence of cores does not mean that the core is available to you and therefore you can't use that to determine how many concurrent CSRs could consume all cpu resources. |
Yeah that's a very good point. Fair enough, nothing to change then! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good and makes sense why it would help. At some point we may need to address the issue on the clients and find a way to prevent them from all making requests right away. This will help alleviate problems related to CPU resources on the leader but will not however mitigate any potential issues caused by the clients actually invoking the RPCs (network bandwidth, aggregate cpu resources across the cluster, cpu resources on the leader and other servers related to RPC decoding and forwarding).
The only other comments I have are very minor.
edbe5e9
to
411b239
Compare
This PR does a few things and has been significantly more subtle than I expected!
Approach
We can't perfectly pace clients requesting CSRs currently because we don't know, even on the servers, how many certificates are actively being blocked on currently. We could come up with a mechanism to provide that feedback later as an optimisation but it's not super simple. Using Serf cluster size (the initial idea) is not really a good enough proxy since you might have a huge cluster with only a handful of connect enabled services, or you may have a medium sized cluster with 100s of connect-certificates per node.
Instead, this PR add two knobs on the server CA config:
csr_max_per_second
: a rate limit for signing. When a CSR RPC comes in and this is non-zero, we will only service it if a slot becomes free in the token bucket within 500ms. This smoothing massively smoothes overall rate without becoming an extra burden (as long polling for a slot might).50
on the basis that this is a decent number for small to medium clusters (when combined with the 30 seconds smearing, this won't be hit in general until you have 1500 service instances), but my laptop can process 100/second while using only ~30% of one core. We should document that it needs to be increased or disabled for large clusters in our "performance" docs.csr_max_concurrent
: a concurrency limit for signing. When a CSR RPC comes in and this is non-zero, we block waiting on a semaphore of this size for up to 500ms.But rate limiting the CSRs on it's own is not enough - the rate limiting takes place in the RPC handler but by this point we've already paid the cost of the connection, the network IO transmitting the CSR body, the msgpack decoding overhead, the goroutine overhead etc. so having 10k+ CSR requests come in together after rotation is still likely to significantly effect server performance.
For this reason, we keep the initial window over which we randomly spread requests after a rotation is noticed at 30 seconds. We then use a random-windowed backoff strategy which was simulated to be the best balance of reducing load (i.e. few requests hit rate limit and waste network and CPU time) while still rotating in a close-to-optimal time given the number of CSRs to process and the rate limit set. This simulation can be seen here: https://github.com/banks/sim-rate-limit-backoff/blob/master/README.md.
For now the 30 second window is not configurable because it seems reasonable even for large deployments. I'd rather make it be smarter and based on actually work to be done in future rather than give more knobs that are hard to reason about to the operator.
Implementation
There are lots of subtleties in the client side cache implementation and tests! #5091 helped pave the way for this implementation but still needed a few edge cases dealt with.
The server side changes are more straightforward. A new Semaphore package was added mostly because it made testing it in isolation much easier than trying to test the concurrency properties in amongst the actual RPC endpoint complications. I don't anticipate it being super re-usable unless we use the same pattern on other RPC endpoints.
In general, rate limiting and handling bursts of traffic at the servers is a problem deferred for another time as there is internal research going into the best approaches for limiting concurrency and server load to minimise service times and prevent priority inversion where raft and serf heart-beating activity starts falling behind due to tons of client RPCs competing for scheduler time. When we do implement alternative admission control policies it could help with this but it's somewhat orthogonal since this is a self-inflicted DOS we are mitigating here.
Other Changes
consul
provider which is wrong because we have "common" configs which have defaults that should apply the same way to all providersErrorUnused
. I think that's reasonable and didn't break any tests but maybe @kyhavlov can remember more context from 82a4b3c#diff-1114c027aafaece078e5770396c31503R37Open Questions
Do the defaults make sense?
I noted the rationale for 50/second as the default - low enough to not take down your cluster (unless it's super close to the edge already or on impossibly tiny servers for it's size) but large enough to not have any impact on rotation time up to 1500 service instances (and still works for higher, just rotations start taking longer).
Having written docs for it where I explained that if you have more than 4 CPU cores you should probably default to limiting to just 1 or 2 cores instead, I realise that we could just automate that and have the default behaviour depend on how many CPU cores we detect at runtime.
I.e. when processing the Config we could detect no value set (we'd have to distinguish it from default by not putting 50/s limit in as a default in the normal way) and then use this logic:
The biggest downsides I see:
-1
as a default and document that if both are-1
then we do this default logic? Or is it OK just to tell user to set them impractically high to "disable" the limit? e.g. disable rate limit and set max concurrency to a million?TODO