Skip to content

Commit

Permalink
Fix config naming and add tests for new CA configs
Browse files Browse the repository at this point in the history
  • Loading branch information
banks committed Jan 18, 2019
1 parent fea202d commit 411b239
Show file tree
Hide file tree
Showing 4 changed files with 24 additions and 16 deletions.
4 changes: 2 additions & 2 deletions agent/cache-types/connect_ca_leaf.go
Original file line number Diff line number Diff line change
Expand Up @@ -21,13 +21,13 @@ const ConnectCALeafName = "connect-ca-leaf"

// caChangeJitterWindow is the time over which we spread each round of retries
// when attempting to get a new certificate following a root rotation. It's
// selected to be a trade-off between not making rotation unneccesarily slow on
// selected to be a trade-off between not making rotation unnecessarily slow on
// a tiny cluster while not hammering the servers on a huge cluster
// unnecessarily hard. Servers rate limit to protect themselves from the
// expensive crypto work, but in practice have 10k+ RPCs all in the same second
// will cause a major disruption even on large servers due to downloading the
// payloads, parsing msgpack etc. Instead we pick a window that for now is fixed
// but later might be either user configurable (not nice since it because
// but later might be either user configurable (not nice since it would become
// another hard-to-tune value) or set dynamically by the server based on it's
// knowledge of how many certs need to be rotated. Currently the server doesn't
// know that so we pick something that is reasonable. We err on the side of
Expand Down
16 changes: 12 additions & 4 deletions agent/config/runtime_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -3004,8 +3004,10 @@ func TestFullConfig(t *testing.T) {
"connect": {
"ca_provider": "consul",
"ca_config": {
"RotationPeriod": "90h",
"LeafCertTTL": "1h"
"rotation_period": "90h",
"leaf_cert_ttl": "1h",
"csr_max_per_second": 100,
"csr_max_concurrent": 2
},
"enabled": true,
"proxy_defaults": {
Expand Down Expand Up @@ -3552,6 +3554,10 @@ func TestFullConfig(t *testing.T) {
ca_config {
rotation_period = "90h"
leaf_cert_ttl = "1h"
# hack float since json parses numbers as float and we have to
# assert against the same thing
csr_max_per_second = 100.0
csr_max_concurrent = 2.0
}
enabled = true
proxy_defaults {
Expand Down Expand Up @@ -4204,8 +4210,10 @@ func TestFullConfig(t *testing.T) {
ConnectSidecarMaxPort: 9999,
ConnectCAProvider: "consul",
ConnectCAConfig: map[string]interface{}{
"RotationPeriod": "90h",
"LeafCertTTL": "1h",
"RotationPeriod": "90h",
"LeafCertTTL": "1h",
"CSRMaxPerSecond": float64(100),
"CSRMaxConcurrent": float64(2),
},
ConnectProxyAllowManagedRoot: false,
ConnectProxyAllowManagedAPIRegistration: false,
Expand Down
6 changes: 3 additions & 3 deletions website/source/docs/agent/options.html.md
Original file line number Diff line number Diff line change
Expand Up @@ -878,11 +878,11 @@ default will automatically work with some tooling.
service instances before the time it takes to rotate is impacted. For
larger deployments we recommend increasing this based on the expected
number of server instances and server resources, or use
`csr_max_concurrency` instead if servers have more than one core.
`csr_max_concurrent` instead if servers have more than one core.
Setting this to zero disables rate limiting. Added in 1.4.1.
* <a name="ca_csr_max_concurrency"></a><a
href="#ca_csr_max_concurrency">`csr_max_concurrency`</a> Sets a limit
* <a name="ca_csr_max_concurrent"></a><a
href="#ca_csr_max_concurrent">`csr_max_concurrent`</a> Sets a limit
on how many Certificate Signing Requests will be processed
concurrently. Defaults to 0 (disabled). This is useful when you have
more than one or two cores available to the server. For example on an
Expand Down
14 changes: 7 additions & 7 deletions website/source/docs/guides/performance.html.md
Original file line number Diff line number Diff line change
Expand Up @@ -180,17 +180,17 @@ cluster. Typically these operations are fast on modern hardware, however when
the CA is changed or it's key rotated, the leader will face an influx of
requests for new certificates for every service instance running.

While the client agents smear these over 30 seconds to avoid an immediate
thundering herd, they don't have enough information to tune that smearing based
on the number of certificates in use in the cluster so picking longer smearing
results in artificially slow rotations for small clusters.
While the client agents distribute these randomly over 30 seconds to avoid an
immediate thundering herd, they don't have enough information to tune that
period based on the number of certificates in use in the cluster so picking
longer smearing results in artificially slow rotations for small clusters.

Smearing requests over 30s is sufficient to bring RPC load to a reasonable level
in all but the very largest clusters, but the extra CPU load from cryptographic
operations could impact the server's normal work. To limit that, Consul since
1.4.1 exposes two ways to limit the impact Certificate signing has on the leader
[`csr_max_per_second`](/docs/agent/options.html#ca_csr_max_per_second) and
[`csr_max_concurrency`](/docs/agent/options.html#csr_max_concurrency).
[`csr_max_concurrent`](/docs/agent/options.html#ca_csr_max_concurrent).

By default we set a limit of 50 per second which is reasonable on modest
hardware but may be too low and impact rotation times if more than 1500 service
Expand All @@ -204,12 +204,12 @@ process?

For larger production deployments, we generally recommend multiple CPU cores for
servers to handle the normal workload. With four or more cores available, it's
simpler to limit signing CPU impact with `csr_max_concurrency` rather than tune
simpler to limit signing CPU impact with `csr_max_concurrent` rather than tune
the rate limit. This effectively sets how many CPU cores can be monopolized by
certificate signing work (although it doesn't pin that work to specific cores).
In this case `csr_max_per_second` should be disabled (set to `0`).

For example if you have an 8 core server, setting `csr_max_concurrency` to `1`
For example if you have an 8 core server, setting `csr_max_concurrent` to `1`
would allow you to process CSRs as fast as a single core can (which is likely
sufficient for the very large clusters), without consuming all available
CPU cores and impacting normal server work or stability.

0 comments on commit 411b239

Please sign in to comment.