Skip to content

Commit

Permalink
restructure HA section to talk about availability concerns + rollout …
Browse files Browse the repository at this point in the history
…issue

Signed-off-by: clux <sszynrae@gmail.com>
  • Loading branch information
clux committed Apr 23, 2024
1 parent 73c51fe commit 7cca11d
Show file tree
Hide file tree
Showing 3 changed files with 23 additions and 22 deletions.
33 changes: 14 additions & 19 deletions docs/controllers/availability.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ Despite the common goals often set forth for application deployments, most `kube

This is due to a couple of properties:

- Controllers are queue consumers that not require 100% uptime to meet a 100% SLO
- Rust images are often very small and will reschedule quickly
- watch streams re-initialise quickly with the current state on boot
- [[reconciler#idempotency]] means multiple repeat reconciliations are not problematic
Expand All @@ -21,12 +22,12 @@ These properties combined creates a low-overhead system that is normally quick t

That said, this setup can struggle under strong consistency requirements. Ask yourself:

- How fast do you expect your reconciler to react?
- Is `30s` P95 downtime from reschedules acceptable?
- How quickly do you expect your reconciler to **respond** to changes on average?
- Is a `30s` P95 downtime from reschedules acceptable?

## Reactivity
## Responsiveness

If you want to improve __average reactivity__, then traditional [[scaling]] and [[optimization]] strategies can help:
If you want to improve __average responsiveness__, then traditional [[scaling]] and [[optimization]] strategies can help:

- Configure controller concurrency to avoid waiting for a reconciler slot
- Optimize the reconciler, avoid duplicated work
Expand All @@ -38,35 +39,29 @@ You can plot heatmaps of reconciliation times in grafana using standard [[observ

## High Availability

At a certain point, the slowdown caused by pod reschedules is going to dominate the latency metrics. Thus, having more than one replica (and having HA) is a requirement for further reducing tail latencies.
Scaling a controller beyond one replica for HA is different than for a regular load-balanced traffic receiving application.

Unfortunately, scaling a controller is more complicated than adding another replica because all Kubernetes watches are effectively unsynchronised, competing consumers that are unaware of each other.
A controller is effectively a consumer of Kubernetes watch events, and these are themselves unsynchronised event streams whose watchers are unaware of each other. Adding another pod - without some form of external locking - will result in duplicated work.

To avoid this, most controllers lean into the eventual consistency model and run with a single replica, accepting higher tail latencies due to reschedules. However, once the performance demands are strong enough, these pod reschedules will dominate the tail of your latency metrics, making scaling necessary.

!!! warning "Scaling Replicas"

It not recommended to set `replicas: 2` for an [[application]] running a normal `Controller` without leaders/shards, as this will cause both controller pods to reconcile the same objects, creating duplicate work and potential race conditions.

To safely operate with more than one pod, you must have __leadership of your domain__ and wait for such leadership to be acquired before commencing.
To safely operate with more than one pod, you must have __leadership of your domain__ and wait for such leadership to be __acquired__ before commencing. This is the concept of leader election.

## Leader Election

Leader election (via [Kubernetes//Leases](https://kubernetes.io/docs/concepts/architecture/leases/)) allows having control over resources managed in-Kubernetes via Leases as distributed locking mechanisms.
Leader election allows having control over resources managed in Kubernetes using [Leases](https://kubernetes.io/docs/concepts/architecture/leases/) as distributed locking mechanisms.

The common solution to downtime based-problems is to use the `leader-with-lease` pattern, by having another controller replica in "standby mode", ready to takeover immediately without stepping on the toes of the other controller pod. We can do this by creating a `Lease`, and gating on the validity of the lease before doing the real work in the reconciler.

The natural expiration of `leases` means that you are required to periodically update them while your main pod (the leader) is active. When your pod is to be replaced, you can initiate a step down (and expire the lease), say after draining your work queue after receiving a `SIGTERM`. If your pod crashes, then the lease will expire naturally (albeit likely more slowly).

<!-- this feels distracting to the main point maybe
### Defacto Leadership
When running the default 1 replica controller have implictly created a `leader for life`. You never have other contenders for "defacto leadership" except for the short upgrade window:
!!! note "Unsynchronised Rollout Surges"

!!! warning "Rollout Safety for Single Replicas"
A 1 replica controller deployment without leader election might create short periods of duplicate work and racey writes during rollouts because of how [rolling updates surge](https://docs.rs/k8s-openapi/latest/k8s_openapi/api/apps/v1/struct.RollingUpdateDeployment.html) by default.

Even with 1 replica, you might see racey writes during controller upgrades without locking/leases. A `StatefulSet` with one replica could also give you a downtime based rolling upgrade that implicitly avoids racey writes, but it could also [require manual rollbacks](https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#forced-rollback).
Other forms of defacto leadership can come in the form of [shards](./scaling.md#Sharding), but these are generally created for [[scaling]] concerns, and suffer from the same problems during rollout and controller teardown.
-->
The natural expiration of `leases` means that you are required to periodically update them while your main pod (the leader) is active. When your pod is about be replaced, you can initiate a step down (and expire the lease), ideally after receiving a `SIGTERM` after [draining your active work queue](https://docs.rs/kube/latest/kube/runtime/struct.Controller.html#method.shutdown_on_signal). If your pod crashes, then a replacement pod must wait for the scheduled lease expiry.

### Third Party Crates

Expand Down
8 changes: 5 additions & 3 deletions docs/controllers/scaling.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,10 @@ This chapter is about strategies for scaling controllers and the tradeoffs these
## Motivating Questions

- Why is the reconciler lagging? Are there too many resources being reconciled?
* How do you find out?
- What happens when your controller starts managing resource sets so large that it starts significantly impacting your CPU or memory use?
- Do you give your more resources?
- Do you add more pods? How can you do this safely?
* Do you give your more resources?
* Do you add more pods? How can you do this safely?

Scaling an efficient Rust application that spends most of its time waiting for network changes might not seem like a complicated affair, and indeed, you can scale a controller in many ways and achieve good outcomes. But in terms of costs, not all solutions are created equal; are you avoiding improving your algorithm, or are you throwing more expensive machines at the problem?

Expand All @@ -22,9 +23,10 @@ We recommend trying the following scaling strategies in order:
In other words, try to improve your algorithm first, and once you've reached a reasonable limit of what you can achieve with that approach, allocate more resources to the problem.

### Controller Optimizations
Ensure you look at common controller [[optimization]] to:
Ensure you look at common controller [[optimization]] to get the most out of your resources:

* minimize network intensive operations
* avoid caching large manifests unnecessarily, and prune unneeded data
* cache/memoize expensive work
* checkpoint progress on `.status` objects to avoid repeating work

Expand Down
4 changes: 4 additions & 0 deletions includes/abbreviations.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,3 +12,7 @@
*[HA]: High Availability
*[PVC]: Persistent Volume Claim
*[RPS]: Requests Per Second
*[SLI]: Service Level Indicator
*[SLA]: Service Level Agreement
*[SLO]: Service Level Objective
*[P95]: 95th Percentile

0 comments on commit 7cca11d

Please sign in to comment.