Skip to content

Commit

Permalink
this should be fine now
Browse files Browse the repository at this point in the history
Signed-off-by: clux <sszynrae@gmail.com>
  • Loading branch information
clux committed Nov 8, 2023
1 parent 2e86a37 commit b613705
Showing 1 changed file with 23 additions and 12 deletions.
35 changes: 23 additions & 12 deletions docs/troubleshooting.md
Original file line number Diff line number Diff line change
@@ -1,29 +1,30 @@
# Troubleshooting

Problems with [Api] commands failing is often RBAC or naming related and can be identified either in error codes or logs. Some common problems are explored herein.
Problems with [Api] commands failing is often RBAC or naming related and can be identified either in error codes or logs. Some common problems and solutions are explored herein.

See [[observability#adding-logs]] for how to a tracing subscriber set-up so you can get `kube` logs printed.
See [[observability#adding-logs]] for how to setup tracing subscribers to `kube` logs gets printed.

## Access

Access issues is a result of misconfigured or lacking [[manifests#RBAC]] and will bubble up as a [kube::Error::Api](https://docs.rs/kube/latest/kube/enum.Error.html#variant.Api) where the underlying error code will contain a `403`.
Access issues is a result of misconfigured access or lacking [[manifests#RBAC]] and will bubble up as a [kube::Error::Api](https://docs.rs/kube/latest/kube/enum.Error.html#variant.Api) where the underlying error code will contain a `403`.

These will show up looking something like this when printed:
In print, they look something like this:

```sh
ErrorResponse { status: "Failure", message: "documents.kube.rs \"samuel\" is forbidden: User \"system:serviceaccount:default:doc-controller\" cannot patch resource \"documents\" in API group \"kube.rs\" in the namespace \"default\"", reason: "Forbidden", code: 403 }
```
And they should be visible directly (if you are printing your error objects somewhere rather than discarding them), or internally.
And they should be visible directly provided you are actully printing your error objects somewhere (rather than discarding them).
If you turn up logging to `RUST_LOG=kube=debug` you should also see most errors internally.
### Watcher Errors
A [watcher] will expose [watcher::Error] as the error part of it's `Stream` items. If these errors are discarded, it might lead to a continuously failing and retrying program.
!!! warning "Watcher errors are soft errors"
A watcher will retry on all failures (including 403s, network failures) because the watcher can recover if external circumstances improve (for instance by an admin tweaking a `Role` object, or the network improving). These errors are therefore often ignored optimistically, but they should __never be silently ignored__.
A watcher will retry on all failures (including 403s, network failures) because the watcher can recover if external circumstances improve (for instance by an admin tweaking a `Role` object, or the network improving). These errors are therefore often __optimistically ignored__, but they should never be __silently ignored__.
When matching on items from the stream and printing the errors, the errors can look like:
Expand All @@ -37,25 +38,35 @@ If you are not printing the watcher errors yourself, you can get them via logs f
WARN kube_runtime::watcher: watcher error 403: Api(ErrorResponse { status: "Failure", message: "documents.kube.rs is forbidden: User \"system:serviceaccount:default:doc-controller\" cannot watch resource \"documents\" in API group \"kube.rs\" at the cluster scope", reason: "Forbidden", code: 403 })
```
The easiest error handling is to tear down the application on any errors by i.e. passing stream errors through a `try_for_each` (ala [pod_watcher](https://github.com/kube-rs/kube/blob/5813ad043e00e7b34de5e22a3fd983419ece2493/examples/pod_watcher.rs#L26-L33)). This will usually let you know something is broken by crashing. Usually this either annoys the user, or it triggers the common [KubePodCrashLooping](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubepodcrashlooping/) alert if you are deployed in-cluster. In both cases, this is an inelegant way of handling errors since it means you will also restart the full application on spurious network issues.
#### Watcher Error Handling
More error handling solutions is to try to handle errors for a while, but report them up to your environment. When deployed in-cluster, [[obervability#adding-metrics]] is the normal path, so you can alert on a higher than normal error rates (if they persist).
The __easiest__ error handling setup is to tear down the application on any errors by (say) passing stream errors through a `try_for_each` (ala [pod_watcher](https://github.com/kube-rs/kube/blob/5813ad043e00e7b34de5e22a3fd983419ece2493/examples/pod_watcher.rs#L26-L33)) or a `try_next` loop (ala [event_watcher](https://github.com/kube-rs/kube/blob/5813ad043e00e7b34de5e22a3fd983419ece2493/examples/event_watcher.rs#L39-L43)).
!!! note "Crashing in-cluster"
<!-- TODO: will move this to another productionising document later i think
If you are deployed in-cluster, don't be afraid to exit early on errors you don't expect. Exits are easier to handle than a badly running app in a confusing state. By crashing, you get [retry with backoff](https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#restart-policy) for free, plus you often get alerts such as [KubePodCrashLooping](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubepodcrashlooping/) triggering (without instrumentation needed).
Another is to check for the most common error up front before starting an infinite watch stream;
While easy, early exits is not the best solution;
1. did you install the crd before trying to start a watcher? do a naked list first and exit if not:
- __Locally__, having a CLI abruptly exit is a bad user experience.
- __In-cluster__, frequent restarts of a large app with many spurious non-fatal condition can mask underlying problems.
For controllers with multiple watchers, [[observability#Adding Metrics]] is customary, so that you can alert on percentage error rates over a time span.
It is also common to check for **blocker errors** up-front before starting an infinite watch stream;
1. did you install the crd before trying to start a watcher? do a naked list first as a sanity:
```rust
if let Err(e) = docs.list(&ListParams::default().limit(1)).await {
error!("CRD is not queryable; {e:?}. Is the CRD installed?");
info!("Installation: cargo run --bin crdgen | kubectl apply -f -");
std::process::exit(1);
}
watcher(docs, conf).try_for_each(|_| future::ready(Ok(()))).await?;
```
-->
This is a particularly common error case since CRD installation is often managed out-of-band with the application and thus often neglected.
### Request Inspection
Expand Down

0 comments on commit b613705

Please sign in to comment.