this should be fine now

Signed-off-by: clux <sszynrae@gmail.com>
kube-rs · Nov 8, 2023 · b613705 · b613705
1 parent 2e86a37
commit b613705
Showing 1 changed file with 23 additions and 12 deletions.
diff --git a/docs/troubleshooting.md b/docs/troubleshooting.md
@@ -1,29 +1,30 @@
 # Troubleshooting
 
-Problems with [Api] commands failing is often RBAC or naming related and can be identified either in error codes or logs. Some common problems are explored herein.
+Problems with [Api] commands failing is often RBAC or naming related and can be identified either in error codes or logs. Some common problems and solutions are explored herein.
 
-See [[observability#adding-logs]] for how to a tracing subscriber set-up so you can get `kube` logs printed.
+See [[observability#adding-logs]] for how to setup tracing subscribers to `kube` logs gets printed.
 
 ## Access
 
-Access issues is a result of misconfigured or lacking [[manifests#RBAC]] and will bubble up as a [kube::Error::Api](https://docs.rs/kube/latest/kube/enum.Error.html#variant.Api) where the underlying error code will contain a `403`.
+Access issues is a result of misconfigured access or lacking [[manifests#RBAC]] and will bubble up as a [kube::Error::Api](https://docs.rs/kube/latest/kube/enum.Error.html#variant.Api) where the underlying error code will contain a `403`.
 
-These will show up looking something like this when printed:
+In print, they look something like this:
 
 ```sh
 ErrorResponse { status: "Failure", message: "documents.kube.rs \"samuel\" is forbidden: User \"system:serviceaccount:default:doc-controller\" cannot patch resource \"documents\" in API group \"kube.rs\" in the namespace \"default\"", reason: "Forbidden", code: 403 }
 ```
 
-And they should be visible directly (if you are printing your error objects somewhere rather than discarding them), or internally.
+And they should be visible directly provided you are actully printing your error objects somewhere (rather than discarding them).
 
+If you turn up logging to `RUST_LOG=kube=debug` you should also see most errors internally.
 
 ### Watcher Errors
 
 A [watcher] will expose [watcher::Error] as the error part of it's `Stream` items. If these errors are discarded, it might lead to a continuously failing and retrying program.
 
 !!! warning "Watcher errors are soft errors"
 
-    A watcher will retry on all failures (including 403s, network failures) because the watcher can recover if external circumstances improve (for instance by an admin tweaking a `Role` object, or the network improving). These errors are therefore often ignored optimistically, but they should __never be silently ignored__.
+    A watcher will retry on all failures (including 403s, network failures) because the watcher can recover if external circumstances improve (for instance by an admin tweaking a `Role` object, or the network improving). These errors are therefore often __optimistically ignored__, but they should never be __silently ignored__.
 
 When matching on items from the stream and printing the errors, the errors can look like:
 
@@ -37,25 +38,35 @@ If you are not printing the watcher errors yourself, you can get them via logs f
 WARN kube_runtime::watcher: watcher error 403: Api(ErrorResponse { status: "Failure", message: "documents.kube.rs is forbidden: User \"system:serviceaccount:default:doc-controller\" cannot watch resource \"documents\" in API group \"kube.rs\" at the cluster scope", reason: "Forbidden", code: 403 })
 ```
 
-The easiest error handling is to tear down the application on any errors by i.e. passing stream errors through a `try_for_each` (ala [pod_watcher](https://github.com/kube-rs/kube/blob/5813ad043e00e7b34de5e22a3fd983419ece2493/examples/pod_watcher.rs#L26-L33)). This will usually let you know something is broken by crashing. Usually this either annoys the user, or it triggers the common [KubePodCrashLooping](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubepodcrashlooping/) alert if you are deployed in-cluster. In both cases, this is an inelegant way of handling errors since it means you will also restart the full application on spurious network issues.
+#### Watcher Error Handling
 
-More error handling solutions is to try to handle errors for a while, but report them up to your environment. When deployed in-cluster, [[obervability#adding-metrics]] is the normal path, so you can alert on a higher than normal error rates (if they persist).
+The __easiest__ error handling setup is to tear down the application on any errors by (say) passing stream errors through a `try_for_each` (ala [pod_watcher](https://github.com/kube-rs/kube/blob/5813ad043e00e7b34de5e22a3fd983419ece2493/examples/pod_watcher.rs#L26-L33)) or a `try_next` loop (ala [event_watcher](https://github.com/kube-rs/kube/blob/5813ad043e00e7b34de5e22a3fd983419ece2493/examples/event_watcher.rs#L39-L43)).
 
+!!! note "Crashing in-cluster"
 
-<!-- TODO: will move this to another productionising document later i think
+    If you are deployed in-cluster, don't be afraid to exit early on errors you don't expect. Exits are easier to handle than a badly running app in a confusing state. By crashing, you get [retry with backoff](https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#restart-policy) for free, plus you often get alerts such as [KubePodCrashLooping](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubepodcrashlooping/) triggering (without instrumentation needed).
 
-Another is to check for the most common error up front before starting an infinite watch stream;
+While easy, early exits is not the best solution;
 
-1. did you install the crd before trying to start a watcher? do a naked list first and exit if not:
+- __Locally__, having a CLI abruptly exit is a bad user experience.
+- __In-cluster__, frequent restarts of a large app with many spurious non-fatal condition can mask underlying problems.
+
+For controllers with multiple watchers, [[observability#Adding Metrics]] is customary, so that you can alert on percentage error rates over a time span.
+
+It is also common to check for **blocker errors** up-front before starting an infinite watch stream;
+
+1. did you install the crd before trying to start a watcher? do a naked list first as a sanity:
 
 ```rust
 if let Err(e) = docs.list(&ListParams::default().limit(1)).await {
     error!("CRD is not queryable; {e:?}. Is the CRD installed?");
     info!("Installation: cargo run --bin crdgen | kubectl apply -f -");
     std::process::exit(1);
 }
+watcher(docs, conf).try_for_each(|_| future::ready(Ok(()))).await?;
 ```
--->
+
+This is a particularly common error case since CRD installation is often managed out-of-band with the application and thus often neglected.
 
 ### Request Inspection