Skip to content

Commit

Permalink
docs(README): add ToC and a drawing
Browse files Browse the repository at this point in the history
Signed-off-by: Clément Nussbaumer <clement.nussbaumer@postfinance.ch>
  • Loading branch information
clementnuss committed Apr 5, 2024
1 parent 52767fb commit 99b52d8
Show file tree
Hide file tree
Showing 2 changed files with 43 additions and 9 deletions.
52 changes: 43 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,38 @@
<!-- START doctoc generated TOC please keep comment here to allow auto update -->
<!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->

- [Kubenurse](#kubenurse)
- [Deployment](#deployment)
- [Helm deployment](#helm-deployment)
- [Configuration settings](#configuration-settings)
- [Configuration](#configuration)
- [http Endpoints](#http-endpoints)
- [Health Checks](#health-checks)
- [API Server Direct](#api-server-direct)
- [API Server DNS](#api-server-dns)
- [Me Ingress](#me-ingress)
- [Me Service](#me-service)
- [Neighbourhood](#neighbourhood)
- [Neighbourhood filtering](#neighbourhood-filtering)
- [Neighbourhood incoming checks metric](#neighbourhood-incoming-checks-metric)
- [Metrics](#metrics)

<!-- END doctoc generated TOC please keep comment here to allow auto update -->

[![CI](https://github.com/postfinance/kubenurse/actions/workflows/release.yml/badge.svg)](https://github.com/postfinance/kubenurse/actions/workflows/release.yml)
[![Coverage Status](https://coveralls.io/repos/github/postfinance/kubenurse/badge.svg?branch=master)](https://coveralls.io/github/postfinance/kubenurse?branch=master)
![GitHub release (latest SemVer)](https://img.shields.io/github/v/release/postfinance/kubenurse)

# Kubenurse

kubenurse is a little service that monitors all network connections in a Kubernetes
cluster. Kubenurse measures request durations, records errors and exports those metrics in Prometheus format.
Kubenurse is a little service that monitors all network connections in a
Kubernetes cluster. Kubenurse measures request durations, records errors and
exports those metrics in Prometheus format.

Here's an overview of the checks performed by kubenurse, which are exposed as
labels for the various duration/error prometheus metrics.

![kubenurse request types](./doc/kubenurse.png)

## Deployment

Expand Down Expand Up @@ -72,7 +99,8 @@ Default tolerations:
```
After everything is set up and Prometheus scrapes the kubenurses, you can build
dashboards [as this example](./doc/grafana-kubenurse.json) that show network latencies and errors or use the metrics for alarming.
dashboards [as this example](./doc/grafana-kubenurse.json) that show network
latencies and errors or use the metrics for alarming.
![Grafana ingress view](doc/grafana_ingress.png "Grafana ingress view")
![Grafana path view](doc/grafana_path.png "Grafana path view")
Expand Down Expand Up @@ -157,8 +185,7 @@ The `/alive` endpoint returns a JSON like this with status code 200 if everythin

## Health Checks

Every five seconds and on every access of `/alive`, the checks described below are run.
Check results are cached for 3 seconds in order to prevent excessive network traffic.
Every five seconds, the checks described below are run.

A little illustration of what communication occurs, is here:

Expand Down Expand Up @@ -261,16 +288,23 @@ To bypass the node filtering feature, you simply need to set the

All performed checks expose metrics which can be used to monitor/alert:

- SDN network latencies and errors
- kubelet-to-kubelet network latencies and errors
- node-to-node network latencies and errors
- pod-to-apiserver communication
- Ingress roundtrip latencies and errors
- Service roundtrip latencies and errors (kube-proxy)
- Service roundtrip latencies and errors (kube-proxy / your CNI)
- Major kube-apiserver issues
- kube-dns (or CoreDNS) errors
- External DNS resolution errors (ingress URL resolution)

At `/metrics` you will find these:
At `/metrics` you will find the following metrics:

- `kubenurse_errors_total`: Kubenurse error counter partitioned by error type
- `kubenurse_request_duration`: a histogram for Kubenurse request duration partitioned by error type
- `kubenurse_httpclient_request_duration_seconds`: a latency histogram of request latencies from the kubenurse http client.
- `kubenurse_httpclient_trace_requests_total`: a latency histogram for the http
client _trace_ metric instrumentation, with detailed statistics for e.g.
`dns_start`, `got_conn` events, and more. the details can be seen in the
[`httptrace.go`](https://github.com/postfinance/kubenurse/blob/52767fbb280b65c06ac926dac49dd874e9ec4aee/internal/servicecheck/httptrace.go#L73)
file
- `kubenurse_neighbourhood_incoming_checks`: a gauge which reports how many
unique neighbours have queried the current pod in the last minute
Binary file added doc/kubenurse.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 99b52d8

Please sign in to comment.