Skip to content
This repository has been archived by the owner on Nov 1, 2022. It is now read-only.

docs: improve monitoring documentation #384

Merged
merged 3 commits into from
Apr 20, 2020
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 50 additions & 2 deletions docs/references/monitoring.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,58 @@ The Helm Operator exposes a metrics endpoint at `/metrics` on the configured
[`--listen`](operator.md#general-flags) address (defaults to `:3030`) with data
in Prometheus format.

The following metrics are exposed:
## Metrics

| Metric | Description
|--------|---
| `release_duration_seconds` | Release synchronization duration in seconds. |
| `release_duration_seconds` | Release synchronization duration in seconds. This duration includes one or many `release_phase_durations`. |
| `release_phase_duration_seconds` | Release phase synchronization duration in seconds. |
| `release_phase_info` | The (negative) integer equaling the current phase of a release. Negative values are failed phases, `0` equals to unknown. See [release phases](#release-phases).
| `release_queue_length_count` | Count of release jobs waiting in the queue to be processed. |


### Release phases

The following is a table of the values the `release_phase_info` metric exposes,
and the phase they represent:

| Value | Phase |
|-------|---
| `-4` | `ChartFetchFailed`
| `-3` | `Failed`
| `-2` | `RollbackFailed`
| `-1 ` | `RolledBack`
| `0` | `Unknown`
| `1` | `RollingBack`
| `2` | `Installing`
| `3` | `Upgrading`
| `4` | `ChartFetched`
| `5` | `Succeeded`

## Prometheus alert rules examples

The following is a list of Prometheus alert rules examples possible
with the exposed metrics. We are open to [pull requests](
https://github.com/fluxcd/helm-operator/pulls) adding additional rules.

### Low queue throughput

```yaml
alert: HelmOperatorLowThroughput
expr: flux_helm_operator_release_queue_length_count > 0
for: 30m
```

### Automatic rollback of `HelmRelease`

```yaml
alert: HelmReleaseRolledBack
expr: flux_helm_operator_release_phase_info == -1
```

### `HelmRelease` subject to an error

```yaml
alert: HelmReleaseError
expr: flux_helm_operator_release_phase_info < -1
```