Skip to content

Commit

Permalink
Clarify type and meaning of stacks_* metrics (#402)
Browse files Browse the repository at this point in the history
* Clarify type and meaning of stacks_* metrics

The stacks_failing metric is created as a GaugeVec in the Go code, which represents a set of time series distinguished by labels (in this case, "namespace" and "name"). But each of these time series are of type `gauge`, so the documentation is misleading in referring to them as `gaugevec` (which is not a kind of metric).

I've simplified the verbiage a little, in passing.

Addresses #399.

* Reset stacks_failed gauge when stack deleted

The stacks_failed metric is a set of gauges, each labelled with the
namespace and name of a Stack object. The controller sets a gauge to `1`
when its Stack object is given a state of "failed", and `0` for
"succeeded". A query aggregating over the labels will get the count of
failed stacks.

However: once a Stack is deleted, the gauge remains with the last value
-- and if it was failing, it will still be included in the count. So,
this commit resets the gauge to `0` when a Stack is deleted (if it had a
state at all).

Signed-off-by: Michael Bridgen <mbridgen@pulumi.com>
  • Loading branch information
squaremo authored Jan 24, 2023
1 parent 9076fdd commit ce3262f
Show file tree
Hide file tree
Showing 3 changed files with 12 additions and 2 deletions.
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,8 @@ CHANGELOG
- When a Stack uses a Flux source, but the source has no artifact to download, park the Stack until
the source has been updated, rather than retrying
[#359](https://github.com/pulumi/pulumi-kubernetes-operator/pull/359)
- Correct the stacks_failing metric in the case of a stack being deleted after failing
[#402](https://github.com/pulumi/pulumi-kubernetes-operator/pull/402)

## 1.10.1 (2022-10-25)

Expand Down
4 changes: 2 additions & 2 deletions docs/metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,8 @@ Once the above are created, Prometheus will update its target scraping rules to

The current implementation explicitly emits the following metrics:

1. `stacks_active` - `gauge` that tracks the number of currently registered stacks managed by the system
2. `stacks_failing` - `gaugevec` that provides information about stacks currently failing (`stack.status.lastUpdate.state` is `failed`)
1. `stacks_active` - a `gauge` time series that reports the number of currently registered stacks managed by the system
2. `stacks_failing` - a set of `gauge` time series, labelled by namespace, that gives the number of stacks currently failing (`stack.status.lastUpdate.state` is `failed`)

In addition, we find tracking the following metrics emitted by the controller-runtime would be useful to track:

Expand Down
8 changes: 8 additions & 0 deletions pkg/controller/stack/metrics.go
Original file line number Diff line number Diff line change
Expand Up @@ -66,4 +66,12 @@ func updateStackCallback(oldObj, newObj interface{}) {

func deleteStackCallback(oldObj interface{}) {
numStacks.Dec()
oldStack, ok := oldObj.(*pulumiv1.Stack)
if !ok {
return
}
// assume that if there was a status recorded, this gauge exists
if oldStack.Status.LastUpdate != nil {
numStacksFailing.With(prometheus.Labels{"namespace": oldStack.Namespace, "name": oldStack.Name}).Set(0)
}
}

0 comments on commit ce3262f

Please sign in to comment.