Skip to content

Commit

Permalink
add MimirContinuousTestFailingOnWrites and MimirContinuousTestFailing…
Browse files Browse the repository at this point in the history
…OnReads alerts
  • Loading branch information
QuantumEnigmaa committed Sep 10, 2024
1 parent 0666a68 commit 9d39c58
Show file tree
Hide file tree
Showing 2 changed files with 35 additions and 0 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
### Added

- Add `MimirHPAReachedMaxReplicas` alert, to detect when Mimir's HPAs have reached maximum capacity.
- Add `MimirContinuousTestFailingOnWrites` and `MimirContinuousTestFailingOnReads` alerts.

### Changed

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -182,4 +182,38 @@ spec:
severity: page
team: atlas
topic: observability
- alert: MimirContinuousTestFailingOnWrites
annotations:
description: 'Mimir continous-test detected errors in the write path.'
opsrecipe: mimir/
# Query is based on the following upstream mixin alerting rule : https://github.com/grafana/mimir/blob/main/operations/mimir-mixin-compiled/alerts.yaml#L1097
expr: sum by(cluster_id, installation, namespace, pipeline, provider test) (rate(mimir_continuous_test_writes_failed_total[5m])) > 0
for: 1h
labels:
area: platform
cancel_if_cluster_status_creating: "true"
cancel_if_cluster_status_deleting: "true"
cancel_if_cluster_status_updating: "true"
cancel_if_outside_working_hours: "true"
#TODO dashboard:
severity: page
team: atlas
topic: observability
- alert: MimirContinuousTestFailingOnReads
annotations:
description: 'Mimir continous-test detected errors in the write path.'
opsrecipe: mimir/
# Query is based on the following upstream mixin alerting rule : https://github.com/grafana/mimir/blob/main/operations/mimir-mixin-compiled/alerts.yaml#L1097
expr: sum by(cluster_id, installation, namespace, pipeline, provider test) (rate(mimir_continuous_test_queries_failed_total[5m])) > 0
for: 1h
labels:
area: platform
cancel_if_cluster_status_creating: "true"
cancel_if_cluster_status_deleting: "true"
cancel_if_cluster_status_updating: "true"
cancel_if_outside_working_hours: "true"
# TODO dashboard:
severity: page
team: atlas
topic: observability
{{- end }}

0 comments on commit 9d39c58

Please sign in to comment.