Skip to content

Commit

Permalink
Merge pull request #1661 from DSD-DBS/kube-alerts
Browse files Browse the repository at this point in the history
feat: Add alerts for system administrators in Grafana
  • Loading branch information
MoritzWeber0 authored Jul 23, 2024
2 parents 9cb9f7b + c0f6490 commit 448e14b
Show file tree
Hide file tree
Showing 28 changed files with 311 additions and 56 deletions.
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ deploy-without-build: helm-deploy rollout open
helm-deploy:
@k3d cluster list $(CLUSTER_NAME) >/dev/null || $(MAKE) create-cluster
@kubectl create namespace $(SESSION_NAMESPACE) 2> /dev/null || true
@[[ ! $$(helm dependency list ./helm | grep missing) ]] || helm dependency update ./helm;
@[[ ! $$(helm dependency list ./helm | sed '1d' | sed '/^$$/d' | grep -wv ok) ]] || helm dependency update ./helm;
@echo "Start helm upgrade..."
HELM_PACKAGE_DIR=$$(mktemp -d)
helm package --app-version=$$(git rev-parse --abbrev-ref HEAD) --version=$$(git describe --tags) -d "$$HELM_PACKAGE_DIR" helm
Expand Down
7 changes: 6 additions & 1 deletion ci-templates/gitlab/k8s-deploy.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@

variables:
GRAFANA_HELM_CHART: https://grafana.github.io/helm-charts/
PROMETHEUS_HELM_CHARTS: https://prometheus-community.github.io/helm-charts
PRIVATE_GPG_PATH: /secrets/private.gpg
TARGET:
value: 'staging'
Expand Down Expand Up @@ -43,6 +44,9 @@ variables:
- git checkout ${REVISION}
# prettier-ignore
- sed -i "s#https://grafana.github.io/helm-charts/#${GRAFANA_HELM_CHART}#g" ./helm/Chart.yaml
- sed -i
"s#https://prometheus-community.github.io/helm-charts#${PROMETHEUS_HELM_CHARTS}#g"
./helm/Chart.yaml

.kubernetes: &kubernetes
- NAMESPACE=$(cat ../plain.k8s.json | jq -r ".namespace")
Expand All @@ -61,9 +65,9 @@ variables:

.helm-deploy: &helm-deploy
- RELEASE=$(cat ../plain.k8s.json | jq -r ".release")
- cp -r ../config/* helm/config
# prettier-ignore
- DOCKER_TAG=$(echo $REVISION | sed 's/[^a-zA-Z0-9.]/-/g')-$CI_COMMIT_REF_SLUG
- helm repo add grafana-helm-remote ${GRAFANA_HELM_CHART}
- helm dependency update ./helm
- HELM_PACKAGE_DIR=$(mktemp -d)
- >
Expand All @@ -78,6 +82,7 @@ variables:
--set docker.tag=${DOCKER_TAG} \
-f ../${TARGET}/general.values.yaml \
-f ../plain.values.yaml "$HELM_PACKAGE_DIR"/collab-manager-*.tgz
- kubectl rollout restart deployment ${RELEASE}-backend
- kubectl rollout restart deployment ${RELEASE}-frontend
- kubectl rollout restart deployment ${RELEASE}-docs
Expand Down
2 changes: 1 addition & 1 deletion docs/docs/admin/cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ located in a module:
python -m capellacollab.cli --help
```

This gives you the help information. The CLI tool currently has one subcommand:
This gives you the help information. The CLI tool currently has a subcommand:
`ws`, short for workspace.

```
Expand Down
30 changes: 30 additions & 0 deletions docs/docs/admin/monitoring/alerting.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
<!--
~ SPDX-FileCopyrightText: Copyright DB InfraGO AG and contributors
~ SPDX-License-Identifier: Apache-2.0
-->

# Alerts in unexpected situations

If something doesn't work as expected, it's important that the system
administrators will receive a notification.

We use the Grafana Alertmanager to send alerts for some pre-defined error
cases. If you're missing an alert rule, let us know via
[GitHub issues](https://github.com/DSD-DBS/capella-collab-manager/issues) or
open a PR and add it to the list of pre-defined rules.

## Configure alerting

By default, firing alerts can only be viewed in the Grafana UI. You can
configure additional contact points depending on your needs.

A list of available contact points is available in the
[official Grafana documentation](https://grafana.com/docs/grafana/latest/alerting/configure-notifications/manage-contact-points/).
The list includes chat services like Microsoft Teams but also email and webhook
notifications.

!!! info "Configure SMTP server for email alerting"

For email alerting, you need to configure an SMTP server in the
`values.yaml` in the Helm chart. Have a look at the `alerting.email`
configuration.
Binary file added docs/docs/admin/monitoring/contact_points.md.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
20 changes: 20 additions & 0 deletions docs/docs/admin/monitoring/dashboards.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
<!--
~ SPDX-FileCopyrightText: Copyright DB InfraGO AG and contributors
~ SPDX-License-Identifier: Apache-2.0
-->

# Grafana Dashboards

We provide a few pre-configured Grafana dashboards to monitor the sessions and
TeamForCapella licenses.

The Grafana dashboards are available to administrators and can be accessed via
the "Grafana" link in the main menu. Select Dashboards to see a list of
available dashboards:

![Dashboard in the main Grafana menu](./dashboards.png){:style="width:300px"}

You can add additional dashboards depending on your needs. If you think the
dashboard could be helpful for others, please add the dashboard to the
[list of pre-defined dashboards](https://github.com/DSD-DBS/capella-collab-manager/tree/main/helm/config/grafana)
and [open a PR](https://github.com/DSD-DBS/capella-collab-manager/pulls).
Binary file added docs/docs/admin/monitoring/dashboards.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
14 changes: 14 additions & 0 deletions docs/docs/admin/monitoring/frontend.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
<!--
~ SPDX-FileCopyrightText: Copyright DB InfraGO AG and contributors
~ SPDX-License-Identifier: Apache-2.0
-->

# Pipeline and Model Modifier Monitoring

Metrics connected to projects and registered models are available in a custom
dashboard in the frontend.

In the dashboard, you can get a general overview of the status of pipelines and
model modifiers registered models.

You can find it by navigating to `Menu` > `Settings` > `Monitoring`
11 changes: 0 additions & 11 deletions docs/docs/admin/settings/monitoring.md

This file was deleted.

6 changes: 5 additions & 1 deletion docs/mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,10 @@ nav:
- Installation: admin/installation.md
- Uninstallation: admin/uninstallation.md
- Getting started: admin/getting_started/index.md
- Monitoring:
- Alerting: admin/monitoring/alerting.md
- Dashboards: admin/monitoring/dashboards.md
- Pipelines & Model Modifiers: admin/monitoring/frontend.md
- Integrations:
- Git: admin/settings/model-sources/git.md
- TeamForCapella:
Expand All @@ -90,7 +94,6 @@ nav:
- General: admin/settings/tools/index.md
- Configuration: admin/tools/configuration.md
- Alerts: admin/alerts/create.md
- Monitoring: admin/settings/monitoring.md
- CI templates:
- Gitlab CI/CD:
- Image builder: admin/ci-templates/gitlab/image-builder.md
Expand Down Expand Up @@ -147,6 +150,7 @@ plugins:
'user/tools/capella/teamforcapella/connect/connect-to-t4c.md': 'user/tools/capella/teamforcapella/connect/index.md'
'user/tools/capella/teamforcapella/import/import-from-t4c.md': 'user/tools/capella/teamforcapella/import/index.md'
'user/tools/capella/teamforcapella/export/export-to-t4c.md': 'user/tools/capella/teamforcapella/export/index.md'
'admin/settings/monitoring.md': 'admin/monitoring/frontend.md'

markdown_extensions:
- meta
Expand Down
2 changes: 2 additions & 0 deletions helm/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,5 @@
.venv
initdb.sql
p_options.yaml
config/certs/*
!config/certs/.gitkeep
7 changes: 5 additions & 2 deletions helm/Chart.lock
Original file line number Diff line number Diff line change
Expand Up @@ -2,5 +2,8 @@ dependencies:
- name: loki
repository: https://grafana.github.io/helm-charts/
version: 5.30.0
digest: sha256:afdc744b3c8e3b9f5c6caf9858de7b6c70e5f97c178c9728828d5fcd713dc20d
generated: "2023-10-16T13:58:40.468168+02:00"
- name: kube-state-metrics
repository: https://prometheus-community.github.io/helm-charts
version: 5.21.0
digest: sha256:78da5915214a0d59b1d90c8e269f98b904028ecf7e167600c817717568cc85fd
generated: "2024-07-19T11:32:45.833191+02:00"
7 changes: 7 additions & 0 deletions helm/Chart.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,16 @@ home: https://github.com/DSD-DBS/capella-collab-manager
type: application
version: 0.0.0 # The version is automatically updated by the release process.
appVersion: 0.0.0 # The appVersion is automatically updated by the release process.
maintainers:
- name: Systems Engineering Toolchain team of Digitale Schiene Deutschland
email: set@deutschebahn.com
dependencies:
- name: loki
alias: loki
condition: loki.enabled
version: 5.30.0
repository: https://grafana.github.io/helm-charts/
- name: kube-state-metrics
version: 5.21.0
condition: kube-state-metrics.enabled
repository: https://prometheus-community.github.io/helm-charts
Empty file added helm/config/certs/.gitkeep
Empty file.
76 changes: 76 additions & 0 deletions helm/config/grafana/alerting/rules.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
# SPDX-FileCopyrightText: Copyright DB InfraGO AG and contributors
# SPDX-License-Identifier: Apache-2.0

apiVersion: 1
groups:
- orgId: 1
name: Deployment
folder: Alerts
interval: 1m
rules:
- uid: a32cdf13-990a-438e-a451-11f9185e97b2
title: Session container unhealthy
condition: A
data:
- refId: A
relativeTimeRange:
from: 3600
to: 0
datasourceUid: prometheus_ccm
model:
datasource:
type: prometheus
uid: prometheus_ccm
editorMode: code
expr:
sum by(namespace, pod, phase,
annotation_capellacollab_session_id)
(kube_pod_status_phase{phase=~"Pending|Unknown|Failed"} * on
(uid) group_left kube_pod_labels{label_workload="session"} * on
(uid) group_left (annotation_capellacollab_session_id)
kube_pod_annotations) > 0
instant: true
intervalMs: 1000
legendFormat: '{{deployment}}'
maxDataPoints: 43200
range: false
refId: A
noDataState: OK
execErrState: Error
for: 10m
annotations:
description: A session container is in an unexpected state.
runbook_url: ''
summary:
Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} is not
ready for over 10 minutes.
labels:
'': ''
isPaused: false
- uid: d7609318-1289-443a-908c-bada900079cc
title: Job has failed
condition: A
data:
- refId: A
relativeTimeRange:
from: 86400
to: 0
datasourceUid: prometheus_ccm
model:
datasource:
type: prometheus
uid: prometheus_ccm
editorMode: builder
expr: kube_job_status_failed > 0
hide: false
instant: true
intervalMs: 1000
maxDataPoints: 43200
range: false
refId: A
noDataState: OK
execErrState: Error
for: 5m
annotations:
summary: A job has failed
isPaused: false
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@
{
"datasource": {
"type": "prometheus",
"uid": "PBFA97CFB590B2093"
"uid": "prometheus_ccm"
},
"fieldConfig": {
"defaults": {
Expand Down Expand Up @@ -141,7 +141,7 @@
{
"datasource": {
"type": "prometheus",
"uid": "PBFA97CFB590B2093"
"uid": "prometheus_ccm"
},
"editorMode": "code",
"expr": "sum(count(up{tool_version_id=~\"$tool_version_id\",connection_method_id=~\"$connection_method_id\", session_type=~\"$session_type\", job=\"sessions\"})) OR on() vector(0)",
Expand All @@ -153,7 +153,7 @@
{
"datasource": {
"type": "prometheus",
"uid": "PBFA97CFB590B2093"
"uid": "prometheus_ccm"
},
"editorMode": "code",
"expr": "(count(up{tool_version_id=~\"$tool_version_id\",connection_method_id=~\"$connection_method_id\", session_type=~\"$session_type\", job=\"sessions\"} == 0)) OR vector(0)",
Expand All @@ -170,7 +170,7 @@
{
"datasource": {
"type": "prometheus",
"uid": "PBFA97CFB590B2093"
"uid": "prometheus_ccm"
},
"fieldConfig": {
"defaults": {
Expand Down Expand Up @@ -277,7 +277,7 @@
{
"datasource": {
"type": "prometheus",
"uid": "PBFA97CFB590B2093"
"uid": "prometheus_ccm"
},
"editorMode": "code",
"expr": "count by (tool_id, tool_name) (up{tool_version_id=~\"$tool_version_id\",connection_method_id=~\"$connection_method_id\", session_type=~\"$session_type\", job=\"sessions\"})",
Expand All @@ -293,7 +293,7 @@
{
"datasource": {
"type": "prometheus",
"uid": "PBFA97CFB590B2093"
"uid": "prometheus_ccm"
},
"description": "",
"fieldConfig": {
Expand Down Expand Up @@ -343,7 +343,7 @@
{
"datasource": {
"type": "prometheus",
"uid": "PBFA97CFB590B2093"
"uid": "prometheus_ccm"
},
"editorMode": "code",
"expr": "sum(count(up{tool_version_id=~\"$tool_version_id\",connection_method_id=~\"$connection_method_id\", session_type=~\"$session_type\", job=\"sessions\"})) OR on() vector(0)",
Expand All @@ -358,7 +358,7 @@
{
"datasource": {
"type": "prometheus",
"uid": "PBFA97CFB590B2093"
"uid": "prometheus_ccm"
},
"description": "How long sessions were idle. When a session reaches the top of the graph, it is terminated automatically.",
"fieldConfig": {
Expand Down Expand Up @@ -442,7 +442,7 @@
{
"datasource": {
"type": "prometheus",
"uid": "PBFA97CFB590B2093"
"uid": "prometheus_ccm"
},
"editorMode": "code",
"expr": "idletime_minutes{tool_version_id=~\"$tool_version_id\",connection_method_id=~\"$connection_method_id\", session_type=~\"$session_type\"}",
Expand All @@ -467,7 +467,7 @@
},
"datasource": {
"type": "prometheus",
"uid": "PBFA97CFB590B2093"
"uid": "prometheus_ccm"
},
"definition": "up{job=\"sessions\"}",
"hide": 0,
Expand All @@ -492,7 +492,7 @@
},
"datasource": {
"type": "prometheus",
"uid": "PBFA97CFB590B2093"
"uid": "prometheus_ccm"
},
"definition": "up{job=\"sessions\", tool_version_id=~\"$tool_version_id\"}",
"description": "",
Expand All @@ -518,7 +518,7 @@
},
"datasource": {
"type": "prometheus",
"uid": "PBFA97CFB590B2093"
"uid": "prometheus_ccm"
},
"definition": "label_values(up{job=\"sessions\", tool_version_id=~\"$tool_version_id\"},session_type)",
"hide": 0,
Expand Down
2 changes: 2 additions & 0 deletions helm/config/grafana/dashboards/active-sessions.json.license
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
SPDX-FileCopyrightText: Copyright DB InfraGO AG and contributors
SPDX-License-Identifier: Apache-2.0
Loading

0 comments on commit 448e14b

Please sign in to comment.