Skip to content

Commit

Permalink
Merge branch 'main' into start-reviewing-phoenix-alerts
Browse files Browse the repository at this point in the history
  • Loading branch information
QuentinBisson authored Jun 9, 2024
2 parents c1999e2 + 394ae9f commit 1c3560f
Show file tree
Hide file tree
Showing 17 changed files with 54 additions and 39 deletions.
2 changes: 1 addition & 1 deletion .circleci/config.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
version: 2.1
orbs:
architect: giantswarm/architect@5.2.0
architect: giantswarm/architect@5.2.1

workflows:
package-and-push-chart-on-tag:
Expand Down
14 changes: 13 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,19 +7,31 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

### Fixed

- Fixed usage of yq, and jq in check-opsrecipes.sh
- Fetch jq with make install-tools

### Added

- Added a new alerting rule to `falco.rules.yml` to fire an alert for XZ-backdoor.
- Add `CiliumAPITooSlow`.

### Changed

- Review phoenix alerts towards Mimir.
- Split the phoenix job alert into 2:
- a new file named job.aws.rules that contains the aws specific alerts
- move the rest of job.rules into the shared alerts because it is provider independent
- Move the management cluster certificate alerts into the shared alerts because it is provider independent
- Review and fix phoenix alerts towards Mimir and multi-provider MCs.
- Moves cluster-autoscaler and vpa alerts to turtles.

### Fixed

- Fix and improve the ops-recipe test script.
- Fix cabbage alerts for multi-provider wcs.
- Fix shield alert area labels.
- Fix `cert-exporter` alerting.

### Removed

Expand Down
8 changes: 0 additions & 8 deletions helm/prometheus-rules/templates/_helpers.tpl
Original file line number Diff line number Diff line change
Expand Up @@ -45,14 +45,6 @@ phoenix
{{- end -}}
{{- end -}}

{{- define "isCertExporterInstalled" -}}
{{- if has .Values.managementCluster.provider.kind (list "cloud-director" "vsphere" "capa") -}}
false
{{- else -}}
true
{{- end -}}
{{- end -}}

{{- define "isBastionBeingMonitored" -}}
{{ not (eq .Values.managementCluster.provider.flavor "capi") }}
{{- end -}}
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
## TODO Remove with vintage
{{- if eq .Values.managementCluster.provider.flavor "vintage" }}
## TODO Remove when all vintage installations are gone
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
## TODO Remove with vintage
{{- if eq .Values.managementCluster.provider.flavor "vintage" }}
## TODO Remove when all vintage installations are gone
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
## TODO Remove with vintage
{{- if eq .Values.managementCluster.provider.flavor "vintage" }}
## TODO Remove when all vintage installations are gone
# newer clusters don't use docker anymore
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
## TODO Remove with vintage
# This rule applies to vintage aws management clusters
{{- if eq .Values.managementCluster.provider.flavor "vintage" }}
## TODO Remove when all vintage installations are gone
# This rule applies to vintage aws management clusters
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
## TODO Remove with vintage
# This rule applies to vintage aws clusters
{{- if eq .Values.managementCluster.provider.flavor "vintage" }}
## TODO Remove when all vintage installations are gone
# This rule applies to vintage aws clusters
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
## TODO Remove with vintage
{{- if eq .Values.managementCluster.provider.flavor "vintage" }}
## TODO Remove when all vintage installations are gone
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# This rule applies to all capi management clusters
{{- if eq .Values.managementCluster.provider.flavor "capi" }}
# This rule applies to all capi management clusters
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ spec:
opsrecipe: falco-alert/
expr: increase(falco_events{priority=~"0|1|2|3"}[10m] ) > 0
labels:
area: kaas
area: platform
cancel_if_cluster_status_creating: "true"
cancel_if_cluster_status_deleting: "true"
cancel_if_cluster_status_updating: "true"
Expand All @@ -37,7 +37,7 @@ spec:
opsrecipe: falco-alert/
expr: increase(falco_events{priority=~"4|5"}[10m] ) > 0
labels:
area: kaas
area: platform
cancel_if_cluster_status_creating: "true"
cancel_if_cluster_status_deleting: "true"
cancel_if_cluster_status_updating: "true"
Expand All @@ -52,7 +52,7 @@ spec:
opsrecipe: falco-alert/
expr: increase(falco_events{priority="6"}[10m] ) > 0
labels:
area: kaas
area: platform
cancel_if_cluster_status_creating: "true"
cancel_if_cluster_status_deleting: "true"
cancel_if_cluster_status_updating: "true"
Expand All @@ -67,7 +67,7 @@ spec:
opsrecipe: falco-alert/
expr: falco_events{rule="Backdoored library loaded into SSHD (CVE-2024-3094)"} > 0
labels:
area: kaas
area: platform
cancel_if_cluster_status_creating: "false"
cancel_if_cluster_status_deleting: "false"
cancel_if_cluster_status_updating: "false"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ spec:
expr: sum(kube_validatingwebhookconfiguration_info{validatingwebhookconfiguration=~"kyverno-.*"}) by (cluster_id, installation, pipeline, provider) > 0 and sum(kube_deployment_status_replicas{deployment=~"kyverno|kyverno-admission-controller"}) by (cluster_id, installation, pipeline, provider) == 0
for: 15m
labels:
area: managedservices
area: platform
cancel_if_cluster_status_creating: "true"
cancel_if_cluster_status_deleting: "true"
cancel_if_cluster_status_updating: "true"
Expand All @@ -34,7 +34,7 @@ spec:
expr: aggregation:kyverno_resource_counts{kind=~"(generate|update)requests.kyverno.io"} > 5000
for: 15m
labels:
area: managedservices
area: platform
cancel_if_cluster_status_creating: "true"
cancel_if_cluster_status_deleting: "true"
cancel_if_cluster_status_updating: "true"
Expand All @@ -51,7 +51,7 @@ spec:
expr: sum(kube_deployment_spec_replicas{deployment=~"kyverno|kyverno-kyverno-plugin|kyverno-policy-reporter"}) by (cluster_id, installation, pipeline, provider) == 0
for: 4h
labels:
area: managedservices
area: platform
cancel_if_cluster_status_creating: "true"
cancel_if_cluster_status_deleting: "true"
cancel_if_cluster_status_updating: "true"
Expand All @@ -66,7 +66,7 @@ spec:
expr: sum(kube_deployment_spec_replicas{deployment="kyverno"}) by (cluster_id, installation, pipeline, provider) != 0 and sum(kube_deployment_spec_replicas{deployment="kyverno"}) by (cluster_id, installation, pipeline, provider) < 3
for: 1h
labels:
area: managedservices
area: platform
cancel_if_cluster_status_creating: "true"
cancel_if_cluster_status_deleting: "true"
cancel_if_cluster_status_updating: "true"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ spec:
area: kaas
cancel_if_outside_working_hours: "true"
severity: page
team: team: {{ include "providerTeam" . }}
team: {{ include "providerTeam" . }}
topic: security
- alert: ManagementClusterCertificateWillExpireInLessThanOneMonth
annotations:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
{{- if eq (include "isCertExporterInstalled" .) "true" }}
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
Expand Down Expand Up @@ -31,7 +30,7 @@ spec:
annotations:
description: '{{`Certificate metrics are missing for cluster {{ $labels.cluster_id }}.`}}'
opsrecipe: absent-metrics
expr: max(up{cluster_id!="", cluster_type="workload_cluster"}) by (cluster_id) unless on (cluster_id) count (cert_exporter_not_after{cluster_type="workload_cluster"}) by (cluster_id) > 0
expr: max(up{cluster_id!="", cluster_type="workload_cluster"}) by (cluster_id, installation, pipeline, provider) unless on (cluster_id) count (cert_exporter_not_after{cluster_type="workload_cluster"}) by (cluster_id, installation, pipeline, provider) > 0
for: 30m
labels:
area: kaas
Expand All @@ -42,4 +41,3 @@ spec:
severity: page
team: {{ include "providerTeam" . }}
topic: security
{{- end -}}
12 changes: 8 additions & 4 deletions test/hack/bin/check-opsrecipes.sh
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,8 @@ listOpsRecipes () {

# find all ops-recipes ".md" files, and keep only the opsrecipe name (may contain a path, like "rolling-nodes/rolling-nodes")
find "$privateOpsrecipesParentDirectory"/content/docs/support-and-ops/ops-recipes -type f -name \*.md \
| sed -n 's_'"$privateOpsrecipesParentDirectory"'/content/docs/support-and-ops/ops-recipes/\(.*\).md_\1_p'
| sed -n 's_'"$privateOpsrecipesParentDirectory"'/content/docs/support-and-ops/ops-recipes/\(.*\).md_\1_p' \
| sed 's/\/_index//g' # Removes the _index.md files and keep the directory name
rm -rf "$privateOpsrecipesParentDirectory"

# Add extra opsrecipes
Expand All @@ -69,7 +70,6 @@ listOpsRecipes () {
echo "deployment-not-satisfied-china"
}


main() {
local -a runInCi=false
for arg in "$@"; do
Expand All @@ -86,6 +86,10 @@ main() {
local -a E_unexistingrecipe=()
local returncode=0

local -r GIT_WORKDIR="$(git rev-parse --show-toplevel)"
local -r YQ=test/hack/bin/yq
local -r JQ=test/hack/bin/jq

# Investigation section
########################

Expand Down Expand Up @@ -144,10 +148,10 @@ main() {
fi

# parse rules yaml files, and for each rule found output alertname, opsrecipe, and severity, space-separated, on one line.
done < <(yq -o json "$rulesFile" | jq -j '.spec.groups[].rules[] | .alert, " ", .annotations.opsrecipe, " ", .labels.severity, "\n"')
done < <("$GIT_WORKDIR/$YQ" -o json "$rulesFile" | "$GIT_WORKDIR/$JQ" -j '.spec.groups[]?.rules[] | .alert, " ", .annotations.opsrecipe, " ", .labels.severity, "\n"')

checkedRules+=("$rulesFile")
done < <(find $RULES_FILES -type f -print0)
done < <(find "${RULES_FILES[@]}" -type f -print0)


# Output section - let's write down our findings
Expand Down
9 changes: 9 additions & 0 deletions test/hack/bin/fetch-tools.sh
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ ARCHITECT_VERSION="6.8.0"
PROMETHEUS_VERSION="2.41.0"
HELM_VERSION="3.9.0"
YQ_VERSION="4.26.1"
JQ_VERSION="1.7.1"
PINT_VERSION="0.58.1"

GIT_WORKDIR=$(git rev-parse --show-toplevel)
Expand All @@ -19,6 +20,8 @@ Linux*)
export ARCHITECT_SOURCE="https://github.com/giantswarm/architect/releases/download/v${ARCHITECT_VERSION}/architect-v${ARCHITECT_VERSION}-linux-amd64.tar.gz"
export YQ_SOURCE="https://github.com/mikefarah/yq/releases/download/v${YQ_VERSION}/yq_linux_amd64.tar.gz"
export YQ_BIN_FILE="yq_linux_amd64"
export JQ_SOURCE="https://github.com/jqlang/jq/releases/download/jq-${JQ_VERSION}/jq-linux-amd64"
export JQ_BIN_FILE="jq"
export PINT_SOURCE="https://github.com/cloudflare/pint/releases/download/v${PINT_VERSION}/pint-${PINT_VERSION}-linux-amd64.tar.gz"
export PINT_BIN_FILE="pint-linux-amd64"
;;
Expand All @@ -29,6 +32,8 @@ Darwin*)
export ARCHITECT_SOURCE="https://github.com/giantswarm/architect/releases/download/v${ARCHITECT_VERSION}/architect-v${ARCHITECT_VERSION}-darwin-amd64.tar.gz"
export YQ_SOURCE="https://github.com/mikefarah/yq/releases/download/v${YQ_VERSION}/yq_darwin_amd64.tar.gz"
export YQ_BIN_FILE="yq_darwin_amd64"
export JQ_SOURCE="https://github.com/jqlang/jq/releases/download/jq-${JQ_VERSION}/jq-macos-amd64"
export JQ_BIN_FILE="jq"
export PINT_SOURCE="https://github.com/cloudflare/pint/releases/download/v${PINT_VERSION}/pint-${PINT_VERSION}-darwin-amd64.tar.gz"
export PINT_BIN_FILE="pint-darwin-amd64"
TAR_CMD="gtar"
Expand Down Expand Up @@ -107,6 +112,10 @@ main() {
"${GIT_WORKDIR}/test/hack/bin/yq-${YQ_VERSION}.tar.gz" \
"$YQ_SOURCE" \
"*/yq_*"
download \
"${JQ_SOURCE}" \
"${GIT_WORKDIR}/test/hack/bin/${JQ_BIN_FILE}"
chmod +x "${GIT_WORKDIR}/test/hack/bin/${JQ_BIN_FILE}"
if [[ ! -f "${GIT_WORKDIR}/test/hack/bin/yq" ]]; then
ln -s "${GIT_WORKDIR}/test/hack/bin/${YQ_BIN_FILE}" "${GIT_WORKDIR}/test/hack/bin/yq"
fi
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ tests:
eval_time: 15m
exp_alerts:
- exp_labels:
area: managedservices
area: platform
cluster_id: gremlin
installation: gremlin
pipeline: testing
Expand All @@ -49,7 +49,7 @@ tests:
eval_time: 45m
exp_alerts:
- exp_labels:
area: managedservices
area: platform
cluster_id: gremlin
installation: gremlin
pipeline: testing
Expand All @@ -70,7 +70,7 @@ tests:
eval_time: 240m
exp_alerts:
- exp_labels:
area: managedservices
area: platform
cluster_id: gremlin
installation: gremlin
pipeline: testing
Expand All @@ -90,7 +90,7 @@ tests:
eval_time: 310m
exp_alerts:
- exp_labels:
area: managedservices
area: platform
cluster_id: gremlin
installation: gremlin
pipeline: testing
Expand Down

0 comments on commit 1c3560f

Please sign in to comment.