Add alerts to notify vertical or horizontal scaling #2866

aruniiird · 2024-10-22T08:35:55Z

Now CPU usage high alerts are categorized to TWO different sections,
First section: where we have high CPU usage due to high MDS requests rate: at this point we need to scale horzontally by adding more mds pods.
Second section: where we have only CPU usage high: at this point we need to add scale vertically by adding more resources (CPU, memory) to the pods.

aruniiird · 2024-10-22T08:38:48Z

Converting this PR to draft, as we have to add/update the runbooks links at https://github.com/openshift/runbooks repo

aruniiird · 2024-10-22T13:50:51Z

Created a PR: openshift/runbooks#217, to add the new files to the runbooks repo

umangachapagain · 2024-11-14T05:47:27Z

metrics/deploy/prometheus-ocs-rules.yaml

+        runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/openshift-container-storage-operator/CephMdsCpuUsageHighNeedsHorizontalScaling.md
+        severity_level: warning
+      expr: |
+        (label_replace(pod:container_cpu_usage:sum{pod=~"rook-ceph-mds.*"}/ on(pod, namespace) kube_pod_resource_limit{resource='cpu',pod=~"rook-ceph-mds.*"}, "ceph_daemon", "mds.$1", "pod", "rook-ceph-mds-(.*)-(.*)") + on (ceph_daemon, namespace) group_left(managedBy) (0 * (ceph_mds_metadata ==1)) > 0.67) and on (ceph_daemon, namespace, managedBy) (rate(ceph_mds_request[1h]) > 1000)


Can you add a bit of explanation about the expression?

Why are we comparing with 0.67? 1000?

Why does >1000 mean HorizontalScaling and <1000 mean VerticalScaling for the same expression?

High request leading to high cpu usage can be helped with offloading the it to multiple mds, and low requests but still high CPU usage might be because of lack of resources. That's what I understand from the alert

@umangachapagain , @weirdwiz , we have made a change where we are not changing the MDSCPUUsage alert expression (except for a minor cosmetic change), but we are changing the description and runbook_url link according to the mds request load.

High request leading to high cpu usage can be helped with offloading the it to multiple mds, and low requests but still high CPU usage might be because of lack of resources. That's what I understand from the alert

@weirdwiz , yes you are absolutely right.

@umangachapagain ,

Why are we comparing with 0.67? 1000?

About the 0.67 not really sure, as this was already there for the existing MDSCPUUsageHigh alert, which we never changed. A logical conclusion I draw here is, if you are using more than ≈70%-ntage of CPU for past 6 hours, then it is considered as a sign of high CPU usage.

Why does >1000 mean HorizontalScaling and <1000 mean VerticalScaling for the same expression?

Now by keeping 67% as our CPU threshold, the number 1000 was reached during testing, when (approx) 1000 or more requests were hitting the MD server, we saw a gradual CPU usage rise and in a 1hr window frame it reaches the CPU threshold.
That means if the rate of mds-requests is approx 1000 reqs / sec for an hour we see CPU usage crosses 67% threshold.

PS: please see the new changes, here we are not doing much modification to the expression, but making description and runbook_url text changes according to the query (for past 6hrs rate query: rate(ceph_mds_request[6h]))

umangachapagain · 2024-11-14T05:49:17Z

metrics/deploy/prometheus-ocs-rules.yaml

      annotations:
        description: |-
          Ceph metadata server pod ({{ $labels.pod }}) has high cpu usage.
          Please consider increasing the CPU request for the {{ $labels.pod }} pod as described in the runbook.
+          This may help to process more requests and thus evict more items from cache.


We should either remove this statement, or word it with more assurity. "This may help" is not a good response to an alert IMO.

I think incedental affects should be consolidated to the runbooks

weirdwiz · 2024-11-14T08:31:16Z

metrics/deploy/prometheus-ocs-rules.yaml

+        runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/openshift-container-storage-operator/CephMdsCpuUsageHighNeedsHorizontalScaling.md
+        severity_level: warning
+      expr: |
+        (label_replace(pod:container_cpu_usage:sum{pod=~"rook-ceph-mds.*"}/ on(pod, namespace) kube_pod_resource_limit{resource='cpu',pod=~"rook-ceph-mds.*"}, "ceph_daemon", "mds.$1", "pod", "rook-ceph-mds-(.*)-(.*)") + on (ceph_daemon, namespace) group_left(managedBy) (0 * (ceph_mds_metadata ==1)) > 0.67) and on (ceph_daemon, namespace, managedBy) (rate(ceph_mds_request[1h]) > 1000)


We should think about the delta for calculating the rate, are we countering for short bursts of high requests and then silence, or are we looking at the scenario where there is consistently high request rate?

If it's the latter, the delta will work appropriately

Here we are looking for a consistent high requests rate.
Now the delta is brought up to 6 hrs (the time of waiting period). We now moved the mds-request rate query to the annotation part, so that at the time of it being fired (that is after 6hrs) the rate will give appropriate description and runbook_url link. As you have mentioned (about higher the delta lower the jitter/error-rate), through a 6h delta span we should not have unnecessary variances.

That makes sense, so we're considering the full time span, only at the moment the link is displayed to the user
rather than flip flopping. Pretty neat

Now CPU usage high alerts are categorized to TWO different scenarios, First scenario: where we have high CPU usage due to high rate of mds requests coming in: Solution: at this point we need to scale horizontally Second section: where we have only CPU usage high: Solution: at this point we need to add more resources to the existing mds pods, thus scaling vertically. Signed-off-by: Arun Kumar Mohan <amohan@redhat.com>

aruniiird · 2024-11-14T14:00:53Z

Screenshot of a sample alert to show how description and runbook_url link is shown

Vertical scaling example

Horizontal scaling example

weirdwiz

LGTM

openshift-ci · 2024-11-15T06:32:19Z

@weirdwiz: changing LGTM is restricted to collaborators

In response to this:

LGTM

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci · 2024-11-15T11:31:31Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aruniiird, umangachapagain, weirdwiz

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [umangachapagain]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

aruniiird marked this pull request as draft October 22, 2024 08:36

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 22, 2024

aruniiird force-pushed the add-new-CPU-usage-alerts-for-vertical-and-horizontal-scaling branch from 75b3bc3 to 4afc8a5 Compare November 11, 2024 19:26

aruniiird marked this pull request as ready for review November 13, 2024 08:45

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 13, 2024

agarwal-mudit requested review from umangachapagain and iamniting November 13, 2024 13:51

umangachapagain reviewed Nov 14, 2024

View reviewed changes

weirdwiz reviewed Nov 14, 2024

View reviewed changes

aruniiird force-pushed the add-new-CPU-usage-alerts-for-vertical-and-horizontal-scaling branch from 4afc8a5 to 418a011 Compare November 14, 2024 09:38

agarwal-mudit requested review from weirdwiz and umangachapagain November 14, 2024 12:45

aruniiird force-pushed the add-new-CPU-usage-alerts-for-vertical-and-horizontal-scaling branch from 418a011 to 377ceb8 Compare November 14, 2024 13:57

weirdwiz approved these changes Nov 15, 2024

View reviewed changes

umangachapagain approved these changes Nov 15, 2024

View reviewed changes

openshift-ci bot assigned umangachapagain Nov 15, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Nov 15, 2024

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 15, 2024

openshift-merge-bot bot merged commit 07cbcfa into red-hat-storage:main Nov 15, 2024
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add alerts to notify vertical or horizontal scaling #2866

Add alerts to notify vertical or horizontal scaling #2866

aruniiird commented Oct 22, 2024 •

edited

Loading

aruniiird commented Oct 22, 2024

aruniiird commented Oct 22, 2024

umangachapagain Nov 14, 2024

weirdwiz Nov 14, 2024

aruniiird Nov 14, 2024

umangachapagain Nov 14, 2024

weirdwiz Nov 14, 2024

aruniiird Nov 14, 2024

weirdwiz Nov 14, 2024

aruniiird Nov 14, 2024

weirdwiz Nov 15, 2024

aruniiird commented Nov 14, 2024

weirdwiz left a comment

openshift-ci bot commented Nov 15, 2024

openshift-ci bot commented Nov 15, 2024

Add alerts to notify vertical or horizontal scaling #2866

Add alerts to notify vertical or horizontal scaling #2866

Conversation

aruniiird commented Oct 22, 2024 • edited Loading

aruniiird commented Oct 22, 2024

aruniiird commented Oct 22, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aruniiird commented Nov 14, 2024

weirdwiz left a comment

Choose a reason for hiding this comment

openshift-ci bot commented Nov 15, 2024

openshift-ci bot commented Nov 15, 2024

aruniiird commented Oct 22, 2024 •

edited

Loading