Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MGDAPI-5085 - Add alerting for MCG on GCP #3075

Merged

Conversation

adam-cattermole
Copy link
Member

@adam-cattermole adam-cattermole commented Jan 23, 2023

Issue link

MGDAPI-5085

What

Add alerts for various MCG resources - requires rebase once #3052 is merged.

Verification steps

Provision a GCP CCS cluster - if you do not have an access key, request one from myself - place in root of delorean repo and run:

OCM_CLUSTER_NAME=acatterm-ccs OCM_CLUSTER_LIFESPAN=24 COMPUTE_NODES_COUNT=4 BYOC=true CLOUD_PROVIDER=gcp make -f make/ocm.mk ocm/cluster.json
make -f make/ocm.mk ocm/cluster/create

Once provisioned we can deploy RHOAM from this branch:

LOCAL=false make cluster/prepare/local
LOCAL=false USE_CLUSTER_STORAGE=true make code/run

Navigate to alerting in redhat-rhoam-observability namespace:
Networking -> Routes -> Prometheus (location) -> Alerts

Verify that the alerts are listed and are not firing.

  • RHOAMMCGOperatorMetricsServiceEndpointDown
  • RHOAMMCGOperatorRhmiRegistryCsServiceEndpointDown
  • NooBaaCorePod
  • NooBaaDBPod
  • NooBaaDefaultBackingStorePod
  • NooBaaEndpointPod
  • NooBaaS3Endpoint
  • NooBaaBucketCapacityOver85Percent
  • NooBaaBucketCapacityOver95Percent

Scale down:

  • deployments/noobaa-operator
  • deployments/noobaa-endpoint
  • statefulsets/noobaa-core
  • statefulsets/noobaa-db-pg

Delete:

  • pods/noobaa-default-backing-store-noobaa-pod-*

After a few minutes, the following alerts should be firing:

  • RHOAMMCGOperatorMetricsServiceEndpointDown
  • NooBaaCorePod
  • NooBaaDBPod
  • NooBaaDefaultBackingStorePod
  • NooBaaEndpointPod
  • NooBaaS3Endpoint

Scaling the all of the deployments back up again should result in the alerts to stop.

@codecov
Copy link

codecov bot commented Jan 24, 2023

Codecov Report

Merging #3075 (900793b) into mgdapi-3425-gcp (654e6ff) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Additional details and impacted files

Impacted file tree graph

@@               Coverage Diff                @@
##           mgdapi-3425-gcp    #3075   +/-   ##
================================================
  Coverage            72.56%   72.57%           
================================================
  Files                  104      104           
  Lines                29241    29247    +6     
================================================
+ Hits                 21219    21225    +6     
  Misses                7281     7281           
  Partials               741      741           
Impacted Files Coverage Δ
pkg/products/mcg/reconciler.go 72.95% <100.00%> (+0.68%) ⬆️
pkg/products/threescale/reconciler.go 61.66% <100.00%> (ø)

@adam-cattermole adam-cattermole force-pushed the MGDAPI-5085 branch 3 times, most recently from 01d3a57 to ea94686 Compare January 25, 2023 13:34
@adam-cattermole adam-cattermole changed the title [WIP] MGDAPI-5085 - Add alerting for MCG on GCP MGDAPI-5085 - Add alerting for MCG on GCP Jan 26, 2023
@adam-cattermole
Copy link
Member Author

e2e flaky - clusters failed to provision
/test rhoam-e2e multitenant-rhoam-e2e

Copy link
Contributor

@cecobask cecobask left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I followed the verification instructions and the alerts worked as expected.

  • NooBaa and MCG alerts were created and did not fire on fresh RHOAM install
  • After scaling down the required deployments, statefulsets and deleting a pod the alerts below started firing
  • Finally, scaling up the deployments and statefulsets resolved the previously firing alerts

Screenshot 2023-01-27 at 14 52 28

pkg/products/mcg/prometheusRules.go Outdated Show resolved Hide resolved
pkg/products/mcg/prometheusRules.go Outdated Show resolved Hide resolved
@adam-cattermole
Copy link
Member Author

/test rhoam-e2e

@adam-cattermole
Copy link
Member Author

e2e failed to provision cluster...
/test rhoam-e2e

@cecobask
Copy link
Contributor

Thank you for addressing the changes I suggested, @adam-cattermole!
We could create a JIRA about completely getting rid of kube_endpoint_address_available and kube_endpoint_address_not_ready metrics in favour of kube_endpoint_address.

I'm happy to approve the pull request now 👍🏻
/lgtm

@KevFan
Copy link
Contributor

KevFan commented Jan 31, 2023

/approve

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 31, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: KevFan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@adam-cattermole
Copy link
Member Author

/test multitenant-rhoam-e2e

@openshift-merge-robot openshift-merge-robot merged commit 327a725 into integr8ly:mgdapi-3425-gcp Jan 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants