Suppress CDIDefaultStorageClassDegraded on SNO #3310

arnongilboa · 2024-06-09T20:09:17Z

What this PR does / why we need it:
On single-node OpenShift, even if none of the default/virt default storage classes supports ReadWriteMany (but supports smart clone), we will not fire the CDIDefaultStorageClassDegraded alert. We added degraded label to kubevirt_cdi_storageprofile_info to simplify the alert expression.

Which issue(s) this PR fixes:
jira-ticket: https://issues.redhat.com/browse/CNV-40665

Special notes for your reviewer:

Release note:

Suppress CDIDefaultStorageClassDegraded alert on SNO

On single-node OpenShift, even if none of the default/virt default storage classes supports `ReadWriteMany` (but supports smart clone), we will not fire the `CDIDefaultStorageClassDegraded` alert. We added `degraded` label to `kubevirt_cdi_storageprofile_info` to simplify the alert expression. Signed-off-by: Arnon Gilboa <agilboa@redhat.com>

coveralls · 2024-06-09T20:17:13Z

coverage: 59.075% (+0.05%) from 59.023%
when pulling 3d99cfb on arnongilboa:suppress_alerts_on_sno
into f3d0060 on kubevirt:main.

akalenyu · 2024-06-10T09:02:30Z

pkg/controller/storageprofile-controller.go

+	} else {
+		isSNO = clusterInfra.Status.ControlPlaneTopology == ocpconfigv1.SingleReplicaTopologyMode &&
+			clusterInfra.Status.InfrastructureTopology == ocpconfigv1.SingleReplicaTopologyMode
+	}


So with LVMS this check was quite tricky to rely on, since you could be doing "SNO" with multiple workers but your images (and thus VMs) would end up on same node, so for our purposes it wasn't "SNO".

Do we have the same concern here? I want to make sure we don't take away the alert from cases where it's actually helpful

cc @awels wdyt

Discussing it with @aglitke , we currently decide to suppress only in the case of a single node, so afaiu in the mentioned LVMS case we will still alert if no RWX.

The problem is (I think) that this API does not actually guarantee 1 single node.
So someone doing 3-node SNO setup, trying to properly set up live migration, will not receive the alert
We may or may not be okay with that - but something to consider

The doc says that's the way to detect a single-node, and afaik the other alternative for detecting SNO requires adding cluster rbac for node “list” and “watch”, which we currently prefer not to add.

btw why is it so bad to add cluster rbac for node “list” and “watch”? Sure CDI is a completely different animal, but HCO has this rbac.

When adding RBAC we always have to consider why are we adding RBAC. And honestly some alert doesn't seem like the right thing to add RBAC for.

@awels understood. And what about @akalenyu argument here?

I think this was @mhenriks point, that it is not easy to get right to figure out if we are truly in a single node environment or not. There are all kinds of edge cases where we won't alert with this and probably should. I do still think it is an improvement over always alerting in the wrong case.

That is another reason why adding the nodes is not a good idea. You could have multiple nodes, and some of them are not marked as worker nodes, and you now have only one node that can run workloads. Getting that stuff right is very tricky. I think this is 'good enough'. People with those edge cases should know what they are doing.

akalenyu · 2024-06-16T13:53:06Z

pkg/controller/storageprofile-controller_test.go

+		Entry("Unknown provisioner", "unknown-provisioner", false, 1),
+		Entry("Unknown provisioner", "unknown-provisioner", true, 0),


Same entry name will be confusing when one of these fails

Sure, these entries are not needed and can be merged. Fixed.

akalenyu · 2024-06-16T13:54:09Z

/approve
/hold
unhold once you're happy with #3310 (comment)

kubevirt-bot · 2024-06-16T13:54:31Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: akalenyu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [akalenyu]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: Arnon Gilboa <agilboa@redhat.com>

arnongilboa · 2024-06-16T15:14:37Z

/unhold

coveralls · 2024-06-16T15:22:14Z

coverage: 59.056% (+0.03%) from 59.023%
when pulling 45a28d6 on arnongilboa:suppress_alerts_on_sno
into f3d0060 on kubevirt:main.

akalenyu · 2024-06-16T15:39:55Z

/lgtm

akalenyu · 2024-06-16T17:05:49Z

@arnongilboa: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-containerized-data-importer-e2e-nfs 45a28d6 link true /test pull-containerized-data-importer-e2e-nfs

@arnongilboa I think this is a real bug, the "Error" condition reason will not appear anymore after #3194

arnongilboa · 2024-06-16T20:08:40Z

/retest

arnongilboa · 2024-06-16T20:11:08Z

@arnongilboa: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:
Test name Commit Details Required Rerun command
pull-containerized-data-importer-e2e-nfs 45a28d6 link true /test pull-containerized-data-importer-e2e-nfs

@arnongilboa I think this is a real bug, the "Error" condition reason will not appear anymore after #3194

Shouldn't it fail on all PRs? it has nothing to do with this one and should be fixed separately.

akalenyu · 2024-06-16T21:23:17Z

@arnongilboa: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:
Test name Commit Details Required Rerun command
pull-containerized-data-importer-e2e-nfs 45a28d6 link true /test pull-containerized-data-importer-e2e-nfs

@arnongilboa I think this is a real bug, the "Error" condition reason will not appear anymore after #3194

Shouldn't it fail on all PRs? it has nothing to do with this one and should be fixed separately.

Maybe there's something wrong with the test? It's not related to this PR and should have started failing since that change

arnongilboa · 2024-06-18T06:42:17Z

/cherrypick release-v1.59

kubevirt-bot · 2024-06-18T06:43:07Z

@arnongilboa: new pull request created: #3327

In response to this:

/cherrypick release-v1.59

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

arnongilboa · 2024-06-18T07:12:07Z

/cherrypick release-v1.58

kubevirt-bot · 2024-06-18T07:13:19Z

@arnongilboa: #3310 failed to apply on top of branch "release-v1.58":

Applying: Suppress CDIDefaultStorageClassDegraded on SNO
Using index info to reconstruct a base tree...
M	doc/metrics.md
M	pkg/controller/storageprofile-controller.go
M	pkg/controller/storageprofile-controller_test.go
A	pkg/monitoring/metrics/cdi-controller/storageprofile.go
A	pkg/monitoring/rules/alerts/operator.go
M	pkg/operator/resources/cluster/controller.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/operator/resources/cluster/controller.go
CONFLICT (modify/delete): pkg/monitoring/rules/alerts/operator.go deleted in HEAD and modified in Suppress CDIDefaultStorageClassDegraded on SNO. Version Suppress CDIDefaultStorageClassDegraded on SNO of pkg/monitoring/rules/alerts/operator.go left in tree.
CONFLICT (modify/delete): pkg/monitoring/metrics/cdi-controller/storageprofile.go deleted in HEAD and modified in Suppress CDIDefaultStorageClassDegraded on SNO. Version Suppress CDIDefaultStorageClassDegraded on SNO of pkg/monitoring/metrics/cdi-controller/storageprofile.go left in tree.
Auto-merging pkg/controller/storageprofile-controller_test.go
CONFLICT (content): Merge conflict in pkg/controller/storageprofile-controller_test.go
Auto-merging pkg/controller/storageprofile-controller.go
CONFLICT (content): Merge conflict in pkg/controller/storageprofile-controller.go
Auto-merging doc/metrics.md
CONFLICT (content): Merge conflict in doc/metrics.md
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Suppress CDIDefaultStorageClassDegraded on SNO
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherrypick release-v1.58

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

kubevirt-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. labels Jun 9, 2024

kubevirt-bot requested review from akalenyu and ShellyKa13 June 9, 2024 20:09

kubevirt-bot added the size/M label Jun 9, 2024

akalenyu reviewed Jun 10, 2024

View reviewed changes

awels approved these changes Jun 13, 2024

View reviewed changes

kubevirt-bot assigned awels Jun 13, 2024

kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Jun 13, 2024

akalenyu reviewed Jun 16, 2024

View reviewed changes

kubevirt-bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 16, 2024

kubevirt-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 16, 2024

Cleanup utest

45a28d6

Signed-off-by: Arnon Gilboa <agilboa@redhat.com>

kubevirt-bot removed the lgtm Indicates that a PR is ready to be merged. label Jun 16, 2024

kubevirt-bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 16, 2024

kubevirt-bot assigned akalenyu Jun 16, 2024

kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Jun 16, 2024

kubevirt-bot merged commit 11d91e0 into kubevirt:main Jun 16, 2024
19 checks passed

kubevirt-bot mentioned this pull request Jun 18, 2024

[release-v1.59] Suppress CDIDefaultStorageClassDegraded on SNO #3327

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suppress CDIDefaultStorageClassDegraded on SNO #3310

Suppress CDIDefaultStorageClassDegraded on SNO #3310

arnongilboa commented Jun 9, 2024 •

edited

Loading

coveralls commented Jun 9, 2024

akalenyu Jun 10, 2024

akalenyu Jun 10, 2024

arnongilboa Jun 10, 2024

akalenyu Jun 10, 2024

arnongilboa Jun 10, 2024

arnongilboa Jun 10, 2024

awels Jun 11, 2024

arnongilboa Jun 11, 2024

awels Jun 11, 2024

awels Jun 13, 2024

akalenyu Jun 16, 2024

arnongilboa Jun 16, 2024

akalenyu commented Jun 16, 2024

kubevirt-bot commented Jun 16, 2024

arnongilboa commented Jun 16, 2024

coveralls commented Jun 16, 2024

akalenyu commented Jun 16, 2024

akalenyu commented Jun 16, 2024

arnongilboa commented Jun 16, 2024

arnongilboa commented Jun 16, 2024

akalenyu commented Jun 16, 2024

arnongilboa commented Jun 18, 2024

kubevirt-bot commented Jun 18, 2024

arnongilboa commented Jun 18, 2024

kubevirt-bot commented Jun 18, 2024

		Entry("Unknown provisioner", "unknown-provisioner", false, 1),
		Entry("Unknown provisioner", "unknown-provisioner", true, 0),

Suppress CDIDefaultStorageClassDegraded on SNO #3310

Suppress CDIDefaultStorageClassDegraded on SNO #3310

Conversation

arnongilboa commented Jun 9, 2024 • edited Loading

coveralls commented Jun 9, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

akalenyu commented Jun 16, 2024

kubevirt-bot commented Jun 16, 2024

arnongilboa commented Jun 16, 2024

coveralls commented Jun 16, 2024

akalenyu commented Jun 16, 2024

akalenyu commented Jun 16, 2024

arnongilboa commented Jun 16, 2024

arnongilboa commented Jun 16, 2024

akalenyu commented Jun 16, 2024

arnongilboa commented Jun 18, 2024

kubevirt-bot commented Jun 18, 2024

arnongilboa commented Jun 18, 2024

kubevirt-bot commented Jun 18, 2024

arnongilboa commented Jun 9, 2024 •

edited

Loading