Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suppress CDIDefaultStorageClassDegraded on SNO #3310

Merged
merged 2 commits into from
Jun 16, 2024

Conversation

arnongilboa
Copy link
Collaborator

@arnongilboa arnongilboa commented Jun 9, 2024

What this PR does / why we need it:
On single-node OpenShift, even if none of the default/virt default storage classes supports ReadWriteMany (but supports smart clone), we will not fire the CDIDefaultStorageClassDegraded alert. We added degraded label to kubevirt_cdi_storageprofile_info to simplify the alert expression.

Which issue(s) this PR fixes:
jira-ticket: https://issues.redhat.com/browse/CNV-40665

Special notes for your reviewer:

Release note:

Suppress CDIDefaultStorageClassDegraded alert on SNO

On single-node OpenShift, even if none of the default/virt default
storage classes supports `ReadWriteMany` (but supports smart clone),
we will not fire the `CDIDefaultStorageClassDegraded` alert.
We added `degraded` label to `kubevirt_cdi_storageprofile_info` to
simplify the alert expression.

Signed-off-by: Arnon Gilboa <agilboa@redhat.com>
@kubevirt-bot kubevirt-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. labels Jun 9, 2024
@coveralls
Copy link

Coverage Status

coverage: 59.075% (+0.05%) from 59.023%
when pulling 3d99cfb on arnongilboa:suppress_alerts_on_sno
into f3d0060 on kubevirt:main.

Comment on lines +307 to +310
} else {
isSNO = clusterInfra.Status.ControlPlaneTopology == ocpconfigv1.SingleReplicaTopologyMode &&
clusterInfra.Status.InfrastructureTopology == ocpconfigv1.SingleReplicaTopologyMode
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So with LVMS this check was quite tricky to rely on, since you could be doing "SNO" with multiple workers but your images (and thus VMs) would end up on same node, so for our purposes it wasn't "SNO".

Do we have the same concern here? I want to make sure we don't take away the alert from cases where it's actually helpful

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @awels wdyt

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussing it with @aglitke , we currently decide to suppress only in the case of a single node, so afaiu in the mentioned LVMS case we will still alert if no RWX.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is (I think) that this API does not actually guarantee 1 single node.
So someone doing 3-node SNO setup, trying to properly set up live migration, will not receive the alert
We may or may not be okay with that - but something to consider

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc says that's the way to detect a single-node, and afaik the other alternative for detecting SNO requires adding cluster rbac for node “list” and “watch”, which we currently prefer not to add.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw why is it so bad to add cluster rbac for node “list” and “watch”? Sure CDI is a completely different animal, but HCO has this rbac.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When adding RBAC we always have to consider why are we adding RBAC. And honestly some alert doesn't seem like the right thing to add RBAC for.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@awels understood. And what about @akalenyu argument here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was @mhenriks point, that it is not easy to get right to figure out if we are truly in a single node environment or not. There are all kinds of edge cases where we won't alert with this and probably should. I do still think it is an improvement over always alerting in the wrong case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is another reason why adding the nodes is not a good idea. You could have multiple nodes, and some of them are not marked as worker nodes, and you now have only one node that can run workloads. Getting that stuff right is very tricky. I think this is 'good enough'. People with those edge cases should know what they are doing.

@kubevirt-bot kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Jun 13, 2024
Comment on lines 495 to 496
Entry("Unknown provisioner", "unknown-provisioner", false, 1),
Entry("Unknown provisioner", "unknown-provisioner", true, 0),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same entry name will be confusing when one of these fails

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, these entries are not needed and can be merged. Fixed.

@akalenyu
Copy link
Collaborator

/approve
/hold
unhold once you're happy with #3310 (comment)

@kubevirt-bot kubevirt-bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 16, 2024
@kubevirt-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: akalenyu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kubevirt-bot kubevirt-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 16, 2024
Signed-off-by: Arnon Gilboa <agilboa@redhat.com>
@kubevirt-bot kubevirt-bot removed the lgtm Indicates that a PR is ready to be merged. label Jun 16, 2024
@arnongilboa
Copy link
Collaborator Author

/unhold

@kubevirt-bot kubevirt-bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 16, 2024
@coveralls
Copy link

Coverage Status

coverage: 59.056% (+0.03%) from 59.023%
when pulling 45a28d6 on arnongilboa:suppress_alerts_on_sno
into f3d0060 on kubevirt:main.

@akalenyu
Copy link
Collaborator

/lgtm

@kubevirt-bot kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Jun 16, 2024
@akalenyu
Copy link
Collaborator

@arnongilboa: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-containerized-data-importer-e2e-nfs 45a28d6 link true /test pull-containerized-data-importer-e2e-nfs

@arnongilboa I think this is a real bug, the "Error" condition reason will not appear anymore after #3194

@arnongilboa
Copy link
Collaborator Author

/retest

@arnongilboa
Copy link
Collaborator Author

@arnongilboa: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:
Test name Commit Details Required Rerun command
pull-containerized-data-importer-e2e-nfs 45a28d6 link true /test pull-containerized-data-importer-e2e-nfs

@arnongilboa I think this is a real bug, the "Error" condition reason will not appear anymore after #3194

Shouldn't it fail on all PRs? it has nothing to do with this one and should be fixed separately.

@akalenyu
Copy link
Collaborator

@arnongilboa: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:
Test name Commit Details Required Rerun command
pull-containerized-data-importer-e2e-nfs 45a28d6 link true /test pull-containerized-data-importer-e2e-nfs

@arnongilboa I think this is a real bug, the "Error" condition reason will not appear anymore after #3194

Shouldn't it fail on all PRs? it has nothing to do with this one and should be fixed separately.

Maybe there's something wrong with the test? It's not related to this PR and should have started failing since that change

@kubevirt-bot kubevirt-bot merged commit 11d91e0 into kubevirt:main Jun 16, 2024
19 checks passed
@arnongilboa
Copy link
Collaborator Author

/cherrypick release-v1.59

@kubevirt-bot
Copy link
Contributor

@arnongilboa: new pull request created: #3327

In response to this:

/cherrypick release-v1.59

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@arnongilboa
Copy link
Collaborator Author

/cherrypick release-v1.58

@kubevirt-bot
Copy link
Contributor

@arnongilboa: #3310 failed to apply on top of branch "release-v1.58":

Applying: Suppress CDIDefaultStorageClassDegraded on SNO
Using index info to reconstruct a base tree...
M	doc/metrics.md
M	pkg/controller/storageprofile-controller.go
M	pkg/controller/storageprofile-controller_test.go
A	pkg/monitoring/metrics/cdi-controller/storageprofile.go
A	pkg/monitoring/rules/alerts/operator.go
M	pkg/operator/resources/cluster/controller.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/operator/resources/cluster/controller.go
CONFLICT (modify/delete): pkg/monitoring/rules/alerts/operator.go deleted in HEAD and modified in Suppress CDIDefaultStorageClassDegraded on SNO. Version Suppress CDIDefaultStorageClassDegraded on SNO of pkg/monitoring/rules/alerts/operator.go left in tree.
CONFLICT (modify/delete): pkg/monitoring/metrics/cdi-controller/storageprofile.go deleted in HEAD and modified in Suppress CDIDefaultStorageClassDegraded on SNO. Version Suppress CDIDefaultStorageClassDegraded on SNO of pkg/monitoring/metrics/cdi-controller/storageprofile.go left in tree.
Auto-merging pkg/controller/storageprofile-controller_test.go
CONFLICT (content): Merge conflict in pkg/controller/storageprofile-controller_test.go
Auto-merging pkg/controller/storageprofile-controller.go
CONFLICT (content): Merge conflict in pkg/controller/storageprofile-controller.go
Auto-merging doc/metrics.md
CONFLICT (content): Merge conflict in doc/metrics.md
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Suppress CDIDefaultStorageClassDegraded on SNO
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherrypick release-v1.58

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants