-
Notifications
You must be signed in to change notification settings - Fork 264
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suppress CDIDefaultStorageClassDegraded on SNO #3310
Suppress CDIDefaultStorageClassDegraded on SNO #3310
Conversation
On single-node OpenShift, even if none of the default/virt default storage classes supports `ReadWriteMany` (but supports smart clone), we will not fire the `CDIDefaultStorageClassDegraded` alert. We added `degraded` label to `kubevirt_cdi_storageprofile_info` to simplify the alert expression. Signed-off-by: Arnon Gilboa <agilboa@redhat.com>
} else { | ||
isSNO = clusterInfra.Status.ControlPlaneTopology == ocpconfigv1.SingleReplicaTopologyMode && | ||
clusterInfra.Status.InfrastructureTopology == ocpconfigv1.SingleReplicaTopologyMode | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So with LVMS this check was quite tricky to rely on, since you could be doing "SNO" with multiple workers but your images (and thus VMs) would end up on same node, so for our purposes it wasn't "SNO".
Do we have the same concern here? I want to make sure we don't take away the alert from cases where it's actually helpful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @awels wdyt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussing it with @aglitke , we currently decide to suppress only in the case of a single node, so afaiu in the mentioned LVMS case we will still alert if no RWX.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is (I think) that this API does not actually guarantee 1 single node.
So someone doing 3-node SNO setup, trying to properly set up live migration, will not receive the alert
We may or may not be okay with that - but something to consider
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The doc says that's the way to detect a single-node, and afaik the other alternative for detecting SNO requires adding cluster rbac for node “list” and “watch”, which we currently prefer not to add.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw why is it so bad to add cluster rbac for node “list” and “watch”? Sure CDI is a completely different animal, but HCO has this rbac.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When adding RBAC we always have to consider why are we adding RBAC. And honestly some alert doesn't seem like the right thing to add RBAC for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this was @mhenriks point, that it is not easy to get right to figure out if we are truly in a single node environment or not. There are all kinds of edge cases where we won't alert with this and probably should. I do still think it is an improvement over always alerting in the wrong case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is another reason why adding the nodes is not a good idea. You could have multiple nodes, and some of them are not marked as worker nodes, and you now have only one node that can run workloads. Getting that stuff right is very tricky. I think this is 'good enough'. People with those edge cases should know what they are doing.
Entry("Unknown provisioner", "unknown-provisioner", false, 1), | ||
Entry("Unknown provisioner", "unknown-provisioner", true, 0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same entry name will be confusing when one of these fails
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, these entries are not needed and can be merged. Fixed.
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: akalenyu The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Signed-off-by: Arnon Gilboa <agilboa@redhat.com>
/unhold |
/lgtm |
@arnongilboa I think this is a real bug, the "Error" condition reason will not appear anymore after #3194 |
/retest |
Shouldn't it fail on all PRs? it has nothing to do with this one and should be fixed separately. |
Maybe there's something wrong with the test? It's not related to this PR and should have started failing since that change |
/cherrypick release-v1.59 |
@arnongilboa: new pull request created: #3327 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/cherrypick release-v1.58 |
@arnongilboa: #3310 failed to apply on top of branch "release-v1.58":
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What this PR does / why we need it:
On single-node OpenShift, even if none of the default/virt default storage classes supports
ReadWriteMany
(but supports smart clone), we will not fire theCDIDefaultStorageClassDegraded
alert. We addeddegraded
label tokubevirt_cdi_storageprofile_info
to simplify the alert expression.Which issue(s) this PR fixes:
jira-ticket: https://issues.redhat.com/browse/CNV-40665
Special notes for your reviewer:
Release note: