-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Design for data mover node selection #7383
Design for data mover node selection #7383
Conversation
49a27df
to
af76a48
Compare
af76a48
to
268039e
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #7383 +/- ##
==========================================
- Coverage 61.75% 61.62% -0.13%
==========================================
Files 262 263 +1
Lines 28433 28681 +248
==========================================
+ Hits 17558 17675 +117
- Misses 9643 9758 +115
- Partials 1232 1248 +16 ☔ View full report in Codecov by Sentry. |
6ca7f2e
to
a3ba141
Compare
As mentioned in the [Volume Snapshot Data Movement Design][2], the exposer decides where to launch the VGDP instances. At present, for volume snapshot data movement backups, the exposer creates backupPods and the VGDP instances will be initiated in the nodes where backupPods are scheduled. So the loadAffinity will be translated (from `metav1.LabelSelector` to `corev1.Affinity`) and set to the backupPods. | ||
|
||
It is possible that node-agent pods, as a daemonset, don't run in every worker nodes, users could fulfil this by specify `nodeSelector` or `nodeAffinity` to the node-agent daemonset spec. On the other hand, at present, VGDP instances must be assigned to nodes where node-agent pods are running. Therefore, if there is any node selection for node-agent pods, users must consider this into this load affinity configuration, so as to guarantee that VGDP instances are always assigned to nodes where node-agent pods are available. This is done by users, we don't inherit any node selection configuration from node-agent daemonset as we think daemonset scheduler works differently from plain pod scheduler, simply inheriting all the configurations may cause unexpected result of backupPod schedule. | ||
Otherwise, if a backupPod are scheduled to a node where node-agent pod is absent, the corresponding DataUpload CR will stay in `Accepted` phase until the prepare timeout (by default 30min). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me clarify, this is a possible situation in v1.13, right?
IMO it's possible to take the node-selector in node-agent spec when scheduling the bakcupPod
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is possible because users could modify node-agent's yaml and add any configurations affecting the schedule, i.e., topologies, affinities/anti-affinities, node selectors, etc.
Node-agent's node selection is done by Kubernetes' scheduler, the only way to control its behavior is to modify the node-agent's yaml; on the other hand, the data mover node-selection configuration is in a configMap that is detected after node-agent starts.
Therefore, if we reflect node-selection configuration into node-agent scheduling, we must dynamically edit node-agent's yaml after node-agent starts, so this causes node-agent restarts one more time.
Moreover, users may be surprised when node-agent restarts once more because they may not have realized their node-selection configuration has affected the node-agent scheduling.
Moreover, it is possible that node-agent pod cannot run in a specific node for sure, if we change the node-agent spec, things still doesn't work.
Therefore, users must know the relationship of node-agent scheduling and node-selection in either case. So we'd better have user realize this from the beginning and it is easy for them to make two correct configurations for node-agent spec and node-selection configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK let's make sure this is covered in documentation.
It is possible that node-agent pods, as a daemonset, don't run in every worker nodes, users could fulfil this by specify `nodeSelector` or `nodeAffinity` to the node-agent daemonset spec. On the other hand, at present, VGDP instances must be assigned to nodes where node-agent pods are running. Therefore, if there is any node selection for node-agent pods, users must consider this into this load affinity configuration, so as to guarantee that VGDP instances are always assigned to nodes where node-agent pods are available. This is done by users, we don't inherit any node selection configuration from node-agent daemonset as we think daemonset scheduler works differently from plain pod scheduler, simply inheriting all the configurations may cause unexpected result of backupPod schedule. | ||
Otherwise, if a backupPod are scheduled to a node where node-agent pod is absent, the corresponding DataUpload CR will stay in `Accepted` phase until the prepare timeout (by default 30min). | ||
|
||
At present, as part of the expose operations, the exposer creates a volume, represented by backupPVC, from the snapshot. The backupPVC uses the same storageClass with the source volume. If the `volumeBindingMode` in the storageClass is `Immediate`, the volume is immediately allocated from the underlying storage without waiting for the backupPod. On the other hand, the loadAffinity is set to the backupPod's affinity. If the backupPod is scheduled to a node where the snapshot volume is not accessible, e.g., because of storage topologies, the backupPod won't get into Running state, concequently, the data movement won't complete. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can explicitly document that when the storageclass has the BindingMode as "Immediate", the user SHOULD NOT set node selector for data mover
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can document this, but probably, we just need to tell users to be careful to set node-selection for Immediate
volumes, as not all volumes have the constraints like topologies. In the envs with no constraints, node-selection works well with Immediate
volumes
Signed-off-by: Lyndon-Li <lyonghui@vmware.com>
2f9d8ae
a3ba141
to
2f9d8ae
Compare
|
||
There is a common solution for the both problems: | ||
- We have an existing logic to periodically enqueue the dataupload CRs which are in the `Accepted` phase for timeout and cancel checks | ||
- We add a new logic to this existing logic to check if the corresponding backupPods are in unrecoverable status |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we give more information about the definition of the unrecoverable
state?
It's better to let the user know in which condition the DataUpload cancelation or failure is expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This Unrecoverable status check is an existing logic, here we add a further check in this existing logic.
At present, we don't have this check included in the doc, we can add it and include all the checks for backupPod/restorePod's Unrecoverable status.
There are several comments covering requests for document. We will create a separate PR to add a document for node-selection. We will cover all the requests in that PR later. |
Add the design for node selection for data mover backup