-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Design for node-agent concurrency #6950
Conversation
5480508
to
f203936
Compare
f203936
to
cce41d2
Compare
Codecov Report
@@ Coverage Diff @@
## main #6950 +/- ##
==========================================
+ Coverage 60.67% 61.11% +0.43%
==========================================
Files 250 252 +2
Lines 26515 26916 +401
==========================================
+ Hits 16088 16449 +361
- Misses 9283 9302 +19
- Partials 1144 1165 +21 |
### Global data path manager | ||
As for the code implementation, data path manager is to maintain the total number of the running VGDP instances and ensure the limit is not excceeded. At present, there is one data path manager instance per controller, as a result, the concurrent numbers are calculated separately for each controller. This doesn't help to limit the concurrency among different requesters. | ||
Therefore, we need to create one global data path manager instance server-wide, and pass it to different controllers. The instance will be created at node-agent server startup. | ||
The concurrent number is required to initiate a data path manager, the number comes from either Per-node concurrent number or Global concurrent number. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we control the limit of the running VGDP instances? Is this already implemented by Velero or Kopia?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it has been implemented, check the dataPath manager code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to also update the PVB, PVR, DataUpload, and DataDownload status to reflect the reason why they are still not running?
fsBackup, err := r.dataPathMgr.CreateFileSystemBR(pvb.Name, pVBRRequestor, ctx, r.Client, pvb.Namespace, callbacks, log)
if err != nil {
if err == datapath.ConcurrentLimitExceed {
return ctrl.Result{Requeue: true, RequeueAfter: time.Minute}, nil
} else {
return r.errorOut(ctx, &pvb, err, "error to create data path", log)
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, will do this in the code change PR.
|
||
## Goals | ||
|
||
- Define the behaviors of concurrent VGDP instances in node-agent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we support configuring the concurrency in the backup/restore level in the future, will this feature be impacted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not impacted. The two controls could work together, backups/restores are not probably all in the same node; while in the same node, the VGDP concurrency mechanism has higher priority over backup/restore level concurrency, as the former is for resources consideration, if there are not enough resources, the backups/restores level configurations are not be fulfilled.
9962e43
to
d219b0e
Compare
d1daa89
to
8163ad0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a couple of questions on the design of this API
design/node-agent-concurrency.md
Outdated
## Non-Goals | ||
- VGDP instances from different nodes always run in concurrent since in most common cases the resources are isolated. For special cases that some resources are shared across nodes, there is no support at present | ||
- In practice, restores run in prioritized scenarios, e.g., disarster recovery. However, the current design doesn't consider this difference, a VGDP instance for a restore is blocked if it reaches to the limit of the concurrency, even though the ones block it are for backups. If users do meet some problems here, they should consider to stop the backups first | ||
- Sometimes, users wants to block VGDP instances from running in a specific node, this is not considered in the current design. To archive this, more modules need to be considered, just setting VGDP instances' concurrent number to 0 won't bring the ideal reasult |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what would be the result then? and why would it not be ideal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the details in the design.
design/node-agent-concurrency.md
Outdated
### Per-node concurrent number | ||
We allow users to specify different concurrent number per node, for example, users can allow 3 concurrent instances in Node-1, 2 instances in Node-2 and 1 instance in Node-3. This is for below considerations: | ||
- The resources may be different among nodes. Then users could specify smaller concurrent number for nodes with less resources while larger number for the ones with more resources | ||
- Help users to isolate critical environments. Users may run some critical workloads in some specified nodes, since VGDP instances may take large resource consumption, users may only want to run them in the other nodes or run less number of instances in the nodes with critical workloads |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Users may only want to run them in the other nodes
How does this work if I can't set the number to zero and we do not consider turning off the node agent for specific nodes.
Feels like there is a mismatch between goals/non-goals and solution, can you help me understand?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the number is 0 for either the global number of per-node number, it will be ignored.
There were indeed some struggles on supporting the 0 number (blocking mode), but finally I gave it up for now. So there were some mismatch in the doc. I've removed this sentence.
In future, if we see cases hardly require this, let's think about a solution from the entire workflow. As mentioned in the doc, only do it from VGDP instance level doesn't work
design/node-agent-concurrency.md
Outdated
|
||
The range of Per-node concurrent number is the same with Global concurrent number. | ||
Per-node concurrent number is preferable to Global concurrent number, so it will overwrite the Global concurrent number for that node. | ||
Per-node concurrent number is implemented in a separate configMap named ```node-agent-configs```. This configMap is not created by Velero, users should create it manually on demand. The configMap should be created before node-agent server starts, otherwise, node-agent server should be restarted in order to make the effect. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we watch for this particular config map instead and make the changes necessary when they are made?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- From me, this number is not frequently changed, so we don't need to watch it, but retrieve it once at server starts.
- There are some more details to cover. If we allow it, users may change the number in the configure any time, there will be a chaos. E.g., for one time, there are 3 concurrencies and for the other time, there are 2 concurrencies.
- It makes the implementation complex, the dataPathManager is a singleton instance, if the number is not set only once when initiating it, we will need a watch mechanism and also a contest solving mechanism to update the number
Therefore, if there is not a case that strongly requires this, I would prefer not adding this complexy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having behavior that matches kube-like behavior (constant reconciliation) is complexity that is worth adding.
design/node-agent-concurrency.md
Outdated
The ```node-agent-configs``` configMap may be used for other purpose of configuring node-agent in future, at present, there is only one kind of data ```data-path-concurrency```. Its data struct is like below: | ||
|
||
```go | ||
type DataPathConcurrency struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was it considered to make this a CRD singleton cluster scoped CR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if I get your question, let me try to answer it. By the design, this is singleton for a single Velero installation. If there are more than one installation in a cluster, I didn't consider it for the design, but technically, yes, all the Velero installations will follow the same cm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So either, this should be a cluster-scoped resource as a CRD (not a config map). Or it should be namespace scoped configmap and tied to the velero node agent configuration. I think it will be confusing otherwise. Magic config maps that have special meaning used to be needed, but I think we have options to make it more declarative and therefore more explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as my understanding, this per-node number is a static configuration, so I don't understand the benefit to make it a CRD.
If you are talking about dynamically watching it, as mentioned here, I don't think it is necessary to dynamically watch a configuration that is not to be frequently changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may not be frequently changed, BUT when you change it, what is the process now, and how does it affect running backups?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So both the global number change and per-node number change won't take effect unless node-agent server restarts and node-agent server is only recommended to restart when there are no running backups/restores
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am worried about the brittleness of this with a backup and restore system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- For the already running backups/restores, I do believe this is a reasonable behavior, we cannot peel a data path thread (which for example has backed up tones of data) whenever the number is changed by users.
- For the backups/restores to be created/submitted, we can use the new number without restarting node-agent server, but I don't think it is something must have, especially in the first version.
For 2, I am open for a vote of this. It is just efforts/complexity vs. user friendliness
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I am following 1 unless you are spelling out the agreement on letting running threads continue running and reducing the number over time.
I believe that if you implemented the system described in 1 and 2 it would be more resilient and have a better UX (especially when you say need to bump up/down these resources for a $REASON.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the agreement on letting running threads continue running and reducing the number over time
Yes, this is what I meant, the limit should only work on the new coming VGDP instance creations.
design/node-agent-concurrency.md
Outdated
```go | ||
type DataPathConcurrency struct { | ||
// ConfigRules specifies the configs by node identified by label selector | ||
ConfigRules map[string]int `json:"configRules"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of a string, we should use a LabelSelector, IMO.
Also, something to consider: if multiple labels are matched, which one is selected? I suggest the safe thing is the lesser of the entries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if multiple labels are matched, which one is selected? I suggest the safe thing is the lesser of the entries
Yes, that is what we are going to do. See this statement in the design "If one node falls into more than one elements, e.g., if node1
also has the label beta.kubernetes.io/instance-type=Standard_B4ms
, the smallest number (3) will be used. "
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of a string, we should use a LabelSelector, IMO.
See this comment. The string is in the pattern of LabelSelector, so the code will convert a string to a LabelSelector. This is for two considerations:
- LabelSelector cannot be a key of map. To use LabelSelector, the structure will be complex
- A string is convenient for users to write the config, they don't need to write "{}" or the assignment to "matchLabel"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don' know if this makes sense. There is a well-defined struct for label selector, and if you don't use that, it won't be very clear.
I would suggest using a CRD for this and the well-known label selector struct. I would probably just do a simple list. Then, you can order that list by the size of the threads. you will loop through the selectors, and once one is found, it will be the lowest option.
I don't think you gain anything from having a map here, as to see if a node is selected you have to look at the key, so you will have to loop over the map. I think that a list would be a better choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me try with a list of label selector. I have ever considered this, but it eventually involves two data structures, so I just dropped
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, changed to LabelSelector
If one node falls into more than one elements, e.g., if ```node1``` also has the label ```beta.kubernetes.io/instance-type=Standard_B4ms```, the smallest number (3) will be used. | ||
|
||
### Global data path manager | ||
As for the code implementation, data path manager is to maintain the total number of the running VGDP instances and ensure the limit is not excceeded. At present, there is one data path manager instance per controller, as a result, the concurrent numbers are calculated separately for each controller. This doesn't help to limit the concurrency among different requesters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you help me understand.
This is saying that the node agent controller will be responsible for the number of concurrent DataUploads/DataDownloads it can handle and ensure this is true? Can you help me understand why this is not the thread of the workers for the controller and needs something in a global manager?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is saying that the node agent controller will be responsible for the number of concurrent
No, this is the current/existing behavior. We need to change the control to the node-agent server wide, as you mentioned in a global manager.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand. Can you please help me?
Is this because we have two things (DataUploads and DataDownloads) with workers, and you want to split them?
If this is the case, then isn't there a concern that a node agent will take ownership, because its worker is free just spinning because there are no available "dataManager" threads? Then other node agents that are in the system, that may have free "dataManager" threads, will be sitting idle?
Did I miss us handling this in the design?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might have mislead you. The thing is, in the existing code, there are one dataPathMgr in one controller, so in order to count the overall VGDP instances no matter they are from a DataUpload, DataDownload, PodVolumeBackup or PodVolumeRestore, we need to make a singleton dataPathMgr. See the existing code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I do think that, because a node takes ownwership of something and then can determine if it can do work on it, we may want to consider how to make sure this is something that can more adequately be spread out.
8163ad0
to
f1ddae3
Compare
design/node-agent-concurrency.md
Outdated
## Solution | ||
|
||
### Global concurrent number | ||
We add a node-agent server parameter ```data-path-concurrent-num``` to accept a concurrent number that will be applied to all nodes if the per-node number is not specified. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this param is for the node-agent server, isn't it always per-node, because different nodes run different node-agent server instances?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is for node-agent server, see this comment
design/node-agent-concurrency.md
Outdated
|
||
The range of Per-node concurrent number is the same with Global concurrent number. | ||
Per-node concurrent number is preferable to Global concurrent number, so it will overwrite the Global concurrent number for that node. | ||
Per-node concurrent number is implemented in a separate configMap named ```node-agent-configs```. This configMap is not created by Velero, users should create it manually on demand. The configMap should be created before node-agent server starts, otherwise, node-agent server should be restarted in order to make the effect. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the configmap is an alternative to the param for node-agent server, I'd vote for node-agent server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not an alternative, the thing is, we want a number that applied to all nodes and we also want some nodes have their particular number.
|
||
### Global data path manager | ||
As for the code implementation, data path manager is to maintain the total number of the running VGDP instances and ensure the limit is not excceeded. At present, there is one data path manager instance per controller, as a result, the concurrent numbers are calculated separately for each controller. This doesn't help to limit the concurrency among different requesters. | ||
Therefore, we need to create one global data path manager instance server-wide, and pass it to different controllers. The instance will be created at node-agent server startup. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by server-wide, do you mean node-agent server or the whole velero setup? I assume it's node-agent server, if that's the node-agent server, I don't think it's global
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is node-agent server. Maybe we can give it a better name, what about "Ordinary Number"?
- Help users to isolate critical environments. Users may run some critical workloads in some specified nodes, since VGDP instances may take large resource consumption, users may want to run less number of instances in the nodes with critical workloads | ||
|
||
The range of Per-node concurrent number is the same with Global concurrent number. | ||
Per-node concurrent number is preferable to Global concurrent number, so it will overwrite the Global concurrent number for that node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The term "Global concurrent number" is a little confusing to me. IMO this concurrent number is to control the concurrency within one node-agent server, and if this is true, the number is per-node. And the "Global current number" is just a shortcut to help user to set the currency for every node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the "Global concurrent number" applies to every node by default and if a per-node number is also set for a node, the per-node number overwrites the number in that node.
For example, say Global Number is 3, and there are 3 nodes -- node-1, node-2, node-3; and users also configured a per-node number for node-2, say 5, then the concurrent number for each node are:
node-1: 3, node-2:5, node-3:3
I would like to see an alternative explored, where
|
I agree it is a common requirement. However, as I mentioned, we cannot simply make this from define the concurrency of VGDP. |
I agree this is a reasonable requirement. But again, we cannot finish this from the VGDP level only. Things are different for backup types. E.g., for both fs restore and data mover restore, pods are first created and then VGDP instances are launched, the thing is, in some cases, a VGDP instance must run in the same node where the pod is scheduled to, otherwise, the restored volume may not be mounted to the pod if there are some topology/affinity/taint settings. So the same, these requirements is not the current design could fulfill, because it is only aiming to the number of VGDP instances |
I think that if we agree these are reasonable requirements, then the design should also solve them. I wonder if we could take some time in a community meeting or set up a one-off call, to discuss and brainstorm if that would be helpful? |
Any way is fine to me. We can discuss in the next community meeting. Meanwhile, I am not sure if we are on the same page for everything (regardless our opinions), if not, we can run a sync-up call first. |
@shawn-hurley For these two questions, I think we are still not in the same page, because I thought you regarded them as something we can make, but I don't think so --- considering all types, PVB/PVR, data mover backup/restore, these are something we cannot achieve right now and for a long time in future. For the details, we need to consider from end to end: Why can't we support --- "we handle a node not having workers running for one of DataDownloads OR DataUploads or both": This means, unless enhancing the above pod schedule related behavior, we have no way to support it. For sure, it will be a hard work. Why can't we support --- "being able to break up if a node does DataDownloads or DataRestores": |
57a307e
to
2d33d45
Compare
Done on some further changes to address comments:
Please continue the review. |
Thank you for explaining where we were not on the same page. I think that helped a lot for me to understand the concerns. I think now we have to go up the stack a second and talk about the feature and what it means it for users. Is it worth breaking up what can be expected by users based on which backup they choose? This would allow us to do the "more correct" thing for DataUploads and continue to support PVB/PVR requests and DataDownloads that have to get set for a specific node. I think that it is reasonable that the main need here is to control nodes that are uploading data. If you are restoring, something went wrong, and you probably want most/all available resources to do that, from a UX perspective, it seems reasonable. What are the downsides of this? |
This is what I was planning to do in a a separate PR/design (another issue #7036 is opened for it), since its implementation/usability deviates the current PR:
|
2d33d45
to
d2cfd9f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concerns around naming now that I finally understand the issues, thank you @Lyndon-Li for being patient while I caught up.
I hope the comments make sense in that light, let me know if I am still misunderstanding!
|
||
The data structure for ```data-path-concurrency``` is as below: | ||
```go | ||
type DataPathConcurrency struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I worry that, this in conjunction with the changes that you are describing for the data upload change, may be confusing.
I think that, at least now, i understand the DataPath name and what it means, but I don't think it is clear to the end users what this is managing.
Can we call this external concurrency or node load concurrency or something, it feels like DataPath is very much an internal concept that does convey specific meaning.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, DataPath is more like a design/architecture level concept.
We have discussed a little about this in the community meeting, but haven't get the final answer. Some of the previous nominations -- "dataPathConcurrency", "dataMoverConcurrency".
Personally, I think "loadConcurrency" (we don't need to mention "node" because it is under the node-agent-configs) is close to what we are doing here and is not ambiguous. Waiting for others' opinions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this naming question, we decide not to block the merge of this design. We can keep the discussion on this, and when we want to change the name, if the design has been merged, we come back to make slight change to this design.
So this discussion can continue until the FC of 1.13.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Lyndon-Li "loadConcurrency" works for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
loadConcurrency works for me as well. Docs on this guy will be necessary, but I now better understand what this is doing. Thank you for taking the time, and I agree if inspiration strikes and folks come up with a better name, awesome!
design/node-agent-concurrency.md
Outdated
|
||
## Solution | ||
|
||
We introduce a configMap named ```node-agent-configs``` for users to specify the node-agent related configurations. This configMap is not created by Velero, users should create it manually on demand. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which namespace would this config live in? How would two node agent daemon sets, if we have multiple installs know which one to look at?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question, I have modify the design to include it --- the configMap is namespace specific and only works for node-agent in the specific namespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good to me. I worry about the UX once you have three and try to understand what is going on in a node across the three pods, but I think we can solve it if/when we hit that problem.
Signed-off-by: Lyndon-Li <lyonghui@vmware.com>
d2cfd9f
to
a0edad9
Compare
/lgtm |
For issue #6663, Add the design for node-agent concurrency