-
Notifications
You must be signed in to change notification settings - Fork 993
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
There are always some nodes in NotReady state, reason: OutOfSync #1506
Comments
/assign @Thor-wl |
Could you help check which node is in |
@wpeng102, I got the root cause. Volcano sets node to NodeReady and OutOfSync when if !ni.Used.LessEqual(NewResource(node.Status.Allocatable)) {
ni.State = NodeState{
Phase: NotReady,
Reason: "OutOfSync",
}
return
} However, we have some dynamic resources which changes capacity during time, e.g. resources:
limits:
caster.io/colocation-cpu: 6k
caster.io/colocation-memory: "9830"
ephemeral-storage: 18102Mi
requests:
caster.io/colocation-cpu: 6k
caster.io/colocation-memory: "9830"
ephemeral-storage: 9051Mi node yaml allocatable:
caster.io/colocation-cpu: "0"
caster.io/colocation-memory: "32768"
cpu: "47"
ephemeral-storage: 937234648Ki
hugepages-1Gi: "0"
hugepages-2Mi: "0"
memory: 130444672Ki
pods: "110" |
some related fixed issue, #294. |
It's interesting, so you have used resource overcommit and colocation in your cluster? |
@wpeng102 , Our cluster has multiple schedulers. other schedulers(cooperating with device plugin) handle the above extended resources, i.e. "caster.io/colocation-cpu". |
FYI, we remove the following logic in our forked branch. if !ni.Used.LessEqual(NewResource(node.Status.Allocatable)) {
ni.State = NodeState{
Phase: NotReady,
Reason: "OutOfSync",
}
return
} and //Sub subtracts two Resource objects.
func (r *Resource) Sub(rr *Resource) *Resource {
// assert.Assertf(rr.LessEqual(r), "resource is not sufficient to do operation: <%v> sub <%v>", r, rr)
r.MilliCPU -= rr.MilliCPU
if r.MilliCPU < 0 {
r.MilliCPU = 0
}
r.Memory -= rr.Memory
if r.Memory < 0 {
r.Memory = 0
}
for rrName, rrQuant := range rr.ScalarResources {
if r.ScalarResources == nil {
return r
}
r.ScalarResources[rrName] -= rrQuant
if r.ScalarResources[rrName] < 0 {
r.ScalarResources[rrName] = 0
}
}
return r
} |
What happened:
"Failed to set node info, phase: NotReady, reason: OutOfSync"
I0528 09:55:13.751676 1 cache.go:761] There are <0> Jobs, <1> Queues and <607> Nodes in total for scheduling. I0528 09:55:14.994064 1 cache.go:761] There are <0> Jobs, <1> Queues and <606> Nodes in total for scheduling.
Need to be mentioned, I did not submit any job at that time.
What you expected to happen:
Our Volcano scheduler is only responsible for some nodes, so I change some code,
In theory, scheduler should hold 644 nodes, I wait for about an hour, it never reaches this number in scheduling cycle.
I do not submit any job during my waiting time.
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
Environment:
kubectl version
): Client Version: version.Info{Major:"1", Minor:"12+", GitVersion:"v1.12.10-BCS.patch.v3", GitCommit:"47a5f679b379dc6962fc0967069f4deb445093d2", GitTreeState:"clean", BuildDate:"2020-10-28T02:58:43Z", GoVersion:"go1.12.13", Compiler:"gc", Platform:"linux/amd64"}Server Version: version.Info{Major:"1", Minor:"12+", GitVersion:"v1.12.10-BCS.patch.v3", GitCommit:"47a5f679b379dc6962fc0967069f4deb445093d2", GitTreeState:"clean", BuildDate:"2020-10-28T02:58:43Z", GoVersion:"go1.12.13", Compiler:"gc", Platform:"linux/amd64"}
uname -a
):The text was updated successfully, but these errors were encountered: