There are always some nodes in NotReady state, reason: OutOfSync #1506

umialpha · 2021-05-31T01:19:10Z

What happened:

Volcano scheduler continually warns
"Failed to set node info, phase: NotReady, reason: OutOfSync"
In every scheduling cycle, some nodes can not be snapshotted, moreover the total node number sometimes decreases, e.g.
I0528 09:55:13.751676 1 cache.go:761] There are <0> Jobs, <1> Queues and <607> Nodes in total for scheduling. I0528 09:55:14.994064 1 cache.go:761] There are <0> Jobs, <1> Queues and <606> Nodes in total for scheduling.
Need to be mentioned, I did not submit any job at that time.

What you expected to happen:
Our Volcano scheduler is only responsible for some nodes, so I change some code,

sc.nodeInformer.Informer().AddEventHandlerWithResyncPeriod(
		cache.FilteringResourceEventHandler{
			FilterFunc: func(obj interface{}) bool {
				switch v := obj.(type) {
				case *v1.Node:
					return responsibleForNode(v)
				default:
					return false
				}

			},
			Handler: cache.ResourceEventHandlerFuncs{
				AddFunc:    sc.AddNode,
				UpdateFunc: sc.UpdateNode,
				DeleteFunc: sc.DeleteNode,
			},
		},
		0,
	)

In theory, scheduler should hold 644 nodes, I wait for about an hour, it never reaches this number in scheduling cycle.
I do not submit any job during my waiting time.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

Volcano Version: v1.2.0
Kubernetes version (use kubectl version): Client Version: version.Info{Major:"1", Minor:"12+", GitVersion:"v1.12.10-BCS.patch.v3", GitCommit:"47a5f679b379dc6962fc0967069f4deb445093d2", GitTreeState:"clean", BuildDate:"2020-10-28T02:58:43Z", GoVersion:"go1.12.13", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"12+", GitVersion:"v1.12.10-BCS.patch.v3", GitCommit:"47a5f679b379dc6962fc0967069f4deb445093d2", GitTreeState:"clean", BuildDate:"2020-10-28T02:58:43Z", GoVersion:"go1.12.13", Compiler:"gc", Platform:"linux/amd64"}
Cloud provider or hardware configuration:
OS (e.g. from /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
Others:

The text was updated successfully, but these errors were encountered:

Thor-wl · 2021-05-31T01:55:42Z

/assign @Thor-wl

wpeng102 · 2021-05-31T02:37:26Z

Could you help check which node is in OutOfSync status? And please help append the describe node result.

umialpha · 2021-05-31T04:39:54Z

@wpeng102, I got the root cause. Volcano sets node to NodeReady and OutOfSync when

if !ni.Used.LessEqual(NewResource(node.Status.Allocatable)) {
		ni.State = NodeState{
			Phase:  NotReady,
			Reason: "OutOfSync",
		}
		return
	}

However, we have some dynamic resources which changes capacity during time, e.g.
pod yaml

   resources:
      limits:
        caster.io/colocation-cpu: 6k
        caster.io/colocation-memory: "9830"
        ephemeral-storage: 18102Mi
      requests:
        caster.io/colocation-cpu: 6k
        caster.io/colocation-memory: "9830"
        ephemeral-storage: 9051Mi

node yaml

allocatable:
    caster.io/colocation-cpu: "0"
    caster.io/colocation-memory: "32768"
    cpu: "47"
    ephemeral-storage: 937234648Ki
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    memory: 130444672Ki
    pods: "110"

umialpha · 2021-05-31T07:59:07Z

some related fixed issue, #294.

wpeng102 · 2021-05-31T08:06:57Z

@umialpha

It's interesting, so you have used resource overcommit and colocation in your cluster?

umialpha · 2021-05-31T08:27:27Z

@wpeng102 , Our cluster has multiple schedulers. other schedulers(cooperating with device plugin) handle the above extended resources, i.e. "caster.io/colocation-cpu".

umialpha · 2021-06-08T10:04:00Z

FYI, we remove the following logic in our forked branch.

if !ni.Used.LessEqual(NewResource(node.Status.Allocatable)) {
		ni.State = NodeState{
			Phase:  NotReady,
			Reason: "OutOfSync",
		}
		return
	}

and

//Sub subtracts two Resource objects.
func (r *Resource) Sub(rr *Resource) *Resource {
	// assert.Assertf(rr.LessEqual(r), "resource is not sufficient to do operation: <%v> sub <%v>", r, rr)

	r.MilliCPU -= rr.MilliCPU
	if r.MilliCPU < 0 {
		r.MilliCPU = 0
	}
	r.Memory -= rr.Memory
	if r.Memory < 0 {
		r.Memory = 0
	}

	for rrName, rrQuant := range rr.ScalarResources {
		if r.ScalarResources == nil {
			return r
		}
		r.ScalarResources[rrName] -= rrQuant
		if r.ScalarResources[rrName] < 0 {
			r.ScalarResources[rrName] = 0
		}
	}

	return r
}

umialpha added the kind/bug Categorizes issue or PR as related to a bug. label May 31, 2021

volcano-sh-bot assigned Thor-wl May 31, 2021

umialpha closed this as completed Jun 8, 2021

zishen mentioned this issue Oct 11, 2021

There are always some nodes in NotReady state, reason: OutOfSync ；when resource Allocatable changed #1776

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

There are always some nodes in NotReady state, reason: OutOfSync #1506

There are always some nodes in NotReady state, reason: OutOfSync #1506

umialpha commented May 31, 2021

Thor-wl commented May 31, 2021

wpeng102 commented May 31, 2021

umialpha commented May 31, 2021

umialpha commented May 31, 2021

wpeng102 commented May 31, 2021

umialpha commented May 31, 2021

umialpha commented Jun 8, 2021

There are always some nodes in NotReady state, reason: OutOfSync #1506

There are always some nodes in NotReady state, reason: OutOfSync #1506

Comments

umialpha commented May 31, 2021

Thor-wl commented May 31, 2021

wpeng102 commented May 31, 2021

umialpha commented May 31, 2021

umialpha commented May 31, 2021

wpeng102 commented May 31, 2021

umialpha commented May 31, 2021

umialpha commented Jun 8, 2021