Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

There are always some nodes in NotReady state, reason: OutOfSync #1506

Closed
umialpha opened this issue May 31, 2021 · 7 comments
Closed

There are always some nodes in NotReady state, reason: OutOfSync #1506

umialpha opened this issue May 31, 2021 · 7 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@umialpha
Copy link

What happened:

  1. Volcano scheduler continually warns
    "Failed to set node info, phase: NotReady, reason: OutOfSync"
  2. In every scheduling cycle, some nodes can not be snapshotted, moreover the total node number sometimes decreases, e.g.
    I0528 09:55:13.751676 1 cache.go:761] There are <0> Jobs, <1> Queues and <607> Nodes in total for scheduling. I0528 09:55:14.994064 1 cache.go:761] There are <0> Jobs, <1> Queues and <606> Nodes in total for scheduling.
    Need to be mentioned, I did not submit any job at that time.

What you expected to happen:
Our Volcano scheduler is only responsible for some nodes, so I change some code,

sc.nodeInformer.Informer().AddEventHandlerWithResyncPeriod(
		cache.FilteringResourceEventHandler{
			FilterFunc: func(obj interface{}) bool {
				switch v := obj.(type) {
				case *v1.Node:
					return responsibleForNode(v)
				default:
					return false
				}

			},
			Handler: cache.ResourceEventHandlerFuncs{
				AddFunc:    sc.AddNode,
				UpdateFunc: sc.UpdateNode,
				DeleteFunc: sc.DeleteNode,
			},
		},
		0,
	)

In theory, scheduler should hold 644 nodes, I wait for about an hour, it never reaches this number in scheduling cycle.
I do not submit any job during my waiting time.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • Volcano Version: v1.2.0
  • Kubernetes version (use kubectl version): Client Version: version.Info{Major:"1", Minor:"12+", GitVersion:"v1.12.10-BCS.patch.v3", GitCommit:"47a5f679b379dc6962fc0967069f4deb445093d2", GitTreeState:"clean", BuildDate:"2020-10-28T02:58:43Z", GoVersion:"go1.12.13", Compiler:"gc", Platform:"linux/amd64"}
    Server Version: version.Info{Major:"1", Minor:"12+", GitVersion:"v1.12.10-BCS.patch.v3", GitCommit:"47a5f679b379dc6962fc0967069f4deb445093d2", GitTreeState:"clean", BuildDate:"2020-10-28T02:58:43Z", GoVersion:"go1.12.13", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:
@umialpha umialpha added the kind/bug Categorizes issue or PR as related to a bug. label May 31, 2021
@Thor-wl
Copy link
Contributor

Thor-wl commented May 31, 2021

/assign @Thor-wl

@wpeng102
Copy link
Member

Could you help check which node is in OutOfSync status? And please help append the describe node result.

@umialpha
Copy link
Author

@wpeng102, I got the root cause. Volcano sets node to NodeReady and OutOfSync when

if !ni.Used.LessEqual(NewResource(node.Status.Allocatable)) {
		ni.State = NodeState{
			Phase:  NotReady,
			Reason: "OutOfSync",
		}
		return
	}

However, we have some dynamic resources which changes capacity during time, e.g.
pod yaml

   resources:
      limits:
        caster.io/colocation-cpu: 6k
        caster.io/colocation-memory: "9830"
        ephemeral-storage: 18102Mi
      requests:
        caster.io/colocation-cpu: 6k
        caster.io/colocation-memory: "9830"
        ephemeral-storage: 9051Mi

node yaml

allocatable:
    caster.io/colocation-cpu: "0"
    caster.io/colocation-memory: "32768"
    cpu: "47"
    ephemeral-storage: 937234648Ki
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    memory: 130444672Ki
    pods: "110"

@umialpha
Copy link
Author

some related fixed issue, #294.

@wpeng102
Copy link
Member

@umialpha

It's interesting, so you have used resource overcommit and colocation in your cluster?

@umialpha
Copy link
Author

@wpeng102 , Our cluster has multiple schedulers. other schedulers(cooperating with device plugin) handle the above extended resources, i.e. "caster.io/colocation-cpu".

@umialpha
Copy link
Author

umialpha commented Jun 8, 2021

FYI, we remove the following logic in our forked branch.

if !ni.Used.LessEqual(NewResource(node.Status.Allocatable)) {
		ni.State = NodeState{
			Phase:  NotReady,
			Reason: "OutOfSync",
		}
		return
	}

and

//Sub subtracts two Resource objects.
func (r *Resource) Sub(rr *Resource) *Resource {
	// assert.Assertf(rr.LessEqual(r), "resource is not sufficient to do operation: <%v> sub <%v>", r, rr)

	r.MilliCPU -= rr.MilliCPU
	if r.MilliCPU < 0 {
		r.MilliCPU = 0
	}
	r.Memory -= rr.Memory
	if r.Memory < 0 {
		r.Memory = 0
	}

	for rrName, rrQuant := range rr.ScalarResources {
		if r.ScalarResources == nil {
			return r
		}
		r.ScalarResources[rrName] -= rrQuant
		if r.ScalarResources[rrName] < 0 {
			r.ScalarResources[rrName] = 0
		}
	}

	return r
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

3 participants