Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

volcano-scheduler start failed #628

Closed
jianxingzhe opened this issue Dec 19, 2019 · 19 comments · Fixed by #654
Closed

volcano-scheduler start failed #628

jianxingzhe opened this issue Dec 19, 2019 · 19 comments · Fixed by #654
Labels
area/scheduling kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Milestone

Comments

@jianxingzhe
Copy link

Is this a BUG REPORT or FEATURE REQUEST?:

Uncomment only one, leave it on its own line:

/kind bug

What happened:
when deploy volcano witeh the installer/volcano-deployment.yaml, the vocano-shceduler start failed, the logs as follows:

I1219 09:41:11.334295       1 session.go:135] Open Session b0d84829-2243-11ea-9587-a67222a5a7fa with <1> Job and <1> Queues
I1219 09:41:11.334540       1 enqueue.go:55] Enter Enqueue ...
I1219 09:41:11.334549       1 enqueue.go:70] Added Queue <default> for Job <nzk/nzkcluster>
I1219 09:41:11.334566       1 panic.go:679] Leaving Enqueue ...
I1219 09:41:11.334601       1 session.go:154] Close Session b0d84829-2243-11ea-9587-a67222a5a7fa
E1219 09:41:11.334655       1 runtime.go:69] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
/home/travis/gopath/src/volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:76
/home/travis/gopath/src/volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65
/home/travis/gopath/src/volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51
/home/travis/.gimme/versions/go1.13.5.linux.amd64/src/runtime/panic.go:679
/home/travis/.gimme/versions/go1.13.5.linux.amd64/src/runtime/panic.go:199
/home/travis/.gimme/versions/go1.13.5.linux.amd64/src/runtime/signal_unix.go:394
/home/travis/gopath/src/volcano.sh/volcano/pkg/scheduler/actions/enqueue/enqueue.go:78
/home/travis/gopath/src/volcano.sh/volcano/pkg/scheduler/scheduler.go:84
/home/travis/gopath/src/volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
/home/travis/gopath/src/volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134
/home/travis/gopath/src/volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88
/home/travis/.gimme/versions/go1.13.5.linux.amd64/src/runtime/asm_amd64.s:1357
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
        panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x138 pc=0x11c583a]

goroutine 201 [running]:
volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
        /home/travis/gopath/src/volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:58 +0x105
panic(0x132f580, 0x217a760)
        /home/travis/.gimme/versions/go1.13.5.linux.amd64/src/runtime/panic.go:679 +0x1b2
volcano.sh/volcano/pkg/scheduler/actions/enqueue.(*enqueueAction).Execute(0xc00016c098, 0xc000b26140)
        /home/travis/gopath/src/volcano.sh/volcano/pkg/scheduler/actions/enqueue/enqueue.go:78 +0x32a
volcano.sh/volcano/pkg/scheduler.(*Scheduler).runOnce(0xc00055aa80)
        /home/travis/gopath/src/volcano.sh/volcano/pkg/scheduler/scheduler.go:84 +0x294
volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc0003c13a0)
        /home/travis/gopath/src/volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x5e
volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc0003c13a0, 0x3b9aca00, 0x0, 0x1, 0x0)
        /home/travis/gopath/src/volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134 +0xf8
volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/wait.Until(0xc0003c13a0, 0x3b9aca00, 0x0)
        /home/travis/gopath/src/volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 +0x4d
created by volcano.sh/volcano/pkg/scheduler.(*Scheduler).Run
        /home/travis/gopath/src/volcano.sh/volcano/pkg/scheduler/scheduler.go:68 +0xd4

root@paas-operator-0:~# kubectl get Job -n nzk
No resources found.

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • Volcano Version:
    volcanosh/vc-scheduler:latest
  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.12", GitCommit:"a8b52209ee172232b6db7a6e0ce2adc77458829f", GitTreeState:"clean", BuildDate:"2019-10-15T12:12:15Z", GoVersion:"go1.11.13", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.12", GitCommit:"a8b52209ee172232b6db7a6e0ce2adc77458829f", GitTreeState:"clean", BuildDate:"2019-10-15T12:04:30Z", GoVersion:"go1.11.13", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):
PRETTY_NAME="Debian GNU/Linux 9 (stretch)"
NAME="Debian GNU/Linux"
VERSION_ID="9"
VERSION="9 (stretch)"
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:
@hzxuzhonghu
Copy link
Collaborator

It seems that when Enqueue action executes, the job.PodGroup is still not populated. This is a bug.

/kind bug

/area scheduler

@volcano-sh-bot volcano-sh-bot added the kind/bug Categorizes issue or PR as related to a bug. label Dec 19, 2019
@volcano-sh-bot
Copy link
Contributor

@hzxuzhonghu: The label(s) area/scheduler cannot be applied. These labels are supported: ``

In response to this:

It seems that when Enqueue action executes, the job.PodGroup is still not populated. This is a bug.

/kind bug

/area scheduler

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k82cn
Copy link
Member

k82cn commented Dec 19, 2019

/area scheduling
/priority important-soon

@volcano-sh-bot volcano-sh-bot added area/scheduling priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Dec 19, 2019
@k82cn k82cn added this to the v0.4 milestone Dec 20, 2019
@k82cn
Copy link
Member

k82cn commented Dec 24, 2019

@jianxingzhe , what kind of workload did you submit to volcano? could you share the yaml file?

@k82cn
Copy link
Member

k82cn commented Dec 24, 2019

It seems that when Enqueue action executes, the job.PodGroup is still not populated. This is a bug.

@hzxuzhonghu , in Snapshot of cache.go, we did not return job without PodGroup, is there any place we set it back to nil?

@hzxuzhonghu
Copy link
Collaborator

in Snapshot of cache.go, we did not return job without PodGroup

Can you link the lines?

@hzxuzhonghu
Copy link
Collaborator

BTW, i donot think we should leave a critical bug to 0.4, which is 3 months away.

@jianxingzhe
Copy link
Author

@jianxingzhe , what kind of workload did you submit to volcano? could you share the yaml file?

i submit nothing to volcano. when i deploy volcano in my cluster, the scheduler panic.

@hzxuzhonghu
Copy link
Collaborator

What's this <nzk/nzkcluster>?

@jianxingzhe
Copy link
Author

jianxingzhe commented Dec 27, 2019

What's this <nzk/nzkcluster>?

this is my crd resource, which has already in my cluster befor deploy volcano.

root@paas-operator-0:~# kubectl get nzkcluster -n nzk
NAME         READY   NODEPORT   AGE
nzkcluster   0       31685      16d

root@paas-operator-0:~# kubectl get Job -n nzk
No resources found.

@hzxuzhonghu
Copy link
Collaborator

I thought nzk/nzkcluster should be a pod with scheduler name set volcano, otherwise it should not be watched by volcano

@hzxuzhonghu
Copy link
Collaborator

try this kubectl get vcjob -n nzk

@jianxingzhe
Copy link
Author

try this kubectl get vcjob -n nzk

I'm sure there is not a pod with scheduler name set to volcano. when i delete the ns nzk, the volcano scheduler start successfully.

root@paas-operator-0:~# kubectl get pod -n nzk
NAME                            READY   STATUS             RESTARTS   AGE
nzk-operator-6d68d5756b-6bfd8   1/1     Running            0          18d
nzk-operator-6d68d5756b-7mgns   0/1     Evicted            0          18d
nzk-operator-6d68d5756b-7qmzp   0/1     CrashLoopBackOff   3323       18d
nzk-operator-6d68d5756b-lc6wz   0/1     Evicted            0          17d
nzk-operator-6d68d5756b-t6bqs   0/1     CrashLoopBackOff   2780       17d
nzkcluster-0                    0/1     CrashLoopBackOff   2787       17d

root@paas-operator-0:~# kubectl get pod -n nzk
NAME                            READY   STATUS             RESTARTS   AGE
nzk-operator-6d68d5756b-6bfd8   1/1     Running            0          18d
nzk-operator-6d68d5756b-7mgns   0/1     Evicted            0          18d
nzk-operator-6d68d5756b-7qmzp   0/1     CrashLoopBackOff   3323       18d
nzk-operator-6d68d5756b-lc6wz   0/1     Evicted            0          17d
nzk-operator-6d68d5756b-t6bqs   0/1     CrashLoopBackOff   2780       17d
nzkcluster-0                    0/1     CrashLoopBackOff   2787       17d
root@paas-operator-0:~# kubectl get pod -n nzk nzkcluster-0 -o yaml | grep scheduler
  schedulerName: default-scheduler
root@paas-operator-0:~# kubectl get nzkcluster -n nzk -o yaml | grep scheduler
root@paas-operator-0:~# kubectl get vcjob -n nzk
No resources found.

@jianxingzhe
Copy link
Author

in another cluster, i have the same error. i add some logs in volcano scheduler:

        for _, value := range sc.Jobs {
                // If no scheduling spec, does not handle it.
                if value.PodGroup == nil && value.PDB == nil {
                        klog.V(4).Infof("The scheduling spec of Job <%v:%s/%s> is nil, ignore it.",
                                value.UID, value.Namespace, value.Name)

                        continue
                }

                if _, found := snapshot.Queues[value.Queue]; !found {
                        klog.V(3).Infof("The Queue <%v> of Job <%v/%v> does not exist, ignore it.",
                                value.Queue, value.Namespace, value.Name)
                        continue
                }

                klog.V(3).Infof("add Job: %v", value) // print the joninfo

                wg.Add(1)
                go cloneJob(value)
        }

i found the volcano scheduler add some jobs automatically,and theses jobs can not be found in my cluster. i cannot find the code where these jobs were added to the scheduler cache . @hzxuzhonghu

I1229 04:59:21.663117       1 shared_informer.go:123] caches populated
I1229 04:59:21.663152       1 scheduler.go:72] Start scheduling ...
I1229 04:59:21.663940       1 cache.go:781] add Job: Job (d60d44fe-29f7-11ea-8524-246e9627db94): namespace default (default), name nzk-chaos, minAvailable 0, podGroup <nil>
I1229 04:59:21.664005       1 cache.go:781] add Job: Job (cbdb9912-1c94-11ea-bf89-246e9627db94): namespace panther (default), name demo1-es-default, minAvailable 1, podGroup <nil>
I1229 04:59:21.664026       1 cache.go:781] add Job: Job (1730e308-2791-11ea-8524-246e9627db94): namespace hw-elasticsearch-new9 (default), name panther-sample-02-es-default, minAvailable 7, podGroup <nil>
I1229 04:59:21.664043       1 cache.go:781] add Job: Job (87592197-2855-11ea-8524-246e9627db94): namespace nes-elasticsearch (default), name panther-sample-es-default, minAvailable 15, podGroup <nil>
I1229 04:59:21.664058       1 cache.go:781] add Job: Job (e8d6d77c-209c-11ea-8524-246e9627db94): namespace default (default), name example, minAvailable 0, podGroup <nil>
I1229 04:59:21.664075       1 cache.go:781] add Job: Job (f5c5ce78-1662-11ea-bf89-246e9627db94): namespace zookeeper (default), name example, minAvailable 0, podGroup <nil>
I1229 04:59:21.664099       1 cache.go:788] There are <6> Jobs, <1> Queues and <4> Nodes in total for scheduling.
I1229 04:59:21.664123       1 session.go:135] Open Session fa06dd0a-29f7-11ea-bf37-ce2c7f6a1947 with <6> Job and <1> Queues
I1229 04:59:21.664411       1 proportion.go:67] The total resource is <cpu 180000.00, memory 1343110746112.00, hugepages-1Gi 0.00, hugepages-2Mi 8589934592000.00>
I1229 04:59:21.664451       1 proportion.go:71] Considering Job <panther/demo1-es-default>.
I1229 04:59:21.664467       1 proportion.go:85] Added Queue <default> attributes.
I1229 04:59:21.664479       1 proportion.go:71] Considering Job <hw-elasticsearch-new9/panther-sample-02-es-default>.
I1229 04:59:21.664489       1 proportion.go:71] Considering Job <nes-elasticsearch/panther-sample-es-default>.
I1229 04:59:21.664497       1 proportion.go:71] Considering Job <default/example>.
I1229 04:59:21.664505       1 proportion.go:71] Considering Job <zookeeper/example>.
I1229 04:59:21.664514       1 proportion.go:71] Considering Job <default/nzk-chaos>.
I1229 04:59:21.664529       1 proportion.go:127] Considering Queue <default>: weight <1>, total weight <1>.
I1229 04:59:21.664550       1 proportion.go:144] The attributes of queue <default> in proportion: deserved <cpu 180000.00, memory 1343110746112.00, hugepages-2Mi 8589934592000.00, hugepages-1Gi 0.00>, allocate <cpu 0.00, memory 0.00>, request <cpu 0.00, memory 0.00>, share <0.00>
I1229 04:59:21.664698       1 proportion.go:154] Exiting when remaining is empty:  <cpu 0.00, memory 0.00, hugepages-2Mi 0.00, hugepages-1Gi 0.00>
I1229 04:59:21.664895       1 binpack.go:158] Enter binpack plugin ...
I1229 04:59:21.664907       1 binpack.go:177] resources [] record in weight but not found on any node
I1229 04:59:21.664920       1 binpack.go:161] Leaving binpack plugin. binpack.weight[1], binpack.cpu[1], binpack.memory[1], no extend resources. ...
I1229 04:59:21.664947       1 enqueue.go:55] Enter Enqueue ...
I1229 04:59:21.664960       1 enqueue.go:70] Added Queue <default> for Job <default/example>
I1229 04:59:21.664989       1 panic.go:522] Leaving Enqueue ...
I1229 04:59:21.665066       1 panic.go:522] End scheduling ...
E1229 04:59:21.665228       1 runtime.go:69] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)

root@test3:~# kubectl get vcjob -n default
No resources found.
root@test3:~# kubectl get vcjob -n panther
No resources found.
root@test3:~# kubectl get vcjob -n hw-elasticsearch-new9
No resources found.

@hzxuzhonghu
Copy link
Collaborator

How did you deploy volcano? And which version do you use?

@jianxingzhe
Copy link
Author

jianxingzhe commented Dec 30, 2019

how did you deploy volcano? And which version do you use?

@hzxuzhonghu i deploy volcano with this yaml file
https://github.com/volcano-sh/volcano/blob/master/installer/volcano-development.yaml

@hzxuzhonghu
Copy link
Collaborator

On your own k8s?

@jianxingzhe
Copy link
Author

On your own k8s?

yes, on our dev k8s cluster

Client Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.12", GitCommit:"a8b52209ee172232b6db7a6e0ce2adc77458829f", GitTreeState:"clean", BuildDate:"2019-10-15T12:12:15Z", GoVersion:"go1.11.13", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.12", GitCommit:"a8b52209ee172232b6db7a6e0ce2adc77458829f", GitTreeState:"clean", BuildDate:"2019-10-15T12:04:30Z", GoVersion:"go1.11.13", Compiler:"gc", Platform:"linux/amd64"}

@hzxuzhonghu
Copy link
Collaborator

I1229 04:59:21.664005       1 cache.go:781] add Job: Job (cbdb9912-1c94-11ea-bf89-246e9627db94): namespace panther (default), name demo1-es-default, minAvailable 1, podGroup <nil>
I1229 04:59:21.664026       1 cache.go:781] add Job: Job (1730e308-2791-11ea-8524-246e9627db94): namespace hw-elasticsearch-new9 (default), name panther-sample-02-es-default, minAvailable 7, podGroup <nil>
I1229 04:59:21.664043       1 cache.go:781] add Job: Job (87592197-2855-11ea-8524-246e9627db94): namespace nes-elasticsearch (default), name panther-sample-es-default, minAvailable 15, podGroup <nil>
I1229 04:59:21.664058       1 cache.go:781] add Job: Job (e8d6d77c-209c-11ea-8524-246e9627db94): namespace default (default), name example, minAvailable 0, podGroup <nil>
I1229 04:59:21.664075       1 cache.go:781] add Job: Job (f5c5ce78-1662-11ea-bf89-246e9627db94): namespace zookeeper (default), name example, minAvailable 0, podGroup <nil>
I1229 04:59:21.664099       1 cache.go:788] The

I am curious what's these jobs are

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/scheduling kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants