VCjob stuck in crashLoopBackOff while exact same regular job simply finishes #1075

kbroos2 · 2020-10-01T13:15:50Z

Is this a BUG REPORT or FEATURE REQUEST?:

Uncomment only one, leave it on its own line:
/kind bug
/kind feature

What happened:
Trying the different features of volcano.sh, launching some very simple VCjobs. Despite them successfully completing, they end up in a crashloopbackoff state afterwards. When launching the exact same job in a regular kubernetes job-template, they successfully complete and end up in a "completed state"

What you expected to happen:
Pod enters completed state --> vcjob in completed state --> removal from queue

How to reproduce it (as minimally and precisely as possible):
the test queue exists and has sufficient resources available.
the vcjob:

kind: Job
metadata:
  name: job-5-default
  namespace: default
spec:
  minAvailable: 1
  schedulerName: volcano
  queue: test
  policies:
    - event: PodEvicted
      action: RestartJob
  tasks:
    - replicas: 1
      name: pi
      policies:
      - event: TaskCompleted
        action: CompleteJob
      template:
        spec:
          containers:
            - image: perl
              name: pi
              command: ["perl", "-Mbignum=bpi", "-wle", "print bpi(2000)"]
              resources:
                requests:
                  cpu: 0.4
                limits:
                  cpu: 0.4

Result of kubectl get vcjob / get q / get pods:

k get vcjob
NAME            AGE
job-4-default   4h8m
job-5-default   152m
❯ k get q
NAME       AGE
awesomeo   5h12m
default    2d11h
test       2d10h
❯ k describe q test
Name:         test
Namespace:
Labels:       <none>
Annotations:  kubectl.kubernetes.io/last-applied-configuration:
                {"apiVersion":"scheduling.volcano.sh/v1beta1","kind":"Queue","metadata":{"annotations":{},"name":"test"},"spec":{"capability":{"cpu":2},"r...
API Version:  scheduling.volcano.sh/v1beta1
Kind:         Queue
Metadata:
  Creation Timestamp:  2020-09-28T13:04:07Z
  Generation:          1
  Resource Version:    157483
  Self Link:           /apis/scheduling.volcano.sh/v1beta1/queues/test
  UID:                 6488d6c1-9a0c-4c4d-900a-080aa78ec811
Spec:
  Capability:
    Cpu:        2
  Reclaimable:  false
  Weight:       1
Status:
  Running:  1
  State:    Open
Events:     <none>
❯ k get pods
NAME                              READY   STATUS             RESTARTS   AGE
job-4-default-nginx-0             0/1     CrashLoopBackOff   29         4h8m
job-5-default-pi-0                0/1     CrashLoopBackOff   32         153m
keycloak-1587565613-0             1/1     Terminating        0          161d
pi-rs68x                          0/1     Completed          0          161m
the-deployment-7fd7749979-67dvz   1/1     Running            1          2d13h
the-deployment-7fd7749979-84jsb   1/1     Terminating        1          161d
the-deployment-7fd7749979-fdnq7   1/1     Running            1          2d13h
the-deployment-7fd7749979-plsxx   1/1     Terminating        1          161d
the-deployment-7fd7749979-rggsr   1/1     Running            1          2d13h
the-deployment-7fd7749979-t5ctb   1/1     Terminating        1          161d

Result of kubectl logs job-5-default-pi-o

3.1415926535897932384626433832....

Result of kubectl describe pod:

Name:               job-5-default-pi-0
Namespace:          default
Priority:           0
PriorityClassName:  <none>
Node:               minikube/192.168.99.117
Start Time:         Wed, 30 Sep 2020 22:42:42 +0200
Labels:             volcano.sh/job-name=job-5-default
                    volcano.sh/job-namespace=default
Annotations:        scheduling.k8s.io/group-name: job-5-default
                    volcano.sh/job-name: job-5-default
                    volcano.sh/job-version: 0
                    volcano.sh/task-spec: pi
Status:             Running
IP:                 172.17.0.10
Controlled By:      Job/job-5-default
Containers:
  pi:
    Container ID:  docker://4a64097d1bc50d3dacaf7a1f0a14b7f6b337c2fe920dd19f2e9f45b1094951d0
    Image:         perl
    Image ID:      docker-pullable://perl@sha256:a107c1e913309e35792b9e29caf29b15e44f64eab263e1f00beaa0dc7ea26fdb
    Port:          <none>
    Host Port:     <none>
    Command:
      perl
      -Mbignum=bpi
      -wle
      print bpi(2000)
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Thu, 01 Oct 2020 01:16:01 +0200
      Finished:     Thu, 01 Oct 2020 01:16:15 +0200
    Ready:          False
    Restart Count:  33
    Limits:
      cpu:  400m
    Requests:
      cpu:        400m
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-g59z7 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  default-token-g59z7:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-g59z7
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason   Age                  From               Message
  ----     ------   ----                 ----               -------
  Normal   Pulled   14h (x25 over 16h)   kubelet, minikube  Successfully pulled image "perl"
  Warning  BackOff  13h (x650 over 16h)  kubelet, minikube  Back-off restarting failed container

Anything else we need to know?:
Trying this in a normal job finishes normally

job.yaml
`apiVersion: batch/v1
kind: Job
metadata:
  name: pi
spec:
  template:
    spec:
      containers:
      - name: pi
        image: perl
        command: ["perl",  "-Mbignum=bpi", "-wle", "print bpi(2000)"]
      restartPolicy: Never
  backoffLimit: 4

Just finishes - ends in a completed state,

Name:               pi-rs68x
Namespace:          default
Priority:           0
PriorityClassName:  <none>
Node:               minikube/192.168.99.117
Start Time:         Wed, 30 Sep 2020 22:33:30 +0200
Labels:             controller-uid=9dee8904-af18-4a66-afe7-f38db713ede5
                    job-name=pi
Annotations:        <none>
Status:             Succeeded
IP:                 172.17.0.12
Controlled By:      Job/pi
Containers:
  pi:
    Container ID:  docker://c72857afd9801129a61b524f75492f5ba8effc8db954729288037a7062a63a32
    Image:         perl
    Image ID:      docker-pullable://perl@sha256:a107c1e913309e35792b9e29caf29b15e44f64eab263e1f00beaa0dc7ea26fdb
    Port:          <none>
    Host Port:     <none>
    Command:
      perl
      -Mbignum=bpi
      -wle
      print bpi(2000)
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Wed, 30 Sep 2020 22:33:36 +0200
      Finished:     Wed, 30 Sep 2020 22:33:41 +0200
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-g59z7 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  default-token-g59z7:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-g59z7
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:          <none>

Tried this as well with the default nginx-example of the volcano.sh site (sleep 10m) and a custom python script in a python container. All yield the same result.

Environment:

Volcano Version: latest from github master, chartversion 0.1
Kubernetes version (use kubectl version): 1.17.3
Cloud provider or hardware configuration: minikube on virtualbox
OS (e.g. from /etc/os-release):
NAME=Buildroot
VERSION=2019.02.9
ID=buildroot
VERSION_ID=2019.02.9
PRETTY_NAME="Buildroot 2019.02.9"
Kernel (e.g. uname -a): Linux minikube 4.19.94 Rename hpw.cloud keyword to volcano.sh #1 SMP Fri Mar 6 11:41:28 PST 2020 x86_64 GNU/Linux
Install tools: volcano.sh basic documentation - used helm install
Others:

The text was updated successfully, but these errors were encountered:

k82cn · 2020-10-03T01:26:48Z

please also put restartPolicy: Never into VCJob's pod template, similar to k8s job :)

kbroos2 · 2020-10-03T10:34:03Z

Thanks! That works indeed. :) When it's pointed out, it is perfectly obvious now.

Just a heads up: the documentation on volcano.sh is therefor not entirely correct:
step 2 at https://volcano.sh/en/docs/tutorials/ does not specify this. Hence the tutorial get stuck into this endless crashLoobBackOff cycle.
Other places of the volcano.sh website still seem to be correct (eg https://volcano.sh/en/docs/vcjob/)

Thanks Again :)

Best regards!

k82cn · 2020-10-07T01:48:40Z

Just a heads up: the documentation on volcano.sh is therefor not entirely correct:
step 2 at https://volcano.sh/en/docs/tutorials/ does not specify this. Hence the tutorial get stuck into this endless crashLoobBackOff cycle.

Thanks very much for your suggestion!

@Thor-wl please help to update related document to close this issue :)

Thor-wl · 2020-10-09T01:34:09Z

Thanks! That works indeed. :) When it's pointed out, it is perfectly obvious now.

Just a heads up: the documentation on volcano.sh is therefor not entirely correct:
step 2 at https://volcano.sh/en/docs/tutorials/ does not specify this. Hence the tutorial get stuck into this endless crashLoobBackOff cycle.
Other places of the volcano.sh website still seem to be correct (eg https://volcano.sh/en/docs/vcjob/)

Thanks Again :)

Best regards!

Thanks for your report! I will correct it.

Thor-wl · 2020-10-10T06:00:55Z

Just a heads up: the documentation on volcano.sh is therefor not entirely correct:
step 2 at https://volcano.sh/en/docs/tutorials/ does not specify this. Hence the tutorial get stuck into this endless crashLoobBackOff cycle.

Thanks very much for your suggestion!

@Thor-wl please help to update related document to close this issue :)

related PR:volcano-sh/website#100

k82cn · 2020-10-14T00:38:41Z

fixed by volcano-sh/website#100 :)

volcano-sh-bot added the kind/bug Categorizes issue or PR as related to a bug. label Oct 1, 2020

kbroos2 closed this as completed Oct 3, 2020

k82cn reopened this Oct 7, 2020

k82cn closed this as completed Oct 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VCjob stuck in crashLoopBackOff while exact same regular job simply finishes #1075

VCjob stuck in crashLoopBackOff while exact same regular job simply finishes #1075

kbroos2 commented Oct 1, 2020 •

edited by k82cn

Loading

k82cn commented Oct 3, 2020

kbroos2 commented Oct 3, 2020

k82cn commented Oct 7, 2020

Thor-wl commented Oct 9, 2020

Thor-wl commented Oct 10, 2020

k82cn commented Oct 14, 2020

VCjob stuck in crashLoopBackOff while exact same regular job simply finishes #1075

VCjob stuck in crashLoopBackOff while exact same regular job simply finishes #1075

Comments

kbroos2 commented Oct 1, 2020 • edited by k82cn Loading

k82cn commented Oct 3, 2020

kbroos2 commented Oct 3, 2020

k82cn commented Oct 7, 2020

Thor-wl commented Oct 9, 2020

Thor-wl commented Oct 10, 2020

k82cn commented Oct 14, 2020

kbroos2 commented Oct 1, 2020 •

edited by k82cn

Loading