Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SLA plugin doesn't work on batch/v1 Job objects; sla-waiting-time from volcano-scheduler.conf is ignored #1901

Closed
adamnovak opened this issue Dec 13, 2021 · 29 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@adamnovak
Copy link

What happened:

As mentioned in #1869 I am using Volcano to schedule Kubernetes Job objects, to try and prevent smaller jobs submitted later from immediately filling any available space and starving larger jobs submitted earlier.

My cluster has a 96-core node with hostname "k1.kube".

I installed Volcano from the Helm chart in tag v1.4.0, using this values.yaml:

basic:
  image_tag_version: "v1.4.0"
  controller_image_name: "volcanosh/vc-controller-manager"
  scheduler_image_name: "volcanosh/vc-scheduler"
  admission_image_name: "volcanosh/vc-webhook-manager"
  admission_secret_name: "volcano-admission-secret"
  admission_config_file: "config/volcano-admission.conf"
  scheduler_config_file: "config/volcano-scheduler.conf"

  image_pull_secret: ""
  admission_port: 8443
  crd_version: "v1"
custom:
  metrics_enable: "false"

And then overriding the scheduler configmap with this and restarting the scheduler pod:

apiVersion: v1
data:
  volcano-scheduler.conf: |
    actions: "enqueue, allocate, backfill"
    tiers:
    - plugins:
      - name: priority
      - name: gang
      - name: conformance
      - name: sla
        arguments:
          # Stop letting little jobs pass big jobs after the big jobs have been
          # waiting this long
          sla-waiting-time: 5m
    - plugins:
      - name: overcommit
      - name: drf
      - name: predicates
      - name: proportion
      - name: nodeorder
        arguments:
          # Maybe this will try to fill already full nodes first?
          leastrequested.weight: 0
          mostrequested.weight: 2
          nodeaffinity.weight: 3
          podaffinity.weight: 3
          balancedresource.weight: 1
          tainttoleration.weight: 1
          imagelocality.weight: 1
      - name: binpack
kind: ConfigMap
metadata:
  annotations:
    meta.helm.sh/release-name: volcano
    meta.helm.sh/release-namespace: volcano-system
  labels:
    app.kubernetes.io/managed-by: Helm
  name: volcano-scheduler-configmap
  namespace: volcano-system

So I should be using a global SLA of 5 minutes.

Then, I prepared a test: fill up the node with some jobs, then queue a big job, then queue a bunch of smaller jobs after it:

# Clean up
kubectl delete job -l app=volcanotest

# Make 10 10 core jobs that will block out our test job for at least 2 minutes
# Make sure they don't all finish at once.
rm -f jobs_before.yml
for NUM in {1..10} ; do
cat >>jobs_before.yml <<EOF
apiVersion: batch/v1
kind: Job
metadata:
  name: presleep${NUM}
  labels:
    app: volcanotest
spec:
  template:
    spec:
      schedulerName: volcano
      nodeSelector:
        kubernetes.io/hostname: k1.kube
      containers:
      - name: main
        image: ubuntu:20.04
        command: ["sleep",  "$(( $RANDOM % 20 + 120 ))"]
        resources:
          limits:
            memory: 300M
            cpu: 10000m
            ephemeral-storage: 1G
          requests:
            memory: 300M
            cpu: 10000m
            ephemeral-storage: 1G
      restartPolicy: Never
  backoffLimit: 4
  ttlSecondsAfterFinished: 1000
---
EOF
done

# And 200 10 core jobs that, if they all pass it, will keep it blocked out for 20 minutes
# We expect it really to be blocked like 5-7-10 minutes if the SLA plugin is working.
rm -f jobs_after.yml
for NUM in {1..200} ; do
cat >>jobs_after.yml <<EOF
apiVersion: batch/v1
kind: Job
metadata:
  name: postsleep${NUM}
  labels:
    app: volcanotest
spec:
  template:
    spec:
      schedulerName: volcano
      nodeSelector:
        kubernetes.io/hostname: k1.kube
      containers:
      - name: main
        image: ubuntu:20.04
        command: ["sleep",  "$(( $RANDOM % 20 + 60 ))"]
        resources:
          limits:
            memory: 300M
            cpu: 10000m
            ephemeral-storage: 1G
          requests:
            memory: 300M
            cpu: 10000m
            ephemeral-storage: 1G
      restartPolicy: Never
  backoffLimit: 4
  ttlSecondsAfterFinished: 1000
---
EOF
done

# And the test job itself between them.
rm -f job_middle.yml
cat >job_middle.yml <<EOF
apiVersion: batch/v1
kind: Job
metadata:
  name: middle
  labels:
    app: volcanotest
spec:
  template:
    spec:
      schedulerName: volcano
      nodeSelector:
        kubernetes.io/hostname: k1.kube
      containers:
      - name: main
        image: ubuntu:20.04
        command: ["sleep", "1"]
        resources:
          limits:
            memory: 300M
            cpu: 50000m
            ephemeral-storage: 1G
          requests:
            memory: 300M
            cpu: 50000m
            ephemeral-storage: 1G
      restartPolicy: Never
  backoffLimit: 4
  ttlSecondsAfterFinished: 1000
EOF

kubectl apply -f jobs_before.yml
sleep 10
kubectl apply -f job_middle.yml
sleep 10
CREATION_TIME="$(kubectl get job middle -o jsonpath='{.metadata.creationTimestamp}')"
kubectl apply -f jobs_after.yml
# Wait for it to finish
COMPLETION_TIME=""
while [[ -z "${COMPLETION_TIME}" ]] ; do
    sleep 10
    COMPLETION_TIME="$(kubectl get job middle -o jsonpath='{.status.completionTime}')"
done
echo "Test large job was created at ${CREATION_TIME} and completed at ${COMPLETION_TIME}"

I observed jobs from jobs_after.yml being scheduled even when the job from job_middle.yml had had its pod pending for 10 minutes, which is double the global SLA time that should be being enforced.

What you expected to happen:

These shouldn't be much more than 5 minutes between the creation and completion times for the large middle job. When the job pod from job_middle.yml has been pending for 5 minutes, no more job pods from jobs_after.yml should be being scheduled by Volcano until job_middle.yml has been scheduled.

How to reproduce it (as minimally and precisely as possible):
Use the Volcano helm chart, the above configmap override, kubectl -n volcano-system delete pod "$(kubectl get pod -n volcano-system | grep volcano-scheduler | cut -f1 -d' ')" to bounce the schedule pod after reconfiguring it, and the above Bash code to generate test jobs. Adjust the hostname label selectors and job sizes as needed to fill the test cluster node you are using.

Anything else we need to know?:

Is the SLA plugin maybe not smart enough to clear out space for a job to meet the SLA from a node that matches its selectors?
Are other plugins in the config maybe scheduling stuff that the SLA plugin has decided chouldn't be scheduled yet?

The scheduler pod logs don't seem to include the string "sla", but they log a bunch for every pod that's waiting every second, so I might not be able to see the startup logs or every single line ever logged.

The jobs are definitely getting PodGroups created for them. Here's the PodGroup description for the middle job when it should have been run according to the SLA but has not yet been:

Name:         podgroup-31600c19-2282-47f1-934b-94026d88db1e
Namespace:    vg
Labels:       <none>
Annotations:  <none>
API Version:  scheduling.volcano.sh/v1beta1
Kind:         PodGroup
Metadata:
  Creation Timestamp:  2021-12-13T22:06:25Z
  Generation:          2
  Managed Fields:
    API Version:  scheduling.volcano.sh/v1beta1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:ownerReferences:
      f:spec:
        .:
        f:minMember:
        f:minResources:
          .:
          f:cpu:
          f:ephemeral-storage:
          f:memory:
        f:priorityClassName:
      f:status:
    Manager:      vc-controller-manager
    Operation:    Update
    Time:         2021-12-13T22:06:25Z
    API Version:  scheduling.volcano.sh/v1beta1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        f:conditions:
        f:phase:
    Manager:    vc-scheduler
    Operation:  Update
    Time:       2021-12-13T22:06:26Z
  Owner References:
    API Version:           batch/v1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  Job
    Name:                  middle
    UID:                   31600c19-2282-47f1-934b-94026d88db1e
  Resource Version:        122332555
  Self Link:               /apis/scheduling.volcano.sh/v1beta1/namespaces/vg/podgroups/podgroup-31600c19-2282-47f1-934b-94026d88db1e
  UID:                     8bee9cca-40d5-47b5-90e7-ebb1bc70059a
Spec:
  Min Member:  1
  Min Resources:
    Cpu:                  50
    Ephemeral - Storage:  1G
    Memory:               300M
  Priority Class Name:    medium-priority
  Queue:                  default
Status:
  Conditions:
    Last Transition Time:  2021-12-13T22:06:26Z
    Message:               1/1 tasks in gang unschedulable: pod group is not ready, 1 Pending, 1 minAvailable; Pending: 1 Unschedulable
    Reason:                NotEnoughResources
    Status:                True
    Transition ID:         86f1b151-92dd-4893-bcd3-c2573b3029fc
    Type:                  Unschedulable
  Phase:                   Inqueue
Events:
  Type     Reason         Age                   From     Message
  ----     ------         ----                  ----     -------
  Warning  Unschedulable  64s (x1174 over 21m)  volcano  1/1 tasks in gang unschedulable: pod group is not ready, 1 Pending, 1 minAvailable; Pending: 1 Unschedulable

Environment:

  • Volcano Version: v1.4.0
  • Kubernetes version (use kubectl version):
Server Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.0", GitCommit:"e19964183377d0ec2052d1f1fa930c4d7575bd50", GitTreeState:"clean", BuildDate:"2020-08-26T14:23:04Z", GoVersion:"go1.15", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration: Nodes are hosted on AWS instances.
  • OS (e.g. from /etc/os-release):
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"
  • Kernel (e.g. uname -a):
Linux master.kube 5.8.7-1.el7.elrepo.x86_64 #1 SMP Fri Sep 4 13:11:18 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux
  • Install tools:
helm version
version.BuildInfo{Version:"v3.7.2", GitCommit:"663a896f4a815053445eec4153677ddc24a0a361", GitTreeState:"clean", GoVersion:"go1.16.10"}
  • Others:
@adamnovak adamnovak added the kind/bug Categorizes issue or PR as related to a bug. label Dec 13, 2021
@adamnovak
Copy link
Author

My config (based off the default one for the Helm chart) has an additional tier of plugins compared to what the docs for the SLA plugin suggest:

 actions: "enqueue, allocate, backfill"
  tiers:
  - plugins:
    - name: priority
    - name: gang
    - name: sla
      arguments:
        sla-waiting-time: 1h2m3s

Does the sla plugin not work if there's that other tier after it? What configurations are and are not allowed for the plugin?

@k82cn
Copy link
Member

k82cn commented Dec 14, 2021

/cc @jiangkaihua @Thor-wl

@adamnovak
Copy link
Author

I tried using the shorter single-tier config given above, but with sla-waiting-time set to 5m, and that didn't seem to help. I still saw later jobs passing my test job after the SLA time had elapsed.

@william-wang
Copy link
Member

@adamnovak thanks for your reporting, let us have a investigation.

@Thor-wl
Copy link
Contributor

Thor-wl commented Dec 22, 2021

/cc @jiangkaihua

@adamnovak
Copy link
Author

Is the reproduction I provided enough to debug this, or do I need to try and turn this into a Volcano unit test?

@william-wang
Copy link
Member

william-wang commented Jan 4, 2022

The info you provided is enough. We have found some clues for this issue.

  1. Firstly, you need to use volcano job instead of k8s job in the sla testing.
  2. There is policy confliction between gang and sla plugin. jiangkaihua is still debuging and investigating the root cause.

@jiangkaihua please provide the comments here whenever there is progress. Thanks.

@jiangkaihua
Copy link
Contributor

@adamnovak Thank you for your question. I will give a solution at first, then discuss the issue in detail.

Briefly speaking, this issue was caused by a global SLA waiting time setting, with a short interval between large and small jobs. So a simplest workaround is: Use volcano job to apply jobs, so that you can use annotations to set SLA waiting time individual job.

In my test, I only set the individual SLA waiting time 5 minutes in annotation for job middle, like:

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: middle
  annotations:
    sla-waiting-time: 5m
spec:
  schedulerName: volcano
  minAvailable: 1
  plugins:
    env: []
    svc: []
  tasks:
    - replicas: 1
      name: "test"
      template:
        spec:
          containers:
            - image: alpine
              command: ["/bin/sh", "-c", "sleep 1"]
              imagePullPolicy: IfNotPresent
              name: test
              # the CPU resource was cut down to 1/10, since I do not own a 96-core node.
              # But the proportion was kept as before: 9.6-core idle CPU in total, 10 1-core pre-jobs, and 200 1-core post-jobs.
              resources:
                limits:
                  memory: 300M
                  cpu: 5000m
                  ephemeral-storage: 1G
                requests:
                  memory: 300M
                  cpu: 5000m
                  ephemeral-storage: 1G
          restartPolicy: OnFailure

And no global SLA waiting time settings in sla plugin, like:

  volcano-scheduler.conf: |
    actions: "enqueue, allocate, backfill"
    tiers:
    - plugins:
      - name: priority
        # gang plugin was not invoked, the reason will be discussed later.
        # - name: gang
      - name: conformance
      - name: sla
        arguments:
          # Stop letting little jobs pass big jobs after the big jobs have been
          # waiting this long
          # sla-waiting-time: 5m
    - plugins:
      - name: overcommit
      - name: drf
      - name: predicates
      - name: proportion
      # I neglected the arguments of nodeorder plugin, since I only own one node in cluster
      - name: nodeorder
      - name: binpack
  1. I apply jobs in order like:

    kubectl apply -f job_block.yml
    sleep 15
    kubectl apply -f jobs_before.yml
    sleep 10
    kubectl apply -f job_middle.yml
    sleep 10
    kubectl apply -f jobs_after.yml
    • and the block job is used to make sure that idle CPU on node is 9.6 core:
    Allocatable:
    cpu:                16
    ephemeral-storage:  189997840889
    hugepages-1Gi:      0
    hugepages-2Mi:      0
    memory:             32835244Ki
    pods:               110
    
    Allocated resources:
    (Total limits may be over 100 percent, i.e., overcommitted.)
    Resource           Requests         Limits
    --------           --------         ------
    cpu                6400m (40%)      5550m (34%)
    memory             551658240 (1%)   656515840 (1%)
    ephemeral-storage  1104857600 (0%)  1G (0%)
    hugepages-1Gi      0 (0%)           0 (0%)
    hugepages-2Mi      0 (0%)           0 (0%)
  2. In first 2 minutes, the pre-jobs running, occupying 9 core. At this moment, volcano action enqueue would prevent other pods from pending phase, which is the result of registered overcommit plugin. So you can see there were no other pods created.
    屏幕截图(36)

  3. After some pre-jobs completed, CPU was released 1 core at a time. Since CPU resource was still insufficient for job middle, job middle still cannot create pod, so post-jobs got the resource and create pods.
    屏幕截图(39)

  4. After 5 minutes, job middle reached its SLA, so sla plugin forced this job to be inqueue phase, so it was able to create pod. But there were no idle resources in cluster, so the pod kept pending.
    屏幕截图(40)

  5. As post-jobs completed, the other post-jobs were blocked. Idle CPU were preserved for job middle.
    屏幕截图(45)

  6. When there were 5 idle cores, job middle started.
    屏幕截图(47)

  7. After job middle completed, post-jobs started.
    屏幕截图(49)

And the volcano job middle would be like:

root@ecs-jiangkaihua00525528:~/jiangkaihua/yaml/volcano/vj# kubectl describe vj middle
Name:         middle
Namespace:    default
Annotations:  sla-waiting-time: 5m
API Version:  batch.volcano.sh/v1alpha1
Kind:         Job
Metadata:
  Creation Timestamp:  2022-01-04T14:33:30Z
...
Status:
  Conditions:
    Last Transition Time:  2022-01-04T14:33:30Z
    Status:                Pending
    Last Transition Time:  2022-01-04T14:39:24Z
    Status:                Running
    Last Transition Time:  2022-01-04T14:39:25Z
    Status:                Completed
  Min Available:           1
  Running Duration:        5m56.82281603s
  State:
    Last Transition Time:  2022-01-04T14:39:25Z
    Phase:                 Completed
  Succeeded:               1
  Task Status Count:
    Test:
      Phase:
        Succeeded:  1

So in my opinion, use volcano job is a simplest way for you to solve this.

@k82cn
Copy link
Member

k82cn commented Jan 4, 2022

  1. Firstly, you need to use volcano job instead of k8s job in the sla testing.

we should support all kind of object :)

@william-wang
Copy link
Member

  1. Firstly, you need to use volcano job instead of k8s job in the sla testing.

we should support all kind of object :)

Yes. Currently the SLA feature is supported in volcano job firstly and We will support it for k8s job as well.

@k82cn
Copy link
Member

k82cn commented Jan 5, 2022

  1. Firstly, you need to use volcano job instead of k8s job in the sla testing.

we should support all kind of object :)

Yes. Currently the SLA feature is supported in volcano job firstly and We will support it for k8s job as well.

That's great!!!

@adamnovak
Copy link
Author

@jiangkaihua I've tried a test procedure very similar to yours, and I haven't been able to get my middle job to run when its SLA expires, even when using batch.volcano.sh/v1alpha1 Job AKA vcjob.

Can you point to anything here that you think might be responsible for my not observing a pause in the flow of post-jobs having their pods created, while my middle job with an SLA has a pending pod?

Here's the script I am using:
test-volcano.sh.txt

And here's the configmap I applied:

apiVersion: v1
data:
  volcano-scheduler.conf: |
    actions: "enqueue, allocate, backfill"
      tiers:
      - plugins:
        - name: priority
        # Do not use gang scheduling, it breaks SLA.
        # See <https://github.com/volcano-sh/volcano/issues/1901#issuecomment-1004892955>
        #- name: gang
        - name: conformance
        - name: sla
          arguments: {}
          # Do not provide a time here; if everything has the same SLA, SLA won't
          # really be enforced because everything will be allowed to go Pending
      - plugins:
        - name: overcommit
        - name: drf
        - name: predicates
        - name: proportion
        - name: nodeorder
          arguments:
            # Maybe this will try to fill already full nodes first?
            leastrequested.weight: 0
            mostrequested.weight: 2
            nodeaffinity.weight: 3
            podaffinity.weight: 3
            balancedresource.weight: 1
            tainttoleration.weight: 1
            imagelocality.weight: 1
        - name: binpack
kind: ConfigMap
metadata:
  annotations:
    meta.helm.sh/release-name: volcano
    meta.helm.sh/release-namespace: volcano-system
  labels:
    app.kubernetes.io/managed-by: Helm
  name: volcano-scheduler-configmap
  namespace: volcano-system

And after applying that I restarted the volcano-scheduler pod.

The first test I ran was with everything having a nodeSelector on the pods to send everything to just one node, and with a 5m SLA on all the pods. (I think I still had it in the scheduler config at this point.) This didn't work, so I concluded the problem was that I need to set an SLA on just the big job pod, like you did. Otherwise, if all the pods are past their SLA, there's no reason for the little jobs to wait for the big job, right?

So I changed it to just put the SLA on the one big job, like you did. However, that didn't work either, and it looked like I was getting pods created for about 3 times as many Volcano jobs as could actually fit on the designated node of my 3 node cluster.

So then I figured, maybe Volcano doesn't really understand NodeSelectors at the phase at which it is deciding what vcjobs ought to have Pods created. I scaled up my test to cover the whole cluster, and removed the node selectors.

That also didn't work. I could see that vcjobs were being limited from creating pods, and that the pool of pods being made was a better match for the free space on the cluster (the small jobs weren't being made to wait for more than about a minute). After my SLA elapsed, I saw my middle job get a pod created, which means Volcano actually is consulting and acting on the SLA. But I didn't observe any reduction in the flow of pods created for the later-submitted jobs, and they continued to have their pods schedule and run and starve out the pending pod for my middle job.

Volcano isn't the only scheduler running on the cluster, but the busiest node had about 13/96 cores allocated to other workloads, so I don't think Volcano was somehow reserving space but forgetting about the space allocated to preexisting pods from the default scheduler. Even if that was the case, I still should have seen your step 5 (post-jobs not getting pending pods until the middle job has run), right? Or does Volcano decide where it is hoping to place the SLA job in advance and allocate other resources to later jobs even when the SLA job is still pending?

Other things that could have caused trouble:

  1. My cluster can auto-scale, and when it saw the pending pods for this workload, it scaled up and added a few more nodes. Which made Volcano let more vcjobs create their pods to run in the extra space, which made a few more pending pods, which made the cluster scale up a bit more. Unfortunately, the post-jobs still grabbed all this free space, and the middle job didn't schedule until the post-jobs started to run out. To me it looked like this the scaling/Volcano interaction was working properly; is there anything about this that could be causing my problem? I think I can add node selectors to make sure that the jobs will be willing to run on all existing nodes but no new nodes, if that is likely to help.
  2. I have schedulerName: volcano on some of the pod specs (for the middle and post jobs) in my script, in addition to having it on the vcjob itself. Is asking for the Volcano scheduler twice going to cause a problem? Or, alternately, do I need it on all my vcjob pods if Volcano is a non-default scheduler?

@adamnovak
Copy link
Author

adamnovak commented Feb 28, 2022

I tried affinity-ing my jobs onto only my pre-existing nodes, and also removing the schedulerName from the pods and having it only on the jobs. I still don't see the expected behavior; my later jobs still don't seem to be stopping for the job with the SLA.

@jiangkaihua
Copy link
Contributor

jiangkaihua commented Mar 1, 2022

@adamnovak Thank you for your question. In my opinion, the most possible reason is the auto-scale feature. You may stop auto-scale and try again.

Let's talk about the sla plugin in detail. Firstly, enqueue action prevent excessive pods from creating according to cluster resources, and this feature is realized by overcommit plugin. By default, overcommit plugin should ensure that resources needed by pods(pending, pipelined, and running status) in cluster no more 120% of cluster capacity. So excessive pods from postsleep jobs would be held by enqueue action and not created. Then sla plugin could lock the resources released by presleep jobs.

But in this scenario, the cluster scaled up automatically, causing culster capacity kept enlarging, so overcommit plugin and enqueue action failed to prevent postsleep pods from creating. And this is the first question.

Meanwhile, the second question, is that the middle job only have ONE 50-core pod. sla plugin reserves resources by taking effect in allocate action: when 1 or some pods were allocated to nodes successfully, but the job was too large to be fully allocated, sla plugin would cache the schedule and lock these resources in this scheduling period, so that within this period other jobs would not occupy thest locked resources. And in next schedule period, the locked resources would be released and re-allocated to large jobs chosen by sla plugin.

And here is the question: if there was only 1 pod in large job, the resources would not be locked since allocate action failed to schedule this single pod, noe schedule would be cached. Then the postsleep jobs would occupy these resources, since their pods have been created and in pending status.

By the way, nodeSelector indeed did nothing in enqueue action, this is a defect and we should consider about. Thank you! And schedulerName is inoperative in vcjob, you may just neglect it. : )

@adamnovak
Copy link
Author

That explanation definitely helps me understand a bit more about what I am seeing.

I applied some node selectors to prevent my workload from triggering scale-up (because the autoscaler can tell the new nodes wouldn't be suitable for the pods), and that didn't seem to help.

But your description of how sla works explains that: it doesn't kick in until at least one pod finds space to run. So if all the pods in the job are large and would need some planning from the scheduler to free up enough space on any one node, none of them schedule, even with the sla plugin active and working. As soon as some space is freed, the scheduler slots a small pod into it, and never makes room for the first big pod.

Unfortunately, I think that that means that the sla plugin might not be designed for my use case after all; I need something that can ensure that an unending supply of small pods can't indefinitely prevent big pods in the same queue from scheduling. That might be achievable with the sla plugin if I added a tiny pod to each big job, which could then be scheduled in order to trigger the sla plugin to reserve resources.

However, it seems like there must be a better way to do this. Can I force Volcano to schedule all jobs in the order that they are submitted, for example? Can I configure overcommit to allow no more than one pending job at a time? Or can I adjust the sla plugin configuration so it operates before the first pod in a job schedules?

@jiangkaihua
Copy link
Contributor

jiangkaihua commented Mar 2, 2022

@adamnovak I think adding a tiny pod in big job may not be a correct method to solve this, since even if the tiny pod had got the resources, the other large pods still could not get enough resources to be scheduled, then no lock, the idle resources still would be allocated to following pending pods.

Modify overcommit plugin may not help about it either, since after scale-up, the cluster capacity indeed enlarged and overcommit plugin was responsible to permit more pods to be created, then allocate action had no choice but allocate idle resources to tiny pods, since it was designed to allocate as much as possible pods in a scheduling period.

So in my opinion, just stop scale-up would take effect. overcommit plugin would calculate cluster capacity correctly, then enqueue action would block follow-up jobs by preventing postsleep jobs from creating pods, so that no pod would preempt resources from large pods, then middle job would be able to run.

@adamnovak
Copy link
Author

I tried it with some node selectors that should have blocked scale up:

# Where should we run?
#NODE_SELECTOR='nodeSelector: {"kubernetes.io/hostname": "k1.kube"}'
NODE_SELECTOR='affinity: {"nodeAffinity": {"requiredDuringSchedulingIgnoredDuringExecution": {"nodeSelectorTerms": [{"matchExpressions": [{"key": "kubernetes.io/hostname", "operator": "In", "values": ["k1.kube", "k2.kube", "k3.kube"]}]}]}}}'

I still saw the small pods starving my large pod. If I remember correctly, my node selectors worked and the cluster did not change in size during that run.

I could try and take the autoscaler offline completely for the test, but I don't think I'll see a different result. Moreover, we eventually do want our cluster to autoscale, up to a limit, when there's lots of work to be done, so if I do get it working with the autoscaler off, my next step would be to figure out how to make it work with the autoscaler on.

My model of what is happening here is:

  • My cluster has 3 nodes of 96 cores (plus the negligible leader node), for 288 total cores.
  • My small jobs take 10 cores and my large job takes 50 cores.
  • The overcommit plugin is running at its default factor of 1.20, so it is willing to allow inqueue and running jobs up to 345.6 cores.
  • A few cores on some nodes are used by existing pods, and the last 6 cores of each node aren't actually able to fit my 10-core filler jobs, so ~3 of my postsleep jobs that might otherwise be running will stay pending.
  • Overcommit sees the 57.6 overcommit-able cores, and thinks it can have my big job and (maybe?) one other postsleep job inqueue.
  • So I end up with my big job and 3 or 4 postsleep jobs pending.
  • And then unless 5 jobs all simultaneously finish on the same node, there's never room for my big middle job, because the pending postsleep jobs get slotted into the available space as soon as it opens, and more postsleep jobs are allowed inqueue to replace the ones that finished.

If this is the case, making the filler jobs smaller might help, because edge/node boundary effects might be partly responsible for the problem.

I'll do another run and look at the more verbose log.

@adamnovak
Copy link
Author

I ran a couple more runs. The first one I actually didn't manage to starve the job and it completed promptly, but the second one I starved it for about 20 minutes before I gave up.

The set of pods looks more or less like what I described above:

Wed Mar  2 10:34:31 PST 2022
NAME                                                           READY   STATUS      RESTARTS   AGE
buildkit-amd64-86dd9b949-r5skj                                 1/1     Running     0          174d
gitlab-vgteam-kubernetes-runner-gitlab-runner-c58d75c4-5xgbf   1/1     Running     0          55d
jas-construct-pt-hprc-mc38-f1-gc38-k32-0302-1446-dbqcv         1/1     Running     0          4h48m
middle-middle-0                                                0/1     Pending     0          16m
postsleep249-postsleep249-0                                    1/1     Running     0          75s
postsleep250-postsleep250-0                                    1/1     Running     0          75s
postsleep251-postsleep251-0                                    1/1     Running     0          72s
postsleep252-postsleep252-0                                    1/1     Running     0          70s
postsleep253-postsleep253-0                                    1/1     Running     0          66s
postsleep254-postsleep254-0                                    1/1     Running     0          64s
postsleep255-postsleep255-0                                    1/1     Running     0          62s
postsleep256-postsleep256-0                                    1/1     Running     0          56s
postsleep257-postsleep257-0                                    1/1     Running     0          53s
postsleep258-postsleep258-0                                    1/1     Running     0          54s
postsleep259-postsleep259-0                                    1/1     Running     0          52s
postsleep260-postsleep260-0                                    1/1     Running     0          51s
postsleep261-postsleep261-0                                    1/1     Running     0          48s
postsleep262-postsleep262-0                                    1/1     Running     0          46s
postsleep263-postsleep263-0                                    1/1     Running     0          44s
postsleep264-postsleep264-0                                    1/1     Running     0          39s
postsleep265-postsleep265-0                                    1/1     Running     0          31s
postsleep266-postsleep266-0                                    1/1     Running     0          30s
postsleep267-postsleep267-0                                    1/1     Running     0          29s
postsleep268-postsleep268-0                                    1/1     Running     0          25s
postsleep269-postsleep269-0                                    1/1     Running     0          8s
postsleep270-postsleep270-0                                    0/1     Pending     0          4s
postsleep271-postsleep271-0                                    0/1     Pending     0          3s
postsleep272-postsleep272-0                                    0/1     Pending     0          3s

jas-construct-pt-hprc-mc38-f1-gc38-k32-0302-1446-dbqcv is another workload on the default scheduler taking 24 cores, and about 20 cores are used by other pods or pods in other namespaces.

96 cores/node * 3 nodes = 288 cores

288 cores - 44 cores used = 244 cores free

244 * 1.2 = 292.8 cores to queue for

292.8 cores to queue for - 50 cores for pending middle job = 242.8 cores for running or pending postsleep jobs

242.8 / 10 = ~24 running or pending postsleep jobs

And you see above I have about 24 running or pending postsleep jobs. 3 of them end up pending because they can't actually fit onto any space on any single node, and then any job that finishes is immediately replaced by one of the 3.

@jiangkaihua
Copy link
Contributor

@adamnovak Thank you for your model and calculation, I get your scenario now. I think the key issue is that overcommit default-factor was too large in your case, causing there are always pending postsleep pods through enqueue and created.

How about try setting overcommit factor to 1.0? Then no new pod would be created until all current pending pods got scheduled. The large middle job would still be able to create pod when sla arrived, since sla plugin was set in first tier & overcommit plugin in second tier in scheduler configuration. A sample config would be like:

apiVersion: v1
data:
  volcano-scheduler.conf: |
    actions: "enqueue, allocate, backfill"
      tiers:
      - plugins:
        - name: priority
        # Do not use gang scheduling, it breaks SLA.
        # See <https://github.com/volcano-sh/volcano/issues/1901#issuecomment-1004892955>
        #- name: gang
        - name: conformance
        - name: sla
          arguments: {}
          # Do not provide a time here; if everything has the same SLA, SLA won't
          # really be enforced because everything will be allowed to go Pending
      - plugins:
        - name: overcommit
          # replace default-factor
          arguments:
            overcommit-factor: 1.0
        - name: drf
        - name: predicates
        - name: proportion
        - name: nodeorder
          arguments:
            # Maybe this will try to fill already full nodes first?
            leastrequested.weight: 0
            mostrequested.weight: 2
            nodeaffinity.weight: 3
            podaffinity.weight: 3
            balancedresource.weight: 1
            tainttoleration.weight: 1
            imagelocality.weight: 1
        - name: binpack
kind: ConfigMap
metadata:
  annotations:
    meta.helm.sh/release-name: volcano
    meta.helm.sh/release-namespace: volcano-system
  labels:
    app.kubernetes.io/managed-by: Helm
  name: volcano-scheduler-configmap
  namespace: volcano-system

@tFable
Copy link

tFable commented Mar 4, 2022

Hi all, I have been following this issue closely SLAs are an important aspect of our scheduling strategy.

I wanted to ask specifically about this comment form the scheduler configmap that @jiangkaihua posted here

# gang plugin was not invoked, the reason will be discussed later.
# - name: gang

@jiangkaihua , could you please elaborate on why the gang plug is disabled, here? does it have to be disabled for SLA to work? Gang scheduling is paramount to us (and I imagine for most people that use Volcano) to allow pods in a single job are treated as a unit so I wonder what is the use-case that someone would want gang scheduling off?

Thanks you and @adamnovak so much for the detailed thread of this issue.
t

@tFable
Copy link

tFable commented Mar 7, 2022

Hey all,

Sorry to dogpile on this but I have actually run into a similar issue. In my use-case, I was trying to get jobs to the front of the queue by using vcjob.spec.priorityClassName, however, smaller jobs seem to always get ahead of the large, high-prio job. I opened an issue on this (#2052 )

I am now starting to think that my scenario is much more similar to this issue than I thought, even though I don't have cluster auto-scaler.

The first thing that I'm trying to do is to set overcommit-factor: 1.0, however, even when I set this, I still see pods in pending state even when the cluster is completely full with running pods.

Is there a way I can 100% confirm that the overcommit ratio is 1.0? I have tried to set it to a negative amount and I do see the error message that sets it back to 1.2. I have also set it to a very large amount and have indeed seen many more pods in pending state. So I do know that the overcommit-factor is being considered in the code but I'm not sure how to see that when I set it to 1.0 there is 0 overcommitment. From observation overcommit-factor: 1.0 is not taking place and, instead, the default is taking place.

I have tried to look at logs but the overcommit-factor does not seem to be printed.

Again, sorry to pile on this but I do think this might be related to what you're seeing here.

@jiangkaihua
Copy link
Contributor

jiangkaihua commented Mar 7, 2022

@tFable Thank you for your question, Let's talk about gang first. The reason why I suggeest to disable gang is that gang may cause trouble when sla lock resources. As you can see, sla lock resources by cache schedule in allocate action:

if !ssn.JobPipelined(job) {
stmt.Discard()
}

Here sla, gang, and tdm plugins all registered ssn.jobPipelinedFns[]. So the 3 plugins would decide whether the schedule would be cache in current period.
Then let us see what happened in JobPipelined():
// JobPipelined invoke pipelined function of the plugins
// Check if job has get enough resource to run
func (ssn *Session) JobPipelined(obj interface{}) bool {
var hasFound bool
for _, tier := range ssn.Tiers {
for _, plugin := range tier.Plugins {
if !isEnabled(plugin.EnabledJobPipelined) {
continue
}
jrf, found := ssn.jobPipelinedFns[plugin.Name]
if !found {
continue
}
res := jrf(obj)
if res < 0 {
return false
}
if res > 0 {
hasFound = true
}
}
// if plugin exists that votes permit, meanwhile other plugin votes abstention,
// permit job to be pipelined, do not check next tier
if hasFound {
return true
}
}
return true
}

As you can see, JobPipelined() would go through each tier in scheduler configuration, and in one tier, if one certain registered plugin returned util.Reject(-1), JobPipelined() would return FALSE and discard the schedule cache, then no resources would be locked; meanwhile in one tier, if no plugin returned util.Reject(-1) and at least one plugin returned util.Permit(+1), JobPipelined() would return TRUE and save the schedule cache, then resources would be locked; if no plugin registered in this tier, or registered plugins all returned util.Abstain(0), then it would go to next tier.

The principle of JobPipelined() may be confusing to new users, so I suggested to just disable gang if users do not need it. If you want to use gang, you may set gang in lower tiers, like:

apiVersion: v1
data:
  volcano-scheduler.conf: |
    actions: "enqueue, allocate, backfill"
      tiers:
      - plugins:
        - name: priority
        - name: conformance
        - name: sla
      - plugins:
        - name: overcommit
          # replace default-factor
          arguments:
            overcommit-factor: 1.0
        - name: gang
        - name: drf
        - name: predicates
        - name: proportion
        - name: nodeorder
        - name: binpack
kind: ConfigMap
metadata:
  annotations:
    meta.helm.sh/release-name: volcano
    meta.helm.sh/release-namespace: volcano-system
  labels:
    app.kubernetes.io/managed-by: Helm
  name: volcano-scheduler-configmap
  namespace: volcano-system

@tFable
Copy link

tFable commented Mar 7, 2022

Hey @jiangkaihua , thanks so much for your message!

I'm trying to digest it a bit (sorry, I'm pretty new to Volcano but I think I'm learning fast 😄 )

So is it the case that gang should NOT be used with the sla plugin?

Also, importantly, does disabling the gang plugin also disable Volcano's gang scheduling capability (i.e. "don't start any pods of this group unless ALL pods can be started")?

Thanks again!

@jiangkaihua
Copy link
Contributor

jiangkaihua commented Mar 7, 2022

Hey @jiangkaihua , thanks so much for your message!

I'm trying to digest it a bit (sorry, I'm pretty new to Volcano but I think I'm learning fast 😄 )

So is it the case that gang should NOT be used with the sla plugin?

Also, importantly, does disabling the gang plugin also disable Volcano's gang scheduling capability (i.e. "don't start any pods of this group unless ALL pods can be started")?

Thanks again!

Yes, disabling gang plugin would stop Volcano's gang scheduling capability. You may try configuration mentioned above, it should help. Thank you for your reply. : )

By the way, overcommit plugin should have been working well when you set overcommit-factor correctly, which version of Volcano in your scenario?

@tFable
Copy link

tFable commented Mar 7, 2022

Thanks so much @jiangkaihua! It seemed that I prematurely replied before seeing your recommendation to put gang lower. For some reason your comment above did all load for me at first.

I'll give the above a shot. Thanks so much!

Any clues on how I can troubleshoot why I see pending pods even when I have overcommit-factor: 1.0?

What I mean is this:

  • Cluster is empty and overcommit-factor: 1.0
  • job1, regular-priority, that takes up entire cluster is submitted and starts running
  • job2, regular-priority, two that takes up small number of pods is submitted
  • job2 pods show pending right away when doing kubectl get pods

With overcommit-factor: 1.0, shouldn't it be the case that job2 pods will not show up when doing kubectl get pods?

Thank you!

@tFable
Copy link

tFable commented Mar 7, 2022

Also, I'm using volcano from this file with the only difference that I'm changing the docker images to pin to 1.5.0 because of the issue I saw that's described in this ticket (#2053 )

@tFable
Copy link

tFable commented Mar 7, 2022

But, to be fair to @adamnovak and not taint this issue with my replies, perhaps is best if we take that discussion about overcommit-factor: 1.0 to the other issue I opened (#2052)? 😃

@stale
Copy link

stale bot commented Jun 11, 2022

Hello 👋 Looks like there was no activity on this issue for last 90 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

@stale stale bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 11, 2022
@stale
Copy link

stale bot commented Aug 10, 2022

Closing for now as there was no activity for last 60 days after marked as stale, let us know if you need this to be reopened! 🤗

@stale stale bot closed this as completed Aug 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

6 participants