-
Notifications
You must be signed in to change notification settings - Fork 993
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SLA plugin doesn't work on batch/v1
Job
objects; sla-waiting-time
from volcano-scheduler.conf
is ignored
#1901
Comments
My config (based off the default one for the Helm chart) has an additional tier of plugins compared to what the docs for the SLA plugin suggest:
Does the |
/cc @jiangkaihua @Thor-wl |
I tried using the shorter single-tier config given above, but with |
@adamnovak thanks for your reporting, let us have a investigation. |
/cc @jiangkaihua |
Is the reproduction I provided enough to debug this, or do I need to try and turn this into a Volcano unit test? |
The info you provided is enough. We have found some clues for this issue.
@jiangkaihua please provide the comments here whenever there is progress. Thanks. |
@adamnovak Thank you for your question. I will give a solution at first, then discuss the issue in detail. Briefly speaking, this issue was caused by a global SLA waiting time setting, with a short interval between large and small jobs. So a simplest workaround is: Use volcano job to apply jobs, so that you can use annotations to set SLA waiting time individual job. In my test, I only set the individual SLA waiting time 5 minutes in annotation for job apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: middle
annotations:
sla-waiting-time: 5m
spec:
schedulerName: volcano
minAvailable: 1
plugins:
env: []
svc: []
tasks:
- replicas: 1
name: "test"
template:
spec:
containers:
- image: alpine
command: ["/bin/sh", "-c", "sleep 1"]
imagePullPolicy: IfNotPresent
name: test
# the CPU resource was cut down to 1/10, since I do not own a 96-core node.
# But the proportion was kept as before: 9.6-core idle CPU in total, 10 1-core pre-jobs, and 200 1-core post-jobs.
resources:
limits:
memory: 300M
cpu: 5000m
ephemeral-storage: 1G
requests:
memory: 300M
cpu: 5000m
ephemeral-storage: 1G
restartPolicy: OnFailure And no global SLA waiting time settings in volcano-scheduler.conf: |
actions: "enqueue, allocate, backfill"
tiers:
- plugins:
- name: priority
# gang plugin was not invoked, the reason will be discussed later.
# - name: gang
- name: conformance
- name: sla
arguments:
# Stop letting little jobs pass big jobs after the big jobs have been
# waiting this long
# sla-waiting-time: 5m
- plugins:
- name: overcommit
- name: drf
- name: predicates
- name: proportion
# I neglected the arguments of nodeorder plugin, since I only own one node in cluster
- name: nodeorder
- name: binpack
And the volcano job root@ecs-jiangkaihua00525528:~/jiangkaihua/yaml/volcano/vj# kubectl describe vj middle
Name: middle
Namespace: default
Annotations: sla-waiting-time: 5m
API Version: batch.volcano.sh/v1alpha1
Kind: Job
Metadata:
Creation Timestamp: 2022-01-04T14:33:30Z
...
Status:
Conditions:
Last Transition Time: 2022-01-04T14:33:30Z
Status: Pending
Last Transition Time: 2022-01-04T14:39:24Z
Status: Running
Last Transition Time: 2022-01-04T14:39:25Z
Status: Completed
Min Available: 1
Running Duration: 5m56.82281603s
State:
Last Transition Time: 2022-01-04T14:39:25Z
Phase: Completed
Succeeded: 1
Task Status Count:
Test:
Phase:
Succeeded: 1 So in my opinion, use volcano job is a simplest way for you to solve this. |
we should support all kind of object :) |
Yes. Currently the SLA feature is supported in volcano job firstly and We will support it for k8s job as well. |
That's great!!! |
@jiangkaihua I've tried a test procedure very similar to yours, and I haven't been able to get my middle job to run when its SLA expires, even when using Can you point to anything here that you think might be responsible for my not observing a pause in the flow of post-jobs having their pods created, while my middle job with an SLA has a pending pod? Here's the script I am using: And here's the configmap I applied:
And after applying that I restarted the The first test I ran was with everything having a So I changed it to just put the SLA on the one big job, like you did. However, that didn't work either, and it looked like I was getting pods created for about 3 times as many Volcano jobs as could actually fit on the designated node of my 3 node cluster. So then I figured, maybe Volcano doesn't really understand NodeSelectors at the phase at which it is deciding what That also didn't work. I could see that Volcano isn't the only scheduler running on the cluster, but the busiest node had about 13/96 cores allocated to other workloads, so I don't think Volcano was somehow reserving space but forgetting about the space allocated to preexisting pods from the default scheduler. Even if that was the case, I still should have seen your step 5 (post-jobs not getting pending pods until the middle job has run), right? Or does Volcano decide where it is hoping to place the SLA job in advance and allocate other resources to later jobs even when the SLA job is still pending? Other things that could have caused trouble:
|
I tried affinity-ing my jobs onto only my pre-existing nodes, and also removing the |
@adamnovak Thank you for your question. In my opinion, the most possible reason is the auto-scale feature. You may stop auto-scale and try again. Let's talk about the But in this scenario, the cluster scaled up automatically, causing culster capacity kept enlarging, so Meanwhile, the second question, is that the And here is the question: if there was only 1 pod in large job, the resources would not be locked since By the way, |
That explanation definitely helps me understand a bit more about what I am seeing. I applied some node selectors to prevent my workload from triggering scale-up (because the autoscaler can tell the new nodes wouldn't be suitable for the pods), and that didn't seem to help. But your description of how Unfortunately, I think that that means that the However, it seems like there must be a better way to do this. Can I force Volcano to schedule all jobs in the order that they are submitted, for example? Can I configure |
@adamnovak I think adding a tiny pod in big job may not be a correct method to solve this, since even if the tiny pod had got the resources, the other large pods still could not get enough resources to be scheduled, then no lock, the idle resources still would be allocated to following pending pods. Modify So in my opinion, just stop scale-up would take effect. |
I tried it with some node selectors that should have blocked scale up:
I still saw the small pods starving my large pod. If I remember correctly, my node selectors worked and the cluster did not change in size during that run. I could try and take the autoscaler offline completely for the test, but I don't think I'll see a different result. Moreover, we eventually do want our cluster to autoscale, up to a limit, when there's lots of work to be done, so if I do get it working with the autoscaler off, my next step would be to figure out how to make it work with the autoscaler on. My model of what is happening here is:
If this is the case, making the filler jobs smaller might help, because edge/node boundary effects might be partly responsible for the problem. I'll do another run and look at the more verbose log. |
I ran a couple more runs. The first one I actually didn't manage to starve the job and it completed promptly, but the second one I starved it for about 20 minutes before I gave up. The set of pods looks more or less like what I described above:
jas-construct-pt-hprc-mc38-f1-gc38-k32-0302-1446-dbqcv is another workload on the default scheduler taking 24 cores, and about 20 cores are used by other pods or pods in other namespaces. 96 cores/node * 3 nodes = 288 cores 288 cores - 44 cores used = 244 cores free 244 * 1.2 = 292.8 cores to queue for 292.8 cores to queue for - 50 cores for pending middle job = 242.8 cores for running or pending postsleep jobs 242.8 / 10 = ~24 running or pending postsleep jobs And you see above I have about 24 running or pending postsleep jobs. 3 of them end up pending because they can't actually fit onto any space on any single node, and then any job that finishes is immediately replaced by one of the 3. |
@adamnovak Thank you for your model and calculation, I get your scenario now. I think the key issue is that How about try setting apiVersion: v1
data:
volcano-scheduler.conf: |
actions: "enqueue, allocate, backfill"
tiers:
- plugins:
- name: priority
# Do not use gang scheduling, it breaks SLA.
# See <https://github.com/volcano-sh/volcano/issues/1901#issuecomment-1004892955>
#- name: gang
- name: conformance
- name: sla
arguments: {}
# Do not provide a time here; if everything has the same SLA, SLA won't
# really be enforced because everything will be allowed to go Pending
- plugins:
- name: overcommit
# replace default-factor
arguments:
overcommit-factor: 1.0
- name: drf
- name: predicates
- name: proportion
- name: nodeorder
arguments:
# Maybe this will try to fill already full nodes first?
leastrequested.weight: 0
mostrequested.weight: 2
nodeaffinity.weight: 3
podaffinity.weight: 3
balancedresource.weight: 1
tainttoleration.weight: 1
imagelocality.weight: 1
- name: binpack
kind: ConfigMap
metadata:
annotations:
meta.helm.sh/release-name: volcano
meta.helm.sh/release-namespace: volcano-system
labels:
app.kubernetes.io/managed-by: Helm
name: volcano-scheduler-configmap
namespace: volcano-system |
Hi all, I have been following this issue closely SLAs are an important aspect of our scheduling strategy. I wanted to ask specifically about this comment form the scheduler configmap that @jiangkaihua posted here
@jiangkaihua , could you please elaborate on why the gang plug is disabled, here? does it have to be disabled for SLA to work? Gang scheduling is paramount to us (and I imagine for most people that use Volcano) to allow pods in a single job are treated as a unit so I wonder what is the use-case that someone would want gang scheduling off? Thanks you and @adamnovak so much for the detailed thread of this issue. |
Hey all, Sorry to dogpile on this but I have actually run into a similar issue. In my use-case, I was trying to get jobs to the front of the queue by using I am now starting to think that my scenario is much more similar to this issue than I thought, even though I don't have cluster auto-scaler. The first thing that I'm trying to do is to set Is there a way I can 100% confirm that the overcommit ratio is I have tried to look at logs but the overcommit-factor does not seem to be printed. Again, sorry to pile on this but I do think this might be related to what you're seeing here. |
@tFable Thank you for your question, Let's talk about volcano/pkg/scheduler/actions/allocate/allocate.go Lines 273 to 275 in 9698fbb
Here sla , gang , and tdm plugins all registered ssn.jobPipelinedFns[] . So the 3 plugins would decide whether the schedule would be cache in current period.Then let us see what happened in JobPipelined() :volcano/pkg/scheduler/framework/session_plugins.go Lines 304 to 334 in 9698fbb
As you can see, JobPipelined() would go through each tier in scheduler configuration, and in one tier, if one certain registered plugin returned util.Reject (-1), JobPipelined() would return FALSE and discard the schedule cache, then no resources would be locked; meanwhile in one tier, if no plugin returned util.Reject (-1) and at least one plugin returned util.Permit (+1), JobPipelined() would return TRUE and save the schedule cache, then resources would be locked; if no plugin registered in this tier, or registered plugins all returned util.Abstain (0), then it would go to next tier.
The principle of apiVersion: v1
data:
volcano-scheduler.conf: |
actions: "enqueue, allocate, backfill"
tiers:
- plugins:
- name: priority
- name: conformance
- name: sla
- plugins:
- name: overcommit
# replace default-factor
arguments:
overcommit-factor: 1.0
- name: gang
- name: drf
- name: predicates
- name: proportion
- name: nodeorder
- name: binpack
kind: ConfigMap
metadata:
annotations:
meta.helm.sh/release-name: volcano
meta.helm.sh/release-namespace: volcano-system
labels:
app.kubernetes.io/managed-by: Helm
name: volcano-scheduler-configmap
namespace: volcano-system |
Hey @jiangkaihua , thanks so much for your message! I'm trying to digest it a bit (sorry, I'm pretty new to Volcano but I think I'm learning fast 😄 ) So is it the case that Also, importantly, does disabling the Thanks again! |
Yes, disabling By the way, |
Thanks so much @jiangkaihua! It seemed that I prematurely replied before seeing your recommendation to put I'll give the above a shot. Thanks so much! Any clues on how I can troubleshoot why I see What I mean is this:
With Thank you! |
But, to be fair to @adamnovak and not taint this issue with my replies, perhaps is best if we take that discussion about |
Hello 👋 Looks like there was no activity on this issue for last 90 days. |
Closing for now as there was no activity for last 60 days after marked as stale, let us know if you need this to be reopened! 🤗 |
What happened:
As mentioned in #1869 I am using Volcano to schedule Kubernetes Job objects, to try and prevent smaller jobs submitted later from immediately filling any available space and starving larger jobs submitted earlier.
My cluster has a 96-core node with hostname "k1.kube".
I installed Volcano from the Helm chart in tag v1.4.0, using this
values.yaml
:And then overriding the scheduler configmap with this and restarting the scheduler pod:
So I should be using a global SLA of 5 minutes.
Then, I prepared a test: fill up the node with some jobs, then queue a big job, then queue a bunch of smaller jobs after it:
I observed jobs from
jobs_after.yml
being scheduled even when the job fromjob_middle.yml
had had its pod pending for 10 minutes, which is double the global SLA time that should be being enforced.What you expected to happen:
These shouldn't be much more than 5 minutes between the creation and completion times for the large middle job. When the job pod from
job_middle.yml
has been pending for 5 minutes, no more job pods fromjobs_after.yml
should be being scheduled by Volcano untiljob_middle.yml
has been scheduled.How to reproduce it (as minimally and precisely as possible):
Use the Volcano helm chart, the above configmap override,
kubectl -n volcano-system delete pod "$(kubectl get pod -n volcano-system | grep volcano-scheduler | cut -f1 -d' ')"
to bounce the schedule pod after reconfiguring it, and the above Bash code to generate test jobs. Adjust the hostname label selectors and job sizes as needed to fill the test cluster node you are using.Anything else we need to know?:
Is the SLA plugin maybe not smart enough to clear out space for a job to meet the SLA from a node that matches its selectors?
Are other plugins in the config maybe scheduling stuff that the SLA plugin has decided chouldn't be scheduled yet?
The scheduler pod logs don't seem to include the string "sla", but they log a bunch for every pod that's waiting every second, so I might not be able to see the startup logs or every single line ever logged.
The jobs are definitely getting PodGroups created for them. Here's the PodGroup description for the middle job when it should have been run according to the SLA but has not yet been:
Environment:
kubectl version
):uname -a
):The text was updated successfully, but these errors were encountered: