-
Notifications
You must be signed in to change notification settings - Fork 509
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Replace ray job submit for 3x/8.5x faster job scheduling for cluster/managed jobs #4318
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Michaelvll!
# TODO(zhwu): This number should be tuned based on heuristics. | ||
_PENDING_SUBMIT_GRACE_PERIOD = 60 | ||
_INIT_SUBMIT_GRACE_PERIOD = 60 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need a grace period of 60s?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is required, as we have two steps for our job submission:
- add job with INIT state and retrieve the job id. (This is to reserve the job id only)
- actually, add the commands for the job to the table and set it to PENDING state.
These are two ssh connections, so the second step may significantly delayed if there is any network issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the comment for this.
An issue found:
The problem comes from that To reproduce:
See the
|
After investigating the comment here: #4318 (comment) It seems that all the job driver processes we run are under control, and sending SIGTERM to the process group is enough, as the driver processes will correctly clean up the underlying tasks. For example, the job process in the process group above
The job driver process starts the actual user jobs as a ray task under
Hence, we don't need to start a daemon for forcefully kill the process group during cancellation, which significantly reduce the time |
Our current process tree for the driver processes are not ideal, as everything is chained in a single tree, and canceling a job will split the tree into two, e.g. for the tree above, if we cancel 97, the process tree becomes
This works at the moment, but we should move to a more elegant solution using |
Thanks for investigating @Michaelvll. I just confirmed correctness of
|
With 5650d26, we are now able to avoid the chain of processes. : ) |
|
Fixes #4295 and Fixes #4293
ray job
has introduced a significant delay in our job submission and additional memory consumption. Although Ray job may provide some safeguard for abnormally failed jobs, it does not provide much value for our job management when the status is handled carefully in our own job table. In this PR, we replace ray job submit with subprocess and add a new stateFAILED_DRIVER
for the jobs to distinguish the user program failure vs job driver failure (such as OOM).Scheduling Speed for unmanaged jobs
The job scheduling is much faster: 60 seconds for 23 jobs (#4310) -> 25 seconds for 32 jobs (after reducing the CPU/job, it can be 60 seconds for 67 jobs), i.e. 0.38jobs/s -> 1.1 jobs/s (~3x faster)
job queue with this PR
Scheduling speed for managed jobs
Fixes #4294, it can now keep up with the job submitting, where managed jobs scheduling speed is 29s -> 3.4s, i.e. 8.5x faster.
Memory consumption
32 jobs running in parallel and many jobs PENDING and being submitted
The memory consumption issue relates to #4334
Master: 8.0G
This PR: 6.2G
More jobs
Master: 58GB / 264 jobs (0.21GB/job)
This PR: 42.3GB / 264 jobs (0.16GB/job)
Correctness/Robustness
Job can get into the following situations:
This PR will have no effect to the job status as the driver will set it correctly.
This PR will set the job in FAILED_DRIVER state as the driver process is not running while the job is not in terminal states. (Current master will set it to FAILED)
This PR will set the job in FAILED_DRIVER state as the driver process is not running while the job is not in terminal states.
This PR will set the job in FAILED_DRIVER state as the driver process is not running while the job is not in terminal states.
Tested (run the relevant ones):
bash format.sh
sky jobs launch
on a small jobs controller to manually trigger OOM and see if the jobs queue can handle it correctly.pytest tests/test_smoke.py --aws
(except three tests in [UX] Improve Formatting of Post Job Creation Logs #4198 (comment) andtest_sky_bench
forsubprocess.CalledProcessError: Command '['aws', 's3', 'rm', '--recursive', 's3://sky-bench-c174-gcpuser/t-sky-bench-0c']' returned non-zero exit status 1.
)pytest tests/test_smoke.py::test_fill_in_the_name
conda deactivate; bash -i tests/backward_compatibility_tests.sh 1
sky launch -c test-queue --cloud aws --cpus 2 "echo hi"; for i in `seq 1 7`; do sky exec test-queue "echo hi; sleep 1000" -d; done
sky exec test-queue "echo hi; sleep 1000" -d
should fail for runtime versionsky queue; sky logs tset-queue 2
should correctly runsky launch -c test-queue echo hi
sky cancel test-queue 2
; the old pending job scheduled correctlysky cancel test-queue 3 4 5
; the new pending job scheduled correctly