[Core] Replace ray job submit for 3x/8.5x faster job scheduling for cluster/managed jobs #4318

Michaelvll · 2024-11-09T21:29:13Z

Fixes #4295 and Fixes #4293

ray job has introduced a significant delay in our job submission and additional memory consumption. Although Ray job may provide some safeguard for abnormally failed jobs, it does not provide much value for our job management when the status is handled carefully in our own job table. In this PR, we replace ray job submit with subprocess and add a new state FAILED_DRIVER for the jobs to distinguish the user program failure vs job driver failure (such as OOM).

Scheduling Speed for unmanaged jobs

The job scheduling is much faster: 60 seconds for 23 jobs (#4310) -> 25 seconds for 32 jobs (after reducing the CPU/job, it can be 60 seconds for 67 jobs), i.e. 0.38jobs/s -> 1.1 jobs/s (~3x faster)

job queue with this PR

75  sky-cmd  a few secs ago  -               -         1x[CPU:1+]  PENDING    ~/sky_logs/sky-2024-11-09-21-57-52-131459  
74  sky-cmd  a few secs ago  < 1 sec         < 1s      1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-21-57-50-439630  
73  sky-cmd  a few secs ago  a few secs ago  2s        1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-21-57-50-398196  
72  sky-cmd  a few secs ago  a few secs ago  3s        1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-21-57-46-160367  
71  sky-cmd  a few secs ago  a few secs ago  4s        1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-21-57-45-831961  
70  sky-cmd  a few secs ago  a few secs ago  5s        1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-21-57-45-193824  
69  sky-cmd  a few secs ago  a few secs ago  6s        1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-21-57-44-775659  
68  sky-cmd  a few secs ago  a few secs ago  6s        1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-21-57-45-117409  
67  sky-cmd  a few secs ago  a few secs ago  7s        1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-21-57-44-608202  
66  sky-cmd  a few secs ago  a few secs ago  8s        1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-21-57-43-623994  
65  sky-cmd  a few secs ago  a few secs ago  9s        1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-21-57-42-985405  
64  sky-cmd  15 secs ago     11 secs ago     11s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-21-57-38-759178  
63  sky-cmd  15 secs ago     12 secs ago     12s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-21-57-38-652951  
62  sky-cmd  15 secs ago     13 secs ago     13s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-21-57-38-725297  
61  sky-cmd  16 secs ago     14 secs ago     14s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-21-57-37-587498  
60  sky-cmd  16 secs ago     15 secs ago     15s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-21-57-37-990283  
59  sky-cmd  16 secs ago     14 secs ago     14s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-21-57-37-744277  
58  sky-cmd  17 secs ago     15 secs ago     15s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-21-57-36-753063  
57  sky-cmd  17 secs ago     15 secs ago     15s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-21-57-36-662268  
56  sky-cmd  22 secs ago     17 secs ago     17s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-21-57-31-550790  
55  sky-cmd  22 secs ago     18 secs ago     18s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-21-57-31-190262  
54  sky-cmd  23 secs ago     19 secs ago     19s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-21-57-31-334542  
53  sky-cmd  23 secs ago     20 secs ago     20s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-21-57-30-513732  
52  sky-cmd  23 secs ago     21 secs ago     21s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-21-57-30-478628  
51  sky-cmd  23 secs ago     21 secs ago     21s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-21-57-30-467094  
50  sky-cmd  24 secs ago     22 secs ago     22s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-21-57-29-433253  
49  sky-cmd  24 secs ago     23 secs ago     23s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-21-57-29-319268  
48  sky-cmd  29 secs ago     24 secs ago     24s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-21-57-25-158885  
47  sky-cmd  29 secs ago     25 secs ago     25s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-21-57-25-001693  
46  sky-cmd  29 secs ago     26 secs ago     26s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-21-57-24-605389  
45  sky-cmd  30 secs ago     27 secs ago     27s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-21-57-24-055194  
44  sky-cmd  30 secs ago     27 secs ago     27s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-21-57-23-610780  
43  sky-cmd  31 secs ago     28 secs ago     28s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-21-57-22-962817  
42  sky-cmd  31 secs ago     29 secs ago     29s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-21-57-22-456464  
41  sky-cmd  31 secs ago     30 secs ago     30s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-21-57-22-371177  
40  sky-cmd  35 secs ago     31 secs ago     31s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-21-57-17-681144  
39  sky-cmd  35 secs ago     32 secs ago     32s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-21-57-17-808512  
38  sky-cmd  36 secs ago     33 secs ago     33s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-21-57-17-623907  
37  sky-cmd  36 secs ago     34 secs ago     34s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-21-57-16-894306  
36  sky-cmd  37 secs ago     34 secs ago     34s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-21-57-16-245718  
35  sky-cmd  37 secs ago     35 secs ago     35s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-21-57-16-223589  
34  sky-cmd  38 secs ago     36 secs ago     36s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-21-57-15-132803  
33  sky-cmd  38 secs ago     37 secs ago     37s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-21-57-15-401731  
32  sky-cmd  43 secs ago     38 secs ago     38s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-21-57-10-874506  
31  sky-cmd  43 secs ago     39 secs ago     39s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-21-57-10-813577  
30  sky-cmd  43 secs ago     40 secs ago     40s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-21-57-10-840665  
29  sky-cmd  44 secs ago     41 secs ago     41s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-21-57-09-734297  
28  sky-cmd  44 secs ago     41 secs ago     41s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-21-57-09-642548  
27  sky-cmd  44 secs ago     42 secs ago     42s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-21-57-09-273863  
26  sky-cmd  45 secs ago     43 secs ago     43s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-21-57-07-539143  
25  sky-cmd  45 secs ago     44 secs ago     44s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-21-57-07-453114  
24  sky-cmd  50 secs ago     47 secs ago     47s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-21-57-02-913495  
23  sky-cmd  50 secs ago     47 secs ago     47s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-21-57-02-727743  
22  sky-cmd  50 secs ago     48 secs ago     48s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-21-57-03-117854  
21  sky-cmd  51 secs ago     49 secs ago     49s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-21-57-02-473544  
20  sky-cmd  51 secs ago     50 secs ago     50s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-21-57-01-902662  
19  sky-cmd  51 secs ago     50 secs ago     50s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-21-57-01-557019  
18  sky-cmd  52 secs ago     51 secs ago     51s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-21-57-00-507900  
17  sky-cmd  53 secs ago     52 secs ago     52s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-21-56-59-803208  
16  sky-cmd  58 secs ago     54 secs ago     54s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-21-56-55-498089  
15  sky-cmd  58 secs ago     55 secs ago     55s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-21-56-55-517846  
14  sky-cmd  58 secs ago     56 secs ago     56s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-21-56-55-392485  
13  sky-cmd  58 secs ago     56 secs ago     56s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-21-56-55-580883  
12  sky-cmd  58 secs ago     57 secs ago     57s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-21-56-55-577158  
11  sky-cmd  59 secs ago     57 secs ago     57s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-21-56-54-964683  
10  sky-cmd  59 secs ago     58 secs ago     58s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-21-56-53-712027  
9   sky-cmd  1 min ago       59 secs ago     59s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-21-56-53-476885  
8   sky-cmd  1 min ago       1 min ago       1m        1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-21-56-47-424912  
7   sky-cmd  1 min ago       1 min ago       1m        1x[CPU:1+]  SUCCEEDED  ~/sky_logs/sky-2024-11-09-21-56-47-583078

Scheduling speed for managed jobs

Fixes #4294, it can now keep up with the job submitting, where managed jobs scheduling speed is 29s -> 3.4s, i.e. 8.5x faster.

Memory consumption

32 jobs running in parallel and many jobs PENDING and being submitted

The memory consumption issue relates to #4334

Master: 8.0G
This PR: 6.2G

More jobs
Master: 58GB / 264 jobs (0.21GB/job)
This PR: 42.3GB / 264 jobs (0.16GB/job)

Correctness/Robustness

Job can get into the following situations:

Job driver successfully submitted and finish with job table status updated.
This PR will have no effect to the job status as the driver will set it correctly.
Job driver failed with dependency issues without setting the job status
This PR will set the job in FAILED_DRIVER state as the driver process is not running while the job is not in terminal states. (Current master will set it to FAILED)
Job in RUNNING state but is staled due to VM restart.
This PR will set the job in FAILED_DRIVER state as the driver process is not running while the job is not in terminal states.
Job driver being killed by OOM.
This PR will set the job in FAILED_DRIVER state as the driver process is not running while the job is not in terminal states.

Tested (run the relevant ones):

…ray-job-submit

romilbhardwaj

Thanks @Michaelvll!

romilbhardwaj · 2024-11-13T05:48:20Z

sky/skylet/job_lib.py

 # TODO(zhwu): This number should be tuned based on heuristics.
-_PENDING_SUBMIT_GRACE_PERIOD = 60
+_INIT_SUBMIT_GRACE_PERIOD = 60


Do we still need a grace period of 60s?

Yes, it is required, as we have two steps for our job submission:

add job with INIT state and retrieve the job id. (This is to reserve the job id only)

actually, add the commands for the job to the table and set it to PENDING state.
These are two ssh connections, so the second step may significantly delayed if there is any network issues.

Updated the comment for this.

sky/skylet/job_lib.py

Michaelvll · 2024-11-14T02:19:02Z

An issue found:
After sky cancel several jobs, some jobs close to that job may turn into FAILED_DRIVER state. Hypothesis: the other jobs may be scheduled by the scheduler.schedule_step call in the say driver of the cancelled job, which might cause the process to be within the same process group or be a children process of that cancelled job driver process, causing the job being killed by the kill_process_daemon. We should fix this

142  sky-cmd  9 mins ago   20 secs ago     20s       1x[CPU:1+]  RUNNING        ~/sky_logs/sky-2024-11-14-02-05-06-716421  
141  sky-cmd  9 mins ago   20 secs ago     20s       1x[CPU:1+]  RUNNING        ~/sky_logs/sky-2024-11-14-02-05-06-450754  
140  sky-cmd  9 mins ago   21 secs ago     21s       1x[CPU:1+]  RUNNING        ~/sky_logs/sky-2024-11-14-02-05-06-619355  
139  sky-cmd  9 mins ago   22 secs ago     22s       1x[CPU:1+]  RUNNING        ~/sky_logs/sky-2024-11-14-02-05-05-888792  
138  sky-cmd  9 mins ago   23 secs ago     23s       1x[CPU:1+]  RUNNING        ~/sky_logs/sky-2024-11-14-02-05-06-738846  
137  sky-cmd  9 mins ago   23 secs ago     23s       1x[CPU:1+]  RUNNING        ~/sky_logs/sky-2024-11-14-02-05-05-599533  
136  sky-cmd  9 mins ago   24 secs ago     24s       1x[CPU:1+]  RUNNING        ~/sky_logs/sky-2024-11-14-02-05-05-721108  
135  sky-cmd  9 mins ago   25 secs ago     25s       1x[CPU:1+]  RUNNING        ~/sky_logs/sky-2024-11-14-02-05-05-505060  
134  sky-cmd  9 mins ago   -               -         1x[CPU:1+]  FAILED_DRIVER  ~/sky_logs/sky-2024-11-14-02-05-04-679146  
133  sky-cmd  10 mins ago  8 mins ago      7m 46s    1x[CPU:1+]  FAILED_DRIVER  ~/sky_logs/sky-2024-11-14-02-04-52-103688  
132  sky-cmd  10 mins ago  8 mins ago      7m 47s    1x[CPU:1+]  FAILED_DRIVER  ~/sky_logs/sky-2024-11-14-02-04-51-440670  
131  sky-cmd  10 mins ago  8 mins ago      7m 48s    1x[CPU:1+]  FAILED_DRIVER  ~/sky_logs/sky-2024-11-14-02-04-51-589677  
130  sky-cmd  10 mins ago  8 mins ago      7m 49s    1x[CPU:1+]  FAILED_DRIVER  ~/sky_logs/sky-2024-11-14-02-04-51-925204  
129  sky-cmd  10 mins ago  8 mins ago      8m 18s    1x[CPU:1+]  FAILED_DRIVER  ~/sky_logs/sky-2024-11-14-02-04-51-537510  
128  sky-cmd  10 mins ago  8 mins ago      8m 11s    1x[CPU:1+]  CANCELLED      ~/sky_logs/sky-2024-11-14-02-04-50-996221  
127  sky-cmd  10 mins ago  8 mins ago      8m 20s    1x[CPU:1+]  FAILED_DRIVER  ~/sky_logs/sky-2024-11-14-02-04-51-034328  
126  sky-cmd  10 mins ago  8 mins ago      8m 21s    1x[CPU:1+]  FAILED_DRIVER  ~/sky_logs/sky-2024-11-14-02-04-50-943330  
125  sky-cmd  10 mins ago  8 mins ago      8m 22s    1x[CPU:1+]  FAILED_DRIVER  ~/sky_logs/sky-2024-11-14-02-04-51-022167  
124  sky-cmd  10 mins ago  8 mins ago      8m 23s    1x[CPU:1+]  FAILED_DRIVER  ~/sky_logs/sky-2024-11-14-02-04-50-909505  
123  sky-cmd  10 mins ago  8 mins ago      8m 24s    1x[CPU:1+]  FAILED_DRIVER  ~/sky_logs/sky-2024-11-14-02-04-50-789364  
122  sky-cmd  10 mins ago  8 mins ago      8m 25s    1x[CPU:1+]  FAILED_DRIVER  ~/sky_logs/sky-2024-11-14-02-04-50-775809  
121  sky-cmd  10 mins ago  8 mins ago      8m 29s    1x[CPU:1+]  FAILED_DRIVER  ~/sky_logs/sky-2024-11-14-02-04-50-334967  
120  sky-cmd  10 mins ago  8 mins ago      8m 30s    1x[CPU:1+]  FAILED_DRIVER  ~/sky_logs/sky-2024-11-14-02-04-49-284130  
119  sky-cmd  10 mins ago  8 mins ago      8m 31s    1x[CPU:1+]  FAILED_DRIVER  ~/sky_logs/sky-2024-11-14-02-04-49-842914  
118  sky-cmd  10 mins ago  8 mins ago      8m 32s    1x[CPU:1+]  FAILED_DRIVER  ~/sky_logs/sky-2024-11-14-02-04-49-201166  
117  sky-cmd  10 mins ago  8 mins ago      8m 33s    1x[CPU:1+]  FAILED_DRIVER  ~/sky_logs/sky-2024-11-14-02-04-37-037409  
116  sky-cmd  10 mins ago  9 mins ago      8m 34s    1x[CPU:1+]  FAILED_DRIVER  ~/sky_logs/sky-2024-11-14-02-04-37-084538  
115  sky-cmd  10 mins ago  9 mins ago      8m 35s    1x[CPU:1+]  FAILED_DRIVER  ~/sky_logs/sky-2024-11-14-02-04-36-744599  
114  sky-cmd  10 mins ago  9 mins ago      8m 36s    1x[CPU:1+]  FAILED_DRIVER  ~/sky_logs/sky-2024-11-14-02-04-37-003736  
113  sky-cmd  10 mins ago  9 mins ago      8m 37s    1x[CPU:1+]  FAILED_DRIVER  ~/sky_logs/sky-2024-11-14-02-04-36-864517  
112  sky-cmd  10 mins ago  9 mins ago      8m 38s    1x[CPU:1+]  FAILED_DRIVER  ~/sky_logs/sky-2024-11-14-02-04-36-710345  
111  sky-cmd  10 mins ago  9 mins ago      8m 31s    1x[CPU:1+]  CANCELLED      ~/sky_logs/sky-2024-11-14-02-04-37-177211

The problem comes from that start_new_session we set for starting the job process is only creating a new session, and the process itself is still a child process of the original process that calls the subprocess. This causes the children process killing destroying the other jobs as well.

To reproduce:

start 32 jobs with 60 minutes runtime
start another 200 jobs with 600000 minutes runtime
After new jobs being scheduled, by some of the 32 jobs finish.
Cancel some of the process

See the ps faux output on the cluster below

ubuntu     17888  0.0  0.0   7764  3456 ?        Ss   03:26   0:00 /bin/bash -c echo "SKYPILOT_JOB_ID <95>" && cd ~/sky_workdir && $([ -s ~/.sky/python_path ] && cat ~/.sky/python_path 2> /dev/null || which python3) -u ~/.sky/sky_app/sky_job_95> ~/sky_logs/sky-2024-11-14-03-25-43-592146/r
ubuntu     17891  0.3  0.1 23941308 115528 ?     Sl   03:26   0:02  \_ /home/ubuntu/skypilot-runtime/bin/python -u /home/ubuntu/.sky/sky_app/sky_job_95
ubuntu     18021  0.0  0.0   7764  3456 ?        Ss   03:26   0:00      \_ /bin/bash -c echo "SKYPILOT_JOB_ID <96>" && cd ~/sky_workdir && $([ -s ~/.sky/python_path ] && cat ~/.sky/python_path 2> /dev/null || which python3) -u ~/.sky/sky_app/sky_job_96> ~/sky_logs/sky-2024-11-14-03-25-43-
ubuntu     18024  0.3  0.1 23941308 115592 ?     Sl   03:26   0:02          \_ /home/ubuntu/skypilot-runtime/bin/python -u /home/ubuntu/.sky/sky_app/sky_job_96
ubuntu     18155  0.0  0.0   7764  3328 ?        Ss   03:26   0:00              \_ /bin/bash -c echo "SKYPILOT_JOB_ID <97>" && cd ~/sky_workdir && $([ -s ~/.sky/python_path ] && cat ~/.sky/python_path 2> /dev/null || which python3) -u ~/.sky/sky_app/sky_job_97> ~/sky_logs/sky-2024-11-14-0
ubuntu     18158  0.3  0.1 23942320 115460 ?     Sl   03:26   0:02                  \_ /home/ubuntu/skypilot-runtime/bin/python -u /home/ubuntu/.sky/sky_app/sky_job_97
ubuntu     18311  0.0  0.0   7764  3456 ?        Ss   03:26   0:00                      \_ /bin/bash -c echo "SKYPILOT_JOB_ID <98>" && cd ~/sky_workdir && $([ -s ~/.sky/python_path ] && cat ~/.sky/python_path 2> /dev/null || which python3) -u ~/.sky/sky_app/sky_job_98> ~/sky_logs/sky-2024
ubuntu     18314  0.3  0.1 23942320 115840 ?     Sl   03:26   0:02                          \_ /home/ubuntu/skypilot-runtime/bin/python -u /home/ubuntu/.sky/sky_app/sky_job_98

Michaelvll · 2024-11-14T04:09:28Z

After investigating the comment here: #4318 (comment)

It seems that all the job driver processes we run are under control, and sending SIGTERM to the process group is enough, as the driver processes will correctly clean up the underlying tasks. For example, the job process in the process group above

ubuntu     18155  0.0  0.0   7764  3328 ?        Ss   03:26   0:00              \_ /bin/bash -c echo "SKYPILOT_JOB_ID <97>" && cd ~/sky_workdir && $([ -s ~/.sky/python_path ] && cat ~/.sky/python_path 2> /dev/null || which python3) -u ~/.sky/sky_app/sky_job_97> ~/sky_logs/sky-2024-11-14-0
ubuntu     18158  0.3  0.1 23942320 115460 ?     Sl   03:26   0:02                  \_ /home/ubuntu/skypilot-runtime/bin/python -u /home/ubuntu/.sky/sky_app/sky_job_97

The job driver process starts the actual user jobs as a ray task under raylet, which does not inherits from the driver process, and killing the driver process will have the raylet to clean up the specific ray task:

ubuntu     20760  0.2  0.1 23643804 112152 ?     SNl  03:27   0:04  \_ ray::sky-cmd,
ubuntu     20957  0.0  0.0   2892  1664 ?        SNs  03:27   0:00  |   \_ /bin/sh -c /bin/bash -i /tmp/sky_app_cx2gd3_p
ubuntu     20958  0.0  0.0   9188  4992 ?        SN   03:27   0:00  |   |   \_ /bin/bash -i /tmp/sky_app_cx2gd3_p
ubuntu     21086  0.0  0.0   6192  1920 ?        SN   03:27   0:00  |   |       \_ sleep 3600
ubuntu     20966  0.0  0.0      0     0 ?        ZN   03:27   0:00  |   \_ [python] <defunct>

Hence, we don't need to start a daemon for forcefully kill the process group during cancellation, which significantly reduce the time sky cancel takes.

Michaelvll · 2024-11-14T04:14:27Z

Our current process tree for the driver processes are not ideal, as everything is chained in a single tree, and canceling a job will split the tree into two, e.g. for the tree above, if we cancel 97, the process tree becomes

ubuntu     17891  0.2  0.1 23941308 115656 ?     Sl   03:26   0:05  \_ /home/ubuntu/skypilot-runtime/bin/python -u /home/ubuntu/.sky/sky_app/sky_job_95
ubuntu     18021  0.0  0.0   7764  3456 ?        Ss   03:26   0:00      \_ /bin/bash -c echo "SKYPILOT_JOB_ID <96>" && cd ~/sky_workdir && $([ -s ~/.sky/python_path ] && cat ~/.sky/python_path 2> /dev/null || which python3) -u ~/.sky/sky_app/sky_job_96> ~/sky_logs/sky-2024-11-14-03-25-43-523420
ubuntu     18024  0.2  0.1 23941308 115976 ?     Sl   03:26   0:05          \_ /home/ubuntu/skypilot-runtime/bin/python -u /home/ubuntu/.sky/sky_app/sky_job_96
ubuntu     18155  0.0  0.0      0     0 ?        Zs   03:26   0:00              \_ [bash] <defunct>
ubuntu     18311  0.0  0.0   7764  3456 ?        Ss   03:26   0:00 /bin/bash -c echo "SKYPILOT_JOB_ID <98>" && cd ~/sky_workdir && $([ -s ~/.sky/python_path ] && cat ~/.sky/python_path 2> /dev/null || which python3) -u ~/.sky/sky_app/sky_job_98> ~/sky_logs/sky-2024-11-14-03-25-55-373930/run.log
ubuntu     18314  0.2  0.1 23942320 115840 ?     Sl   03:26   0:05  \_ /home/ubuntu/skypilot-runtime/bin/python -u /home/ubuntu/.sky/sky_app/sky_job_98
ubuntu     18444  0.0  0.0   7764  3456 ?        Ss   03:26   0:00      \_ /bin/bash -c echo "SKYPILOT_JOB_ID <99>" && cd ~/sky_workdir && $([ -s ~/.sky/python_path ] && cat ~/.sky/python_path 2> /dev/null || which python3) -u ~/.sky/sky_app/sky_job_99> ~/sky_logs/sky-2024-11-14-03-25-56-846300
ubuntu     18447  0.2  0.1 23942332 115840 ?     Sl   03:26   0:05          \_ /home/ubuntu/skypilot-runtime/bin/python -u /home/ubuntu/.sky/sky_app/sky_job_99

This works at the moment, but we should move to a more elegant solution using skylet as the single source that start the job driver processes.

romilbhardwaj · 2024-11-14T06:02:31Z

Thanks for investigating @Michaelvll. I just confirmed correctness of sky cancel with this misbehaving script:

run: |
  # Trap SIGTERM and ignore it
  trap "" SIGTERM

  for ((i=1; i<=3600; i++)); do
    echo "Count: $i"
    echo "Count: $i" >> /tmp/count.txt
    sleep 1
  done

sky cancel indeed kills the process.

Michaelvll · 2024-11-14T08:02:43Z

With 5650d26, we are now able to avoid the chain of processes. : )

Michaelvll · 2024-11-14T08:15:41Z

sky launch -c t-d examples/resnet_distributed_torch.yaml; ssh t-d-worker1 nvidia-smi; sky cancel t-d 1; ssh t-d-worker1 nvidia-smi`
All smoke tests: pytest tests/test_smoke.py --aws (except three tests in test_sky_bench for subprocess.CalledProcessError: Command '['aws', 's3', 'rm', '--recursive', 's3://sky-bench-c174-gcpuser/t-sky-bench-0c']' returned non-zero exit status 1.)
Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh 1
- Master: sky launch -c test-queue --cloud aws --cpus 2 "echo hi"; for i in `seq 1 7`; do sky exec test-queue "echo hi; sleep 1000" -d; done
- This PR: sky exec test-queue "echo hi; sleep 1000" -d should fail for runtime version
- This PR: sky queue; sky logs test-queue 2 should correctly run
- This PR: sky launch -c test-queue echo hi
- This PR: sky cancel test-queue 2; the old pending job scheduled correctly
- This PR: sky cancel test-queue 3 4 5; the new pending job scheduled correctly

Michaelvll added 3 commits November 9, 2024 21:16

Use process instead of ray job

04cbbad

fix

92cb753

fix job table

1102eba

Michaelvll marked this pull request as draft November 9, 2024 21:29

Michaelvll added 2 commits November 9, 2024 22:27

format and comment

4e24659

format

80d8d85

Michaelvll mentioned this pull request Nov 9, 2024

[Core] Avoid job scheduling race condition #4310

Merged

5 tasks

Michaelvll added 4 commits November 9, 2024 22:56

fix

a172e31

set pid immediately

ef1f7c4

Fix cancel

b037876

Fix cancel

3e64e50

Michaelvll changed the title ~~[Core] Replace ray job submit with subprocess~~ [Core] Replace ray job submit with subprocess with 3x faster job scheduling speed Nov 10, 2024

Michaelvll changed the title ~~[Core] Replace ray job submit with subprocess with 3x faster job scheduling speed~~ [Core] Replace ray job submit with subprocess for 3x faster job scheduling speed Nov 10, 2024

Michaelvll changed the title ~~[Core] Replace ray job submit with subprocess for 3x faster job scheduling speed~~ [Core] Replace ray job submit with subprocess for 3x faster job scheduling Nov 10, 2024

Michaelvll added 11 commits November 10, 2024 04:17

Fix backward compat

3a95eb2

fix cancel for backward compat

40c8a3c

Fix backward compat

f3d1709

Merge branch 'master' of github.com:skypilot-org/skypilot into avoid-…

380b9b1

…ray-job-submit

fix output

6f3d985

fix ray job submit case

eb2b4b0

set default to -1

91a6636

fix cancel jobs

ff12f60

fix cancel for ray job

c57f629

format

67715b2

wait job controller up

00a54cd

Michaelvll marked this pull request as ready for review November 10, 2024 19:34

Michaelvll mentioned this pull request Nov 10, 2024

[Jobs] Fast jobs cancellation for PENDING managed jobs #4321

Draft

5 tasks

fix state

f7a6268

Michaelvll mentioned this pull request Nov 11, 2024

[Core] Allow more PENDING jobs to be scheduled concurrently (1.4x faster) #4311

Draft

5 tasks

Add comment

fcbd89d

Michaelvll added 2 commits November 12, 2024 23:05

use aws

08cd62f

add -y

759d3dc

Michaelvll requested review from romilbhardwaj and cg505 November 13, 2024 00:12

romilbhardwaj approved these changes Nov 13, 2024

View reviewed changes

Michaelvll added 7 commits November 13, 2024 19:47

minor format

034c2de

Update INIT comment

442bba5

comment for PID

a1f2fe4

update failed driver message

9989aa5

fix

7c595ca

format

2f7075f

deprecation comment

54ba8c3

Michaelvll changed the title ~~[Core] Replace ray job submit with subprocess for 3x/8.5x faster job scheduling~~ [Core] Replace ray job submit for 3x/8.5x faster job scheduling for cluster/managed jobs Nov 14, 2024

Michaelvll added 3 commits November 14, 2024 03:59

No need to forcefully kill

b38fd1b

No kill

5f19942

check both nodes for cancel

a93773b

format

8f56b7a

Michaelvll added 2 commits November 14, 2024 07:59

Avoid chain of processes

5650d26

add check

013f2dc

Allow cancelling state

e05c152

Michaelvll added this pull request to the merge queue Nov 15, 2024

Merged via the queue into master with commit a404e3f Nov 15, 2024
20 checks passed

Michaelvll deleted the avoid-ray-job-submit branch November 15, 2024 08:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Replace ray job submit for 3x/8.5x faster job scheduling for cluster/managed jobs #4318

[Core] Replace ray job submit for 3x/8.5x faster job scheduling for cluster/managed jobs #4318

Michaelvll commented Nov 9, 2024 •

edited

Loading

romilbhardwaj left a comment

romilbhardwaj Nov 13, 2024

Michaelvll Nov 13, 2024

Michaelvll Nov 13, 2024

Michaelvll commented Nov 14, 2024 •

edited

Loading

Michaelvll commented Nov 14, 2024 •

edited

Loading

Michaelvll commented Nov 14, 2024 •

edited

Loading

romilbhardwaj commented Nov 14, 2024

Michaelvll commented Nov 14, 2024

Michaelvll commented Nov 14, 2024 •

edited

Loading

[Core] Replace ray job submit for 3x/8.5x faster job scheduling for cluster/managed jobs #4318

[Core] Replace ray job submit for 3x/8.5x faster job scheduling for cluster/managed jobs #4318

Conversation

Michaelvll commented Nov 9, 2024 • edited Loading

Scheduling Speed for unmanaged jobs

Scheduling speed for managed jobs

Memory consumption

Correctness/Robustness

romilbhardwaj left a comment

Choose a reason for hiding this comment

romilbhardwaj Nov 13, 2024

Choose a reason for hiding this comment

Michaelvll Nov 13, 2024

Choose a reason for hiding this comment

Michaelvll Nov 13, 2024

Choose a reason for hiding this comment

Michaelvll commented Nov 14, 2024 • edited Loading

Michaelvll commented Nov 14, 2024 • edited Loading

Michaelvll commented Nov 14, 2024 • edited Loading

romilbhardwaj commented Nov 14, 2024

Michaelvll commented Nov 14, 2024

Michaelvll commented Nov 14, 2024 • edited Loading

Michaelvll commented Nov 9, 2024 •

edited

Loading

Michaelvll commented Nov 14, 2024 •

edited

Loading

Michaelvll commented Nov 14, 2024 •

edited

Loading

Michaelvll commented Nov 14, 2024 •

edited

Loading

Michaelvll commented Nov 14, 2024 •

edited

Loading