[fast] if cluster is INIT, force refresh before deciding to provision (…

…#4328) * [fast] if cluster is INIT, force refresh before deciding to provision If a cluster is mid-initialization, its status will be INIT and autostop/down will not be set yet. In this case, the cluster refresh won't actually grab the cluster status lock and hard refresh the status. So, check_cluster_available will immeidately decide that the cluster is INIT and throw. This could cause a bug where many parallel launches of `sky launch --fast` that are staggered can all decide that the cluster is INIT, and all decide that they need to launch the cluster. Since cluster initialization is locked with the cluster status lock, each invocation will sychronously re-launch the cluster. Now, if we see that the cluster is INIT, we force a refresh. This will acquire the cluster status lock, which will block until any ongoing provisioning completes and the cluster is UP. If the cluster is otherwise INIT (e.g. ray cluster has been stopped abnormally) then provisioning should proceed as normal. This does not fix the race where the cluster does not exist or is STOPPED, and many simultaneously started `sky launch --fast` invocations try to create or restart the cluster. However, once the first batch complete their launches, all future invocations should correctly see the cluster as UP, not INIT - even if they are started while the first batch is still provisioning the cluster. Fixing the STOPPED or non-existent case is a bit more difficult and will probably require moving this detection logic inside the provisioning code, so that it holds the cluster status lock continuously from the status check until the cluster is UP. * update comment
skypilot-org · Nov 15, 2024 · a2278cb · a2278cb
1 parent fa798d7
commit a2278cb
Showing 1 changed file with 37 additions and 22 deletions.
diff --git a/sky/execution.py b/sky/execution.py
@@ -11,10 +11,10 @@
 from sky import admin_policy
 from sky import backends
 from sky import clouds
-from sky import exceptions
 from sky import global_user_state
 from sky import optimizer
 from sky import sky_logging
+from sky import status_lib
 from sky.backends import backend_utils
 from sky.usage import usage_lib
 from sky.utils import admin_policy_utils
@@ -463,28 +463,43 @@ def launch(
     stages = None
     # Check if cluster exists and we are doing fast provisioning
     if fast and cluster_name is not None:
-        maybe_handle = global_user_state.get_handle_from_cluster_name(
-            cluster_name)
-        if maybe_handle is not None:
-            try:
-                # This will throw if the cluster is not available
-                backend_utils.check_cluster_available(
+        cluster_status, maybe_handle = (
+            backend_utils.refresh_cluster_status_handle(cluster_name))
+        if cluster_status == status_lib.ClusterStatus.INIT:
+            # If the cluster is INIT, it may be provisioning. We want to prevent
+            # concurrent calls from queueing up many sequential reprovision
+            # attempts. Since provisioning will hold the cluster status lock, we
+            # wait to hold that lock by force refreshing the status. This will
+            # block until the cluster finishes provisioning, then correctly see
+            # that it is UP.
+            # TODO(cooperc): If multiple processes launched in parallel see that
+            # the cluster is STOPPED or does not exist, they will still all try
+            # to provision it, since we do not hold the lock continuously from
+            # the status check until the provision call. Fixing this requires a
+            # bigger refactor.
+            cluster_status, maybe_handle = (
+                backend_utils.refresh_cluster_status_handle(
                     cluster_name,
-                    operation='executing tasks',
-                    check_cloud_vm_ray_backend=False,
-                    dryrun=dryrun)
-                handle = maybe_handle
-                # Get all stages
-                stages = [
-                    Stage.SYNC_WORKDIR,
-                    Stage.SYNC_FILE_MOUNTS,
-                    Stage.PRE_EXEC,
-                    Stage.EXEC,
-                    Stage.DOWN,
-                ]
-            except exceptions.ClusterNotUpError:
-                # Proceed with normal provisioning
-                pass
+                    force_refresh_statuses=[
+                        # If the cluster is INIT, we want to try to grab the
+                        # status lock, which should block until provisioning is
+                        # finished.
+                        status_lib.ClusterStatus.INIT,
+                    ],
+                    # Wait indefinitely to obtain the lock, so that we don't
+                    # have multiple processes launching the same cluster at
+                    # once.
+                    cluster_status_lock_timeout=-1,
+                ))
+        if cluster_status == status_lib.ClusterStatus.UP:
+            handle = maybe_handle
+            stages = [
+                Stage.SYNC_WORKDIR,
+                Stage.SYNC_FILE_MOUNTS,
+                Stage.PRE_EXEC,
+                Stage.EXEC,
+                Stage.DOWN,
+            ]
 
     return _execute(
         entrypoint=entrypoint,