Skip to content

Commit

Permalink
[fast] if cluster is INIT, force refresh before deciding to provision (
Browse files Browse the repository at this point in the history
…#4328)

* [fast] if cluster is INIT, force refresh before deciding to provision

If a cluster is mid-initialization, its status will be INIT and autostop/down
will not be set yet. In this case, the cluster refresh won't actually grab the
cluster status lock and hard refresh the status. So, check_cluster_available
will immeidately decide that the cluster is INIT and throw.

This could cause a bug where many parallel launches of `sky launch --fast` that
are staggered can all decide that the cluster is INIT, and all decide that they
need to launch the cluster. Since cluster initialization is locked with the
cluster status lock, each invocation will sychronously re-launch the cluster.

Now, if we see that the cluster is INIT, we force a refresh. This will acquire
the cluster status lock, which will block until any ongoing provisioning
completes and the cluster is UP. If the cluster is otherwise INIT (e.g. ray
cluster has been stopped abnormally) then provisioning should proceed as normal.

This does not fix the race where the cluster does not exist or is STOPPED, and
many simultaneously started `sky launch --fast` invocations try to create or
restart the cluster. However, once the first batch complete their launches, all
future invocations should correctly see the cluster as UP, not INIT - even if
they are started while the first batch is still provisioning the cluster. Fixing
the STOPPED or non-existent case is a bit more difficult and will probably
require moving this detection logic inside the provisioning code, so that it
holds the cluster status lock continuously from the status check until the
cluster is UP.

* update comment
  • Loading branch information
cg505 authored Nov 15, 2024
1 parent fa798d7 commit a2278cb
Showing 1 changed file with 37 additions and 22 deletions.
59 changes: 37 additions & 22 deletions sky/execution.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,10 @@
from sky import admin_policy
from sky import backends
from sky import clouds
from sky import exceptions
from sky import global_user_state
from sky import optimizer
from sky import sky_logging
from sky import status_lib
from sky.backends import backend_utils
from sky.usage import usage_lib
from sky.utils import admin_policy_utils
Expand Down Expand Up @@ -463,28 +463,43 @@ def launch(
stages = None
# Check if cluster exists and we are doing fast provisioning
if fast and cluster_name is not None:
maybe_handle = global_user_state.get_handle_from_cluster_name(
cluster_name)
if maybe_handle is not None:
try:
# This will throw if the cluster is not available
backend_utils.check_cluster_available(
cluster_status, maybe_handle = (
backend_utils.refresh_cluster_status_handle(cluster_name))
if cluster_status == status_lib.ClusterStatus.INIT:
# If the cluster is INIT, it may be provisioning. We want to prevent
# concurrent calls from queueing up many sequential reprovision
# attempts. Since provisioning will hold the cluster status lock, we
# wait to hold that lock by force refreshing the status. This will
# block until the cluster finishes provisioning, then correctly see
# that it is UP.
# TODO(cooperc): If multiple processes launched in parallel see that
# the cluster is STOPPED or does not exist, they will still all try
# to provision it, since we do not hold the lock continuously from
# the status check until the provision call. Fixing this requires a
# bigger refactor.
cluster_status, maybe_handle = (
backend_utils.refresh_cluster_status_handle(
cluster_name,
operation='executing tasks',
check_cloud_vm_ray_backend=False,
dryrun=dryrun)
handle = maybe_handle
# Get all stages
stages = [
Stage.SYNC_WORKDIR,
Stage.SYNC_FILE_MOUNTS,
Stage.PRE_EXEC,
Stage.EXEC,
Stage.DOWN,
]
except exceptions.ClusterNotUpError:
# Proceed with normal provisioning
pass
force_refresh_statuses=[
# If the cluster is INIT, we want to try to grab the
# status lock, which should block until provisioning is
# finished.
status_lib.ClusterStatus.INIT,
],
# Wait indefinitely to obtain the lock, so that we don't
# have multiple processes launching the same cluster at
# once.
cluster_status_lock_timeout=-1,
))
if cluster_status == status_lib.ClusterStatus.UP:
handle = maybe_handle
stages = [
Stage.SYNC_WORKDIR,
Stage.SYNC_FILE_MOUNTS,
Stage.PRE_EXEC,
Stage.EXEC,
Stage.DOWN,
]

return _execute(
entrypoint=entrypoint,
Expand Down

0 comments on commit a2278cb

Please sign in to comment.