Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[fast] if cluster is INIT, force refresh before deciding to provision (…
…#4328) * [fast] if cluster is INIT, force refresh before deciding to provision If a cluster is mid-initialization, its status will be INIT and autostop/down will not be set yet. In this case, the cluster refresh won't actually grab the cluster status lock and hard refresh the status. So, check_cluster_available will immeidately decide that the cluster is INIT and throw. This could cause a bug where many parallel launches of `sky launch --fast` that are staggered can all decide that the cluster is INIT, and all decide that they need to launch the cluster. Since cluster initialization is locked with the cluster status lock, each invocation will sychronously re-launch the cluster. Now, if we see that the cluster is INIT, we force a refresh. This will acquire the cluster status lock, which will block until any ongoing provisioning completes and the cluster is UP. If the cluster is otherwise INIT (e.g. ray cluster has been stopped abnormally) then provisioning should proceed as normal. This does not fix the race where the cluster does not exist or is STOPPED, and many simultaneously started `sky launch --fast` invocations try to create or restart the cluster. However, once the first batch complete their launches, all future invocations should correctly see the cluster as UP, not INIT - even if they are started while the first batch is still provisioning the cluster. Fixing the STOPPED or non-existent case is a bit more difficult and will probably require moving this detection logic inside the provisioning code, so that it holds the cluster status lock continuously from the status check until the cluster is UP. * update comment
- Loading branch information