This repository has been archived by the owner on Jun 6, 2024. It is now read-only.
[Launcher]: Revise the definition of Framework running state #2135
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Solve Issues:
#2022
#2051
#2099
New Definition:
This makes the Launcher APIs reflect the real Framework running state, instead of just the raw AM running state. (Since We always want to best effort hide AM concept to end user).
The definition works for both Incremental and Gang Scheduling, because even for Gang Scheduling, “Exists running Task” means the Framework has already satisfied Gang Allocation, and all its Tasks has already been launched before.
Why not implement in RestServer?
To revise state, RestServer List operation need to also read all TaskStatuses, which is too heavy.
Why implement in LauncherWebServer instead of LauncherService?
It is hard to make sure the FrameworkStatus is consistent with the TaskStatuses outside WebServer.
However, this will make the exposed FrameworkState is not consistent with the backend, but it is fine because the revised state, i.e. APPLICATION_RUNNING and APPLICATION_WAITING are generally exchangeable even in the backend.
How about K8S Launcher?
K8S Launcher treat the whole Framework spec and status as a single CRD object, it is easy to make it consistent even in the backend.
Will add a new state FrameworkAttemptPreparing to indicate there is no running Task (such as during Gang Allocation Phase) even if FrameworkAttempt object is already created.
Long Term Plan: Generalize Scheduling Policy