-
Notifications
You must be signed in to change notification settings - Fork 509
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SkyServe] deprecate old serve autoscaler policy #2868
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! Maybe we could migrate more implementation into our on-demand autoscaler, such as:
- Change the
AutoscalerDecision
to the latest version in the spot policy; - Change the type of return value for
evaluate_scaling
(i.e.,List[AutoscalerDecision]
); - Using the same interface (
target_qps
instead of{upper,lower}_threshold
) as spot policy;- Probably also migrate the
_get_desired_num_replicas
function.
- Probably also migrate the
The latest bullet point needs more discussion; cc @Michaelvll @concretevitamin for a look.
@cblmemo we probably want to think about how much to migrate from the spot policy PR. The current PR is kinda a minimal version of adding counter-based policy |
Seems like AUTOSCALER_COOLDOWN_SECONDS is deprecated too |
Yeah, we should discuss more. Agreed that we should migrate the counter-based autoscaling policy to master ASAP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are doing minimal migration then this version of the code looks great to me 🫡
https://github.com/skypilot-org/skypilot/blob/serve-spot/sky/serve/constants.py#L41 |
I mean all variable/constants related to cooldown should be able to remove, including this constant and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, I missed one thing. We should not let the user wait for 5 minutes for the bootstrap case and we need some special judgment for this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks great to me! Left some nits, and please run some manual tests and include them in the PR description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is still a constant to delete IIRC, other than that it looks great to me! After some tests this PR should be ready to go 🫡
@cblmemo added a manual test. commented out the constant |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry for not making it clear.. I mean test to make sure it doesn't contain any unexpected errors. Probably we could eliminate it in the PR since it is very similar to our http server, and the interface will change soon (to target qps per replica).
examples/serve/autoscaler.yaml
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these are some manual tests for the autoscaling behavior.
Are there other scripts where we can test the skyserve autoscaler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I usually just use watch -n curl http://...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the yamls that are mainly for testing, we can probably have them in ./tests/test_yamls
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved to test
* deprecate old autoscaler policy * remove redundant variable * remove redundant line * log upscale counter and downscale coutner * fast autoscaling when num_replicas below min or above max * address tian comments * remove constant * added test * update script * moved to tests
After discussing with @cblmemo
This PR replaces the old Sky Serve autoscaler policy with the new counter-based version (as in the spot policy)
Tested (run the relevant ones):
bash format.sh
sky serve up tests/test_yamls/test_serve_autoscaler.yaml
followed bypython3 tests/test_serve_autoscaler.py
pytest tests/test_smoke.py