Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Serve] Make controller regions/ choose from replica resources #4053

Merged
merged 14 commits into from
Oct 24, 2024

Conversation

euclidgame
Copy link
Contributor

@euclidgame euclidgame commented Oct 9, 2024

Fixes #3364 .

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • Unit test for the case where all resources are specified with clouds, regions and zones.
    • Unit test for the case where all resources are specified with clouds and regions, only part of them have zones.
    • Unit test for mixed cases: some resources have clouds, regions and zones; some don't have regions or zones.
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

@euclidgame euclidgame marked this pull request as ready for review October 9, 2024 20:44
Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this feature @euclidgame ! This is awesome. The PR looks mostly good for me. Left some discussions ;)

Comment on lines 525 to 527
for cloud_name, regions in requested_clouds_with_region_zone.items()
for region, zones in regions.items() for zone in zones
for cloud in [clouds.CLOUD_REGISTRY.from_str(cloud_name)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a little bit confusing. Lets do explicit for loop instead?

Comment on lines 491 to 493
requested_clouds_with_region_zone[cloud_name] = {
'_allow_any_region': {'_allow_any_zone'}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to left blank if enable any region? same apply for zone.

Copy link
Contributor Author

@euclidgame euclidgame Oct 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is possible, but it requires special handling to determine whether an empty set indicates that any region is allowed, which means we should only add new regions if it’s the first time. Using a placeholder makes it easier. Maybe I can use None instead of the specific strings?

sky/utils/controller_utils.py Show resolved Hide resolved
Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this @euclidgame ! Looks good to me. Left some discussions ;)

sky/utils/controller_utils.py Outdated Show resolved Hide resolved
tests/unit_tests/test_controller_utils.py Show resolved Hide resolved
tests/unit_tests/test_controller_utils.py Show resolved Hide resolved
Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this @euclidgame ! Left some discussions ;)

Comment on lines 94 to 108
# 2. All resources has cloud specified. Some of them
# could NOT host controllers. Return a set, only
# containing those could host controllers.
# 2. Some resources cannot host controllers.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we revert this?

Comment on lines 130 to 142
# 3. Some resources does not have cloud specified.
# Return the default resources.
# 3. Some resources do not have cloud specified.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same, could we revert this?

tests/unit_tests/test_controller_utils.py Show resolved Hide resolved
sky/utils/controller_utils.py Show resolved Hide resolved
@euclidgame
Copy link
Contributor Author

Thanks for the suggestions @cblmemo ! All fixed, please review :-)

Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @euclidgame for the prompt fix! Left some nits and after that it should be ready to go!

sky/utils/controller_utils.py Outdated Show resolved Hide resolved
sky/utils/controller_utils.py Outdated Show resolved Hide resolved
sky/utils/controller_utils.py Outdated Show resolved Hide resolved
sky/utils/controller_utils.py Outdated Show resolved Hide resolved
sky/utils/controller_utils.py Outdated Show resolved Hide resolved
sky/utils/controller_utils.py Outdated Show resolved Hide resolved
tests/unit_tests/test_controller_utils.py Outdated Show resolved Hide resolved
@euclidgame
Copy link
Contributor Author

@cblmemo Thanks for the suggestions! I have fixed them, please review again.

@cblmemo
Copy link
Collaborator

cblmemo commented Oct 24, 2024

Thanks @euclidgame ! Mostly looks good to me. I tested the unittest but unfortunately got an error. Could you help fix it?

pytest tests/unit_tests/test_controller_utils.py                                       
D 10-24 16:40:14 skypilot_config.py:228] Using config path: /home/txia/.sky/config.yaml
D 10-24 16:40:14 skypilot_config.py:233] Config loaded:
D 10-24 16:40:14 skypilot_config.py:233] {'serve': {'controller': {'resources': {'cloud': 'aws', 'cpus': 4}}}}
D 10-24 16:40:14 skypilot_config.py:245] Config syntax check passed.
bringing up nodes...
....F.
==================================================================================================== FAILURES =====================================================================================================
_____________________________________________________________ test_get_controller_resources_with_task_resources[serve-default_controller_resources1] ______________________________________________________________
[gw4] linux -- Python 3.9.17 /home/txia/miniconda3/envs/sky-serve/bin/python
tests/unit_tests/test_controller_utils.py:104: in test_get_controller_resources_with_task_resources
    _check_controller_resources(controller_resources, expected_combinations,
tests/unit_tests/test_controller_utils.py:82: in _check_controller_resources
    assert config == default_controller_resources, config
E   AssertionError: {'cpus': '4', 'disk_size': 200}
E   assert {'cpus': '4',...sk_size': 200} == {'cpus': '4+'...sk_size': 200}
E     Omitting 1 identical items, use -vv to show
E     Differing items:
E     {'cpus': '4'} != {'cpus': '4+'}
E     Use -v to get more diff
============================================================================================= short test summary info =============================================================================================
FAILED tests/unit_tests/test_controller_utils.py::test_get_controller_resources_with_task_resources[serve-default_controller_resources1] - AssertionError: {'cpus': '4', 'disk_size': 200}
1 failed, 5 passed, 2 warnings in 5.47s

@cblmemo
Copy link
Collaborator

cblmemo commented Oct 24, 2024

Thanks @euclidgame ! Mostly looks good to me. I tested the unittest but unfortunately got an error. Could you help fix it?

pytest tests/unit_tests/test_controller_utils.py                                       
D 10-24 16:40:14 skypilot_config.py:228] Using config path: /home/txia/.sky/config.yaml
D 10-24 16:40:14 skypilot_config.py:233] Config loaded:
D 10-24 16:40:14 skypilot_config.py:233] {'serve': {'controller': {'resources': {'cloud': 'aws', 'cpus': 4}}}}
D 10-24 16:40:14 skypilot_config.py:245] Config syntax check passed.
bringing up nodes...
....F.
==================================================================================================== FAILURES =====================================================================================================
_____________________________________________________________ test_get_controller_resources_with_task_resources[serve-default_controller_resources1] ______________________________________________________________
[gw4] linux -- Python 3.9.17 /home/txia/miniconda3/envs/sky-serve/bin/python
tests/unit_tests/test_controller_utils.py:104: in test_get_controller_resources_with_task_resources
    _check_controller_resources(controller_resources, expected_combinations,
tests/unit_tests/test_controller_utils.py:82: in _check_controller_resources
    assert config == default_controller_resources, config
E   AssertionError: {'cpus': '4', 'disk_size': 200}
E   assert {'cpus': '4',...sk_size': 200} == {'cpus': '4+'...sk_size': 200}
E     Omitting 1 identical items, use -vv to show
E     Differing items:
E     {'cpus': '4'} != {'cpus': '4+'}
E     Use -v to get more diff
============================================================================================= short test summary info =============================================================================================
FAILED tests/unit_tests/test_controller_utils.py::test_get_controller_resources_with_task_resources[serve-default_controller_resources1] - AssertionError: {'cpus': '4', 'disk_size': 200}
1 failed, 5 passed, 2 warnings in 5.47s

nvm. It is an issue on master branch as well. Issued #4172 to keep track of it. Thanks for contributing @euclidgame and merging now!

@cblmemo cblmemo added this pull request to the merge queue Oct 24, 2024
Merged via the queue into skypilot-org:master with commit e832dde Oct 24, 2024
20 checks passed
@cblmemo cblmemo deleted the prioritize-regions branch October 24, 2024 23:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Core][Controller] Respect region/zone settings in controller resources when creating controller
2 participants