Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AWS SSO] Use service account to access EC2 and S3 on head and worker nodes #1489

Merged
merged 55 commits into from
Dec 28, 2022
Merged
Show file tree
Hide file tree
Changes from 46 commits
Commits
Show all changes
55 commits
Select commit Hold shift + click to select a range
1181cc1
Create SkyPilot service account with neccessary permission
Michaelvll Dec 5, 2022
996eba2
legacy key name
Michaelvll Dec 5, 2022
470557c
align quote
Michaelvll Dec 5, 2022
abc224b
lint
Michaelvll Dec 5, 2022
d90b851
Hints for not working with other clouds
Michaelvll Dec 5, 2022
e18b91e
Add back credential file upload, since VMs on other clouds need it to…
Michaelvll Dec 5, 2022
92bcb32
Add get instance profile and pin awscli
Michaelvll Dec 6, 2022
3fe931e
rename policy
Michaelvll Dec 6, 2022
6e28d9a
Fix hints
Michaelvll Dec 6, 2022
7b7447d
align
Michaelvll Dec 6, 2022
1673bd1
Not upload AWScredentials when SSO
Michaelvll Dec 6, 2022
8c326bf
format
Michaelvll Dec 6, 2022
ba4f041
mark costly smoke tests
Michaelvll Dec 6, 2022
7e3987c
Use ubuntu-20.04 for test
Michaelvll Dec 7, 2022
4efebf1
format
Michaelvll Dec 7, 2022
1719e2e
rename tests
Michaelvll Dec 7, 2022
fe23509
fix
Michaelvll Dec 7, 2022
f1fd9ce
support option for run_smoke_tests.sh
Michaelvll Dec 7, 2022
d6c5e91
Merge branch 'master' of github.com:concretevitamin/sky-experiments i…
Michaelvll Dec 7, 2022
21d1a23
rename variable
Michaelvll Dec 7, 2022
355088e
Use version instead of inplace change for aws config for backward com…
Michaelvll Dec 12, 2022
705b86e
fix
Michaelvll Dec 12, 2022
1430d74
Add comments
Michaelvll Dec 12, 2022
6ea8b4b
Merge branch 'master' of github.com:concretevitamin/sky-experiments i…
Michaelvll Dec 16, 2022
1f18f48
Fix undefined
Michaelvll Dec 17, 2022
db819f7
Merge branch 'master' of github.com:concretevitamin/sky-experiments i…
Michaelvll Dec 20, 2022
f6b0d9b
Fix sky check
Michaelvll Dec 20, 2022
396ccd9
Use a single file for aws config
Michaelvll Dec 20, 2022
a8cc448
address comments
Michaelvll Dec 20, 2022
fa02be3
fix
Michaelvll Dec 20, 2022
e5b8cb0
fix
Michaelvll Dec 20, 2022
41730cc
revert smoke tests
Michaelvll Dec 20, 2022
b7d422e
Add mypy check for sky/clouds/*.py
Michaelvll Dec 20, 2022
6df83ff
fix mypy list
Michaelvll Dec 20, 2022
78aada9
Fix identity check
Michaelvll Dec 20, 2022
2eac8b6
lint
Michaelvll Dec 20, 2022
c86b69d
remove spaces
Michaelvll Dec 20, 2022
f74f604
Merge branch 'fix-identity-check' of github.com:concretevitamin/sky-e…
Michaelvll Dec 20, 2022
24e22c9
Merge branch 'master' of github.com:concretevitamin/sky-experiments i…
Michaelvll Dec 26, 2022
a1df005
address comments
Michaelvll Dec 27, 2022
62378bd
Update sky/clouds/aws.py
Michaelvll Dec 27, 2022
80911dc
Update sky/clouds/aws.py
Michaelvll Dec 27, 2022
ba54fe7
address comments
Michaelvll Dec 27, 2022
6723190
zone to be None
Michaelvll Dec 27, 2022
bfc5f56
add comment
Michaelvll Dec 27, 2022
d150cd9
apply changes to help str
Michaelvll Dec 27, 2022
e99853a
Enable fail over when the sso is expired
Michaelvll Dec 27, 2022
1b85317
Update sky/skylet/providers/aws/config.py
Michaelvll Dec 27, 2022
75b3424
address comments
Michaelvll Dec 27, 2022
0e9819d
avoid unbounded get_project_id exceptions
Michaelvll Dec 27, 2022
6546bb1
format
Michaelvll Dec 27, 2022
1c2c549
print out status refresh result
Michaelvll Dec 27, 2022
3ef278b
retry for azure identity fetching
Michaelvll Dec 27, 2022
3df29ea
wait longer for autostop
Michaelvll Dec 27, 2022
47ac24e
more retry for azure
Michaelvll Dec 27, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion docs/source/getting-started/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,8 @@ To get the **AWS access key** required by :code:`aws configure`, please go to th
$ # Configure your AWS credentials
$ aws configure

Note: If you are using AWS IAM Identity Center (AWS SSO), you will need :code:`pip install awscli>=1.27.10`. See `here <https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-sso.html>`_ for instructions on how to configure AWS SSO.

**GCP**

.. code-block:: console
Expand Down Expand Up @@ -181,4 +183,4 @@ If you experience any issues after installation, you can use the :code:`--uninst
$ sky --uninstall-shell-completion auto
$ # sky --uninstall-shell-completion zsh
$ # sky --uninstall-shell-completion bash
$ # sky --uninstall-shell-completion fish
$ # sky --uninstall-shell-completion fish
2 changes: 2 additions & 0 deletions examples/using_file_mounts.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,8 @@ file_mounts:

/s3-data-test: s3://fah-public-data-covid19-cryptic-pockets/human/il6/PROJ14534/RUN999/CLONE0/results0
/s3-data-file: s3://fah-public-data-covid19-cryptic-pockets/human/il6/PROJ14534/RUN999/CLONE0/results0/frame0.xtc
# Test access to private bucket
# /my-bucket: s3://sky-detectron2-outputs
# /test-my-gcs: gs://cloud-storage-test-zhwu-2

# If a source path points to a "directory", its contents will be recursively
Expand Down
6 changes: 3 additions & 3 deletions sky/backends/backend_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -1829,11 +1829,11 @@ def _update_cluster_status(

Raises:
exceptions.ClusterOwnerIdentityMismatchError: if the current user is not the
same as the user who created the cluster.
same as the user who created the cluster.
exceptions.CloudUserIdentityError: if we fail to get the current user
identity.
identity.
exceptions.ClusterStatusFetchingError: the cluster status cannot be
fetched from the cloud provider.
fetched from the cloud provider.
"""
if not acquire_per_cluster_status_lock:
return _update_cluster_status_no_lock(cluster_name)
Expand Down
177 changes: 123 additions & 54 deletions sky/clouds/aws.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,22 +19,26 @@
# renaming to avoid shadowing variables
from sky import resources as resources_lib

# Minimum set of files under ~/.aws that grant AWS access.
# This local file (under ~/.aws/) will be uploaded to remote nodes (any
# cloud), if all of the following conditions hold:
# - the current user identity is not using AWS SSO
# - this file exists
# It has the following purposes:
# - make all nodes (any cloud) able to access private S3 buckets
# - make some remote nodes able to launch new nodes on AWS (i.e., makes
# AWS head node able to launch AWS workers, or any-cloud spot controller
# able to launch spot clusters on AWS).
#
# If we detect the current user identity is AWS SSO, we will not upload this
# file to any remote nodes (any cloud). Instead, a SkyPilot IAM role is
# assigned to both AWS head and workers.
# TODO(skypilot): This also means we leave open a bug for AWS SSO users that
# use multiple clouds. The non-AWS nodes will have neither the credential
# file nor the ability to understand AWS IAM.
_CREDENTIAL_FILES = [
'credentials',
]


def _run_output(cmd):
proc = subprocess.run(cmd,
shell=True,
check=True,
stderr=subprocess.PIPE,
stdout=subprocess.PIPE)
return proc.stdout.decode('ascii')


# TODO(zhwu): Move the default AMI size to the catalog instead.
DEFAULT_AMI_GB = 45


Expand All @@ -45,21 +49,28 @@ class AWS(clouds.Cloud):
_REPR = 'AWS'
_regions: List[clouds.Region] = []

_INDENT_PREFIX = ' '
_STATIC_CREDENTIAL_HELP_STR = (
'Run the following commands:'
'\n $ pip install boto3'
'\n $ aws configure'
'\n For more info: '
f'\n{_INDENT_PREFIX} $ pip install boto3'
f'\n{_INDENT_PREFIX} $ aws configure'
f'\n{_INDENT_PREFIX}For more info: '
'https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-quickstart.html' # pylint: disable=line-too-long
)

_SSO_CREDENTIAL_HELP_STR = (
'Run the following commands (must use aws v2 CLI):'
'\n $ aws configure sso'
'\n $ aws sso login --profile <profile_name>'
'\n For more info: '
'https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-sso.html' # pylint: disable=line-too-long
)
@classmethod
def _sso_credentials_help_str(cls, expired: bool = False) -> str:
help_str = 'Run the following commands (must use aws v2 CLI):'
if not expired:
help_str += f'\n{cls._INDENT_PREFIX} $ aws configure sso'
help_str += (
f'\n{cls._INDENT_PREFIX} $ aws sso login --profile <profile_name>'
f'\n{cls._INDENT_PREFIX}For more info: '
'https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-sso.html' # pylint: disable=line-too-long
)
return help_str

_MAX_AWSCLI_MAJOR_VERSION = 1

#### Regions/Zones ####

Expand Down Expand Up @@ -100,7 +111,7 @@ def region_zones_provision_loop(
*,
instance_type: Optional[str] = None,
accelerators: Optional[Dict[str, int]] = None,
use_spot: bool,
use_spot: bool = False,
) -> Iterator[Tuple[clouds.Region, List[clouds.Zone]]]:
# AWS provisioner can handle batched requests, so yield all zones under
# each region.
Expand Down Expand Up @@ -139,28 +150,28 @@ def get_default_ami(cls, region_name: str, instance_type: str) -> str:
@classmethod
def _get_image_id(
cls,
image_id: Optional[Dict[str, str]],
image_id: Optional[Dict[Optional[str], str]],
region_name: str,
) -> str:
) -> Optional[str]:
if image_id is None:
return None
if None in image_id:
image_id = image_id[None]
image_id_str = image_id[None]
else:
assert region_name in image_id, image_id
image_id = image_id[region_name]
if image_id.startswith('skypilot:'):
image_id = service_catalog.get_image_id_from_tag(image_id,
region_name,
clouds='aws')
if image_id is None:
image_id_str = image_id[region_name]
if image_id_str.startswith('skypilot:'):
image_id_str = service_catalog.get_image_id_from_tag(image_id_str,
region_name,
clouds='aws')
if image_id_str is None:
# Raise ResourcesUnavailableError to make sure the failover
# in CloudVMRayBackend will be correctly triggered.
# TODO(zhwu): This is a information leakage to the cloud
# implementor, we need to find a better way to handle this.
raise exceptions.ResourcesUnavailableError(
f'No image found for region {region_name}')
return image_id
return image_id_str

def get_image_size(self, image_id: str, region: Optional[str]) -> float:
if image_id.startswith('skypilot:'):
Expand Down Expand Up @@ -260,7 +271,7 @@ def get_vcpus_from_instance_type(
def make_deploy_resources_variables(
self, resources: 'resources_lib.Resources',
region: Optional['clouds.Region'],
zones: Optional[List['clouds.Zone']]) -> Dict[str, str]:
zones: Optional[List['clouds.Zone']]) -> Dict[str, Optional[str]]:
if region is None:
assert zones is None, (
'Set either both or neither for: region, zones.')
Expand All @@ -271,7 +282,7 @@ def make_deploy_resources_variables(
'Set either both or neither for: region, zones.')

region_name = region.name
zones = [zone.name for zone in zones]
zone_names = [zone.name for zone in zones]

r = resources
# r.accelerators is cleared but .instance_type encodes the info.
Expand All @@ -290,13 +301,13 @@ def make_deploy_resources_variables(
'custom_resources': custom_resources,
'use_spot': r.use_spot,
'region': region_name,
'zones': ','.join(zones),
'zones': ','.join(zone_names),
'image_id': image_id,
}

def get_feasible_launchable_resources(self,
resources: 'resources_lib.Resources'):
fuzzy_candidate_list = []
fuzzy_candidate_list: List[str] = []
if resources.instance_type is not None:
assert resources.is_launchable(), resources
# Treat Resources(AWS, p3.2x, V100) as Resources(AWS, p3.2x).
Expand Down Expand Up @@ -342,24 +353,20 @@ def check_credentials(self) -> Tuple[bool, Optional[str]]:
except ImportError:
raise ImportError('Fail to import dependencies for AWS.'
'Try pip install "skypilot[aws]"') from None
# This file is required because it will be synced to remote VMs for
# `aws` to access private storage buckets.
# `aws configure list` does not guarantee this file exists.
if not os.path.isfile(os.path.expanduser('~/.aws/credentials')):
return (False, '~/.aws/credentials does not exist. ' +
self._STATIC_CREDENTIAL_HELP_STR)

# Checks if the AWS CLI is installed properly
try:
_run_output('aws configure list')
except subprocess.CalledProcessError:
proc = subprocess.run('aws --version',
shell=True,
check=False,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
if proc.returncode != 0:
return False, (
'AWS CLI is not installed properly.'
# TODO(zhwu): Change the installation hint to from PyPI.
' Run the following commands in the SkyPilot codebase:'
'\n $ pip install .[aws]'
'\n Credentials may also need to be set. ' +
self._STATIC_CREDENTIAL_HELP_STR)
'AWS CLI is not installed properly. '
'Run the following commands:'
f'\n{self._INDENT_PREFIX} $ pip install skypilot[aws]'
f'{self._INDENT_PREFIX}Credentials may also need to be set. '
f'{self._STATIC_CREDENTIAL_HELP_STR}')

# Checks if AWS credentials 1) exist and 2) are valid.
# https://stackoverflow.com/questions/53548737/verify-aws-credentials-with-boto3
Expand All @@ -368,9 +375,34 @@ def check_credentials(self) -> Tuple[bool, Optional[str]]:
except exceptions.CloudUserIdentityError as e:
return False, str(e)

hints = None
if self._is_current_identity_sso():
hints = (
'AWS SSO is set. '
'It will work if you use AWS only, but will cause problems if you want to '
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
'use multiple clouds. To set up static credentials, try: aws configure'
)
else:
# This file is required because it is required by the VMs launched on
# other clouds to access private s3 buckets and resources like EC2.
# `get_current_user_identity` does not guarantee this file exists.
if not os.path.isfile(os.path.expanduser('~/.aws/credentials')):
return (False, '~/.aws/credentials does not exist. ' +
self._STATIC_CREDENTIAL_HELP_STR)

# Fetch the AWS availability zones mapping from ID to name.
from sky.clouds.service_catalog import aws_catalog # pylint: disable=import-outside-toplevel,unused-import
return True, None
return True, hints

def _is_current_identity_sso(self) -> bool:
proc = subprocess.run('aws configure list',
shell=True,
check=False,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
if proc.returncode != 0:
return False
return 'sso' in proc.stdout.decode().split()

def get_current_user_identity(self) -> Optional[str]:
"""Returns the identity of the user on this cloud."""
Expand All @@ -397,14 +429,33 @@ def get_current_user_identity(self) -> Optional[str]:
'Failed to access AWS services with credentials. '
'Make sure that the access and secret keys are correct.'
f' {self._STATIC_CREDENTIAL_HELP_STR}') from None
except aws.botocore_exceptions().InvalidConfigError as e:
import awscli
from packaging import version
awscli_version = version.parse(awscli.__version__)
if (awscli_version < version.parse('1.27.10') and
'configured to use SSO' in str(e)):
with ux_utils.print_exception_no_traceback():
raise exceptions.CloudUserIdentityError(
'awscli is too old to use SSO. Run the following command to upgrade:'
f'\n{self._INDENT_PREFIX} $ pip install awscli>=1.27.10'
f'\n{self._INDENT_PREFIX}You may need to log into SSO again after '
f'upgrading. {self._sso_credentials_help_str()}'
) from None
with ux_utils.print_exception_no_traceback():
raise exceptions.CloudUserIdentityError(
f'Invalid AWS configuration.\n'
f' Reason: [{common_utils.class_fullname(e.__class__)}] {e}.'
) from None
except aws.botocore_exceptions().TokenRetrievalError:
# This is raised when the access token is expired, which mainly
# happens when the user is using temporary credentials or SSO
# login.
with ux_utils.print_exception_no_traceback():
raise exceptions.CloudUserIdentityError(
'AWS access token is expired.'
f' {self._SSO_CREDENTIAL_HELP_STR}') from None
f' {self._sso_credentials_help_str(expired=True)}'
) from None
except Exception as e: # pylint: disable=broad-except
with ux_utils.print_exception_no_traceback():
raise exceptions.CloudUserIdentityError(
Expand All @@ -414,9 +465,27 @@ def get_current_user_identity(self) -> Optional[str]:
return user_id

def get_credential_file_mounts(self) -> Dict[str, str]:
# TODO(skypilot): ~/.aws/credentials is required for users using multiple clouds.
# If this file does not exist, users can launch on AWS via AWS SSO and assign
# IAM role to the cluster.
# However, if users launch clusters in a non-AWS cloud, those clusters do not
# understand AWS IAM role so will not be able to access private AWS EC2 resources
# and S3 buckets.

# The file should not be uploaded if the user is using SSO, as the credential
# file can be from a different account, and will make autopstop/autodown/spot
# controller misbehave.

# TODO(zhwu/zongheng): We can also avoid uploading the credential file for the
# cluster launched on AWS even if the user is using static credentials. We need
# to define a mechanism to find out the cloud provider of the cluster to be
# launched in this function.
if self._is_current_identity_sso():
return {}
return {
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
f'~/.aws/{filename}': f'~/.aws/{filename}'
for filename in _CREDENTIAL_FILES
if os.path.exists(os.path.expanduser(f'~/.aws/{filename}'))
}

def instance_type_exists(self, instance_type):
Expand Down
16 changes: 6 additions & 10 deletions sky/clouds/azure.py
Original file line number Diff line number Diff line change
Expand Up @@ -141,7 +141,7 @@ def region_zones_provision_loop(
*,
instance_type: Optional[str] = None,
accelerators: Optional[Dict[str, int]] = None,
use_spot: bool,
use_spot: bool = False,
) -> Iterator[Tuple[clouds.Region, List[clouds.Zone]]]:
del accelerators # unused

Expand Down Expand Up @@ -180,15 +180,13 @@ def get_zone_shell_cmd(cls) -> Optional[str]:
def make_deploy_resources_variables(
self, resources: 'resources.Resources',
region: Optional['clouds.Region'],
zones: Optional[List['clouds.Zone']]) -> Dict[str, str]:
zones: Optional[List['clouds.Zone']]) -> Dict[str, Optional[str]]:
if region is None:
assert zones is None, (
'Set either both or neither for: region, zones.')
region = self._get_default_region()

region_name = region.name
# Azure does not support specific zones.
zones = []

r = resources
assert not r.use_spot, \
Expand All @@ -208,7 +206,8 @@ def make_deploy_resources_variables(
'custom_resources': custom_resources,
'use_spot': r.use_spot,
'region': region_name,
'zones': zones,
# Azure does not support specific zones.
'zones': None,
**image_config
}

Expand Down Expand Up @@ -275,11 +274,8 @@ def check_credentials(self) -> Tuple[bool, Optional[str]]:
except subprocess.CalledProcessError:
return False, (
# TODO(zhwu): Change the installation hint to from PyPI.
'Azure CLI returned error. Run the following commands in the SkyPilot codebase:'
'\n $ pip install skypilot[azure] # if installed from '
'PyPI'
'\n Or:'
'\n $ pip install .[azure] # if installed from source'
'Azure CLI returned error. Run the following commands:'
'\n $ pip install skypilot[azure]'
'\n Credentials may also need to be set.' + help_str)
# If Azure is properly logged in, this will return the account email
# address + subscription ID.
Expand Down
Loading