Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AWS Batch] Update resource limits #1364

Merged
merged 2 commits into from
May 29, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
- [Future] Exposed 'wait_dur_sec' and 'retries' in future.wait() and future.get_result() methods
- [Localhost] Upgraded localhost backend v2 and set it as the default localhost backend
- [Localhost] Set monitoring_interval to 0.1 in the localhost storage backend
- [AWS Batch] Updated CPU and Memory resource limits

### Fixed
- [AWS Lambda] Fixed wrong AWS Lambda delete runtime_name match semantics
Expand Down
20 changes: 9 additions & 11 deletions docs/source/compute_config/aws_batch.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,6 @@ In summary, you can use one of the following settings:
aws_batch:
region : <REGION_NAME>
execution_role: <EXECUTION_ROLE_ARN>
instance_role: <INSTANCE_ROLE_ARN>
job_role: <JOB_ROLE_ARN>
subnets:
- <SUBNET_ID_1>
Expand All @@ -81,7 +80,7 @@ In summary, you can use one of the following settings:
- ...
```

2. Provide the credentials in the `aws` section of the Lithops config file:
2. Provide the credentials in the `aws` section of the Lithops config file. In this case you can omit setting the `job_role`:
```yaml
lithops:
backend: aws_batch
Expand All @@ -93,8 +92,7 @@ In summary, you can use one of the following settings:

aws_batch:
execution_role: <EXECUTION_ROLE_ARN>
instance_role: <INSTANCE_ROLE_ARN>
job_role: <JOB_ROLE_ARN>
job_role: <JOB_ROLE_ARN> # Not mandatory if the credentials are in the aws section
subnets:
- <SUBNET_ID_1>
- <SUBNET_ID_2>
Expand All @@ -121,17 +119,17 @@ In summary, you can use one of the following settings:
|Group|Key|Default|Mandatory|Additional info|
|---|---|---|---|---|
| aws_batch | execution_role | | yes | ARN of the execution role used to execute AWS Batch tasks on ECS for Fargate environments |
| aws_batch | job_role | | yes | ARN of the job role used to execute AWS Batch tasks on ECS for Fargate environments|
| aws_batch | instance_role | | yes | ARN of the execution role used to execute AWS Batch tasks on ECS for EC2 environments |
| aws_batch | job_role | | yes | ARN of the job role used to execute AWS Batch tasks on ECS for Fargate environments. Not mandatory if the credentials are in the `aws` section of the configuration|
| aws_batch | security_groups | | yes | List of Security groups to attach for ECS task containers. By default, you can use a security group that accepts all outbound traffic but blocks all inbound traffic. |
| aws_batch | subnets | | yes | List of subnets from a VPC where to deploy the ECS task containers. Note that if you are using a **private subnet**, you can set `assign_public_ip` to `false` but make sure containers can reach other AWS services like ECR, Secrets service, etc., by, for example, using a NAT gateway. If you are using a **public subnet** you must set `assign_public_up` to `true` |
| aws_batch | instance_role | | no | ARN of the execution role used to execute AWS Batch tasks on ECS for EC2 environments. Mandatory if using the **EC2** or **SPOT** `env_type` |
| aws_batch | region | | no | Region name (like `us-east-1`) where to deploy the ECS cluster. Lithops will use the region set under the `aws` section if it is not set here |
| aws_batch | assign_public_ip | `true` | no | Assign public IPs to ECS task containers. Set to `true` if the tasks are being deployed in a public subnet. Set to `false` when deploying on a private subnet. |
| aws_batch | runtime | `default_runtime-v3X` | no | Runtime name |
| aws_batch | runtime_timeout | 180 | no | Runtime timeout |
| aws_batch | runtime_memory | 1024 | no | Runtime memory |
| aws_batch | worker_processes | 1 | no | Worker processes |
| aws_batch | container_vcpus | 0.5 | no | Number of vCPUs assigned to each task container. It can be different from `worker_processes`. Use it to run a task that uses multiple processes within a container. |
| aws_batch | runtime | | no | Container runtime name in ECR. If not provided Lithops will automatically build a default runtime |
| aws_batch | runtime_timeout | 180 | no | Runtime timeout managed by the cloud provider. |
| aws_batch | runtime_memory | 1024 | no | Runtime memory assigned to each task container. |
| aws_batch | runtime_cpu | 0.5 | no | Number of vCPUs assigned to each task container. It can be different from `worker_processes`. |
| aws_batch | worker_processes | 1 | no | Number of parallel Lithops processes in a worker. This is used to parallelize function activations within the worker. |
| aws_batch | service_role | `None` | no | Service role for AWS Batch. Leave empty to use a service-linked execution role. More info [here](https://docs.aws.amazon.com/batch/latest/userguide/using-service-linked-roles.html) |
| aws_batch | env_max_cpus | 10 | no | Maximum total CPUs of the compute environment |
| aws_batch | env_type | FARGATE_SPOT | no | Compute environment type, one of: `["EC2", "SPOT", "FARGATE", "FARGATE_SPOT"]` |
Expand Down
2 changes: 1 addition & 1 deletion lithops/serverless/backends/aws_batch/aws_batch.py
Original file line number Diff line number Diff line change
Expand Up @@ -280,7 +280,7 @@ def _create_job_def(self, runtime_name, runtime_memory):
'resourceRequirements': [
{
'type': 'VCPU',
'value': str(self.aws_batch_config['container_vcpus'])
'value': str(self.aws_batch_config['runtime_cpu'])
},
{
'type': 'MEMORY',
Expand Down
77 changes: 43 additions & 34 deletions lithops/serverless/backends/aws_batch/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,24 +22,29 @@
ENV_TYPES = {'EC2', 'SPOT', 'FARGATE', 'FARGATE_SPOT'}
RUNTIME_ZIP = 'lithops_aws_batch.zip'

AVAILABLE_MEM_FARGATE = [512] + [1024 * i for i in range(1, 31)]
AVAILABLE_CPU_FARGATE = [0.25, 0.5, 1, 2, 4]
# https://docs.aws.amazon.com/batch/latest/APIReference/API_ResourceRequirement.html
AVAILABLE_CPU_MEM_FARGATE = {
0.25: [512, 1024, 2048],
0.5: [1024, 2048, 3072, 4096],
1: [2048, 3072, 4096, 5120, 6144, 7168, 8192],
2: [4096, 5120, 6144, 7168, 8192, 9216, 10240, 11264, 12288, 13312, 14336, 15360, 16384],
4: [8192 + 1024 * i for i in range(21)], # Starts at 8192, increments by 1024 up to 30720
8: [16384 + 4096 * i for i in range(12)], # Starts at 16384, increments by 4096 up to 61440
16: [32768 + 8192 * i for i in range(12)] # Starts at 32768, increments by 8192 up to 122880
}

DEFAULT_CONFIG_KEYS = {
'runtime_timeout': 180, # Default: 180 seconds => 3 minutes
'runtime_memory': 1024, # Default memory: 1GB
'runtime_cpu': 0.5,
'worker_processes': 1,
'container_vcpus': 0.5,
'env_max_cpus': 10,
'env_type': 'FARGATE_SPOT',
'assign_public_ip': True,
'subnets': []
}

RUNTIME_TIMEOUT_MAX = 7200 # Max. timeout: 7200s == 2h
RUNTIME_MEMORY_MAX = 30720 # Max. memory: 30720 MB

REQ_PARAMS = ('execution_role', 'instance_role', 'security_groups')
REQ_PARAMS = ('execution_role', 'security_groups')

DOCKERFILE_DEFAULT = """
RUN apt-get update && apt-get install -y \
Expand All @@ -58,7 +63,8 @@
numpy \
cloudpickle \
ps-mem \
tblib
tblib \
psutil

# Copy Lithops proxy and lib to the container image.
ENV APP_HOME /lithops
Expand All @@ -73,7 +79,7 @@

def load_config(config_data):

if not config_data['aws_batch']:
if 'aws_batch' not in config_data or not config_data['aws_batch']:
raise Exception("'aws_batch' section is mandatory in the configuration")

if 'aws' not in config_data:
Expand All @@ -92,38 +98,41 @@ def load_config(config_data):
if key not in config_data['aws_batch']:
config_data['aws_batch'][key] = DEFAULT_CONFIG_KEYS[key]

if config_data['aws_batch']['runtime_memory'] > RUNTIME_MEMORY_MAX:
logger.warning("Memory set to {} - {} exceeds "
"the maximum amount".format(RUNTIME_MEMORY_MAX, config_data['aws_batch']['runtime_memory']))
config_data['aws_batch']['runtime_memory'] = RUNTIME_MEMORY_MAX

if config_data['aws_batch']['runtime_timeout'] > RUNTIME_TIMEOUT_MAX:
logger.warning("Timeout set to {} - {} exceeds the "
"maximum amount".format(RUNTIME_TIMEOUT_MAX, config_data['aws_batch']['runtime_timeout']))
config_data['aws_batch']['runtime_timeout'] = RUNTIME_TIMEOUT_MAX

config_data['aws_batch']['max_workers'] = config_data['aws_batch']['env_max_cpus'] // config_data['aws_batch']['container_vcpus']

if config_data['aws_batch']['env_type'] not in ENV_TYPES:
raise Exception(
'AWS Batch env type must be one of {} (is {})'.format(ENV_TYPES, config_data['aws_batch']['env_type']))

if config_data['aws_batch']['env_type'] in {'FARGATE, FARGATE_SPOT'}:
if config_data['aws_batch']['container_vcpus'] not in AVAILABLE_CPU_FARGATE:
raise Exception('{} container vcpus is not available for {} environment (choose one of {})'.format(
config_data['aws_batch']['runtime_memory'], config_data['aws_batch']['env_type'],
AVAILABLE_CPU_FARGATE
))
if config_data['aws_batch']['runtime_memory'] not in AVAILABLE_MEM_FARGATE:
raise Exception('{} runtime memory is not available for {} environment (choose one of {})'.format(
config_data['aws_batch']['runtime_memory'], config_data['aws_batch']['env_type'],
AVAILABLE_MEM_FARGATE
))
f"AWS Batch env type must be one of {ENV_TYPES} "
f"(is {config_data['aws_batch']['env_type']})"
)

# container_vcpus is deprectaded. To be removed in a future release
if 'container_vcpus' in config_data['aws_batch']:
config_data['aws_batch']['runtime_cpu'] = config_data['aws_batch']['container_vcpus']

if config_data['aws_batch']['env_type'] in {'FARGATE', 'FARGATE_SPOT'}:
runtime_memory = config_data['aws_batch']['runtime_memory']
runtime_cpu = config_data['aws_batch']['runtime_cpu']
env_type = config_data['aws_batch']['env_type']
cpu_keys = list(AVAILABLE_CPU_MEM_FARGATE.keys())
if runtime_cpu not in cpu_keys:
raise Exception(
f"'{runtime_cpu}' runtime cpu is not available for the {env_type} environment "
f"(choose one of {', '.join(map(str, cpu_keys))})"
)
mem_keys = AVAILABLE_CPU_MEM_FARGATE[runtime_cpu]
if config_data['aws_batch']['runtime_memory'] not in mem_keys:
raise Exception(
f"'{runtime_memory}' runtime memory is not valid for {runtime_cpu} "
f"vCPU and the {env_type} environment (for {runtime_cpu}vCPU "
f"choose one of {', '.join(map(str, mem_keys))})"
)

if config_data['aws_batch']['env_type'] in {'EC2', 'SPOT'}:
if 'instance_role' not in config_data['aws_batch']:
raise Exception("'instance_role' mandatory for EC2 or SPOT environments")

config_data['aws_batch']['max_workers'] = config_data['aws_batch']['env_max_cpus'] \
// config_data['aws_batch']['runtime_cpu']

assert isinstance(config_data['aws_batch']['assign_public_ip'], bool)

if 'region_name' in config_data['aws_batch']:
Expand Down