Skip to content

Commit

Permalink
feat: make image builder timeout configurable (#992)
Browse files Browse the repository at this point in the history
* chore: change default value for `IMAGE_BUILD_TIMEOUT`

Signed-off-by: Guilhem Barthés <guilhem.barthes@owkin.com>

* feat: add `builder.timeout` to change `IMAGE_BUILD_TIMEOUT`

Signed-off-by: Guilhem Barthés <guilhem.barthes@owkin.com>

* feat: add test for `substrap.lock_local.lock_resource`

Signed-off-by: Guilhem Barthés <guilhem.barthes@owkin.com>

* chore: remove unused `IMAGE_BUILD_TIMEOUT`

Signed-off-by: Guilhem Barthés <guilhem.barthes@owkin.com>

* fix: replace `Exception` by `LockError` as `Exception` does not accept custom named arguments

Signed-off-by: Guilhem Barthés <guilhem.barthes@owkin.com>

* chore: update chart changelog

Signed-off-by: Guilhem Barthés <guilhem.barthes@owkin.com>

* doc: update chart changelog

Signed-off-by: Guilhem Barthés <guilhem.barthes@owkin.com>

* doc: change fragment

Signed-off-by: Guilhem Barthés <guilhem.barthes@owkin.com>

---------

Signed-off-by: Guilhem Barthés <guilhem.barthes@owkin.com>
  • Loading branch information
guilhem-barthes committed Sep 17, 2024
1 parent 288005b commit 513842f
Show file tree
Hide file tree
Showing 11 changed files with 49 additions and 4 deletions.
2 changes: 1 addition & 1 deletion backend/backend/settings/deps/image_build.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
import os

# How long we wait before throwing errors, in seconds
IMAGE_BUILD_TIMEOUT = 3 * 60 * 60 # 3 hours
IMAGE_BUILD_TIMEOUT = int(os.getenv("IMAGE_BUILD_TIMEOUT", 60 * 60)) # 1 hour
# Delay before two check
IMAGE_BUILD_CHECK_DELAY = 5
BUILDER_KANIKO_STARTUP_MAX_ATTEMPTS = int(os.getenv("BUILDER_KANIKO_STARTUP_MAX_ATTEMPTS", 60))
Expand Down
1 change: 0 additions & 1 deletion backend/substrapp/compute_tasks/image_builder.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,6 @@

logger = structlog.get_logger(__name__)

IMAGE_BUILD_TIMEOUT = settings.IMAGE_BUILD_TIMEOUT
IMAGE_BUILD_CHECK_DELAY = settings.IMAGE_BUILD_CHECK_DELAY
REGISTRY = settings.REGISTRY
REGISTRY_SCHEME = settings.REGISTRY_SCHEME
Expand Down
10 changes: 10 additions & 0 deletions backend/substrapp/exceptions.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,3 +32,13 @@ class IntegrityError(Exception):

class ServerMediasNoSubdirError(Exception):
"""A supplied servermedias path didn't contain the expected subdir"""


class LockError(Exception):
"""Cannot get a lock file to write on"""

lock_file: str

def __init__(self, *args, lock_file: str, **kwargs) -> None:
self.lock_file = lock_file
super().__init__(args, kwargs)
4 changes: 3 additions & 1 deletion backend/substrapp/lock_local.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,8 @@

import structlog

from substrapp.exceptions import LockError

logger = structlog.get_logger(__name__)

LOCK_FILE_FOLDER = "/tmp" # nosec
Expand Down Expand Up @@ -47,7 +49,7 @@ def lock_resource(
did_wait = True
logger.debug("Lock: Waiting for lock to be released", lock_file=lock_file)
if time.time() - start > timeout:
raise Exception(f"Failed to acquire lock after {timeout} seconds", lock_file=lock_file)
raise LockError(f"Failed to acquire lock after {timeout} seconds", lock_file=lock_file)
time.sleep(delay)

logger.debug("Lock: Acquired", lock_file=lock_file)
Expand Down
20 changes: 20 additions & 0 deletions backend/substrapp/tests/test_lock_local.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
from pathlib import Path

import pytest

from substrapp import lock_local
from substrapp.exceptions import LockError


# Test that `lock_local.lock_resource() returns the correct excception
def test_lock_local_timeout():
resource_type = "test"
unique_identifier = "1"
# Create fake lock
slug = f"{resource_type}_{unique_identifier}"
lock_path = Path(lock_local._get_lock_file_path(slug))
lock_path.write_text("unique_id")

with pytest.raises(LockError):
with lock_local.lock_resource(resource_type, unique_identifier, timeout=1):
pass
1 change: 1 addition & 0 deletions changes/992.added
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Add configurable builder timeout through the env var `IMAGE_BUILD_TIMEOUT`
6 changes: 6 additions & 0 deletions charts/substra-backend/CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,12 @@
# Changelog

<!-- towncrier release notes start -->
## [26.11.0] - 2024-09-17

# Added

New configurable value `.Values.builder.timeout` configuring the duration before a building task would fail.

## [26.10.2] - 2024-09-17

# Fixed
Expand Down
2 changes: 1 addition & 1 deletion charts/substra-backend/Chart.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
apiVersion: v2
name: substra-backend
home: https://github.com/Substra
version: 26.10.2
version: 26.11.0
appVersion: 0.48.0
kubeVersion: '>= 1.19.0-0'
description: Main package for Substra
Expand Down
1 change: 1 addition & 0 deletions charts/substra-backend/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -236,6 +236,7 @@ See [UPGRADE.md](https://github.com/Substra/substra-backend/blob/main/charts/sub
| `builder.enabled` | Enable builder service | `true` |
| `builder.replicaCount` | Replica count for the builder service | `1` |
| `builder.concurrency` | Maximum amount of tasks to process in parallel | `1` |
| `builder.timeout` | Number of seconds to wait before failing a build | `3600` |
| `builder.kanikoStartup.maxAttempts` | Number of checks done before considering a kaniko pod has failed to spawn | `60` |
| `builder.kanikoStartup.checkDelay` | Time, in seconds, to wait when a container is in pending state before checking again its status | `2` |
| `builder.image.registry` | Substra backend server image registry | `ghcr.io` |
Expand Down
2 changes: 2 additions & 0 deletions charts/substra-backend/templates/statefulset-builder.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -199,6 +199,8 @@ spec:
value: {{ toYaml .Values.builder.kanikoStartup.maxAttempts | quote }}
- name: BUILDER_KANIKO_STARTUP_PENDING_STATE_WAIT_SECONDS
value: {{ toYaml .Values.builder.kanikoStartup.checkDelay | quote }}
- name: IMAGE_BUILD_TIMEOUT
value: {{ toYaml .Values.builder.timeout | quote }}
ports:
- name: http
containerPort: 8000
Expand Down
4 changes: 4 additions & 0 deletions charts/substra-backend/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -583,6 +583,10 @@ builder:
##
concurrency: 1

## @param builder.timeout Number of seconds to wait before failing a build
##
timeout: 3600

## @param builder.kanikoStartup.maxAttempts Number of checks done before considering a kaniko pod has failed to spawn
## @param builder.kanikoStartup.checkDelay Time, in seconds, to wait when a container is in pending state before checking again its status
##
Expand Down

0 comments on commit 513842f

Please sign in to comment.