feat(api,ui,sdk): Make CPU limits configurable #381

deadlycoconuts · 2024-05-21T04:16:05Z

Context

Similar to caraml-dev/merlin#586, this PR aims to make CPU limits configurable for the end user.

As of present, users are not able to configure the CPU limits of the pods in which Turing routers/enrichers/ensemblers (docker and pyfunc) are deployed in - they are instead determined automatically on the platform-level (Turing API server). Depending on how the API server has been configured, one of the following happens:

the CPU limit of a component is set as its CPU request value, multiplied by a scaling factor (e.g. 2 CPU * 1.5) or,
- Note that this is the existing way memory limits are automatically set by the Turing API server
the CPU limit is left unset

This PR introduces a new workflow which would allow users to instead override the platform-level CPU limits (described in the paragraph above) set on a component. This workflow is available via the UI, SDK and by extension, directly calling the API endpoint of the API server.

UI:

SDK:

config = router.config
config.resource_request = ResourceRequest(
    min_replica=1,
    max_replica=3,
    cpu_request="100m",
    cpu_limit="2",
    memory_request="512Mi",
)

router.update(config)

In addition, this PR adds a new configuration, DefaultEnvVarsWithoutCPULimits, which is a list of env vars that automatically get added to all Turing routers/enrichers/ensemblers (docker and pyfunc) when CPU limits are not set. This allows the Turing API server's operators to set env vars platform-wide that can potentially improve these deployments' performance, e.g. env vars involving concurrency.

Modifications

api/turing/cluster/knative_service.go - Removal of platform-level fields from the KnativeService struct
api/turing/cluster/servicebuilder/service_builder.go - Addition of platform-level configs to clusterSvcBuilder and new helper methods to set default env vars when cpu limits are not explicitly set and when the cpu limit scaling factor is set as 0
api/turing/config/config.go - Addition of the new field DefaultEnvVarsWithoutCPULimits
sdk/turing/router/config/resource_request.py - Addition of a new cpu limit field to the resource request class
ui/src/router/components/form/components/CPULimitsFormGroup.js - Addition of a new form group to allow cpu limits to be specified on the UI

… builder

codecov · 2024-05-21T04:17:47Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 68.06%. Comparing base (0d6f578) to head (1f9187e).

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #381      +/-   ##
==========================================
+ Coverage   62.19%   68.06%   +5.87%     
==========================================
  Files         124      149      +25     
  Lines        9755    11809    +2054     
==========================================
+ Hits         6067     8038    +1971     
- Misses       2954     3033      +79     
- Partials      734      738       +4

Flag	Coverage Δ
api-test	`62.26% <ø> (+0.07%)`	⬆️
sdk-test-3.10	`95.92% <100.00%> (?)`
sdk-test-3.8	`95.92% <100.00%> (?)`
sdk-test-3.9	`95.92% <100.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

api/turing/cluster/knative_service.go

api/turing/cluster/servicebuilder/service_builder.go

api/turing/config/config.go

api/turing/internal/testutils/validation.go

api/turing/cluster/servicebuilder/service_builder.go

…verage decreases

api/turing/cluster/models.go

api/turing/cluster/servicebuilder/service_builder.go

leonlnj

thanks LGTM. +1 for moving platform default out and putting it as part of the service builder

… setting resource requirement values

deadlycoconuts added 3 commits May 21, 2024 11:25

Update swagger docs

0d045c6

Update sdk autogenerated openapi classes

40398cb

Refactor knativeServiceConfig and make it an attribute of the service…

70da2bc

… builder

deadlycoconuts added the enhancement New feature or request label May 21, 2024

deadlycoconuts self-assigned this May 21, 2024

deadlycoconuts added 2 commits May 23, 2024 12:28

Add cpu limit to resource requests and refactor knative service struct

2861728

Simplify diff reporting steps in unit test helper function

0cd4838

deadlycoconuts force-pushed the make_cpu_limits_configurable branch from 916ed3f to 0cd4838 Compare May 23, 2024 04:29

deadlycoconuts added 4 commits May 23, 2024 13:42

Fix k8s service unit test

6ab6d0f

Refactor how cpu and memory limits are set for fluentd stateful set

2ee90cb

Simplify if else-block in cpu limit helper function

cfe561b

Update openapi specs

360729b

deadlycoconuts force-pushed the make_cpu_limits_configurable branch from f6cb0c4 to 08e0064 Compare May 23, 2024 07:29

Update python unit tests

4104721

deadlycoconuts force-pushed the make_cpu_limits_configurable branch from 08e0064 to 4104721 Compare May 23, 2024 07:32

deadlycoconuts added 2 commits May 23, 2024 18:11

Add cpu limit form group component

6af3304

Refactor cpu limit as nullable field

1ac8cde

deadlycoconuts force-pushed the make_cpu_limits_configurable branch from 2a60787 to 1ac8cde Compare May 24, 2024 03:18

deadlycoconuts added 3 commits May 24, 2024 13:41

Add default values for cpu limit in all router components

a1d019b

Fix tooltip description

112f26f

Rename cpu limit section

1baf14c