Skip to content

Commit

Permalink
feat(api): Refactor API server to introduce support for Docker image …
Browse files Browse the repository at this point in the history
…registries and s3 bucket support (#605)

## Note 🚨
~~This PR should not be merged without the changes in
caraml-dev/mlp#116 being merged, published and
imported as dependencies first. This PR is currently only using a branch
of a fork (the source of the quoted PR) of the MLP repository.~~ The
dependent PR has been merged.

# Description
In order to provide support for using image registries that use [Docker
registry
credentials](https://docs.docker.com/reference/cli/docker/login/#configure-the-credential-store)
as well as [AWS S3-based blob storage
services](https://docs.aws.amazon.com/code-library/latest/ug/go_2_s3_code_examples.html),
this PR refactors the API server to support these additional image
registry and bob storage options. More concretely, these are the
following changes made:

- Set up the workflow needed to allow platform maintainers to configure
the image registry that the API server will push images to (Docker or
Google Cloud/Artifact Registry) as well as the blob storage service that
it should read model artifacts from/write files to (S3-based
store/Google Cloud Storage)
- Allow the API server to access a configured Docker registry to check
if an image is available
- Allow the API server to check and hash model dependencies in a
configured S3-based store
- Allow Kaniko jobs spun up by the API server to use load model
artifacts from a configured S3-based store when building model images
- Allow Kaniko jobs spun up by the API server to push images to the
configured Docker registry

# Modifications
- `api/cmd/api/setup.go` - Make the initialisation of the image builder
set up the artifact service type and docker registry correctly depending
on the one set up
- `api/config/config.go` - Introduce new configs for platform
maintainers to specify the `KanikoPushRegistryType` and the
`KanikoDockerCredentialSecretName`
- `api/pkg/imagebuilder/imagebuilder.go` - Make changes to the image
builder to configure the Kaniko job spec correctly depending on the
selected registry type and blob storage type
- `python/batch-predictor/docker/app.Dockerfile` - Add steps to the
batch predictor docker image to authenticate and pull model artifacts
correctly depending on the configured blob storage type
- `python/batch-predictor/docker/base.Dockerfile` - Add steps to the
base batch predictor image to install the AWS CLI
- `python/pyfunc-server/docker/Dockerfile` - Add steps to the pyfunc
server docker image to authenticate and pull model artifacts correctly
depending on the configured blob storage type
- `python/pyfunc-server/docker/base.Dockerfile` - Add steps to the base
pyfunc server image to install the AWS CLI

# Tests
<!-- Besides the existing / updated automated tests, what specific
scenarios should be tested? Consider the backward compatibility of the
changes, whether corner cases are covered, etc. Please describe the
tests and check the ones that have been completed. Eg:
- [x] Deploying new and existing standard models
- [ ] Deploying PyFunc models
-->

# Checklist
- [x] Added PR label
- [x] Added unit test, integration, and/or e2e tests
- [x] Tested locally
- [ ] Updated documentation
- [ ] Update Swagger spec if the PR introduce API changes
- [ ] Regenerated Golang and Python client if the PR introduces API
changes

# Release Notes
<!--
Does this PR introduce a user-facing change?
If no, just write "NONE" in the release-note block below.
If yes, a release note is required. Enter your extended release note in
the block below.
If the PR requires additional action from users switching to the new
release, include the string "action required".

For more information about release notes, see kubernetes' guide here:
http://git.k8s.io/community/contributors/guide/release-notes.md
-->

```release-note
NONE
```
  • Loading branch information
deadlycoconuts authored Nov 8, 2024
1 parent 7f7d42c commit 875b77f
Show file tree
Hide file tree
Showing 18 changed files with 786 additions and 376 deletions.
5 changes: 0 additions & 5 deletions .github/workflows/codecov-config/codecov.yml

This file was deleted.

14 changes: 0 additions & 14 deletions .github/workflows/merlin.yml
Original file line number Diff line number Diff line change
Expand Up @@ -137,13 +137,6 @@ jobs:
E2E_USE_GOOGLE_OAUTH: false
working-directory: ./python/sdk
run: make unit-test
- uses: codecov/codecov-action@v4
with:
flags: sdk-test-${{ matrix.python-version }}
name: sdk-test-${{ matrix.python-version }}
token: ${{ secrets.CODECOV_TOKEN }}
working-directory: ./python/sdk
codecov_yml_path: ../../.github/workflows/codecov-config/codecov.yml

lint-api:
runs-on: ubuntu-latest
Expand Down Expand Up @@ -188,13 +181,6 @@ jobs:
POSTGRES_USER: ${{ secrets.DB_USERNAME }}
POSTGRES_PASSWORD: ${{ secrets.DB_PASSWORD }}
run: make it-test-api-ci
- uses: codecov/codecov-action@v4
with:
flags: api-test
name: api-test
token: ${{ secrets.CODECOV_TOKEN }}
working-directory: ./api
codecov_yml_path: ../.github/workflows/codecov-config/codecov.yml

test-observation-publisher:
runs-on: ubuntu-latest
Expand Down
91 changes: 50 additions & 41 deletions api/cmd/api/setup.go
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,6 @@ import (
"net/http"
"time"

gcs "cloud.google.com/go/storage"
"github.com/GoogleCloudPlatform/spark-on-k8s-operator/pkg/client/clientset/versioned"
"github.com/caraml-dev/merlin/webhook"
"github.com/caraml-dev/mlp/api/pkg/artifact"
Expand Down Expand Up @@ -105,55 +104,65 @@ func initImageBuilder(cfg *config.Config) (webserviceBuilder imagebuilder.ImageB
}

var artifactService artifact.Service
if cfg.ImageBuilderConfig.ArtifactServiceType == "gcs" {
gcsClient, err := gcs.NewClient(context.Background())
if err != nil {
log.Panicf("%s,failed initializing gcs for mlflow delete package", err.Error())
}
artifactService = artifact.NewGcsArtifactClient(gcsClient)
} else if cfg.ImageBuilderConfig.ArtifactServiceType == "nop" {
if cfg.MlflowConfig.ArtifactServiceType == "gcs" {
artifactService, err = artifact.NewGcsArtifactClient()
} else if cfg.MlflowConfig.ArtifactServiceType == "s3" {
artifactService, err = artifact.NewS3ArtifactClient()
} else if cfg.MlflowConfig.ArtifactServiceType == "nop" {
artifactService = artifact.NewNopArtifactClient()
} else {
log.Panicf("invalid artifact service type %s", cfg.ImageBuilderConfig.ArtifactServiceType)
log.Panicf("invalid artifact service type %s", cfg.MlflowConfig.ArtifactServiceType)
}
if err != nil {
log.Panicf("%s,failed initializing mlflow artifact service", err.Error())
}

if cfg.ImageBuilderConfig.KanikoPushRegistryType != "gcr" &&
cfg.ImageBuilderConfig.KanikoPushRegistryType != "docker" {
log.Panicf("invalid kaniko push registry type %s", cfg.ImageBuilderConfig.KanikoPushRegistryType)
}

webServiceConfig := imagebuilder.Config{
BaseImage: cfg.ImageBuilderConfig.BaseImage,
BuildNamespace: cfg.ImageBuilderConfig.BuildNamespace,
DockerRegistry: cfg.ImageBuilderConfig.DockerRegistry,
BuildTimeoutDuration: timeout,
KanikoImage: cfg.ImageBuilderConfig.KanikoImage,
KanikoServiceAccount: cfg.ImageBuilderConfig.KanikoServiceAccount,
KanikoAdditionalArgs: cfg.ImageBuilderConfig.KanikoAdditionalArgs,
DefaultResources: cfg.ImageBuilderConfig.DefaultResources,
Tolerations: cfg.ImageBuilderConfig.Tolerations,
NodeSelectors: cfg.ImageBuilderConfig.NodeSelectors,
MaximumRetry: cfg.ImageBuilderConfig.MaximumRetry,
SafeToEvict: cfg.ImageBuilderConfig.SafeToEvict,
ClusterName: cfg.ImageBuilderConfig.ClusterName,
GcpProject: cfg.ImageBuilderConfig.GcpProject,
Environment: cfg.Environment,
SupportedPythonVersions: cfg.ImageBuilderConfig.SupportedPythonVersions,
BaseImage: cfg.ImageBuilderConfig.BaseImage,
BuildNamespace: cfg.ImageBuilderConfig.BuildNamespace,
DockerRegistry: cfg.ImageBuilderConfig.DockerRegistry,
BuildTimeoutDuration: timeout,
KanikoImage: cfg.ImageBuilderConfig.KanikoImage,
KanikoPushRegistryType: cfg.ImageBuilderConfig.KanikoPushRegistryType,
KanikoDockerCredentialSecretName: cfg.ImageBuilderConfig.KanikoDockerCredentialSecretName,
KanikoServiceAccount: cfg.ImageBuilderConfig.KanikoServiceAccount,
KanikoAdditionalArgs: cfg.ImageBuilderConfig.KanikoAdditionalArgs,
DefaultResources: cfg.ImageBuilderConfig.DefaultResources,
Tolerations: cfg.ImageBuilderConfig.Tolerations,
NodeSelectors: cfg.ImageBuilderConfig.NodeSelectors,
MaximumRetry: cfg.ImageBuilderConfig.MaximumRetry,
SafeToEvict: cfg.ImageBuilderConfig.SafeToEvict,
ClusterName: cfg.ImageBuilderConfig.ClusterName,
GcpProject: cfg.ImageBuilderConfig.GcpProject,
Environment: cfg.Environment,
SupportedPythonVersions: cfg.ImageBuilderConfig.SupportedPythonVersions,
}
webserviceBuilder = imagebuilder.NewModelServiceImageBuilder(kubeClient, webServiceConfig, artifactService)

predJobConfig := imagebuilder.Config{
BaseImage: cfg.ImageBuilderConfig.PredictionJobBaseImage,
BuildNamespace: cfg.ImageBuilderConfig.BuildNamespace,
DockerRegistry: cfg.ImageBuilderConfig.DockerRegistry,
BuildTimeoutDuration: timeout,
KanikoImage: cfg.ImageBuilderConfig.KanikoImage,
KanikoServiceAccount: cfg.ImageBuilderConfig.KanikoServiceAccount,
KanikoAdditionalArgs: cfg.ImageBuilderConfig.KanikoAdditionalArgs,
DefaultResources: cfg.ImageBuilderConfig.DefaultResources,
Tolerations: cfg.ImageBuilderConfig.Tolerations,
NodeSelectors: cfg.ImageBuilderConfig.NodeSelectors,
MaximumRetry: cfg.ImageBuilderConfig.MaximumRetry,
SafeToEvict: cfg.ImageBuilderConfig.SafeToEvict,
ClusterName: cfg.ImageBuilderConfig.ClusterName,
GcpProject: cfg.ImageBuilderConfig.GcpProject,
Environment: cfg.Environment,
SupportedPythonVersions: cfg.ImageBuilderConfig.SupportedPythonVersions,
BaseImage: cfg.ImageBuilderConfig.PredictionJobBaseImage,
BuildNamespace: cfg.ImageBuilderConfig.BuildNamespace,
DockerRegistry: cfg.ImageBuilderConfig.DockerRegistry,
BuildTimeoutDuration: timeout,
KanikoImage: cfg.ImageBuilderConfig.KanikoImage,
KanikoPushRegistryType: cfg.ImageBuilderConfig.KanikoPushRegistryType,
KanikoDockerCredentialSecretName: cfg.ImageBuilderConfig.KanikoDockerCredentialSecretName,
KanikoServiceAccount: cfg.ImageBuilderConfig.KanikoServiceAccount,
KanikoAdditionalArgs: cfg.ImageBuilderConfig.KanikoAdditionalArgs,
DefaultResources: cfg.ImageBuilderConfig.DefaultResources,
Tolerations: cfg.ImageBuilderConfig.Tolerations,
NodeSelectors: cfg.ImageBuilderConfig.NodeSelectors,
MaximumRetry: cfg.ImageBuilderConfig.MaximumRetry,
SafeToEvict: cfg.ImageBuilderConfig.SafeToEvict,
ClusterName: cfg.ImageBuilderConfig.ClusterName,
GcpProject: cfg.ImageBuilderConfig.GcpProject,
Environment: cfg.Environment,
SupportedPythonVersions: cfg.ImageBuilderConfig.SupportedPythonVersions,
}
predJobBuilder = imagebuilder.NewPredictionJobImageBuilder(kubeClient, predJobConfig, artifactService)

Expand Down
36 changes: 21 additions & 15 deletions api/config/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -104,7 +104,7 @@ type BaseImageConfig struct {
ImageName string `validate:"required" json:"imageName"`
// Dockerfile Path within the build context
DockerfilePath string `validate:"required" json:"dockerfilePath"`
// GCS URL Containing build context
// GCS/S3 URL Containing build context
BuildContextURI string `validate:"required" json:"buildContextURI"`
// Path to sub folder which is intended to build instead of using root folder
BuildContextSubPath string `json:"buildContextSubPath"`
Expand Down Expand Up @@ -209,18 +209,19 @@ type ClusterConfig struct {
}

type ImageBuilderConfig struct {
ClusterName string `validate:"required"`
GcpProject string
ArtifactServiceType string
BaseImage BaseImageConfig `validate:"required"`
PredictionJobBaseImage BaseImageConfig `validate:"required"`
BuildNamespace string `validate:"required" default:"mlp"`
DockerRegistry string `validate:"required"`
BuildTimeout string `validate:"required" default:"10m"`
KanikoImage string `validate:"required" default:"gcr.io/kaniko-project/executor:v1.6.0"`
KanikoServiceAccount string
KanikoAdditionalArgs []string
DefaultResources ResourceRequestsLimits `validate:"required"`
ClusterName string `validate:"required"`
GcpProject string
BaseImage BaseImageConfig `validate:"required"`
PredictionJobBaseImage BaseImageConfig `validate:"required"`
BuildNamespace string `validate:"required" default:"mlp"`
DockerRegistry string `validate:"required"`
BuildTimeout string `validate:"required" default:"10m"`
KanikoImage string `validate:"required" default:"gcr.io/kaniko-project/executor:v1.6.0"`
KanikoServiceAccount string
KanikoPushRegistryType string `validate:"required,oneof=docker gcr" default:"docker"`
KanikoDockerCredentialSecretName string
KanikoAdditionalArgs []string
DefaultResources ResourceRequestsLimits `validate:"required"`
// How long to keep the image building job resource in the Kubernetes cluster. Default: 2 days (48 hours).
Retention time.Duration `validate:"required" default:"48h"`
Tolerations Tolerations
Expand Down Expand Up @@ -453,8 +454,13 @@ type JaegerConfig struct {
}

type MlflowConfig struct {
TrackingURL string `validate:"required"`
ArtifactServiceType string `validate:"required"`
TrackingURL string `validate:"required_if=ArtifactServiceType gcs ArtifactServiceType s3"`
// Note that the Kaniko image builder needs to be configured correctly to have the necessary credentials to download
// the artifacts from the blob storage tool depending on the artifact service type selected (gcs/s3). For gcs, the
// credentials can be provided via a k8s service account or a secret but for s3, the credentials can be provided via
// additional arguments in the config KanikoAdditionalArgs e.g.
// --build-arg=[AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY/AWS_DEFAULT_REGION/AWS_ENDPOINT_URL]=xxx
ArtifactServiceType string `validate:"required,oneof=nop gcs s3"`
}

func (cfg *Config) Validate() error {
Expand Down
18 changes: 9 additions & 9 deletions api/config/config_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -387,9 +387,8 @@ func TestLoad(t *testing.T) {
EnvironmentConfigs: []*EnvironmentConfig{},
},
ImageBuilderConfig: ImageBuilderConfig{
ClusterName: "test-cluster",
GcpProject: "test-project",
ArtifactServiceType: "gcs",
ClusterName: "test-cluster",
GcpProject: "test-project",
BaseImage: BaseImageConfig{
ImageName: "ghcr.io/caraml-dev/merlin/merlin-pyfunc-base:0.0.0",
DockerfilePath: "pyfunc-server/docker/Dockerfile",
Expand All @@ -403,12 +402,13 @@ func TestLoad(t *testing.T) {
BuildContextSubPath: "python",
MainAppPath: "/home/spark/merlin-spark-app/main.py",
},
BuildNamespace: "caraml",
DockerRegistry: "test-docker.pkg.dev/test/caraml-registry",
BuildTimeout: "30m",
KanikoImage: "gcr.io/kaniko-project/executor:v1.21.0",
KanikoServiceAccount: "kaniko-merlin",
KanikoAdditionalArgs: []string{"--test=true", "--no-logs=false"},
BuildNamespace: "caraml",
DockerRegistry: "test-docker.pkg.dev/test/caraml-registry",
BuildTimeout: "30m",
KanikoImage: "gcr.io/kaniko-project/executor:v1.21.0",
KanikoServiceAccount: "kaniko-merlin",
KanikoPushRegistryType: "docker",
KanikoAdditionalArgs: []string{"--test=true", "--no-logs=false"},
DefaultResources: ResourceRequestsLimits{
Requests: Resource{
CPU: "1",
Expand Down
23 changes: 21 additions & 2 deletions api/go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -4,15 +4,14 @@ go 1.22

require (
cloud.google.com/go/bigtable v1.11.0
cloud.google.com/go/storage v1.39.0
github.com/GoogleCloudPlatform/spark-on-k8s-operator v0.0.0-20221025152940-c261df66a006
github.com/afex/hystrix-go v0.0.0-20180502004556-fa1af6a1f4f5
github.com/antihax/optional v1.0.0
github.com/antonmedv/expr v1.12.5
github.com/bboughton/gcp-helpers v0.1.0
github.com/buger/jsonparser v1.1.1
github.com/caraml-dev/merlin-pyspark-app v0.0.3
github.com/caraml-dev/mlp v1.13.2-rc2
github.com/caraml-dev/mlp v1.13.2
github.com/caraml-dev/protopath v0.1.0
github.com/caraml-dev/universal-prediction-interface v1.0.0
github.com/cenkalti/backoff/v4 v4.2.1
Expand Down Expand Up @@ -246,11 +245,31 @@ require (
)

require (
cloud.google.com/go/storage v1.39.0 // indirect
github.com/avast/retry-go/v4 v4.6.0 // indirect
github.com/aws/aws-sdk-go-v2 v1.30.6-0.20240906182417-827d25db0048 // indirect
github.com/aws/aws-sdk-go-v2/aws/protocol/eventstream v1.6.4 // indirect
github.com/aws/aws-sdk-go-v2/config v1.17.8 // indirect
github.com/aws/aws-sdk-go-v2/credentials v1.12.21 // indirect
github.com/aws/aws-sdk-go-v2/feature/ec2/imds v1.12.17 // indirect
github.com/aws/aws-sdk-go-v2/internal/configsources v1.3.17 // indirect
github.com/aws/aws-sdk-go-v2/internal/endpoints/v2 v2.6.17 // indirect
github.com/aws/aws-sdk-go-v2/internal/ini v1.3.24 // indirect
github.com/aws/aws-sdk-go-v2/internal/v4a v1.3.17 // indirect
github.com/aws/aws-sdk-go-v2/service/internal/accept-encoding v1.11.4 // indirect
github.com/aws/aws-sdk-go-v2/service/internal/checksum v1.3.19 // indirect
github.com/aws/aws-sdk-go-v2/service/internal/presigned-url v1.11.19 // indirect
github.com/aws/aws-sdk-go-v2/service/internal/s3shared v1.17.17 // indirect
github.com/aws/aws-sdk-go-v2/service/s3 v1.61.2 // indirect
github.com/aws/aws-sdk-go-v2/service/sso v1.11.23 // indirect
github.com/aws/aws-sdk-go-v2/service/ssooidc v1.13.6 // indirect
github.com/aws/aws-sdk-go-v2/service/sts v1.16.19 // indirect
github.com/aws/smithy-go v1.20.4 // indirect
golang.org/x/time v0.5.0 // indirect
k8s.io/klog/v2 v2.120.1 // indirect
k8s.io/kube-openapi v0.0.0-20231113174909-778a5567bc1e // indirect
sigs.k8s.io/structured-merge-diff/v4 v4.4.1 // indirect

)

replace (
Expand Down
Loading

0 comments on commit 875b77f

Please sign in to comment.