Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: Add BentoML Yatai Stack contrib #2350

Merged
merged 1 commit into from
Feb 1, 2023

Conversation

yetone
Copy link
Contributor

@yetone yetone commented Dec 22, 2022

Design Proposal

Why BentoML

image

  • BentoML is an open-source framework for developing, serving, and deploying ML services.
    • Building
      • Unifies ML frameworks with out-of-the-box implementation of popular frameworks
      • Exposes gRPC and OpenAPI for serving
      • Provides Python SDK for development
    • Deployment
      • Any environment, batch inference, streaming, or online serving
      • Any cloud platform for on-prem
      • Full observability support through Grafana
      • Yatai - BentoML's deployment platform

User Stories

Goal: From simple Python module to distributed Kubernetes deployment.

Consider the following common ML services involve custom pre and post-processing logic and inference of multiple models.

image

Developing on Kubeflow Notebook

  • Create a service using saved model.
%%writefile service.py
import asyncio
import bentoml

fraud_detection = bentoml.pytorch.get("fraud_detection:latest").to_runner()
risk_assessment_1 = bentoml.sklearn.get("risk_assessment_1:latest").to_runner()
risk_assessment_2 = bentoml.sklearn.get("risk_assessment_2:latest").to_runner()
risk_assessment_3 = bentoml.sklearn.get("risk_assessment_3:latest").to_runner()

svc = bentoml.Service(
    name="credit_application",
    runners=[fraud_detection, risk_assessment_1, risk_assessment_2, risk_assessment_3]
)

@svc.api(input=bentoml.io.JSON(), output=bentoml.io.JSON())
async def apply(input_data: dict) -> dict:
    features = await fetch_features(input_date["user_id"])
    detection = await fraud_detection.async_run(input_data, features)
    if detection["confidence"] < CONFIDENCE_THRESHOLD:
       return REJECTION
    assessments = await asyncio.gather(
        risk_assessment_1.async_run(input_data["application"], features),
        risk_assessment_2.async_run(input_data["application"], features),
        risk_assessment_3.async_run(input_data["application"], features),
    )
    return process_assessments(assessments)

  • Serve and test the service.
!bentoml serve service.py:svc --reload

2022-11-07T06:50:53+0000 [INFO] [cli] Prometheus metrics for HTTP BentoServer from "service.py:svc" can be accessed at <http://localhost:3000/metrics>.
2022-11-07T06:50:53+0000 [INFO] [cli] Starting development HTTP BentoServer from "service.py:svc" listening on <http://0.0.0.0:3000/> (Press CTRL+C to quit)

image

  • Build bento
!bentoml build

Building BentoML service "credit_application:wly5lqc6ncpzwcvj" from build context "."
Successfully built Bento(tag="credit_application:wly5lqc6ncpzwcvj").
  • Export bento to blob storage.
!bentoml export credit_application:wly5lqc6ncpzwcvj s3://your_bento_bucket/credit_application.wly5lqc6ncpzwcvj.bento

Deploying to Kubernetes

image

Users can deploy bentos to the K8s cluster in one of the three ways.

Kubernetes Python Client

Users can deploy bentos through Kubeflow Notebook with Kubernetes Python client

kubectl

BentoML offers two options to deploy bentos directly to the Kubenetes cluster through kubectl and the BentoRequest, Bento, and BentoDeployment CRDs.

The first option relies on yatai-image-builder to build the OCI image. Users need to create a BentoRequest CR and BentoDeployment CR to deploy a bento. In the BentoDeployment CR, the name of the bento should be defined as the name of the BentoRequest CR. If this Bento CR not found, yatai-deployment will look for the BentoRequest CR by the same name and wait for the BentoRequest CR to generate the Bento CR. This option will build the OCI image by spawning a pod to run the Kaniko build tool. However, the Kaniko build tool requires root user access. If root user access is not available, consider the second option below.

The second option relies on the users to provide a URI to the pre-built OCI image of the bento. Users need to manually create a Bento CR with the image field defined as the pre-built OCI image URI. Then create a BentoDeployment CR to reference the Bento CR previously created.

Kubeflow Pipeline Component

This option will be available in Kubeflow release 1.8.

Verification

The following installation and testing steps demonstrate how to install Yatai components and deploy bentos through kubectl with BentoRequest and BentoDeployment CRDs.

Installation

Install with kustomize command:

kustomize build bentoml-yatai-stack/default | kubectl apply -n kubeflow --server-side -f -

Test

Create Bento CR and BentoDeployment CR:

kubectl apply -f example.yaml

Verifying that the bento deployment is running:

kubectl -n kubeflow get deploy -l yatai.ai/bento-deployment=test-yatai

The output of the above command should be like this:

NAME                  READY   UP-TO-DATE   AVAILABLE   AGE
test-yatai            1/1     1            1           6m12s
test-yatai-runner-0   1/1     1            1           16m

Verifying that the bento service is created:

kubectl -n kubeflow get service -l yatai.ai/bento-deployment=test-yatai

The output of the above command should look like this:

NAME                                                 TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)             AGE
test-yatai                                           ClusterIP   10.96.150.42    <none>        3000/TCP,3001/TCP   7m59s
test-yatai-runner-32c50ece701351fb576189d54bd58724   ClusterIP   10.96.193.242   <none>        3000/TCP,3001/TCP   7m39s

Port-forwarding the bento service:

kubectl -n kubeflow port-forward svc/test-yatai 3000:3000

Finally you can test the bento service with the curl command:

curl -X 'POST' http://localhost:3000/classify -d '[[0,1,2,3]]'

The output should be:

[2]

@google-cla
Copy link

google-cla bot commented Dec 22, 2022

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@@ -0,0 +1,34 @@
apiVersion: apps/v1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yetone could you please add an explanation why we need to DaemonSet? Question from Julius.

@juliusvonkohout
Copy link
Member

juliusvonkohout commented Jan 15, 2023

  1. I recommend newer naming, not Docker but OCI
  2. Why do you need oci-private-registry-proxy? And why do you need a deamonset?
  3. Why do you request access to all secrets with the role yatai-deployment-in-yatai-system and yatai-deployment-with-bento-deployment instead of a fixed list? Are you in general sure that your roles are the minimum needed to run bentoml?
  4. I am missing in general an architectural overview. Please have a look at https://docs.google.com/document/d/1SP-COL6O-ETMQny7zXbrbEL4zZBG3REErWjfcvFZVEw/edit?usp=sharing&resourcekey=0-IaUGVemd2XX6MGTgifDbpg I only have found https://docs.bentoml.org/projects/yatai/en/latest/concepts/architecture.html so far.
  5. What is the service communication matrix? i want to create proper networkpolices for bentoml
  6. The security violations with root containers for builds with kaniko should be documented in the readme as well. Maybe also a possible solution

@yetone
Copy link
Contributor Author

yetone commented Jan 15, 2023

@juliusvonkohout

Thank you for your review, here are some of my answers, I don't know if they answer your questions

  1. It's a good suggestion, I'll change it
  2. oci-private-registry-proxy is to circumvent the problem of forcing tls when nodes pull oci images, because docker does not force 127.0.0.1:5000 to be https by default, which is why daemonset is used. But I can parameterize the oci registry settings, not to do it out of the box, and let the user set the oci registry configurations, so there is no need to install oci-registry and this proxy.
  3. Does k8s' rbac have the ability to specify the name of a resource? I was under the impression that rbac could only specify the type of the resource. Yes, I put a lot of effort into minimizing this rbac, so I'm sure this is the minimum rbac range required by bentoml
  4. My PR is modeled after this PR from seldon, which actually requires us to describe our architecture diagram? I don't see it in the proposals.
  5. The components do not have any network access to each other, they only communicate through Bento CR
  6. I can also switch the image build engine to buildah, so at least I can use openshift's anyuid SCC to build the image

@juliusvonkohout
Copy link
Member

2. oci-private-registry-proxy is to circumvent the problem of forcing tls when nodes pull oci images, because docker does not force 127.0.0.1:5000 to be https by default, which is why daemonset is used. But I can parameterize the oci registry settings, not to do it out of the box, and let the user set the oci registry configurations, so there is no need to install oci-registry and this proxy.

I am confused about that. There is no docker in modern Kubernetes only containerd, cri-o etc. THen handle image pulling for you. Maybe i need a bit more details.

"Does k8s' rbac have the ability to specify the name of a resource?" Yes it does. It is somewhere in the official kubernetes documentation. In only allows a static positive list, no wildcards or regex stuff.

"The components do not have any network access to each other, they only communicate through Bento CR" So that means in the networkpolicy i only need to enable the ports of the webhooks if bentoml in the Kubeflow namespace to the kubernetes apiserver?

"I can also switch the image build engine to buildah, so at least I can use openshift's anyuid SCC to build the image" So are you using even more dangerous stuff than root so far? I know that at least buildah/podman "only" need root but no additional SYS_CAP_* or priviledged containers.

@yetone
Copy link
Contributor Author

yetone commented Jan 15, 2023

  1. oci-private-registry-proxy is to circumvent the problem of forcing tls when nodes pull oci images, because docker does not force 127.0.0.1:5000 to be https by default, which is why daemonset is used. But I can parameterize the oci registry settings, not to do it out of the box, and let the user set the oci registry configurations, so there is no need to install oci-registry and this proxy.

I am confused about that. There is no docker in modern Kubernetes only containerd, cri-o etc. THen handle image pulling for you. Maybe i need a bit more details.

To be honest, oci-private-registry-proxy is for users out of the box, because most users use k8s environments (minikube, kind, k3s) with docker, so I can delete this, not automatically install oci registry, and let users specify their own existing oci registry, but it will cause bentoml yatai stack installation will be a bit more troublesome, because of the pre-dependencies

"Does k8s' rbac have the ability to specify the name of a resource?" Yes it does. It is somewhere in the official kubernetes documentation. In only allows a static positive list, no wildcards or regex stuff.

I found it in the documentation, thank you, I will add the resource name

"The components do not have any network access to each other, they only communicate through Bento CR" So that means in the networkpolicy i only need to enable the ports of the webhooks if bentoml in the Kubeflow namespace to the kubernetes apiserver?

Yes, just open the conversion webhook port of yatai-deployment to k8s and that's it!

"I can also switch the image build engine to buildah, so at least I can use openshift's anyuid SCC to build the image" So are you using even more dangerous stuff than root so far? I know that at least buildah/podman "only" need root but no additional SYS_CAP_* or priviledged containers.

We didn't use any dangerous functions, and it feels like kaniko will work in anyuid SCC, I'll try it out

Finally, thank you for your reply!

@juliusvonkohout
Copy link
Member

juliusvonkohout commented Jan 15, 2023

Also minikube, k3s etc will have to drop Docker at some point in the future. I think even GKE dropped it with Kubernetes 1.19. So if it is only relevant for ancient installations i would drop it, since we require Kuernetes 1.24+ anyway in Kubeflow 1.7. Do you have a detailed documentation of the deamonset? I think even "Docker Desktop" uses containerd nowadays.

@juliusvonkohout
Copy link
Member

juliusvonkohout commented Jan 15, 2023

And just that you know we are not fixed on a specific build tool. Podman and Kaniko are fine, they should just have rootless on their roadmap.

@yetone
Copy link
Contributor Author

yetone commented Jan 15, 2023

@juliusvonkohout Yes, you are right. I will delete the docker-registry and proxy

@ssheng
Copy link
Contributor

ssheng commented Jan 17, 2023

Design Proposal

Why BentoML

image

  • BentoML is an open-source framework for developing, serving, and deploying ML services.
    • Building
      • Unifies ML frameworks with out-of-the-box implementation of popular frameworks
      • Exposes gRPC and OpenAPI for serving
      • Provides Python SDK for development
    • Deployment
      • Any environment, batch inference, streaming, or online serving
      • Any cloud platform for on-prem
      • Full observability support through Grafana
      • Yatai - BentoML’s deployment platform

User Stories

Goal: From simple Python module to distributed Kubernetes deployment.

Consider the following common ML services involve custom pre and post-processing logic and inference of multiple models.

image

Developing on Kubeflow Notebook

  • Create a service using saved model.
%%writefile service.py
import asyncio
import bentoml

fraud_detection = bentoml.pytorch.get("fraud_detection:latest").to_runner()
risk_assessment_1 = bentoml.sklearn.get("risk_assessment_1:latest").to_runner()
risk_assessment_2 = bentoml.sklearn.get("risk_assessment_2:latest").to_runner()
risk_assessment_3 = bentoml.sklearn.get("risk_assessment_3:latest").to_runner()

svc = bentoml.Service(
    name="credit_application",
    runners=[fraud_detection, risk_assessment_1, risk_assessment_2, risk_assessment_3]
)

@svc.api(input=bentoml.io.JSON(), output=bentoml.io.JSON())
async def apply(input_data: dict) -> dict:
    features = await fetch_features(input_date["user_id"])
    detection = await fraud_detection.async_run(input_data, features)
    if detection["confidence"] < CONFIDENCE_THRESHOLD:
       return REJECTION
    assessments = await asyncio.gather(
        risk_assessment_1.async_run(input_data["application"], features),
        risk_assessment_2.async_run(input_data["application"], features),
        risk_assessment_3.async_run(input_data["application"], features),
    )
    return process_assessments(assessments)

  • Serve and test the service.
!bentoml serve service.py:svc --reload

2022-11-07T06:50:53+0000 [INFO] [cli] Prometheus metrics for HTTP BentoServer from "service.py:svc" can be accessed at <http://localhost:3000/metrics>.
2022-11-07T06:50:53+0000 [INFO] [cli] Starting development HTTP BentoServer from "service.py:svc" listening on <http://0.0.0.0:3000> (Press CTRL+C to quit)

image

  • Build bento
!bentoml build

Building BentoML service "credit_application:wly5lqc6ncpzwcvj" from build context "."
Successfully built Bento(tag="credit_application:wly5lqc6ncpzwcvj").
  • Export bento to blob storage.
!bentoml export credit_application:wly5lqc6ncpzwcvj s3://your_bento_bucket/credit_application.wly5lqc6ncpzwcvj.bento

Deploying to Kubernetes

image

Users can deploy bentos to the K8s cluster in one of the three ways.

Kubernetes Python Client

Users can deploy bentos through Kubeflow Notebook with Kubernetes Python client

kubectl

BentoML offers two options to deploy bentos directly to the Kubenetes cluster through kubectl and the BentoRequest, Bento, and BentoDeployment CRDs.

The first option relies on yatai-image-builder to build the OCI image. Users need to create a BentoRequest CR and BentoDeployment CR to deploy a bento. In the BentoDeployment CR, the name of the bento should be defined as the name of the BentoRequest CR. If this Bento CR not found, yatai-deployment will look for the BentoRequest CR by the same name and wait for the BentoRequest CR to generate the Bento CR. This option will build the OCI image by spawning a pod to run the Kaniko build tool. However, the Kaniko build tool requires root user access. If root user access is not available, consider the second option below.

The second option relies on the users to provide a URI to the pre-built OCI image of the bento. Users need to manually create a Bento CR with the image field defined as the pre-built OCI image URI. Then create a BentoDeployment CR to reference the Bento CR previously created.

Kubeflow Pipeline Component

This option will be available in Kubeflow release 1.8.

Verification

The following installation and testing steps demonstrate how to install Yatai components and deploy bentos through kubectl with BentoRequest and BentoDeployment CRDs.

Installation

Install with kustomize command:

kustomize build bentoml-yatai-stack/default | kubectl apply -n kubeflow --server-side -f -

Test

Create Bento CR and BentoDeployment CR:

kubectl apply -f example.yaml

Verifying that the bento deployment is running:

kubectl -n kubeflow get deploy -l yatai.ai/bento-deployment=test-yatai

The output of the above command should be like this:

NAME                  READY   UP-TO-DATE   AVAILABLE   AGE
test-yatai            1/1     1            1           6m12s
test-yatai-runner-0   1/1     1            1           16m

Verifying that the bento service is created:

kubectl -n kubeflow get service -l yatai.ai/bento-deployment=test-yatai

The output of the above command should look like this:

NAME                                                 TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)             AGE
test-yatai                                           ClusterIP   10.96.150.42    <none>        3000/TCP,3001/TCP   7m59s
test-yatai-runner-32c50ece701351fb576189d54bd58724   ClusterIP   10.96.193.242   <none>        3000/TCP,3001/TCP   7m39s

Port-forwarding the bento service:

kubectl -n kubeflow port-forward svc/test-yatai 3000:3000

Finally you can test the bento service with the curl command:

curl -X 'POST' http://localhost:3000/classify -d '[[0,1,2,3]]'

The output should be:

[2]

@yetone yetone force-pushed the feat/integrate-bentoml branch from 7f4cd86 to 23b7971 Compare January 17, 2023 10:44
@juliusvonkohout
Copy link
Member

This looks very good. There is a typo (fech instead of fetch) and maybe use the original svg of the diagram instead of png. I am actually in favor of just adding this documentation including images to the PR. Maybe in readme.md

Furthermore we got the rootcontainer security exception for Kubeflow 1.7, but you also need to document it in readme.md. and that you plan to solve it for Kubeflow 1.8+ as discussed in the release meeting yesterday. Also documenting how to use a prebuilt image from somewhere else without having to build it with root rights would be good.

@yetone yetone force-pushed the feat/integrate-bentoml branch from 23b7971 to 17ede3a Compare January 17, 2023 11:00
@yetone
Copy link
Contributor Author

yetone commented Jan 17, 2023

@juliusvonkohout Hi, I have updated this PR as per your valuable advice, which includes removing everything related to the docker registry and minimizing the role as much as possible

@kimwnasptd
Copy link
Member

@yetone @juliusvonkohout IMHO the state of this PR is amazing! The README is one of the best we have in the whole repo. And kudos to @juliusvonkohout for driving the review and exposing the security concerns.

I'm more than happy to merge this now, to make sure it's included in the release branch.

As @juliusvonkohout pointed out in a Manifests WG meeting we'll also need to update the links for some images, since they are referencing data from a user repo. We've seen this causing problems multiple times in kubeflow/examples kubeflow/examples#1006 kubeflow/examples#1005 kubeflow/examples#1004 kubeflow/examples#1003, so @yetone let's have a subsequent PR where we:

  1. Push the required images in a dir in this repo
  2. Use relevant links

But, again, let's merge this for now and we can fix these nits afterwards.

Again, amazing work!

/lgtm
/approve

@EricBentoML
Copy link

Thanks, @kimwnasptd.
We will follow your suggestion to update the image after our engineer returns from PTO.
For this PR, it seems like it didn't get merged due to "do-not-merge/invalid-owners-file" tag.
@DomFleischmann any suggestions on how to move forward?
Please let us know if you need anything. Thx.
cc @ssheng , @yetone

@juliusvonkohout
Copy link
Member

The following users are mentioned in OWNERS file(s) but are untrusted for the following reasons. One way to make the user trusted is to add them as members of the kubeflow org. You can then trigger verification by writing /verify-owners in a comment.

* yetone
  
  * User is not a member of the org. User is not a collaborator. Satisfy at least one of these conditions to make the user trusted.

* yubozhao
  
  * User is not a member of the org. User is not a collaborator. Satisfy at least one of these conditions to make the user trusted.

Either you become member or you add me and Kimonas in the owners file instead. The membership must be approved by Google which might take some time.

@@ -0,0 +1,3 @@
approvers:
- yetone
- yubozhao
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- yubozhao
- yubozhao
- juliusvonkohout

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yetone could you please approve this suggestion? cc: @juliusvonkohout, @EricBentoML

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes if you keep yubozhao he must be member of Kubeflow as well. If you do not have any members in your company use me and @kimwnasptd

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello Julius.

I am part of the release 1.7 team and I should already be part of the kubeflow member.

@yubozhao
Copy link

/lgtm
/approve

@ssheng
Copy link
Contributor

ssheng commented Jan 30, 2023

/verify-owners

@yetone
Copy link
Contributor Author

yetone commented Feb 1, 2023

/verify-owners

@yubozhao
Copy link

yubozhao commented Feb 1, 2023

/lgtm
/approve

@google-oss-prow google-oss-prow bot added the lgtm label Feb 1, 2023
@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kimwnasptd, yetone, yubozhao

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit 6b6954f into kubeflow:master Feb 1, 2023
kevin85421 pushed a commit to juliusvonkohout/manifests that referenced this pull request Feb 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants