Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add HPA feature/unit/fvt/docs/script #342

Merged
merged 7 commits into from
Apr 27, 2023
Merged

add HPA feature/unit/fvt/docs/script #342

merged 7 commits into from
Apr 27, 2023

Conversation

Jooho
Copy link
Contributor

@Jooho Jooho commented Mar 8, 2023

Motivation

This PR is about #329

Modifications

- Added autoscaler controller 
  - added unit tests
- Added hpa controller 
- Added servingruntime webhook
  - added unit tests
  - added self-signed-ca.sh
- Added HPA Autoscaler fvt test
- Update Makefile
  - add deploy-release-dev-mode-fvt
- Update install.sh
  - add --modelmesh-serving-image
  - add --enable-self-signed-ca option
- Update Manifests
  - certmanager
  - webhook
  - rbac
    - networkpolicy-webhook.yaml
    - role.yaml
      - added horizontalpodautoscalers
  - dependencies
    - fvt.yaml
      - added --data-dir (this is from quickstart.yaml)
- Update Docs      
  - docs/install/install-script.md
  - docs/install/developer.md
  - docs/install/quickstart.md
  - docs/production-use/scaling.md

Result

With this PR, ServingRuntime/ClusterServingRuntime can use HPA autoscaling feature.
Adding an annotation serving.kserve.io/autoscalerClass: hpa into ServingRuntime/ClusterServingRuntime, the feature becomes enabled.

After the HPA feature is enabled, you can tune HPA in detail by using the annotation below.

annotations:
    serving.kserve.io/autoscalerClass: hpa
    serving.kserve.io/targetUtilizationPercentage: "75"
    serving.kserve.io/metrics: "cpu"
    autoscaling.knative.dev/min-scale: "2"
    autoscaling.knative.dev/max-scale: "3"

Test results

  • Unit => Success
  • FVT Tests
    -  ENABLE_SELF_SIGNED_CA + ClusterServingRuntime(*Success*)
      -  Tested all fvt tests
    -  ENABLE_SELF_SIGNED_CA + ServingRuntime(*Success*)
      -  Only tested fvt/hpa because other fvt tests failed
    - Cert Manager + ClusterServingRuntime (*Success*)
      -  Tested all fvt tests
    - Cert Manager + ServingRuntime(*Success*)
        -  Only tested fvt/hpa because other fvt tests failed
    

FVT Test steps

1-1.ENABLE_SELF_SIGNED_CA + ClusterServingRuntime

kind create cluster 

kubectl create ns modelmesh-serving
kubectl config set-context --current --namespace modelmesh-serving
NAMESPACE=modelmesh-serving MODELMESH_SERVING_IMAGE=quay.io/jooholee/modelmesh-controller:hpa-webhook make deploy-release-dev-mode-fvt

make fvt

1-2.ENABLE_SELF_SIGNED_CA + ServingRuntime

kind create cluster 

kubectl create ns modelmesh-serving
kubectl config set-context --current --namespace modelmesh-serving
NAMESPACE=modelmesh-serving MODELMESH_SERVING_IMAGE=quay.io/jooholee/modelmesh-controller:hpa-webhook NAMESPACE_SCOPE_MODE=true make deploy-release-dev-mode-fvt

NAMESPACESCOPEMODE=true ginkgo -v -p -progress --fail-fast fvt/hpa --timeout=50m

2-1.Cert-Manager + ClusterServingRuntime

CERT_MANAGER_VERSION="v1.11.0"

kind create cluster 

# Install cert manager
echo "Installing cert manager ..."
kubectl create namespace cert-manager
sleep 2
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/${CERT_MANAGER_VERSION}/cert-manager.yaml
echo "Waiting for cert manager started ..."
kubectl wait --for=condition=ready pod -l 'app in (cert-manager,webhook)' --timeout=180s -n cert-manager

# Deploy modelmesh-serving
kubectl create ns modelmesh-serving
kubectl config set-context --current --namespace modelmesh-serving
NAMESPACE=modelmesh-serving MODELMESH_SERVING_IMAGE=quay.io/jooholee/modelmesh-controller:hpa-webhook  make deploy-release-dev-mode-fvt

make fvt

2-2.Cert-Manager + ServingRuntime

CERT_MANAGER_VERSION="v1.11.0"

kind create cluster 

# Install cert manager
echo "Installing cert manager ..."
kubectl create namespace cert-manager
sleep 2
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/${CERT_MANAGER_VERSION}/cert-manager.yaml
echo "Waiting for cert manager started ..."
kubectl wait --for=condition=ready pod -l 'app in (cert-manager,webhook)' --timeout=180s -n cert-manager

# Deploy modelmesh-serving
kubectl create ns modelmesh-serving
kubectl config set-context --current --namespace modelmesh-serving
NAMESPACE=modelmesh-serving MODELMESH_SERVING_IMAGE=quay.io/jooholee/modelmesh-controller:hpa-webhook NAMESPACE_SCOPE_MODE=true make deploy-release-dev-mode-fvt

NAMESPACESCOPEMODE=true ginkgo -v -p -progress --fail-fast fvt/hpa --timeout=50m

Resolves #329

@Jooho
Copy link
Contributor Author

Jooho commented Mar 8, 2023

@njhill This PR requires a certificate for the webhook. For testing purposes, install.sh includes an enable_self_signed_ca option to generate a self-signed certificate for the webhook, but it is disabled by default. However, the existing FVT CI does not have this option, so a failure was expected.
For your information, I added the FVT result with this PR in the description and it returns it succeeded.

The option enable_self_signed_ca can be enabled by default but I believe it's probably better to disable it.
What do you think?

@Jooho
Copy link
Contributor Author

Jooho commented Mar 9, 2023

@njhill I made --fvt option include enable-self-signed-ca option too. So I believe the original fvt ci would be passed.

@Jooho
Copy link
Contributor Author

Jooho commented Mar 9, 2023

FVT test failed because of an insufficient resources.

  0/1 nodes are available: 1 Insufficient cpu.

@Jooho Jooho force-pushed the kserve-hpa branch 2 times, most recently from 870fba0 to 5eff85a Compare March 9, 2023 14:06
@Jooho Jooho force-pushed the kserve-hpa branch 18 times, most recently from 0706eb8 to acca297 Compare March 10, 2023 03:31
tjohnson31415
tjohnson31415 previously approved these changes Apr 25, 2023
Copy link
Contributor

@tjohnson31415 tjohnson31415 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the final set of tweaks in and the FVTs passing, this PR LGTM!

Thanks @Jooho for contributing this and promptly responding to the many rounds of review!

@tjohnson31415
Copy link
Contributor

/lgtm
/approve

@kserve-oss-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Jooho, tjohnson31415

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Member

@ckadner ckadner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the merge is blocked due to branch protection rules. I will change that shortly. In the meantime I have one last question.

config/manager/manager.yaml Outdated Show resolved Hide resolved
@ckadner ckadner dismissed tjohnson31415’s stale review April 25, 2023 23:31

we should remove the namespace from the base manifests

Signed-off-by: jooho <jlee@redhat.com>
Signed-off-by: jooho <jlee@redhat.com>
Copy link
Member

@ckadner ckadner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@ckadner ckadner merged commit dd0229f into kserve:main Apr 27, 2023
kserve-oss-bot pushed a commit that referenced this pull request May 3, 2023
#### Motivation
I experienced a few errors when running the install script using the `--enable-self-signed-ca` flag from the recently merged #342. 

#### Modifications
- Use `sed -i.bak` to work around different versions of it, originally resulting in `sed: -i: No such file or directory` for me.
- Fix `--dry-run` error which requires a set value [(none, server, or client)](https://stackoverflow.com/questions/69531858/what-is-the-different-dry-run-opportunities)
- Fix typos

#### Result
Fixes errors associated with the `--enable-self-signed-ca` flag

Signed-off-by: Rafael Vasquez <raf.vasquez@ibm.com>
@ckadner ckadner mentioned this pull request May 4, 2023
5 tasks
Jooho added a commit to Jooho/modelmesh-serving that referenced this pull request May 5, 2023
@danielrubin1989
Copy link

Hi @ckadner ,
I tried to install the v0.11.0-rc1 in EKS v1.26 and got this errors regards the HPA from the controller:
{"level":"error","ts":"2023-07-19T14:04:52Z","msg":"Reconciler error","controller":"ServingRuntimeReconciler","controllerGroup":"serving.kserve.io","controllerKind":"ServingRuntime","Ser vingRuntime":{"name":"triton-2.x","namespace":"prod"},"namespace":"prod","name":"triton-2.x","reconcileID":"3a1af383-32c6-4717-814a-afa2962667c4","error":"HPA reconcile error: no matches for kind \"HorizontalPodAutoscaler\" in version \"autoscaling/v2beta2\"","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/root/go/ pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/roo t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:274\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/root /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:235"}

"autoscaling/v2beta2" was marked as deprecated in v1.23 and removed in Kubernetes 1.26.
better using "autoscaling/v2" , What do you think?

@rafvasq
Copy link
Member

rafvasq commented Jul 19, 2023

@danielrubin1989, I believe this update is related to what you're experiencing which should resolve it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved enhancement New feature or request lgtm
Projects
None yet
9 participants