Skip to content
This repository has been archived by the owner on Jan 31, 2024. It is now read-only.

Add model-monitoring stack and update model-mesh manifests to latest #732

Conversation

VedantMahabaleshwarkar
Copy link
Contributor

@VedantMahabaleshwarkar VedantMahabaleshwarkar commented Feb 15, 2023

Model-Mesh updates + add model monitoring stack

Description

  • Sync model-mesh manifests from opendatahub-io/model-serving/manifests to opendatahub-io/odh-manifests/model-mesh
  • Add model-monitoring manifests from downstream manifests to ODH manifests
  • Delete old modelmesh basic tests that are broken
  • Add following tests to tests/basictests/modelmesh.sh
    • Resource creation
      • Verify etcd deployment succeeded
      • Verify etcd secrets were created successfully
      • Verify odh-prometheus-operator deployment was successful
      • Verify 3 prometheus pods came up successfully
      • Verify odh-model-monitoring route was created successfully
    • Setup model testing namespace
      • Deleted old sklearn model yaml files and deployment since it was broken
      • Added new model creation files from model-serving quickstart
    • Inference Testing
      • Verify curl call to model is successful
    • Model Monitoring
      • Verify accurate number of models deployed from openshift-monitoring (old test kept mostly as it was)
      • Verify new model-monitoring prometheus stack works by making a prometheus query against the odh-model-monitoring route and verify result

How Has This Been Tested?

  • Install a fresh cluster
  • Install ODH operator
  • Create namespace where you want to install the ODH components. eg : opendatahub
  • Use the following kfdef to install the odh-common components + model-mesh + model-monitoring
  • Edit the kfdef to replace monitoring-namespace and deployment-namespace with the NS you are installing ODH components in (if it is not opendatahub)
apiVersion: kfdef.apps.kubeflow.org/v1
kind: KfDef
metadata:
  name: odh-modelmesh
spec:
  applications:
    - kustomizeConfig:
        repoRef:
          name: manifests
          path: odh-common
      name: odh-common
    - kustomizeConfig:
        parameters:
        - name: monitoring-namespace
          value: opendatahub
        overlays:
          - odh-model-controller
        repoRef:
          name: manifests
          path: model-mesh
      name: model-mesh
    - kustomizeConfig:
        parameters:
        - name: deployment-namespace
          value: opendatahub
        repoRef:
          name: manifests
          path: modelmesh-monitoring
      name: modelmesh-monitoring
  repos:
    - name: manifests
      uri: https://api.github.com/repos/VedantMahabaleshwarkar/odh-manifests/tarball/modelmesh
  version: master 
  • Verify all deployments are successful with all containers coming up successfully
  • Run the modelmesh test suite as follows
    • modify tests/makefile
      • GIT_ORG=VedantMahabaleshwarkar
      • GIT_BRANCH=modelmesh
    • in odh-manifests/tests
      • make build
      • make run SKIP_INSTALL=true SKIP_KFDEF_INSTALL=true TESTS_REGEX=modelmesh (we already installed the operator and installed a kfdef earlier in the test instructions).

Note:

Overall test suite fails due to /root/peak/util: line 23: orig_project: unbound variable but verify that all modelmesh tests passed

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

@anishasthana
Copy link
Member

/hold until opendatahub-io/modelmesh-serving#69 is merged in

@openshift-ci openshift-ci bot added the do-not-merge/hold Hold off on merging (provide reason in comment) label Feb 15, 2023
@LaVLaS
Copy link
Contributor

LaVLaS commented Feb 15, 2023

/hold

Putting a hold on this until after we tag odh-manifests for the v1.4.2 release today

@LaVLaS
Copy link
Contributor

LaVLaS commented Feb 16, 2023

@VedantMahabaleshwarkar Since this is adding new functionality, it's going to need some type a smoke test update to verify it's actually working

@VedantMahabaleshwarkar
Copy link
Contributor Author

@VedantMahabaleshwarkar Since this is adding new functionality, it's going to need some type a smoke test update to verify it's actually working

@LaVLaS Added tests to the PR as well

@VannTen
Copy link

VannTen commented Feb 20, 2023

* Sync model-mesh manifests from opendatahub-io/model-serving/manifests to opendatahub-io/odh-manifests/model-mesh

* Add model-monitoring manifests from downstream manifests to ODH manifests

* Delete old modelmesh basic tests that are broken

* Add following tests to `tests/basictests/modelmesh.sh

Is there some circular dependency between those items ?
Looks like that could be broken into separate PR for easier review.

Copy link

@VannTen VannTen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LaVLaS

I took a quick look.
Some stuff looks a bit alarming to me, see below.

@LaVLaS LaVLaS added this to the ODH 1.5.0 Release milestone Feb 20, 2023
@VedantMahabaleshwarkar
Copy link
Contributor Author

* Sync model-mesh manifests from opendatahub-io/model-serving/manifests to opendatahub-io/odh-manifests/model-mesh

* Add model-monitoring manifests from downstream manifests to ODH manifests

* Delete old modelmesh basic tests that are broken

* Add following tests to `tests/basictests/modelmesh.sh

Is there some circular dependency between those items ? Looks like that could be broken into separate PR for easier review.

@VannTen I wouldn't call it a circular dependency. Ideally the changes to tests/basictests/modelmesh.sh come after the new manifests are merged. But since we are adding new functionality, @LaVLaS wanted to include tests along with this so we can verify the PR works as intended.

@VannTen
Copy link

VannTen commented Feb 21, 2023 via email

@VedantMahabaleshwarkar
Copy link
Contributor Author

@VannTen I wouldn't call it a circular dependency. Ideally the changes to tests/basictests/modelmesh.sh come after the new manifests are merged. But since we are adding new functionality, @LaVLaS wanted to include tests along with this so we can verify the PR works as intended.
The sync part (first *) could at least be split don't you think ? Why not on adding the tests as part of the PR, but could you at least do separate commits ? That's super hard to read and review as is.

@VannTen separated out the testsuite changes in another commit

@VannTen
Copy link

VannTen commented Feb 23, 2023

/lgtm
(as far as I can tell ^)

@openshift-ci openshift-ci bot added the lgtm label Feb 23, 2023
Copy link
Member

@anishasthana anishasthana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Hold off on merging (provide reason in comment) label Feb 23, 2023
@anishasthana
Copy link
Member

@VannTen can you dismiss your old review/ create a new review?

@LaVLaS
Copy link
Contributor

LaVLaS commented Feb 23, 2023

/retest

tests/setup/odh-core.yaml Outdated Show resolved Hide resolved
@anishasthana
Copy link
Member

/retest

@VedantMahabaleshwarkar
Copy link
Contributor Author

/retest

@openshift-ci openshift-ci bot added the lgtm label Mar 7, 2023
@openshift-ci
Copy link

openshift-ci bot commented Mar 7, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: anishasthana, LaVLaS

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [LaVLaS,anishasthana]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@LaVLaS
Copy link
Contributor

LaVLaS commented Mar 7, 2023

/lgtm

@openshift-merge-robot openshift-merge-robot merged commit 1302606 into opendatahub-io:master Mar 7, 2023
@shalberd
Copy link
Contributor

shalberd commented Mar 20, 2023

grafik

With Openshift Monitoring for User Defined Projects, the bringing-in / federation of cluster-level metrics (pod container restarts , oom, all that stuff from exporter kube-state-metrics) happens automatically at the namespace-level. The only thing not accessible are node-level (node exporter) metrics. Meaning I get such things as kube_pod_container_restarts without an explicit federation servicemonitor. Could you make the federation servicemonitor and prometheus and prometheus operator optional via an overlay? That way, you'd have all the metrics gathering still in there, while making it possible for users who have monitoring for user defined projects enabled to skip the prometheus part and cluster metrics federation part.

The monitoring of the metrics from odh-model-controller ServiceMonitor works with Monitoring for User Defined Projects, too, by the way.

That is, the section with the custom monitoring implementation for model mesh could be removed from odh-core, as it is achieved with Monitoring for User Defined Projects.

@LaVLaS
Copy link
Contributor

LaVLaS commented Mar 20, 2023

Please move this discussion an open issue and not a PR that has already merged

@shalberd
Copy link
Contributor

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
Development

Successfully merging this pull request may close these issues.

6 participants