Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model Registry proposal (ref KF community meeting 20240102) #682

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

tarilabs
Copy link
Member

@tarilabs tarilabs commented Jan 3, 2024

Following feedback received during KF community meeting held 20240102,
raising the Model Registry proposal google doc previously shared with the community: (link),
as a Markdown in the form of Pull Request (this PR).

See also

Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: tarilabs
Once this PR has been reviewed and has the lgtm label, please assign james-jwu for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for creating this @tarilabs!
cc WG for the review on this proposal.

@kubeflow/wg-pipeline-leads
@kubeflow/wg-training-leads
@kubeflow/wg-notebooks-leads
@kubeflow/wg-manifests-leads
@kubeflow/wg-deployment-leads
@yuzisun @jbottum

@rchincha
Copy link

rchincha commented Mar 19, 2024

Hi All, not sure if folks here are aware of the work going on over at Open Containers Initiative (OCI) wrt standardizing registry interfaces and general-purpose "artifacts".

https://github.com/opencontainers/distribution-spec

OCI is a sibling org under Linux Foundation and its specs closely interoperate with Kubernetes.

https://opencontainers.org/

Just bringing this to your attention that it is now possible to colocate arbitrary artifacts (content-addressable) along with relationships and provenance in a OCI dist-spec v1.1.0 conformant image registry, and not just container images.

  • workloads - container images
  • data (model data in this case) - "artifacts"
  • data lineage - "referrers", captures relationship between artifacts
  • authN/provenance - signatures etc, each artifact can be signed independently

The main motivation in this forum would be:

  1. Image registries already a required part of Kubernetes ecosystem and app lifecycle
  2. Many problems are likely already solved
  3. So if possible, why not re-use instead of implementing and maintaining something entirely new
  4. Excellent community tooling already available

A registry you can quickly spin up and play with: https://zotregistry.dev/
Azure, AWS and others should hopefully announce support soon.

Thoughts?

Full disclosure: I am a OCI TOB member and zot author.

@rchincha
Copy link

rchincha commented Mar 19, 2024

Talk is cheap, here is an example (using only a subset of OCI dist-spec v1.1.0 features)
project-zot/zot#2332

Also, an upcoming blog post clarifying most things.
https://github.com/opencontainers/opencontainers.org/blob/395bc5f98777a72082bfe300a167b563af234ef0/content/posts/blog/2024-03-13-image-and-distribution-1-1.md

@rareddy
Copy link

rareddy commented Mar 20, 2024

@rchincha thank you for reaching out

see this about oci reference in the proposal https://github.com/kubeflow/community/pull/682/files#diff-aaf54745ecb36016135c83a5a41a03025574ecb492aec56ef6d2c7c902abfe17R180

Can I recommend you open another PR in the model registry and can we collaborate on a proposal on how you see the integration working, we have verified that we could store and retrieve the models without any issues but we have not explored how/if we should spread the metadata and query it back (if that is needed at all is another question as it can be stored in the db) and also how we can influence the consumption of the model for inference directly from OCI repo as it in projects like Kserve.

@rchincha
Copy link

Can I recommend you open another PR in the model registry and can we collaborate on a proposal

Would love to. Also folks over at CNCF artifacts, ORAS and OCI would certainly be interested.

Initial grok'ing of kserve project indicates that there could be a couple of ways to do this:

  1. an "initContainer" approach that pulls required artifacts and lays them out so it can be consumed

  2. A CSI approach like so:
    https://github.com/converged-computing/oras-csi
    https://kserve.github.io/website/0.8/modelserving/storage/pvc/pvc/#create-pv-and-pvc

rchincha added a commit to rchincha/model-registry that referenced this pull request Mar 20, 2024
Partially addresses kubeflow/community#682

Signed-off-by: Ramkumar Chinchani <rchincha@cisco.com>
@rchincha
Copy link

kubeflow/model-registry#48
^ fyi, thanks.

rchincha added a commit to rchincha/model-registry that referenced this pull request Mar 21, 2024
Partially addresses kubeflow/community#682

Signed-off-by: Ramkumar Chinchani <rchincha@cisco.com>
rchincha added a commit to rchincha/model-registry that referenced this pull request Mar 21, 2024
Partially addresses kubeflow/community#682

Signed-off-by: Ramkumar Chinchani <rchincha@cisco.com>
rchincha added a commit to rchincha/model-registry that referenced this pull request Mar 21, 2024
Partially addresses kubeflow/community#682

Signed-off-by: Ramkumar Chinchani <rchincha@cisco.com>
@rareddy
Copy link

rareddy commented Mar 21, 2024

Also folks over at CNCF artifacts, ORAS and OCI would certainly be interested.

@rchincha we are collaborating with ORAS maintainers and model-car initiative inventors let's see we can bring their attention on this effort for storage. We already put in some work towards KServe Storage Containers which will be another way for providing the models for inferencing.

A couple of requests for proposal,

  • we need to be able to support multiple storage backends as S3 is predominately the most preferred method currently to be used by the AI communities.
  • For the OCI plugin it must be based OCI-Dist level so that users can have a choice of their Zot, Harbor or Quay etc.
  • must be able to deploy in Kube, as a lot of users want to able to deploy all infra on their cloud not necessarily always connect to an external SaaS offering.

I did look at ArtifactHub project a couple of months ago in CNCF which looked very interesting in terms of how they use OCI and metadata scraping but did not draw any conclusions about how that could be folded into the mix to bridge the metadata portion or not. That could be very interesting IMO. Is this CNCF project u mentioned above?

@rchincha
Copy link

rchincha commented Mar 21, 2024

Also folks over at CNCF artifacts, ORAS and OCI would certainly be interested.

@rchincha we are collaborating with ORAS maintainers and model-car initiative inventors let's see we can bring their attention on this effort for storage. We already put in some work towards KServe Storage Containers which will be another way for providing the models for inferencing.

wrt kserve, maybe this as a contract? kserve/kserve#3539

A couple of requests for proposal,

* we need to be able to support multiple storage backends as S3 is predominately the most preferred method currently to be used by the AI communities.

This is best left to the registry implementations which may or may not choose to support S3 backend (for example, speaking only for zot, it does support S3), but make it clear that to be compatible with kubeflow, this is an additional requirement.

* For the OCI plugin it must be based OCI-Dist level so that users can have a choice of their Zot, Harbor or Quay etc.

The OCI plugin must be registry-agnostic of course and this calls out the role that OCI dist-spec v1.1.0 plays as a contract.

* must be able to deploy in Kube, as a lot of users want to able to deploy all infra on their cloud not necessarily always connect to an external SaaS offering.

Another additional requirement, and comes with the territory.

I did look at ArtifactHub project a couple of months ago in CNCF which looked very interesting in terms of how they use OCI and metadata scraping but did not draw any conclusions about how that could be folded into the mix to bridge the metadata portion or not. That could be very interesting IMO. Is this CNCF project u mentioned above?

As I understand it, ArtifactHub predates OCI dist-spec v1.1.0 but there may be interest to standardize on this dist-spec.

metadata scraping

OCI dist-spec v1.1.0 has explicit provisions for this. But can you kindly point to some concrete examples.

Will update kubeflow/model-registry#48

@tarilabs
Copy link
Member Author

metadata scraping

OCI dist-spec v1.1.0 has explicit provisions for this. But can you kindly point to some concrete examples.

Personally very curious for examples on this topic! :) that is very interesting in the context of potentially indexing/query for Manifest of metadata (a "model registry" use case) by means of OCI Artifact.

rchincha added a commit to rchincha/kserve that referenced this pull request Mar 21, 2024
Partially addresses kubeflow/community#682

Signed-off-by: Ramkumar Chinchani <rchincha@cisco.com>
@rchincha
Copy link

rchincha commented Mar 21, 2024

metadata scraping

OCI dist-spec v1.1.0 has explicit provisions for this. But can you kindly point to some concrete examples.

Personally very curious for examples on this topic! :) that is very interesting in the context of potentially indexing/query for Manifest of metadata (a "model registry" use case) by means of OCI Artifact.

https://github.com/opencontainers/opencontainers.org/blob/395bc5f98777a72082bfe300a167b563af234ef0/content/posts/blog/2024-03-13-image-and-distribution-1-1.md#describing-associations

^ this is how the OCI community has addressed this. Note that the original use case was container images and associated metadata such as SBOMs etc.

So in this case ...

  1. upload model data (of a particular media-type)
  2. upload model metadata (of a particular media-type and subject:=1. above)
  3. download 1.
  4. download "artifacts referring to 1." and optionally "of a particular media-type"

@tarilabs
Copy link
Member Author

tarilabs commented Mar 21, 2024

Thanks @rchincha , is there a way to avoid having to download the associated metadata, only to query for it locally, and do that "on the OCI registry" server end?

Example
Here I have 3 different ML models stored as OCI artifacts: https://quay.io/repository/mmortari/mnist?tab=tags

I know some metadata for each of those. I'm looking for a solution if possible which doesn't require me to download the associated metadata-Manifest of each of the artifacts locally, in order to query those metadata.
For concrete example, if each of the model defines accuracy=0.987 or the likes, I want to query which ML artifacts in mmortari/mnist repo above have max(accuracy)

Hope the example convey the question I'm curious for.
Edit: that is why @rareddy was referring to analogous of ArtifactHub, as it would seem from capability and use pov, very similar use-case, in a way.

@rchincha
Copy link

rchincha commented Mar 22, 2024

@tarilabs

For concrete example, if each of the model defines accuracy=0.987 or the likes, I want to query which ML artifacts in mmortari/mnist repo above have max(accuracy)

In the OCI dist-spec world, one way would be to list all tags in a repository, get their manifests and compare annotations (== accuracy=0.987) - no need to download actual data.

I was more concerned about the following:
https://github.com/MarquezProject/marquez
https://github.com/google/ml-metadata

@tarilabs
Copy link
Member Author

Thanks @rchincha , reassuring to hear it doesn't need to download actual data, will be looking for a chance to understand in more details from you how OCI dist-spec works for this use-case in practice.

We have Model Registry biweekly meetings: https://www.kubeflow.org/docs/about/community/#kubeflow-community-calendars

Do you think you'll be able to join one, so we could discuss it live in more details?
Thanks!

@rchincha
Copy link

https://kccnceu2024.sched.com/event/1YeLi
^ This idea is spreading around I suppose ... @kubecon EU 2024

Your next meeting is Apr 1. Will try to make that.

rchincha added a commit to rchincha/kserve that referenced this pull request Mar 22, 2024
Partially addresses kubeflow/community#682

Signed-off-by: Ramkumar Chinchani <rchincha@cisco.com>
rchincha added a commit to rchincha/kserve that referenced this pull request May 1, 2024
Partially addresses kubeflow/community#682

OCI image and distribution specs v1.1.0 has added support for pushing
and pulling arbitrary artifacts to a conformant registry, and not just
container images.

Since a registry is already needed to deploy inference workloads as
containers, and that it would be desirable to avoid another piece of
infrastructure just to store inference data, a OCI conformant registry
could become that ideal store to combine and colocate both use cases.

This plugin adds that support.

References:

https://opencontainers.org/posts/blog/2024-03-13-image-and-distribution-1-1/

Signed-off-by: Ramkumar Chinchani <rchincha@cisco.com>
rchincha added a commit to rchincha/kserve that referenced this pull request May 24, 2024
Partially addresses kubeflow/community#682

OCI image and distribution specs v1.1.0 has added support for pushing
and pulling arbitrary artifacts to a conformant registry, and not just
container images.

Since a registry is already needed to deploy inference workloads as
containers, and that it would be desirable to avoid another piece of
infrastructure just to store inference data, a OCI conformant registry
could become that ideal store to combine and colocate both use cases.

This plugin adds that support.

Uses the oras-go library.

References:

https://opencontainers.org/posts/blog/2024-03-13-image-and-distribution-1-1/

Signed-off-by: Ramkumar Chinchani <rchincha@cisco.com>
rchincha added a commit to rchincha/kserve that referenced this pull request May 24, 2024
Partially addresses kubeflow/community#682

OCI image and distribution specs v1.1.0 has added support for pushing
and pulling arbitrary artifacts to a conformant registry, and not just
container images.

Since a registry is already needed to deploy inference workloads as
containers, and that it would be desirable to avoid another piece of
infrastructure just to store inference data, a OCI conformant registry
could become that ideal store to combine and colocate both use cases.

This plugin adds that support.

Uses the oras-go library.

References:

https://opencontainers.org/posts/blog/2024-03-13-image-and-distribution-1-1/

Signed-off-by: Ramkumar Chinchani <rchincha@cisco.com>
@rchincha
Copy link

kubernetes/enhancements#4642
some overlapping work/groups ...

@tarilabs
Copy link
Member Author

kubernetes/enhancements#4642 some overlapping work/groups ...

iiuc this would allow "materializing" OCI artifacts as a mounted volume in a container, effectively allowing the "files" inside an OCI artifacts to be available for inference say in a running container of a model server.
is this a fair summary?

@rchincha
Copy link

kubernetes/enhancements#4642 some overlapping work/groups ...

iiuc this would allow "materializing" OCI artifacts as a mounted volume in a container, effectively allowing the "files" inside an OCI artifacts to be available for inference say in a running container of a model server. is this a fair summary?

Still a preliminary KEP, but would seem so.

@rhuss
Copy link

rhuss commented May 28, 2024

kubernetes/enhancements#4642 some overlapping work/groups ...

iiuc this would allow "materializing" OCI artifacts as a mounted volume in a container, effectively allowing the "files" inside an OCI artifacts to be available for inference say in a running container of a model server. is this a fair summary?

For reference, in KServe a workaround for directly accessing files within an OCI image is implemented and available via a sidecar approach ("modelcar") by leveraging root FS system access via the /proc filesystem when shareProcessNamespace: true is set on the Pod. You can find details in the KServe documentation and in the Design Document. It actually implements the desired behavior with current means, but of course is more or less just a workaround of an OCI volume type (as discussed already a long time ago in kubernetes/kubernetes#831)

@tarilabs
Copy link
Member Author

For reference, in KServe a workaround for directly accessing files within an OCI image is implemented and available via a sidecar approach ("modelcar") by leveraging root FS system access via the /proc filesystem when shareProcessNamespace: true is set on the Pod. You can find details in the KServe documentation and in the Design Document. It actually implements the desired behavior with current means, but of course is more or less just a workaround of an OCI volume type (as discussed already a long time ago in kubernetes/kubernetes#831)

thank you @rhuss , to me is about providing user-choice; given an opportunity to have OCI Artifact with a ML model asset:

  • could build "around" a runnable container image to serve it, say a linux + serving runtime + the ML model asset from the OCI artifact (this is possible today)
  • could build "around" a ModelCar, to serve it on KServe (this is possible today thanks to your contrib in KServe)
  • could eventually just mount it in a serving runtime running on k8s leveraging KEP-4639 (in the future)

wdyt?

@rchincha
Copy link

@tarilabs
Copy link
Member Author

Thank you @rchincha , we indeed noted that blog post as well :)

Fyi, we have it in our live-roadmap as a proposal for integration as a preferred storage solution for the ML model, to complement current Model Registry.

Orthogonal research work in this area, is captured here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants