Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v0.14 release blog #431

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
149 changes: 149 additions & 0 deletions docs/blog/articles/2024-12-13-KServe-0.14-release.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@
# Announcing: KServe v0.14

We are excited to announce KServe v0.14. In this release we are introducing a new Python client designed for KServe, and a new model cache feature; we are promoting OCI storage for models as a stable feature; and we added support for deploying models directly from Hugging Face.

Below are a summary of the key changes.

## Introducing Inference client for Python

The KServe Python SDK now includes both [REST](https://github.com/kserve/kserve/blob/v0.14.0/python/kserve/kserve/inference_client.py#L388) and [GRPC](https://github.com/kserve/kserve/blob/v0.14.0/python/kserve/kserve/inference_client.py#L61) inference clients. The new Inference clients of the SDK were delivered as **alpha** features.

Inline with the features documented in issue [#3270](https://github.com/kserve/kserve/issues/3270), both clients have the following characteristics:

* The clients are asynchronous
* Support for HTTP/2 (via [httpx](https://www.python-httpx.org/) library)
* Support Open Inference Protocol v1 and v2

As usual, the version 0.14.0 of the KServe Python SDK is [published to PyPI](https://pypi.org/project/kserve/0.14.0/) and available to install via `pip install`.

<!--
Related tickets:
* Initial implementation [#3270](https://github.com/kserve/kserve/issues/3270)
* FP16 support [#3643](https://github.com/kserve/kserve/issues/3643)
-->

## Support for OCI storage for models (modelcars) becomes stable

In KServe version 0.12, support for using OCI containers for model storage was introduced as an experimental feature. This allows users to store models in containers in OCI format, and allows the usage of OCI-compatible registries for publishing the models.

This feature was implemented by configuring the OCI model container as a sidecar in the InferenceService pod, which was the motivation for naming the feature as modelcars. The model files are made available to the model server by configuring [process namespace sharing](https://kubernetes.io/docs/tasks/configure-pod-container/share-process-namespace/) in the pod.

There was one small but important detail that was unsolved and motivated the experimental status: since the modelcar is part of the main containers of the pod, there was no certainty that the modelcar would start quickly. The model server would be unstable if it starts first than the modelcar, and since there was no prefetching of the model image, this was thought as a likely condition.

The unstable situation has been mitigated by configuring the OCI model as an init container in addition to also configuring it as a sidecar. The configuration as an init container ensures that the model is fetched before the main containers are started. The prefetching allows the modelcar to start quickly.
The stabilization is available since KServe version 0.14, where modelcars are now a stable feature.

### Future plan

Modelcars is one implementation option for supporting OCI images for model storage. There are other alternatives commented in [issue #4083](https://github.com/kserve/kserve/issues/4083).

Using volume mounts based on OCI artifacts is the optimal implementation, but this is only [recently possible since Kubernetes 1.31](https://kubernetes.io/blog/2024/08/16/kubernetes-1-31-image-volume-source/) as a native alpha feature. KServe can now evolve to use this new Kubernetes feature.

## Introducing model cache

With models increasing in size, specially true for LLM models, pulling from storage each time a pod is created can result in unmanageable start-up times. Although OCI storage also has the benefit of model caching, the capabilities are not flexible since the management is delegated to the cluster.

The Model Cache was proposed as another alternative to enhance KServe usability with big models, released in KServe v0.14 as an **alpha** feature. It relies on a PV for storing models and it provides control about which models to store in the cache. The feature was designed to mainly to use node Filesystem as storage. Read the [design document for the details](https://docs.google.com/document/d/1nao8Ws3tonO2zNAzdmXTYa0hECZNoP2SV_z9Zg0PzLA/edit).

The model cache is currently disabled by default. To enable, you need to modify the `localmodel.enabled` field on the `inferenceservice-config` ConfigMap.

You start by creating a node group as follows:

```yaml
apiVersion: serving.kserve.io/v1alpha1
kind: LocalModelNodeGroup
metadata:
name: nodegroup1
spec:
persistentVolumeSpec:
accessModes:
- ReadWriteOnce
volumeMode: Filesystem
capacity:
storage: 2Gi
hostPath:
path: /models
type: ""
persistentVolumeReclaimPolicy: Delete
storageClassName: standard
persistentVolumeClaimSpec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 2Gi
storageClassName: standard
volumeMode: Filesystem
volumeName: kserve

```

Then, you can specify to store an cache a model with the following resource:

```yaml
apiVersion: serving.kserve.io/v1alpha1
kind: ClusterLocalModel
metadata:
name: iris
spec:
modelSize: 1Gi
nodeGroup: nodegroup1
sourceModelUri: gs://kfserving-examples/models/sklearn/1.0/model
```
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add the isvc with the example explaining how to use this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, the InferenceService doesn't change and you use it normally (i.e. you still would use gs://kfserving-examples/models/sklearn/1.0/model for storageUri) .

The difference you would notice is that the model will be fetched/mounted from the cache, instead of downloading it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I can add a brief note about what I just wrote.


<!--
Related tickets:
* Cluster local model controller: [#3860](https://github.com/kserve/kserve/pull/3860)
* Cluster Local Model CR [#3839](https://github.com/kserve/kserve/pull/3839)
* Add NodeDownloadPending status to ClusterLocalModel [#3955](https://github.com/kserve/kserve/pull/3955)
-->

## Support for Hugging Face hub in storage initializer

The KServe storage initializer has been enhanced to support downloading models directly from Hugging Face. For this, the new schema `hf://` is now supported in the `storageUri` field of InferenceServices. The following YAML partial shows this:

```yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: huggingface-llama3
spec:
predictor:
model:
storageUri: hf://meta-llama/meta-llama-3-8b-instruct
```

Both public and private Hugging Face repositories are supported. The credentials can be provided by the usual mechanism of binding Secrets to ServiceAccounts, or by binding the credentials Secret as environment variables in the InferenceService.

Read the [documentation](../../../modelserving/storage/huggingface/hf/) for more details.

<!--
Related tickets:
* Implement Huggingface model download in storage initializer [#3584](https://github.com/kserve/kserve/pull/3584)
-->

## Other Changes

This release also includes several enhancements and changes:

### What's New?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sivanantha321 @andyi2it also good to the add the binary extension support and response header support?
#419

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I somehow thought that the binary extension was an enhancement of the Inference Client. So, to better understand... should I add it under the inference client heading? or it is good here as a bullet under What's New? ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

binary extension is not part of the inference client effort, it is implementing the binary extension as part of the open inference protocol along with FP16 support.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

* New flag for automount serviceaccount token by [#3979](https://github.com/kserve/kserve/pull/3979)
* TLS support for inference loggers [#3837](https://github.com/kserve/kserve/issues/3837)
* Allow PVC storage to be mounted in ReadWrite mode via an annotation [#3687](https://github.com/kserve/kserve/issues/3687)

### What's Changed?
* Added `hostIPC` field to `ServingRuntime` CRD, for supporting more than one GPU in Serverless mode [#3791](https://github.com/kserve/kserve/issues/3791)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is good to add a section for LLM runtime support to include the changes as part of the 0.14 release

  • vllm 0.6.x support
  • add health endpoint for vLLM backend
  • support shared memory volume for vLLM backend
  • support chat completion template file
  • support trust_remote_code for vllm and HF backend

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean a dedicated section with LLM-related enhancements? or should those be listed here under What's Changed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes worth calling it out separately, cc @sivanantha321 to check if the list of changes are correct

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ray is an optional dependency now and the way it is implemented is changed. It is worth mentioning this as a breaking change. kserve/kserve#3834

* Support for Python 3.12 is added, while support Python 3.8 is removed [#3645](https://github.com/kserve/kserve/pull/3645)

For complete details on the new features and updates, visit our [official release notes](https://github.com/kserve/kserve/releases/tag/v0.14.0).

## Join the community

- Visit our [Website](https://kserve.github.io/website/) or [GitHub](https://github.com/kserve)
- Join the Slack ([#kserve](https://github.com/kserve/community?tab=readme-ov-file#questions-and-issues))
- Attend our community meeting by subscribing to the [KServe calendar](https://zoom-lfx.platform.linuxfoundation.org/meetings/kserve?view=month).
- View our [community github repository](https://github.com/kserve/community) to learn how to make contributions. We are excited to work with you to make KServe better and promote its adoption!

Thanks for all the contributors who have made the commits to 0.14 release!

The KServe Project
5 changes: 1 addition & 4 deletions docs/modelserving/storage/oci.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,13 +13,10 @@ Modelcars represents a step forward in efficient model serving, particularly ben

## Enabling Modelcars

Modelcars is an experimental feature in KServe and is not enabled by default.
Modelcars feature in KServe is not enabled by default.
To take advantage of this new model serving method, it needs to be activated in the KServe configuration.
Follow the steps below to enable Modelcars in your environment.

!!! note
Modelcars are currently in an experimental phase. Enable this feature in a test environment first to ensure it meets your requirements before using it in a production setting.

Modelcars can be enabled by modifying the `storageInitializer` configuration in the `inferenceservice-config` ConfigMap.
This can be done manually using `kubectl edit` or by executing the script provided below, with the current namespace set to the namespace where the `kserve-controller-manager` is installed (depends on the way how KServer is installed.)

Expand Down
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -115,6 +115,7 @@ nav:
- Debugging guide: developer/debug.md
- Blog:
- Releases:
- KServe 0.14 Release: blog/articles/2024-12-13-KServe-0.14-release.md
- KServe 0.13 Release: blog/articles/2024-05-15-KServe-0.13-release.md
- KServe 0.11 Release: blog/articles/2023-10-08-KServe-0.11-release.md
- KServe 0.10 Release: blog/articles/2023-02-05-KServe-0.10-release.md
Expand Down