-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v0.14 release blog #431
base: main
Are you sure you want to change the base?
v0.14 release blog #431
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,149 @@ | ||
# Announcing: KServe v0.14 | ||
|
||
We are excited to announce KServe v0.14. In this release we are introducing a new Python client designed for KServe, and a new model cache feature; we are promoting OCI storage for models as a stable feature; and we added support for deploying models directly from Hugging Face. | ||
|
||
Below are a summary of the key changes. | ||
|
||
## Introducing Inference client for Python | ||
|
||
The KServe Python SDK now includes both [REST](https://github.com/kserve/kserve/blob/v0.14.0/python/kserve/kserve/inference_client.py#L388) and [GRPC](https://github.com/kserve/kserve/blob/v0.14.0/python/kserve/kserve/inference_client.py#L61) inference clients. The new Inference clients of the SDK were delivered as **alpha** features. | ||
|
||
Inline with the features documented in issue [#3270](https://github.com/kserve/kserve/issues/3270), both clients have the following characteristics: | ||
|
||
* The clients are asynchronous | ||
* Support for HTTP/2 (via [httpx](https://www.python-httpx.org/) library) | ||
* Support Open Inference Protocol v1 and v2 | ||
|
||
As usual, the version 0.14.0 of the KServe Python SDK is [published to PyPI](https://pypi.org/project/kserve/0.14.0/) and available to install via `pip install`. | ||
|
||
<!-- | ||
Related tickets: | ||
* Initial implementation [#3270](https://github.com/kserve/kserve/issues/3270) | ||
* FP16 support [#3643](https://github.com/kserve/kserve/issues/3643) | ||
--> | ||
|
||
## Support for OCI storage for models (modelcars) becomes stable | ||
|
||
In KServe version 0.12, support for using OCI containers for model storage was introduced as an experimental feature. This allows users to store models in containers in OCI format, and allows the usage of OCI-compatible registries for publishing the models. | ||
|
||
This feature was implemented by configuring the OCI model container as a sidecar in the InferenceService pod, which was the motivation for naming the feature as modelcars. The model files are made available to the model server by configuring [process namespace sharing](https://kubernetes.io/docs/tasks/configure-pod-container/share-process-namespace/) in the pod. | ||
|
||
There was one small but important detail that was unsolved and motivated the experimental status: since the modelcar is part of the main containers of the pod, there was no certainty that the modelcar would start quickly. The model server would be unstable if it starts first than the modelcar, and since there was no prefetching of the model image, this was thought as a likely condition. | ||
|
||
The unstable situation has been mitigated by configuring the OCI model as an init container in addition to also configuring it as a sidecar. The configuration as an init container ensures that the model is fetched before the main containers are started. The prefetching allows the modelcar to start quickly. | ||
The stabilization is available since KServe version 0.14, where modelcars are now a stable feature. | ||
|
||
### Future plan | ||
|
||
Modelcars is one implementation option for supporting OCI images for model storage. There are other alternatives commented in [issue #4083](https://github.com/kserve/kserve/issues/4083). | ||
|
||
Using volume mounts based on OCI artifacts is the optimal implementation, but this is only [recently possible since Kubernetes 1.31](https://kubernetes.io/blog/2024/08/16/kubernetes-1-31-image-volume-source/) as a native alpha feature. KServe can now evolve to use this new Kubernetes feature. | ||
|
||
## Introducing model cache | ||
|
||
With models increasing in size, specially true for LLM models, pulling from storage each time a pod is created can result in unmanageable start-up times. Although OCI storage also has the benefit of model caching, the capabilities are not flexible since the management is delegated to the cluster. | ||
|
||
The Model Cache was proposed as another alternative to enhance KServe usability with big models, released in KServe v0.14 as an **alpha** feature. It relies on a PV for storing models and it provides control about which models to store in the cache. The feature was designed to mainly to use node Filesystem as storage. Read the [design document for the details](https://docs.google.com/document/d/1nao8Ws3tonO2zNAzdmXTYa0hECZNoP2SV_z9Zg0PzLA/edit). | ||
|
||
The model cache is currently disabled by default. To enable, you need to modify the `localmodel.enabled` field on the `inferenceservice-config` ConfigMap. | ||
|
||
You start by creating a node group as follows: | ||
|
||
```yaml | ||
apiVersion: serving.kserve.io/v1alpha1 | ||
kind: LocalModelNodeGroup | ||
metadata: | ||
name: nodegroup1 | ||
spec: | ||
persistentVolumeSpec: | ||
accessModes: | ||
- ReadWriteOnce | ||
volumeMode: Filesystem | ||
capacity: | ||
storage: 2Gi | ||
hostPath: | ||
path: /models | ||
type: "" | ||
persistentVolumeReclaimPolicy: Delete | ||
storageClassName: standard | ||
persistentVolumeClaimSpec: | ||
accessModes: | ||
- ReadWriteOnce | ||
resources: | ||
requests: | ||
storage: 2Gi | ||
storageClassName: standard | ||
volumeMode: Filesystem | ||
volumeName: kserve | ||
|
||
``` | ||
|
||
Then, you can specify to store an cache a model with the following resource: | ||
|
||
```yaml | ||
apiVersion: serving.kserve.io/v1alpha1 | ||
kind: ClusterLocalModel | ||
metadata: | ||
name: iris | ||
spec: | ||
modelSize: 1Gi | ||
nodeGroup: nodegroup1 | ||
sourceModelUri: gs://kfserving-examples/models/sklearn/1.0/model | ||
``` | ||
|
||
<!-- | ||
Related tickets: | ||
* Cluster local model controller: [#3860](https://github.com/kserve/kserve/pull/3860) | ||
* Cluster Local Model CR [#3839](https://github.com/kserve/kserve/pull/3839) | ||
* Add NodeDownloadPending status to ClusterLocalModel [#3955](https://github.com/kserve/kserve/pull/3955) | ||
--> | ||
|
||
## Support for Hugging Face hub in storage initializer | ||
|
||
The KServe storage initializer has been enhanced to support downloading models directly from Hugging Face. For this, the new schema `hf://` is now supported in the `storageUri` field of InferenceServices. The following YAML partial shows this: | ||
|
||
```yaml | ||
apiVersion: serving.kserve.io/v1beta1 | ||
kind: InferenceService | ||
metadata: | ||
name: huggingface-llama3 | ||
spec: | ||
predictor: | ||
model: | ||
storageUri: hf://meta-llama/meta-llama-3-8b-instruct | ||
``` | ||
|
||
Both public and private Hugging Face repositories are supported. The credentials can be provided by the usual mechanism of binding Secrets to ServiceAccounts, or by binding the credentials Secret as environment variables in the InferenceService. | ||
|
||
Read the [documentation](../../../modelserving/storage/huggingface/hf/) for more details. | ||
|
||
<!-- | ||
Related tickets: | ||
* Implement Huggingface model download in storage initializer [#3584](https://github.com/kserve/kserve/pull/3584) | ||
--> | ||
|
||
## Other Changes | ||
|
||
This release also includes several enhancements and changes: | ||
|
||
### What's New? | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @sivanantha321 @andyi2it also good to the add the binary extension support and response header support? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I somehow thought that the binary extension was an enhancement of the Inference Client. So, to better understand... should I add it under the inference client heading? or it is good here as a bullet under There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. binary extension is not part of the inference client effort, it is implementing the binary extension as part of the open inference protocol along with FP16 support. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe you can link the documentation as well. https://kserve.github.io/website/latest/modelserving/data_plane/binary_tensor_data_extension/ |
||
* New flag for automount serviceaccount token by [#3979](https://github.com/kserve/kserve/pull/3979) | ||
* TLS support for inference loggers [#3837](https://github.com/kserve/kserve/issues/3837) | ||
* Allow PVC storage to be mounted in ReadWrite mode via an annotation [#3687](https://github.com/kserve/kserve/issues/3687) | ||
|
||
### What's Changed? | ||
* Added `hostIPC` field to `ServingRuntime` CRD, for supporting more than one GPU in Serverless mode [#3791](https://github.com/kserve/kserve/issues/3791) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it is good to add a section for LLM runtime support to include the changes as part of the 0.14 release
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you mean a dedicated section with LLM-related enhancements? or should those be listed here under There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes worth calling it out separately, cc @sivanantha321 to check if the list of changes are correct There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ray is an optional dependency now and the way it is implemented is changed. It is worth mentioning this as a breaking change. kserve/kserve#3834 |
||
* Support for Python 3.12 is added, while support Python 3.8 is removed [#3645](https://github.com/kserve/kserve/pull/3645) | ||
|
||
For complete details on the new features and updates, visit our [official release notes](https://github.com/kserve/kserve/releases/tag/v0.14.0). | ||
|
||
## Join the community | ||
|
||
- Visit our [Website](https://kserve.github.io/website/) or [GitHub](https://github.com/kserve) | ||
- Join the Slack ([#kserve](https://github.com/kserve/community?tab=readme-ov-file#questions-and-issues)) | ||
- Attend our community meeting by subscribing to the [KServe calendar](https://zoom-lfx.platform.linuxfoundation.org/meetings/kserve?view=month). | ||
- View our [community github repository](https://github.com/kserve/community) to learn how to make contributions. We are excited to work with you to make KServe better and promote its adoption! | ||
|
||
Thanks for all the contributors who have made the commits to 0.14 release! | ||
|
||
The KServe Project |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add the isvc with the example explaining how to use this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK, the InferenceService doesn't change and you use it normally (i.e. you still would use
gs://kfserving-examples/models/sklearn/1.0/model
forstorageUri
) .The difference you would notice is that the model will be fetched/mounted from the cache, instead of downloading it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I can add a brief note about what I just wrote.