[Feature] rolling upgrade design and implementation for Kuberay #527

wilsonwang371 · 2022-09-01T22:24:23Z

Search before asking

I had searched in the issues and found no similar feature requirement.

Description

Right now we don't support Ray cluster rolling upgrade. This is a valid requirement for customers that has a large number of nodes in their Ray cluster deployment.

Use case

support rolling upgrade of Ray clusters which can be beneficial to users with large Ray cluster.

Related issues

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

DmitriGekhtman · 2022-09-07T04:30:36Z

This would be great to have. Let's figure out a design...
Would we be aiming for something similar to rollouts for Deployments?

wilsonwang371 · 2022-09-14T04:20:08Z

Note: we need to review #231 and make a new design.

scarlet25151 · 2022-09-27T18:19:40Z

Would we be aiming for something similar to rollouts for Deployments?

@DmitriGekhtman I think we can separate the update for two role:

for head node, we can just delete old one and pull up a new one, here we need to consider how it interacts with the HA mechanism.
for worker node, yes we can use similar rolling update logic in the deployments, however there may be some difference like deployment do not support scale old version replicas to 0 so it keep a MaxUnavailable or MaxSurge but in ray we can have only head node with no worker, so we need to define some new behaviors

here me and Wilson will come up with detail design and we may have several round discussion.

DmitriGekhtman · 2022-09-27T19:28:34Z

I like the strategy of splitting the discussion (and potentially even implementation) into updates for head and updates for worker.

cc @brucez-anyscale for the head node HA aspect.
Stating the question again: What should happen when you change the configuration for a RayCluster's head pod?

here me and Wilson will come up with detail design and we may have several round discussion.

That's great! I'm looking forward to discussing the design of this functionality -- I think it's very important.

brucez-anyscale · 2022-09-30T04:52:47Z

Right now. RayService does the whole cluster level upgrading, so RayService works itself for now.
RayCluster rolling upgrade: I think the head node or worker node should be backward compatible, so they can join back the Ray cluster.

scarlet25151 · 2022-10-06T22:43:56Z

@wilsonwang371 Here I think we need to find the exact use case that user can benefit from the feature.

First is the user behavior, following the previous discussion, we can make the assumption that in this story:

the users would like to upgrade the image of worker.
the users would like to upgrade the image of head.
the user would like to upgrade both head and workers.

In all of those cases, we need to enable the mechanism that the ray package in images is compatible.

Here are some scenarios that I can think about:

there is no actor or task running on the raycluster

In this case, we would not need the feature since the recreate strategy would be enough, the only modification is to enable the worker upgrade in the reconcile.

there is some jobs running on the raycluster, some remaining actors inside the old one.

here the situation is a little bit tricky since we need to support mechanisms in ray that migrate actors from old raycluster to the new one.

there is a ray service running on the raycluster, just as @brucez-anyscale said the whole cluster would upgrade

This case is the most possible to have the rolling upgrade feature. Since for now we may recreate a brand new raycluster by rayservice controller. we may support the rolling upgrade in raycluster controller to ease the ray service upgrade.

Indeed we need support standard update semantic for raycluster, at least in recreate strategy. However, for now, consider those cases, would the feature raycluster rolling upgrade bring any further significant benefit to the user? WDYT @DmitriGekhtman

DmitriGekhtman · 2022-10-07T00:01:13Z

Let's first consider the most basic use-case that we were going for with the --forced-cluster-upgrade flag.

When a user updates a RayCluster CR and applies it, they expect changes to pod configs to be reflected in the actual pod configuration, even if the change is potentially disruptive to the Ray workload. If you update a workerGroupSpec,
workers with outdated configuration should be eliminated and workers with update configuration should be created.
Same thing for the HeadGroupSpec.

The ability to do (destructive) updates is available with the Ray Autoscaler's VM node providers and with the legacy python-based Ray operator. The implementation for this uses hashes of last-applied node configuration. We could potentially do the same thing here.

If Ray versions mismatch, things won't work out, no matter what, because Ray does not have cross-version compatibility. If workloads are running, they may be interrupted. These are complex, higher-order concerns, but we can start by just registering pod config updates.

grzesuav · 2023-03-17T15:22:00Z

Question - why to create pod's directly and not Deployment - which would handle this ? (side note - not familiar with ray in particular, just an operator of a kubernetes cluster where ray is deployed)

jhasm · 2023-04-18T19:22:34Z

I am curious if there has been any update on this feature or do we have any plans?

If we are worried that we do not have a strong use case to focus on this, I can help. Not have rolling upgrade is a real pain for us. I am talking from the perspective of a ML platform that supports all ML teams within a company.

We plan to have several ray clusters, standing and ephemeral. Think one ray cluster for model-dev (interactive), automated training, batch serving and real-time serving per project a group or project in one ML team.
For standing clusters, not having rolling upgrade is like going back by a few years in infrastructure, for us. Every service we have has rolling upgrade, and we do not allow downtimes in the production services.
For real-time serving (ray-serve) this is a blocker. The serving needs to be available 24x7, there is no acceptable downtime outside of SLA.
Since the project specific python dependencies are going to be baked into the image running on the workers, we will need to update this for every change. This does happen frequently for us, and having a scheduled downtime to do this out of the norm for our infrastructure.
Since Kuberay is at v0.5.0 we expect to keep up with the rapid version upgrades. And this will require us to delete and recreate all our ray clusters.
Deleting a resource and recreating is not a standard CI/CD operation for us. This requires custom steps or manual support in our case. Deleting a resource manually is reserved for emergency situations, but the lack of rolling upgrade requires us to use it frequently.

I am happy to discuss more on this, or help any way I can.

peterghaddad · 2023-04-20T03:39:12Z

@jhasm I don't want to speak for others, but believe serve will be critical to ensure 100% uptime during upgrades of Ray Cluster versions. The way a model is served shouldn't hinder the upgrade i.e serve cli, sdk, etc. I had some thoughts I wanted to share.

There may be opportunities to enable cluster version rolling upgrades using Ray's GCS external Redis.

A potential starting point may be to detect when the Ray Cluster version changes. If the version changes and the cluster name is currently deployed, then launch a new Ray cluster. Once jobs are transferred, have kuberay rewrite the service to point to the new cluster. I believe the more complex portion is transferring the jobs and actors to the new cluster.

kevin85421 · 2023-04-20T19:43:41Z

Good point: https://ray-distributed.slack.com/archives/C02GFQ82JPM/p1682018043124949?thread_ts=1681846159.725999&cid=C02GFQ82JPM

Keep the head service and serve service with the same name.

qizzzh · 2024-02-01T04:58:58Z

Any update on this? Lack of rolling-update is like a no-go for many production serving workloads.

DmitriGekhtman · 2024-02-01T06:44:16Z

The RayService custom resource is intended to support the upgrade semantics of the sort people in this thread are looking for.

An individual Ray Cluster should be thought of as a massive pod -- there is not a coherent way to conduct a rolling upgrade of a single Ray cluster (though actually some large enterprises have actually managed to achieve this)

tl;dr solutions for upgrades require multiple Ray clusters

In my experience, doing anything "production-grade" with Ray requires multiple Ray clusters and external orchestration.

kevin85421 · 2024-02-03T08:32:59Z

@qizzzh, I just saw your message. As @DmitriGekhtman mentioned, upgrading Ray involves more than one RayCluster. For RayService, we plan to support incremental upgrades, meaning that we won't need a new, large RayCluster for a zero-downtime upgrade. Instead, we will gradually increase the size of the new RayCluster and decrease the size of the old one. If you want to chat more, feel free to reach out to me on Slack.

Ray doesn't natively support rolling upgrade. It is impossible for KubeRay to achieve that (in the single RayCluster). This issue should move to Ray instead of KubeRay. Close this issue. I will open new issues to track incremental upgrade when I start to work on it.

zzb54321 · 2024-08-07T04:36:13Z

Hi @kevin85421 , is there any progress on this? or any tracking issue created? so we can check whether the incremental upgrade effort has been started or no. Thanks a lot!

andrewsykim · 2024-08-07T16:34:11Z

@zzb54321 there have been some discussions but no work started. I am willing to start a one-pager proposal on this effort. @kevin85421 any objections?

kevin85421 · 2024-08-07T23:05:37Z

@zzb54321 instead of an incremental upgrade, the community recently prefers an N+1 upgrade for now. See #2274 for more details.

kevin85421 · 2024-08-07T23:05:50Z

I am willing to start a one-pager proposal on this effort.

sounds good!

JoshKarpel · 2024-08-08T02:20:29Z

@kevin85421 what's an N+1 upgrade?

kevin85421 · 2024-08-08T02:50:37Z

what's an N+1 upgrade?

RayService manages multiple (N) small RayCluster CRs simultaneously. When we need to upgrade the RayService CR, it creates a new small RayCluster CR and then tears down an old RayCluster CR.

You can think of it like a K8s Deployment, where each Pod in the Deployment is a 1-node RayCluster. Then, use the K8s rolling upgrade mechanism to upgrade the K8s Deployment.

JoshKarpel · 2024-08-08T03:00:24Z

Gotcha! That makes a lot of sense. I'll follow #2274

Do you think it would be possible to set e.g. an environment variable to be different in each small cluster automatically? We've been thinking about sharding our current one large cluster into multiple smaller clusters to handle increasing scale (probably roughly what is being referred to here) - it would be nice if we could do that via this mechanism so that we didn't have to manage that ourselves!

kevin85421 · 2024-08-08T03:45:45Z

@JoshKarpel Would you mind explaining why you need to have different environment variables for different small RayCluster CRs? For the short term, I plan to make RayService more similar to a K8s Deployment (where each Pod has the same spec) instead of a K8s StatefulSet. That is, I prefer to make all RayCluster CRs that belong to the same RayService CR have the same spec. If we make it stateful, I think the complexity will increase a lot.

JoshKarpel · 2024-08-08T03:53:54Z

Oh, sorry, yes, I should have said why!

Our goal here would be to shard a set of dynamically-created Serve applications (reconciled with our ML model store) across multiple clusters. Right now, we deploy the Serve applications from inside the cluster itself, so each cluster would need to know which shard it should be (e.g., to then do consistent hashing on the metadata that defines the Serve apps, so it knows which apps to create in itself).

We don't deploy the apps through the RayService CR because we don't want KubeRay to consider them when determining the health of the cluster (see ray-project/ray#44226).

That said - short term, your plan totally makes sense, and I agree that it will be much simpler! Once we have that maybe we can work on extending it to add some stateful-ness. By then maybe we'll have played with it in our setup and have something we could upstream.

zzb54321 · 2024-08-08T07:06:13Z

A few question about N+1 upgrade. Assume the RayService CR defines multiple applications. In the context of N+1 upgrade, there will be multiple small RayCluster CRs.

Does it mean the applications will be sharded and distributed to these small clusters? In terms of sharding, for a certain application, will it be sharded to multiple clusters? or an application has to fit into one cluster?
This has different implications. In the latter case, it's actually still a blue/green upgrade from an application's point of view. If a RayService CR has one giant application, the upgrade still needs almost double resource.
If only one application is updated, when certain small cluster are upgraded, the other applications that co-locate in the same cluster will be affected and also updated. So basically we can't update one specific application, without touching others. Is that right?

Thanks.

Basasuya · 2024-08-14T11:45:49Z

@zzb54321 I think this plan would make one application fit into one cluster. it is not blue/green upgrade? Reference to k8s deployment, it like rolling upgrade to gradually upgrade all instances.
Applications and ray clusters are one-to-one correspondence in my opinion. So there are no other applications that co-locate in the same cluster will be affected.

wilsonwang371 added the enhancement New feature or request label Sep 1, 2022

wilsonwang371 self-assigned this Sep 1, 2022

wilsonwang371 mentioned this issue Sep 14, 2022

[Bug] --forced-cluster-upgrade Causes termination loop for ray head node #558

Closed

2 tasks

wilsonwang371 added the P1 Issue that should be fixed within a few weeks label Sep 14, 2022

DmitriGekhtman added the rfc label Sep 14, 2022

DmitriGekhtman mentioned this issue Oct 25, 2022

[RayService] Ray cluster keeps restarting when autoscaling with RayServices #649

Closed

2 tasks

DmitriGekhtman mentioned this issue Nov 19, 2022

[autoscaler][kubernetes][operator] Add rolling upgrade functionality to K8s operator ray-project/ray#13675

Closed

peterghaddad mentioned this issue Mar 7, 2023

Reconcile Ray Workers when VolumeMounts change #945

Closed

4 tasks

kevin85421 mentioned this issue Apr 26, 2023

KubeRay v0.6.0 roadmap #1052

Closed

nvtkaszpir mentioned this issue May 26, 2023

[Bug] Allow zero replica for Helm #965

Closed

2 tasks

kevin85421 closed this as completed Feb 3, 2024

andrewsykim mentioned this issue Aug 21, 2024

[RFC] Introduce new API-RayCluster Fleet and ReplicaSet in KubeRay #2323

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] rolling upgrade design and implementation for Kuberay #527

[Feature] rolling upgrade design and implementation for Kuberay #527

wilsonwang371 commented Sep 1, 2022

DmitriGekhtman commented Sep 7, 2022

wilsonwang371 commented Sep 14, 2022

scarlet25151 commented Sep 27, 2022 •

edited

Loading

DmitriGekhtman commented Sep 27, 2022

brucez-anyscale commented Sep 30, 2022

scarlet25151 commented Oct 6, 2022

DmitriGekhtman commented Oct 7, 2022

grzesuav commented Mar 17, 2023 •

edited

Loading

jhasm commented Apr 18, 2023

peterghaddad commented Apr 20, 2023

kevin85421 commented Apr 20, 2023 •

edited

Loading

qizzzh commented Feb 1, 2024

DmitriGekhtman commented Feb 1, 2024 •

edited

Loading

kevin85421 commented Feb 3, 2024

zzb54321 commented Aug 7, 2024

andrewsykim commented Aug 7, 2024

kevin85421 commented Aug 7, 2024

kevin85421 commented Aug 7, 2024

JoshKarpel commented Aug 8, 2024

kevin85421 commented Aug 8, 2024

JoshKarpel commented Aug 8, 2024

kevin85421 commented Aug 8, 2024

JoshKarpel commented Aug 8, 2024

zzb54321 commented Aug 8, 2024

Basasuya commented Aug 14, 2024

[Feature] rolling upgrade design and implementation for Kuberay #527

[Feature] rolling upgrade design and implementation for Kuberay #527

Comments

wilsonwang371 commented Sep 1, 2022

Search before asking

Description

Use case

Related issues

Are you willing to submit a PR?

DmitriGekhtman commented Sep 7, 2022

wilsonwang371 commented Sep 14, 2022

scarlet25151 commented Sep 27, 2022 • edited Loading

DmitriGekhtman commented Sep 27, 2022

brucez-anyscale commented Sep 30, 2022

scarlet25151 commented Oct 6, 2022

DmitriGekhtman commented Oct 7, 2022

grzesuav commented Mar 17, 2023 • edited Loading

jhasm commented Apr 18, 2023

peterghaddad commented Apr 20, 2023

kevin85421 commented Apr 20, 2023 • edited Loading

qizzzh commented Feb 1, 2024

DmitriGekhtman commented Feb 1, 2024 • edited Loading

kevin85421 commented Feb 3, 2024

zzb54321 commented Aug 7, 2024

andrewsykim commented Aug 7, 2024

kevin85421 commented Aug 7, 2024

kevin85421 commented Aug 7, 2024

JoshKarpel commented Aug 8, 2024

kevin85421 commented Aug 8, 2024

JoshKarpel commented Aug 8, 2024

kevin85421 commented Aug 8, 2024

JoshKarpel commented Aug 8, 2024

zzb54321 commented Aug 8, 2024

Basasuya commented Aug 14, 2024

scarlet25151 commented Sep 27, 2022 •

edited

Loading

grzesuav commented Mar 17, 2023 •

edited

Loading

kevin85421 commented Apr 20, 2023 •

edited

Loading

DmitriGekhtman commented Feb 1, 2024 •

edited

Loading