Rollout main logic #432

ryanzhang-oss · 2023-07-17T03:06:35Z

Description of your changes

the main rollout logic

Fixes #

I have:

Run make reviewable to ensure this PR is ready for review.

How has this code been tested

Special notes for your reviewer

zhiying-lin · 2023-07-17T09:18:00Z

pkg/controllers/rollout/controller.go

+			}
+			// The binding needs update if it's not pointing to the latest resource snapshot
+			if binding.Spec.ResourceSnapshotName != latestResourceSnapshotName {
+				removeCandidates = append(removeCandidates, &binding)


why it's not a updateCandiate

sorry for the naming confusion. The removalCandidates (to be renamed) contains candidates that if we pick will cause one instance to be unavailable. Here, if we update the resources, we are making the app unavailable on that cluster(in our model, which may not be true in real case, we are just playing it safe)

zhiying-lin · 2023-07-17T09:42:34Z

pkg/controllers/clusterresourcebinding/watcher.go

+}
+
+// SetupWithManager sets up the controller with the Manager.
+func (r *Watcher) SetupWithManager(mgr ctrl.Manager) error {


instead of using two watchers + queue, there is another option.

We ask the rollout controller to watch binding & snapshot and implement the event handler func.

similar to, https://github.com/Azure/fleet-networking/blob/main/pkg/controllers/multiclusterservice/controller.go#L428

The reconcile request is the CRP object.

I feel it's better.

yeah, I thought about that too. The reason I opted to use two watchers is there are some error handling logics. If we just use the event handler then the errors will not be retried and thus we could end up not handle the event. Another thing I am (at least now) not sure about is that I currently also filter out some events.

However, I looked at the code again. It seems that all the failure cases are unexpected. I will give this another try.

zhiying-lin · 2023-07-17T12:04:30Z

pkg/controllers/rollout/controller.go

+	for i := 0; i < maxReadyNumber-upperBoundReadyNumber; i++ {
+		// we don't differentiate if binding needs to be bound to new cluster or its resource snapshot needs to be updated
+		// TODO:
+		tobeUpgradedBinding = append(tobeUpgradedBinding, updateCandidates[i])


are we going to differentiate the new bound or snapshot updating in the first phase?

based on the maxSurge definition, we only need to append the new bound ones.

the update candidates only includes selected bindings

zhiying-lin · 2023-07-17T12:09:46Z

pkg/controllers/rollout/controller.go

+	lowerBoundAvailable := len(readyBindings) - len(canBeUnavailableBindings)
+	for i := 0; i < lowerBoundAvailable-minAvailableNumber; i++ {
+		tobeRemovedBinding = append(tobeRemovedBinding, removeCandidates[i])
+	}


if we still have some capacity (lowerBoundAvaiable - minAvailable - len(toBeRevmovedBinding)), we can update the resourceSnapshot on the bounded binding.

the update ones are already included in the removingCandidates

michaelawyu · 2023-07-17T16:19:18Z

pkg/controllers/clusterresourcesnapshot/watcher.go

+// SetupWithManager sets up the controller with the Manager.
+func (r *Watcher) SetupWithManager(mgr ctrl.Manager) error {
+	return ctrl.NewControllerManagedBy(mgr).
+		For(&fleetv1beta1.ClusterResourceSnapshot{}, builder.WithPredicates(predicate.GenerationChangedPredicate{})).


Hi Ryan! This will lead the controller to ignore metadata changes; if the CRP controller adds a label/annotation, the event will not be picked up.

The resource snapshot spec is immutable (ideally), so this predicate will filter out every change I guess.

yeah, is there any problem with filter out any metadata/status changes?

michaelawyu · 2023-07-17T16:20:45Z

pkg/controllers/clusterresourcesnapshot/watcher.go

+				return false
+			},
+			// skipping update events as the specs are immutable.
+			UpdateFunc: func(evt event.UpdateEvent) bool {


Hi Ryan! The spec is surely immutable, but the metadata is still subject to change; we probably shouldn't ignore these events.

which metadata change should we handle?

michaelawyu · 2023-07-17T16:23:20Z

pkg/controllers/rollout/controller.go

+		klog.ErrorS(err, "we have encountered a fatal error that can't be retried")
+		return ctrl.Result{}, controller.NewUnexpectedBehaviorError(err)
+	}
+	klog.V(2).InfoS("Start to rollout the bindings", "memberCluster", crpName)


Hi Ryan! There is a typo here; it should be clusterResourcePlacement rather than memberCluster I assume?

michaelawyu · 2023-07-17T16:25:52Z

pkg/controllers/rollout/controller.go

+	if crp.Spec.Strategy.RollingUpdate.UnavailablePeriodSeconds != nil {
+		unavailablePeriod = time.Duration(*crp.Spec.Strategy.RollingUpdate.UnavailablePeriodSeconds)
+	} else {
+		unavailablePeriod = 60 * time.Second


Hi Ryan! It might be better to make this default value a top-level constant IMO.

this is the CRD default value, will ETCD/API server honor the CRD schema?

pkg/controllers/rollout/controller.go

michaelawyu · 2023-07-17T16:31:25Z

pkg/controllers/rollout/controller.go

+	if err := r.Client.List(ctx, bindingList, crpLabelMatcher); err != nil {
+		klog.ErrorS(err, "Failed to list all the bindings associated with the clusterResourcePlacement",
+			"clusterResourcePlacement", klog.KObj(crp))
+		return nil, nil, controller.NewAPIServerError(false, err)


Hi Ryan! Just to be certain, is this an uncached client?

no, this is cached.

michaelawyu · 2023-07-17T16:41:18Z

pkg/controllers/rollout/controller.go

+	// Since we can't predict the number of bindings that can be unavailable after they are applied, we don't take them into account
+	lowerBoundAvailable := len(readyBindings) - len(canBeUnavailableBindings)
+	for i := 0; i < lowerBoundAvailable-minAvailableNumber; i++ {
+		tobeRemovedBinding = append(tobeRemovedBinding, removeCandidates[i])


Hi Ryan! Should we check to avoid an index overflow error here?

good catch, yes

michaelawyu · 2023-07-17T16:46:30Z

pkg/controllers/rollout/controller.go

+	for i := 0; i < maxReadyNumber-upperBoundReadyNumber; i++ {
+		// we don't differentiate if binding needs to be bound to new cluster or its resource snapshot needs to be updated
+		// TODO:
+		tobeUpgradedBinding = append(tobeUpgradedBinding, updateCandidates[i])


An index overflow can also happen here I am afraid.

michaelawyu · 2023-07-17T17:34:39Z

pkg/controllers/rollout/controller.go

+		// TODO:
+		tobeUpgradedBinding = append(tobeUpgradedBinding, updateCandidates[i])
+	}
+	return tobeRemovedBinding, tobeUpgradedBinding, nil


Hi Ryan! Is there any risk of the workflow getting stuck here?

Say, for example, we have a CRP with the following setting:

numOfClusters = 10

maxSurge = 0

maxUnavailable = 4 (minAvailable = 6)

And suppose currently in the system there are 5 scheduled bindings (0-5 of them are ready), 5 bound bindings, and 7 unscheduled bindings (none of them is ready).

Now, len(canBeReady) = 5 bound + 7 unscheduled = 12, which is > maxReady (10). So no upgrade will happen.

And len(ready) = 5, which is < minAvailable, so no removal will happen either.

Now, len(canBeReady) = 5 bound + 7 unscheduled = 12, which is > maxReady (10). So no upgrade will happen.

And len(ready) = 5, which is < minAvailable, so no removal will happen either.

In this case, the unscheduled bindings are either deleting or not. In either way, their status will change. If they are deleting, they will disappear eventually. So we will have higher bound and lower bound = 5. We will start to flip the "scheduled" bindings.

If those are not deleting, then they will become ready eventually as the last Applied time doesn't change. Then we will have upper bound = 12 and lower bound = 12 in which case we will start to delete some of those unselected.

When there is a mix, for example, 3 of them are deleting while 4 are just unselected, it will eventually become
4 ready unselected bindings. In this case, the upper bound = 9 and lower bound is 9. We can delete 3 unselected and move one selected to bound.

I think the bottom line is that not "deleting" bindings will become ready eventually while "deleting" bindings will disappear eventually so we will always be able to move forward (if the system behaves). We can get stuck only if some of the agents/clusters are not working which is unavoidable. I think deployment get stuck too if some pods never become ready.

zhiying-lin · 2023-07-21T07:05:32Z

pkg/controllers/rollout/controller.go

+			binding.Spec.ResourceSnapshotName = latestResourceSnapshotName
+			binding.Status.LastResourceUpdateTime = metav1.Now()
+			errs.Go(func() error {
+				if err := r.Client.Update(cctx, binding); err != nil {


we only update the spec here and need another call to update the status.

ryanzhang-oss marked this pull request as draft July 17, 2023 03:06

ryanzhang-oss changed the title ~~Rollout feature~~ Rollout main logic Jul 17, 2023

zhiying-lin reviewed Jul 17, 2023

View reviewed changes

michaelawyu reviewed Jul 17, 2023

View reviewed changes

pkg/controllers/rollout/controller.go Show resolved Hide resolved

michaelawyu reviewed Jul 17, 2023

View reviewed changes

ryanzhang-oss force-pushed the rollout-feature branch 2 times, most recently from cf5841a to df674aa Compare July 20, 2023 06:19

temp

240cda0

ryanzhang-oss force-pushed the rollout-feature branch from df674aa to 240cda0 Compare July 20, 2023 19:14

temp

2d7592f

zhiying-lin reviewed Jul 21, 2023

View reviewed changes

ryanzhang-oss closed this Jul 26, 2023

ryanzhang-oss deleted the rollout-feature branch February 12, 2024 19:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rollout main logic #432

Rollout main logic #432

ryanzhang-oss commented Jul 17, 2023

zhiying-lin Jul 17, 2023

ryanzhang-oss Jul 17, 2023

zhiying-lin Jul 17, 2023

ryanzhang-oss Jul 17, 2023 •

edited

Loading

zhiying-lin Jul 17, 2023

ryanzhang-oss Jul 18, 2023

zhiying-lin Jul 17, 2023

ryanzhang-oss Jul 18, 2023

michaelawyu Jul 17, 2023

michaelawyu Jul 17, 2023

ryanzhang-oss Jul 17, 2023

michaelawyu Jul 17, 2023

ryanzhang-oss Jul 17, 2023

michaelawyu Jul 17, 2023 •

edited

Loading

michaelawyu Jul 17, 2023

ryanzhang-oss Jul 17, 2023 •

edited

Loading

michaelawyu Jul 17, 2023

ryanzhang-oss Jul 17, 2023

michaelawyu Jul 17, 2023

ryanzhang-oss Jul 17, 2023

michaelawyu Jul 17, 2023

michaelawyu Jul 17, 2023

michaelawyu Jul 17, 2023 •

edited

Loading

michaelawyu Jul 17, 2023

ryanzhang-oss Jul 18, 2023

zhiying-lin Jul 21, 2023

Rollout main logic #432

Rollout main logic #432

Conversation

ryanzhang-oss commented Jul 17, 2023

Description of your changes

How has this code been tested

Special notes for your reviewer

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryanzhang-oss Jul 17, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michaelawyu Jul 17, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryanzhang-oss Jul 17, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michaelawyu Jul 17, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryanzhang-oss Jul 17, 2023 •

edited

Loading

michaelawyu Jul 17, 2023 •

edited

Loading

ryanzhang-oss Jul 17, 2023 •

edited

Loading

michaelawyu Jul 17, 2023 •

edited

Loading