Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rollout main logic #432

Closed
wants to merge 2 commits into from
Closed

Conversation

ryanzhang-oss
Copy link
Contributor

Description of your changes

the main rollout logic

Fixes #

I have:

  • Run make reviewable to ensure this PR is ready for review.

How has this code been tested

Special notes for your reviewer

@ryanzhang-oss ryanzhang-oss marked this pull request as draft July 17, 2023 03:06
@ryanzhang-oss ryanzhang-oss changed the title Rollout feature Rollout main logic Jul 17, 2023
}
// The binding needs update if it's not pointing to the latest resource snapshot
if binding.Spec.ResourceSnapshotName != latestResourceSnapshotName {
removeCandidates = append(removeCandidates, &binding)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why it's not a updateCandiate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for the naming confusion. The removalCandidates (to be renamed) contains candidates that if we pick will cause one instance to be unavailable. Here, if we update the resources, we are making the app unavailable on that cluster(in our model, which may not be true in real case, we are just playing it safe)

}

// SetupWithManager sets up the controller with the Manager.
func (r *Watcher) SetupWithManager(mgr ctrl.Manager) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of using two watchers + queue, there is another option.

We ask the rollout controller to watch binding & snapshot and implement the event handler func.

similar to, https://github.com/Azure/fleet-networking/blob/main/pkg/controllers/multiclusterservice/controller.go#L428

The reconcile request is the CRP object.

I feel it's better.

Copy link
Contributor Author

@ryanzhang-oss ryanzhang-oss Jul 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I thought about that too. The reason I opted to use two watchers is there are some error handling logics. If we just use the event handler then the errors will not be retried and thus we could end up not handle the event. Another thing I am (at least now) not sure about is that I currently also filter out some events.

However, I looked at the code again. It seems that all the failure cases are unexpected. I will give this another try.

for i := 0; i < maxReadyNumber-upperBoundReadyNumber; i++ {
// we don't differentiate if binding needs to be bound to new cluster or its resource snapshot needs to be updated
// TODO:
tobeUpgradedBinding = append(tobeUpgradedBinding, updateCandidates[i])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we going to differentiate the new bound or snapshot updating in the first phase?

based on the maxSurge definition, we only need to append the new bound ones.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the update candidates only includes selected bindings

lowerBoundAvailable := len(readyBindings) - len(canBeUnavailableBindings)
for i := 0; i < lowerBoundAvailable-minAvailableNumber; i++ {
tobeRemovedBinding = append(tobeRemovedBinding, removeCandidates[i])
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we still have some capacity (lowerBoundAvaiable - minAvailable - len(toBeRevmovedBinding)), we can update the resourceSnapshot on the bounded binding.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the update ones are already included in the removingCandidates

// SetupWithManager sets up the controller with the Manager.
func (r *Watcher) SetupWithManager(mgr ctrl.Manager) error {
return ctrl.NewControllerManagedBy(mgr).
For(&fleetv1beta1.ClusterResourceSnapshot{}, builder.WithPredicates(predicate.GenerationChangedPredicate{})).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Ryan! This will lead the controller to ignore metadata changes; if the CRP controller adds a label/annotation, the event will not be picked up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The resource snapshot spec is immutable (ideally), so this predicate will filter out every change I guess.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, is there any problem with filter out any metadata/status changes?

return false
},
// skipping update events as the specs are immutable.
UpdateFunc: func(evt event.UpdateEvent) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Ryan! The spec is surely immutable, but the metadata is still subject to change; we probably shouldn't ignore these events.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which metadata change should we handle?

klog.ErrorS(err, "we have encountered a fatal error that can't be retried")
return ctrl.Result{}, controller.NewUnexpectedBehaviorError(err)
}
klog.V(2).InfoS("Start to rollout the bindings", "memberCluster", crpName)
Copy link
Contributor

@michaelawyu michaelawyu Jul 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Ryan! There is a typo here; it should be clusterResourcePlacement rather than memberCluster I assume?

if crp.Spec.Strategy.RollingUpdate.UnavailablePeriodSeconds != nil {
unavailablePeriod = time.Duration(*crp.Spec.Strategy.RollingUpdate.UnavailablePeriodSeconds)
} else {
unavailablePeriod = 60 * time.Second
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Ryan! It might be better to make this default value a top-level constant IMO.

Copy link
Contributor Author

@ryanzhang-oss ryanzhang-oss Jul 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the CRD default value, will ETCD/API server honor the CRD schema?

if err := r.Client.List(ctx, bindingList, crpLabelMatcher); err != nil {
klog.ErrorS(err, "Failed to list all the bindings associated with the clusterResourcePlacement",
"clusterResourcePlacement", klog.KObj(crp))
return nil, nil, controller.NewAPIServerError(false, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Ryan! Just to be certain, is this an uncached client?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, this is cached.

// Since we can't predict the number of bindings that can be unavailable after they are applied, we don't take them into account
lowerBoundAvailable := len(readyBindings) - len(canBeUnavailableBindings)
for i := 0; i < lowerBoundAvailable-minAvailableNumber; i++ {
tobeRemovedBinding = append(tobeRemovedBinding, removeCandidates[i])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Ryan! Should we check to avoid an index overflow error here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, yes

for i := 0; i < maxReadyNumber-upperBoundReadyNumber; i++ {
// we don't differentiate if binding needs to be bound to new cluster or its resource snapshot needs to be updated
// TODO:
tobeUpgradedBinding = append(tobeUpgradedBinding, updateCandidates[i])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An index overflow can also happen here I am afraid.

// TODO:
tobeUpgradedBinding = append(tobeUpgradedBinding, updateCandidates[i])
}
return tobeRemovedBinding, tobeUpgradedBinding, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Ryan! Is there any risk of the workflow getting stuck here?

Copy link
Contributor

@michaelawyu michaelawyu Jul 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Say, for example, we have a CRP with the following setting:

  • numOfClusters = 10
  • maxSurge = 0
  • maxUnavailable = 4 (minAvailable = 6)

And suppose currently in the system there are 5 scheduled bindings (0-5 of them are ready), 5 bound bindings, and 7 unscheduled bindings (none of them is ready).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, len(canBeReady) = 5 bound + 7 unscheduled = 12, which is > maxReady (10). So no upgrade will happen.

And len(ready) = 5, which is < minAvailable, so no removal will happen either.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


Now, len(canBeReady) = 5 bound + 7 unscheduled = 12, which is > maxReady (10). So no upgrade will happen.

And len(ready) = 5, which is < minAvailable, so no removal will happen either.

In this case, the unscheduled bindings are either deleting or not. In either way, their status will change. If they are deleting, they will disappear eventually. So we will have higher bound and lower bound = 5. We will start to flip the "scheduled" bindings.

If those are not deleting, then they will become ready eventually as the last Applied time doesn't change. Then we will have upper bound = 12 and lower bound = 12 in which case we will start to delete some of those unselected.

When there is a mix, for example, 3 of them are deleting while 4 are just unselected, it will eventually become
4 ready unselected bindings. In this case, the upper bound = 9 and lower bound is 9. We can delete 3 unselected and move one selected to bound.

I think the bottom line is that not "deleting" bindings will become ready eventually while "deleting" bindings will disappear eventually so we will always be able to move forward (if the system behaves). We can get stuck only if some of the agents/clusters are not working which is unavoidable. I think deployment get stuck too if some pods never become ready.

@ryanzhang-oss ryanzhang-oss force-pushed the rollout-feature branch 2 times, most recently from cf5841a to df674aa Compare July 20, 2023 06:19
binding.Spec.ResourceSnapshotName = latestResourceSnapshotName
binding.Status.LastResourceUpdateTime = metav1.Now()
errs.Go(func() error {
if err := r.Client.Update(cctx, binding); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we only update the spec here and need another call to update the status.

@ryanzhang-oss ryanzhang-oss deleted the rollout-feature branch February 12, 2024 19:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants