-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rollout main logic #432
Rollout main logic #432
Conversation
} | ||
// The binding needs update if it's not pointing to the latest resource snapshot | ||
if binding.Spec.ResourceSnapshotName != latestResourceSnapshotName { | ||
removeCandidates = append(removeCandidates, &binding) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why it's not a updateCandiate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry for the naming confusion. The removalCandidates (to be renamed) contains candidates that if we pick will cause one instance to be unavailable. Here, if we update the resources, we are making the app unavailable on that cluster(in our model, which may not be true in real case, we are just playing it safe)
} | ||
|
||
// SetupWithManager sets up the controller with the Manager. | ||
func (r *Watcher) SetupWithManager(mgr ctrl.Manager) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of using two watchers + queue, there is another option.
We ask the rollout controller to watch binding & snapshot and implement the event handler func.
similar to, https://github.com/Azure/fleet-networking/blob/main/pkg/controllers/multiclusterservice/controller.go#L428
The reconcile request is the CRP object.
I feel it's better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I thought about that too. The reason I opted to use two watchers is there are some error handling logics. If we just use the event handler then the errors will not be retried and thus we could end up not handle the event. Another thing I am (at least now) not sure about is that I currently also filter out some events.
However, I looked at the code again. It seems that all the failure cases are unexpected. I will give this another try.
for i := 0; i < maxReadyNumber-upperBoundReadyNumber; i++ { | ||
// we don't differentiate if binding needs to be bound to new cluster or its resource snapshot needs to be updated | ||
// TODO: | ||
tobeUpgradedBinding = append(tobeUpgradedBinding, updateCandidates[i]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we going to differentiate the new bound or snapshot updating in the first phase?
based on the maxSurge definition, we only need to append the new bound ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the update candidates only includes selected bindings
lowerBoundAvailable := len(readyBindings) - len(canBeUnavailableBindings) | ||
for i := 0; i < lowerBoundAvailable-minAvailableNumber; i++ { | ||
tobeRemovedBinding = append(tobeRemovedBinding, removeCandidates[i]) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we still have some capacity (lowerBoundAvaiable - minAvailable - len(toBeRevmovedBinding)), we can update the resourceSnapshot on the bounded binding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the update ones are already included in the removingCandidates
// SetupWithManager sets up the controller with the Manager. | ||
func (r *Watcher) SetupWithManager(mgr ctrl.Manager) error { | ||
return ctrl.NewControllerManagedBy(mgr). | ||
For(&fleetv1beta1.ClusterResourceSnapshot{}, builder.WithPredicates(predicate.GenerationChangedPredicate{})). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Ryan! This will lead the controller to ignore metadata changes; if the CRP controller adds a label/annotation, the event will not be picked up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The resource snapshot spec is immutable (ideally), so this predicate will filter out every change I guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, is there any problem with filter out any metadata/status changes?
return false | ||
}, | ||
// skipping update events as the specs are immutable. | ||
UpdateFunc: func(evt event.UpdateEvent) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Ryan! The spec is surely immutable, but the metadata is still subject to change; we probably shouldn't ignore these events.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which metadata change should we handle?
klog.ErrorS(err, "we have encountered a fatal error that can't be retried") | ||
return ctrl.Result{}, controller.NewUnexpectedBehaviorError(err) | ||
} | ||
klog.V(2).InfoS("Start to rollout the bindings", "memberCluster", crpName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Ryan! There is a typo here; it should be clusterResourcePlacement
rather than memberCluster
I assume?
if crp.Spec.Strategy.RollingUpdate.UnavailablePeriodSeconds != nil { | ||
unavailablePeriod = time.Duration(*crp.Spec.Strategy.RollingUpdate.UnavailablePeriodSeconds) | ||
} else { | ||
unavailablePeriod = 60 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Ryan! It might be better to make this default value a top-level constant IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the CRD default value, will ETCD/API server honor the CRD schema?
if err := r.Client.List(ctx, bindingList, crpLabelMatcher); err != nil { | ||
klog.ErrorS(err, "Failed to list all the bindings associated with the clusterResourcePlacement", | ||
"clusterResourcePlacement", klog.KObj(crp)) | ||
return nil, nil, controller.NewAPIServerError(false, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Ryan! Just to be certain, is this an uncached client?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, this is cached.
// Since we can't predict the number of bindings that can be unavailable after they are applied, we don't take them into account | ||
lowerBoundAvailable := len(readyBindings) - len(canBeUnavailableBindings) | ||
for i := 0; i < lowerBoundAvailable-minAvailableNumber; i++ { | ||
tobeRemovedBinding = append(tobeRemovedBinding, removeCandidates[i]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Ryan! Should we check to avoid an index overflow error here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch, yes
for i := 0; i < maxReadyNumber-upperBoundReadyNumber; i++ { | ||
// we don't differentiate if binding needs to be bound to new cluster or its resource snapshot needs to be updated | ||
// TODO: | ||
tobeUpgradedBinding = append(tobeUpgradedBinding, updateCandidates[i]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An index overflow can also happen here I am afraid.
// TODO: | ||
tobeUpgradedBinding = append(tobeUpgradedBinding, updateCandidates[i]) | ||
} | ||
return tobeRemovedBinding, tobeUpgradedBinding, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Ryan! Is there any risk of the workflow getting stuck here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Say, for example, we have a CRP with the following setting:
- numOfClusters = 10
- maxSurge = 0
- maxUnavailable = 4 (minAvailable = 6)
And suppose currently in the system there are 5 scheduled bindings (0-5 of them are ready), 5 bound bindings, and 7 unscheduled bindings (none of them is ready).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now, len(canBeReady) = 5 bound + 7 unscheduled = 12, which is > maxReady (10). So no upgrade will happen.
And len(ready) = 5, which is < minAvailable, so no removal will happen either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now, len(canBeReady) = 5 bound + 7 unscheduled = 12, which is > maxReady (10). So no upgrade will happen.
And len(ready) = 5, which is < minAvailable, so no removal will happen either.
In this case, the unscheduled bindings are either deleting or not. In either way, their status will change. If they are deleting, they will disappear eventually. So we will have higher bound and lower bound = 5. We will start to flip the "scheduled" bindings.
If those are not deleting, then they will become ready eventually as the last Applied time doesn't change. Then we will have upper bound = 12 and lower bound = 12 in which case we will start to delete some of those unselected.
When there is a mix, for example, 3 of them are deleting while 4 are just unselected, it will eventually become
4 ready unselected bindings. In this case, the upper bound = 9 and lower bound is 9. We can delete 3 unselected and move one selected to bound.
I think the bottom line is that not "deleting" bindings will become ready eventually while "deleting" bindings will disappear eventually so we will always be able to move forward (if the system behaves). We can get stuck only if some of the agents/clusters are not working which is unavoidable. I think deployment get stuck too if some pods never become ready.
cf5841a
to
df674aa
Compare
df674aa
to
240cda0
Compare
binding.Spec.ResourceSnapshotName = latestResourceSnapshotName | ||
binding.Status.LastResourceUpdateTime = metav1.Now() | ||
errs.Go(func() error { | ||
if err := r.Client.Update(cctx, binding); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we only update the spec here and need another call to update the status.
Description of your changes
the main rollout logic
Fixes #
I have:
make reviewable
to ensure this PR is ready for review.How has this code been tested
Special notes for your reviewer