-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Provide optional rollback support for HelmReleases #2006
Conversation
a5334b6
to
5babee8
Compare
@squaremo @2opremio although this PR is still a draft, please take a look whenever you have time. The reason it is still is a draft is described in #1960 (comment), but due to Helm not providing us with tools to observe successful roll-outs, I see no easy way to work around the issue described in this comment. (Given the feature is opt-in, maybe we can live with it for a while?) |
@hiddeco Won't the rollback mechanism change based on how #1960 (comment) is resolved? If it can differ, then I think it's better to review it once we have a design.
Uhm, I am not sure. An infinite loop is scary |
2786e0f
to
60946e7
Compare
@2opremio @squaremo took me some time to get the mechanics sorted, but here it is. The operator now keeps track of what generation of a resource it has seen, this is done by updating the In case a rollback status condition is set, there are three options:
I am still on the fence about two scenarios:
|
60946e7
to
b7a003b
Compare
Can't we store the |
We could store a SHA hash of the merged values in a status (someone on Slack suggested something like this too). But to be able to make a comparison and detect a change, this would require us to resolve and merge all values before we schedule an upgrade (in |
fhr.Spec.Rollback.Recreate, fhr.Spec.Rollback.DisableHooks, fhr.Spec.Rollback.Wait) | ||
if err != nil { | ||
chs.logger.Log("warning", "unable to rollback chart release", "resource", fhr.ResourceID().String(), "release", name, "err", err) | ||
chs.setCondition(fhr, fluxv1beta1.HelmReleaseRolledBack, v1.ConditionFalse, ReasonRollbackFailed, err.Error()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, only maybe, create variables for the reason and message and call chs.setCondition
only once?
Code looks good, but I don't know enough about the Helm Operator to rubber-stamp this. I am missing tests though (which I guess you planned to add later) |
It may be an idea to pick up some of the The operator is not that complex and quite small as a whole, so you should be able to fit it into your head rapidly. (I was able to do this and I am not the fastest person on earth in Golang code bases at the moment). |
b7a003b
to
3c9234c
Compare
Thanks @hiddeco , I will take a look at those once I am done with the |
1b9234b
to
bc2098c
Compare
@2opremio @squaremo it does now detect changes to external values files, how it does this is up for debate. I personally am not very happy with it, i.e. using the |
bc2098c
to
e7e95b7
Compare
e7e95b7
to
9e62648
Compare
9e62648
to
c841ff5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could really do with a state machine diagram to see what's going on!
8822296
to
469ef18
Compare
I've been testing this branch and came across a problem with StatefulSets. If an upgrade involves recreating the pods in a StatefulSet then Kubernetes will, as normal, start replacing the pods starting at the last ordinal, e.g.
If that pod fails to come up (e.g. the image is missing) the release is rolled back but the problem pod is left as-is, i.e.
The problem is addressed by including Is there a way to address this? Could the |
@adrian given that StatefulSets are known to be problematic to rollback (as mentioned in the issue linked in this PR), and the fact that we mimic I am however interested in any solution you may come across which solves the problem for Oh, and almost forgot -- awesome you are testing this :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW I've been testing the operator on this branch for the past few days and it's working fine. I've tested,
- upgrades with
rollback.enabled: false
- upgrades with
rollback.enabled: true
androllback.recreate: false
- upgrades with
rollback.enabled: true
androllback.recreate: true
In relation to my question earlier about dealing with failed pods in StatefulSets...
One option we've implemented (not in an operator) is to delete all broken StatefulSet pods after issuing the helm rollback command. Maybe that's something that the operator could do? |
Awesome, going to rebase and tidy things up and bother @squaremo with reviewing again so it lands in
I am not very keen on implementing 'hacks' like this in the operator, the reason for this is that it may work for you but breaks things for other people. Instead, and this is probably not to your likening, I would just make note of that it is likely to give issues with some |
469ef18
to
7b558bc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks fine to me, the design (in the state diagram) looks right, and we have some empirical evidence that it works as designed. Let's release it into the wild!
To be able to calculate a checksum for the values from external sources we need to be able to retrieve the chart path for a HelmRelease outside of the ReconcileReleaseDef method, as the path is required for chart file references.
Before this change the actual rollback status was not consulted in case the status of the chart fetch was successful, which could lead to the wrong state being reported back.
7b558bc
to
d386c35
Compare
@hiddeco this is awesome, thanks a lot for implementing this feature! |
@semyonslepov thanks for your enthusiasm and interest in this feature, I merged a PR yesterday which documents the added fields to the Last, if you do not want to wait on the next Helm operator release, there is a prerelease build available which also adds experimental support for running multiple workers processing your releases ( |
Fixes #1960