-
Notifications
You must be signed in to change notification settings - Fork 177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: autorecover from stuck situations #354
base: main
Are you sure you want to change the base?
Conversation
Hey, I just want to say that you are awesome for driving this HIP! 💪 I think there is a dire need for this. I've seen many people loose trust in Helm because of things getting stuck irrecoverably, it would be so nice to redeem Helm. Also, this will be of great value for all the tools that depend on Helm (like Rancher). I'll read your document and put down some comments. On a first read, it's just formatting things or typos I'm guessing. I might contribute something better once I've digested your suggestion. Tell me if I can help! |
I think it is really clever reusing the |
After implementation, the --timout parameter will be stored in the helm release object (secret) in k8s and | ||
have an indirect impact on possible parallel processes. | ||
|
||
`helm ls -a` shows two new columns, regular `helm ls` does NOT show those: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that would not be fully backwards compatible. Other programs that depend on Helm may parse the output from helm ls -a
and expect the number of columns to be unchanged.
Not sure how bad that is, though. Every change is a breaking change to someone, and I think this seems like a reasonable change. And parsing the default (table) output is not robust anyway, if they do parse the output they should probably use -o json
or -o yaml
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am very open to change the behaviour regarding -a
, since -a
traditionally means all releases (in all states), rather than the colums displayed.
Maybe rather tie it to the existing flag --debug
or add a new flag? I actually would prefer that, maybe call it --concurrency
? It's really just that most users will never need to see it...
@lindhe : What do you think?
Sidenote: There is also the --short
flag, but that false per default, so we cannot use that and having a long flag could yield to funny situations, e.g. specifying short and long....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm impartial to changing it or not. If this feature can get added with the modification to -a
, I think that's a good change. But if that causes any roadblocks I do not think it's worth it.
I think adding a new flag sounds a bit convoluted. If it cannot be part of -a
by default , then it might make sense to tie into --debug
like you suggest. But I would like to get some input from Helm maintainers before going too far with the specifics of that implementation. If they agree that it is OK to introduce a new column by default, that's probably the cleanest solution.
I think this HIP has an additional benefit that should probably be mentioned in the document: it should make multi-user scenarios more robust in general, not just fix the situations where Helm gets stuck. What happens today if two clients run similar commands at the same time? If there is no mechanism to handle that, I believe this HIP would be a great addition to handle those scenarios gracefully. |
I think you are on the right track with your implementation, trying to keep this lock inside the Helm release object. That way, we do not introduce new objects to the Helm ecosystem, which is a great thing. That said, I just want to mention that there is a Lease object in Kubernetes. If we were looking for alternatives, that might be a sensible thing to use. |
There is already a mechanism to detect concurrency, since a certain helm3 version. It is based on the "pending" states in the release object. While implemented with good intentions this has the consequence that one cannot recover anymore in bad cases. So I would not mention this in addition. Nevertheless, if you feel that additional benefits could be mentioned, feel free suggest an edit again, I will happily merge it if it increases the chances of the HIP. One more thing to discuss that I realized during the implementation: If I would now introduce using the k8s server time it could yield to weird timestamps deviations between other helm timestamp fields vs the ones used for locking. While I am still convinced that server time is better, this would make most sense only if this is refactored within the entire codebase, which I guess is out of scope and would require an independent HIP, what do you think? |
kubernetes has distributed locking primitives I think? Couldn't one of those be used? |
@kfox1111 I assume you mean the Lease object as suggest by @lindhe. Yes, that could be used, it is just that the release state is then distributed:
Since I consider the locked-state as an attribute of the release state, the question is whether is is worth to store it somewhere else... Nevertheless, it could increase the backwards compatibility if we see it as something that is honored if it is there and else not. |
Sounds like a bummer. I agree 1000% that server time is superior. I would go so far as to say that it is the only sensible thing to use. But I also agree that it probably belongs to another PR (unfortunately). |
original HIP draft - storing lock in k8s secret
possible change of storing the lock as native k8s lockIf agreement is reached that such a solution is better overall, I can invest the effort of rewriting the HIP and the draft code. benefits
drawbacks
Based on this, @kfox1111 do you still think the k8s lock (Lease) should be used? I think it might be ok because I estimate that most people will run the helm client with a user with namespace admin permissions. So I'm only asking for a final ok from a maintainer for the Lease solution before starting further work, I personally think the benefits of the Lease slightly outweigh the alternative solution. |
Signed-off-by: Gernot Feichter <gernot.feichter@bearingpoint.com>
Co-authored-by: Andreas Lindhé <7773090+lindhe@users.noreply.github.com> Signed-off-by: gerrnot <gernot.feichter@bearingpoint.com>
Co-authored-by: Andreas Lindhé <7773090+lindhe@users.noreply.github.com> Signed-off-by: gerrnot <gernot.feichter@bearingpoint.com>
Co-authored-by: Andreas Lindhé <7773090+lindhe@users.noreply.github.com> Signed-off-by: gerrnot <gernot.feichter@bearingpoint.com>
Signed-off-by: Gernot Feichter <gernot.feichter@bearingpoint.com>
bbaa513
to
7b94ce8
Compare
Since on the helm mailing list there was no veto and only one positive feedback from a ranger representative (see references) among the many likes in various related bug threads, I am submitting this new HIP...