-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
app.planned_units does not filter dying units #807
Comments
ops/framework just conveys the planned units stat from juju - so without digging in deep here (yet) I suspect that's where this will probably need to be addressed. |
planned_units was deliberately written to be naive, so that we could piggyback off of bugfixes in Juju core, without creating another potentially out of sync cache of information about the model. It might make sense to drop dying units from the count, but we would need to be cautious about introducing a new set of error conditions. |
@jameinel do you have any thoughts on this? I thought that dying units would drop off of goal state at some point ... |
I suggest something as simple as:
But testing seems tricky at the first glance since currently the model tests use the test harness that provides a simulated value for |
I'm pretty sure they do drop out eventually. But they could easily not go
away until they're fully gone, which breaks Dmitrii's assumptions around
how scaling down should work.
…On Tue, Jul 19, 2022 at 5:09 PM Pen Gale ***@***.***> wrote:
@jameinel <https://github.com/jameinel> do you have any thoughts on this?
I thought that dying units would drop off of goal state at some point ...
—
Reply to this email directly, view it on GitHub
<#807 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABRQ7POCMJKHHLXEQ3LC2DVU4KPPANCNFSM54AYTHTQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Just as an exercise, here are some possible scenarios:
In this case, knowing whether a unit is dying helps a lot. We notice that total units minus dying units is zero, which means that the Application or event the entire Model is probably being removed. We can skip code that attempts to be cautious about workload integrity, and just get to "gone" as quickly as possible.
Excluding dying units helps here, too. While it's rare that a workload application can be move from "HA" to "not HA" without tearing it down and replacing it entirely, there are going to be cases where we can manage a clean transition from many to one. And this transition is going to be most successful when we can track dying units.
This is an edge case. But it's the source of a valid bug. Let's say that, for whatever reason, the start and end states of a cloud operation are to have x units of an application running, but those units are not the same units that I started with. If I prune dying units from my count at the wrong time, I might end up seeing a planned unit count of zero, and invoke "teardown" codepaths on workloads that I should be more cautious with. Scenario #3 one of the sources of my caution here. While the controller is not immune from this edge case, my planned unit count is more likely to be off at the unit level, where the model is more likely to be out of date. (I'm beginning to wish that I hadn't been quite so quick to agree to expose this aspect of goal state as "planned" units in the first place. It does introduce a specific way about reasoning about the underlying data, and I'm not sure that the charm is the right place to locate that reasoning. Hmmmm ...) |
What if we added an option |
Apologies for not following up on this sooner. I got pulled into troubleshooting the dashboard charm, and I've let a lot of my sidequests drop ... I wrote a library to address this, along with a charm to test some of my assumptions: https://github.com/pengale/charm-brass-tacks/blob/main/lib/charms/brass_tacks/v0/planned_units_plus.py It turns out that some of my assumptions about teardown were wrong. I've had a hard time coming up with working tests that demonstrate concretely that the lib solves the problem. I will try to loop back to this in the coming weeks. Folks are welcome to grab the code and test the library in the meantime, though! |
Q: is it already addressed in #936 ? |
Yes indeed, thanks @taurus-forever. Closing. |
I am writing a charm for a stateful application which has clustering built-in. For the complete application removal case I need to avoid removing members from the actual cluster while handling
-departed
events on a peer relation compared to regular member removal when the number of units is reduced.The code of interest would be something like this:
The need for that is two-fold:
relation-departed
is unable to connect to pebble to exec a member removal command (see https://paste.ubuntu.com/p/P854xysQkw/). I can usecontainer.can_connect
to do the check but I need to understand that the target state is actually 0 units besides that.The problem with the above
> 0
check is thatapp.planned_units
currently does not filter out dying unitsoperator/ops/model.py
Lines 2346 to 2358 in 4ffc125
app.planned_units == 1
(see below). As a result, the leader unit will think thatplanned_units > 0
and will attempt to do member removal which will not be possible as the container will be gone by then.It is possible to check if units in the
goal-state
output are dying or not (the outputs below show an example of that) and reduce the reported planned unit count by the amount of dying units. I suggest this is done in the framework until Juju gets other means of reporting for this.The text was updated successfully, but these errors were encountered: