Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

app.planned_units does not filter dying units #807

Closed
dshcherb opened this issue Jul 19, 2022 · 10 comments
Closed

app.planned_units does not filter dying units #807

dshcherb opened this issue Jul 19, 2022 · 10 comments
Labels
bug Something isn't working feature New feature or request

Comments

@dshcherb
Copy link
Contributor

dshcherb commented Jul 19, 2022

I am writing a charm for a stateful application which has clustering built-in. For the complete application removal case I need to avoid removing members from the actual cluster while handling -departed events on a peer relation compared to regular member removal when the number of units is reduced.

The code of interest would be something like this:

    def _on_departed(self, event):
        if self.model.unit.is_leader() and self.model.app.planned_units() > 0:
            self.remove_member(event.unit.name.replace('/', '-'))
        # Otherwise just do nothing and let the units be removed without removing them from the actual cluster.

The need for that is two-fold:

  1. The cluster may not be in a healthy state to process member removals properly as all units are being removed from it without a particular order, however, we do not care about it since all of it goes away anyway;
  2. It looks like the actual container where pebble runs gets removed asynchronously and so the charm code in relation-departed is unable to connect to pebble to exec a member removal command (see https://paste.ubuntu.com/p/P854xysQkw/). I can use container.can_connect to do the check but I need to understand that the target state is actually 0 units besides that.

The problem with the above > 0 check is that app.planned_units currently does not filter out dying units

operator/ops/model.py

Lines 2346 to 2358 in 4ffc125

def planned_units(self) -> int:
"""Count of "planned" units that will run this application.
Includes the current unit in the count.
"""
# The goal-state tool will return the information that we need. Goal state as a general
# concept is being deprecated, however, in favor of approaches such as the one that we use
# here.
app_state = self._run('goal-state', return_output=True, use_json=True)
app_state = typing.cast(Dict[str, List[str]], app_state)
# Planned units can be zero. We don't need to do error checking here.
return len(app_state.get('units', []))
whereas having a single unit that's still alive but dying will result in reporting of app.planned_units == 1 (see below). As a result, the leader unit will think that planned_units > 0 and will attempt to do member removal which will not be possible as the container will be gone by then.

It is possible to check if units in the goal-state output are dying or not (the outputs below show an example of that) and reduce the reported planned unit count by the amount of dying units. I suggest this is done in the framework until Juju gets other means of reporting for this.

$ juju status
Model    Controller          Cloud/Region        Version  SLA          Timestamp
skydive  microk8s-localhost  microk8s/localhost  2.9.32   unsupported  15:06:13Z

App            Version  Status  Scale  Charm          Channel  Rev  Address        Exposed  Message
etcd-operator           error     1/3  etcd-operator             0  10.152.183.11  no       hook failed: "cluster-relation-departed"

Unit              Workload  Agent  Address      Ports  Message
etcd-operator/1*  error     idle   10.1.116.89         hook failed: "cluster-relation-departed"

$ juju show-application etcd-operator
etcd-operator:
  charm: etcd-operator
  series: focal
  constraints:
    arch: amd64
  principal: true
  exposed: false
  remote: false
  life: dying
  endpoint-bindings:
    "": alpha
    cluster: alpha
    db: alpha


$ juju show-application etcd-operator
etcd-operator:
  charm: etcd-operator
  series: focal
  constraints:
    arch: amd64
  principal: true
  exposed: false
  remote: false
  life: dying
  endpoint-bindings:
    "": alpha
    cluster: alpha
    db: alpha

$ juju show-unit etcd-operator/1
etcd-operator/1:
  opened-ports: []
  charm: local:focal/etcd-operator-0
  leader: true
  life: dying
  relation-info:
  - relation-id: 0
    endpoint: cluster
    related-endpoint: cluster
    application-data:
      cluster-token: 9320de9e-e661-413a-b452-cae99114e281
      peer-cluster-urls: '{"etcd-operator-0": "https://etcd-operator-0.etcd-operator-endpoints.skydive.svc.cluster.local:2380",
        "etcd-operator-2": "https://etcd-operator-2.etcd-operator-endpoints.skydive.svc.cluster.local:2380",
        "etcd-operator-1": "https://etcd-operator-1.etcd-operator-endpoints.skydive.svc.cluster.local:2380"}'
    local-unit:
      in-scope: true
      data:
        egress-subnets: 10.152.183.11/32
        ingress-address: 10.152.183.11
        peer-cluster-url: https://etcd-operator-1.etcd-operator-endpoints.skydive.svc.cluster.local:2380
        peer-listen-url: http://etcd-operator-1.etcd-operator-endpoints.skydive.svc.cluster.local:2379
        private-address: 10.152.183.11
  provider-id: etcd-operator-1
  address: 10.1.116.89

$ juju run --unit etcd-operator/1 'goal-state'
units:
  etcd-operator/1:
    status: dying
    since: 2022-07-19 15:02:43Z
relations: {}


@rwcarlsen
Copy link
Contributor

ops/framework just conveys the planned units stat from juju - so without digging in deep here (yet) I suspect that's where this will probably need to be addressed.

@pengale
Copy link
Contributor

pengale commented Jul 19, 2022

planned_units was deliberately written to be naive, so that we could piggyback off of bugfixes in Juju core, without creating another potentially out of sync cache of information about the model. It might make sense to drop dying units from the count, but we would need to be cautious about introducing a new set of error conditions.

@pengale
Copy link
Contributor

pengale commented Jul 19, 2022

@jameinel do you have any thoughts on this? I thought that dying units would drop off of goal state at some point ...

@dshcherb
Copy link
Contributor Author

planned_units was deliberately written to be naive, so that we could piggyback off of bugfixes in Juju core, without creating another potentially out of sync cache of information about the model. It might make sense to drop dying units from the count, but we would need to be cautious about introducing a new set of error conditions.

I suggest something as simple as:

-        # Planned units can be zero. We don't need to do error checking here.
-        return len(app_state.get('units', []))
+        # Some units may be dying in which case they are not planned to be present anymore.
+        # It is also possible to have zero planned units if all of them are gone already.
+        units = app_state.get('units', {})
+        return len([name for name, details in units.items() if details['status'] != 'dying'])

But testing seems tricky at the first glance since currently the model tests use the test harness that provides a simulated value for planned_units not dependent the concept of unit status ("active", "dying" or otherwise).

@jameinel
Copy link
Member

jameinel commented Jul 22, 2022 via email

@pengale
Copy link
Contributor

pengale commented Jul 24, 2022

Just as an exercise, here are some possible scenarios:

  1. A human operator is tearing down the cloud.

In this case, knowing whether a unit is dying helps a lot. We notice that total units minus dying units is zero, which means that the Application or event the entire Model is probably being removed. We can skip code that attempts to be cautious about workload integrity, and just get to "gone" as quickly as possible.

  1. A human operator is scaling from multiple units to one unit.

Excluding dying units helps here, too. While it's rare that a workload application can be move from "HA" to "not HA" without tearing it down and replacing it entirely, there are going to be cases where we can manage a clean transition from many to one. And this transition is going to be most successful when we can track dying units.

  1. The units in an application are being replaced.

This is an edge case. But it's the source of a valid bug. Let's say that, for whatever reason, the start and end states of a cloud operation are to have x units of an application running, but those units are not the same units that I started with. If I prune dying units from my count at the wrong time, I might end up seeing a planned unit count of zero, and invoke "teardown" codepaths on workloads that I should be more cautious with.

Scenario #3 one of the sources of my caution here. While the controller is not immune from this edge case, my planned unit count is more likely to be off at the unit level, where the model is more likely to be out of date.

(I'm beginning to wish that I hadn't been quite so quick to agree to expose this aspect of goal state as "planned" units in the first place. It does introduce a specific way about reasoning about the underlying data, and I'm not sure that the charm is the right place to locate that reasoning. Hmmmm ...)

@pengale pengale added bug Something isn't working feature New feature or request labels Sep 2, 2022
@rwcarlsen
Copy link
Contributor

What if we added an option exclude_dying=True? Then we could be sure to not break any existing charms but still give the option to charmers. The real difficulty here is that regardless of what we do, there will be times when this method returns an unexpected result :/

@pengale
Copy link
Contributor

pengale commented Sep 9, 2022

Apologies for not following up on this sooner. I got pulled into troubleshooting the dashboard charm, and I've let a lot of my sidequests drop ...

I wrote a library to address this, along with a charm to test some of my assumptions: https://github.com/pengale/charm-brass-tacks/blob/main/lib/charms/brass_tacks/v0/planned_units_plus.py

It turns out that some of my assumptions about teardown were wrong. I've had a hard time coming up with working tests that demonstrate concretely that the lib solves the problem.

I will try to loop back to this in the coming weeks. Folks are welcome to grab the code and test the library in the meantime, though!

@taurus-forever
Copy link

Q: is it already addressed in #936 ?

@benhoyt
Copy link
Collaborator

benhoyt commented Sep 12, 2023

Yes indeed, thanks @taurus-forever. Closing.

@benhoyt benhoyt closed this as completed Sep 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working feature New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants