[DPE-5125] Remove peer relation lock #392

phvalguima · 2024-08-08T18:30:01Z

This PR separates the lock-related changes from the DPE-4575.

The main goal is to ensure we are not unnecessarily requesting for the lock for a single unit.

Removing the peer lock

There are two situations that could justify the use of peer lock:

Use Case 1: moving from 1 unit up while that single unit is powered off

1x unit opensearch (opensearch/0) working as intended
A restart is triggered in opensearch/0 -> it sets the .charm-node-lock as taken
While opensearch/0 is down, opensearch/{1-X} are added
Given opensearch/0 is the main leader, and the elected manager, units {1-X} won't be able to proceed on their process
opensearch/0 comes back and cleans its own .charm-node-lock
Units {1-X} can start their process and start to start

Use Case 2: a cluster with X elegible managers, where a subset is powered down already

In this case, we may have two situations: (i) we have a minimal quorum of nodes powered up - hence the other nodes can pick the .charm-node-lock one by one and no need for the peer relation here; or (ii) we powered down nodes to less than the minimal quorum: in this case, the cluster is essentially stopped and unresponsive.

For case (ii), recovering units will only matter when the minimal quorum of voting seats have been achieved once again. Before that, the cluster will not be able to define shard allocation. On the other hand, if we have to bring several nodes at once up, then we may end up with a lot of recovery traffic happening at once as well. OpenSearch does provide a protection for these flooding, using cluster.routing.allocation.node_concurrent_{|outgoing|incoming}_recoveries or node_initial_primaries_recoveries settings. That sets a limit of how many concurrent shard relocation tasks may happen at once.

Therefore, a case of (ii) will eventually become a case (i), and can then start to use the charm lock in OpenSearch for restarting. This PR will not address how to decide when we are in case (ii) and what do. This is going to be subject of further investigation.

This PR separates the lock-related changes from the DPE-4575.

lib/charms/opensearch/v0/opensearch_locking.py

phvalguima added 2 commits August 8, 2024 20:29

[DPE-4575] spinoff - Lock changes

15e30f6

This PR separates the lock-related changes from the DPE-4575.

Update opensearch_locking.py

0767550

phvalguima changed the title ~~[DPE-4575] spinoff - Lock changes~~ [DPE-4575] Remove peer relation lock Aug 8, 2024

phvalguima requested review from reneradoi and Mehdi-Bendriss August 8, 2024 20:00

reneradoi reviewed Aug 9, 2024

View reviewed changes

lib/charms/opensearch/v0/opensearch_locking.py Show resolved Hide resolved

phvalguima changed the title ~~[DPE-4575] Remove peer relation lock~~ [DPE-5125] Remove peer relation lock Aug 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DPE-5125] Remove peer relation lock #392

[DPE-5125] Remove peer relation lock #392

phvalguima commented Aug 8, 2024 •

edited

Loading

[DPE-5125] Remove peer relation lock #392

Are you sure you want to change the base?

[DPE-5125] Remove peer relation lock #392

Conversation

phvalguima commented Aug 8, 2024 • edited Loading

Removing the peer lock

Use Case 1: moving from 1 unit up while that single unit is powered off

Use Case 2: a cluster with X elegible managers, where a subset is powered down already

phvalguima commented Aug 8, 2024 •

edited

Loading