Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DPE-5125] Remove peer relation lock #392

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

phvalguima
Copy link
Contributor

@phvalguima phvalguima commented Aug 8, 2024

This PR separates the lock-related changes from the DPE-4575.

The main goal is to ensure we are not unnecessarily requesting for the lock for a single unit.

Removing the peer lock

There are two situations that could justify the use of peer lock:

Use Case 1: moving from 1 unit up while that single unit is powered off

  1. 1x unit opensearch (opensearch/0) working as intended
  2. A restart is triggered in opensearch/0 -> it sets the .charm-node-lock as taken
  3. While opensearch/0 is down, opensearch/{1-X} are added
  4. Given opensearch/0 is the main leader, and the elected manager, units {1-X} won't be able to proceed on their process
  5. opensearch/0 comes back and cleans its own .charm-node-lock
  6. Units {1-X} can start their process and start to start

Use Case 2: a cluster with X elegible managers, where a subset is powered down already

In this case, we may have two situations: (i) we have a minimal quorum of nodes powered up - hence the other nodes can pick the .charm-node-lock one by one and no need for the peer relation here; or (ii) we powered down nodes to less than the minimal quorum: in this case, the cluster is essentially stopped and unresponsive.

For case (ii), recovering units will only matter when the minimal quorum of voting seats have been achieved once again. Before that, the cluster will not be able to define shard allocation. On the other hand, if we have to bring several nodes at once up, then we may end up with a lot of recovery traffic happening at once as well. OpenSearch does provide a protection for these flooding, using cluster.routing.allocation.node_concurrent_{|outgoing|incoming}_recoveries or node_initial_primaries_recoveries settings. That sets a limit of how many concurrent shard relocation tasks may happen at once.

Therefore, a case of (ii) will eventually become a case (i), and can then start to use the charm lock in OpenSearch for restarting. This PR will not address how to decide when we are in case (ii) and what do. This is going to be subject of further investigation.

This PR separates the lock-related changes from the DPE-4575.
@phvalguima phvalguima changed the title [DPE-4575] spinoff - Lock changes [DPE-4575] Remove peer relation lock Aug 8, 2024
@phvalguima phvalguima changed the title [DPE-4575] Remove peer relation lock [DPE-5125] Remove peer relation lock Aug 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants