Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster can be downscaled while shards are not migrated #3867

Closed
barkbay opened this issue Oct 22, 2020 · 3 comments · Fixed by #3883
Closed

Cluster can be downscaled while shards are not migrated #3867

barkbay opened this issue Oct 22, 2020 · 3 comments · Fixed by #3883
Labels
>bug Something isn't working v1.3.0

Comments

@barkbay
Copy link
Contributor

barkbay commented Oct 22, 2020

When a Pod is deleted/restarting:

  • the shards on the node enter the UNASSIGNED state
  • the node field is set to an empty string
{
		"index": "logs-191998",
		"shard": "4",
		"state": "UNASSIGNED",
		"node": "",
		"prirep": "p"
}

The empty string prevents ECK to detect that shards might actually still be assigned to a node which is downscaled (because shard.NodeName is empty) :

// NodeHasShard returns true if the given ES Pod is holding at least one shard (primary or replica).
func NodeHasShard(ctx context.Context, shardLister esclient.ShardLister, podName string) (bool, error) {
	[...]
	// filter shards affected by node removal
	for _, shard := range shards {
		if shard.NodeName == podName { // <--- Here
			return true, nil
		}
	}
	return false, nil
}

If Pods are deleted (by the controller itself while upgrading as described in #3861) or evicted (K8S node maintenance), ECK might then downscale and delete a set of nodes while they are still hosting some shards.

Note that it is not possible to know to which node "name" a shard was assigned, only its uid is stored in the shard metadata:

{
	"shard": "0",
	"prirep": "p",
	"state": "UNASSIGNED",
	"unassigned.reason": "NODE_LEFT",
	"node": null,
	"id": null,
	"ud": "node_left [fMeKGb_PSi-DIGR_qPTgQQ]"
}
@barkbay barkbay added the >bug Something isn't working label Oct 22, 2020
@barkbay barkbay changed the title Cluster might be downscaled while shards are not migrated Cluster can be downscaled while shards are not migrated Oct 22, 2020
@anyasabo
Copy link
Contributor

Is the right fix for this to add another predicate to not scale down if there are unassigned shards? It seems like it might be too conservative but I'm not sure. Maybe it just makes sense to filter for unassigned.reason == NODE_LEFT? But I don't know if there's other states where a shard is unassigned and removing a node might cause data loss.

@sebgl
Copy link
Contributor

sebgl commented Oct 26, 2020

I think I'm +1 with your suggestion @anyasabo: don't downscale if there is any UNASSIGNED shard.

To be less conservative we could filter on NODE_LEFT, keep a node name -> node id mapping in memory and parse the node ID from unassigned.reason, but I feel like that's not needed. It feels also a bit unreliable since there's no schema guarantee on that API.
I cannot see a valid case where users would like to have some unassigned shards with NODE_LEFT. I think it could happen basically for 2 reasons:

  • the node did really left the cluster, on purpose. In which case it has gone through our downscale mechanism, so it should not have shards anymore. In other terms: this case should normally never happen?
  • the node is temporarily down for some reasons. For example because it has been restarted/upgraded, or because it failed for some reasons. In that case I think the way we move forward in other parts of the reconciliation has always been "the user needs to fix the manifest first so the node comes back into the cluster, then we can move on with the regular node reconciliation".

I am not sure about other unassigned reasons. I suspect for some of them (eg. REINITIALIZED) our predicate on the .node field already ensures we don't remove the wrong node, but I'm not sure.

I think it's best for us to stay on the safe side and don't downscale if there's any unassigned shard. Basically: fix your cluster first (through upscale, upgrade, or index API calls (eg. change replicas)), then we can safely remove some nodes. Similar to how we only move on through a rolling upgrade if the cluster health is green.

We can still optimize later if we find valid reasons to have unassigned shards?

@sebgl
Copy link
Contributor

sebgl commented Oct 28, 2020

#3867 takes a slightly simpler approach to detect the above:

  • don't allow any downscale if some shards have no node assigned (.node == "")
  • don't allows downscaling a node if some shards are assigned to it (.node == <node>)

I think this works regardless of the state or unassigned.reason.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug Something isn't working v1.3.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants