Do not cache Elasticsearch settings in annotations #2864

barkbay · 2020-04-11T17:18:59Z

attemptDownscale may delete some nodes which are not excluded.

For example, let's consider the following situation where 2 StatefulSets with an initial replicas: 2 are downscaled to replicas: 1:

Initial state

sset1:
- sset1-0
- sset1-1
sset2:
- sset2-0
- sset2-1

Iteration 1

actualStatefulSets=[sset1,sset2]
sset1-1 is scheduled to be deleted:
- allocationSetter.ExcludeFromShardAllocation(ctx, ["sset1-1"]) is successful
- updateAllocationExcludeAnnotation(es, ["sset1-1"]) is successful
sset1-1 is not deleted yet because it is holding some data

Iteration 2

actualStatefulSets=[sset2,sset1] - note the inconsistent order, this is because there is no guarantee regarding the order of the StatefulSets retrieved from the K8S API
sset2-1 is scheduled to be deleted:
- allocationSetter.ExcludeFromShardAllocation(ctx, ["sset2-1"]) is successful
- updateAllocationExcludeAnnotation(es, ["sset2-1"]) fails (e.g.: Operation cannot be fulfilled on elasticsearches.elasticsearch.k8s.elastic.co "test-mutation-2nd-master-set-dtvx": the object has been modified;...)
at this point it is possible that sset1-1 does not hold some data anymore

At the end of this iteration there is an inconsistency between the excluded nodes in ES and the annotation set on the Elasticsearchobject: sset1-1 is not excluded any more even if the annotation on ES is saying the opposite 💣 💣💣

Iteration 3

actualStatefulSets=[sset1,sset2] - note that now the order is the same than the one at iteration 1
sset1-1 is scheduled to be deleted:
- allocationSetter.ExcludeFromShardAllocation(ctx, ["sset1-1"]) is NOT called because of the stale annotation:

	if exclusions == allocationExcludeFromAnnotation(es) {
		return nil
	}

calculatePerformableDownscale does not detect any shards on sset1-1 because it has been excluded in the past.
sset1-1 is deleted ... but since it is not excluded anymore there is no guarantee that we have not delete some data with it 💥

Could be the root cause of #2788 in which 2 StatefulSets are downscaled/deleted.

The text was updated successfully, but these errors were encountered:

barkbay · 2020-04-14T06:08:47Z

The pattern outlined above may actually happen as soon as an annotation is used to store a copy of the Elasticsearch state:

Iteration `n`

value1 should be use in an API call
Call Es API with value1
Save value1 in an annotation

Iteration `n+1`

A new value, value2, should be use in an API call
Call Es API with value2
Annotation can't be updated with the new value

Iteration `n+2`

value1 from previous iteration n should be used again in an API call
value1 is in the annotation: API is not called
value2 is actually still in used

There's also an other case where we should cleanup and not rely on these annotations: when the API call to Elasticsearch fails, in case of a timeout for instance (maybe the call succeded ... maybe not)

This may happen when ECK deals with:

Allocation settings
Voting config exclusions
Zen1 minimum master nodes
Remote clusters

I'm not sure there's an easy way to deal with these "inconsistencies". We could stop using annotations and cache these values in memory.
The drawback is that the state would be lost at restart, which would generate some API calls to initialize the cache during ECK startup. I think it might be acceptable now that we are using concurrent reconciliation.

Happy to implement a PR to not use annotations anymore, let me know your thoughts.

barkbay · 2020-04-14T06:25:35Z

Also I think we should sort the list of StatefulSets from RetrieveActualStatefulSets, it would not solve the problem but help to have stable downscales and a better understanding of the global algorithm.

barkbay · 2020-04-15T06:40:52Z

We have decided to remove any kind of cache between ECK and the Elasticsearch API.
For most of the API calls it is just about doing an HTTP PUT which will replace the existing settings.
It should be noted that it is not the case for remote clusters: remote clusters are managed with a map, and deleting a remote cluster requires to explicitly set its map entry (the seeds) to nil

This means that in the case of the remote clusters we have to first retrieve the list of the existing remote clusters to know if some of them have to be deleted. But a side effect is that ECK will delete any unknown remote cluster.

pebrc · 2020-04-15T07:19:43Z

But a side effect is that ECK will delete any unknown remote cluster.

That seems very undesirable especially given that for remote cluster setups that connect to a cluster outside of k8s manual configuration is the only option and we would destroy such setups.

A few things come to mind:

keeping the current annotation system for remote clusters (can we mitigate the bug maybe by setting the annotation first, and comparing always with the current settings)
using a naming convention for the remote cluster settings we introduce -eck or similar

barkbay · 2020-04-15T07:31:28Z

If the annotation is set first there is no guarantee that the clusters will be removed.
I would be 👍 to prefix the name with something like eck-. (even if I don't like hiding some information in a name)

pebrc · 2020-04-15T07:54:23Z

The disadvantage of the prefix/suffix is that the remote cluster name would not be the one defined by the user in the spec.

I was thinking of using the annotation as a history of remote clusters ECK has created. Something like:

Spec	Annotation	ES
A	A	(none because we errored out)
A	A	A
B	A	A (error on annotation update)
B	A,B	A (error on ES request)
B	A, B	B (success on subsequent iteration)
B	B	B (clean up history on next iteration)

On each reconciliation we would GET _cluster/settings compare and use the annotation only as a history mechanism to know which remote clusters were created by us at some point.

barkbay · 2020-04-15T08:20:40Z

If there's no error the workflow of the annotation is [A] -> [A,B] -> [B] Which means a last reconcile loop is needed to make the annotation consistent (this is the "clean up history" step in your proposal).

However it seems to solve our problem and sounds cleaner than the naming convention. I'll work on a draft.

barkbay added >bug Something isn't working v1.1.0 labels Apr 11, 2020

barkbay self-assigned this Apr 11, 2020

This was referenced Apr 11, 2020

Red health during E2E test mutation #2788

Closed

Check node excluded from shard allocation before deleting #2869

Closed

barkbay mentioned this issue Apr 14, 2020

Add Elasticsearch client cache #2875

Closed

4 tasks

barkbay changed the title ~~Non excluded nodes may be deleted, leading to data loss~~ Do not cache Elasticsearch settings in annotations Apr 15, 2020

This was referenced Apr 15, 2020

Do not use annotations to cache Elasticsearch API calls #2880

Merged

Sort StatefulSets retrieved for a given ES cluster #2882

Merged

Use annotation to track created remote clusters #2891

Merged

barkbay closed this as completed in #2891 Apr 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not cache Elasticsearch settings in annotations #2864

Do not cache Elasticsearch settings in annotations #2864

barkbay commented Apr 11, 2020

barkbay commented Apr 14, 2020 •

edited

Loading

barkbay commented Apr 14, 2020

barkbay commented Apr 15, 2020

pebrc commented Apr 15, 2020

barkbay commented Apr 15, 2020 •

edited

Loading

pebrc commented Apr 15, 2020 •

edited

Loading

barkbay commented Apr 15, 2020

Do not cache Elasticsearch settings in annotations #2864

Do not cache Elasticsearch settings in annotations #2864

Comments

barkbay commented Apr 11, 2020

Initial state

Iteration 1

Iteration 2

Iteration 3

barkbay commented Apr 14, 2020 • edited Loading

Iteration n

Iteration n+1

Iteration n+2

barkbay commented Apr 14, 2020

barkbay commented Apr 15, 2020

pebrc commented Apr 15, 2020

barkbay commented Apr 15, 2020 • edited Loading

pebrc commented Apr 15, 2020 • edited Loading

barkbay commented Apr 15, 2020

barkbay commented Apr 14, 2020 •

edited

Loading

Iteration `n`

Iteration `n+1`

Iteration `n+2`

barkbay commented Apr 15, 2020 •

edited

Loading

pebrc commented Apr 15, 2020 •

edited

Loading