Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not cache Elasticsearch settings in annotations #2864

Closed
barkbay opened this issue Apr 11, 2020 · 7 comments · Fixed by #2891
Closed

Do not cache Elasticsearch settings in annotations #2864

barkbay opened this issue Apr 11, 2020 · 7 comments · Fixed by #2891
Assignees
Labels
>bug Something isn't working v1.1.0

Comments

@barkbay
Copy link
Contributor

barkbay commented Apr 11, 2020

attemptDownscale may delete some nodes which are not excluded.

For example, let's consider the following situation where 2 StatefulSets with an initial replicas: 2 are downscaled to replicas: 1:

Initial state

  • sset1:
    • sset1-0
    • sset1-1
  • sset2:
    • sset2-0
    • sset2-1

Iteration 1

  • actualStatefulSets=[sset1,sset2]
  • sset1-1 is scheduled to be deleted:
    • allocationSetter.ExcludeFromShardAllocation(ctx, ["sset1-1"]) is successful
    • updateAllocationExcludeAnnotation(es, ["sset1-1"]) is successful
  • sset1-1 is not deleted yet because it is holding some data

Iteration 2

  • actualStatefulSets=[sset2,sset1] - note the inconsistent order, this is because there is no guarantee regarding the order of the StatefulSets retrieved from the K8S API
  • sset2-1 is scheduled to be deleted:
    • allocationSetter.ExcludeFromShardAllocation(ctx, ["sset2-1"]) is successful
    • updateAllocationExcludeAnnotation(es, ["sset2-1"]) fails (e.g.: Operation cannot be fulfilled on elasticsearches.elasticsearch.k8s.elastic.co "test-mutation-2nd-master-set-dtvx": the object has been modified;...)
  • at this point it is possible that sset1-1 does not hold some data anymore

At the end of this iteration there is an inconsistency between the excluded nodes in ES and the annotation set on the Elasticsearchobject: sset1-1 is not excluded any more even if the annotation on ES is saying the opposite 💣 💣💣

Iteration 3

  • actualStatefulSets=[sset1,sset2] - note that now the order is the same than the one at iteration 1
  • sset1-1 is scheduled to be deleted:
    • allocationSetter.ExcludeFromShardAllocation(ctx, ["sset1-1"]) is NOT called because of the stale annotation:
	if exclusions == allocationExcludeFromAnnotation(es) {
		return nil
	}
  • calculatePerformableDownscale does not detect any shards on sset1-1 because it has been excluded in the past.
  • sset1-1 is deleted ... but since it is not excluded anymore there is no guarantee that we have not delete some data with it 💥

Could be the root cause of #2788 in which 2 StatefulSets are downscaled/deleted.

@barkbay
Copy link
Contributor Author

barkbay commented Apr 14, 2020

The pattern outlined above may actually happen as soon as an annotation is used to store a copy of the Elasticsearch state:

Iteration n

  • value1 should be use in an API call
  • Call Es API with value1
  • Save value1 in an annotation

Iteration n+1

  • A new value, value2, should be use in an API call
  • Call Es API with value2
  • Annotation can't be updated with the new value

Iteration n+2

  • value1 from previous iteration n should be used again in an API call
  • value1 is in the annotation: API is not called
  • value2 is actually still in used

There's also an other case where we should cleanup and not rely on these annotations: when the API call to Elasticsearch fails, in case of a timeout for instance (maybe the call succeded ... maybe not)

This may happen when ECK deals with:

  • Allocation settings
  • Voting config exclusions
  • Zen1 minimum master nodes
  • Remote clusters

I'm not sure there's an easy way to deal with these "inconsistencies". We could stop using annotations and cache these values in memory.
The drawback is that the state would be lost at restart, which would generate some API calls to initialize the cache during ECK startup. I think it might be acceptable now that we are using concurrent reconciliation.

Happy to implement a PR to not use annotations anymore, let me know your thoughts.

@barkbay
Copy link
Contributor Author

barkbay commented Apr 14, 2020

Also I think we should sort the list of StatefulSets from RetrieveActualStatefulSets, it would not solve the problem but help to have stable downscales and a better understanding of the global algorithm.

@barkbay
Copy link
Contributor Author

barkbay commented Apr 15, 2020

We have decided to remove any kind of cache between ECK and the Elasticsearch API.
For most of the API calls it is just about doing an HTTP PUT which will replace the existing settings.
It should be noted that it is not the case for remote clusters: remote clusters are managed with a map, and deleting a remote cluster requires to explicitly set its map entry (the seeds) to nil

This means that in the case of the remote clusters we have to first retrieve the list of the existing remote clusters to know if some of them have to be deleted. But a side effect is that ECK will delete any unknown remote cluster.

@barkbay barkbay changed the title Non excluded nodes may be deleted, leading to data loss Do not cache Elasticsearch settings in annotations Apr 15, 2020
@pebrc
Copy link
Collaborator

pebrc commented Apr 15, 2020

But a side effect is that ECK will delete any unknown remote cluster.

That seems very undesirable especially given that for remote cluster setups that connect to a cluster outside of k8s manual configuration is the only option and we would destroy such setups.

A few things come to mind:

  • keeping the current annotation system for remote clusters (can we mitigate the bug maybe by setting the annotation first, and comparing always with the current settings)
  • using a naming convention for the remote cluster settings we introduce -eck or similar

@barkbay
Copy link
Contributor Author

barkbay commented Apr 15, 2020

If the annotation is set first there is no guarantee that the clusters will be removed.
I would be 👍 to prefix the name with something like eck-. (even if I don't like hiding some information in a name)

@pebrc
Copy link
Collaborator

pebrc commented Apr 15, 2020

The disadvantage of the prefix/suffix is that the remote cluster name would not be the one defined by the user in the spec.

I was thinking of using the annotation as a history of remote clusters ECK has created. Something like:

Spec Annotation ES
A A (none because we errored out)
A A A
B A A (error on annotation update)
B A,B A (error on ES request)
B A, B B (success on subsequent iteration)
B B B (clean up history on next iteration)

On each reconciliation we would GET _cluster/settings compare and use the annotation only as a history mechanism to know which remote clusters were created by us at some point.

@barkbay
Copy link
Contributor Author

barkbay commented Apr 15, 2020

If there's no error the workflow of the annotation is [A] -> [A,B] -> [B] Which means a last reconcile loop is needed to make the annotation consistent (this is the "clean up history" step in your proposal).

However it seems to solve our problem and sounds cleaner than the naming convention. I'll work on a draft.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug Something isn't working v1.1.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants