Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor downscales and add unit tests #1506

Merged
merged 10 commits into from
Aug 9, 2019

Conversation

sebgl
Copy link
Contributor

@sebgl sebgl commented Aug 7, 2019

This refactors the StatefulSets downscale code into multiple smaller
functions, easier to test.
It also adds unit tests for most of those functions, and generally covers
the entire downscale codebase with unit tests.

Relates #1287.

This refactors the StatefulSets downscale code into multiple smaller
functions, easier to test.
It also adds unit tests for most of those functions, and generally covers
the entire downscale codebase with unit tests.
@sebgl sebgl added >test Related to unit/integration/e2e tests >refactoring labels Aug 7, 2019
@sebgl
Copy link
Contributor Author

sebgl commented Aug 7, 2019

This refactoring does not limit the number of master nodes we can remove at once.
I'll reintroduce it as part of #1281.

Edit: that wasn't part of the original code either. Will definitely introduce it with #1281.

Copy link
Contributor

@david-kow david-kow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, few comments.

case downscale.isReplicaDecrease():
// adjust the theoretical downscale to one we can safely perform
performable := calculatePerformableDownscale(ctx, downscale, allLeavingNodes)
if !performable.isReplicaDecrease() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice.

@sebgl
Copy link
Contributor Author

sebgl commented Aug 8, 2019

Jenkins test this please

1 similar comment
@david-kow
Copy link
Contributor

Jenkins test this please

@barkbay
Copy link
Contributor

barkbay commented Aug 8, 2019

Not sure if it is related to this PR but I think I managed to have a race condition:

I quickly changed the number of expected Pods from 5 -> 3 -> 5
Pods have not been deleted:

NAME                                               READY   AGE   CONTAINERS      IMAGES
statefulset.apps/elasticsearch-sample-es-default   5/5     33m   elasticsearch   docker.elastic.co/elasticsearch/elasticsearch:7.2.0
NAME                                    READY   STATUS    RESTARTS   AGE   IP          NODE                                                 NOMINATED NODE   READINESS GATES
pod/elasticsearch-sample-es-default-0   1/1     Running   0          27m   10.60.1.3   gke-michael-dev-cluster-default-pool-e885d7bd-0f29   <none>           <none>
pod/elasticsearch-sample-es-default-1   1/1     Running   0          27m   10.60.0.9   gke-michael-dev-cluster-default-pool-e4e8e5d9-tt13   <none>           <none>
pod/elasticsearch-sample-es-default-2   1/1     Running   0          27m   10.60.2.7   gke-michael-dev-cluster-default-pool-88a37176-xfx3   <none>           <none>
pod/elasticsearch-sample-es-default-3   1/1     Running   0          27m   10.60.0.8   gke-michael-dev-cluster-default-pool-e4e8e5d9-tt13   <none>           <none>
pod/elasticsearch-sample-es-default-4   1/1     Running   0          27m   10.60.1.2   gke-michael-dev-cluster-default-pool-e885d7bd-0f29   <none>           <none>
pod/kibana-sample-kb-d48d76cdd-8s74l    1/1     Running   0          19m   10.60.1.4   gke-michael-dev-cluster-default-pool-e885d7bd-0f29   <none>           <none>

But I have the two last nodes excluded:

{
  "persistent" : { },
  "transient" : {
    "cluster" : {
      "routing" : {
        "allocation" : {
          "exclude" : {
            "_name" : "elasticsearch-sample-es-default-4,elasticsearch-sample-es-default-3"
          }
        }
      }
    }
  }
}

@sebgl
Copy link
Contributor Author

sebgl commented Aug 8, 2019

@barkbay good catch!
This is because we return early, but should not:

// scheduleDataMigrations requests Elasticsearch to migrate data away from leavingNodes.
func scheduleDataMigrations(esClient esclient.Client, leavingNodes []string) error {
	if len(leavingNodes) == 0 {
		return nil
	}
	log.V(1).Info("Migrating data away from nodes", "nodes", leavingNodes)
	return migration.MigrateData(esClient, leavingNodes)
}

I'll fix this.
Thanks!

@sebgl
Copy link
Contributor Author

sebgl commented Aug 8, 2019

Last commit should correctly reset shard allocation excludes, similar to how done before.
I created #1522 to optimize for not doing it at every single reconciliation.

I think there is a potential race where we'd clear allocation excludes before the pod is completely deleted (if it's still being terminated for example). I'd like to handle that with another issue: #1523.

Copy link
Contributor

@barkbay barkbay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@anyasabo
Copy link
Contributor

anyasabo commented Aug 9, 2019

LGTM, just one question about the fakes

Copy link
Contributor

@david-kow david-kow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sebgl sebgl merged commit c34da67 into elastic:master Aug 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>refactoring >test Related to unit/integration/e2e tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants