Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Autoscaling] Add Elasticsearch autoscaling controller #4173

Merged
merged 21 commits into from
Feb 11, 2021

Conversation

barkbay
Copy link
Contributor

@barkbay barkbay commented Jan 29, 2021

This PR introduces the Elasticsearch autoscaling controller.

This new controller manages the autoscaling policies, periodically polls the _autoscaling/capacity API and adjusts automatically the resources in the node sets.

Some parts of the code are not optimized, this PR is large, I tried to keep something which is easy to understand.
I also tried to add comments where needed to help the reader/reviewer.

Starting by the reconcileInternal function in pkg/controller/autoscaling/elasticsearch/driver.go might be a good starting point for a review.

The following improvements will be added in dedicated PRs:

  • Add a warning in case of overlapping roles.
  • Manage limits using ratios, only resource requests are supported for now.
  • Add a stabilization period
  • Make the poll period a setting, current default is 60s

Testing

This PR relies on the ES autoscaling client and requires to activate a trial (or a valid license)

Deciders are documented here.

Data tier

Unfortunately I don't have strong advices for testing storage deciders.
You can try to use Rally, but note that it will only exercise the reactive decider:

esrally race --pipeline=benchmark-only --track=eventdata --track-repository=eventdata --challenge=index-logs-fixed-daily-volume \
             --track-params="bulk_size:100,number_of_replicas:1,bulk_indexing_clients:1,daily_logging_volume:1GB" \
             --target-hosts=https://${ES}:9200 \
             --client-options="use_ssl:true,verify_certs:false,basic_auth_user:'elastic',basic_auth_password:$PASSWORD"

You can also deploy some agents to create datastreams.

ML

You can use the following API call to create a ML job (it requires the sample data flights). Then do a POST _ml/data_frame/analytics/mljob/_start to start the job.

Logs and status

The expected resources for the node sets and some useful messages are printed in the logs but also stored in the elasticsearch.alpha.elastic.co/autoscaling-status annotation. You can use the following command to retrieve it:kubectl get -n elasticsearch.elasticsearch.k8s.elastic.co/mycluster -o jsonpath='{.metadata.annotations.elasticsearch\.alpha\.elastic\.co/autoscaling-status}' | jq

@barkbay barkbay added >feature Adds or discusses adding a feature to the product v1.5.0 autoscaling labels Jan 29, 2021
@barkbay barkbay force-pushed the autoscaling-pr/controller branch from 172cc8a to 0e99651 Compare January 29, 2021 15:22
Copy link
Collaborator

@pebrc pebrc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still working through this but I thought I share a first set of comments. All minor.

pkg/controller/autoscaling/elasticsearch/controller.go Outdated Show resolved Hide resolved
pkg/controller/autoscaling/elasticsearch/status/actual.go Outdated Show resolved Hide resolved
pkg/controller/autoscaling/elasticsearch/status/actual.go Outdated Show resolved Hide resolved
pkg/controller/autoscaling/elasticsearch/driver.go Outdated Show resolved Hide resolved
pkg/controller/autoscaling/elasticsearch/driver.go Outdated Show resolved Hide resolved
pkg/controller/autoscaling/elasticsearch/driver.go Outdated Show resolved Hide resolved
pkg/controller/autoscaling/elasticsearch/driver.go Outdated Show resolved Hide resolved
Copy link
Contributor

@charith-elastic charith-elastic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a lot to digest and I won't claim to have understood everything. But, overall LGTM 👍🏼

pkg/controller/autoscaling/elasticsearch/controller.go Outdated Show resolved Hide resolved
pkg/controller/autoscaling/elasticsearch/driver.go Outdated Show resolved Hide resolved
Copy link
Collaborator

@pebrc pebrc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I am 2/3 through ...

pkg/controller/elasticsearch/driver/autoscaling.go Outdated Show resolved Hide resolved
pkg/controller/elasticsearch/driver/autoscaling.go Outdated Show resolved Hide resolved
requiredCPUCapacityAsMilli := cpuRange.Min.MilliValue() + requiredAdditionalCPUCapacity

// Round up memory to the next core
requiredCPUCapacityAsMilli = roundUp(requiredCPUCapacityAsMilli, 1000)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we allow CPU capacity in smaller increments than one core?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong opinion. I always have preferred rounded values because it makes it easier to resonate about the performance model (e.g. how to size thread pools in the JVM with fractional number ?) But I agree that the user might want to have more accurate resource requests.
@elastic/cloud-k8s I'm curious if there are other opinions.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Elasticsearch or Kibana, we would round up CPU capacity to an integer, but might be too much for lighter components, for example beats.
Also, if you just want to run the whole ECK stack on a laptop, users would prefer to have less granual option to define resource limits, just to enable them to test locally.

Copy link
Collaborator

@pebrc pebrc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impressive work! I think I went through everything once. But I haven't tested as this is depending on the API PR IIUC

@barkbay
Copy link
Contributor Author

barkbay commented Feb 10, 2021

Thanks all for the time you spent on this PR, I know it is not an easy one 🙇

Quick update:

Copy link
Collaborator

@pebrc pebrc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I did only a very few tests over night and storage autoscaling worked as expected. But I hoped to be able to do some more tests. But I believe we can merge this as is and iterate if necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
autoscaling >feature Adds or discusses adding a feature to the product v1.5.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants