Scale testing ECK #357

nkvoll · 2019-02-07T09:30:31Z

Things we're looking to answer:

How many clusters can the operator+k8s handle?
What are the main bottlenecks, and how can we work past them?
Speed/performance/availability/stability of PVs
Network performance

pebrc · 2019-10-18T09:25:52Z

While ideally the tests described here are fully automated and repeatable we can run a first iteration of scale testing for the 1.0 release which does not attempt to automated everything right away if it helps reducing the effort involved in this.

otrosien · 2019-11-14T15:06:18Z

For big clusters you typically start running into timeouts on the API server for operations like trying to list all pods. (We've had this in our implementation ;)

charith-elastic · 2019-11-27T14:48:24Z

As this is not a typical load testing use case, I was not aware of any pre-existing tools and techniques to test a Kubernetes operator. I had to do some experiments to understand the nature of the problem and associated issues.

Unsurprisingly, Kubernetes itself is the main factor in any attempt to understand the behaviour of the operator under load. As the operator delegates most of its work to built-in Kubernetes controllers like the Statefulset controller, it is effectively bound by the capacity of those systems. Another limiting factor is the quotas imposed by the cloud provider which prevents creation of resources such as persistent volumes beyond a certain limit.

During experimental testing, resource usage of the operator appeared to be roughly proportional to the number of resources being managed regardless of cluster size (no significant difference between single node clusters vs. 30-node clusters). The main bottleneck seemed to be the number of secrets created, which seem to trigger a volume mounting error in the Kubelet: MountVolume.SetUp failed for volume "elastic-internal-http-certificates" : couldn't propagate object cache: timed out waiting for the condition (kubernetes/kubernetes#70044). No significant load on the API server was observed apart from an increase in Goroutine count. Response times for API calls remained fairly constant throughout.

Given these initial findings, it appears that the behaviour of the operator depends mainly on external factors rather than intrinsic ones. The question, then, is to determine what we want to achieve from the scale testing effort. Do we want to establish a baseline such as "the operator requires X mb of RAM for each managed resource" or make a sweeping statement like "the operator can manage N Elasticsearch clusters". The former is measurable and fairly environment-agnostic while the latter is subjective and depends on many external factors (Kubernetes versions, available resources and their saturation, cluster topology, overhead from service meshes and network overlays, other operators and applications running in the cluster etc.)

Any ideas or suggestions are welcome.

charith-elastic · 2019-11-29T15:41:41Z

Kubernetes version: 1.15.4-gke.18
ECK commit: b559ec387bc30c1be3011b00722ef98cc5a683b7

As to be expected, the operator is bound mainly by external factors such as resource limits, provider quotas and limitations of Kubernetes itself. I managed to spawn 218 Elasticsearch clusters (1 master + 5 data nodes) and 218 Kibana instances before the time to attach persistent volumes became a bottleneck and clusters were taking too long to reach green status. At this point, the number of objects in the cluster were as follows:

436 Statefulsets
218 Deployments
1526 Pods
872 Services
3924 Secrets
436 Configmaps

Resource usage of the operator was as follows:

It should be noted that the RAM usage only increases when new resources are created. When no new create operations are being performed, memory usage remains fairly constant even when the underlying resources are being modified regularly. Some spikes can be observed when a large number of pods disappear (e.g. when a node dies), but they return to normal quickly.

CPU and network usage remain fairly flat and does not seem to grow in proportion to the number of resources being managed. RAM usage is proportional to the number of resources being managed (268 MiB for 436 resources = ~0.61 MiB per resource). This proportion did not seem to vary by the number of Elasticsearch nodes created per resource and can be considered reasonably stable.

Number of requests to the API server remained fairly low at around 100 RPS and the 99th percentile latency below 150ms. The API server was using 1.32 GiB of RAM and CPU usage was under 0.29.

Based on these observations, I think it is fair to conclude that the capacity of the operator is mostly bound by the resources allocated to it and the limitations of the underlying Kubernetes installation. Since the reconciliation logic is relatively simple and does not require locks nor complex state management, it should be able to handle a very large number of resources effortlessly if all other external factors are controlled. One potential bottleneck that could manifest itself at a sufficiently large scale is the access to the Elasticsearch API of each managed cluster. However, it is unlikely that this would be an immediate concern in the short term.

charith-elastic · 2019-12-02T12:43:47Z

I left the operator running over the weekend, managing 50 Elasticsearch clusters and 50 Kibana instances. Chaoskube was configured to randomly kill a pod every 10 minutes. Additionally, as the pods were scheduled on to a node pool consisting of GCP preemptible nodes, Kubernetes nodes were automatically cycled every 24 hours as well.

All 100 resources were in green state after the weekend. There was a sudden jump in used memory from 83 MiB to 112 MiB -- presumably due to multiple nodes getting recycled -- but for over 48 hours the total used memory only increased by 2 MiB.

Average heap usage over this period was 51 MiB.

From looking at the heap profile, it appears that most of the heap allocations can be attributed to the TLS connection establishment by the operator's Elasticsearch client and the Kubernetes API client. For large deployments, we may want to reconsider this approach and investigate the feasibility of long-lived connections and techniques like TLS session resumption to reduce the overhead of communication.

charith-elastic · 2019-12-03T09:06:31Z

Managed to run 514 Elasticsearch clusters with the operator before the time to detect the health of clusters became too long. From a quick look at the observer code, it appears that for each cluster, 3 goroutines are spawned every 10 seconds to retrieve the cluster info, cluster health and licence. Since these are network calls, at a sufficiently large scale the overhead seems to compound and become more obvious. Unfortunately, the client is not instrumented so there are no metrics to illustrate this point. Profile data as well the following trace summary is available for anyone interested in digging deeper.

Goroutines:
net.(*Resolver).goLookupIPCNAMEOrder.func3.1 N=1544
net/http.(*persistConn).addTLS.func2 N=259
net/http.(*persistConn).readLoop N=775
net/http.(*Transport).dialConnFor N=259
net/http.(*persistConn).writeLoop N=774
github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/observer.RetrieveState.func3 N=259
github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/observer.RetrieveState.func1 N=257
github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/observer.RetrieveState.func2 N=258
internal/singleflight.(*Group).doCall N=258
github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/observer.(*Observer).runPeriodically N=259
runtime.timerproc N=2
k8s.io/apimachinery/pkg/watch.(*StreamWatcher).receive N=2
github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/observer.(*Manager).notifyListeners.func1 N=258
net.(*netFD).connect.func2 N=258
golang.org/x/net/http2.(*ClientConn).readLoop N=1
runtime/trace.Start.func1 N=1
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1 N=2
k8s.io/client-go/tools/cache.(*sharedIndexInformer).Run N=2
k8s.io/client-go/util/workqueue.(*Type).updateUnfinishedWorkLoop N=8
github.com/google/gops/agent.listen N=1
k8s.io/client-go/util/workqueue.(*delayingType).waitingLoop N=8
k8s.io/klog.(*loggingT).flushDaemon N=1
N=1965

In order to support very large deployments, we may want to reconsider the observation strategy and investigate options such as healthcheck sidecars that can reduce the workload of the operator.

Apart from the noticeable delay in detecting the health of clusters, everything else appeared to be normal during the test. The operator was managing the 514 clusters for over 12 hours and the memory usage during this period was 438 MiB with about 240 MiB allocated on the heap. 3291 goroutines were active. API Server was using 1.52 GiB of memory and had 9517 goroutines.

514 clusters amounts to:

514 Statefulsets
1542 Pods
1542 Persistent Volumes
5140 Secrets
1028 Configmaps
1028 Services

pebrc added this to the Beta milestone Feb 8, 2019

pebrc added the loe:large label Feb 13, 2019

pebrc removed this from the Beta milestone May 10, 2019

sebgl added the >test Related to unit/integration/e2e tests label Jul 17, 2019

sebgl mentioned this issue Sep 12, 2019

Request Timeout after 30s #1100

Closed

thbkrkr closed this as completed Oct 24, 2019

thbkrkr reopened this Oct 24, 2019

pebrc assigned charith-elastic Nov 11, 2019

pebrc changed the title ~~Scale testing high level meta issue~~ Scale testing ECK Nov 19, 2019

pebrc added the v1.0.0 label Nov 20, 2019

This was referenced Nov 21, 2019

Create scale testing harness #2154

Closed

Baseline ECK scale tests #2173

Closed

This was referenced Dec 3, 2019

Instrument the operator with metrics #212

Closed

Explore concurrency schemes to compensate for blocking ES calls #1161

Closed

charith-elastic closed this as completed Dec 4, 2019

pebrc mentioned this issue Apr 17, 2020

Run automated tests with more than one cluster at a time #2897

Open

pebrc mentioned this issue Jul 20, 2020

Evaluate different ECK deployment modes #3478

Open

thbkrkr mentioned this issue May 14, 2024

How many Elasticsearch clusters can I deploy with ECK #7786

Closed

naemono mentioned this issue Nov 1, 2024

Potential memory leak in the ECK Autoscaler. #8174

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scale testing ECK #357

Scale testing ECK #357

nkvoll commented Feb 7, 2019

pebrc commented Oct 18, 2019

otrosien commented Nov 14, 2019 •

edited

Loading

charith-elastic commented Nov 27, 2019

charith-elastic commented Nov 29, 2019

charith-elastic commented Dec 2, 2019

charith-elastic commented Dec 3, 2019

Scale testing ECK #357

Scale testing ECK #357

Comments

nkvoll commented Feb 7, 2019

pebrc commented Oct 18, 2019

otrosien commented Nov 14, 2019 • edited Loading

charith-elastic commented Nov 27, 2019

charith-elastic commented Nov 29, 2019

charith-elastic commented Dec 2, 2019

charith-elastic commented Dec 3, 2019

otrosien commented Nov 14, 2019 •

edited

Loading