Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scale testing ECK #357

Closed
nkvoll opened this issue Feb 7, 2019 · 6 comments
Closed

Scale testing ECK #357

nkvoll opened this issue Feb 7, 2019 · 6 comments
Assignees
Labels
>test Related to unit/integration/e2e tests v1.0.0

Comments

@nkvoll
Copy link
Member

nkvoll commented Feb 7, 2019

Things we're looking to answer:

  • How many clusters can the operator+k8s handle?
  • What are the main bottlenecks, and how can we work past them?
  • Speed/performance/availability/stability of PVs
  • Network performance
@pebrc pebrc added this to the Beta milestone Feb 8, 2019
@pebrc pebrc removed this from the Beta milestone May 10, 2019
@sebgl sebgl added the >test Related to unit/integration/e2e tests label Jul 17, 2019
@pebrc
Copy link
Collaborator

pebrc commented Oct 18, 2019

While ideally the tests described here are fully automated and repeatable we can run a first iteration of scale testing for the 1.0 release which does not attempt to automated everything right away if it helps reducing the effort involved in this.

@otrosien
Copy link

otrosien commented Nov 14, 2019

For big clusters you typically start running into timeouts on the API server for operations like trying to list all pods. (We've had this in our implementation ;)

@pebrc pebrc changed the title Scale testing high level meta issue Scale testing ECK Nov 19, 2019
@pebrc pebrc added the v1.0.0 label Nov 20, 2019
This was referenced Nov 21, 2019
@charith-elastic
Copy link
Contributor

As this is not a typical load testing use case, I was not aware of any pre-existing tools and techniques to test a Kubernetes operator. I had to do some experiments to understand the nature of the problem and associated issues.

Unsurprisingly, Kubernetes itself is the main factor in any attempt to understand the behaviour of the operator under load. As the operator delegates most of its work to built-in Kubernetes controllers like the Statefulset controller, it is effectively bound by the capacity of those systems. Another limiting factor is the quotas imposed by the cloud provider which prevents creation of resources such as persistent volumes beyond a certain limit.

During experimental testing, resource usage of the operator appeared to be roughly proportional to the number of resources being managed regardless of cluster size (no significant difference between single node clusters vs. 30-node clusters). The main bottleneck seemed to be the number of secrets created, which seem to trigger a volume mounting error in the Kubelet: MountVolume.SetUp failed for volume "elastic-internal-http-certificates" : couldn't propagate object cache: timed out waiting for the condition (kubernetes/kubernetes#70044). No significant load on the API server was observed apart from an increase in Goroutine count. Response times for API calls remained fairly constant throughout.

Given these initial findings, it appears that the behaviour of the operator depends mainly on external factors rather than intrinsic ones. The question, then, is to determine what we want to achieve from the scale testing effort. Do we want to establish a baseline such as "the operator requires X mb of RAM for each managed resource" or make a sweeping statement like "the operator can manage N Elasticsearch clusters". The former is measurable and fairly environment-agnostic while the latter is subjective and depends on many external factors (Kubernetes versions, available resources and their saturation, cluster topology, overhead from service meshes and network overlays, other operators and applications running in the cluster etc.)

Any ideas or suggestions are welcome.

@charith-elastic
Copy link
Contributor

Kubernetes version: 1.15.4-gke.18
ECK commit: b559ec387bc30c1be3011b00722ef98cc5a683b7

As to be expected, the operator is bound mainly by external factors such as resource limits, provider quotas and limitations of Kubernetes itself. I managed to spawn 218 Elasticsearch clusters (1 master + 5 data nodes) and 218 Kibana instances before the time to attach persistent volumes became a bottleneck and clusters were taking too long to reach green status. At this point, the number of objects in the cluster were as follows:

  • 436 Statefulsets
  • 218 Deployments
  • 1526 Pods
  • 872 Services
  • 3924 Secrets
  • 436 Configmaps

Resource usage of the operator was as follows:
image

It should be noted that the RAM usage only increases when new resources are created. When no new create operations are being performed, memory usage remains fairly constant even when the underlying resources are being modified regularly. Some spikes can be observed when a large number of pods disappear (e.g. when a node dies), but they return to normal quickly.

CPU and network usage remain fairly flat and does not seem to grow in proportion to the number of resources being managed. RAM usage is proportional to the number of resources being managed (268 MiB for 436 resources = ~0.61 MiB per resource). This proportion did not seem to vary by the number of Elasticsearch nodes created per resource and can be considered reasonably stable.

Number of requests to the API server remained fairly low at around 100 RPS and the 99th percentile latency below 150ms. The API server was using 1.32 GiB of RAM and CPU usage was under 0.29.

Based on these observations, I think it is fair to conclude that the capacity of the operator is mostly bound by the resources allocated to it and the limitations of the underlying Kubernetes installation. Since the reconciliation logic is relatively simple and does not require locks nor complex state management, it should be able to handle a very large number of resources effortlessly if all other external factors are controlled. One potential bottleneck that could manifest itself at a sufficiently large scale is the access to the Elasticsearch API of each managed cluster. However, it is unlikely that this would be an immediate concern in the short term.

@charith-elastic
Copy link
Contributor

I left the operator running over the weekend, managing 50 Elasticsearch clusters and 50 Kibana instances. Chaoskube was configured to randomly kill a pod every 10 minutes. Additionally, as the pods were scheduled on to a node pool consisting of GCP preemptible nodes, Kubernetes nodes were automatically cycled every 24 hours as well.

All 100 resources were in green state after the weekend. There was a sudden jump in used memory from 83 MiB to 112 MiB -- presumably due to multiple nodes getting recycled -- but for over 48 hours the total used memory only increased by 2 MiB.
image

Average heap usage over this period was 51 MiB.
image

From looking at the heap profile, it appears that most of the heap allocations can be attributed to the TLS connection establishment by the operator's Elasticsearch client and the Kubernetes API client. For large deployments, we may want to reconsider this approach and investigate the feasibility of long-lived connections and techniques like TLS session resumption to reduce the overhead of communication.
image

image

@charith-elastic
Copy link
Contributor

Managed to run 514 Elasticsearch clusters with the operator before the time to detect the health of clusters became too long. From a quick look at the observer code, it appears that for each cluster, 3 goroutines are spawned every 10 seconds to retrieve the cluster info, cluster health and licence. Since these are network calls, at a sufficiently large scale the overhead seems to compound and become more obvious. Unfortunately, the client is not instrumented so there are no metrics to illustrate this point. Profile data as well the following trace summary is available for anyone interested in digging deeper.

Goroutines:
net.(*Resolver).goLookupIPCNAMEOrder.func3.1 N=1544
net/http.(*persistConn).addTLS.func2 N=259
net/http.(*persistConn).readLoop N=775
net/http.(*Transport).dialConnFor N=259
net/http.(*persistConn).writeLoop N=774
github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/observer.RetrieveState.func3 N=259
github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/observer.RetrieveState.func1 N=257
github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/observer.RetrieveState.func2 N=258
internal/singleflight.(*Group).doCall N=258
github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/observer.(*Observer).runPeriodically N=259
runtime.timerproc N=2
k8s.io/apimachinery/pkg/watch.(*StreamWatcher).receive N=2
github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/observer.(*Manager).notifyListeners.func1 N=258
net.(*netFD).connect.func2 N=258
golang.org/x/net/http2.(*ClientConn).readLoop N=1
runtime/trace.Start.func1 N=1
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1 N=2
k8s.io/client-go/tools/cache.(*sharedIndexInformer).Run N=2
k8s.io/client-go/util/workqueue.(*Type).updateUnfinishedWorkLoop N=8
github.com/google/gops/agent.listen N=1
k8s.io/client-go/util/workqueue.(*delayingType).waitingLoop N=8
k8s.io/klog.(*loggingT).flushDaemon N=1
N=1965

In order to support very large deployments, we may want to reconsider the observation strategy and investigate options such as healthcheck sidecars that can reduce the workload of the operator.

Apart from the noticeable delay in detecting the health of clusters, everything else appeared to be normal during the test. The operator was managing the 514 clusters for over 12 hours and the memory usage during this period was 438 MiB with about 240 MiB allocated on the heap. 3291 goroutines were active. API Server was using 1.52 GiB of memory and had 9517 goroutines.

514 clusters amounts to:

  • 514 Statefulsets
  • 1542 Pods
  • 1542 Persistent Volumes
  • 5140 Secrets
  • 1028 Configmaps
  • 1028 Services

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>test Related to unit/integration/e2e tests v1.0.0
Projects
None yet
Development

No branches or pull requests

6 participants