-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scale testing ECK #357
Comments
While ideally the tests described here are fully automated and repeatable we can run a first iteration of scale testing for the 1.0 release which does not attempt to automated everything right away if it helps reducing the effort involved in this. |
For big clusters you typically start running into timeouts on the API server for operations like trying to list all pods. (We've had this in our implementation ;) |
As this is not a typical load testing use case, I was not aware of any pre-existing tools and techniques to test a Kubernetes operator. I had to do some experiments to understand the nature of the problem and associated issues. Unsurprisingly, Kubernetes itself is the main factor in any attempt to understand the behaviour of the operator under load. As the operator delegates most of its work to built-in Kubernetes controllers like the Statefulset controller, it is effectively bound by the capacity of those systems. Another limiting factor is the quotas imposed by the cloud provider which prevents creation of resources such as persistent volumes beyond a certain limit. During experimental testing, resource usage of the operator appeared to be roughly proportional to the number of resources being managed regardless of cluster size (no significant difference between single node clusters vs. 30-node clusters). The main bottleneck seemed to be the number of secrets created, which seem to trigger a volume mounting error in the Kubelet: Given these initial findings, it appears that the behaviour of the operator depends mainly on external factors rather than intrinsic ones. The question, then, is to determine what we want to achieve from the scale testing effort. Do we want to establish a baseline such as "the operator requires X mb of RAM for each managed resource" or make a sweeping statement like "the operator can manage N Elasticsearch clusters". The former is measurable and fairly environment-agnostic while the latter is subjective and depends on many external factors (Kubernetes versions, available resources and their saturation, cluster topology, overhead from service meshes and network overlays, other operators and applications running in the cluster etc.) Any ideas or suggestions are welcome. |
I left the operator running over the weekend, managing 50 Elasticsearch clusters and 50 Kibana instances. Chaoskube was configured to randomly kill a pod every 10 minutes. Additionally, as the pods were scheduled on to a node pool consisting of GCP preemptible nodes, Kubernetes nodes were automatically cycled every 24 hours as well. All 100 resources were in green state after the weekend. There was a sudden jump in used memory from 83 MiB to 112 MiB -- presumably due to multiple nodes getting recycled -- but for over 48 hours the total used memory only increased by 2 MiB. Average heap usage over this period was 51 MiB. From looking at the heap profile, it appears that most of the heap allocations can be attributed to the TLS connection establishment by the operator's Elasticsearch client and the Kubernetes API client. For large deployments, we may want to reconsider this approach and investigate the feasibility of long-lived connections and techniques like TLS session resumption to reduce the overhead of communication. |
Managed to run 514 Elasticsearch clusters with the operator before the time to detect the health of clusters became too long. From a quick look at the observer code, it appears that for each cluster, 3 goroutines are spawned every 10 seconds to retrieve the cluster info, cluster health and licence. Since these are network calls, at a sufficiently large scale the overhead seems to compound and become more obvious. Unfortunately, the client is not instrumented so there are no metrics to illustrate this point. Profile data as well the following trace summary is available for anyone interested in digging deeper.
In order to support very large deployments, we may want to reconsider the observation strategy and investigate options such as healthcheck sidecars that can reduce the workload of the operator. Apart from the noticeable delay in detecting the health of clusters, everything else appeared to be normal during the test. The operator was managing the 514 clusters for over 12 hours and the memory usage during this period was 438 MiB with about 240 MiB allocated on the heap. 3291 goroutines were active. API Server was using 1.52 GiB of memory and had 9517 goroutines. 514 clusters amounts to:
|
Things we're looking to answer:
The text was updated successfully, but these errors were encountered: