Handle Extremely High Throughput by holding back requests to etcd until the throughput decreases. #16837

Sharpz7 · 2023-10-27T08:08:57Z

What would you like to be added?

Find the original k8s ticket here: kubernetes/kubernetes#120781.

Essentially, what the title says. If someone wants to try and enter an incredibly high number of key-value pairs all at once, there should be a way to hold them back.

As I said in my last comment in the original ticket in k8s, I am not convinced this belongs here. But, it is something I am very, very interested in, and would be happy to pivot to whatever is needed and work on this personally.

Thanks!

Why is this needed?

For people dealing with extremely high-throughput batch work (i.e 1000's jobs / second, lasting 1-2 mins each), etcd starts to become a real problem.

Links to back up this point:

https://etcd.io/docs/v3.5/op-guide/performance/
https://github.com/armadaproject/armada: A scheduling solution partially designed around this problem.

In the original ticket (kubernetes/kubernetes#120781) it was agreed:

Using a non-etcd backend is not the desired end state (etcd is great!)
It should be in-tree for etcd

serathius · 2023-10-27T08:29:25Z

Essentially, what the title says. If someone wants to try and enter an incredibly high number of key-value pairs all at once, there should be a way to hold them back.

Yes, there is, at least in Kubernetes. https://kubernetes.io/docs/concepts/cluster-administration/flow-control/
For single tenant case (etcd used by single Kubernetes cluster), there is no motivation to have a APF on etcd side. Kubernetes is better aware of what kind of requests it is sending to etcd.

Essentially, what the title says. If someone wants to try and enter an incredibly high number of key-value pairs all at once, there should be a way to hold them back.
...
For people dealing with extremely high-throughput batch work (i.e 1000's jobs / second, lasting 1-2 mins each), etcd starts to become a real problem.

What you are describing is issue with write throughput? Is that correct?

I haven't heard of real world cases where writes could topple etcd. Write throughput depends mostly on disk performance and is not as resource intensive on memory or CPU (would need a test to confirm). Also etcd limits number of pending proposals so at some point no new proposals should be accepted.

Are you sure there is no other accompanying requests other than just writes that could be cause of the problem? For example cost of writes, scales with number of watchers. If you have many watches established this would make more sense. I think we need more concrete data points then just saying etcd becomes a problem. What exact traffic goes into etcd, performance metrics and profiles to be able to answer what is the problem in your case.

Overall I think this is a scalability problem, in such cases there is no single problem that would allow us to scale to 1000 qps. Fixing one bottleneck will just surfice another issue. Solution defining the exact scenario we want to improve, picking the success metric, and progressive improvements towards the goal. Such as #16467

serathius · 2023-10-27T09:21:57Z

I remembered one case where testing high throughput affected etcd, however it only caused high memory usage due to increased number of allocations required by PrevKey watch option used by Kubernetes #16839.

Issue was easily mitigated by changing GC to be more aggresive.

Sharpz7 · 2023-10-30T05:12:53Z

Really appreciate you getting back to me. This gives me lots of research and ideas to throw about with my team. If we still have questions / issues, I will try and do it more formally like you suggested.

Thanks!

Sharpz7 · 2023-11-29T22:33:55Z

Hey @serathius, so talking with people more, it seems that we suffer from very high pod-churn with the pods having very large manifests (20-50 kB), so we are constantly having to defragment etcd. OpenShift have an operator that handles this most of the time https://docs.openshift.com/container-platform/4.14/scalability_and_performance/recommended-performance-scale-practices/recommended-etcd-practices.html#manual-defrag-etcd-data_recommended-etcd-practices but they still need some manual work.

Is there a way of doing this natively with an operator that etcd have? Or is that something that could be created? Thanks again.

jmhbnz · 2023-11-30T02:04:26Z

Hey @Sharpz7 - One of the etcd maintainers @ahrtr has put together an etcd defrag helper utility https://github.com/ahrtr/etcd-defrag.

This can be run via a kubernetes cronjob and have rules applied to ensure defrag is only run if actually required.

It might be a helpful approach, however please bear in mind this is not an official etcd subproject at this point.

Sharpz7 · 2023-12-04T02:57:51Z

Appreciate you getting back to me - this is really cool! Thanks

stale · 2024-03-17T12:57:50Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

Sharpz7 added the type/feature label Oct 27, 2023

jmhbnz added the area/performance label Oct 31, 2023

stale bot added the stale label Mar 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle Extremely High Throughput by holding back requests to etcd until the throughput decreases. #16837

Handle Extremely High Throughput by holding back requests to etcd until the throughput decreases. #16837

Sharpz7 commented Oct 27, 2023

serathius commented Oct 27, 2023

serathius commented Oct 27, 2023 •

edited

Loading

Sharpz7 commented Oct 30, 2023

Sharpz7 commented Nov 29, 2023

jmhbnz commented Nov 30, 2023

Sharpz7 commented Dec 4, 2023

stale bot commented Mar 17, 2024

Handle Extremely High Throughput by holding back requests to etcd until the throughput decreases. #16837

Handle Extremely High Throughput by holding back requests to etcd until the throughput decreases. #16837

Comments

Sharpz7 commented Oct 27, 2023

What would you like to be added?

Why is this needed?

serathius commented Oct 27, 2023

serathius commented Oct 27, 2023 • edited Loading

Sharpz7 commented Oct 30, 2023

Sharpz7 commented Nov 29, 2023

jmhbnz commented Nov 30, 2023

Sharpz7 commented Dec 4, 2023

stale bot commented Mar 17, 2024

serathius commented Oct 27, 2023 •

edited

Loading