-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-3158: Optional maximum etcd page size on every list requests #3252
KEP-3158: Optional maximum etcd page size on every list requests #3252
Conversation
Welcome @chaochn47! |
Hi @chaochn47. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
7deee45
to
f9cea87
Compare
/cc @wojtek-t @anguslees @shyamjvs @mborsz @ptabor @serathius PTAL, thx! |
/ok-to-test |
66ac306
to
3b09687
Compare
# The following PRR answers are required at alpha release | ||
# List the feature gate name and the components for which it must be enabled | ||
feature-gates: | ||
- name: MaximumListLimitOnEtcd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A clearer name would be MaximumPageSizeLimitForEtcdLists
imo. Let's see if others have better suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That looks good to me.
**Note:** When your KEP is complete, all of these comment blocks should be removed. | ||
|
||
To get started with this template: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove these comment blocks to make reviewing easier? Given you've already filled the details.
--> | ||
Performing API queries that return all of the objects of a given resource type (GET /api/v1/pods, GET /api/v1/secrets) without pagination can lead to significant variations in peak memory use on etcd and APIServer. | ||
|
||
This proposal covers an approach that splitting full resource list to protect etcd from out of memory crashing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"an approach where the apiserver makes multiple paginated list calls to etcd instead of a single large unpaginated call to reduce the peak memory usage of etcd, thereby reducing its chances of Out-of-memory crashes"
^ would be clearer
|
||
[documentation style guide]: https://github.com/kubernetes/community/blob/master/contributors/guide/style-guide.md | ||
--> | ||
Performing API queries that return all of the objects of a given resource type (GET /api/v1/pods, GET /api/v1/secrets) without pagination can lead to significant variations in peak memory use on etcd and APIServer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apiserver memory usage is also mentioned here, but this proposal is primarily helping optimize etcd memory only. Worth making this clear and add "reducing apiserver memory usage for unpaginated list calls" as a non-goal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing to note though is this may indirectly also help apiserver in setups where etcd is colocated with apiserver on the same instance. So reducing etcd mem usage, can buy more memory for apiserver.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing to note though is this may indirectly also help apiserver in setups where etcd is colocated with apiserver on the same instance. So reducing etcd mem usage, can buy more memory for apiserver.
That's a real good point for other cloud-providers to consider use this feature.
know that this has succeeded? | ||
--> | ||
|
||
- APIServer is able to split large etcd range queries into smaller chunks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this is more of a "how" we're trying to achieve the goal listed in the next line rather than "what". It doesn't sound like a goal to me in that sense.
- Feature gate name: | ||
- Components depending on the feature gate: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add feature gate name here and also kube-apiserver
as the component.
- Components depending on the feature gate: | ||
- [ ] Other | ||
- Describe the mechanism: | ||
- Will enabling / disabling the feature require downtime of the control |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this should be "Yes, this requires restart of the apiserver instance. However, there shouldn't be a downtime in HA setups where there are at least one replica kept active at any given time during the update."
+cc @liggitt - is there a proper wording for this already?
and make progress. | ||
--> | ||
|
||
- Throttle large range requests on etcd server |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we generalize this as "Implementing any priority or fairness mechanisms for list calls made to etcd"?
--> | ||
|
||
- Throttle large range requests on etcd server | ||
- Streaming list response from etcd to APIServer in chunks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Streaming" and "chunks" seem to be a bit contradicting. We're actually proposing the latter in this KEP, but not the former. Let's make it clear.
|
||
- Throttle large range requests on etcd server | ||
- Streaming list response from etcd to APIServer in chunks | ||
- Fine-tune with [priority and fairness](https://github.com/kubernetes/enhancements/tree/master/keps/sig-api-machinery/1040-priority-and-fairness) or `max-requests-in-flight` to prevent etcd blowing up |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest -> "Reduce list call load on etcd using priority & fairness settings on the apiserver"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. I will update with Reduce list call load on etcd using priority & fairness setting on apiserver
3b09687
to
ebe3587
Compare
Previous comments are addressed and updated accordingly. PTAL @shyamjvs, thanks! There are no additional comments from other reviewers. Is this KEP implementable or there are major concerns about this approach? Please let me know, thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few tweak suggestions
[experience reports]: https://github.com/golang/go/wiki/ExperienceReports | ||
--> | ||
|
||
Protect etcdfrom out of memory crashing and prevent its cascading unpredicable failures. APIServer and other downstream components recovering like re-creating gRPC connections to etcd, rebuilding all caches and resuming internal reconciler can be expensive. Lower etcd RAM consumption at peak time to allow auto scaler to increase memory budget in time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Protect etcdfrom out of memory crashing and prevent its cascading unpredicable failures. APIServer and other downstream components recovering like re-creating gRPC connections to etcd, rebuilding all caches and resuming internal reconciler can be expensive. Lower etcd RAM consumption at peak time to allow auto scaler to increase memory budget in time. | |
Protect etcd from out of memory crashing and prevent its cascading unpredictable failures. kube-apiserver and other downstream components recovering like re-creating gRPC connections to etcd, rebuilding all caches and resuming internal reconciler can be expensive. Lower etcd RAM consumption at peak time to allow autoscalers to increase their memory budget in time. |
|
||
Protect etcdfrom out of memory crashing and prevent its cascading unpredicable failures. APIServer and other downstream components recovering like re-creating gRPC connections to etcd, rebuilding all caches and resuming internal reconciler can be expensive. Lower etcd RAM consumption at peak time to allow auto scaler to increase memory budget in time. | ||
|
||
In Kubernetes, etcd only has a single client: APIServer. It is feasible to control how requests are sent from APIServer to etcd. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Kubernetes, etcd only has a single client: APIServer. It is feasible to control how requests are sent from APIServer to etcd. | |
In Kubernetes, etcd only has a single client: kube-apiserver. It is feasible to control how requests are sent from kube-apiserver to etcd. |
### Desired outcome | ||
By default, this proposal is a no-op. | ||
If this feature is opted-in and `max-list-etcd-limit` flag on APIServer is set to `x` where `x` >= 500 | ||
- KubeAPIServer splits requests to etcd into multiple pages. The max etcd page size is `x`. If user provided the limit `y` is smaller or equal to `x`, those requests are intact. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- KubeAPIServer splits requests to etcd into multiple pages. The max etcd page size is `x`. If user provided the limit `y` is smaller or equal to `x`, those requests are intact. | |
- kube-apiserver splits requests to etcd into multiple pages. The max etcd page size is `x`. If user provided the limit `y` is smaller or equal to `x`, those requests are intact. |
|
||
### Desired outcome | ||
By default, this proposal is a no-op. | ||
If this feature is opted-in and `max-list-etcd-limit` flag on APIServer is set to `x` where `x` >= 500 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this feature is opted-in and `max-list-etcd-limit` flag on APIServer is set to `x` where `x` >= 500 | |
If the relevant feature gate is enabled and the `max-list-etcd-limit` command line argument to kube-apiserver is set to `x`, where `x >= 500`, then: |
|
||
#### Story 1 | ||
A user deployed an application as daemonset that queries pods belonging to the node it is running. It translates to range query etcd from `/registry/pods/` to `/registry/pods0` and each such query payload is close to 2GiB. | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, you could mark this as JSON.
ebe3587
to
04817e8
Compare
04817e8
to
cecca3b
Compare
I think this effort would be better spent encouraging clients to paginate. Fixing apiserver <-> etcd is at best half of the problem. I think this proposal is assuming that if etcd didn't fall over everything would be fine, but actually if etcd doesn't fall over first, then apiserver probably will, since it still buffers. Encouraging clients to paginate looks like:
|
Those restrictions might include a maximum page size for the Kubernetes API (not etcd), I think. |
APF limits concurrency. We can't put a maximum like that in without breaking clients. (and it would not be an APF setting, either) |
Perhaps a different limit under the broad umbrella of “fairness”. |
Fairness refers to requests of a given source in a given priority level having a chance of service inversely proportional to the number of request sources rather than the total number of requests. We can't put a hard cap on unpaginated requests, but we can shuffle them to their own PL and make its max concurrency 1. That will slow them down enough to let apiserver (and etcd) survive, and encourage the end user to update to a version that paginates. Before we can do that we need to make it maximally easy to get on versions of clients that paginate list requests. This proposal won't help with any of that. |
Thanks @lavalamp for the review!! I have some follow up questions.
I doubt this is possible in practice. The APIListChunking feature was added few years ago. From the kubernetes platform provider perspective like GKE or EKS, there is a need to have some protection mechanism on the platform itself against the naive clients. This proposal provides a way that paginating the requests internally in apiserver to protect the platform (control plane) and save the end users from the mistakes.
Is APF configuration based on requests paginated or not? If yes, that will be very useful. This setting has to be default and cannot be reactive when etcd out of memory cased by such traffic storm already happened. |
Please see this for an approach that can work: #3667 Yes, clients are going to have to change. We can't rescue apiserver memory from unpaginated list calls. Paginating the call between apiserver and etcd won't help with this at all -- the {apiserver, etcd} system will still fail, just in a different place.
Today it doesn't distinguish, IIRC. But ask in the slack channel: https://kubernetes.slack.com/archives/C01G8RY7XD3 We likely have to make APF aware somehow, so that we can throttle the most expensive requests the most. But it's not fair to do that until we have some actually working alternative for clients to switch to. |
@chaochn47 If the cluster has a few objects of big size, the maximum limit doesn't necessarily avoid peak memory use. Also, for native clients that don't support pagination, do you intend to break them?
I am intrigued to know how this will be implemented but I suspect there would be some issue. For example, to be aware the request is expensive, apiserver may have to query it from etcd first then the damage is done for etcd meanwhile other expensive requests are coming through apiserver. |
No, APF will eventually estimate based on other requests in the priority level. And a surprisingly expensive request will occupy seats for a long time, reducing throughput until it finishes. |
Agree with @lavalamp about leveraging APF long-term for limiting the etcd list throughput. One issue though (we haven't discussed above) is - a change was made while ago (kubernetes/kubernetes#108569) which increases apiserver -> etcd list page size exponentially until enough objects matching a field selector are found. This is done to satisfy the client-specified "limit" and can end up listing pretty much everything from etcd if it's a rare selector. For e.g "list pods on my node" by kubelet. A mechanism to cap etcd page size can work in tandem with that change to bound "max damage" a given request can make to etcd. I'm not sure how this could be handled through APF alone (especially given the variability of label/field selectors). Thoughts? |
I feel limiting by count is unable to cap "max damage" precisely. So I have been working on a proposal to allow limiting by response size. etcd-io/etcd#14810 |
I think just a max page size should be sufficient to solve that. There is also no reason for it to grow exponentially. |
@lavalamp : max page size is bytes or in 'entries' ? I see benefits of max-page-bytes-size: Taking in consideration (not guaranteed) upper-bound-limit established in kubernetes/kubernetes#108569 of 10k entries, and not-that big object of 30k, we have 300MB in a single request (and theoretical ability to read full etcd state (8GBs) in a single request when entry is 1MB ). 8-~32MB/grpc sounds like a better balance between throughput and memory consumption. And ultimately streaming range (like: etcd-io/etcd#12343) would move the problem to the grpc/httpv2 flow control layer. |
If we really wanted to, we could compute a target number of entries for all pages but the first based on the average object size seen so far. But I think that's not worth the effort either. The only thing that actually solves the problem is getting people to use the watch w/ catch up entries: #3157 Effort spent on this problem that's not on that or on getting clients to adopt that and pagination has a poor effort to reward ratio IMO. |
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
/remove-lifecycle rotten |
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
@k8s-triage-robot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
KEP-3158: Optional maximum etcd page size on every list requests