Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-3158: Optional maximum etcd page size on every list requests #3252

Conversation

chaochn47
Copy link

@chaochn47 chaochn47 commented Mar 22, 2022

  • One-line PR description: KEP-3158: Optional maximum etcd page size on every list requests
  • Other comments: N/A

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 22, 2022
@k8s-ci-robot
Copy link
Contributor

Welcome @chaochn47!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot
Copy link
Contributor

Hi @chaochn47. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. labels Mar 22, 2022
@chaochn47 chaochn47 force-pushed the optional_maximum_limit_on_list_requests_to_etcd branch 2 times, most recently from 7deee45 to f9cea87 Compare April 5, 2022 19:24
@chaochn47 chaochn47 marked this pull request as ready for review April 5, 2022 19:27
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 5, 2022
@k8s-ci-robot k8s-ci-robot requested a review from fedebongio April 5, 2022 19:27
@chaochn47
Copy link
Author

/cc @wojtek-t @anguslees @shyamjvs @mborsz @ptabor @serathius PTAL, thx!

@MadhavJivrajani
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 6, 2022
@chaochn47 chaochn47 force-pushed the optional_maximum_limit_on_list_requests_to_etcd branch 2 times, most recently from 66ac306 to 3b09687 Compare April 6, 2022 17:13
# The following PRR answers are required at alpha release
# List the feature gate name and the components for which it must be enabled
feature-gates:
- name: MaximumListLimitOnEtcd
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A clearer name would be MaximumPageSizeLimitForEtcdLists imo. Let's see if others have better suggestions.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks good to me.

Comment on lines 2 to 4
**Note:** When your KEP is complete, all of these comment blocks should be removed.

To get started with this template:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove these comment blocks to make reviewing easier? Given you've already filled the details.

-->
Performing API queries that return all of the objects of a given resource type (GET /api/v1/pods, GET /api/v1/secrets) without pagination can lead to significant variations in peak memory use on etcd and APIServer.

This proposal covers an approach that splitting full resource list to protect etcd from out of memory crashing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"an approach where the apiserver makes multiple paginated list calls to etcd instead of a single large unpaginated call to reduce the peak memory usage of etcd, thereby reducing its chances of Out-of-memory crashes"

^ would be clearer


[documentation style guide]: https://github.com/kubernetes/community/blob/master/contributors/guide/style-guide.md
-->
Performing API queries that return all of the objects of a given resource type (GET /api/v1/pods, GET /api/v1/secrets) without pagination can lead to significant variations in peak memory use on etcd and APIServer.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apiserver memory usage is also mentioned here, but this proposal is primarily helping optimize etcd memory only. Worth making this clear and add "reducing apiserver memory usage for unpaginated list calls" as a non-goal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing to note though is this may indirectly also help apiserver in setups where etcd is colocated with apiserver on the same instance. So reducing etcd mem usage, can buy more memory for apiserver.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing to note though is this may indirectly also help apiserver in setups where etcd is colocated with apiserver on the same instance. So reducing etcd mem usage, can buy more memory for apiserver.

That's a real good point for other cloud-providers to consider use this feature.

know that this has succeeded?
-->

- APIServer is able to split large etcd range queries into smaller chunks
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this is more of a "how" we're trying to achieve the goal listed in the next line rather than "what". It doesn't sound like a goal to me in that sense.

Comment on lines 585 to 586
- Feature gate name:
- Components depending on the feature gate:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add feature gate name here and also kube-apiserver as the component.

- Components depending on the feature gate:
- [ ] Other
- Describe the mechanism:
- Will enabling / disabling the feature require downtime of the control
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this should be "Yes, this requires restart of the apiserver instance. However, there shouldn't be a downtime in HA setups where there are at least one replica kept active at any given time during the update."

+cc @liggitt - is there a proper wording for this already?

and make progress.
-->

- Throttle large range requests on etcd server
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we generalize this as "Implementing any priority or fairness mechanisms for list calls made to etcd"?

-->

- Throttle large range requests on etcd server
- Streaming list response from etcd to APIServer in chunks
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Streaming" and "chunks" seem to be a bit contradicting. We're actually proposing the latter in this KEP, but not the former. Let's make it clear.


- Throttle large range requests on etcd server
- Streaming list response from etcd to APIServer in chunks
- Fine-tune with [priority and fairness](https://github.com/kubernetes/enhancements/tree/master/keps/sig-api-machinery/1040-priority-and-fairness) or `max-requests-in-flight` to prevent etcd blowing up
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest -> "Reduce list call load on etcd using priority & fairness settings on the apiserver"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I will update with Reduce list call load on etcd using priority & fairness setting on apiserver

@chaochn47 chaochn47 force-pushed the optional_maximum_limit_on_list_requests_to_etcd branch from 3b09687 to ebe3587 Compare April 25, 2022 22:34
@chaochn47
Copy link
Author

Previous comments are addressed and updated accordingly. PTAL @shyamjvs, thanks!

There are no additional comments from other reviewers. Is this KEP implementable or there are major concerns about this approach? Please let me know, thanks.

Copy link
Contributor

@sftim sftim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few tweak suggestions

[experience reports]: https://github.com/golang/go/wiki/ExperienceReports
-->

Protect etcdfrom out of memory crashing and prevent its cascading unpredicable failures. APIServer and other downstream components recovering like re-creating gRPC connections to etcd, rebuilding all caches and resuming internal reconciler can be expensive. Lower etcd RAM consumption at peak time to allow auto scaler to increase memory budget in time.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Protect etcdfrom out of memory crashing and prevent its cascading unpredicable failures. APIServer and other downstream components recovering like re-creating gRPC connections to etcd, rebuilding all caches and resuming internal reconciler can be expensive. Lower etcd RAM consumption at peak time to allow auto scaler to increase memory budget in time.
Protect etcd from out of memory crashing and prevent its cascading unpredictable failures. kube-apiserver and other downstream components recovering like re-creating gRPC connections to etcd, rebuilding all caches and resuming internal reconciler can be expensive. Lower etcd RAM consumption at peak time to allow autoscalers to increase their memory budget in time.


Protect etcdfrom out of memory crashing and prevent its cascading unpredicable failures. APIServer and other downstream components recovering like re-creating gRPC connections to etcd, rebuilding all caches and resuming internal reconciler can be expensive. Lower etcd RAM consumption at peak time to allow auto scaler to increase memory budget in time.

In Kubernetes, etcd only has a single client: APIServer. It is feasible to control how requests are sent from APIServer to etcd.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In Kubernetes, etcd only has a single client: APIServer. It is feasible to control how requests are sent from APIServer to etcd.
In Kubernetes, etcd only has a single client: kube-apiserver. It is feasible to control how requests are sent from kube-apiserver to etcd.

### Desired outcome
By default, this proposal is a no-op.
If this feature is opted-in and `max-list-etcd-limit` flag on APIServer is set to `x` where `x` >= 500
- KubeAPIServer splits requests to etcd into multiple pages. The max etcd page size is `x`. If user provided the limit `y` is smaller or equal to `x`, those requests are intact.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- KubeAPIServer splits requests to etcd into multiple pages. The max etcd page size is `x`. If user provided the limit `y` is smaller or equal to `x`, those requests are intact.
- kube-apiserver splits requests to etcd into multiple pages. The max etcd page size is `x`. If user provided the limit `y` is smaller or equal to `x`, those requests are intact.


### Desired outcome
By default, this proposal is a no-op.
If this feature is opted-in and `max-list-etcd-limit` flag on APIServer is set to `x` where `x` >= 500
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If this feature is opted-in and `max-list-etcd-limit` flag on APIServer is set to `x` where `x` >= 500
If the relevant feature gate is enabled and the `max-list-etcd-limit` command line argument to kube-apiserver is set to `x`, where `x >= 500`, then:


#### Story 1
A user deployed an application as daemonset that queries pods belonging to the node it is running. It translates to range query etcd from `/registry/pods/` to `/registry/pods0` and each such query payload is close to 2GiB.
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, you could mark this as JSON.

@chaochn47 chaochn47 force-pushed the optional_maximum_limit_on_list_requests_to_etcd branch from ebe3587 to 04817e8 Compare April 26, 2022 17:48
@chaochn47 chaochn47 force-pushed the optional_maximum_limit_on_list_requests_to_etcd branch from 04817e8 to cecca3b Compare April 26, 2022 17:52
@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Oct 10, 2022
@lavalamp
Copy link
Member

I think this effort would be better spent encouraging clients to paginate. Fixing apiserver <-> etcd is at best half of the problem. I think this proposal is assuming that if etcd didn't fall over everything would be fine, but actually if etcd doesn't fall over first, then apiserver probably will, since it still buffers.

Encouraging clients to paginate looks like:

  • Fixing in-tree tools / libraries to all paginate
  • limiting non-paginated requests via APF

@sftim
Copy link
Contributor

sftim commented Nov 29, 2022

limiting non-paginated requests via APF

Those restrictions might include a maximum page size for the Kubernetes API (not etcd), I think.

@lavalamp
Copy link
Member

Those restrictions might include a maximum page size for the Kubernetes API (not etcd), I think.

APF limits concurrency. We can't put a maximum like that in without breaking clients. (and it would not be an APF setting, either)

@sftim
Copy link
Contributor

sftim commented Nov 29, 2022

Perhaps a different limit under the broad umbrella of “fairness”.

@lavalamp
Copy link
Member

Fairness refers to requests of a given source in a given priority level having a chance of service inversely proportional to the number of request sources rather than the total number of requests.

We can't put a hard cap on unpaginated requests, but we can shuffle them to their own PL and make its max concurrency 1. That will slow them down enough to let apiserver (and etcd) survive, and encourage the end user to update to a version that paginates.

Before we can do that we need to make it maximally easy to get on versions of clients that paginate list requests.

This proposal won't help with any of that.

@chaochn47
Copy link
Author

chaochn47 commented Nov 30, 2022

Thanks @lavalamp for the review!! I have some follow up questions.

Fixing in-tree tools / libraries to all paginate

I doubt this is possible in practice. The APIListChunking feature was added few years ago.

From the kubernetes platform provider perspective like GKE or EKS, there is a need to have some protection mechanism on the platform itself against the naive clients. This proposal provides a way that paginating the requests internally in apiserver to protect the platform (control plane) and save the end users from the mistakes.

Limiting non-paginated requests via APF

Is APF configuration based on requests paginated or not? If yes, that will be very useful.

This setting has to be default and cannot be reactive when etcd out of memory cased by such traffic storm already happened.

@lavalamp
Copy link
Member

Please see this for an approach that can work: #3667

Yes, clients are going to have to change. We can't rescue apiserver memory from unpaginated list calls. Paginating the call between apiserver and etcd won't help with this at all -- the {apiserver, etcd} system will still fail, just in a different place.

Is APF configuration based on requests paginated or not?

Today it doesn't distinguish, IIRC. But ask in the slack channel: https://kubernetes.slack.com/archives/C01G8RY7XD3

We likely have to make APF aware somehow, so that we can throttle the most expensive requests the most. But it's not fair to do that until we have some actually working alternative for clients to switch to.

@linxiulei
Copy link

@chaochn47 If the cluster has a few objects of big size, the maximum limit doesn't necessarily avoid peak memory use. Also, for native clients that don't support pagination, do you intend to break them?

We likely have to make APF aware somehow, so that we can throttle the most expensive requests the most. But it's not fair to do that until we have some actually working alternative for clients to switch to.

I am intrigued to know how this will be implemented but I suspect there would be some issue. For example, to be aware the request is expensive, apiserver may have to query it from etcd first then the damage is done for etcd meanwhile other expensive requests are coming through apiserver.

@lavalamp
Copy link
Member

For example, to be aware the request is expensive, apiserver may have to query it from etcd first then the damage is done for etcd meanwhile other expensive requests are coming through apiserver.

No, APF will eventually estimate based on other requests in the priority level. And a surprisingly expensive request will occupy seats for a long time, reducing throughput until it finishes.

@shyamjvs
Copy link
Member

Agree with @lavalamp about leveraging APF long-term for limiting the etcd list throughput.

One issue though (we haven't discussed above) is - a change was made while ago (kubernetes/kubernetes#108569) which increases apiserver -> etcd list page size exponentially until enough objects matching a field selector are found. This is done to satisfy the client-specified "limit" and can end up listing pretty much everything from etcd if it's a rare selector. For e.g "list pods on my node" by kubelet.

A mechanism to cap etcd page size can work in tandem with that change to bound "max damage" a given request can make to etcd. I'm not sure how this could be handled through APF alone (especially given the variability of label/field selectors). Thoughts?

@linxiulei
Copy link

I feel limiting by count is unable to cap "max damage" precisely. So I have been working on a proposal to allow limiting by response size. etcd-io/etcd#14810

@lavalamp
Copy link
Member

lavalamp commented Jan 3, 2023

I think just a max page size should be sufficient to solve that. There is also no reason for it to grow exponentially.

@ptabor
Copy link

ptabor commented Jan 19, 2023

I think just a max page size should be sufficient to solve that. There is also no reason for it to grow exponentially.

@lavalamp : max page size is bytes or in 'entries' ?

I see benefits of max-page-bytes-size:

Taking in consideration (not guaranteed) upper-bound-limit established in kubernetes/kubernetes#108569 of 10k entries, and not-that big object of 30k, we have 300MB in a single request (and theoretical ability to read full etcd state (8GBs) in a single request when entry is 1MB ).

8-~32MB/grpc sounds like a better balance between throughput and memory consumption.

And ultimately streaming range (like: etcd-io/etcd#12343) would move the problem to the grpc/httpv2 flow control layer.

@lavalamp
Copy link
Member

If we really wanted to, we could compute a target number of entries for all pages but the first based on the average object size seen so far. But I think that's not worth the effort either.

The only thing that actually solves the problem is getting people to use the watch w/ catch up entries: #3157

Effort spent on this problem that's not on that or on getting clients to adopt that and pagination has a poor effort to reward ratio IMO.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 19, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 19, 2023
@chaochn47
Copy link
Author

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label May 24, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 19, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 18, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants