Vector making excessive api requests to Kubernetes API server on 429/500 http responses #16798

jeremy-mi-rh · 2023-03-14T17:16:56Z

A note for the community

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

This is to split issues 16753 to its own.

Context

We use vector to deliver kubernetes_logs to our kafka cluster which will be later processed and ingested into Humio. Vector is deployed as a daemon set in our kubernetes clusteres (each with >1000 nodes running).

We recently had an outage in one of our kubernetes clusters (with ~1100 nodes running). There was a failure in ETCD leader node, which resulted in a cascaded failure where pods making 1000x API calls to our API server which eventually brought the kubernetes control plane down entirely.

In the process of remediation, we identified vector as one of the candidate that was hammering the API server. Shutting down vector along with a few other daemon sets eventually reduced the traffic on Control Plane components, which allows ETCD nodes to recover.

Issue: Need a more aggressive backoff strategy?

The issue we found is that vector was making lot more requests when it was seeing non successful responses from Kube API server. Making more requests is expected, as it needs to retry. However we are seeing 1000x times more requests in some cases.

Before 17:45, the traffic was pretty steady. It makes 1 - 300 requests at a per minute basis. And when etcd server starts to have issues, it starts to make very aggressive requests which results in as many as 200,000 requests per minute. Is there a way we can configure the backoff strategy in this case? Or should it be less aggressive on retrying by default?

Also attached the graph that filtered on 429 response code:

Configuration

No response

Version

vector 0.27.0 (x86_64-unknown-linux-gnu 5623d1e 2023-01-18)

References

#7943
#16753

The text was updated successfully, but these errors were encountered:

nabokihms · 2023-03-16T19:35:03Z

@jeremy-mi-rh Hello! Just to share. To prevent Kubernetes API overloading on restarts, we added a FlowSchema with PriorityLevelConfigurartion only for Vector pods, and it works like a charm! It helped us a lot with overloading.

The code snippet can be found here:
https://github.com/deckhouse/deckhouse/blob/7da80720b8cba25fa6646ce6e826f86bbad1d3fe/modules/460-log-shipper/templates/flowcontrol.yaml#L4-L42

The other option was using the resourceVersion=0 for requests, which is unavailable in kube-rs (why we decided to go with the flowschema variant).

jeremy-mi-rh · 2023-03-17T01:24:00Z

Thanks for sharing! @nabokihms

Internally we are looking to implement/enforce APF as well in our clusters. As of now we don't have any so it will take some time for us to get there. Great to hear that flowcontrol does help on this use case and that would definitely motivates us to adopt it.

Other than enforcing APF, is it possible to allow users to configure backoff strategy from vector configurations?

nabokihms · 2023-03-17T18:09:05Z

In the Kubernetes client-go the throttling feature is embedded in the client itself, but I think there is no such thing in kube-rs at the current time.

In my humble opinion, users do not need to worry about the back off policy. It should work out of the box, and if it is not, it should be fixed on the vector side, not by configuration.

nabokihms · 2023-04-05T22:20:49Z

Added default backoff to initial list request for all watchers #17009

It turned out that in case of an error, vector immediately retried the list request. The problem with 200,000 requests per minute should be resolved now.

jeremy-mi-rh · 2023-04-05T22:50:18Z

Wow thanks for the quick fix! This will be greatly helpful to all vector users on k8s sources!

Once the release is out, we will be testing it in our cluster and share the results in the community!

neuronull · 2023-04-06T16:57:33Z

👋 hi all, just wanted to double check before closing this manually, does #17009 fully satisfy the issue described here?
Thanks

nabokihms · 2023-04-06T21:48:35Z

@neuronull I'd like to keep this open for now untill the new vector version is released and all interested parties has the chance to test it (if you don't mind).

neuronull · 2023-04-06T22:15:14Z

Ah, I missed that bit but see it now. Sounds good 👍 and thanks for your work on this!

skygrammas · 2023-05-02T18:14:36Z

Hi all, having a similar issue taking out a production cluster (>1000 nodes). I'm curious in which version of vector this change is intended to be released.

spencergilbert · 2023-05-02T18:52:50Z

@skygrammas this looks to have been released in 0.29

jszwedko · 2023-09-28T01:56:16Z

Closing since #17009 has been released for a few versions now. Thanks again @nabokihms !

jeremy-mi-rh added the type: bug A code related bug. label Mar 14, 2023

jeremy-mi-rh mentioned this issue Mar 14, 2023

Vector making api requests to Kubernetes API server without using resource_version #16753

Closed

fuchsnj added the source: kubernetes_logs Anything `kubernetes_logs` source related label Mar 14, 2023

skygrammas mentioned this issue May 2, 2023

feat(kubernetes_logs): use kube-apiserver cache for list requests #17095

Merged

jszwedko closed this as completed Sep 28, 2023

max0ne mentioned this issue Jan 18, 2024

getAllTopControllers generate too much load on api server FairwindsOps/goldilocks#689

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vector making excessive api requests to Kubernetes API server on 429/500 http responses #16798

Vector making excessive api requests to Kubernetes API server on 429/500 http responses #16798

jeremy-mi-rh commented Mar 14, 2023 •

edited by fuchsnj

Loading

nabokihms commented Mar 16, 2023

jeremy-mi-rh commented Mar 17, 2023

nabokihms commented Mar 17, 2023

nabokihms commented Apr 5, 2023 •

edited

Loading

jeremy-mi-rh commented Apr 5, 2023

neuronull commented Apr 6, 2023

nabokihms commented Apr 6, 2023

neuronull commented Apr 6, 2023

skygrammas commented May 2, 2023

spencergilbert commented May 2, 2023

jszwedko commented Sep 28, 2023

Vector making excessive api requests to Kubernetes API server on 429/500 http responses #16798

Vector making excessive api requests to Kubernetes API server on 429/500 http responses #16798

Comments

jeremy-mi-rh commented Mar 14, 2023 • edited by fuchsnj Loading

A note for the community

Problem

Context

Issue: Need a more aggressive backoff strategy?

Configuration

Version

References

nabokihms commented Mar 16, 2023

jeremy-mi-rh commented Mar 17, 2023

nabokihms commented Mar 17, 2023

nabokihms commented Apr 5, 2023 • edited Loading

jeremy-mi-rh commented Apr 5, 2023

neuronull commented Apr 6, 2023

nabokihms commented Apr 6, 2023

neuronull commented Apr 6, 2023

skygrammas commented May 2, 2023

spencergilbert commented May 2, 2023

jszwedko commented Sep 28, 2023

jeremy-mi-rh commented Mar 14, 2023 •

edited by fuchsnj

Loading

nabokihms commented Apr 5, 2023 •

edited

Loading