Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vector making inefficient api GETs in large K8s clusters #7943

Closed
Tracked by #10016
uruddarraju opened this issue Jun 19, 2021 · 8 comments · Fixed by #9974
Closed
Tracked by #10016

Vector making inefficient api GETs in large K8s clusters #7943

uruddarraju opened this issue Jun 19, 2021 · 8 comments · Fixed by #9974
Assignees
Labels
platform: kubernetes Anything `kubernetes` platform related source: kubernetes_logs Anything `kubernetes_logs` source related type: bug A code related bug.

Comments

@uruddarraju
Copy link

uruddarraju commented Jun 19, 2021

Vector Version

vector version: `Vector/0.14.0 (x86_64-unknown-linux-gnu 5f3a319 2021-06-03`

Context

While investigating some API latency regressions in our production Kubernetes clusters, which are >2k nodes with 10s of thousands of pods, we found vector making the following watch/list calls, that dont seem to be setting resourceVersion in them.

Most kube clients, that get deployed across the cluster set their resourceVersion to 0 when they can tolerate weak consistency, asking the API server to return the list results from the cache, there by not needing an ETCD range request, which is strongly consistent and significantly costly. This is also now programmed into client-go for all golang clients in the informer factory that implements the golang reflector like what you have here for rust: https://github.com/kubernetes/client-go/blob/master/tools/cache/reflector.go#L286-L299

Expected Behavior

We expect all LIST/GET/WATCH calls from vector to be hitting the apiserver cache and not etcd directly thereby reducing load on etcd.

Actual Behavior

Some WATCH calls missing apiserver cache and hitting etcd

Kubernetes API Server logs

I0619 21:39:44.679278       1 wrap.go:47] GET /api/v1/pods?&allowWatchBookmarks=true&fieldSelector=spec.nodeName%3Dhello-world-node&labelSelector=vector.dev%2Fexclude%21%3Dtrue&timeoutSeconds=290&watch=true: (4m50.000539436s) 200 [Vector/0.14.0 (x86_64-unknown-linux-gnu 5f3a319 2021-06-03) hello-world-node:37716]
I0619 21:39:45.682526       1 get.go:250] Starting watch for /api/v1/pods, rv=1725125265 labels=vector.dev/exclude!=true fields=spec.nodeName=hello-world-node timeout=4m50s
I0619 21:39:45.682685       1 wrap.go:47] GET /api/v1/pods?&allowWatchBookmarks=true&fieldSelector=spec.nodeName%3Dhello-world-node&labelSelector=vector.dev%2Fexclude%21%3Dtrue&resourceVersion=1725125265&timeoutSeconds=290&watch=true: (691.548µs) 200 [Vector/0.14.0 (x86_64-unknown-linux-gnu 5f3a319 2021-06-03) hello-world-node:37716]
I0619 21:39:46.684515       1 get.go:250] Starting watch for /api/v1/pods, rv= labels=vector.dev/exclude!=true fields=spec.nodeName=hello-world-node timeout=4m50s

I have never been a contributor to a rust project and dont think I will be the best one to make this change, but I think changing the default resource_version from None to 0 for the first call above would help by a large extent here: https://github.com/timberio/vector/blob/ab697bae210151ac3eb0be1de3e435f34dec6407/src/kubernetes/reflector.rs#L163

@uruddarraju uruddarraju added the type: bug A code related bug. label Jun 19, 2021
@spencergilbert spencergilbert self-assigned this Jun 21, 2021
@spencergilbert
Copy link
Contributor

@uruddarraju Thanks! I'll take a dive into our code there and the dependencies we use to try and improve the behavior there

@jpdstan
Copy link

jpdstan commented Jun 21, 2021

+1

@jszwedko jszwedko added platform: kubernetes Anything `kubernetes` platform related source: kubernetes_logs Anything `kubernetes_logs` source related labels Jun 21, 2021
@spencergilbert
Copy link
Contributor

Hey @uruddarraju is there a particular env/config setup you used to get those logs? I'm trying to test the change locally but I'm having trouble reproducing. Thanks!

@MOZGIII
Copy link
Contributor

MOZGIII commented Sep 17, 2021

Oh, nice feedback, I was wondering about the behavior in large-cluster cases. I'd say we should use cache.

Changing to Some(0) here should to the job.

@jpdstan
Copy link

jpdstan commented Sep 23, 2021

@spencergilbert Wdyt^?

@spencergilbert
Copy link
Contributor

👍 sounds good, I'll try and find the time to make that PR work unless someone would like to submit a PR

@ZhiminXiang
Copy link

We experienced the similar issue with vector 0.22.1. Per the investigation with GKE team, they mentioned that the resource_version was probably not set to 0.

I saw the PR #11714 adopted kube-rs, which dropped the change in #9974. I am curious if there is any change needed to set the resource_version again or it has been already set in kube-rs. Apologize if this is a naive question. I am not familiar with Rust.

@spencergilbert
Copy link
Contributor

The resource version should just be handled by the underlying library - if you're still having this/an issue with a version using kube-rs could you open a new bug report with details?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
platform: kubernetes Anything `kubernetes` platform related source: kubernetes_logs Anything `kubernetes_logs` source related type: bug A code related bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants