-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vector making inefficient api GETs in large K8s clusters #7943
Comments
@uruddarraju Thanks! I'll take a dive into our code there and the dependencies we use to try and improve the behavior there |
+1 |
Hey @uruddarraju is there a particular env/config setup you used to get those logs? I'm trying to test the change locally but I'm having trouble reproducing. Thanks! |
Oh, nice feedback, I was wondering about the behavior in large-cluster cases. I'd say we should use cache. vector/src/kubernetes/resource_version.rs Line 13 in ab697ba
Changing to |
@spencergilbert Wdyt^? |
👍 sounds good, I'll try and find the time to make that PR work unless someone would like to submit a PR |
We experienced the similar issue with vector 0.22.1. Per the investigation with GKE team, they mentioned that the resource_version was probably not set to 0. I saw the PR #11714 adopted kube-rs, which dropped the change in #9974. I am curious if there is any change needed to set the resource_version again or it has been already set in kube-rs. Apologize if this is a naive question. I am not familiar with Rust. |
The resource version should just be handled by the underlying library - if you're still having this/an issue with a version using |
Vector Version
Context
While investigating some API latency regressions in our production Kubernetes clusters, which are >2k nodes with 10s of thousands of pods, we found vector making the following watch/list calls, that dont seem to be setting resourceVersion in them.
Most kube clients, that get deployed across the cluster set their resourceVersion to 0 when they can tolerate weak consistency, asking the API server to return the list results from the cache, there by not needing an ETCD range request, which is strongly consistent and significantly costly. This is also now programmed into client-go for all golang clients in the informer factory that implements the golang reflector like what you have here for rust: https://github.com/kubernetes/client-go/blob/master/tools/cache/reflector.go#L286-L299
Expected Behavior
We expect all LIST/GET/WATCH calls from vector to be hitting the apiserver cache and not etcd directly thereby reducing load on etcd.
Actual Behavior
Some WATCH calls missing apiserver cache and hitting etcd
Kubernetes API Server logs
I have never been a contributor to a rust project and dont think I will be the best one to make this change, but I think changing the default
resource_version
fromNone
to0
for the first call above would help by a large extent here: https://github.com/timberio/vector/blob/ab697bae210151ac3eb0be1de3e435f34dec6407/src/kubernetes/reflector.rs#L163The text was updated successfully, but these errors were encountered: