Vector making inefficient api GETs in large K8s clusters #7943

uruddarraju · 2021-06-19T21:42:42Z

Vector Version

vector version: `Vector/0.14.0 (x86_64-unknown-linux-gnu 5f3a319 2021-06-03`

Context

While investigating some API latency regressions in our production Kubernetes clusters, which are >2k nodes with 10s of thousands of pods, we found vector making the following watch/list calls, that dont seem to be setting resourceVersion in them.

Most kube clients, that get deployed across the cluster set their resourceVersion to 0 when they can tolerate weak consistency, asking the API server to return the list results from the cache, there by not needing an ETCD range request, which is strongly consistent and significantly costly. This is also now programmed into client-go for all golang clients in the informer factory that implements the golang reflector like what you have here for rust: https://github.com/kubernetes/client-go/blob/master/tools/cache/reflector.go#L286-L299

Expected Behavior

We expect all LIST/GET/WATCH calls from vector to be hitting the apiserver cache and not etcd directly thereby reducing load on etcd.

Actual Behavior

Some WATCH calls missing apiserver cache and hitting etcd

Kubernetes API Server logs

I0619 21:39:44.679278       1 wrap.go:47] GET /api/v1/pods?&allowWatchBookmarks=true&fieldSelector=spec.nodeName%3Dhello-world-node&labelSelector=vector.dev%2Fexclude%21%3Dtrue&timeoutSeconds=290&watch=true: (4m50.000539436s) 200 [Vector/0.14.0 (x86_64-unknown-linux-gnu 5f3a319 2021-06-03) hello-world-node:37716]
I0619 21:39:45.682526       1 get.go:250] Starting watch for /api/v1/pods, rv=1725125265 labels=vector.dev/exclude!=true fields=spec.nodeName=hello-world-node timeout=4m50s
I0619 21:39:45.682685       1 wrap.go:47] GET /api/v1/pods?&allowWatchBookmarks=true&fieldSelector=spec.nodeName%3Dhello-world-node&labelSelector=vector.dev%2Fexclude%21%3Dtrue&resourceVersion=1725125265&timeoutSeconds=290&watch=true: (691.548µs) 200 [Vector/0.14.0 (x86_64-unknown-linux-gnu 5f3a319 2021-06-03) hello-world-node:37716]
I0619 21:39:46.684515       1 get.go:250] Starting watch for /api/v1/pods, rv= labels=vector.dev/exclude!=true fields=spec.nodeName=hello-world-node timeout=4m50s

I have never been a contributor to a rust project and dont think I will be the best one to make this change, but I think changing the default resource_version from None to 0 for the first call above would help by a large extent here: https://github.com/timberio/vector/blob/ab697bae210151ac3eb0be1de3e435f34dec6407/src/kubernetes/reflector.rs#L163

The text was updated successfully, but these errors were encountered:

spencergilbert · 2021-06-21T13:57:51Z

@uruddarraju Thanks! I'll take a dive into our code there and the dependencies we use to try and improve the behavior there

jpdstan · 2021-06-21T16:24:58Z

+1

spencergilbert · 2021-06-23T17:13:01Z

Hey @uruddarraju is there a particular env/config setup you used to get those logs? I'm trying to test the change locally but I'm having trouble reproducing. Thanks!

MOZGIII · 2021-09-17T06:31:28Z

Oh, nice feedback, I was wondering about the behavior in large-cluster cases. I'd say we should use cache.

vector/src/kubernetes/resource_version.rs

Line 13 in ab697ba

Self(None)

Changing to Some(0) here should to the job.

jpdstan · 2021-09-23T16:32:09Z

@spencergilbert Wdyt^?

spencergilbert · 2021-09-29T23:03:28Z

👍 sounds good, I'll try and find the time to make that PR work unless someone would like to submit a PR

ZhiminXiang · 2023-01-19T07:51:31Z

We experienced the similar issue with vector 0.22.1. Per the investigation with GKE team, they mentioned that the resource_version was probably not set to 0.

I saw the PR #11714 adopted kube-rs, which dropped the change in #9974. I am curious if there is any change needed to set the resource_version again or it has been already set in kube-rs. Apologize if this is a naive question. I am not familiar with Rust.

spencergilbert · 2023-01-19T19:58:24Z

The resource version should just be handled by the underlying library - if you're still having this/an issue with a version using kube-rs could you open a new bug report with details?

uruddarraju added the type: bug A code related bug. label Jun 19, 2021

spencergilbert self-assigned this Jun 21, 2021

jszwedko added platform: kubernetes Anything `kubernetes` platform related source: kubernetes_logs Anything `kubernetes_logs` source related labels Jun 21, 2021

spencergilbert mentioned this issue Nov 9, 2021

feat(kubernetes_logs source): Use resource_version of 0 to use cache #9974

Merged

spencergilbert closed this as completed in #9974 Nov 11, 2021

jszwedko mentioned this issue Nov 12, 2021

kubernetes_logs source fixes #10016

Closed

4 tasks

nabokihms mentioned this issue Mar 16, 2023

Split ListParams and WatchParams kube-rs/kube#1162

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vector making inefficient api GETs in large K8s clusters #7943

Vector making inefficient api GETs in large K8s clusters #7943

uruddarraju commented Jun 19, 2021 •

edited

Loading

spencergilbert commented Jun 21, 2021

jpdstan commented Jun 21, 2021

spencergilbert commented Jun 23, 2021

MOZGIII commented Sep 17, 2021

jpdstan commented Sep 23, 2021

spencergilbert commented Sep 29, 2021

ZhiminXiang commented Jan 19, 2023

spencergilbert commented Jan 19, 2023

Vector making inefficient api GETs in large K8s clusters #7943

Vector making inefficient api GETs in large K8s clusters #7943

Comments

uruddarraju commented Jun 19, 2021 • edited Loading

Vector Version

Context

Expected Behavior

Actual Behavior

Kubernetes API Server logs

spencergilbert commented Jun 21, 2021

jpdstan commented Jun 21, 2021

spencergilbert commented Jun 23, 2021

MOZGIII commented Sep 17, 2021

jpdstan commented Sep 23, 2021

spencergilbert commented Sep 29, 2021

ZhiminXiang commented Jan 19, 2023

spencergilbert commented Jan 19, 2023

uruddarraju commented Jun 19, 2021 •

edited

Loading