Set resourceVersion=0 for GET/LIST calls by default #4670

marseel · 2022-12-13T12:07:01Z

Is your enhancement related to a problem? Please describe

Yes,
During investigation of one of clusters I've noticed that https://github.com/strimzi/strimzi-kafka-operator is putting significant load on etcd (few hundreds of MB/s just from LIST calls). Strimzi is using fabric8 as k8s client. By default, LIST calls coming from Fabric8 do not contain resourceVersion specified.

From SIG Scalability perspective it would make sense to change default behaviour and set resourceVersion to "0". This will allow requests to be processed by Kubernetes API Server without interaction with etcd, which will significantly improve reliability of Kubernetes Control Plane.

Describe the solution you'd like

Change default behaviour of GET/LIST calls in Fabric8 to use resourceVersion=0.

Describe alternatives you've considered

No response

Additional context

I will followup with strimzi maintainers to solve this issue by overriding resourceVersion=0 and also using list and watch pattern, but I believe it would be beneficial to change this default behaviour for all users of Fabric8.

shawkins · 2022-12-13T12:33:22Z

For LIST this request primarily would be targeted at informers - direct usage can specify ListOptions. The relevant line from the go client: https://sourcegraph.com/github.com/kubernetes/client-go/-/blob/tools/cache/reflector.go?L644&subtree=true - we can do the same for the initial listing. I'll have to double check their other logic that references resourceVersion=0 and we'd probably have to document that the cache may be a little more inconsistent than expected to start until after the initial events come from the subsequent watch.

Just to understand better, how often is strimzi starting informers?

For GET I know we can't default to that behavior - I also don't see that in the kubectl client. A get method / operation from a user perspective seems to mean latest. Can you elaborate on how you see this being used? As far as I can tell you'd have to do something like: var resource = op.anyVersion().get(); ... do something with resource, try an update for example and if it fails optimistic locking then try again, but force using the latest.

marseel · 2022-12-13T12:46:41Z

Just to understand better, how often is strimzi starting informers?

If I understand correctly (judging based on API Server logs), Strimzi is not starting informer unfortunately, but it makes repeatable LIST calls, which further makes situation worse. This is why I opened issue in Strimzi as well to improve it and hopefully switch to LIST and WATCH pattern: strimzi/strimzi-kafka-operator#7793

For GET I know we can't default to that behavior

Skipping GET is fine, I was mostly concerned with LIST calls. This was just "random" idea.

shawkins · 2022-12-13T12:56:07Z

If I understand correctly (judging based on API Server logs), Strimzi is not starting informer

My understanding is that strimzi does use informers. See these methods: https://github.com/strimzi/strimzi-kafka-operator/blob/32acf4662a873d47a9a4478d3d6781c9a3b3b86d/operator-common/src/main/java/io/strimzi/operator/common/operator/resource/AbstractResourceOperator.java#L513

but it makes repeatable LIST calls, which further makes situation worse

It would be good to understand if they are frequently starting new informers, or if there are additional LIST calls - for the latter you can make the case that strimzi directly requests a list with resourceVersion 0, but of course it really matter how that is being used and may be too broad of a change for them.

An informer once started should not make frequent list calls without a resource version - if it restarts the watch, it will attempt to start at the last know resource version. Only if it sees a resource version that is too old - which is only likely if the server does not support bookmarks - would it need to fully re-list.

marseel · 2022-12-13T13:24:13Z

I've added comment in Strimzi bug. I believe these calls come from different part of source code in Stirmzi as they contain quite specific labelSelector so I doubt this is informer.

For Fabric8 I was wondering if it would make sense to specify resourceVersion=0 for LIST calls even when they are not part of informer as default behaviour, but I agree this would be incompatible with client-go. But you mentioned that informer in Fabric8 is not using resourceVersion=0 so I think it would make sense to change it :)

shawkins · 2022-12-13T13:28:40Z

I believe these calls come from different part of source code in Stirmzi as they contain quite specific labelSelector so I doubt this is informer

Informers can be off of any resource or list context including where there is a label selector.

For Fabric8 I was wondering if it would make sense to specify resourceVersion=0 for LIST calls even when they are not part of informer as default behaviour, but I agree this would be incompatible with client-go

I don't think so for the same reasoning as get operations - you naturally expect the latest. You can also choose to specify ListOptions if you are the one making the list call.

marseel · 2022-12-13T13:43:45Z

Informers can be off of any resource or list context including where there is a label selector.

Sorry for not being precise. I can see some WATCH requests in API Server logs coming from Strimzi, but they have different labelSelector and are cluster scope, not namespace scope like LIST I mentioned before that I was most worried about. Example Watch (all watches coming from Strimzi for pods are in this format):

/api/v1/pods?labelSelector=strimzi.io%2Fkind%3DKafka&resourceVersion=2659627868&allowWatchBookmarks=true&watch=true
latency="41m11.996246111s" userAgent="strimzi-cluster-operator/0.31.1" apf_pl="workload-low" apf_fs="service-accounts" apf_init_latency="443.37µs" hijacked=true

as side note, I am wondering why Fabric8 is not specifying default timeout for watch requests? Client-go specify random timeout between (5, 10) minutes: https://github.com/kubernetes/kubernetes/blob/4106b10d9c3abe0e90153376ce7cb26b0fb2d1d5/staging/src/k8s.io/client-go/tools/cache/reflector.go#L147

shawkins · 2022-12-13T13:58:26Z

I am wondering why Fabric8 is not specifying default timeout for watch requests?

Its never been requested as a feature. The understanding is that api server will enforce its default timeout (30 minutes I believe), with some jitter of its own.

…r listing

marseel changed the title ~~Set resourceVersion=0 for GET/LIST calls~~ Set resourceVersion=0 for GET/LIST calls by default Dec 13, 2022

marseel mentioned this issue Dec 13, 2022

Strimzi makes repeatable LIST requests causing instsability of Kubernetes Control Plane strimzi/strimzi-kafka-operator#7793

Closed

shawkins mentioned this issue Dec 14, 2022

Set a timeout client side in informer watches #4675

Closed

shawkins added a commit to shawkins/kubernetes-client that referenced this issue Dec 14, 2022

fix fabric8io#4670: using a 0 resourceVersion for the initial informe…

f91988a

…r listing

shawkins mentioned this issue Dec 14, 2022

fix #4670: using a 0 resourceVersion for the initial informer listing #4676

Merged

11 tasks

shawkins self-assigned this Dec 14, 2022

manusa added this to the 6.4.0 milestone Dec 20, 2022

manusa closed this as completed in b6879c4 Dec 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set resourceVersion=0 for GET/LIST calls by default #4670

Set resourceVersion=0 for GET/LIST calls by default #4670

marseel commented Dec 13, 2022

shawkins commented Dec 13, 2022 •

edited

Loading

marseel commented Dec 13, 2022 •

edited

Loading

shawkins commented Dec 13, 2022 •

edited

Loading

marseel commented Dec 13, 2022

shawkins commented Dec 13, 2022

marseel commented Dec 13, 2022

shawkins commented Dec 13, 2022 •

edited

Loading

Set resourceVersion=0 for GET/LIST calls by default #4670

Set resourceVersion=0 for GET/LIST calls by default #4670

Comments

marseel commented Dec 13, 2022

Is your enhancement related to a problem? Please describe

Describe the solution you'd like

Describe alternatives you've considered

Additional context

shawkins commented Dec 13, 2022 • edited Loading

marseel commented Dec 13, 2022 • edited Loading

shawkins commented Dec 13, 2022 • edited Loading

marseel commented Dec 13, 2022

shawkins commented Dec 13, 2022

marseel commented Dec 13, 2022

shawkins commented Dec 13, 2022 • edited Loading

shawkins commented Dec 13, 2022 •

edited

Loading

marseel commented Dec 13, 2022 •

edited

Loading

shawkins commented Dec 13, 2022 •

edited

Loading

shawkins commented Dec 13, 2022 •

edited

Loading