-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set resourceVersion=0 for GET/LIST calls by default #4670
Comments
For LIST this request primarily would be targeted at informers - direct usage can specify ListOptions. The relevant line from the go client: https://sourcegraph.com/github.com/kubernetes/client-go/-/blob/tools/cache/reflector.go?L644&subtree=true - we can do the same for the initial listing. I'll have to double check their other logic that references resourceVersion=0 and we'd probably have to document that the cache may be a little more inconsistent than expected to start until after the initial events come from the subsequent watch. Just to understand better, how often is strimzi starting informers? For GET I know we can't default to that behavior - I also don't see that in the kubectl client. A get method / operation from a user perspective seems to mean latest. Can you elaborate on how you see this being used? As far as I can tell you'd have to do something like: var resource = op.anyVersion().get(); ... do something with resource, try an update for example and if it fails optimistic locking then try again, but force using the latest. |
If I understand correctly (judging based on API Server logs), Strimzi is not starting informer unfortunately, but it makes repeatable LIST calls, which further makes situation worse. This is why I opened issue in Strimzi as well to improve it and hopefully switch to LIST and WATCH pattern: strimzi/strimzi-kafka-operator#7793
Skipping GET is fine, I was mostly concerned with LIST calls. This was just "random" idea. |
My understanding is that strimzi does use informers. See these methods: https://github.com/strimzi/strimzi-kafka-operator/blob/32acf4662a873d47a9a4478d3d6781c9a3b3b86d/operator-common/src/main/java/io/strimzi/operator/common/operator/resource/AbstractResourceOperator.java#L513
It would be good to understand if they are frequently starting new informers, or if there are additional LIST calls - for the latter you can make the case that strimzi directly requests a list with resourceVersion 0, but of course it really matter how that is being used and may be too broad of a change for them. An informer once started should not make frequent list calls without a resource version - if it restarts the watch, it will attempt to start at the last know resource version. Only if it sees a resource version that is too old - which is only likely if the server does not support bookmarks - would it need to fully re-list. |
I've added comment in Strimzi bug. I believe these calls come from different part of source code in Stirmzi as they contain quite specific labelSelector so I doubt this is informer. For Fabric8 I was wondering if it would make sense to specify resourceVersion=0 for LIST calls even when they are not part of informer as default behaviour, but I agree this would be incompatible with client-go. But you mentioned that informer in Fabric8 is not using resourceVersion=0 so I think it would make sense to change it :) |
Informers can be off of any resource or list context including where there is a label selector.
I don't think so for the same reasoning as get operations - you naturally expect the latest. You can also choose to specify ListOptions if you are the one making the list call. |
Sorry for not being precise. I can see some WATCH requests in API Server logs coming from Strimzi, but they have different labelSelector and are cluster scope, not namespace scope like LIST I mentioned before that I was most worried about. Example Watch (all watches coming from Strimzi for pods are in this format):
as side note, I am wondering why Fabric8 is not specifying default timeout for watch requests? Client-go specify random timeout between (5, 10) minutes: https://github.com/kubernetes/kubernetes/blob/4106b10d9c3abe0e90153376ce7cb26b0fb2d1d5/staging/src/k8s.io/client-go/tools/cache/reflector.go#L147 |
Its never been requested as a feature. The understanding is that api server will enforce its default timeout (30 minutes I believe), with some jitter of its own. |
Is your enhancement related to a problem? Please describe
Yes,
During investigation of one of clusters I've noticed that https://github.com/strimzi/strimzi-kafka-operator is putting significant load on etcd (few hundreds of MB/s just from LIST calls). Strimzi is using fabric8 as k8s client. By default, LIST calls coming from Fabric8 do not contain resourceVersion specified.
From SIG Scalability perspective it would make sense to change default behaviour and set resourceVersion to "0". This will allow requests to be processed by Kubernetes API Server without interaction with etcd, which will significantly improve reliability of Kubernetes Control Plane.
Describe the solution you'd like
Change default behaviour of GET/LIST calls in Fabric8 to use resourceVersion=0.
Describe alternatives you've considered
No response
Additional context
I will followup with strimzi maintainers to solve this issue by overriding resourceVersion=0 and also using list and watch pattern, but I believe it would be beneficial to change this default behaviour for all users of Fabric8.
The text was updated successfully, but these errors were encountered: