Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add byobject filter on nodes #2888

Merged
merged 7 commits into from
Oct 23, 2024
Merged

Add byobject filter on nodes #2888

merged 7 commits into from
Oct 23, 2024

Conversation

GnatorX
Copy link
Contributor

@GnatorX GnatorX commented Apr 18, 2024

What type of PR is this?

improvement

Which issue does this PR fix?:

#2887

What does this PR do / Why do we need it?:
This PR adds a filter to reduce the number of node object VPC CNI watches for to just the node it is managing.

Testing done on this change:

Tested on our cluster of 3-4k node by @dl-stripe

Will this PR introduce any new dependencies?:

No
Will this break upgrades or downgrades? Has updating a running cluster been tested?:
No.

Does this change require updates to the CNI daemonset config files to work?:

No

Does this PR introduce any user-facing change?:

No

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@GnatorX GnatorX closed this Apr 29, 2024
@GnatorX GnatorX reopened this Apr 29, 2024
@orsenthil
Copy link
Member

Closing per the discussion here - #2887

@dl-stripe
Copy link

A couple more datapoints in favor of this change. We have a few large node count clusters with a higher rate of churn for these nodes. The number of Node events pushed to the daemonset was causing excesive bandwidth usage. Adding the filter drastically reduced the number of events distributed to the system. Similarly the number of bytes pushed fell drastically (sharing a screenshot without absolute numbers, happy to share in private, but you can see the relative drop in bytes processed).

sum by (cluster, kind) (rate(apiserver_watch_events_total{kind="Node"}[5m]))

image
image

@jayanthvn jayanthvn reopened this Oct 11, 2024
@jayanthvn jayanthvn requested a review from orsenthil October 11, 2024 01:50
@jayanthvn
Copy link
Contributor

@GnatorX - The PR is still in draft mode. Please feel free to move to review. Also can you run make format ?

@GnatorX GnatorX marked this pull request as ready for review October 11, 2024 03:01
@GnatorX GnatorX requested a review from a team as a code owner October 11, 2024 03:01
@GnatorX
Copy link
Contributor Author

GnatorX commented Oct 11, 2024

ya i can do that

@GnatorX
Copy link
Contributor Author

GnatorX commented Oct 11, 2024

ran make format

@jayanthvn
Copy link
Contributor

Updated the branch and started the test workflow..

@GnatorX
Copy link
Contributor Author

GnatorX commented Oct 21, 2024

Thanks

@@ -37,7 +37,11 @@ func getIPAMDCacheFilters() map[client.Object]cache.ByObject {
return map[client.Object]cache.ByObject{
&corev1.Pod{}: {
Field: fields.Set{"spec.nodeName": nodeName}.AsSelector(),
}}
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous two related to changes to the cache improvements were these.

I am trying to understand why the only Pod{} to the filter here in the first place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect it wasn't as obvious because this requires large clusters to show the increase in memory and networking bandwidth. We only saw this because our cluster is 4k node count

@orsenthil
Copy link
Member

@dl-stripe - What is the LHS value in the first graph here - #2888 (comment)

@dl-stripe
Copy link

@dl-stripe - What is the LHS value in the first graph here - #2888 (comment)

It's the number of Node events pushed through informers (based off apiserver_watch_events_total) over a 5 minute period for 3 sample clusters we run the AWS VPC CNI in. They're all large clusters but have slightly different node counts.

@orsenthil
Copy link
Member

When we were using node labels to signal security group for pods feature - https://docs.aws.amazon.com/eks/latest/userguide/security-groups-pods-deployment.html

this change for letting kubernetes client cache the node calls, could have had a impact, and dependent on the cache invalidation by the client when the labels change. Right now since, we don't depend on the CNI node labels, this change looks good to me.

@dl-stripe / @GnatorX - Do you have Security Groups for Pods in your cluster under test and did you see any impact this change?

@GnatorX
Copy link
Contributor Author

GnatorX commented Oct 21, 2024

this change for letting kubernetes client cache the node calls, could have had a impact, and dependent on the cache invalidation by the client when the labels change. Right now since, we don't depend on the CNI node labels, this change looks good to me.

This is filter down the call for nodes when informer performs the list + watch. It shouldn't affect any label changes. It should narrow the informer cache to just watch for the node the VPC CNI is running on and ignore all other node's updates and events

@GnatorX
Copy link
Contributor Author

GnatorX commented Oct 21, 2024

We do run with Security Groups for Pods and this change has no effect on it.

@orsenthil
Copy link
Member

orsenthil commented Oct 21, 2024

This is filter down the call for nodes when informer performs the list + watch. It shouldn't affect any label changes. It should narrow the informer cache to just watch for the node the VPC CNI is running on and ignore all other node's updates and events

Got it. This makes sense, and the change looks correct and helpful too.

Copy link
Member

@orsenthil orsenthil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change LGTM.

The filtermap used here -

ByObject: filterMap,
and this will add a filter to list/watch events for node operations, restricting it to node that CNI runs on.

This correct.

I understand adding an additional cache field can increase the memory usage by VPC CNI, but it did not per here - #2887

We will verify memory usage VPC CNI after this change before the release of this change.

@GnatorX
Copy link
Contributor Author

GnatorX commented Oct 22, 2024

The reason initially I wasn't able to get the memory improvement from this change is because our internal setup vendored the code, so when I made the change it didn't actually take effect.
At a high level what is happening here is we are reducing the amount of data VPC CNI's informer would process and save.
https://github.com/kubernetes/sample-controller/raw/master/docs/images/client-go-controller-interaction.jpeg
The field selector changes the reflector's list & watch to just the node it is running on. This isn't a change to an additional cache field, it is a filter based on the selector field.
This changes 2 things.

  1. We no longer get watch events from other nodes this reduces network calls and CPU processing
  2. We no longer cache all node information in the cluster within the store. This reduce memory consumption

@GnatorX
Copy link
Contributor Author

GnatorX commented Oct 22, 2024

https://github.com/kubernetes-sigs/controller-runtime/blob/main/designs/use-selectors-at-cache.md
https://github.com/kubernetes-sigs/controller-runtime/blob/main/designs/cache_options.md

Some docs I found that explains what these ByObject filters do

@orsenthil
Copy link
Member

Thank you for the explanation and the links to behavior. This change looks good and helpful.

@orsenthil orsenthil merged commit 0703d03 into aws:master Oct 23, 2024
6 checks passed
orsenthil pushed a commit that referenced this pull request Oct 23, 2024
* Add byobject filter on nodes

---------

Co-authored-by: Jayanth Varavani <1111446+jayanthvn@users.noreply.github.com>
Co-authored-by: Garvin Pang <garvinp@stripe.com>
orsenthil added a commit that referenced this pull request Oct 23, 2024
* Add byobject filter on nodes

---------

Co-authored-by: Garvin Pang <garvinpang@protonmail.com>
Co-authored-by: Jayanth Varavani <1111446+jayanthvn@users.noreply.github.com>
Co-authored-by: Garvin Pang <garvinp@stripe.com>
@GnatorX GnatorX deleted the patch-1 branch October 29, 2024 20:26
@sidewinder12s
Copy link

This is a great change/any other calls to API Server or AWS APIs that could use filtering should use it. This cut the memory usage of a 4-500 node cluster in half from 40GB+ to ~20GB across all cni pods.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants