Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running Multiple Recommenders with Checkpointing results in deletion of checkpoints from other recommenders #6387

Open
javin-stripe opened this issue Dec 18, 2023 · 11 comments · May be fixed by #6767
Assignees
Labels
area/vertical-pod-autoscaler kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@javin-stripe
Copy link

Which component are you using?:

Vertical Pod Autoscaler

What version of the component are you using?:

Component version: v0.13.0

What k8s version are you using (kubectl version)?:

kubectl version Output
$ kubectl version

Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.16", GitCommit:"60e5135f758b6e43d0523b3277e8d34b4ab3801f", GitTreeState:"archive", BuildDate:"2023-05-30T23:39:56Z", GoVersion:"go1.19.2", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.16", GitCommit:"60e5135f758b6e43d0523b3277e8d34b4ab3801f", GitTreeState:"clean", BuildDate:"2023-01-18T15:54:23Z", GoVersion:"go1.19.5", Compiler:"gc", Platform:"linux/amd64"

What environment is this in?:

Self hosted on EC2

What did you expect to happen?:

When running multiple recommenders, only checkpoints that correspond with that recommender are eligible for deletion during garbage collection. I.e., for a checkpoint to be deleted by RecommenderA, the checkpoint needs to be associated with RecommenderA.

What happened instead?:

RecommenderA filters out any VPAs that aren’t associated with it, and then in the garbage collection process those checkpoints would be considered ‘orphaned’ and deleted.

The following logs could be found:

I1216 06:14:17.567806     571 cluster_feeder.go:331] Orphaned VPA checkpoint cleanup - deleting some-namespace/some-vpa-checkpoint.

Which only showed for VPA checkpoints not associated with the recommender.

This can also be seen in code:

  1. VPA objects are refreshed inside the LoadVPAs function
  2. Inside that function, after retrieving all of the VPA objects in that cluster, it does a filterVPAs

Inside this function it includes any VPAs that match the recommender’s name, filtering out any VPAs that have a different recommender name.

We can see this happening based on logs “Ignoring vpaCRD vpa-something in namespace some-namespace as new-recommender recommender doesn't process CRDs implicitly destined to default recommender”

So we know that all the VPAs are able to be loaded at this point, but are being filtered out.

  1. After filtering, it processes the VPAs that match the current recommender, and does an important call to AddOrUpdateVpa [2]

Inside this function it will either simply update the VPA object inside the cluster state, or it will create and update it if it’s not present yet.

This is the only place the clusterState.Vpas object is added too.

  1. Later in the recommender tick, we get to the MaintainCheckpoints [2] function
  2. If enough time has elapsed since the last checkpoint garbage collection, which by default is 10min, we initiate the GarbageCollectCheckpoints [2] function.

Inside this function we once again do the LoadVPAs function, to make sure we are working on fresh data (remember that inside this function we filter!)

We then grab all checkpoints, in all namespaces (which is not filtered), and check if the VPA for that checkpoint exists in the clusterState.

Since we know from (3) that the clusterState only contains the filtered list of VPAs, and this function checks all VPA checkpoints, it starts deleting the VPA checkpoints not associated with that recommender.

We can validate that this happened based on seeing the logs: “I1216 06:14:17.567806 571 cluster_feeder.go:331] Orphaned VPA checkpoint cleanup - deleting some-namespace/some-vpa-checkpoint.”

How to reproduce it (as minimally and precisely as possible):

This can easily be reproduced by spinning up two Recommenders in the same cluster, keep one as the default recommender (i.e., don’t override the recommender name) and add a new one.

Create a few VPAs for each of the corresponding Recommenders, and then start watching for the ‘Orphaned VPA checkpoint cleanup*’ log.

After getting them both running, list the checkpoints in that cluster and you should see them being deleted every ~10 minutes (i.e., the garbage collection interval).

Anything else we need to know?:

The fix should be relatively simple, I can put that together if we come to agreement that this is actually a bug and not intended behavior.

The work around is to set the garbage collection interval to something very large, and replace your recommender pods before that interval is elapsed. However this does have the downside that you are effectively turning off the checkpoint garbage collection logic.

@javin-stripe javin-stripe added the kind/bug Categorizes issue or PR as related to a bug. label Dec 18, 2023
@javin-stripe javin-stripe changed the title Running Multiple Recommenders with Checkpointing results in incorrect classification of 'Orphaned Checkpoint' Running Multiple Recommenders with Checkpointing results in incorrect deletion of 'Orphaned Checkpoint' Dec 18, 2023
@javin-stripe javin-stripe changed the title Running Multiple Recommenders with Checkpointing results in incorrect deletion of 'Orphaned Checkpoint' Running Multiple Recommenders with Checkpointing results in deletion of checkpoints from other recommenders Dec 18, 2023
@voelzmo
Copy link
Contributor

voelzmo commented Feb 16, 2024

Hey @javin-stripe, thanks for the great description and analysis! This does absolutely not feel like an intended behavior and probably indicates that this isn't a widely used feature or that all users of this feature use a different mechanism than checkpoints for history data.

Thanks for offering to provide a fix, this would be highly appreciated! When you're saying the fix should be relatively simple, which solution do you have in mind?

@javin-stripe
Copy link
Author

Hey @javin-stripe, thanks for the great description and analysis! This does absolutely not feel like an intended behavior and probably indicates that this isn't a widely used feature or that all users of this feature use a different mechanism than checkpoints for history data.

Thanks for offering to provide a fix, this would be highly appreciated! When you're saying the fix should be relatively simple, which solution do you have in mind?

Hey @voelzmo, thanks for getting back to me!

I haven't really done any work on this since posting this issue and finding essentially the workaround (i.e., if you can cycle your nodes outside of VPA, since the garbage collection timer is in-memory it will reset on a new node). However how I see this fix being implemented is essentially here in code:

checkpointList, err := feeder.vpaCheckpointClient.VerticalPodAutoscalerCheckpoints(namespace).List(context.TODO(), metav1.ListOptions{})

We will likely want to group the checkpoint and the vpa object itself before filtering the vpa's, then only look at checkpoints that are also post filtering.

Let me know if that doesn't make sense, or if there is anything you would like to see in the fix! I should have some cycles in the next few weeks to put up a PR.

@voelzmo
Copy link
Contributor

voelzmo commented Feb 21, 2024

Hey @javin-stripe, thanks for outlining your thoughts on how to approach this before jumping to do the PR!

I think that inside GarbageCollectCheckpoints we probably don't need to call feeder.LoadVPAs(). It is called on several other occasions in the code already and is usually used to populate the clusterState with an updated list of filtered VPA objects.

For garbage collecting the checkpoints you correctly found that we should operate with an unfiltered list of all existing VPA objects in the entire cluster, so my best guess would be to avoid calling feeder.LoadVPAs and instead use something like this to receive all VPAs and build a map indexed by VpaID objects:

allVpaCRDs, err := feeder.vpaLister.List(labels.Everything())
vpaID := model.VpaID{
			Namespace: vpaCRD.Namespace,
			VpaName:   vpaCRD.Name,
		}

The rest of the method can probably stay the same: we just need to make sure that we check against the map of unfiltered VPAs and not the one in the clusterState.

As alternatives I thought about keeping a list of unfiltered VPAs in the clusterState, but given this is the only place we seem to need it, it seemed like a waste of memory and polluting the clusterState.

Does this make sense?

@javin-stripe
Copy link
Author

Yup that sounds good, ill loop back with a PR at some point. Thanks for the thoughts / approach!

@BojanZelic
Copy link

BojanZelic commented Mar 20, 2024

I'm facing the same issue & this is currently limiting from using multiple VPA recommenders; @javin-stripe any progress with the PR? If you're busy I could also give it a shot;

@BojanZelic BojanZelic linked a pull request Apr 24, 2024 that will close this issue
@adrianmoisey
Copy link
Member

/area vertical-pod-autoscaler

@javin-stripe
Copy link
Author

@BojanZelic - unfortunately this was de-prioritized on our end. We have the work around by setting the garbage collection timeframe to be high, and making sure our VPA pods are cycled before than. Sadly im not too sure if/when ill be able to pick up this work anymore.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 6, 2024
@BojanZelic
Copy link

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 7, 2024
@adrianmoisey
Copy link
Member

Can confirm that this exists
/triage accepted

I can take a look at fixing it
/assign

@k8s-ci-robot k8s-ci-robot added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Dec 18, 2024
@adrianmoisey
Copy link
Member

This is bring worked on in #6767

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/vertical-pod-autoscaler kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants