Running Multiple Recommenders with Checkpointing results in deletion of checkpoints from other recommenders #6387

javin-stripe · 2023-12-18T20:33:34Z

Which component are you using?:

Vertical Pod Autoscaler

What version of the component are you using?:

Component version: v0.13.0

What k8s version are you using (kubectl version)?:

kubectl version Output

$ kubectl version

Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.16", GitCommit:"60e5135f758b6e43d0523b3277e8d34b4ab3801f", GitTreeState:"archive", BuildDate:"2023-05-30T23:39:56Z", GoVersion:"go1.19.2", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.16", GitCommit:"60e5135f758b6e43d0523b3277e8d34b4ab3801f", GitTreeState:"clean", BuildDate:"2023-01-18T15:54:23Z", GoVersion:"go1.19.5", Compiler:"gc", Platform:"linux/amd64"

What environment is this in?:

Self hosted on EC2

What did you expect to happen?:

When running multiple recommenders, only checkpoints that correspond with that recommender are eligible for deletion during garbage collection. I.e., for a checkpoint to be deleted by RecommenderA, the checkpoint needs to be associated with RecommenderA.

What happened instead?:

RecommenderA filters out any VPAs that aren’t associated with it, and then in the garbage collection process those checkpoints would be considered ‘orphaned’ and deleted.

The following logs could be found:

I1216 06:14:17.567806     571 cluster_feeder.go:331] Orphaned VPA checkpoint cleanup - deleting some-namespace/some-vpa-checkpoint.

Which only showed for VPA checkpoints not associated with the recommender.

This can also be seen in code:

VPA objects are refreshed inside the LoadVPAs function
Inside that function, after retrieving all of the VPA objects in that cluster, it does a filterVPAs

Inside this function it includes any VPAs that match the recommender’s name, filtering out any VPAs that have a different recommender name.

We can see this happening based on logs “Ignoring vpaCRD vpa-something in namespace some-namespace as new-recommender recommender doesn't process CRDs implicitly destined to default recommender”

So we know that all the VPAs are able to be loaded at this point, but are being filtered out.

After filtering, it processes the VPAs that match the current recommender, and does an important call to AddOrUpdateVpa [2]

Inside this function it will either simply update the VPA object inside the cluster state, or it will create and update it if it’s not present yet.

This is the only place the clusterState.Vpas object is added too.

Later in the recommender tick, we get to the MaintainCheckpoints [2] function
If enough time has elapsed since the last checkpoint garbage collection, which by default is 10min, we initiate the GarbageCollectCheckpoints [2] function.

Inside this function we once again do the LoadVPAs function, to make sure we are working on fresh data (remember that inside this function we filter!)

We then grab all checkpoints, in all namespaces (which is not filtered), and check if the VPA for that checkpoint exists in the clusterState.

Since we know from (3) that the clusterState only contains the filtered list of VPAs, and this function checks all VPA checkpoints, it starts deleting the VPA checkpoints not associated with that recommender.

We can validate that this happened based on seeing the logs: “I1216 06:14:17.567806 571 cluster_feeder.go:331] Orphaned VPA checkpoint cleanup - deleting some-namespace/some-vpa-checkpoint.”

How to reproduce it (as minimally and precisely as possible):

This can easily be reproduced by spinning up two Recommenders in the same cluster, keep one as the default recommender (i.e., don’t override the recommender name) and add a new one.

Create a few VPAs for each of the corresponding Recommenders, and then start watching for the ‘Orphaned VPA checkpoint cleanup*’ log.

After getting them both running, list the checkpoints in that cluster and you should see them being deleted every ~10 minutes (i.e., the garbage collection interval).

Anything else we need to know?:

The fix should be relatively simple, I can put that together if we come to agreement that this is actually a bug and not intended behavior.

The work around is to set the garbage collection interval to something very large, and replace your recommender pods before that interval is elapsed. However this does have the downside that you are effectively turning off the checkpoint garbage collection logic.

The text was updated successfully, but these errors were encountered:

voelzmo · 2024-02-16T09:38:17Z

Hey @javin-stripe, thanks for the great description and analysis! This does absolutely not feel like an intended behavior and probably indicates that this isn't a widely used feature or that all users of this feature use a different mechanism than checkpoints for history data.

Thanks for offering to provide a fix, this would be highly appreciated! When you're saying the fix should be relatively simple, which solution do you have in mind?

javin-stripe · 2024-02-20T22:41:30Z

Hey @javin-stripe, thanks for the great description and analysis! This does absolutely not feel like an intended behavior and probably indicates that this isn't a widely used feature or that all users of this feature use a different mechanism than checkpoints for history data.

Thanks for offering to provide a fix, this would be highly appreciated! When you're saying the fix should be relatively simple, which solution do you have in mind?

Hey @voelzmo, thanks for getting back to me!

I haven't really done any work on this since posting this issue and finding essentially the workaround (i.e., if you can cycle your nodes outside of VPA, since the garbage collection timer is in-memory it will reset on a new node). However how I see this fix being implemented is essentially here in code:

autoscaler/vertical-pod-autoscaler/pkg/recommender/input/cluster_feeder.go

Line 283 in 5bf33b2

    
           checkpointList, err := feeder.vpaCheckpointClient.VerticalPodAutoscalerCheckpoints(namespace).List(context.TODO(), metav1.ListOptions{})

We will likely want to group the checkpoint and the vpa object itself before filtering the vpa's, then only look at checkpoints that are also post filtering.

Let me know if that doesn't make sense, or if there is anything you would like to see in the fix! I should have some cycles in the next few weeks to put up a PR.

voelzmo · 2024-02-21T14:20:35Z

Hey @javin-stripe, thanks for outlining your thoughts on how to approach this before jumping to do the PR!

I think that inside GarbageCollectCheckpoints we probably don't need to call feeder.LoadVPAs(). It is called on several other occasions in the code already and is usually used to populate the clusterState with an updated list of filtered VPA objects.

For garbage collecting the checkpoints you correctly found that we should operate with an unfiltered list of all existing VPA objects in the entire cluster, so my best guess would be to avoid calling feeder.LoadVPAs and instead use something like this to receive all VPAs and build a map indexed by VpaID objects:

allVpaCRDs, err := feeder.vpaLister.List(labels.Everything())
vpaID := model.VpaID{
			Namespace: vpaCRD.Namespace,
			VpaName:   vpaCRD.Name,
		}

The rest of the method can probably stay the same: we just need to make sure that we check against the map of unfiltered VPAs and not the one in the clusterState.

As alternatives I thought about keeping a list of unfiltered VPAs in the clusterState, but given this is the only place we seem to need it, it seemed like a waste of memory and polluting the clusterState.

Does this make sense?

javin-stripe · 2024-02-23T03:57:41Z

Yup that sounds good, ill loop back with a PR at some point. Thanks for the thoughts / approach!

BojanZelic · 2024-03-20T19:39:23Z

I'm facing the same issue & this is currently limiting from using multiple VPA recommenders; @javin-stripe any progress with the PR? If you're busy I could also give it a shot;

adrianmoisey · 2024-07-08T18:47:32Z

/area vertical-pod-autoscaler

javin-stripe · 2024-07-08T19:58:58Z

@BojanZelic - unfortunately this was de-prioritized on our end. We have the work around by setting the garbage collection timeframe to be high, and making sure our VPA pods are cycled before than. Sadly im not too sure if/when ill be able to pick up this work anymore.

k8s-triage-robot · 2024-10-06T20:49:25Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

BojanZelic · 2024-10-07T03:00:08Z

/remove-lifecycle stale

adrianmoisey · 2024-12-18T11:42:07Z

Can confirm that this exists
/triage accepted

I can take a look at fixing it
/assign

adrianmoisey · 2024-12-20T06:07:25Z

This is bring worked on in #6767

javin-stripe added the kind/bug Categorizes issue or PR as related to a bug. label Dec 18, 2023

javin-stripe changed the title ~~Running Multiple Recommenders with Checkpointing results in incorrect classification of 'Orphaned Checkpoint'~~ Running Multiple Recommenders with Checkpointing results in incorrect deletion of 'Orphaned Checkpoint' Dec 18, 2023

javin-stripe changed the title ~~Running Multiple Recommenders with Checkpointing results in incorrect deletion of 'Orphaned Checkpoint'~~ Running Multiple Recommenders with Checkpointing results in deletion of checkpoints from other recommenders Dec 18, 2023

BojanZelic linked a pull request Apr 24, 2024 that will close this issue

vpa - fix checkpoint gc of unknown recommenders #6767

Open

k8s-ci-robot added the area/vertical-pod-autoscaler label Jul 8, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 6, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 7, 2024

k8s-ci-robot assigned adrianmoisey Dec 18, 2024

k8s-ci-robot added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running Multiple Recommenders with Checkpointing results in deletion of checkpoints from other recommenders #6387

Running Multiple Recommenders with Checkpointing results in deletion of checkpoints from other recommenders #6387

javin-stripe commented Dec 18, 2023

voelzmo commented Feb 16, 2024

javin-stripe commented Feb 20, 2024

voelzmo commented Feb 21, 2024

javin-stripe commented Feb 23, 2024

BojanZelic commented Mar 20, 2024 •

edited

Loading

adrianmoisey commented Jul 8, 2024

javin-stripe commented Jul 8, 2024

k8s-triage-robot commented Oct 6, 2024

BojanZelic commented Oct 7, 2024

adrianmoisey commented Dec 18, 2024

adrianmoisey commented Dec 20, 2024

Running Multiple Recommenders with Checkpointing results in deletion of checkpoints from other recommenders #6387

Running Multiple Recommenders with Checkpointing results in deletion of checkpoints from other recommenders #6387

Comments

javin-stripe commented Dec 18, 2023

voelzmo commented Feb 16, 2024

javin-stripe commented Feb 20, 2024

voelzmo commented Feb 21, 2024

javin-stripe commented Feb 23, 2024

BojanZelic commented Mar 20, 2024 • edited Loading

adrianmoisey commented Jul 8, 2024

javin-stripe commented Jul 8, 2024

k8s-triage-robot commented Oct 6, 2024

BojanZelic commented Oct 7, 2024

adrianmoisey commented Dec 18, 2024

adrianmoisey commented Dec 20, 2024

BojanZelic commented Mar 20, 2024 •

edited

Loading