Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Long-lived deployment: Triggering Descheduler with Events (SharedIndexInformer rules, endpoints, etc) #696

Open
Dentrax opened this issue Jan 14, 2022 · 21 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@Dentrax
Copy link
Contributor

Dentrax commented Jan 14, 2022

Is your feature request related to a problem? Please describe.

When I think about a problem I have that requires to take action when Descheduler does something my first target is one of the events that it triggers as follows: 1

  • New Node Joined
  • Service Removed
  • Deployment Delete
  • HPA Update
  • Trigger manually by calling and endpoint
  • etc

(Just randomly filled these as an example, not meant to represent the use-cases.)

Describe the solution you'd like

To stay informed about when these events get triggered, we can use a primitive exposed by Kubernetes and the client-go called NewSharedIndexInformer, inside the cache package.

informer.AddEventHandler(cache.ResourceEventHandlerFuncs{
    AddFunc:    func(obj interface{}) { trigger() },
    DeleteFunc: func(interface{}) { trigger() },
})
go informer.Run(...)
// onAdd is the function executed when the kubernetes informer notified the
// presence of a new kubernetes node in the cluster
func onAdd(obj interface{}) {
    node := obj.(*corev1.Node)
    ...
}

Describe alternatives you've considered

In current architecture, node.go already handle this concern when we schedule the Descheduler to run as cron. But the motivation here, consider if we set the Descheduler to run every hour or once a week. If we delete the service, added new node, etc. there will be a time-window between event started - last time descheduler ran. Which means we have to wait hours or days to next run.

What version of descheduler are you using?

descheduler version: v0.22.1

Additional context
-

Footnotes

  1. https://www.cncf.io/blog/2019/10/15/extend-kubernetes-via-a-shared-informer/

@Dentrax Dentrax added the kind/feature Categorizes issue or PR as related to a new feature. label Jan 14, 2022
@damemi
Copy link
Contributor

damemi commented Jan 14, 2022

Hi @Dentrax,
this is something we have toyed with before, so we're definitely interested in it. See the other issue #489 and discussion in the matching PR at my first attempt to implement it here #488

If there's any interest, feel free to read through some of the feedback there and pick this up as it's been on the back burner for a while. But, it would be great to get this rolling again.

So, if you don't mind, we can close this issue and keep the discussion for this feature in one spot (#489)

@Dentrax
Copy link
Contributor Author

Dentrax commented Jan 14, 2022

This is cool! Didn't notice that. Do you need any help on your PR? To lift & shift this feature, I can give a hand for this.


You mentioned runMode takes 2 options: Default and Informed. In this requirement, I thought we can pass something like Mixed option: Descheduler should start as Deployment for periodic runs. Additional to this, it should also watch Kubernetes events using SharedIndexInformer as well.

The motivation here is to make Descheduler run as scheduled as is + event watching with informers.

As a caveat, we can not run as cron anymore since event watching with informers requires long-lived process.

@JaneLiuL
Copy link
Member

JaneLiuL commented Jan 23, 2022

Hi ~ @damemi and @denkensk
In my opinion, the strategys can be seprate with two types, one is related to pod, another is related to node.
so is it only need to care for the podinformer and the nodeinformer will be better?
since the HPA or deployment informer actually we only care for the pod and the node.
like below

	podInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
		AddFunc:    runPodrelatedStrategy,
		UpdateFunc: runPodrelatedStrategy,
		DeleteFunc: runPodrelatedStrategy,
	})
	nodeInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
		AddFunc:    runNoderelatedStrategy,
		DeleteFunc: runNoderelatedStrategy,
	})

Then we can create one interface for all strategys, and they can have run() method for default mode, and infomered method for informerd mode.

and runPodrelatedStrategy and runNoderelatedStrategy will run as informer method

@Dentrax
Copy link
Contributor Author

Dentrax commented Jan 25, 2022

One another alternative way to trigger descheduler would be using custom endpoints as such: [GET] /api/v1/trigger

@Dentrax
Copy link
Contributor Author

Dentrax commented Feb 3, 2022

Kindly ping @damemi 🤞 (#488 (comment))

@damemi
Copy link
Contributor

damemi commented Feb 3, 2022

So sorry for the delay... I had actually started working on a response to this and got completely distracted!

To answer your questions:

You mentioned runMode takes 2 options: Default and Informed. In this requirement, I thought we can pass something like Mixed option: Descheduler should start as Deployment for periodic runs. Additional to this, it should also watch Kubernetes events using SharedIndexInformer as well.

@Dentrax this seems like a fine option for me. I'm not entirely sure why I originally made the run mode design mutually exclusive. I think this may bring up some challenges such as a timed strategy running at the same time as an informed strategy, but this could probably be handled with good multithreading practices.

In my opinion, the strategys can be seprate with two types, one is related to pod, another is related to node.
so is it only need to care for the podinformer and the nodeinformer will be better?
since the HPA or deployment informer actually we only care for the pod and the node.
like below

@JaneLiuL this is the general idea, yeah. And I think that we should definitely initially start with a basic list of informers (like just pod and node, as you mentioned). However, the implementation details are where this gets tricky. Not all strategies will need all the informers, and in the future new strategies might need different informers.

Then we can create one interface for all strategys, and they can have run() method for default mode, and infomered method for informerd mode.

That's kind of how I was originally trying to address this need, by wrapping the existing default mode functions within a new interface. That interface can be defined with the generic informer functions that will be needed by all strategies, but the actual implementation of the interface can vary for each strategy by only calling upon the relevant informers.

One another alternative way to trigger descheduler would be using custom endpoints as such: [GET] /api/v1/trigger

@Dentrax this is an interesting idea, but not one that I think we should immediately pursue. I think it is safer to keep the descheduler reactive to objects and events that are known to the API server. allowing direct requests like this could bypass the need to know the "state" of the cluster and end up with conflicting, or unpredictable results.

Ultimately I'd like to revisit my implementation, or if there are alternatives/cherry-picks that others would like to work on please feel free. I think we are converging on a similar basic idea for the design that was started there

@necatican
Copy link

I’m concerned about how this would work out when adding multiple nodes. It might be nice to think about implementing a grace period. For example, if I were to add multiple nodes with a few seconds between each, it should wait for a while to ensure all nodes are in the equation.

An initial delay (e.g 30s) after each join event is a way to resolve this issue. Queuing them will not cut it for our use case, considering it’s common for us to add 40-50 nodes.

We could reset the countdown after each node join/update/delete event, and run the descheduler once the delay time of the last joined node is completed.

@Dentrax
Copy link
Contributor Author

Dentrax commented Feb 7, 2022

Hey @damemi, thanks for clarifying. One point that I might miss is that I want to pass Informed mode to entire DeschedulerPolicy. Additionally, configuring each DeschedulerStrategy to set StrategyRunMode individually would be great. Eventually, here is an example config that covers the idea:

  • runMode: both represents policy will run on a periodic loop and handle informers to react immediately.
  • if we pass runMode: default to any strategy, it overrides the policy's, and will not get triggered in case of informer events. (would it be overengineering?)
  • if we pass runMode: informed to any strategy, it overrides the policy's, and will not get triggered on periodic loops.
  • if we didn't specify a runMode to any strategy, strategy will use the policy's run mode.
  • Since Proposal: Informed strategies #489 implements the runMode just for RemovePodsViolatingNodeAffinity and RemovePodsViolatingNodeTaints strategies, I showed only two strategies. Should we want to add this run mode field for each strategy?
  • concern: if we plan new runModes to add in the near future, using both type would be shortsighted?
apiVersion: "descheduler/v1alpha1"
    kind: "DeschedulerPolicy"
	runMode: both
    strategies:
      "RemovePodsViolatingNodeAffinity":
         enabled: true
		 runMode: default
      "RemovePodsViolatingNodeTaints":
         enabled: true
		 runMode: informed

At the end of the day, as @necatican pointed, we expect descheduler will run each strategy in case a new node joined/removed/updated, just like a cronjob.

@damemi
Copy link
Contributor

damemi commented Feb 7, 2022

That's a good point, I think a configurable batch-delay would help the informed strategies run more efficiently and effectively for use cases regarding high frequency periods of updates.

Regarding each strategy, I think this will still need to consider each strategy on a case-by-case basis. For simplicity and safety, I would prefer to roll this out in just one or two strategies to start (with the intent of adding more/all later) to give the design soak time. If we do that, then we won't need a policy-level runMode field until a significant portion of the strategies have implemented an informed option (at that time it can be easily added).

This would set us up for a more backwards-friendly development pattern (adding new fields is always backward-compatible, but if at some point we decide to change or remove the policy-level runMode this could break existing configs). wdyt?

@Dentrax
Copy link
Contributor Author

Dentrax commented Feb 7, 2022

I would prefer to roll this out in just one or two strategies to start (with the intent of adding more/all later) to give the design soak time.

+1, sounds good to me!

I think a configurable batch-delay would help the informed strategies run more efficiently

So we need to create a new batchTime field in duration format. And a timer variable in the main state that listen the informer events and resets after each new event occur. If timer get higher or equal than batchTime, descheduler should run the strategies?

batchTime: 30
|t2 - t1|: 15
|t3 - t2|: exceeds batch time

t          t1              t2                t3
|----------------------------------------------> timeline
           ^               ^                 ^
           |               |                 |
           node1 joined    node2 joined      no nodes joined
           (reset timer)   (reset timer)     (run strategies)

Some questions that might need some clarification:

  1. Should we create a new informer{} struct to informer strategies and put some fields such as enabled, batchTime, etc. There may be some requirements for adding informer configs in the future.
  2. Which events should we listen in order to reset the timer? (Join, delete, update, etc.)
  3. Should we allow passing batchTime for each strategy individually? If so, we might consider creating separated timers for each strategy instead of keeping one timer in the main state. In this case:
  • Should we only run the strategy that where the timer exceeds? vs. running all informer strategies as follows:
Strategy1 --> batchTime: 10
Strategy2 --> batchTime: 30

|t2 - t1|: 15 --> Run **only** Strategy1
|t3 - t2|: 35 --> Run **only** Strategy2

vs.

|t2 - t1|: 15 --> Run **ALL** informed strategies

t          t1              t2                t3
|----------------------------------------------> timeline
(same as above)
  1. What about field name: batchTime vs batchDelay?

PTAL @necatican

@necatican
Copy link

Which events should we listen in order to reset the timer? (Join, delete, update, etc.)

Join and leave events are the obvious ones for sure, we also should consider adding support for status changes and labels/taints.

Should we allow passing batchTime for each strategy individually? If so, we might consider creating separated timers for each strategy instead of keeping one timer in the main state. In this case:

This could be a nice feature but I can not think about an instance where this is necessary.

What about field name: batchTime vs batchDelay?

How about batchWindow or batchGracePeriod? batchTime is not explanatory enough. Also while we're using it to set a delay, the delay duration itself will not be equal to batchDelay, it could cause some confusion. damemi might have a better idea.

Should we consider adding a check to determine the maximum allotted time? What happens if changes keep occurring in a cluster?

@Dentrax
Copy link
Contributor Author

Dentrax commented Mar 2, 2022

Kind ping @damemi. Waiting your final command, boss. So we can get to it!

@damemi
Copy link
Contributor

damemi commented Mar 9, 2022

Sorry for the delay @Dentrax
I think that this could be a good aspect of a potential refactor for the descheduler to make it more like a "descheduler framework". I'm trying to gather some feedback on that idea and how we can correlate a couple big projects like this under one goal. If you're interested, it would be great if you could add some input in the doc or issue here: #753

@Dentrax
Copy link
Contributor Author

Dentrax commented Mar 9, 2022

Thank you so much for informing us! I'm definitely interested and want to take a look at that proposal.


On the other hand, we'd like to send a PoC PR for this proposal if you want. It's obvious that there will be needs we haven't covered/proposed here for sure. At least, we can continue to work on this idea and get feedback/review after submitting a PR.

@binacs
Copy link
Member

binacs commented Mar 10, 2022

/cc

@damemi
Copy link
Contributor

damemi commented Mar 10, 2022

@Dentrax sure thing, it never hurts to send a poc PR to get some more detail discussion rolling. Thanks!

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 8, 2022
@damemi
Copy link
Contributor

damemi commented Jun 15, 2022

/remove-lifecycle stale
/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 15, 2022
@Dentrax
Copy link
Contributor Author

Dentrax commented Sep 1, 2022

Hey @damemi @ingvagabund,

Although we couldn't do a contribution to the Descheduler Framework migration, we followed closely what was done. Months passed since the initial proposal and internal code base has changed. We're running Deschedulers on high scale environments and still looking forward to this feature. Also, we want to get into it. In the discussion we had above, I mostly clarified my concerns about what and how we're going to do. The missing question is the where we are going to put logic in the current codebase.

Plugins directory is currently brand-new. I'm not sure if this is proposal counts as a plugin. I think it is something should be placed in execution level rather than plugin level. I'm a bit confused and lost here.

Can you please enlighten us about where should we write the actual logic in current codebase? Thanks!

cc'ing @eminaktas for awareness

@damemi
Copy link
Contributor

damemi commented Sep 2, 2022

@Dentrax thanks for bumping this. This is definitely one of the features we had in mind with the framework refactor.

You're right that this will be an execution-level change rather than a "plugin". But, it will still need to be extendable/modifiable in similar ways to the plugins (for example, writing a new strategy plugin that needs to be triggered on custom events). In the old codebase there wasn't a good mechanism for wiring this kind of information through to the strategies that need it.

For now, we should focus on migrating the existing code into the framework model so we don't overload ourselves. I think the big blocker for this will be migrating the internal code to a plugin-registry (right now the strategies are migrated to "plugins", but the internals are still basically just wrapping them in the old code). With that, we will be able to start working on an event registry that can be mapped to corresponding plugins.

If you want to put together a preliminary design for how this could work in the new codebase, please do! Having this discussion now, while not a primary focus, will still help inform design decisions for the rest of the framework migration.

@Dentrax
Copy link
Contributor Author

Dentrax commented Nov 1, 2022

Dropping a concern, so I don't forget! One small thing that worth to mention is that there could be slight time window between Node delete event and Descheduling cycle. There is a possibility to race condition between Scheduler and Descheduler:

An example case is:

  • t0 = descheduler runs every node create/delete event
  • t1 = two nodes are deleted
  • t2 = Scheduler will try to schedule Pods to other nodes
  • t2 = Descheduler will try to deschedule Pods to other nodes

My idea here is that we can listen for scheduler to check if any ongoing scheduling event. (Don't know how) As soon as Node gets deleted, we should NOT trigger the descheduler; instead we can wait awhile: all scheduling stuff gets done (all missing Pods up & running) + grace period time.

Does it make sense? 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet
Development

No branches or pull requests

7 participants