Add workers autoscaling through KEDA #33

mfateev · 2019-11-26T22:07:09Z

https://github.com/kedacore/keda

yiminc · 2022-10-29T17:37:58Z

This sounds like a feature request to SDK?

aakarim · 2022-11-01T10:44:26Z

Just casting some light on this - we use KEDA with Postgres queries right now to scale our workers when there is more than a predefined number of tasks that our base node pool can handle.

At the moment it's 2 workers base and when there are more than 10 concurrent Workflows we spin up more servers, 5 concurrent Workflows per node. It's working well, but we're always a little worried that the schema will change and break everything. It would be great to abstract this out to a native integration with KEDA.

ross-p-smith · 2023-06-08T09:37:47Z

I have this working within KEDA using the temporalClient's ListOpenWorkflow method and would love to chat to someone from the temporal maintainers about the best way to get this into the various repos and whether things make sense

Co-authored-by: Mike <mike@ultimatetournament.io>

febinct · 2023-08-08T03:20:53Z

PR for the same https://github.com/kedacore/keda/pull/4863/files

RonaldGalea · 2023-11-16T08:54:32Z

+1 would be a great feature to have. It's common to have specific workers (services) listening on specific task queues. Exposing the size of a given task queue would be a very precise autoscaling metric for those workers.

sinkr · 2024-05-28T17:50:32Z

My organization would like to see this, too.

henrytseng · 2024-05-28T17:53:08Z

+1 This would be great to have

justinfx · 2024-05-28T19:18:19Z

My studio has just started testing Temporal and it would be great to have this feature.

jhecking · 2024-09-24T07:50:56Z

There were two previous attempts to implement a Temporal scaler for Keda, but both got closed. Ref. kedacore/keda#4721 and kedacore/keda#4863.

@cretz, since you were directly involved in kedacore/keda#4863, do you think the new Task Queue Statistics added in v1.25 would be the right way to implement a Keda scaler for Temporal? If so, any thoughts on whether the Temporal team might consider to implement this or whether support would have to come from the community?

jhecking · 2024-09-24T08:04:47Z

To provide some more context: In particular, we are interested in using Keda's ability to scale down Temporal workers to zero if there are no pending tasks on the worker's task queue(s) for some period of time. This is not possible using the (previously?) recommended way of scaling Temporal workers based on the temporal_worker_task_slots_available metric generated by the workers themselves.

cretz · 2024-09-24T13:56:22Z

do you think the new Task Queue Statistics added in v1.25 would be the right way to implement a Keda scaler for Temporal?

Absolutely, and this is on our roadmap to build. We demo'd this at our Replay conference. Stay tuned for more info.

jhecking · 2024-09-25T02:43:40Z

this is on our roadmap to build

@cretz Any further information you can share on this, i.e. possibly a rough timeline? This would help us decide whether we can wait for an official version or whether we need to build something in-house for our own use first.

febinct · 2024-09-25T02:52:37Z

you could autoscale using https://keda.sh/docs/2.15/scalers/prometheus/ @jhecking
eg: "histogram_quantile(0.95, sum(rate(task_schedule_to_start_latency_bucket{exported_namespace="namespace",task_type="Activity", taskqueue="queuename"}[5m])) by (taskqueue, task_type, le))"

jhecking · 2024-09-25T02:56:24Z

Thanks, @febinct. That's what we currently do for 1->n scaling. What we are looking for is a solution that can handle 0->1 / 1->0 scaling as well.

febinct · 2024-09-25T03:19:45Z

@jhecking Scaling from zero to one or vice versa isn’t currently feasible with Temporal because it relies on workers continuously polling task queues. If at least one worker isn’t running, we’re unable to submit jobs and execute it as metrics are exported from sdk. This setup doesn’t support Lambda or keda scaled job like use-cases at the moment. I discussed this with Maxim (CTO of Temporal) during the last Temporal meetup, and he mentioned that it’s on the roadmap didnt get any ETA thou :)

Our team raised this PR, and we’re actively exploring Task Queue Statistics-based autoscaling. We plan to raise a PR in KEDA within the next 2-4 weeks.

as a hack to avoid running larger machines and to save cost what we did was run a very small pod from that temporal workflow we triggered an sqs event and using sqs we triggered https://keda.sh/docs/1.4/concepts/scaling-jobs/ and then waited for signal to come back from that sqs processor which is executing the expensive job

jhecking · 2024-09-25T11:31:40Z

Scaling from zero to one or vice versa isn’t currently feasible with Temporal because it relies on workers continuously polling task queues. If at least one worker isn’t running, we’re unable to submit jobs and execute it as metrics are exported from sdk.

@febinct I don't think this is correct with regards to the new task queue statistics. I was able to spin up a v1.25 dev server using the Temporal CLI. Then I used the hello-world examples to start several workflows on tasks queues that had no active workers, as well as some additional workflows that ran activities on task queues that had no active workers. When I use the Temporal CLI to query the DescribeTaskQueue API, I get the expected stats, i.e.

❯ temporal task-queue describe --task-queue hello-world-workflow-on-task-queue-without-workers
Task Queue Statistics:
    BuildID    TaskQueueType  ApproximateBacklogCount  ApproximateBacklogAge  BacklogIncreaseRate  TasksAddRate  TasksDispatchRate
  UNVERSIONED  workflow                             2  1m 58.619811s                  0.028910344   0.028910344                  0
  UNVERSIONED  activity                             0  0s                                       0             0                  0
Pollers:
  BuildID  TaskQueueType  Identity  LastAccessTime  RatePerSecond
  
❯ temporal task-queue describe --task-queue hello-world-activity-on-task-queue-without-workers
Task Queue Statistics:
    BuildID    TaskQueueType  ApproximateBacklogCount  ApproximateBacklogAge  BacklogIncreaseRate  TasksAddRate  TasksDispatchRate
  UNVERSIONED  workflow                             0  0s                                       0             0                  0
  UNVERSIONED  activity                             3  13.814805s                     0.090869784   0.090869784                  0
Pollers:
  BuildID  TaskQueueType  Identity  LastAccessTime  RatePerSecond

So I think it should be possible to implement a Keda scaler that queries the DescribeTaskQueue API and uses the ApproximateBacklogCount metric to make 0->1 and 1->0 scaling decisions.

@cretz please correct me if I got any of this wrong.

cretz · 2024-09-25T12:46:24Z

Any further information you can share on this, i.e. possibly a rough timeline?

I am afraid there is no specific timeline at this time.

So I think it should be possible to implement a Keda scaler that queries the DescribeTaskQueue API and uses the ApproximateBacklogCount metric to make 0->1 and 1->0 scaling decisions.

Yes, unlike schedule to start latency (which is worker side so requires a worker), backlog count can be used for scale-to-zero use cases and was one of the primary motivators behind this API.

Feel free to come discuss scaling in our community slack or our community forums.

jhecking · 2024-09-25T13:16:23Z

Yes, unlike schedule to start latency (which is worker side so requires a worker), backlog count can be used for scale-to-zero use cases and was one of the primary motivators behind this API.

Great! Thanks for the confirmation.

atihkin · 2024-09-25T23:06:57Z

Hi all 👋🏽 I'm Nikitha, a PM here at Temporal and I wanted to acknowledge all the great feedback and discussion in this thread.

I'm excited to share that we do have imminent plans to build and contribute a KEDA scaler upstream (yes scale to zero will work as @cretz confirmed). I don't have an ETA for you just yet, but it's actively in the works and we will share more soon!

febinct · 2024-09-26T08:19:58Z

https://github.com/kedacore/keda/pull/6191/files pr for the same please review @cretz @jhecking

jhecking · 2024-09-26T10:03:55Z

https://github.com/kedacore/keda/pull/6191/files pr for the same please review @cretz @jhecking

Thank you! Will take a look.

cretz · 2024-09-26T13:51:23Z

@febinct - from @atihkin above, "we do have imminent plans to build and contribute a KEDA scaler upstream", but this seems to preempt us from being able to build this by adding an externally created one. Our algorithm may differ slightly from the one in the PR (for instance, combined task queue stats probably not the way to go unless opted in, build-id-specific stats may be better). I will get with the engineers on the scaling project and review the submission. We should hold off on merging this PR until Temporal takes a look and/or submits a similar alternative.

febinct · 2024-09-26T14:22:04Z

If it makes sense please give review comments happy to collaborate as an extended team and collaborate on the growth of the temporal community.

jhecking · 2024-09-26T15:06:36Z

I, for one, am very grateful to @febinct and team for having put their own implementation out there. 🙏 I have reviewed the PR and I think it will meet our needs. We are planning to go ahead and run some tests with it to get a feel for how well the 0->1 / 1->0 scaling works for our workloads. Though we would probably wait for the official implementation of the Temporal team before using it in prod.

atihkin · 2024-10-07T22:39:07Z

To update folks on this thread - the Temporal team has taken a look and we've decide to go ahead with @febinct's proposal (thank you for your contribution and also @jhecking for your review!). @robholland has left a few comments in https://github.com/kedacore/keda/pull/6191/files but we do hope to be able to merge this PR soon.

febinct · 2024-10-08T01:18:09Z

All the credit goes to https://github.com/Prajithp from our team. We are actively working on closing the comments and then get merged. Will close soon.

Thanks @atihkin

jhecking · 2024-10-08T02:44:14Z

Thank you @Prajithp and @febinct for pushing this forward! 🙏

But I do want to point out that from our perspective the new Keda Temporal scaler is not yet production ready, as we are still faced with the issue of Keda using up 100% of the allocated CPU as soon as we enable the new scaler. I'm continuing to debug the issue but have yet to find a solution.

febinct · 2024-10-08T04:07:45Z

we are also checking same @jhecking as of now suspecting creating new gRPC connections too frequently. The MinConnectTimeout of 5 seconds might be causing rapid reconnections if the connection does not succeed within that time frame could be another potential reason. can you also try bypassing Consul temporarily to see if the CPU load decreases as well? as we dont have Consul setup

jhecking · 2024-10-08T04:30:26Z

can you also try bypassing Consul temporarily to see if the CPU load decreases as well? as we dont have Consul setup

In our case, the Temporal workers are often running in a different cluster from the Temporal server and Consul is required for the workers and Keda to connect to the Temporal server. So far, none of our Temporal workers (using the Typescript, Java and Python SDKs) have shown any similar issues. But I'll try to replicate this in a different cluster where Consul is not required.

raanand-dig · 2024-10-22T11:11:34Z

any update

robholland · 2024-10-23T22:55:22Z

@raanand-dig please follow progress in https://github.com/kedacore/keda/pull/6191/files

rylandg added enhancement New feature or request product-integration New integration with the product packaging labels Apr 27, 2020

samarabbas added the up-for-grabs Issues to consider for external contribution label Jul 3, 2021

sync-by-unito bot closed this as completed Mar 3, 2023

yiminc reopened this Mar 3, 2023

ross-p-smith mentioned this issue Jun 21, 2023

Add Temporal.io Scaler kedacore/keda#4724

Open

Hades32 pushed a commit to Hades32/temporal that referenced this issue Jul 4, 2023

Adding comment about the role (temporalio#33)

6c1e09b

Co-authored-by: Mike <mike@ultimatetournament.io>

atihkin assigned Quinn-With-Two-Ns and atihkin May 8, 2024

atihkin unassigned Quinn-With-Two-Ns Sep 25, 2024

cretz mentioned this issue Sep 26, 2024

add scaler for temporal kedacore/keda#6191

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add workers autoscaling through KEDA #33

Add workers autoscaling through KEDA #33

mfateev commented Nov 26, 2019

yiminc commented Oct 29, 2022

aakarim commented Nov 1, 2022

ross-p-smith commented Jun 8, 2023

febinct commented Aug 8, 2023 •

edited

Loading

RonaldGalea commented Nov 16, 2023

sinkr commented May 28, 2024

henrytseng commented May 28, 2024 •

edited

Loading

justinfx commented May 28, 2024

jhecking commented Sep 24, 2024

jhecking commented Sep 24, 2024

cretz commented Sep 24, 2024 •

edited

Loading

jhecking commented Sep 25, 2024

febinct commented Sep 25, 2024 •

edited

Loading

jhecking commented Sep 25, 2024

febinct commented Sep 25, 2024 •

edited

Loading

jhecking commented Sep 25, 2024

cretz commented Sep 25, 2024 •

edited

Loading

jhecking commented Sep 25, 2024

atihkin commented Sep 25, 2024

febinct commented Sep 26, 2024

jhecking commented Sep 26, 2024

cretz commented Sep 26, 2024 •

edited

Loading

febinct commented Sep 26, 2024

jhecking commented Sep 26, 2024

atihkin commented Oct 7, 2024

febinct commented Oct 8, 2024 •

edited

Loading

jhecking commented Oct 8, 2024 •

edited

Loading

febinct commented Oct 8, 2024 •

edited

Loading

jhecking commented Oct 8, 2024

raanand-dig commented Oct 22, 2024

robholland commented Oct 23, 2024

Add workers autoscaling through KEDA #33

Add workers autoscaling through KEDA #33

Comments

mfateev commented Nov 26, 2019

yiminc commented Oct 29, 2022

aakarim commented Nov 1, 2022

ross-p-smith commented Jun 8, 2023

febinct commented Aug 8, 2023 • edited Loading

RonaldGalea commented Nov 16, 2023

sinkr commented May 28, 2024

henrytseng commented May 28, 2024 • edited Loading

justinfx commented May 28, 2024

jhecking commented Sep 24, 2024

jhecking commented Sep 24, 2024

cretz commented Sep 24, 2024 • edited Loading

jhecking commented Sep 25, 2024

febinct commented Sep 25, 2024 • edited Loading

jhecking commented Sep 25, 2024

febinct commented Sep 25, 2024 • edited Loading

jhecking commented Sep 25, 2024

cretz commented Sep 25, 2024 • edited Loading

jhecking commented Sep 25, 2024

atihkin commented Sep 25, 2024

febinct commented Sep 26, 2024

jhecking commented Sep 26, 2024

cretz commented Sep 26, 2024 • edited Loading

febinct commented Sep 26, 2024

jhecking commented Sep 26, 2024

atihkin commented Oct 7, 2024

febinct commented Oct 8, 2024 • edited Loading

jhecking commented Oct 8, 2024 • edited Loading

febinct commented Oct 8, 2024 • edited Loading

jhecking commented Oct 8, 2024

raanand-dig commented Oct 22, 2024

robholland commented Oct 23, 2024

febinct commented Aug 8, 2023 •

edited

Loading

henrytseng commented May 28, 2024 •

edited

Loading

cretz commented Sep 24, 2024 •

edited

Loading

febinct commented Sep 25, 2024 •

edited

Loading

febinct commented Sep 25, 2024 •

edited

Loading

cretz commented Sep 25, 2024 •

edited

Loading

cretz commented Sep 26, 2024 •

edited

Loading

febinct commented Oct 8, 2024 •

edited

Loading

jhecking commented Oct 8, 2024 •

edited

Loading

febinct commented Oct 8, 2024 •

edited

Loading