feat: add scaler for temporal #4863

Prajithp · 2023-08-05T18:37:27Z

Implement a temporal scaler

Checklist

When introducing a new scaler, I agree with the scaling governance policy
I have verified that my change is according to the deprecations & breaking changes policy
Tests have been added
Changelog has been updated and is aligned with our changelog requirements
A PR is opened to update our Helm chart (repo) (if applicable, ie. when deployment manifests are modified)
A PR is opened to update the documentation on (repo) (if applicable)
Commits are signed with Developer Certificate of Origin (DCO - learn more)

Relates to #4724

github-actions · 2023-08-05T18:37:39Z

Thank you for your contribution! 🙏 We will review your PR as soon as possible.

🏖️ Over the summer, the response time will be longer than usual due to maintainers taking time off so please bear with us.

While you are waiting, make sure to:

Add an entry in our changelog in alphabetical order and link related issue
Update the documentation, if needed
Add unit & e2e tests for your changes
GitHub checks are passing
Is the DCO check failing? Here is how you can fix DCO issues

Learn more about:

Our contribution guide

Prajithp · 2023-08-07T08:52:31Z

Can someone please review this?

cretz · 2023-08-08T20:03:51Z

pkg/scalers/temporal_scaler.go

+}
+
+// getQueueSize returns the queue size of open workflows.
+func (s *temporalWorkflowScaler) getQueueSize(ctx context.Context) (int64, error) {


This does not get an accurate queue size (this paginates). You can use .CountWorkflow but that's only for workflows, it doesn't help with activities (and often it's activities that are the reason for needing to scale).

The proper way to scale Temporal workers is to use the temporal_worker_task_slots_available metric on the workers. See https://docs.temporal.io/dev-guide/worker-performance.

It would make more sense if we consider multiple activities within a single workflow and deploy workers for each activity. However, the current scaling mechanism relies on pending workflows rather than individual activities. I plan to review the SDK documentation to explore the possibility of integrating activities into the scaling process.

Notably, "temporal_worker_task_slots_available" serves as a Prometheus metric, which could potentially be employed alongside the Prometheus scaler for those interested in scaling based on this particular metric.

It would make more sense if we consider multiple activities within a single workflow and deploy workers for each activity.

I don't think it's a reasonable scaler if you don't consider activities. And I don't think the scaler is working that well if it's only for a single workflow type.

However, the current scaling mechanism relies on pending workflows rather than individual activities.

Pending activities matter too (maybe more). Even if you were only doing pending workflows, list open workflows is paginated, you are not getting full counts. Regardless, scaling a worker based on a single workflow is not the best way to write a scaler.

Notably, "temporal_worker_task_slots_available" serves as a Prometheus metric, which could potentially be employed alongside the Prometheus scaler for those interested in scaling based on this particular metric.

This is the metric that should be scaled on and is the one Temporal recommends scaling on (assuming you've configured individual worker resources properly based on your workflows/activites), see https://docs.temporal.io/dev-guide/worker-performance. The current scaler which doesn't include activities, only works for a single workflow type, etc is not sufficient IMO.

I have transitioned this process to a paginated approach. Unfortunately, I haven't discovered a method to integrate activity counts into the current setup. It seems that further research is necessary to explore potential solutions in this regard.

I have transitioned this process to a paginated approach.

You should just use CountWorkflow, not list every workflow. But regardless, we do not have a way for you to easily get all pending activities from the server for a task queue. The scaler needs to use the slots metric per the worker performance doc. Using list/count is not the best way to write the scaler.

@cretz Can you please review the recent changes?

The idea of a target queue size and listing workflows is not the recommended approach to determining whether to scale up or down (ntm it'd be better to use count with a query checking whether running). We recommend using the temporal_worker_task_slots_available metric (with check whether worker type is activity or workflow).

@cretz is one of the main contributors of temporal.io SDK repo.
I think that we should follow his recommendations at this point @Prajithp . Could you implement it? 🙏

@JorTurFer I am unsure about the feasibility of achieving this through the CountWorkflow method, as it might not provide visibility into whether the activity is presently running or not. As recommended by him, individuals aiming to scale based on Prometheus data can make use of the query specified above.

Can we please leave this pull request open for a while? This would allow for the possibility of additional suggestions from others. In the meantime, we will maintain our own fork and deploy it in our production environment.

Can we please leave this pull request open for a while?

Yeas sure, no problem at all

we will maintain our own fork and deploy it in our production environment

I'd suggest using external scaler/metrics api scaler instead of maintaining your own fork. I mean, KEDA can be extended using those scalers from a 3rd party service that you can develop with the code that you prefer. Using this approach instead of maintaining your own fork can brings you the option to upgrade KEDA without the hard effort of rebase it and adapt the code (as you have develop just a scaler this shouldn't be a drama, but extending is always better than modifying)

ross-p-smith · 2023-08-08T20:39:11Z

I had to abandon this as I did not have time to write the e2e tests #4721

tomkerkhove · 2023-08-10T07:54:29Z

@Prajithp can you please fix DCO issues and open a PR for docs please?

🏖️ Over the summer, the response time will be longer than usual due to maintainers taking time off so please bear with us.

While you are waiting, make sure to:

Update the documentation, if needed

Add unit & e2e tests for your changes

GitHub checks are passing

Is the DCO check failing? Here is how you can fix DCO issues

Also note the above so please bear with us ☝

Signed-off-by: Prajith P <prajithpalakkuda@gmail.com> Signed-off-by: Prajith P <prajith.palakkuda@cleartax.in>

Signed-off-by: Prajith P <prajith.palakkuda@cleartax.in>

JorTurFer

Thanks for this addition ❤️ and sorry for the slow review 😞 , summer is complicated :/
I have left some comments inline. Apart from them, we use vendor to have reproducible builds, so you have to execute:

go mod tidy
go mod vendor

This will add all the new deps to de vendor folder, please commit and push them too

JorTurFer · 2023-08-20T10:50:54Z

pkg/scalers/temporal_scaler.go

+		HostPort: meta.endpoint,
+		ConnectionOptions: sdk.ConnectionOptions{
+			DialOptions: []grpc.DialOption{
+				grpc.WithTimeout(time.Duration(temporalClientTimeOut) * time.Second),


As gRPC is HTTP at the of the day, I think that we should use the environment variable that it's in config.GlobalHTTPTimeout

Please, add the TLS options too. There is a helper you should use that unifies all TLS configs like minVersion or custom CAs.

JorTurFer · 2023-08-20T10:59:43Z

pkg/scalers/temporal_scaler.go

+			fmt.Sprintf("temporal-%s-%s", meta.namespace, meta.workflowName),
+		),
+	)
+	meta.scalerIndex = config.ScalerIndex


I think that this line it's not necessary because you are already generating the metrics name here (and it's the only reason to use the scalerIndex IIRC)

JorTurFer · 2023-08-20T11:03:08Z

pkg/scalers/temporal_scaler.go

+}
+
+// getQueueSize returns the queue size of open workflows.
+func (s *temporalWorkflowScaler) getQueueSize(ctx context.Context) (int64, error) {


TBH, I don't have any knowledge about temporal, so I can't give you any extra insight about the implementation.
@cretz , do you agree with the current implementation? As you (both) are the experts on temporal, I hope to get your consensus on the implementation

JorTurFer · 2023-08-20T11:06:03Z

pkg/scalers/temporal_scaler.go

+	for {
+		listOpenWorkflowExecutionsRequest := &workflowservice.ListOpenWorkflowExecutionsRequest{
+			Namespace:       s.metadata.namespace,
+			MaximumPageSize: 1000,
+			NextPageToken:   nextPageToken,
+			Filters: &workflowservice.ListOpenWorkflowExecutionsRequest_TypeFilter{
+				TypeFilter: &tclfilter.WorkflowTypeFilter{
+					Name: s.metadata.workflowName,
+				},
+			},
+		}
+		ws, err := s.tcl.ListOpenWorkflow(ctx, listOpenWorkflowExecutionsRequest)
+		if err != nil {
+			return 0, fmt.Errorf("failed to get workflows: %w", err)
+		}
+
+		for _, exec := range ws.GetExecutions() {
+			execution := executionInfo{
+				workflowId: exec.Execution.GetWorkflowId(),
+				runId:      exec.Execution.RunId,
+			}
+			executions = append(executions, execution)
+		}
+
+		if nextPageToken = ws.NextPageToken; len(nextPageToken) == 0 {
+			break
+		}
+	}


I'm afraid about the performance impact of this. Could we face with any infinite (or almost infinite loop)? If the backend responds slowly and we have to browse idk, 50 pages, what will happen?
Is adding a limit for the pages doable? Maybe just with a parameter that users can modify under their own risk?
WDYT?

JorTurFer · 2023-08-20T11:10:46Z

pkg/scalers/temporal_scaler.go

+}
+
+// getQueueSize returns the queue size of open workflows.
+func (s *temporalWorkflowScaler) getQueueSize(ctx context.Context) (int64, error) {


This PR's implementation and your implementation are really different: https://github.com/kedacore/keda/pull/4721/files#diff-f59fd700aa9c39c0f77d364730bcf70d05712e8841b100cc6b2d502f5224724bR183-R190

JorTurFer · 2023-08-20T11:13:19Z

pkg/scalers/temporal_scaler.go

+	for _, execInfo := range executions {
+		wg.Add(1)
+		go func(e executionInfo) {
+			defer wg.Done()
+
+			workflowId := e.workflowId
+			runId := e.runId
+
+			if !s.isActivityRunning(ctx, workflowId, runId) {
+				executionId := workflowId + "__" + runId
+				pendingCh <- executionId
+			}
+
+		}(execInfo)
+	}
+	wg.Wait()
+	close(pendingCh)


Same as above, if there are thousand of pending executions, What will happen?
Probably I'm wrong, but I understand that we are listing all the workflows and inside the workflows, we are checking all the executions to decide if it's running or not. This suggests me some questions:

Are executed activities removed at any moment or will we have this queue growing and growing?

Doesn't the workflow have any option to give that information during the first requests instead of having to navigate over all the activities?

JorTurFer · 2023-08-20T11:15:33Z

pkg/scalers/temporal_scaler_test.go

+	}
+}
+
+func TestParseTemporalMetadata(t *testing.T) {


Isn't this test duplicated?

JorTurFer · 2023-08-20T11:20:29Z

tests/scalers/temporal/temporal_test.go

+    spec:
+      containers:
+      - name: worker
+        image: "prajithp/temporal-sample:1.0.0"


Could you open a PR to this repo adding the image? We prefer to have all the used images in the org infra to prevent possible issues

JorTurFer · 2023-08-20T11:20:36Z

tests/scalers/temporal/temporal_test.go

+    spec:
+      containers:
+      - name: workerflow
+        image: "prajithp/temporal-sample:1.0.0"


same as above

JorTurFer

Thanks for this addition ❤️ and sorry for the slow review 😞 , summer is complicated :/
I have left some comments inline. Apart from them, we use vendor to have reproducible builds, so you have to execute:

go mod tidy
go mod vendor

This will add all the new deps to de vendor folder, please commit and push them too

zroubalik

@Prajithp any update please?

JorTurFer · 2023-09-07T20:12:41Z

@Prajithp any update please?

#4863 (comment)

zroubalik · 2023-12-21T14:28:04Z

Any update here?

tomkerkhove · 2024-01-11T06:38:11Z

@Prajithp any updates on this? If we don't hear back soon we'll have to close this PR

Prajithp · 2024-01-11T07:00:46Z

@Prajithp any updates on this? If we don't hear back soon we'll have to close this PR

@tomkerkhove, I think we should close this for now since @cretz doesn't seem to be happy with this approach.

cretz · 2024-01-11T13:00:41Z

Yes, I am sorry but the current approach is not how we tell users to scale and has many limitations and drawbacks. We at Temporal are considering building and contributing this, but I am afraid I have no details yet. Even without dedicated server support (e.g. task queue backlog counts), this would need to mirror the approach at https://docs.temporal.io/dev-guide/worker-performance to be a reasonable one.

zroubalik · 2024-01-11T13:07:26Z

@cretz thank you for the information. I am gonna close this PR. Feel free to reach out if you have a better proposal.

febinct · 2024-09-26T09:51:38Z

#6191 please check the pr based on new apporach by temporal 1.25 Task Queue Statistics @cretz @zroubalik @JorTurFer

Prajithp requested a review from a team as a code owner August 5, 2023 18:37

Prajithp changed the title ~~add scaler for temporal~~ feat: add scaler for temporal Aug 5, 2023

Prajithp force-pushed the temporal-scaler branch from cbec598 to 020c659 Compare August 5, 2023 18:43

cretz suggested changes Aug 8, 2023

View reviewed changes

Prajithp and others added 4 commits August 10, 2023 17:47

add scaler for temporal

6fdf28f

Signed-off-by: Prajith P <prajithpalakkuda@gmail.com> Signed-off-by: Prajith P <prajith.palakkuda@cleartax.in>

update CHANGELOG.md

c00d519

Signed-off-by: Prajith P <prajithpalakkuda@gmail.com> Signed-off-by: Prajith P <prajith.palakkuda@cleartax.in>

use pagination to get number of open workflows

bca8a3e

Signed-off-by: Prajith P <prajith.palakkuda@cleartax.in>

Refined Queue Sizing: Incorporating Activity Names in Evaluation

673d9f1

Signed-off-by: Prajith P <prajith.palakkuda@cleartax.in>

Prajithp force-pushed the temporal-scaler branch from fe2bdcf to 673d9f1 Compare August 10, 2023 12:18

Prajithp mentioned this pull request Aug 10, 2023

docs: Temporal scaler kedacore/keda-docs#1207

Open

1 task

JorTurFer reviewed Aug 20, 2023

View reviewed changes

zroubalik reviewed Sep 6, 2023

View reviewed changes

tomkerkhove mentioned this pull request Sep 8, 2023

Add Temporal.io Scaler #4724

Open

tomkerkhove mentioned this pull request Jan 11, 2024

Temporal Scaler #4721

Closed

7 tasks

zroubalik closed this Jan 11, 2024

benbjohnson mentioned this pull request Apr 3, 2024

Implement Temporal metrics collector superfly/fly-autoscaler#25

Merged

2 tasks

jhecking mentioned this pull request Sep 24, 2024

Add workers autoscaling through KEDA temporalio/temporal#33

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add scaler for temporal #4863

feat: add scaler for temporal #4863

Prajithp commented Aug 5, 2023 •

edited

Loading

github-actions bot commented Aug 5, 2023

Prajithp commented Aug 7, 2023

cretz Aug 8, 2023 •

edited

Loading

Prajithp Aug 9, 2023

cretz Aug 9, 2023 •

edited

Loading

Prajithp Aug 9, 2023

cretz Aug 10, 2023 •

edited

Loading

Prajithp Aug 21, 2023

cretz Aug 21, 2023 •

edited

Loading

JorTurFer Aug 24, 2023

Prajithp Aug 25, 2023

JorTurFer Sep 3, 2023

ross-p-smith commented Aug 8, 2023

tomkerkhove commented Aug 10, 2023

JorTurFer left a comment

JorTurFer Aug 20, 2023

JorTurFer Aug 20, 2023

JorTurFer Aug 20, 2023

JorTurFer Aug 20, 2023

JorTurFer Aug 20, 2023

JorTurFer Aug 20, 2023

JorTurFer Aug 20, 2023

JorTurFer Aug 20, 2023

JorTurFer Aug 20, 2023

JorTurFer Aug 20, 2023

JorTurFer left a comment

zroubalik left a comment

JorTurFer commented Sep 7, 2023

zroubalik commented Dec 21, 2023

tomkerkhove commented Jan 11, 2024

Prajithp commented Jan 11, 2024

cretz commented Jan 11, 2024

zroubalik commented Jan 11, 2024

febinct commented Sep 26, 2024 •

edited

Loading

feat: add scaler for temporal #4863

feat: add scaler for temporal #4863

Conversation

Prajithp commented Aug 5, 2023 • edited Loading

Checklist

github-actions bot commented Aug 5, 2023

Prajithp commented Aug 7, 2023

cretz Aug 8, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cretz Aug 9, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cretz Aug 10, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cretz Aug 21, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ross-p-smith commented Aug 8, 2023

tomkerkhove commented Aug 10, 2023

JorTurFer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JorTurFer left a comment

Choose a reason for hiding this comment

zroubalik left a comment

Choose a reason for hiding this comment

JorTurFer commented Sep 7, 2023

zroubalik commented Dec 21, 2023

tomkerkhove commented Jan 11, 2024

Prajithp commented Jan 11, 2024

cretz commented Jan 11, 2024

zroubalik commented Jan 11, 2024

febinct commented Sep 26, 2024 • edited Loading

Prajithp commented Aug 5, 2023 •

edited

Loading

cretz Aug 8, 2023 •

edited

Loading

cretz Aug 9, 2023 •

edited

Loading

cretz Aug 10, 2023 •

edited

Loading

cretz Aug 21, 2023 •

edited

Loading

febinct commented Sep 26, 2024 •

edited

Loading