Add ScalersCache to reuse scalers unless they need changing #2187

ahmelsayed · 2021-10-12T12:46:17Z

Closes #1121

Provide a description of what has been changed

Checklist

Commits are signed with Developer Certificate of Origin (DCO - learn more)
Tests have been added
[N/A] A PR is opened to update our Helm chart (repo) (if applicable, ie. when deployment manifests are modified)
[N/A] A PR is opened to update the documentation on (repo) (if applicable)
Changelog has been updated

Fixes #1121

.github/workflows/pr-validation.yml

zroubalik

Thanks a ton for this!

I have a few questions and minor nits on imports formating :) I have done a very quick review, I will go through the code more properly later.

adapter/main.go

pkg/scaling/scale_handler_test.go

zroubalik · 2021-10-12T14:58:02Z

pkg/scaling/scale_handler.go

+	h.lock.RLock()
+	if cache, ok := h.scalerCaches[key]; ok {
+		h.lock.RUnlock()
+		return cache, nil
+	}
+	h.lock.RUnlock()
+
+	h.lock.Lock()
+	defer h.lock.Unlock()
+	if cache, ok := h.scalerCaches[key]; ok {
+		return cache, nil
+	}


Could you please elaborate on this part? I am not sure I get it. Thanks!

lines 169-174 look to be the internal cache check, but @ahmelsayed are lines 178-180 duplicates?

recheck state after acquiring the W lock on line 176. If a previous ~~thread~~ goroutine has already created a cache, use it. I can change it to a sync.Map with LoadOrStore but I was reading https://github.com/golang/go/blob/master/src/sync/map.go#L12-L26 and wasn't sure if using sync.Map is the best option here, but I wasn't sure.

This can be changed if I take a W lock at line 169, but I thought an R lock there will reduce contention. It's not a high throughput scenario, but that's the idea.

Right, makes sense.

got it! thanks, I missed the RUnlock call there. since this is pretty subtle, I think at a minimum it should have a comment explaining that there's a load or store operation going on here. better would be to use an abstraction like (sync.Map).LoadOrStore. I'll leave it up to you, though, since the code will change in a non-trivial way.

at a higher level, though, is there likely to be a lot of contention with this code? I'm asking to get a feel for whether it's worth using that read lock for the initial check.

I don't expect much contention here. The code paths I could identify that might cause contention are:

Multiple ScaledObject firing their pollingInterval at exactly the same time.

If concurrent reconciliation is enabled.

If the metric adapter gets multiple requests for the same metric value at the same time.

@ahmelsayed would it make sense to just acquire a write lock, do the check, and then handle a cache miss then, to simplify this code?

pkg/scaling/scale_handler.go

pkg/scaling/cache/scalers_cache_test.go

adapter/main.go

arschles

Looks good @ahmelsayed - I left a few comments, mostly nits/ideas. One general comment: it would be good to add some Godoc comments, particularly on the ScalersCache

adapter/main.go

controllers/keda/suite_test.go

arschles · 2021-10-12T16:56:41Z

pkg/scaling/cache/scalers_cache.go

+func NewScalerCache(scalers []scalers.Scaler, factories []func() (scalers.Scaler, error), logger logr.Logger, recorder record.EventRecorder) (*ScalersCache, error) {
+	if len(scalers) != len(factories) {
+		return nil, fmt.Errorf("scalers and factories must match")
+	}


an idea - make the below scalerBuilder public and take in a []ScalerBuilder parameter instead of the scalers and factories ones. you wouldn't need to do this error check and callers wouldn't need to know to ensure len(scalers) == len(factories). WDYT?

pkg/scaling/cache/scalers_cache.go

arschles · 2021-10-12T17:10:30Z

pkg/scaling/scale_handler.go

+	h.lock.RLock()
+	if cache, ok := h.scalerCaches[key]; ok {
+		h.lock.RUnlock()
+		return cache, nil
+	}
+	h.lock.RUnlock()
+
+	h.lock.Lock()
+	defer h.lock.Unlock()
+	if cache, ok := h.scalerCaches[key]; ok {
+		return cache, nil
+	}


lines 169-174 look to be the internal cache check, but @ahmelsayed are lines 178-180 duplicates?

pkg/scaling/scale_handler.go

JorTurFer · 2021-10-12T17:36:46Z

This PR modifies the behavior related with the life cycle of the scalers, basically now we are going to keep them alive until a change requires an update, right?
Should these changes related with the lifecycle be documented? (for example in https://github.com/kedacore/keda/blob/main/CREATE-NEW-SCALER.md#lifecycle-of-a-scaler)

ahmelsayed · 2021-10-12T22:50:31Z

This PR modifies the behavior related with the life cycle of the scalers, basically now we are going to keep them alive until a change requires an update, right? Should these changes related with the lifecycle be documented? (for example in main/CREATE-NEW-SCALER.md#lifecycle-of-a-scaler)

That's correct. The behavior/contract shouldn't change at all though, so I wasn't sure if anything in particular needs to be documented.

Behavior before:

Create Scaler (with latest secrets/trigger auth values)
Check Scaler
Close Scaler
Sleep for pollingInterval
goto 1

if ScaledObject has changed, cancel loop above

Behavior after:

Create or get existing Scaler
Check Scaler
Refresh Scaler on error (in case a secret or authentication value changed in the mean time, since we don't/can't watch all secret sources.
Check Scaler
Sleep for pollingInterval
goto 1

if ScaledObject has changed, invalidate the cache.

So from a user/scaler author prospective, there is no change. If a scaler today is having a long lived connection and implements Close() correctly, it should work.

@arschles @JorTurFer @tomkerkhove @zroubalik do you think this behavior should be globally configurable (or configurable per ScaledObject)?

The main scenario I can think of is someone having a large number of ScaledObject that don't need to be checked frequently. For example, 100 MySQL scalers that need to be checked once an hour.

Before this will cause once an hour to create a SQL connection that's short-lived.

After there will always be 100 open connections to the MySQL server.

arschles · 2021-10-12T23:07:47Z

@arschles @JorTurFer @tomkerkhove @zroubalik do you think this behavior should be globally configurable (or configurable per ScaledObject)?

what behavior specifically? whether a scaler applied to a Scaled{Object, Job} should be cached?

JorTurFer · 2021-10-13T06:28:42Z

The main scenario I can think of is someone having a large number of ScaledObject that don't need to be checked frequently. For example, 100 MySQL scalers that need to be checked once an hour.

In base of this use case, maybe we should support configure it by trigger, I mean, inside the same ScaledObject we can have several triggers, some of them for long time and another for short time 🤔
I think that reach this behavior could be a bit tricky, so for me it's enough if we can define it at ScaledObject level

So from a user/scaler author prospective, there is no change. If a scaler today is having a long lived connection and implements Close() correctly, it should work.

That's truth, and if we support enabling and disabling the cache, the behavior is exactly the same, you are right (in my previous comment I was thinking in the internal objects lifecycle and the requirement of thinking in them as ephemeral objects, but without cache they are ephemeral, so the behavior is exactly the same)

zroubalik · 2021-10-13T09:59:56Z

@arschles @JorTurFer @tomkerkhove @zroubalik do you think this behavior should be globally configurable (or configurable per ScaledObject)?

Probably? Don't have a strong opinion on this.

The main scenario I can think of is someone having a large number of ScaledObject that don't need to be checked frequently. For example, 100 MySQL scalers that need to be checked once an hour.

Before this will cause once an hour to create a SQL connection that's short-lived.

After there will always be 100 open connections to the MySQL server.

If I am not mistaken, HPA is asking Metrics Server for particular metrics periodically, therefore the MySQL server metrics will be scraped anyway. We would have to cache the metric value in as well in order to do the actual check once an hour 🤔

Btw I had been thinking about this feature some time before, current pollingInterval applies only to Operator, not Metrics Server. If we cache the metric value in Metrics Server the pollingInterval could be then applicaple there as well. Though not sure if this is somehow useful, but it should limit the number of requests to the external service, if that's users concern.

zroubalik · 2021-10-13T10:01:15Z

This PR modifies the behavior related with the life cycle of the scalers, basically now we are going to keep them alive until a change requires an update, right? Should these changes related with the lifecycle be documented? (for example in main/CREATE-NEW-SCALER.md#lifecycle-of-a-scaler)

That's correct. The behavior/contract shouldn't change at all though, so I wasn't sure if anything in particular needs to be documented.

Behavior before:
1. Create Scaler (with latest secrets/trigger auth values)

2. Check Scaler

3. Close Scaler

4. Sleep for `pollingInterval`

5. goto `1`
if ScaledObject has changed, cancel loop above

Behavior after:
1. Create or get existing Scaler

2. Check Scaler

3. Refresh Scaler on error (in case a secret or authentication value changed in the mean time, since we don't/can't watch all secret sources.

4. Check Scaler

5. Sleep for `pollingInterval`

6. goto `1`
if ScaledObject has changed, invalidate the cache.

I agree we should document this for scalers developers. I think that similar description to the one above is enough.

ahmelsayed · 2021-10-14T23:58:53Z

@arschles @JorTurFer @tomkerkhove @zroubalik do you think this behavior should be globally configurable (or configurable per ScaledObject)?

Probably? Don't have a strong opinion on this.

The main scenario I can think of is someone having a large number of ScaledObject that don't need to be checked frequently. For example, 100 MySQL scalers that need to be checked once an hour.
Before this will cause once an hour to create a SQL connection that's short-lived.
After there will always be 100 open connections to the MySQL server.

If I am not mistaken, HPA is asking Metrics Server for particular metrics periodically, therefore the MySQL server metrics will be scraped anyway. We would have to cache the metric value in as well in order to do the actual check once an hour 🤔

Btw I had been thinking about this feature some time before, current pollingInterval applies only to Operator, not Metrics Server. If we cache the metric value in Metrics Server the pollingInterval could be then applicaple there as well. Though not sure if this is somehow useful, but it should limit the number of requests to the external service, if that's users concern.

That's a good point @zroubalik. I'll just add a feature env var that will invalidate the cache right after every check for now (effectively closing the scaler to have the same old behavior) in case the new behavior impacts someone. How does that sound?

zroubalik · 2021-10-15T09:52:04Z

That's a good point @zroubalik. I'll just add a feature env var that will invalidate the cache right after every check for now (effectively closing the scaler to have the same old behavior) in case the new behavior impacts someone. How does that sound?

So do you propose to introduce global env var that will affect all ScaledObject or do per ScaledObjects setting? I am fine with both approaches, or we can introduce the per ScaledObject option later, if there's a demand.

arschles · 2021-10-18T16:50:46Z

adapter/main.go

+
+	go func() {
+		if err := mgr.Start(context.Background()); err != nil {
+			panic(err)


an idea here: rather than panic, you could return an errgroup or <-chan error from runScaledObjectController so that the caller can choose what to do with the error in this goroutine?

doubtful this is necessary for this PR, and I'm unsure whether that would be an improvement, holistically, though.

It's a good point. The caller takes <-chan struct{} as a stopCh. I changed it to return that in

2381bfa (#2187)

pkg/scaling/cache/scalers_cache.go

arschles · 2021-10-18T22:14:36Z

pkg/scaling/cache/scalers_cache.go

+		if err != nil {
+			scalerLogger.V(1).Info("Error getting scaler.IsActive, but continue", "Error", err)
+			c.recorder.Event(scaledJob, corev1.EventTypeWarning, eventreason.KEDAScalerFailed, err.Error())
+			continue
+		}


I think this should be part of the previous if err != nil block?

we need a change to refresh the scaler and call IsActive again. so the err checked here is either from L270 or L273

Did you mean L260 or L263? Then the err on L263 is the one defined in the block on L262.

Yes, you are right. Sorry I merged a suggestion without thinking about it too much. This has to update the outer err in case the scaler is still in an error state even after refreshing all the secrets.

ah sorry - I missed the closing } at https://github.com/kedacore/keda/pull/2187/files#diff-e15b5649231cd2fc16ca66df8b66cc1c705c0daf7007b23a9a017bf0948409c4R267. makes sense!

pkg/scaling/cache/scalers_cache.go

pkg/scaling/cache/scalers_cache_test.go

arschles · 2021-10-18T22:21:14Z

pkg/scaling/scale_handler.go

+	h.lock.RLock()
+	if cache, ok := h.scalerCaches[key]; ok {
+		h.lock.RUnlock()
+		return cache, nil
+	}
+	h.lock.RUnlock()
+
+	h.lock.Lock()
+	defer h.lock.Unlock()
+	if cache, ok := h.scalerCaches[key]; ok {
+		return cache, nil
+	}


got it! thanks, I missed the RUnlock call there. since this is pretty subtle, I think at a minimum it should have a comment explaining that there's a load or store operation going on here. better would be to use an abstraction like (sync.Map).LoadOrStore. I'll leave it up to you, though, since the code will change in a non-trivial way.

at a higher level, though, is there likely to be a lot of contention with this code? I'm asking to get a feel for whether it's worth using that read lock for the initial check.

pkg/scaling/scale_handler.go

zroubalik · 2021-10-21T08:44:16Z

controllers/keda/metrics_adapter_controller.go

+		Complete(r)
+}
+
+type MetricsScaledJobReconciler struct {


Job related Metrics Server Reconciler is not needed ;)

zroubalik · 2021-10-21T08:56:03Z

pkg/scaling/cache/scalers_cache.go

+		if err != nil {
+			scalerLogger.V(1).Info("Error getting scaler.IsActive, but continue", "Error", err)
+			c.recorder.Event(scaledJob, corev1.EventTypeWarning, eventreason.KEDAScalerFailed, err.Error())
+			continue
+		}


Did you mean L260 or L263? Then the err on L263 is the one defined in the block on L262.

arschles

this looks great @ahmelsayed. I left a few comments, but nothing seems blocking to me.

pkg/scaling/cache/scalers_cache.go

arschles · 2021-10-26T15:46:54Z

pkg/scaling/cache/scalers_cache.go

+		if err != nil {
+			scalerLogger.V(1).Info("Error getting scaler.IsActive, but continue", "Error", err)
+			c.recorder.Event(scaledJob, corev1.EventTypeWarning, eventreason.KEDAScalerFailed, err.Error())
+			continue
+		}


ah sorry - I missed the closing } at https://github.com/kedacore/keda/pull/2187/files#diff-e15b5649231cd2fc16ca66df8b66cc1c705c0daf7007b23a9a017bf0948409c4R267. makes sense!

pkg/scaling/cache/scalers_cache.go

zroubalik · 2021-10-26T16:24:25Z

@ahmelsayed FYI, we just merged this #2202 for context propagation so you'll need a rebase.

Would be nice if you can in this PR at least partially tackle the other (2.) item from the context propagation related issue #2190

arschles · 2021-10-26T18:00:59Z

@ahmelsayed FYI, we just merged this #2202 for context propagation so you'll need a rebase.

@ahmelsayed this was my doing, sorry about all the conflicts. DM me if you'd like and I can help with the resolution.

Would be nice if you can in this PR at least partially tackle the other (2.) item from the context propagation related issue #2190

@zroubalik to avoid expanding the scope of this PR, I am happy to tackle (2) from that issue in a follow-up PR after this is merged.

zroubalik · 2021-11-03T17:53:45Z

@ahmelsayed any update on this please? I'd like to start working on #2156 to have it for the upcoming release. The best thing would be to base it on your code (the controller part in Metric Server in particular, etc), so don't want start on this until this is merged, to avoid complex rebases 😄 Thx :)

ahmelsayed · 2021-11-09T08:06:25Z

Sorry for the delay, I just rebased the PR, and I believe all the feedback has been addressed? (the conversation isn't easy to follow on github ui)

Closes #1121 Signed-off-by: Ahmed ElSayed <ahmels@microsoft.com>

zroubalik · 2021-11-09T11:57:28Z

/run-e2e

zroubalik · 2021-11-09T12:38:25Z

/run-e2e

zroubalik · 2021-11-09T15:01:40Z

/run-e2e

There is an intermittent failure in scalers/azure-queue-trigger-auth.test.ts, the rest seem to be ok 🎉

JorTurFer

LGTM!
Thanks a ton!

zroubalik · 2021-11-09T16:26:27Z

@ahmelsayed could you please update the changelog with this contribution?

arschles

@ahmelsayed sorry I haven't been back to this PR in some time. I left one comment regarding a log line (a nit-pick, really). I'm happy to do it in a follow-up if you think it'd be a good idea. Let me know.

Regardless, LGTM

pkg/scaling/cache/scalers_cache.go

Signed-off-by: Ahmed ElSayed <ahmels@microsoft.com>

ahmelsayed requested a review from zroubalik as a code owner October 12, 2021 12:46

tomkerkhove reviewed Oct 12, 2021

View reviewed changes

.github/workflows/pr-validation.yml Outdated Show resolved Hide resolved

ahmelsayed force-pushed the ahmels/1121 branch from 378ae14 to cb0f1bc Compare October 12, 2021 13:46

zroubalik reviewed Oct 12, 2021

View reviewed changes

adapter/main.go Outdated Show resolved Hide resolved

zroubalik added this to the v2.5.0 milestone Oct 12, 2021

arschles reviewed Oct 12, 2021

View reviewed changes

zroubalik changed the title ~~Add ScalersCache to reuse scales unless they need changing~~ Add ScalersCache to reuse scalers unless they need changing Oct 13, 2021

zroubalik mentioned this pull request Oct 13, 2021

Improve context propagation to Scalers #2190

Closed

arschles suggested changes Oct 18, 2021

View reviewed changes

ahmelsayed force-pushed the ahmels/1121 branch from 6f884ab to 97a55b4 Compare October 21, 2021 03:47

zroubalik reviewed Oct 21, 2021

View reviewed changes

arschles reviewed Oct 26, 2021

View reviewed changes

zroubalik mentioned this pull request Nov 2, 2021

AWS Cloudwatch Scaler metrics pulling logic is not optimized #2242

Closed

ahmelsayed force-pushed the ahmels/1121 branch from e9afa1d to 628e9b3 Compare November 9, 2021 08:03

ahmelsayed force-pushed the ahmels/1121 branch 3 times, most recently from fbd69ca to 4fdf72a Compare November 9, 2021 09:38

Add ScalersCache to reuse scales unless they need changing

d79a0bd

Closes #1121 Signed-off-by: Ahmed ElSayed <ahmels@microsoft.com>

ahmelsayed force-pushed the ahmels/1121 branch from 4fdf72a to d79a0bd Compare November 9, 2021 11:35

zroubalik requested a review from JorTurFer November 9, 2021 12:21

JorTurFer approved these changes Nov 9, 2021

View reviewed changes

arschles approved these changes Nov 9, 2021

View reviewed changes

pkg/scaling/cache/scalers_cache.go Show resolved Hide resolved

Update CHANGELOG

68643a6

Signed-off-by: Ahmed ElSayed <ahmels@microsoft.com>

ahmelsayed force-pushed the ahmels/1121 branch from fe9b6b7 to 68643a6 Compare November 9, 2021 20:35

ahmelsayed merged commit 37a4324 into main Nov 9, 2021

ahmelsayed deleted the ahmels/1121 branch November 9, 2021 21:17

arschles mentioned this pull request Nov 9, 2021

Propagating contexts to all remaining scalers #2267

Merged

2 tasks

zroubalik mentioned this pull request Jan 3, 2022

Scalers don't retry if connection is lost #2415

Closed

Add ScalersCache to reuse scalers unless they need changing #2187

Add ScalersCache to reuse scalers unless they need changing #2187

Conversation

ahmelsayed commented Oct 12, 2021 • edited Loading

Checklist

zroubalik left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahmelsayed Oct 12, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arschles left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JorTurFer commented Oct 12, 2021 • edited Loading

ahmelsayed commented Oct 12, 2021 • edited Loading

Behavior before:

Behavior after:

arschles commented Oct 12, 2021

JorTurFer commented Oct 13, 2021 • edited Loading

zroubalik commented Oct 13, 2021 • edited Loading

zroubalik commented Oct 13, 2021 • edited Loading

Behavior before:

Behavior after:

ahmelsayed commented Oct 14, 2021

zroubalik commented Oct 15, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arschles left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zroubalik commented Oct 26, 2021 • edited Loading

arschles commented Oct 26, 2021 • edited Loading

zroubalik commented Nov 3, 2021

ahmelsayed commented Nov 9, 2021

zroubalik commented Nov 9, 2021

zroubalik commented Nov 9, 2021

zroubalik commented Nov 9, 2021

JorTurFer left a comment • edited Loading

Choose a reason for hiding this comment

zroubalik commented Nov 9, 2021

arschles left a comment

Choose a reason for hiding this comment

ahmelsayed commented Oct 12, 2021 •

edited

Loading

ahmelsayed Oct 12, 2021 •

edited

Loading

JorTurFer commented Oct 12, 2021 •

edited

Loading

ahmelsayed commented Oct 12, 2021 •

edited

Loading

JorTurFer commented Oct 13, 2021 •

edited

Loading

zroubalik commented Oct 13, 2021 •

edited

Loading

zroubalik commented Oct 13, 2021 •

edited

Loading

zroubalik commented Oct 15, 2021 •

edited

Loading

zroubalik commented Oct 26, 2021 •

edited

Loading

arschles commented Oct 26, 2021 •

edited

Loading

JorTurFer left a comment •

edited

Loading