-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ScalersCache to reuse scalers unless they need changing #2187
Conversation
378ae14
to
cb0f1bc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a ton for this!
I have a few questions and minor nits on imports formating :) I have done a very quick review, I will go through the code more properly later.
pkg/scaling/scale_handler.go
Outdated
h.lock.RLock() | ||
if cache, ok := h.scalerCaches[key]; ok { | ||
h.lock.RUnlock() | ||
return cache, nil | ||
} | ||
h.lock.RUnlock() | ||
|
||
h.lock.Lock() | ||
defer h.lock.Unlock() | ||
if cache, ok := h.scalerCaches[key]; ok { | ||
return cache, nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please elaborate on this part? I am not sure I get it. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lines 169-174 look to be the internal cache check, but @ahmelsayed are lines 178-180 duplicates?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
recheck state after acquiring the W lock on line 176. If a previous thread goroutine has already created a cache, use it. I can change it to a sync.Map
with LoadOrStore
but I was reading https://github.com/golang/go/blob/master/src/sync/map.go#L12-L26 and wasn't sure if using sync.Map is the best option here, but I wasn't sure.
This can be changed if I take a W lock at line 169, but I thought an R lock there will reduce contention. It's not a high throughput scenario, but that's the idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it! thanks, I missed the RUnlock
call there. since this is pretty subtle, I think at a minimum it should have a comment explaining that there's a load or store operation going on here. better would be to use an abstraction like (sync.Map).LoadOrStore
. I'll leave it up to you, though, since the code will change in a non-trivial way.
at a higher level, though, is there likely to be a lot of contention with this code? I'm asking to get a feel for whether it's worth using that read lock for the initial check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't expect much contention here. The code paths I could identify that might cause contention are:
- Multiple ScaledObject firing their
pollingInterval
at exactly the same time. - If concurrent reconciliation is enabled.
- If the metric adapter gets multiple requests for the same metric value at the same time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ahmelsayed would it make sense to just acquire a write lock, do the check, and then handle a cache miss then, to simplify this code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good @ahmelsayed - I left a few comments, mostly nits/ideas. One general comment: it would be good to add some Godoc comments, particularly on the ScalersCache
pkg/scaling/cache/scalers_cache.go
Outdated
func NewScalerCache(scalers []scalers.Scaler, factories []func() (scalers.Scaler, error), logger logr.Logger, recorder record.EventRecorder) (*ScalersCache, error) { | ||
if len(scalers) != len(factories) { | ||
return nil, fmt.Errorf("scalers and factories must match") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
an idea - make the below scalerBuilder
public and take in a []ScalerBuilder
parameter instead of the scalers
and factories
ones. you wouldn't need to do this error check and callers wouldn't need to know to ensure len(scalers) == len(factories)
. WDYT?
pkg/scaling/scale_handler.go
Outdated
h.lock.RLock() | ||
if cache, ok := h.scalerCaches[key]; ok { | ||
h.lock.RUnlock() | ||
return cache, nil | ||
} | ||
h.lock.RUnlock() | ||
|
||
h.lock.Lock() | ||
defer h.lock.Unlock() | ||
if cache, ok := h.scalerCaches[key]; ok { | ||
return cache, nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lines 169-174 look to be the internal cache check, but @ahmelsayed are lines 178-180 duplicates?
This PR modifies the behavior related with the life cycle of the scalers, basically now we are going to keep them alive until a change requires an update, right? |
That's correct. The behavior/contract shouldn't change at all though, so I wasn't sure if anything in particular needs to be documented. Behavior before:
if Behavior after:
if So from a user/scaler author prospective, there is no change. If a scaler today is having a long lived connection and implements @arschles @JorTurFer @tomkerkhove @zroubalik do you think this behavior should be globally configurable (or configurable per ScaledObject)? The main scenario I can think of is someone having a large number of ScaledObject that don't need to be checked frequently. For example, 100 MySQL scalers that need to be checked once an hour. Before this will cause once an hour to create a SQL connection that's short-lived. After there will always be 100 open connections to the MySQL server. |
what behavior specifically? whether a scaler applied to a |
In base of this use case, maybe we should support configure it by trigger, I mean, inside the same ScaledObject we can have several triggers, some of them for long time and another for short time 🤔
That's truth, and if we support enabling and disabling the cache, the behavior is exactly the same, you are right (in my previous comment I was thinking in the internal objects lifecycle and the requirement of thinking in them as ephemeral objects, but without cache they are ephemeral, so the behavior is exactly the same) |
Probably? Don't have a strong opinion on this.
If I am not mistaken, HPA is asking Metrics Server for particular metrics periodically, therefore the MySQL server metrics will be scraped anyway. We would have to cache the metric value in as well in order to do the actual check once an hour 🤔 Btw I had been thinking about this feature some time before, current |
I agree we should document this for scalers developers. I think that similar description to the one above is enough. |
That's a good point @zroubalik. I'll just add a feature env var that will invalidate the cache right after every check for now (effectively closing the scaler to have the same old behavior) in case the new behavior impacts someone. How does that sound? |
So do you propose to introduce global env var that will affect all ScaledObject or do per ScaledObjects setting? I am fine with both approaches, or we can introduce the per ScaledObject option later, if there's a demand. |
adapter/main.go
Outdated
|
||
go func() { | ||
if err := mgr.Start(context.Background()); err != nil { | ||
panic(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
an idea here: rather than panic, you could return an errgroup
or <-chan error
from runScaledObjectController
so that the caller can choose what to do with the error in this goroutine?
doubtful this is necessary for this PR, and I'm unsure whether that would be an improvement, holistically, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good point. The caller takes <-chan struct{}
as a stopCh. I changed it to return that in
pkg/scaling/cache/scalers_cache.go
Outdated
if err != nil { | ||
scalerLogger.V(1).Info("Error getting scaler.IsActive, but continue", "Error", err) | ||
c.recorder.Event(scaledJob, corev1.EventTypeWarning, eventreason.KEDAScalerFailed, err.Error()) | ||
continue | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be part of the previous if err != nil
block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need a change to refresh the scaler and call IsActive again. so the err
checked here is either from L270 or L273
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean L260 or L263? Then the err on L263 is the one defined in the block on L262.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right. Sorry I merged a suggestion without thinking about it too much. This has to update the outer err in case the scaler is still in an error state even after refreshing all the secrets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah sorry - I missed the closing }
at https://github.com/kedacore/keda/pull/2187/files#diff-e15b5649231cd2fc16ca66df8b66cc1c705c0daf7007b23a9a017bf0948409c4R267. makes sense!
pkg/scaling/scale_handler.go
Outdated
h.lock.RLock() | ||
if cache, ok := h.scalerCaches[key]; ok { | ||
h.lock.RUnlock() | ||
return cache, nil | ||
} | ||
h.lock.RUnlock() | ||
|
||
h.lock.Lock() | ||
defer h.lock.Unlock() | ||
if cache, ok := h.scalerCaches[key]; ok { | ||
return cache, nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it! thanks, I missed the RUnlock
call there. since this is pretty subtle, I think at a minimum it should have a comment explaining that there's a load or store operation going on here. better would be to use an abstraction like (sync.Map).LoadOrStore
. I'll leave it up to you, though, since the code will change in a non-trivial way.
at a higher level, though, is there likely to be a lot of contention with this code? I'm asking to get a feel for whether it's worth using that read lock for the initial check.
6f884ab
to
97a55b4
Compare
Complete(r) | ||
} | ||
|
||
type MetricsScaledJobReconciler struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Job related Metrics Server Reconciler is not needed ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep :)
pkg/scaling/cache/scalers_cache.go
Outdated
if err != nil { | ||
scalerLogger.V(1).Info("Error getting scaler.IsActive, but continue", "Error", err) | ||
c.recorder.Event(scaledJob, corev1.EventTypeWarning, eventreason.KEDAScalerFailed, err.Error()) | ||
continue | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean L260 or L263? Then the err on L263 is the one defined in the block on L262.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks great @ahmelsayed. I left a few comments, but nothing seems blocking to me.
pkg/scaling/cache/scalers_cache.go
Outdated
if err != nil { | ||
scalerLogger.V(1).Info("Error getting scaler.IsActive, but continue", "Error", err) | ||
c.recorder.Event(scaledJob, corev1.EventTypeWarning, eventreason.KEDAScalerFailed, err.Error()) | ||
continue | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah sorry - I missed the closing }
at https://github.com/kedacore/keda/pull/2187/files#diff-e15b5649231cd2fc16ca66df8b66cc1c705c0daf7007b23a9a017bf0948409c4R267. makes sense!
@ahmelsayed FYI, we just merged this #2202 for context propagation so you'll need a rebase. Would be nice if you can in this PR at least partially tackle the other (2.) item from the context propagation related issue #2190 |
@ahmelsayed this was my doing, sorry about all the conflicts. DM me if you'd like and I can help with the resolution.
@zroubalik to avoid expanding the scope of this PR, I am happy to tackle (2) from that issue in a follow-up PR after this is merged. |
@ahmelsayed any update on this please? I'd like to start working on #2156 to have it for the upcoming release. The best thing would be to base it on your code (the controller part in Metric Server in particular, etc), so don't want start on this until this is merged, to avoid complex rebases 😄 Thx :) |
e9afa1d
to
628e9b3
Compare
Sorry for the delay, I just rebased the PR, and I believe all the feedback has been addressed? (the conversation isn't easy to follow on github ui) |
fbd69ca
to
4fdf72a
Compare
Closes #1121 Signed-off-by: Ahmed ElSayed <ahmels@microsoft.com>
4fdf72a
to
d79a0bd
Compare
/run-e2e |
/run-e2e |
There is an intermittent failure in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Thanks a ton!
@ahmelsayed could you please update the changelog with this contribution? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ahmelsayed sorry I haven't been back to this PR in some time. I left one comment regarding a log line (a nit-pick, really). I'm happy to do it in a follow-up if you think it'd be a good idea. Let me know.
Regardless, LGTM
Signed-off-by: Ahmed ElSayed <ahmels@microsoft.com>
fe9b6b7
to
68643a6
Compare
Closes #1121
Provide a description of what has been changed
Checklist
Fixes #1121