Set last scale in and out event on autoscaler activity #52

gu-kevin · 2021-10-04T19:26:43Z

Currently, the scale in and out event time is set and stored as a global variable. However, if the service does not run in the same container, the event is always reset to 0 and will scale in and out immediately. This change will set the last scale in and out event based on the autoscaler's activity event, so if the service does not run on the same lambda container, it will keep the state of the last time it scaled in or out.

keithduncan · 2021-10-06T22:42:37Z

Thank you for this pull request @gu-kevin 😄

I think it could be interesting to restore the value of the local variables between lambda containers, but would hesitate to rely on it completely in case an API issue causes the scaler to stall with log.Fatal() 🤔

Do you know if there is a more reliable way to detect the scaling events without using the string comparisons? How would this handle a change to those strings?

gu-kevin · 2021-10-07T18:02:54Z

@keithduncan I think relying on the API is more reliable on container because that state can change anytime.

the data of when the last scaling events are only in autoscaler activity and the api does not provide any information besides strings that you can filter through.

lambda/main.go

scaler/asg.go

keithduncan

Thank you for making those changes 😁 I think this is looking good, I’ve got one more change I’d like to make.

Have you measured how long the GetLastScalingInAndOutActivity function might take to run? I think we should bound with a timeout so that the lambda doesn’t hang waiting for a TCP connection or HTTP response that could cause the lambda to hang beyond its permitted runtime preventing any scaling for being initiated.

lambda/main.go

gu-kevin · 2021-10-14T21:14:19Z

Thank you for making those changes 😁 I think this is looking good, I’ve got one more change I’d like to make.

Have you measured how long the GetLastScalingInAndOutActivity function might take to run? I think we should bound with a timeout so that the lambda doesn’t hang waiting for a TCP connection or HTTP response that could cause the lambda to hang beyond its permitted runtime preventing any scaling for being initiated.

the function increased the maximum duration time from 10000 ms to 60000 ms. For the time out, I am thinking it should default to a minute and is customizable.

keithduncan · 2021-10-15T00:31:12Z

the function increased the maximum duration time from 10000 ms to 60000 ms.

Hmm that’s a lot longer than I had expected 🤔

The CloudFormation / Serverless Application template for this function currently schedules the lambda every 60s and there’s a function timeout of 120s. Taking up to 60s to discover the previous scaling events on boot would introduce significant latency before once again updating the desired count.

What do you think about enabling this functionality only if SCALE_OUT_COOLDOWN_PERIOD or SCALE_IN_COOLDOWN_PERIOD are provided, and then making the deadline for discovering the previous events to the larger of those two. That way the function will only wait at most the configured cooldown period, and if that deadline expires before we discover the true last scale events it doesn’t matter because the cooldown has definitely been respected?

gu-kevin · 2021-10-15T17:12:49Z

the function increased the maximum duration time from 10000 ms to 60000 ms.

Hmm that’s a lot longer than I had expected 🤔

The CloudFormation / Serverless Application template for this function currently schedules the lambda every 60s and there’s a function timeout of 120s. Taking up to 60s to discover the previous scaling events on boot would introduce significant latency before once again updating the desired count.

What do you think about enabling this functionality only if SCALE_OUT_COOLDOWN_PERIOD or SCALE_IN_COOLDOWN_PERIOD are provided, and then making the deadline for discovering the previous events to the larger of those two. That way the function will only wait at most the configured cooldown period, and if that deadline expires before we discover the true last scale events it doesn’t matter because the cooldown has definitely been respected?

makes sense to get the activity if scale in or out period is set. Currently, if scale in or out cooldown period is not set, the lambda will fail. However, setting the deadline to the scale in or out cooldown period won't work if the cooldown period is set longer than the lambda timeout of 2 minutes.

keithduncan · 2021-10-17T23:30:22Z

makes sense to get the activity if scale in or out period is set. Currently, if scale in or out cooldown period is not set, the lambda will fail.

Scale in and out cooldown do look to be optional to be based on my reading.

However, setting the deadline to the scale in or out cooldown period won't work if the cooldown period is set longer than the lambda timeout of 2 minutes.

Great point that the cooldown could be longer than the lifetime of the lambda. I think the behaviour we see today should be preserved, that is the lambda presently takes at least one scaling decision per boot even if it then goes into cooldown for the remainder of the function invocation. While this may result in scaling more often than the configured cooldown, that is preferable to taking no scaling decisions. We need to avoid a "live lock" of sorts here where the lambda takes no scaling decisions in an attempt to honour the cooldown, but in doing so far exceeds the cooldown between scaling decisions.

What do you think of capping the wait time on the ASG scale activity discovery at 10s? If we get a true-positive recollection of past scaling decisions within that deadline great, otherwise the lambda would proceed as currently.

gu-kevin · 2021-10-18T17:21:08Z

makes sense to get the activity if scale in or out period is set. Currently, if scale in or out cooldown period is not set, the lambda will fail.

Scale in and out cooldown do look to be optional to be based on my reading.

my mistake, you are correct

However, setting the deadline to the scale in or out cooldown period won't work if the cooldown period is set longer than the lambda timeout of 2 minutes.

Great point that the cooldown could be longer than the lifetime of the lambda. I think the behaviour we see today should be preserved, that is the lambda presently takes at least one scaling decision per boot even if it then goes into cooldown for the remainder of the function invocation. While this may result in scaling more often than the configured cooldown, that is preferable to taking no scaling decisions. We need to avoid a "live lock" of sorts here where the lambda takes no scaling decisions in an attempt to honour the cooldown, but in doing so far exceeds the cooldown between scaling decisions.

What do you think of capping the wait time on the ASG scale activity discovery at 10s? If we get a true-positive recollection of past scaling decisions within that deadline great, otherwise the lambda would proceed as currently.

I agree that the lambda should not far exceed the cooldown scaling decision. we can have a flag to enable this feature. The lambda running in my environment has been consistently taking 60s - 70s, so 10s would not be enough time.

keithduncan · 2021-10-19T01:28:53Z

The lambda running in my environment has been consistently taking 60s - 70s, so 10s would not be enough time.

Ah I see, are you referring to the overall runtime of the lambda function which loops and sleeps between polling the Buildkite API and operating on the results, or the execution duration of the ASG scale activity discovery function?

gu-kevin · 2021-10-19T16:36:07Z

The lambda running in my environment has been consistently taking 60s - 70s, so 10s would not be enough time.

Ah I see, are you referring to the overall runtime of the lambda function which loops and sleeps between polling the Buildkite API and operating on the results, or the execution duration of the ASG scale activity discovery function?

I am referring to the overall time for the lambda function.

keithduncan · 2021-10-20T00:04:50Z

I am referring to the overall time for the lambda function.

Okay great, that’s the expected execution duration for the lambda function 😄

What do you think about adding some timestamped logs around asg.GetLastScalingInAndOutActivity() so that we have a record of how long that aspect of the function is taking to run?

I think defaulting the ASG scale activity discovery duration to 10s (asgActivityTimeoutDuration in your code) before the lambda would fail forward without restoring the scale activities, would be reasonable. Adding some timestamped logs before and after so we have a record of how long this is taking in practice, and logging both end cases (success or any of the failure outcomes) will also let us see how this is behaving in practice.

gu-kevin · 2021-10-20T23:46:46Z

I am referring to the overall time for the lambda function.

Okay great, that’s the expected execution duration for the lambda function 😄

What do you think about adding some timestamped logs around asg.GetLastScalingInAndOutActivity() so that we have a record of how long that aspect of the function is taking to run?

I think defaulting the ASG scale activity discovery duration to 10s (asgActivityTimeoutDuration in your code) before the lambda would fail forward without restoring the scale activities, would be reasonable. Adding some timestamped logs before and after so we have a record of how long this is taking in practice, and logging both end cases (success or any of the failure outcomes) will also let us see how this is behaving in practice.

@keithduncan added, it is taking around 200 - 500 ms.

gu-kevin · 2021-10-28T23:28:03Z

@keithduncan do you any update on when this can be merged? thanks in advance.

keithduncan · 2021-10-28T23:46:28Z

Thanks for making these updates @gu-kevin, we’re currently preparing our next release to the Elastic CI Stack for AWS but I’m afraid I haven’t had the time to test this change for incorporation yet. I’m hoping to get this tested as part of next month’s release 🙏

gu-kevin · 2021-10-29T16:27:09Z

Thanks for making these updates @gu-kevin, we’re currently preparing our next release to the Elastic CI Stack for AWS but I’m afraid I haven’t had the time to test this change for incorporation yet. I’m hoping to get this tested as part of next month’s release 🙏

thanks for the update

gu-kevin added 6 commits October 4, 2021 19:22

Set last scale in and out event on autoscaler activity

45dc3a8

Add scaler

e0b8314

Break if token is nil

c23fcb6

Fixes

0bd1e9c

Change to scaling out and in activity

302f365

Filter for activity that was due to desired count changes

d701f41

keithduncan reviewed Oct 13, 2021

View reviewed changes

lambda/main.go Show resolved Hide resolved

lambda/main.go Outdated Show resolved Hide resolved

scaler/asg.go Outdated Show resolved Hide resolved

Refactor getting asg last activities

ccdee5c

gu-kevin requested a review from keithduncan October 13, 2021 20:17

gu-kevin added 5 commits October 13, 2021 20:34

Fix comment

0139601

Add back persisting scale state

bd9d686

Set only nil events

8e97f31

break loop

ad691d1

Fix loop

66596e9

keithduncan reviewed Oct 14, 2021

View reviewed changes

lambda/main.go Outdated Show resolved Hide resolved

Get scaling activity in boot up of lambda

3ef8747

gu-kevin added 2 commits October 14, 2021 22:38

Add timeout to getting asg last scale activities

61dbcf7

Fix log

d22c518

gu-kevin added 2 commits October 20, 2021 17:31

Print time spent getting last scaling events

aafe8dd

Print last scale events

2526ec1

gu-kevin requested a review from keithduncan October 20, 2021 23:46

Remove logs

b007885

keithduncan added 4 commits November 22, 2021 13:39

Check the channel return value for errors

1c8f2f9

Log the restored times

e2f2326

Tweak error message when API error occurs

8302812

Merge remote-tracking branch 'origin/master' into prs/52

0f7c025

keithduncan merged commit 3d72466 into buildkite:master Nov 22, 2021

zl4bv mentioned this pull request Jan 20, 2022

Update IAM policy to allow describing scaling activities #61

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set last scale in and out event on autoscaler activity #52

Set last scale in and out event on autoscaler activity #52

gu-kevin commented Oct 4, 2021

keithduncan commented Oct 6, 2021 •

edited

Loading

gu-kevin commented Oct 7, 2021 •

edited

Loading

keithduncan left a comment

gu-kevin commented Oct 14, 2021

keithduncan commented Oct 15, 2021

gu-kevin commented Oct 15, 2021 •

edited

Loading

keithduncan commented Oct 17, 2021

gu-kevin commented Oct 18, 2021

keithduncan commented Oct 19, 2021

gu-kevin commented Oct 19, 2021

keithduncan commented Oct 20, 2021

gu-kevin commented Oct 20, 2021 •

edited

Loading

gu-kevin commented Oct 28, 2021

keithduncan commented Oct 28, 2021

gu-kevin commented Oct 29, 2021

Set last scale in and out event on autoscaler activity #52

Set last scale in and out event on autoscaler activity #52

Conversation

gu-kevin commented Oct 4, 2021

keithduncan commented Oct 6, 2021 • edited Loading

gu-kevin commented Oct 7, 2021 • edited Loading

keithduncan left a comment

Choose a reason for hiding this comment

gu-kevin commented Oct 14, 2021

keithduncan commented Oct 15, 2021

gu-kevin commented Oct 15, 2021 • edited Loading

keithduncan commented Oct 17, 2021

gu-kevin commented Oct 18, 2021

keithduncan commented Oct 19, 2021

gu-kevin commented Oct 19, 2021

keithduncan commented Oct 20, 2021

gu-kevin commented Oct 20, 2021 • edited Loading

gu-kevin commented Oct 28, 2021

keithduncan commented Oct 28, 2021

gu-kevin commented Oct 29, 2021

keithduncan commented Oct 6, 2021 •

edited

Loading

gu-kevin commented Oct 7, 2021 •

edited

Loading

gu-kevin commented Oct 15, 2021 •

edited

Loading

gu-kevin commented Oct 20, 2021 •

edited

Loading