Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set last scale in and out event on autoscaler activity #52

Merged
merged 22 commits into from
Nov 22, 2021

Conversation

gu-kevin
Copy link
Contributor

@gu-kevin gu-kevin commented Oct 4, 2021

Currently, the scale in and out event time is set and stored as a global variable. However, if the service does not run in the same container, the event is always reset to 0 and will scale in and out immediately. This change will set the last scale in and out event based on the autoscaler's activity event, so if the service does not run on the same lambda container, it will keep the state of the last time it scaled in or out.

@keithduncan
Copy link
Contributor

keithduncan commented Oct 6, 2021

Thank you for this pull request @gu-kevin 😄

I think it could be interesting to restore the value of the local variables between lambda containers, but would hesitate to rely on it completely in case an API issue causes the scaler to stall with log.Fatal() 🤔

Do you know if there is a more reliable way to detect the scaling events without using the string comparisons? How would this handle a change to those strings?

@gu-kevin
Copy link
Contributor Author

gu-kevin commented Oct 7, 2021

@keithduncan I think relying on the API is more reliable on container because that state can change anytime.

the data of when the last scaling events are only in autoscaler activity and the api does not provide any information besides strings that you can filter through.
Screenshot 2021-09-29 10 43 30 AM

lambda/main.go Show resolved Hide resolved
lambda/main.go Outdated Show resolved Hide resolved
scaler/asg.go Outdated Show resolved Hide resolved
Copy link
Contributor

@keithduncan keithduncan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for making those changes 😁 I think this is looking good, I’ve got one more change I’d like to make.

Have you measured how long the GetLastScalingInAndOutActivity function might take to run? I think we should bound with a timeout so that the lambda doesn’t hang waiting for a TCP connection or HTTP response that could cause the lambda to hang beyond its permitted runtime preventing any scaling for being initiated.

lambda/main.go Outdated Show resolved Hide resolved
@gu-kevin
Copy link
Contributor Author

Thank you for making those changes 😁 I think this is looking good, I’ve got one more change I’d like to make.

Have you measured how long the GetLastScalingInAndOutActivity function might take to run? I think we should bound with a timeout so that the lambda doesn’t hang waiting for a TCP connection or HTTP response that could cause the lambda to hang beyond its permitted runtime preventing any scaling for being initiated.

the function increased the maximum duration time from 10000 ms to 60000 ms. For the time out, I am thinking it should default to a minute and is customizable.

@keithduncan
Copy link
Contributor

the function increased the maximum duration time from 10000 ms to 60000 ms.

Hmm that’s a lot longer than I had expected 🤔

The CloudFormation / Serverless Application template for this function currently schedules the lambda every 60s and there’s a function timeout of 120s. Taking up to 60s to discover the previous scaling events on boot would introduce significant latency before once again updating the desired count.

What do you think about enabling this functionality only if SCALE_OUT_COOLDOWN_PERIOD or SCALE_IN_COOLDOWN_PERIOD are provided, and then making the deadline for discovering the previous events to the larger of those two. That way the function will only wait at most the configured cooldown period, and if that deadline expires before we discover the true last scale events it doesn’t matter because the cooldown has definitely been respected?

@gu-kevin
Copy link
Contributor Author

gu-kevin commented Oct 15, 2021

the function increased the maximum duration time from 10000 ms to 60000 ms.

Hmm that’s a lot longer than I had expected 🤔

The CloudFormation / Serverless Application template for this function currently schedules the lambda every 60s and there’s a function timeout of 120s. Taking up to 60s to discover the previous scaling events on boot would introduce significant latency before once again updating the desired count.

What do you think about enabling this functionality only if SCALE_OUT_COOLDOWN_PERIOD or SCALE_IN_COOLDOWN_PERIOD are provided, and then making the deadline for discovering the previous events to the larger of those two. That way the function will only wait at most the configured cooldown period, and if that deadline expires before we discover the true last scale events it doesn’t matter because the cooldown has definitely been respected?

makes sense to get the activity if scale in or out period is set. Currently, if scale in or out cooldown period is not set, the lambda will fail. However, setting the deadline to the scale in or out cooldown period won't work if the cooldown period is set longer than the lambda timeout of 2 minutes.

@keithduncan
Copy link
Contributor

makes sense to get the activity if scale in or out period is set. Currently, if scale in or out cooldown period is not set, the lambda will fail.

Scale in and out cooldown do look to be optional to be based on my reading.

However, setting the deadline to the scale in or out cooldown period won't work if the cooldown period is set longer than the lambda timeout of 2 minutes.

Great point that the cooldown could be longer than the lifetime of the lambda. I think the behaviour we see today should be preserved, that is the lambda presently takes at least one scaling decision per boot even if it then goes into cooldown for the remainder of the function invocation. While this may result in scaling more often than the configured cooldown, that is preferable to taking no scaling decisions. We need to avoid a "live lock" of sorts here where the lambda takes no scaling decisions in an attempt to honour the cooldown, but in doing so far exceeds the cooldown between scaling decisions.

What do you think of capping the wait time on the ASG scale activity discovery at 10s? If we get a true-positive recollection of past scaling decisions within that deadline great, otherwise the lambda would proceed as currently.

@gu-kevin
Copy link
Contributor Author

makes sense to get the activity if scale in or out period is set. Currently, if scale in or out cooldown period is not set, the lambda will fail.

Scale in and out cooldown do look to be optional to be based on my reading.

my mistake, you are correct

However, setting the deadline to the scale in or out cooldown period won't work if the cooldown period is set longer than the lambda timeout of 2 minutes.

Great point that the cooldown could be longer than the lifetime of the lambda. I think the behaviour we see today should be preserved, that is the lambda presently takes at least one scaling decision per boot even if it then goes into cooldown for the remainder of the function invocation. While this may result in scaling more often than the configured cooldown, that is preferable to taking no scaling decisions. We need to avoid a "live lock" of sorts here where the lambda takes no scaling decisions in an attempt to honour the cooldown, but in doing so far exceeds the cooldown between scaling decisions.

What do you think of capping the wait time on the ASG scale activity discovery at 10s? If we get a true-positive recollection of past scaling decisions within that deadline great, otherwise the lambda would proceed as currently.

I agree that the lambda should not far exceed the cooldown scaling decision. we can have a flag to enable this feature. The lambda running in my environment has been consistently taking 60s - 70s, so 10s would not be enough time.

@keithduncan
Copy link
Contributor

The lambda running in my environment has been consistently taking 60s - 70s, so 10s would not be enough time.

Ah I see, are you referring to the overall runtime of the lambda function which loops and sleeps between polling the Buildkite API and operating on the results, or the execution duration of the ASG scale activity discovery function?

@gu-kevin
Copy link
Contributor Author

The lambda running in my environment has been consistently taking 60s - 70s, so 10s would not be enough time.

Ah I see, are you referring to the overall runtime of the lambda function which loops and sleeps between polling the Buildkite API and operating on the results, or the execution duration of the ASG scale activity discovery function?

I am referring to the overall time for the lambda function.

@keithduncan
Copy link
Contributor

I am referring to the overall time for the lambda function.

Okay great, that’s the expected execution duration for the lambda function 😄

What do you think about adding some timestamped logs around asg.GetLastScalingInAndOutActivity() so that we have a record of how long that aspect of the function is taking to run?

I think defaulting the ASG scale activity discovery duration to 10s (asgActivityTimeoutDuration in your code) before the lambda would fail forward without restoring the scale activities, would be reasonable. Adding some timestamped logs before and after so we have a record of how long this is taking in practice, and logging both end cases (success or any of the failure outcomes) will also let us see how this is behaving in practice.

@gu-kevin
Copy link
Contributor Author

gu-kevin commented Oct 20, 2021

I am referring to the overall time for the lambda function.

Okay great, that’s the expected execution duration for the lambda function 😄

What do you think about adding some timestamped logs around asg.GetLastScalingInAndOutActivity() so that we have a record of how long that aspect of the function is taking to run?

I think defaulting the ASG scale activity discovery duration to 10s (asgActivityTimeoutDuration in your code) before the lambda would fail forward without restoring the scale activities, would be reasonable. Adding some timestamped logs before and after so we have a record of how long this is taking in practice, and logging both end cases (success or any of the failure outcomes) will also let us see how this is behaving in practice.

@keithduncan added, it is taking around 200 - 500 ms.

@gu-kevin
Copy link
Contributor Author

@keithduncan do you any update on when this can be merged? thanks in advance.

@keithduncan
Copy link
Contributor

Thanks for making these updates @gu-kevin, we’re currently preparing our next release to the Elastic CI Stack for AWS but I’m afraid I haven’t had the time to test this change for incorporation yet. I’m hoping to get this tested as part of next month’s release 🙏

@gu-kevin
Copy link
Contributor Author

Thanks for making these updates @gu-kevin, we’re currently preparing our next release to the Elastic CI Stack for AWS but I’m afraid I haven’t had the time to test this change for incorporation yet. I’m hoping to get this tested as part of next month’s release 🙏

thanks for the update

@keithduncan keithduncan merged commit 3d72466 into buildkite:master Nov 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants