-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set last scale in and out event on autoscaler activity #52
Set last scale in and out event on autoscaler activity #52
Conversation
Thank you for this pull request @gu-kevin 😄 I think it could be interesting to restore the value of the local variables between lambda containers, but would hesitate to rely on it completely in case an API issue causes the scaler to stall with Do you know if there is a more reliable way to detect the scaling events without using the string comparisons? How would this handle a change to those strings? |
@keithduncan I think relying on the API is more reliable on container because that state can change anytime. the data of when the last scaling events are only in autoscaler activity and the api does not provide any information besides strings that you can filter through. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for making those changes 😁 I think this is looking good, I’ve got one more change I’d like to make.
Have you measured how long the GetLastScalingInAndOutActivity
function might take to run? I think we should bound with a timeout so that the lambda doesn’t hang waiting for a TCP connection or HTTP response that could cause the lambda to hang beyond its permitted runtime preventing any scaling for being initiated.
the function increased the maximum duration time from 10000 ms to 60000 ms. For the time out, I am thinking it should default to a minute and is customizable. |
Hmm that’s a lot longer than I had expected 🤔 The CloudFormation / Serverless Application template for this function currently schedules the lambda every 60s and there’s a function timeout of 120s. Taking up to 60s to discover the previous scaling events on boot would introduce significant latency before once again updating the desired count. What do you think about enabling this functionality only if |
makes sense to get the activity if scale in or out period is set. Currently, if scale in or out cooldown period is not set, the lambda will fail. However, setting the deadline to the scale in or out cooldown period won't work if the cooldown period is set longer than the lambda timeout of 2 minutes. |
Scale in and out cooldown do look to be optional to be based on my reading.
Great point that the cooldown could be longer than the lifetime of the lambda. I think the behaviour we see today should be preserved, that is the lambda presently takes at least one scaling decision per boot even if it then goes into cooldown for the remainder of the function invocation. While this may result in scaling more often than the configured cooldown, that is preferable to taking no scaling decisions. We need to avoid a "live lock" of sorts here where the lambda takes no scaling decisions in an attempt to honour the cooldown, but in doing so far exceeds the cooldown between scaling decisions. What do you think of capping the wait time on the ASG scale activity discovery at 10s? If we get a true-positive recollection of past scaling decisions within that deadline great, otherwise the lambda would proceed as currently. |
my mistake, you are correct
I agree that the lambda should not far exceed the cooldown scaling decision. we can have a flag to enable this feature. The lambda running in my environment has been consistently taking 60s - 70s, so 10s would not be enough time. |
Ah I see, are you referring to the overall runtime of the lambda function which loops and sleeps between polling the Buildkite API and operating on the results, or the execution duration of the ASG scale activity discovery function? |
I am referring to the overall time for the lambda function. |
Okay great, that’s the expected execution duration for the lambda function 😄 What do you think about adding some timestamped logs around I think defaulting the ASG scale activity discovery duration to 10s ( |
@keithduncan added, it is taking around 200 - 500 ms. |
@keithduncan do you any update on when this can be merged? thanks in advance. |
Thanks for making these updates @gu-kevin, we’re currently preparing our next release to the Elastic CI Stack for AWS but I’m afraid I haven’t had the time to test this change for incorporation yet. I’m hoping to get this tested as part of next month’s release 🙏 |
thanks for the update |
Currently, the scale in and out event time is set and stored as a global variable. However, if the service does not run in the same container, the event is always reset to 0 and will scale in and out immediately. This change will set the last scale in and out event based on the autoscaler's activity event, so if the service does not run on the same lambda container, it will keep the state of the last time it scaled in or out.