Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit pagination of autoscaling:DescribeScalingActivity #81

Merged
merged 7 commits into from
May 16, 2023

Conversation

triarius
Copy link
Contributor

@triarius triarius commented Mar 31, 2023

This fixes a variety of issues with scaler, the main one being that, when scale in was disabled on the elastic ci stack (which it is by default), the search for the last scale in activity would alway go through the entire history of scaling events. As this is paginated, and stores 6 weeks of activity, many customers were experiencing rate limiting of of their autoscaling API calls. This rendered the elastic ci stack unable to scale out in some cases.

Until @YanivAssaf-at pointed this out, my main attempt at a solution was to impose an arbitrary limit on the pagination of the DescribeAutoScalingActivity API. I still think this could be useful, so I have kept it in.

@triarius triarius force-pushed the pdp-782-limit-pagination-in-buildkite-agent branch from 715f975 to b3f5159 Compare March 31, 2023 09:28
@triarius triarius force-pushed the pdp-782-limit-pagination-in-buildkite-agent branch from b3f5159 to 492e5d6 Compare March 31, 2023 09:36
@triarius triarius changed the base branch from master to pdp-777-fix-ci-for-buildkite-agent-scaler March 31, 2023 09:37
Base automatically changed from pdp-777-fix-ci-for-buildkite-agent-scaler to master April 2, 2023 08:27
@triarius triarius force-pushed the pdp-782-limit-pagination-in-buildkite-agent branch 5 times, most recently from 81aee9e to c5295e8 Compare April 3, 2023 00:28
scaler/asg.go Outdated
for i := 0; !hasFoundScalingActivities; {
i++
if a.MaxDescribeScalingActivitiesPages >= 0 && i >= a.MaxDescribeScalingActivitiesPages {
return nil, nil, fmt.Errorf("%d exceedes allowed pages for autoscaling:DescribeScalingActivities, %d", i, a.MaxDescribeScalingActivitiesPages)
Copy link

@YanivAssaf-at YanivAssaf-at May 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may have found either lastScalingOutActivity or lastScalingInActivity in one of the previous pages. Can they be returned, even though the maximum number of pages has been reached?

This feels like an expected failure case, rather than an exception (from my past Java days). I'd log the warning and return what I can.

My Go knowledge is basically nonexistent, and my suggestions might be wrong.

Copy link
Contributor Author

@triarius triarius May 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion. The calling code used to fall back to the globally saved values on any error, but I've made it only do that if neither scale in nor scale out is found. Then, depending on what is found, it will use the globally saved value or the newly found one.

@triarius triarius force-pushed the pdp-782-limit-pagination-in-buildkite-agent branch from 8d61d1b to e6797eb Compare May 7, 2023 13:16
While there was a timeout that fires after 10 seconds, as the pagination
was happening in a goroutine, it would continue after the firing of the
timeout is pulled from the timeout channel. After this commit, the
context will be canceled and the next call to DescribeScalingActivities
would error.
…scale out activities

If an error occurs or the paging limit is reached, we may still have found one of the two types of scaling activity we are looking for
@triarius triarius force-pushed the pdp-782-limit-pagination-in-buildkite-agent branch from e6797eb to 41e1801 Compare May 7, 2023 13:28
If any one of them were disabled, then the corresponding scaling events
would not appear in the scaling activities and so the entire scaling
activity history would have to be searched each time.

In the elastic ci stack for aws, this was the case as scale in was
disabled. Many customers have reported that there were excessive calls
and rate limiting of the DescribeScalingActivites endpoint, so this was a real problem.

Also there is no need to check environment variables on each iteration
of the main loop, so we have moved them outside.
@triarius triarius marked this pull request as ready for review May 7, 2023 14:06
@triarius triarius requested review from YanivAssaf-at and a team and removed request for YanivAssaf-at May 7, 2023 14:06
Copy link

@YanivAssaf-at YanivAssaf-at left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you.

log.Printf("Encountered error when retrieving last scaling activities: %s", res.Err)
}

if res.LastScaleOutActivity == nil && res.LastScaleInActivity == nil {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this break still necessary?

Copy link
Contributor Author

@triarius triarius May 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. Previously we would break for an error, but now we don't. Now, an error means that at most one of res.LastScaleOutActivity != nil or res.LastScaleInActivity != nil. Whereas as previously it meant that both were nil. If both of these are nil, we don't want to continue and print Succesfully retrieved last scaling activity events. Last scale out %v, last scale in %v. Discovery took %s., so we exit the select statement the same way that an error did before.

What I expect will happen is that the values stored globally in scaleInOutput and scaleOutOutput are used instead. And that's what would have happened if there was an error before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose it's worth logging this expectation, though. I'm not sure that storing scaleInOutput and scaleInOutput globally actually does anything, but it seems to be the design that if describe scaling activities fails or times out, falling back to the global values is meaningful.

Copy link
Contributor

@moskyb moskyb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i have one comment about the default number of pages, but other than that, this all LGTM. awesome work @triarius

MaxDescribeScalingActivitiesPages:
Type: Number
Description: The number of pages to retrive for DescribeScalingActivity. Negative numbers mean unlimited.
Default: "-1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we default this to something low-but-reasonable? if the goal of this work is to consume AWS rate limits less, we should probably make the default something that will reduce the rate-limit-stress we're putting on customers

Copy link
Contributor Author

@triarius triarius May 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, I believe the fact that we were searching for non-existent scale-in events is the major cause of the rate limiting. We can also tweak this parameter from the elastic ci stack, so I propose we leave it unconstrained here and pass in a limit through the elastic stack if we need to.

@triarius triarius merged commit ff97cca into master May 16, 2023
@triarius triarius deleted the pdp-782-limit-pagination-in-buildkite-agent branch May 16, 2023 13:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants