Limit pagination of autoscaling:DescribeScalingActivity #81

triarius · 2023-03-31T04:43:39Z

This fixes a variety of issues with scaler, the main one being that, when scale in was disabled on the elastic ci stack (which it is by default), the search for the last scale in activity would alway go through the entire history of scaling events. As this is paginated, and stores 6 weeks of activity, many customers were experiencing rate limiting of of their autoscaling API calls. This rendered the elastic ci stack unable to scale out in some cases.

Until @YanivAssaf-at pointed this out, my main attempt at a solution was to impose an arbitrary limit on the pagination of the DescribeAutoScalingActivity API. I still think this could be useful, so I have kept it in.

YanivAssaf-at · 2023-05-04T09:02:57Z

scaler/asg.go

+	for i := 0; !hasFoundScalingActivities; {
+		i++
+		if a.MaxDescribeScalingActivitiesPages >= 0 && i >= a.MaxDescribeScalingActivitiesPages {
+			return nil, nil, fmt.Errorf("%d exceedes allowed pages for autoscaling:DescribeScalingActivities, %d", i, a.MaxDescribeScalingActivitiesPages)


This may have found either lastScalingOutActivity or lastScalingInActivity in one of the previous pages. Can they be returned, even though the maximum number of pages has been reached?

This feels like an expected failure case, rather than an exception (from my past Java days). I'd log the warning and return what I can.

My Go knowledge is basically nonexistent, and my suggestions might be wrong.

Good suggestion. The calling code used to fall back to the globally saved values on any error, but I've made it only do that if neither scale in nor scale out is found. Then, depending on what is found, it will use the globally saved value or the newly found one.

While there was a timeout that fires after 10 seconds, as the pagination was happening in a goroutine, it would continue after the firing of the timeout is pulled from the timeout channel. After this commit, the context will be canceled and the next call to DescribeScalingActivities would error.

…scale out activities If an error occurs or the paging limit is reached, we may still have found one of the two types of scaling activity we are looking for

…etrieved

If any one of them were disabled, then the corresponding scaling events would not appear in the scaling activities and so the entire scaling activity history would have to be searched each time. In the elastic ci stack for aws, this was the case as scale in was disabled. Many customers have reported that there were excessive calls and rate limiting of the DescribeScalingActivites endpoint, so this was a real problem. Also there is no need to check environment variables on each iteration of the main loop, so we have moved them outside.

YanivAssaf-at

Thank you.

YanivAssaf-at · 2023-05-08T17:25:39Z

lambda/main.go

+			log.Printf("Encountered error when retrieving last scaling activities: %s", res.Err)
+		}
+
+		if res.LastScaleOutActivity == nil && res.LastScaleInActivity == nil {


Is this break still necessary?

I think so. Previously we would break for an error, but now we don't. Now, an error means that at most one of res.LastScaleOutActivity != nil or res.LastScaleInActivity != nil. Whereas as previously it meant that both were nil. If both of these are nil, we don't want to continue and print Succesfully retrieved last scaling activity events. Last scale out %v, last scale in %v. Discovery took %s., so we exit the select statement the same way that an error did before.

What I expect will happen is that the values stored globally in scaleInOutput and scaleOutOutput are used instead. And that's what would have happened if there was an error before.

I suppose it's worth logging this expectation, though. I'm not sure that storing scaleInOutput and scaleInOutput globally actually does anything, but it seems to be the design that if describe scaling activities fails or times out, falling back to the global values is meaningful.

moskyb

i have one comment about the default number of pages, but other than that, this all LGTM. awesome work @triarius

moskyb · 2023-05-15T00:55:56Z

template.yaml

+  MaxDescribeScalingActivitiesPages:
+    Type: Number
+    Description: The number of pages to retrive for DescribeScalingActivity. Negative numbers mean unlimited.
+    Default: "-1"


should we default this to something low-but-reasonable? if the goal of this work is to consume AWS rate limits less, we should probably make the default something that will reduce the rate-limit-stress we're putting on customers

Maybe, I believe the fact that we were searching for non-existent scale-in events is the major cause of the rate limiting. We can also tweak this parameter from the elastic ci stack, so I propose we leave it unconstrained here and pass in a limit through the elastic stack if we need to.

triarius force-pushed the pdp-782-limit-pagination-in-buildkite-agent branch from 715f975 to b3f5159 Compare March 31, 2023 09:28

triarius mentioned this pull request Mar 31, 2023

Use OIDC to assume role to publish lambdas to s3 #82

Merged

triarius force-pushed the pdp-782-limit-pagination-in-buildkite-agent branch from b3f5159 to 492e5d6 Compare March 31, 2023 09:36

triarius changed the base branch from master to pdp-777-fix-ci-for-buildkite-agent-scaler March 31, 2023 09:37

Base automatically changed from pdp-777-fix-ci-for-buildkite-agent-scaler to master April 2, 2023 08:27

triarius force-pushed the pdp-782-limit-pagination-in-buildkite-agent branch 5 times, most recently from 81aee9e to c5295e8 Compare April 3, 2023 00:28

YanivAssaf-at reviewed May 4, 2023

View reviewed changes

triarius force-pushed the pdp-782-limit-pagination-in-buildkite-agent branch from 8d61d1b to e6797eb Compare May 7, 2023 13:16

triarius added 6 commits May 7, 2023 23:28

Limit pagination of autoscaling:DescribeScalingActivity

38e49c7

Add a parameter to the sam stack and rename variables after it

975dee5

Fix template default values must be strings

ffa63e4

Always return as much info as we can when searching for scale in and …

8eb8d86

…scale out activities If an error occurs or the paging limit is reached, we may still have found one of the two types of scaling activity we are looking for

Allow continuing if only one of the scale out or in activities were r…

41e1801

…etrieved

triarius force-pushed the pdp-782-limit-pagination-in-buildkite-agent branch from e6797eb to 41e1801 Compare May 7, 2023 13:28

triarius marked this pull request as ready for review May 7, 2023 14:06

triarius requested review from YanivAssaf-at and a team and removed request for YanivAssaf-at May 7, 2023 14:06

YanivAssaf-at approved these changes May 8, 2023

View reviewed changes

moskyb approved these changes May 15, 2023

View reviewed changes

triarius merged commit ff97cca into master May 16, 2023

triarius deleted the pdp-782-limit-pagination-in-buildkite-agent branch May 16, 2023 13:12

triarius mentioned this pull request May 17, 2023

Bump CHANGELOG and version for v1.4.0 #90

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Limit pagination of autoscaling:DescribeScalingActivity #81

Limit pagination of autoscaling:DescribeScalingActivity #81

triarius commented Mar 31, 2023 •

edited

Loading

YanivAssaf-at May 4, 2023 •

edited

Loading

triarius May 7, 2023 •

edited

Loading

YanivAssaf-at left a comment

YanivAssaf-at May 8, 2023

triarius May 9, 2023 •

edited

Loading

triarius May 9, 2023

moskyb left a comment

moskyb May 15, 2023

triarius May 16, 2023 •

edited

Loading

Limit pagination of autoscaling:DescribeScalingActivity #81

Limit pagination of autoscaling:DescribeScalingActivity #81

Conversation

triarius commented Mar 31, 2023 • edited Loading

YanivAssaf-at May 4, 2023 • edited Loading

Choose a reason for hiding this comment

triarius May 7, 2023 • edited Loading

Choose a reason for hiding this comment

YanivAssaf-at left a comment

Choose a reason for hiding this comment

YanivAssaf-at May 8, 2023

Choose a reason for hiding this comment

triarius May 9, 2023 • edited Loading

Choose a reason for hiding this comment

triarius May 9, 2023

Choose a reason for hiding this comment

moskyb left a comment

Choose a reason for hiding this comment

moskyb May 15, 2023

Choose a reason for hiding this comment

triarius May 16, 2023 • edited Loading

Choose a reason for hiding this comment

triarius commented Mar 31, 2023 •

edited

Loading

YanivAssaf-at May 4, 2023 •

edited

Loading

triarius May 7, 2023 •

edited

Loading

triarius May 9, 2023 •

edited

Loading

triarius May 16, 2023 •

edited

Loading