-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Limit pagination of autoscaling:DescribeScalingActivity #81
Conversation
715f975
to
b3f5159
Compare
b3f5159
to
492e5d6
Compare
81aee9e
to
c5295e8
Compare
scaler/asg.go
Outdated
for i := 0; !hasFoundScalingActivities; { | ||
i++ | ||
if a.MaxDescribeScalingActivitiesPages >= 0 && i >= a.MaxDescribeScalingActivitiesPages { | ||
return nil, nil, fmt.Errorf("%d exceedes allowed pages for autoscaling:DescribeScalingActivities, %d", i, a.MaxDescribeScalingActivitiesPages) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may have found either lastScalingOutActivity or lastScalingInActivity in one of the previous pages. Can they be returned, even though the maximum number of pages has been reached?
This feels like an expected failure case, rather than an exception (from my past Java days). I'd log the warning and return what I can.
My Go knowledge is basically nonexistent, and my suggestions might be wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion. The calling code used to fall back to the globally saved values on any error, but I've made it only do that if neither scale in nor scale out is found. Then, depending on what is found, it will use the globally saved value or the newly found one.
8d61d1b
to
e6797eb
Compare
While there was a timeout that fires after 10 seconds, as the pagination was happening in a goroutine, it would continue after the firing of the timeout is pulled from the timeout channel. After this commit, the context will be canceled and the next call to DescribeScalingActivities would error.
…scale out activities If an error occurs or the paging limit is reached, we may still have found one of the two types of scaling activity we are looking for
e6797eb
to
41e1801
Compare
If any one of them were disabled, then the corresponding scaling events would not appear in the scaling activities and so the entire scaling activity history would have to be searched each time. In the elastic ci stack for aws, this was the case as scale in was disabled. Many customers have reported that there were excessive calls and rate limiting of the DescribeScalingActivites endpoint, so this was a real problem. Also there is no need to check environment variables on each iteration of the main loop, so we have moved them outside.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you.
log.Printf("Encountered error when retrieving last scaling activities: %s", res.Err) | ||
} | ||
|
||
if res.LastScaleOutActivity == nil && res.LastScaleInActivity == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this break
still necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so. Previously we would break
for an error, but now we don't. Now, an error means that at most one of res.LastScaleOutActivity != nil
or res.LastScaleInActivity != nil
. Whereas as previously it meant that both were nil
. If both of these are nil
, we don't want to continue and print Succesfully retrieved last scaling activity events. Last scale out %v, last scale in %v. Discovery took %s.
, so we exit the select statement the same way that an error did before.
What I expect will happen is that the values stored globally in scaleInOutput
and scaleOutOutput
are used instead. And that's what would have happened if there was an error before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose it's worth logging this expectation, though. I'm not sure that storing scaleInOutput
and scaleInOutput
globally actually does anything, but it seems to be the design that if describe scaling activities fails or times out, falling back to the global values is meaningful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i have one comment about the default number of pages, but other than that, this all LGTM. awesome work @triarius
MaxDescribeScalingActivitiesPages: | ||
Type: Number | ||
Description: The number of pages to retrive for DescribeScalingActivity. Negative numbers mean unlimited. | ||
Default: "-1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we default this to something low-but-reasonable? if the goal of this work is to consume AWS rate limits less, we should probably make the default something that will reduce the rate-limit-stress we're putting on customers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, I believe the fact that we were searching for non-existent scale-in events is the major cause of the rate limiting. We can also tweak this parameter from the elastic ci stack, so I propose we leave it unconstrained here and pass in a limit through the elastic stack if we need to.
This fixes a variety of issues with scaler, the main one being that, when scale in was disabled on the elastic ci stack (which it is by default), the search for the last scale in activity would alway go through the entire history of scaling events. As this is paginated, and stores 6 weeks of activity, many customers were experiencing rate limiting of of their autoscaling API calls. This rendered the elastic ci stack unable to scale out in some cases.
Until @YanivAssaf-at pointed this out, my main attempt at a solution was to impose an arbitrary limit on the pagination of the DescribeAutoScalingActivity API. I still think this could be useful, so I have kept it in.