Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relative scaling calculation #40

Merged
merged 4 commits into from
Apr 14, 2021
Merged

Conversation

dbaggerman
Copy link
Contributor

Proposed solution to #39. It's fresh off the keyboard, and I haven't had a chance to set up a test lab to verify it does what I expect in the real world yet.

Implements an (opt-in) alternative scaling calculation using relative numbers. i.e. N more instances requires as opposed to N total instances required.

This should work better in cases where not all agents on an instance are running or accepting jobs. For example, while waiting for (potentially long-running) jobs to complete during a graceful shutdown.

Implements an (opt-in) alternative scaling calculation using relative
numbers. i.e. N _more_ instances requires as opposed to N _total_
instances required.

This should work better in cases where not all agents on an instance are
running or accepting jobs. For example, while waiting for (potentially
long-running) jobs to complete during a graceful shutdown.
@yob
Copy link
Contributor

yob commented Feb 22, 2021

Thanks @dbaggerman, this is a interesting PR with plenty of useful context in the description and linked issue.

I'm going to chat to a few folks internally to collect some feedback.

One thought that occurs to me: there'll be a lag time between the scaler requesting a new instance and it showing up in the agent count metrics visible in the API, is there potential for this approach to over-compensate and scale up too high for the queue size?

@yob
Copy link
Contributor

yob commented Feb 22, 2021

there'll be a lag time between the scaler requesting a new instance and it showing up in the agent count metrics visible in the API, is there potential for this approach to over-compensate and scale up too high for the queue size?

Oh, I see. There's some accomodation for pending instances in the ASG when calculating the scale up 👍

@dbaggerman
Copy link
Contributor Author

There's some accomodation for pending instances in the ASG when calculating the scale up

That's right. There's still the possibility of a very small window between an instance being ready in the ASG but not yet registered as an agent, but that's hopefully small enough to not be much of a problem in practice.

Theexisting SCALE_OUT_COOLDOWN_PERIOD option could also be used to prevent it spinning up instances too frequently.

@yob
Copy link
Contributor

yob commented Feb 26, 2021

I think we're a bit nervous about adding two scaling modes. Partly due to the small increase in ongoing maintenance complexity, but more importantly because it'll then require a config dial on the elastic stack that we have to explain and support. The stack already has more parameters than we'd like, and we're trying to hold the line on adding more where possible.

The issue you describe is real though! It's something we haven't felt internally on our own stacks because we only run a single agent per instance.

I wonder if there's a way to avoid the issue without adding a second scaling mode? Easier said than done maybe.

Is the problem most pronounced when the stack has only 1 instance left, with < 4 running agents, and < 4 scheduled jobs? Could we use the "fetch the number of running agents from Buildkite's perspective" trick you have here, and when that number is under the number of agents-per-instance automatically scale up by 1?

@dbaggerman
Copy link
Contributor Author

I think we're a bit nervous about adding two scaling modes.

Being concerned about the extra complexity is understandable. I did it that way because I thought it might make the scaling behaviour different, but in the end it produced exactly the same desired count as the current calculation - at least for the scenarios in the unit tests. Maybe there's an argument for making it the new standard behaviour, but I wouldn't be surprised if there are edge case differences not being caught by the tests.

Is the problem most pronounced when the stack has only 1 instance left, with < 4 running agents, and < 4 scheduled jobs?

Yes, the issue is most pronounced when the is one instance in the pool which is not accepting jobs. That's where we end up with someone asking why their job has been waiting for an agent for over an hour. If the queue gets large enough to add a second instance, then at least jobs get processed even if they might wait a bit longer.

Could we use the "fetch the number of running agents from Buildkite's perspective" trick you have here, and when that number is under the number of agents-per-instance automatically scale up by 1?

I suppose we could create a hybrid calculation which infers the number of "inactive" instances and adjusts the absolute number. It would probably end up being slightly more complex than either of the other two calculations on their own - but might also be less likely to interfere with edge cases. I can put something together and add it to the PR so we can see what it looks like - we can unwind back to a single implementation later.

@yob
Copy link
Contributor

yob commented Feb 26, 2021

I can put something together and add it to the PR so we can see what it looks like - we can unwind back to a single implementation later.

Awesome, thanks!

@dbaggerman
Copy link
Contributor Author

@yob, I've rolled it back to just a single implementation. Let me know if there's anything else you would like to change.

@yob
Copy link
Contributor

yob commented Mar 12, 2021

@dbaggerman thanks! Apologies we haven't given feedback yet, but it's not forgotten and we'll get to it soon ❤️

@yob
Copy link
Contributor

yob commented Apr 1, 2021

@dbaggerman we're planning a backwards compatible 5.3.0 elastic stack release in the next week or three and I'd love to land this for you before that.

We like that general approach of the new logic being active without a mode flagf, but we're a little nervous about how this might impact stacks with hundreds or thousands of instances. Particularly the logic that tries to account for pending instances.

Are you open to making this a little more conservative - at least as a first step - and only kicking in the "anti stall" logic when the registered agent count is < AgentsPerInstance? Or maybe AgentsPerInstance*2? At that scale if we over-scaleup and add an extra instance or two when they're not required, the cost impact is minimal. It sounds like it'll also address most of the pain you're experiencing?

Could we even get away with "when asg.DesiredCount == 1 and registeredAgents is less than agentsPerInstance, increment desired instances by 1" safety net?

Apologies for the multiple rounds of feedback ❤️

@dbaggerman
Copy link
Contributor Author

only kicking in the "anti stall" logic when the registered agent count is < AgentsPerInstance?

Sure, that would work OK for us. We run quite a few agents, but spread thinly across many queues / auto-scaling groups which means that we often have queues with only one or two instances running. Those queues are the ones that typically run into this.

@yob
Copy link
Contributor

yob commented Apr 7, 2021

Perfect, we'll hold off 5.3.0 for a week or two then ✨

@yob
Copy link
Contributor

yob commented Apr 14, 2021

I've booted a test stack using a scaler compiled from this PR, and everything worked as normal. I didn't test the specific "stall" situation this is designed to combat, but it was still reassuring to see the standard behaviour work as it should.

I have minor nerves about performance when iterating over the ASG instances to count the pending ones, particularly when the ASG has hundreds or thousands of instances. It'd be nice if that the instance iteration was inside the recent conditional that was added, but I don't think it's a blocker. I'm happy to merge, and we can optimise later if need be.

@yob yob merged commit d6e447c into buildkite:master Apr 14, 2021
yob added a commit that referenced this pull request Apr 14, 2021
yob added a commit to buildkite/elastic-ci-stack-for-aws that referenced this pull request Apr 14, 2021
A single new feature is included in this release. When the elastic stack
is very small (<=2 running instances), consider adding a new instance
when we suspect the current instances are shutting down and there's
pending jobs.

Pull request with more details is here: buildkite/buildkite-agent-scaler#40
keithduncan added a commit that referenced this pull request Jun 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants