Relative scaling calculation #40

dbaggerman · 2021-02-19T03:02:10Z

Proposed solution to #39. It's fresh off the keyboard, and I haven't had a chance to set up a test lab to verify it does what I expect in the real world yet.

Implements an (opt-in) alternative scaling calculation using relative numbers. i.e. N more instances requires as opposed to N total instances required.

This should work better in cases where not all agents on an instance are running or accepting jobs. For example, while waiting for (potentially long-running) jobs to complete during a graceful shutdown.

Implements an (opt-in) alternative scaling calculation using relative numbers. i.e. N _more_ instances requires as opposed to N _total_ instances required. This should work better in cases where not all agents on an instance are running or accepting jobs. For example, while waiting for (potentially long-running) jobs to complete during a graceful shutdown.

yob · 2021-02-22T06:21:03Z

Thanks @dbaggerman, this is a interesting PR with plenty of useful context in the description and linked issue.

I'm going to chat to a few folks internally to collect some feedback.

One thought that occurs to me: there'll be a lag time between the scaler requesting a new instance and it showing up in the agent count metrics visible in the API, is there potential for this approach to over-compensate and scale up too high for the queue size?

yob · 2021-02-22T06:30:30Z

there'll be a lag time between the scaler requesting a new instance and it showing up in the agent count metrics visible in the API, is there potential for this approach to over-compensate and scale up too high for the queue size?

Oh, I see. There's some accomodation for pending instances in the ASG when calculating the scale up 👍

dbaggerman · 2021-02-22T23:07:11Z

There's some accomodation for pending instances in the ASG when calculating the scale up

That's right. There's still the possibility of a very small window between an instance being ready in the ASG but not yet registered as an agent, but that's hopefully small enough to not be much of a problem in practice.

Theexisting SCALE_OUT_COOLDOWN_PERIOD option could also be used to prevent it spinning up instances too frequently.

yob · 2021-02-26T14:19:56Z

I think we're a bit nervous about adding two scaling modes. Partly due to the small increase in ongoing maintenance complexity, but more importantly because it'll then require a config dial on the elastic stack that we have to explain and support. The stack already has more parameters than we'd like, and we're trying to hold the line on adding more where possible.

The issue you describe is real though! It's something we haven't felt internally on our own stacks because we only run a single agent per instance.

I wonder if there's a way to avoid the issue without adding a second scaling mode? Easier said than done maybe.

Is the problem most pronounced when the stack has only 1 instance left, with < 4 running agents, and < 4 scheduled jobs? Could we use the "fetch the number of running agents from Buildkite's perspective" trick you have here, and when that number is under the number of agents-per-instance automatically scale up by 1?

dbaggerman · 2021-02-26T23:44:02Z

I think we're a bit nervous about adding two scaling modes.

Being concerned about the extra complexity is understandable. I did it that way because I thought it might make the scaling behaviour different, but in the end it produced exactly the same desired count as the current calculation - at least for the scenarios in the unit tests. Maybe there's an argument for making it the new standard behaviour, but I wouldn't be surprised if there are edge case differences not being caught by the tests.

Is the problem most pronounced when the stack has only 1 instance left, with < 4 running agents, and < 4 scheduled jobs?

Yes, the issue is most pronounced when the is one instance in the pool which is not accepting jobs. That's where we end up with someone asking why their job has been waiting for an agent for over an hour. If the queue gets large enough to add a second instance, then at least jobs get processed even if they might wait a bit longer.

Could we use the "fetch the number of running agents from Buildkite's perspective" trick you have here, and when that number is under the number of agents-per-instance automatically scale up by 1?

I suppose we could create a hybrid calculation which infers the number of "inactive" instances and adjusts the absolute number. It would probably end up being slightly more complex than either of the other two calculations on their own - but might also be less likely to interfere with edge cases. I can put something together and add it to the PR so we can see what it looks like - we can unwind back to a single implementation later.

yob · 2021-02-26T23:52:39Z

I can put something together and add it to the PR so we can see what it looks like - we can unwind back to a single implementation later.

Awesome, thanks!

scaler/scaling_calculator.go

dbaggerman · 2021-03-11T06:07:33Z

@yob, I've rolled it back to just a single implementation. Let me know if there's anything else you would like to change.

yob · 2021-03-12T01:21:03Z

@dbaggerman thanks! Apologies we haven't given feedback yet, but it's not forgotten and we'll get to it soon ❤️

yob · 2021-04-01T03:43:16Z

@dbaggerman we're planning a backwards compatible 5.3.0 elastic stack release in the next week or three and I'd love to land this for you before that.

We like that general approach of the new logic being active without a mode flagf, but we're a little nervous about how this might impact stacks with hundreds or thousands of instances. Particularly the logic that tries to account for pending instances.

Are you open to making this a little more conservative - at least as a first step - and only kicking in the "anti stall" logic when the registered agent count is < AgentsPerInstance? Or maybe AgentsPerInstance*2? At that scale if we over-scaleup and add an extra instance or two when they're not required, the cost impact is minimal. It sounds like it'll also address most of the pain you're experiencing?

Could we even get away with "when asg.DesiredCount == 1 and registeredAgents is less than agentsPerInstance, increment desired instances by 1" safety net?

Apologies for the multiple rounds of feedback ❤️

dbaggerman · 2021-04-07T23:55:45Z

only kicking in the "anti stall" logic when the registered agent count is < AgentsPerInstance?

Sure, that would work OK for us. We run quite a few agents, but spread thinly across many queues / auto-scaling groups which means that we often have queues with only one or two instances running. Those queues are the ones that typically run into this.

yob · 2021-04-07T23:59:21Z

Perfect, we'll hold off 5.3.0 for a week or two then ✨

yob · 2021-04-14T13:16:14Z

I've booted a test stack using a scaler compiled from this PR, and everything worked as normal. I didn't test the specific "stall" situation this is designed to combat, but it was still reassuring to see the standard behaviour work as it should.

I have minor nerves about performance when iterating over the ASG instances to count the pending ones, particularly when the ASG has hundreds or thousands of instances. It'd be nice if that the instance iteration was inside the recent conditional that was added, but I don't think it's a blocker. I'm happy to merge, and we can optimise later if need be.

added in PR #40

A single new feature is included in this release. When the elastic stack is very small (<=2 running instances), consider adding a new instance when we suspect the current instances are shutting down and there's pending jobs. Pull request with more details is here: buildkite/buildkite-agent-scaler#40

Draft hybrid scaling calculator

5e02a83

dbaggerman commented Mar 2, 2021

View reviewed changes

scaler/scaling_calculator.go Outdated Show resolved Hide resolved

Unwind to single scaling calculator

5abe38a

Relative scaling: Only when few instances

94db4e4

yob merged commit d6e447c into buildkite:master Apr 14, 2021

yob added a commit that referenced this pull request Apr 14, 2021

remove debugging Printf

f435286

added in PR #40

yob mentioned this pull request Apr 14, 2021

bump scaler from 1.0.2 to 1.1.0 buildkite/elastic-ci-stack-for-aws#823

Merged

dbaggerman mentioned this pull request May 4, 2021

Scale out may not occur when instances are gracefully shutting down #39

Closed

This was referenced May 18, 2021

Scale-up goes all the way to 100% immediately after upgrade from 4.5.0 to 5.3.1 buildkite/elastic-ci-stack-for-aws#847

Closed

Deal with race condition in pending instances #44

Closed

keithduncan added a commit that referenced this pull request Jun 8, 2021

Revert shortfall scaling added in #40

7ea1eb6

keithduncan mentioned this pull request Jun 8, 2021

Revert agent shortfall scaling #46

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Relative scaling calculation #40

Relative scaling calculation #40

dbaggerman commented Feb 19, 2021

yob commented Feb 22, 2021

yob commented Feb 22, 2021

dbaggerman commented Feb 22, 2021

yob commented Feb 26, 2021

dbaggerman commented Feb 26, 2021

yob commented Feb 26, 2021

dbaggerman commented Mar 11, 2021

yob commented Mar 12, 2021

yob commented Apr 1, 2021

dbaggerman commented Apr 7, 2021

yob commented Apr 7, 2021

yob commented Apr 14, 2021

Relative scaling calculation #40

Relative scaling calculation #40

Conversation

dbaggerman commented Feb 19, 2021

yob commented Feb 22, 2021

yob commented Feb 22, 2021

dbaggerman commented Feb 22, 2021

yob commented Feb 26, 2021

dbaggerman commented Feb 26, 2021

yob commented Feb 26, 2021

dbaggerman commented Mar 11, 2021

yob commented Mar 12, 2021

yob commented Apr 1, 2021

dbaggerman commented Apr 7, 2021

yob commented Apr 7, 2021

yob commented Apr 14, 2021