Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add scaling to include waiting jobs #17

Merged
merged 6 commits into from
May 5, 2019

Conversation

lox
Copy link
Contributor

@lox lox commented May 5, 2019

The agent metrics api now includes a metric for WaitingJobCount which tracks jobs that are behind a wait step. This allows for pre-emptive scaling so that instances are ready by the time that jobs are scheduled.

By default waiting jobs aren't included, but they can be enabled with INCLUDE_WAITING=1 in the lambda env.

@lox lox force-pushed the add-scaling-to-include-waiting-jobs branch from fad2399 to 0b61903 Compare May 5, 2019 07:14
@lox
Copy link
Contributor Author

lox commented May 5, 2019

Any thoughts on this one @etaoins?

@etaoins
Copy link
Contributor

etaoins commented May 5, 2019

Interesting, I hadn't considered including waiting jobs.

I'm assuming this is off by default because some of the running agents would be free by the time the waiting jobs were unblocked so this would overscale? I wonder if a better default-on heuristic would be:

count := metrics.ScheduledJobs + max(metrics.RunningJobs, metrics.WaitingJobs)

That way we would only scale up to account for waiting jobs if they exceeded the number of running jobs, i.e. once they're unblocked we won't have enough agents to run them.

I don't feel too strongly about this, just throwing it out there 😄

@lox
Copy link
Contributor Author

lox commented May 5, 2019

Yeah, it's a reasonably new idea for me too. Basically we see lots of pipelines in the form of [upload] -> [long docker build] -> [tests with 40x parallelism]. My hope here was for pre-emptive scaling so that by the time the docker build is done there are some agents to run the tests.

It's a good point about some agents freeing up by the time waiting jobs become scheduled. I like the idea of max(running, waiting). 👌🏻

Really appreciate your feedback and insights on the last few PR's @etaoins.

@etaoins
Copy link
Contributor

etaoins commented May 5, 2019

No worries, it's good to work with a company that takes open source seriously and I have a vested interest in my builds going faster 😉

Another pattern a see a lot at SEEK is

steps:
  - label: 'Build & Test'
    queue: prod

  - wait

  - label: 'Deploy Dev'
    queue: dev

  - wait

  - label: 'Deploy Prod'
    queue: prod

Because most of our e.g. branch builds happen in the prod queue it's usually already scaled up. I'm assuming including waiters would allow the dev queue to scale up preemptively for the deploy? That could be a big win for us.

@lox
Copy link
Contributor Author

lox commented May 5, 2019

I'm assuming including waiters would allow the dev queue to scale up preemptively for the deploy? That could be a big win for us.

Yup!

@lox lox merged commit 89e127c into master May 5, 2019
@lox lox deleted the add-scaling-to-include-waiting-jobs branch May 5, 2019 10:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants