Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ephemeral runners and cancelled jobs #1853

Closed
npalm opened this issue Mar 15, 2022 · 1 comment
Closed

Ephemeral runners and cancelled jobs #1853

npalm opened this issue Mar 15, 2022 · 1 comment

Comments

@npalm
Copy link
Member

npalm commented Mar 15, 2022

Description

For non ephemeral runners the status of the workflow job is checked, and only for queued jobs scaling is done. For ephemeral runners this check is not applied because the assumption was that every job needs a runners.

We found out when we start scaling with a couple of 100 runners this ideas was not working as expected. When we got a large number of cancelled jobs. For example based on a job time out. The events are still on queue. This is typically the case when we have reach the max of runners. The lambda's will crate all the runenrs. But they will remain idle since jobs are canclled. This is not a problem with a few cancelled jobs. But when having hugh amount of cancelled jobs, this could casue a large fleet of useless runners.

Solution

We have tested a modified scale up lambda, where we applied the the check for the job in the same way as for non ephemeral runners. In our case this was solving the problem. However, since there is not correlation between job and runner this approach could lead that events are not used for scaling in cases they should lead to scaling. As mitigation we have a very small fleet of runners in the pool to keep track of those missed events.

@alexellis
Copy link

Hi @npalm

I'm working on an adjacent solution using Firecracker and pools of agents, but solely with ephemeral runners to ensure complete isolation and a fresh environment for each run.

We're building similar solutions at a conceptual level, but with a very different technical approach. If you'd like to compare notes feel free to send me an email? See a demo / find out more: https://github.com/self-actuated/actuated

My question for you is: if you were only using emphemeral runners and creating new VMs for each workflow job event, how do you handle a cancelled workflow run? Let's say that your run created 20 jobs, so 20 VMs were started.

If each job is allocated to a runner, starts executing, then the run is cancelled, then the runner exits and everything is cleaned up.

But the challenge is if that run and its 20 jobs are cancelled before being allocated to a runner. At that point we have 20 VMs running and no good way to knowing to shut them down or to reap them.

Alex

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants