Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ScalingCalculator should use live agents rather than configured AgentsPerInstance #84

Open
HongboDu-at opened this issue Apr 6, 2023 · 3 comments

Comments

@HongboDu-at
Copy link

Situation

There is one host, which has 10 agents.
When we gracefully terminate the host, idle agents self-terminate immediately but busy agents will finish the job and then self-terminate. This process could take as long as the when job finishes.
ScalingCalculator behaves wrong in this situation.

agentsPerInstance: params.AgentsPerInstance,

Current behavior

ScalingCalculator thinks there are 10 agents available. But in reality, only busy agents are still alive. No new jobs will get agents to be worked on.

Expected behavior

ScalingCalculator should use live agents rather than configured AgentsPerInstance

@triarius
Copy link
Contributor

triarius commented Apr 19, 2023

Hi @HongboDu-at, thanks for using the buildkite-agent-scaler.

This is an issue that quite a few customers are experiencing. Unfortunately, the metrics endpoint in the API does not expose information about which agents are on which hosts. It may be possible to reconstruct this information by equipping the scaler with a GraphQL token and extracting it from the GraphQL API.

Unfortunately, we don't have any work planned to do this at the moment, but we would welcome any PRs. I suggest making supplying the GraphQL (and therefore this functionality) optional as it is not needed by an Elastic Stack at the moment, and has a very broad scope.

@ilyakruchinin
Copy link

ilyakruchinin commented Jun 5, 2024

We had to fork the repo and updated the scaling calculator to achieve this.
We will look into submitting a PR, but not sure if it will be accepted as the change we made requires support from the Autoscaling Group (a Lifecycle Hook needs to be created to ensure newly spun EC2 instances stay in "Pending" state for the Lambda until the agents are ready), and Autoscaling Group lives in a different CloudFormation Stack (therefore requires changes to both repos at once with a "flag" to enable this feature).

Once we finish all the testing and implementation we'll consider making the PR.

@moskyb
Copy link
Contributor

moskyb commented Jun 5, 2024

awesome! looking forward to seeing a PR, if one comes. very happy to consult as necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants