Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Staging - [Alerting] Azure quota usage for west us #1432

Closed
dotnet-eng-status-staging bot opened this issue Nov 16, 2023 · 4 comments
Closed

Staging - [Alerting] Azure quota usage for west us #1432

dotnet-eng-status-staging bot opened this issue Nov 16, 2023 · 4 comments
Assignees

Comments

@dotnet-eng-status-staging
Copy link

💔 Metric state changed to alerting

An Azure Resource Quota is nearing its limit in region westus!

  • percent of limit {resource=standardDv4Family} 97

Go to rule

@dotnet/dnceng, please investigate

Automation information below, do not change

Grafana-Automated-Alert-Id-e2be2ec3e22e46d28730bab54ff8fa77

@ilyas1974 ilyas1974 self-assigned this Nov 20, 2023
@garath garath assigned garath and unassigned ilyas1974 Nov 20, 2023
@garath
Copy link
Member

garath commented Nov 20, 2023

I'm going to investigate why dv4 has been so high lately.

@ilyas1974
Copy link
Contributor

I was finally able to get onto the Virtual Machines scale sets for HelixStaging and it looks like the reason for this is that every instance of our android queues has 1 machine provisioned in them. Because of the quantity of queues, we are "killed with numbers". Will work with Stu on getting this resolved and figuring out how we got here.

@garath
Copy link
Member

garath commented Nov 21, 2023

I agree with what Ilya found.

TLDR, I believe we need to fix #1415 and remove the "unmonitored" state of the queues.

The highlights are...

  • Some helix-machines commit caused all Android queues to be re-deployed.
  • Deployed queues are, by default, set to an instance count of 1
  • There are a lot of android queues: API versions 21-32, inclusive, each with 4 queue purpose variants, and then each of those sets running on both ubuntu 1804 and 2204 => 176 total cores for just having one machine in each queue.
  • These queues are also set to "unmonitored" from Fix linux-ubuntu-android-emulator validation issues #1415
  • Autoscale would normally scale-down the queue after some time unused.
  • Autoscale is ignoring these queues because they are unmonitored, thus they never scale down to zero.

So we just got ourselves into a bit of a spot.

I've scaled-down all queues with "android" in the name to have zero instances. This will give us back the headroom to let PRs. This might break a PR or scheduled build that tries to test on them (becuase there is nothing to cause them to scale-up again). If this happens, anyone in dnceng can simply scale-up that particular queue again. (Make a note here so we have a paper trail.)

@garath
Copy link
Member

garath commented Nov 21, 2023

Closing this issue as the quota problem is fixed by the mass manual scale-down.

@garath garath closed this as completed Nov 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants