Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix linux-ubuntu-android-emulator validation issues #1415

Open
1 of 3 tasks
dougbu opened this issue Nov 13, 2023 · 6 comments
Open
1 of 3 tasks

Fix linux-ubuntu-android-emulator validation issues #1415

dougbu opened this issue Nov 13, 2023 · 6 comments
Assignees
Labels
dotnet-helix-machines Ops - Service Maintenance Used to track issues related to maintaining the services .NET Eng Supports Ops - Spike Work items to be included in our Ops Spike

Comments

@dougbu
Copy link
Member

dougbu commented Nov 13, 2023

Builds of main since #20231102.02 failed consistently when validating the linux-ubuntu-android-emulator artefact on various ubuntu.??04.amd64.android.*.open queues. Problem is reported as ... no running emulators at /etc/osob/validate/linux-ubuntu-android-emulator ....

One possibility: 3 minutes may be insufficient time these days for the emulator(s) to start up.

Release Note Category

  • Feature changes/additions
  • Bug fixes
  • Internal Infrastructure Improvements

Release Note Description

Corrected a problem preventing validation of some of our queues.

@dougbu dougbu added Ops - Service Maintenance Used to track issues related to maintaining the services .NET Eng Supports dotnet-helix-machines labels Nov 13, 2023
@dougbu dougbu self-assigned this Nov 13, 2023
@riarenas
Copy link
Member

We have temporarily unmonitored the android queues in https://dnceng.visualstudio.com/internal/_git/dotnet-helix-machines/pullrequest/35248, and they should be added back to the deployment list once we're confident we understand the failures/fixes.

@dougbu dougbu changed the title helix-machines' main is red Fix linux-ubuntu-android-emulator validation issues Nov 17, 2023
@dougbu dougbu removed their assignment Nov 17, 2023
@dougbu
Copy link
Member Author

dougbu commented Nov 23, 2023

@premun any ideas for getting to the root cause of our recent problems w/ the emulators❔ I could imagine creating a VM for one of the failing images before we unmonitored the queues. might have an issue there b/c our first-run commands only execute w/in a scale set; would have to do similar things manually and hope to hit the validation failure…

@premun
Copy link
Member

premun commented Nov 23, 2023

@akoeplinger is helping us with this. We spoke about this briefly and it seems that as the first step, we would make the Helix SDK collect the emulator log in case a Helix work item doesn't find it booted.
Alexander might open a PR in Arcade adding this. I am at a conference and OOF tomorrow so I won't be around but he will tag you on the PR.

We could then take the same emulator log collection command (I don't know what it is myself) and put it in our validate.sh to collect it in case we see validation failures in the helix-machine pipeline. Hopefully it will have some clues to what might be the actual root cause. I can't offer more advice as this is not my area really unfortunately.

@akoeplinger
Copy link
Member

I'm experimenting with https://dev.azure.com/dnceng/internal/_git/dotnet-helix-machines/pullrequest/35535 to see if bumping the timeout to 10mins and moving the waiting logic from validate.sh to the first run script helps.
I'll update the PR to capture logs once I figure out how to connect to the staging VM so I can experiment with the scripts.

@ilyas1974
Copy link
Contributor

Due to changing priorities, Alex is not able to work on this currently. Moving the issue to our backlog.

@ilyas1974 ilyas1974 added the Ops - P1 Operations task, priority 1 (highest priority) label Jul 24, 2024
@ilyas1974 ilyas1974 added Ops - Spike Work items to be included in our Ops Spike and removed Ops - P1 Operations task, priority 1 (highest priority) labels Jul 25, 2024
@riarenas riarenas assigned riarenas and unassigned riarenas Aug 15, 2024
@riarenas
Copy link
Member

Adding an additional 5 minute wait won't work. That causes timeouts during custom script extension execution when the machine is trying to start up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dotnet-helix-machines Ops - Service Maintenance Used to track issues related to maintaining the services .NET Eng Supports Ops - Spike Work items to be included in our Ops Spike
Projects
None yet
Development

No branches or pull requests

5 participants