Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move all of our pipelines to use the global-build-job to build CoreCLR, Mono, and Libraries assets #99179

Merged
merged 84 commits into from
Mar 15, 2024

Conversation

jkoritzinsky
Copy link
Member

@jkoritzinsky jkoritzinsky commented Mar 1, 2024

When we first consolidated dotnet/coreclr, dotnet/corefx, dotnet/core-setup, and mono/mono in .NET 5, we brought in separate product build, test build, and test run pipeline jobs, as before consolidation we were seeing increased build throughput and reliability by having separate jobs, ensuring we built each asset exactly once, and a complex matrix of how they interacted.

Over time, we've removed a few of these jobs and allowed double-building to reduce dependencies:

  • Libraries removed their separate "build test" job.

  • The CoreCLR/Mono "build test" job now builds the assets it needs directly instead of depending on other jobs.

  • The Libraries job now builds System.Private.CoreLib in-job (and now can even just build the ref-assembly for CoreLib)
     
    Additionally, we've moved many jobs to build the product (and sometimes test) in one job using the "Global build" template:

  • All Mono devices jobs use the "global build" to build the product (and build runtime tests if needed)

  • All NativeAOT jobs use the "global build" template

  • Nearly all community-supported legs build using the "global build" template.

  • A few outerloop pipelines use the "global build" template for simplicity.

Today, our pipelines are split between using the "Global build" template that directly invokes the root build script to build the product in one command and the separate "build job" templates for building specific subsets (CoreCLR, Libraries, Mono, Installer) separately.

This has added significant complexity in many of our YAML scripts to support both cases and confusion about when to use one scenario or the other. For example, some jobs in the "runtime" pipeline use the global build job template to build the product and run tests whereas others don't.

Recently, we moved the official build to use the "global build" template for every job as a) we needed to move in this direction for the VMR and b) we believed that due to job queue time and upload/download time, we'd save on build execution time.

We saw decent improvements, as well as a significant reliability boost by merging these jobs together in the official build.

This PR takes us the next step: Merge all CoreCLR, Mono, and Libraries product build steps (and installer build steps when we aren't running tests) into single jobs instead of being split into 2 or 3 jobs (as all of the Mono runs also required some components of the corresponding CoreCLR build).

In addition, to avoid ending up in a split situation where many of our outerloop legs are still on the older system while PR and official builds are all on the new system, I have moved every pipeline to the "global build" model, not just the PR pipeline.

Job Dependency Graph Changes

CoreCLR + Libraries + Installer (w/o Installer Tests)

flowchart TD

subgraph Old Model
direction LR
CLR[CoreCLR Build Job]
Libs[Libraries Build Job]
Host[Installer Build Job]
CLR --> Host
Libs --> Host
BRT[Build Runtime Tests] --> RT
CLR --> RT[Runtime Tests]
Libs --> RT
CLR --> LT[Libraries Tests]
Libs --> LT
end
subgraph New Model
Global[Global build job: CLR + Libs + Host + Packs]
GBRT[Build Runtime Tests]
GRT[Runtime Tests]
GLT[Libraries Tests]
Global --> GRT
Global --> GLT
GBRT --> GRT
end
Loading

CoreCLR + Libraries + Installer w/ Tests

flowchart TD

subgraph Old Model
direction LR
CLR[CoreCLR Build Job]
Libs[Libraries Build Job]
Host[Installer Build and Test]
CLR --> Host
Libs --> Host
BRT[Build Runtime Tests] --> RT
CLR --> RT[Runtime Tests]
Libs --> RT
CLR --> LT[Libraries Tests]
Libs --> LT
end
subgraph New Model
Global[Global build job: CLR + Libs]
GI[Installer Build and Test]
GBRT[Build Runtime Tests]
GRT[Runtime Tests]
GLT[Libraries Tests]
Global --> GRT
Global --> GLT
GBRT --> GRT
Global --> GI
end
Loading

Mono Runtime Tests

flowchart TD

subgraph Old Model
direction LR
CLR[CoreCLR Build Job, all of CoreCLR]
Libs[Libraries Build Job]
Mono[Mono Build Job]
BRT[Build Runtime Tests] --> RT
CLR --> RT[Runtime Tests]
Libs --> RT
Mono --> RT
end
subgraph New Model
Global[Global build job: Mono + CoreRun + ilasm/dasm + Libs]
GBRT[Build Runtime Tests]
GRT[Runtime Tests]
Global --> GRT
GBRT --> GRT
end
Loading

YAML Template Flow Changes

In addition to the job simplifications above, there's also some YAML template simplification as well. In addition to removing the build-coreclr-and-libraries-job template, I consolidated the general flow of the JIT outerloop pipelines for CoreCLR tests into a single pipeline template, simplifying defining them.

Additionally, I moved all outerloop pipelines that run libraries tests to use the global build job with the "Send to Helix" extra step instead of shuffling files between jobs.

Q&A

  • Why do we still have separate jobs per vertical in the PR pipeline?
    • We don't want to wait to run the runtime tests until the libraries tests pass or vice versa. Additionally, building the runtime tests still takes a while (~30 minutes), so we don't want to make that a requirement before any tests start to run until we've reduced that time.
  • Why are installer tests still in a separate job?
    • Installer tests are very quick, but we don't want to block libraries and runtime tests from running if installer tests fail?
  • Why is installer build still in the installer test job?
    • It's easier given current infrastructure to keep the "Installer Build and Test" job as-is rather than moving the building steps out entirely.

For validation, I'm happy to run any requested pipelines on this PR. I obviously can't run them all, but I'm happy to run a good spread to validate that I didn't break anything (I've been running various pipelines throughout working on this to validate the experience as I went).

…t our GCC leg uses the global build pipeline and our CI images automatically select the correct compiler)
…e and remove parameters that were only used in these scenarios
…(one product-build/test-build/test-run set of jobs) runtime test pipelines to use the global build approach.
…job + extra steps instead of using separate jobs.
…t. Unify all python usages to use the same python variable template (and all use venv as well)
@jkoritzinsky
Copy link
Member Author

@DrewScoggins the failures in the perf legs looks like restore failures in the benchmarks, not infrastructure issues. Can you take a look?

@BruceForstall
Copy link
Member

I presume you are going to revert your change to jit.h once your testing is done? (It looks like that's causing a formatting job failure, also)

Copy link
Member

@BruceForstall BruceForstall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's basically impossible to review something this massive by sight, but I scanned through various changes, so LGTM.

@jkoritzinsky
Copy link
Member Author

I presume you are going to revert your change to jit.h once your testing is done? (It looks like that's causing a formatting job failure, also)

Yep! Do you want me to keep the JIT-EE GUID change or should I revert that one before I merge this in?

@BruceForstall
Copy link
Member

Do you want me to keep the JIT-EE GUID change or should I revert that one before I merge this in?

You should revert it. It was just to avoid trashing our collections when you were testing.

@jkoritzinsky
Copy link
Member Author

/ba-g Timeouts in interpreter tests are known

Copy link
Member

@agocke agocke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM I think. There might be more work to optimize things, but I think that can come later.

@jkoritzinsky jkoritzinsky merged commit 663839e into dotnet:main Mar 15, 2024
161 of 164 checks passed
@jkoritzinsky jkoritzinsky deleted the pr-single-job branch March 15, 2024 23:51
@akoeplinger
Copy link
Member

@jkoritzinsky thank you so much for doing this work! 🎉

@kg
Copy link
Member

kg commented Mar 16, 2024

The most recent commit in #99841 didn't run any lanes, is that potentially related to these yml changes?

@akoeplinger
Copy link
Member

@kg it looks like it's running on AzDO but didn't post back to GitHub: https://dev.azure.com/dnceng-public/public/_build/results?buildId=605726&view=results

@BruceForstall
Copy link
Member

@jkoritzinsky Looks like libraries-jitstress is broken:

https://dev.azure.com/dnceng-public/public/_build/results?buildId=605612&view=results

 Done building Helix work items for scenario no_tiered_compilation. Work item count: 0
D:\a\_work\1\s\src\libraries\sendtohelixhelp.proj(335,5): error : No helix work items, or APKs, or AppBundles found to test
##[error]src\libraries\sendtohelixhelp.proj(335,5): error : No helix work items, or APKs, or AppBundles found to test

@jkoritzinsky
Copy link
Member Author

@BruceForstall fix for that family of pipelines at #99868

@BruceForstall
Copy link
Member

Looks like runtime-coreclr outerloop is broken:

https://dev.azure.com/dnceng-public/public/_build?definitionId=108&_a=summary

The 'stages' parameter is not a valid StageList

/eng/pipelines/common/xplat-setup.yml: Could not find /eng/pipelines/common/templates/global-build-job.yml in repository self hosted on https://github.com/ using commit 0935105e91450a1bad02b5b2f83be52bea2bcf59. GitHub reported the error, "Not Found"
 ```

@jkoritzinsky
Copy link
Member Author

I've fixed that one as well in #99868 as I had other fixes I had to do there already.

akoeplinger added a commit to akoeplinger/runtime that referenced this pull request Mar 18, 2024
This seems to be a side-effect of dotnet#99179

Arcade checks whether `disableComponentGovernance` is the empty string to apply the default skipping logic: https://github.com/dotnet/arcade/blob/ace00d8719b8d1fdfd0cc05f71bb9af216338d27/eng/common/templates/job/job.yml#L168-L174

Changed our templates to make sure we pass that correctly.
akoeplinger added a commit that referenced this pull request Mar 18, 2024
This seems to be a side-effect of #99179

Arcade checks whether `disableComponentGovernance` is the empty string to apply the default skipping logic: https://github.com/dotnet/arcade/blob/ace00d8719b8d1fdfd0cc05f71bb9af216338d27/eng/common/templates/job/job.yml#L168-L174

Changed our templates to make sure we pass that correctly.
@BruceForstall
Copy link
Member

@jkoritzinsky The runtime-coreclr superpmi-asmdiffs-checked-release pipeline is broken:

https://dev.azure.com/dnceng-public/public/_build/results?buildId=607047&view=logs&j=83516c17-6666-5250-abde-63983ce72a49&t=c10d5f44-55ce-55d7-7975-407ed75d9a96

D:\a\_work\1\s\venv\Scripts\python.exe D:\a\_work\1\s/src/coreclr/scripts/superpmi_asmdiffs_checked_release_setup.py -source_directory D:\a\_work\1\s -checked_directory D:\a\_work\1\s/artifacts/bin/coreclr/windows.x64.Checked -release_directory D:\a\_work\1\s/artifacts/bin/coreclr/windows.x64.Release -arch x64
========================== Starting Command Output ===========================
"C:\Windows\system32\cmd.exe" /D /E:ON /V:OFF /S /C "CALL "D:\a\_work\_temp\cf222c71-0741-4f77-b00c-67fcd11c9759.cmd""
checked_directory doesn't exist

This is a pipeline that depends on both the release and checked jitrollingbuild. I remember there was some special YML to do that; maybe it got lost?

steveisok added a commit to steveisok/runtime that referenced this pull request Mar 18, 2024
dotnet#99179 made some changes and switched the template the workloads build was using.
In the process, workloadArtifactsPath and workloadPackagesPath got dropped, which meant the workloads build did not
pick up any manifests to process. As a result, the build failed.
steveisok added a commit that referenced this pull request Mar 18, 2024
#99179 made some changes and switched the template the workloads build was using.
In the process, workloadArtifactsPath and workloadPackagesPath got dropped, which meant the workloads build did not
pick up any manifests to process. As a result, the build failed.
@github-actions github-actions bot locked and limited conversation to collaborators Apr 18, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants