-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use "pipeline-caching" for coreclr and libraries builds #62464
Conversation
I couldn't figure out the best area label to add to this PR. If you have write-permissions please help me learn by adding exactly one area label. |
Cache nuget packags in libraries build TOREVERT: Just build windows/x64 libraries Use Cache_Key variable No cache key PreBuildSteps fix echo statement Set cacheHitVar Add variables space space 2 no need of another variables Print CACHE_RESTORED Cache based on commit hash fix space Remove condition from upload Remove one more condition minor change Try upload artifact step misc changes in libraries Use caching for Coreclr builds Print commit hashes print dir Point to staging folder Skip zipping if using cached Add debugging logic add | Comment coreclr product for now useCached1,2,3 Change git log command upload-cached-artifact and checking about CACHE_RESTORED get-commit-hash.cmd Another CACHE_RESTORED try fix minor issue script fix use no-pager Use git log in yml Use CACHE_RESTORED in template files directly Echo git commands Temp disable building command Add -- Use python script to get commits fix the cache key add condition at one place Disable commit cachekey Move the conditions properly: Build coreclr with cache Use DAC_CACHE_RESTORED Run for all OS/platforms include arch in cache Includearch on libraries Remove failing test keys Just do windows x64/x86 Cache build log for libraries UsePipelineCache Remove the condition for DAC_CACHE_RESTORED caching fix the coreclr sample key Make space change Fix the missing ) Fix the missing ) variable UsePipelineCache Use variables.UsePipelineCache Fix the missing ) Fix variables.UsePipelineCache Add - before name/value Use cache variable names Just build libraries x64 Print build reason echo %BUILD_REASON% Change to condition Enable windows x86 libraries: fix publish logs enable libraries for all platforms use actual cache key Enable for coreclr all platforms Misc. changes + cleanup Enable Linux for debugging Use *.* Use full path Full expand Expand cache key path Cache libraries/coreclr for all platforms shorten the coreclr cache key for testing just run for win/x86 fix the cross dac problem fix the parameters removing ) Skip restore internal tools Cache build logs Reintroduce python script for git hash fix build error add logging to see why things are failing wrap in quotes python inline fix alignment: Add back print fix the yml files Run git fetch Better history fetching Enable linux x64 to check the errors convert from deepen to depth use access token replace dceng as well: fix typo Remove print tool retry logic Add git log fix exit_on_fail print git version no pager for initial log fix error Enable Linux build enable for all platforms Do not use retry logic create crossDacBuild Change from md to mkdir enable all platforms Use mkdir -p Use label proper label do not cache/restore for DAC Reenable for all platforms Use updated commit_hash file Enable libraries as well Rename some variables and comments Dummy change Revert "Dummy change" This reverts commit 645b20b7fe53e1f3787e42bee7add9895acdbc1e. Flip the condition to fix Mono builds Print source_directory Use os/subob/archtype/ Dummy change Fix the os arch names
@dotnet/jit-contrib |
@jeffschwMSFT , @Chrisboh , @markwilkie , @jkoritzinsky , @safern , @hoyosjs , @trylek , @agocke , @jkotas |
'Directory.Solution.props', | ||
# 'eng' | ||
], | ||
'coreclr': [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks fragile. coreclr build depends on code under libraries (in particular, libraries/Common
and src/coreclr/System.Private.CoreLib
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - As called out in "Lookouts" above, this piece is intentionally left open ended and I am expecting reviewers to point the right set of paths I should have under each category.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand. Coming with the 100% right set of paths here is close to impossible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Do you (or anyone else) can brainstorm other ideas to capture it effectively without tracking commit hashes? While it might take certain iterations to get right, I don't want to give up on this feature, provided the value we can get out of it.
Also, I guess, it should be less risky and straightforward to enable this for test builds at least?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
provided the value we can get out of it.
FWIW: Some data from September 2021:
- The CI is triggered 2.36x times per PR on average
- We spend ~150 hours of build machine time and ~300 hours of test machine time per PR on average
I do like this change and I think we should look into some sort of caching scheme like this. Where we are now, this change is only going to reduce the total CI costs by a few percent (<5%).
it should be less risky and straightforward to enable this for test builds at least?
Yes, I think so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another challenge that I faced, and I want to change is the following:
I tested this mechanics on internal pipelines where I trigged the run manually on the branch where I was developing this feature. So essentially, there were no changes to other folders/files except the ones in my branch. With that model, the cache was hit more frequently because the git history
was linear. Only my changes were going into the PR branch.
When I ported it to public pipeline, I realized that on every single run, we do a MERGE commit that brings in newer changes to the PR branch and that invalidates the cache more often because whenever newer change is pushed on PR branch, when runs are triggered, newer changes can come from main
branch, leading to not matching the cache often.
Would it be possible to use a regular incremental build for this? ie download the cached artifacts and run regular build on top of it, rebuilding everything that was invalidated as necessary. I think it would be much more robust than trying to manually duplicate build dependencies on the side. |
This isn't fully correct. It's also brittle. Things it doesn't consider: environment, relationship to other directories (installer + coreclr has different behavior for single file vs non-singlefile, or at least it did at some point), the eng directory, the build scripts, the arcade runtime. |
Depending on the build system, isn't this also brittle (less than hardcoded dependency)? I thought MSBuild used timestamps. Does downloading from cache preserve such metadata such that the newly checked out source wont always be newer? |
Right, we would have to make sure that the time stamps are set right for the incremental logic to kick in. |
I believe pipeline caching would essentially be the same as artifacts, but the pipeline caching is tight to a pipeline and based on a key, which simplifies things in a sense, if we wanted to download from a pipeline we would need to get the latest successful build and get its info to know if it applies based on the changed paths. However that would help for local development if we wanted, I.e we could enable a switch on a script to download the latest coreclr or libraries from the latest CI pipeline to boost dev productivity. |
Yes, I believe it all comes to having right paths set up in |
I am not sure if |
I think evaluating this as an incremental build might be worth from a perspective of:
Ninja/native already reuses this really well as it checks hashes AFAIK (@jkoritzinsky might be able to confirm). This might be a reason to also move other platforms to this (unless make already handles this well). This means we need to cache the obj directory. Not sure if this is unwieldy (GB of cache per leg). The only questions I have left are:
This would force us to improve the general .NET build system, so in some sort of sense it's also product points, but it has the potential of being disruptive. |
But what about the timestamp? If the files in
Answers to those questions are in https://docs.microsoft.com/en-us/azure/devops/pipelines/release/caching?view=azure-devops#q-when-does-a-cache-expire.
I am not sure. |
- task: Cache@2 | ||
displayName: Cache Build $(ConfigurationName) | ||
inputs: | ||
key: '"Build" | $(CommitsOutputFile)' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should probably also include at least:
- Build machine OS (e.g. we can build the WASM target on both Windows or Linux)
- Arbitrary cache version key, e.g.
v1
, so we can bump it to invalidate the chache (see https://docs.microsoft.com/en-us/azure/devops/pipelines/release/caching?view=azure-devops#q-can-i-clear-a-cache) - Path to source directory (not sure if this is already implicitly handled by get_commit_hash.py)
I do share Jan's concern that it's hard to capture 100% of all the paths and environment settings that can affect whether a cache can be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should probably also include at least:
- Build machine OS (e.g. we can build the WASM target on both Windows or Linux)
That should be fairly easy to include. Was not aware about that for WASM.
- Arbitrary cache version key, e.g.
v1
, so we can bump it to invalidate the chache (see https://docs.microsoft.com/en-us/azure/devops/pipelines/release/caching?view=azure-devops#q-can-i-clear-a-cache)
Good point.
- Path to source directory (not sure if this is already implicitly handled by get_commit_hash.py)
That is handled by get_commit_hash.py
.
I do share Jan's concern that it's hard to capture 100% of all the paths and environment settings that can affect whether a cache can be used.
Sure, the way I understand from the existing comments is there is a consensus in having this just for test artifacts and then experiment if caching and restoring obj
folder followed by an incremental build would help. If it works, then we can have a job that continuously build and caches the obj
folder of newer commits and all the PR builds take advantage of them and improving build-time. As @safern pointed, we can have some script to download the obj
folder depending on the commit you are at, to make inner loop development faster. I tried experimenting the "reusing obj
folder" on my windows machine by cloning runtime repo in a fresh location and copying the pre-built obj
folder from another location into it. So far, I see that obj
folder contains lot of files that has hardcoded path of the original repo location. I tried replacing the paths with new locations, but there were still some places that referenced old repo location and hence when I do a build on new repo location, it was still building it for old location (at least that's what I understood from the console output). I am not too familiar with how it works, or what needs to be updated in obj
folder for it to get reused. If someone can try this out and confirm the necessary steps, I am happy to take it forward from there.
@@ -81,27 +81,102 @@ jobs: | |||
- ${{ if eq(parameters.useHelix, true) }}: | |||
- _additionalBuildArguments: /p:ArchiveTests=true | |||
|
|||
# Currently only support pipeline caching for PRs and manual triggered runs | |||
- ${{ if or(in(variables['Build.Reason'], 'PullRequest'), in(variables['Build.Reason'], 'Manual')) }}: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess manual triggered runs was mostly for testing or are we keeping that?
- ${{ parameters.variables }} | ||
|
||
steps: | ||
|
||
# Get the commit hashes of interesting folder for caching |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we put all these common cache steps on a common template file and then configure it via parameters?
value: '$(Build.SourcesDirectory)/src/coreclr/scripts/get_commit_hash.py' | ||
|
||
# Otherwise set all the cache-specific variables to 'false' which will ensure to build product. | ||
- ${{ if and(notin(variables['Build.Reason'], 'PullRequest'), notin(variables['Build.Reason'], 'Manual')) }}: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we put all these variables that are shared across different templates on xplat-setup which is used by all templates?
Tagging subscribers to this area: @dotnet/runtime-infrastructure Issue DetailsThe goal of this PR is to enable pipeline-caching for build scenarios of coreclr and libraries. In future, we would like to extend it for scenarios like building test, Mono and to some extent test execution. OverviewThe idea behind pipeline caching is simple. After the artifacts of a particular job is produced, store it in an Azure cache against a particular "cache key". If there is a subsequent run that happens in the pipeline (related or unrelated to the original run), and if the "cache key" matches, the job will simply download the cached contents and skip producing the artifacts entirely, saving us some machine time. In this PR, the artifacts creation that I am optimizing is the creation of coreclr and libraries binaries. During the coreclr/libraries build step, it will see if we have already built the binaries and if yes, it will simply download it from Azure storage and upload it as pipeline artifacts. All the steps related to installing pre-requisite tools, product build, zipping, etc. will be skipped. This is possible using ScenariosIn this PR, the support of pipeline-caching is added to all the pipelines (internal and public) that consumes To summarize, in the best case, for a single additional push to a PR branch, pipeline caching can save up to 450 minutes of machine time. The worst case, when there are too many frequent changes in Cache Key
Before we run the pipeline, we always merge the latest DesignAs mentioned previously, the There are few templates where the checks for LookoutsReviewers, please make sure if the paths I have listed in get_commit_hash are correct and if more paths need to be added in the given categories. NextI have a kustos query to understand if we are hitting "Cache Hit" or not. So in future, if we think that we almost never hit cache, we can look back and see how to improve things. Overall, if this is successful, it will also prove beneficial to cache building coreclr/libraries test because they are changed less frequently and hence will certainly be restored from the cache. RiskIt could happen that due to a bug, we might restore the cache that contains wrong or incorrect build artifacts. To work around, a developer can always make changes to one of the paths listed under Contributes to #59598
|
Stale PR |
The goal of this PR is to enable pipeline-caching for build scenarios of coreclr and libraries. In future, we would like to extend it for scenarios like building test, Mono and to some extent test execution.
Overview
The idea behind pipeline caching is simple. After the artifacts of a particular job is produced, store it in an Azure cache against a particular "cache key". If there is a subsequent run that happens in the pipeline (related or unrelated to the original run), and if the "cache key" matches, the job will simply download the cached contents and skip producing the artifacts entirely, saving us some machine time. In this PR, the artifacts creation that I am optimizing is the creation of coreclr and libraries binaries. During the coreclr/libraries build step, it will see if we have already built the binaries and if yes, it will simply download it from Azure storage and upload it as pipeline artifacts. All the steps related to installing pre-requisite tools, product build, zipping, etc. will be skipped. This is possible using
Cache@2
task of AzDo.Scenarios
In this PR, the support of pipeline-caching is added to all the pipelines (internal and public) that consumes
coreclr/templates/build-job.yml
andlibraries/build-job.yml
files and if triggeredManually
or through aPullRequest
. To take an example, if a developer submits a PullRequest that just toucheslibraries
folder. During the first execution ofruntime
pipeline on the PR branch, thebuild-job.yml
would build required coreclr/libraries builds. If the developer further pushes more changes and if between the initial and subsequent changes, if there were no additional changes done incoreclr
folder onmain
branch (see more details below), the caching will restore the previously cached CoreCLR binaries without needing to build them again. It will also upload the required artifacts, just as it happens today. Since there are at least 10~30 different configuration ofcoreclr
builds we do, if the cache is "hit", we could save approximately 30 * 15 mins = 450 minutes of machine execution time on the subsequent run once this feature is enabled. As of today, the cache key is only matched per PR-branch. In other words, if there is a PR# AAA that touchedlibraries
folder and we built and cached the artifacts that were produced out of it. Next there is another PR# BBB that is sent out and it too just touchedlibraries
folder, and betweenAAA
andBBB
there were nocoreclr
commits inmain
, PR #BBB will still not be able to use cached contents. I have opened an microsoft/azure-pipelines-tasks#15589 to check why. The documentation clearly says that such scenarios are allowed, and cache can be used across branches within the same pipeline. However, if there were further updates to PRAAA
, the cached contents can be used.To summarize, in the best case, for a single additional push to a PR branch, pipeline caching can save up to 450 minutes of machine time. The worst case, when there are too many frequent changes in
coreclr
andlibraries
, the cache-key will never match and we will end up building the product, the way we do today except that we will continue to cache them in the hope that some run will use it. The step that caches the contents takes approximately 20 seconds and hence should not regress the current workflow.Cache Key
Cache@2
needs a cache-key that should be unique for that particular run. Azdo creates a fingerprint of the cache-key and uses it to search for the cached contents. One portion of the cache-key should contain the build specific information like OS name, Architecture, BuildType (Debug, Checked or Release), framework name, etc. However, we also need a unique identifier to confirm if there were any changes in folders/files that would help us decide if a build is needed or not. The way I approached it is by checking the git-hashes of the paths we are interested in. get_commit_hash is the python script that I would trigger to fetch the git-hashes of the paths that are interested for that particular build. For examples, if buildingcoreclr
, I would check the git commit hashes ofsrc/coreclr
andsrc/native
folder. All the git commit hashes along with the build configuration (OS, architecture, etc.) is saved in a commit.txt file. I then use the contents of this file as the cache key that AzDo will use to check if it is a "cache hit" or a "cache miss".Before we run the pipeline, we always merge the latest
main
in the PR branch. In such cases, if files insidecoreclr
is touched, we will always find a unique merge hash commit forcoreclr
folder and hence mismatch the previous cache key. This will guarantee that we buildcoreclr
if there are any changes in that folder. On the contrary, we just fetch 40 commit's history during checkout. I added code to fetch little more history so we can correctly see the last commit that changed a folder/file. But sometimes, that might not be enough, and the files/folders might have changed prior to the history we have on AzDo machine. In such case, I will make an entry ofgrafted
next to the file path. Here are sample logs from get_commit_hash file.Design
As mentioned previously, the
UsePipeline
variable is set to true only if the run is triggered from a PullRequest or Manually. It will be used to determine ifCache@2
needs to be executed. There are other variables likeBUILD_CACHE_RESTORED
, BUILD_LOGS_RESTORED`, etc. that are set if the cache is restored. The values of these variables are then used to determine whether we should perform a particular step or should skip it.There are few templates where the checks for
BUILD_CACHE_RESTORED
was needed and there is not a reliable way to add this in condition. Hence, I had to add their check explicitly in those yml files.Lookouts
Reviewers, please make sure if the paths I have listed in get_commit_hash are correct and if more paths need to be added in the given categories.
Next
I have a kustos query to understand if we are hitting "Cache Hit" or not. So in future, if we think that we almost never hit cache, we can look back and see how to improve things. Overall, if this is successful, it will also prove beneficial to cache building coreclr/libraries test because they are changed less frequently and hence will certainly be restored from the cache.
Risk
It could happen that due to a bug, we might restore the cache that contains wrong or incorrect build artifacts. To work around, a developer can always make changes to one of the paths listed under
common
category and in such cases, the cache will always guarantee to see a cache miss. I have also opened microsoft/azure-pipelines-tasks#15583 to track some other issues in this feature.Contributes to #59598