Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ci] Change to cache build dependencies #3562

Merged
merged 12 commits into from
Apr 4, 2024

Conversation

antiagainst
Copy link
Collaborator

@antiagainst antiagainst commented Apr 4, 2024

This commit updates the CI to cache build dependencies (llvm,
nvidia toolchain, and pybind11) to speed up build process.

In order to do this, we need to expose the versions of nvidia
toolchain and pybind11 needed like llvm.

With this chage, we now see a much faster triton install:

Diffing https://github.com/openai/triton/actions/runs/8556465920/job/23446341337?pr=3565 and
https://github.com/openai/triton/actions/runs/8558161016/job/23452293260?pr=3562:

  • NVIDIA A100: ~2.5min -> ~1.5min
  • NVIDIA H100: ~2min -> ~1min
  • AMD gfx90a: ~3min -> ~1min

More importantly we can reduce the chance of seeing
connection reset due to frequently download build
dependencies from the original source.

This commit updates the CI to cache build dependencies (LLVM,
NVIDIA toolchain, and pybind11) to speed up build process.
@jlebar
Copy link
Collaborator

jlebar commented Apr 4, 2024

Triggered CI for you!

@antiagainst
Copy link
Collaborator Author

Thanks @jlebar! :D I'm expecting some push buttons on this one given needing to iterating on it a bit. But don't need to review until it's tested working.

@antiagainst antiagainst marked this pull request as ready for review April 4, 2024 18:26
@antiagainst antiagainst requested a review from ptillet as a code owner April 4, 2024 18:26
~/.triton/llvm
~/.triton/nvidia
~/.triton/pybind11
key: ${{ runner.os }}-${{ runner.arch }}-llvm-${{ steps.cache-key.outputs.llvm }}-nvidia-${{ steps.cache-key.outputs.nvidia }}-pybind11-${{ steps.cache-key.outputs.pybind11 }}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it's worth doing this as three cache steps rather than one, so that when the LLVM version changes we don't have to re-download the other two?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OTOH I'm fine if you want to leave it as-is because debugging this is such a pain, and you finally found something that worked. :)

Copy link
Collaborator Author

@antiagainst antiagainst Apr 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Examples around I see uses only one actions/cache step with mulitple paths. So I just followed that to be on the tested path. Don't know what would happen if we have multiple actions/cache steps. Maybe doable. I can give it a try later.

I think we likely would only change LLVM frequently; other parts likely would stay the same for long time. So I won't worry about cache invdalidation or similar due to frequent changes on other parts. Having everything packaged in one also helps compression/uploading/downloading a bit I think.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we likely would only change LLVM frequently; other parts likely would stay the same for long time.

Yeah, the problem is when we change LLVM we have to re-download the others from the Web (right?). And that exposes us to flakiness.

Anyway sgtm to leave it like this for now.

with:
submodules: 'true'

- name: Set ROCM ENV
- name: Compute build dependency cache keys
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't suppose there's a way to avoid this code duplication?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now probably not. It ties to the overall structure between the AMD and NVIDIA integration tests. I'd think we need to figure out how to unify them in general if that's the path we want to be on. But in general, they are based on different approaches (base machine vs docker) so even seems duplication, it should be fine. cause we may need to take different steps down the road.

Copy link
Collaborator

@jlebar jlebar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome -- thank you so much!

Copy link
Collaborator

@ThomasRaoux ThomasRaoux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@jlebar jlebar merged commit 38e45bb into triton-lang:main Apr 4, 2024
5 checks passed
@antiagainst antiagainst deleted the cache-build-deps branch April 4, 2024 20:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants