Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add legate-boost conda packages #115

Closed
jameslamb opened this issue Jul 17, 2024 · 13 comments
Closed

add legate-boost conda packages #115

jameslamb opened this issue Jul 17, 2024 · 13 comments
Assignees
Labels
feature request New feature or request

Comments

@jameslamb
Copy link
Member

jameslamb commented Jul 17, 2024

Description

For #101 , we want to publish legate-boost conda packages to the legate channel (https://anaconda.org/legate/repo).

This captures the work to do that.

Benefits of this work

  • simplifies installation of legate-boost

Acceptance Criteria

  • conda packages are built and tested in CI here
  • conda package name should be legate-boost (with the -)
  • pushing a tag to the repo triggers publishing to the legate conda channel

Approach

Patterns that might be borrowed from RAPIDS libraries, including:

  • pyproject.toml instead of setup.py
  • scikit-build-core as a build backend instead of scikit-build
  • rapids-cmake to manage dependencies
  • rapids-dependency-file-generator to keep different lists of dependencies consistent

For an example of this, see how cuvs conda packages are built:

Notes

legate-core source, for reference: https://github.com/nv-legate/legate.core

@jameslamb jameslamb added the feature request New feature or request label Jul 17, 2024
This was referenced Jul 17, 2024
@jameslamb jameslamb self-assigned this Jul 17, 2024
@jameslamb
Copy link
Member Author

After merging #176, I just did the following to check the code paths for stable releases.

Pushed a new release tag like this:

git checkout main
git pull upstream main
git tag -a v24.08.00 -m 'v24.08.00'
git push upstream 'v24.08.00'

That triggered this build: https://github.com/rapidsai/legate-boost/actions/runs/11616162673

The build succeeded, and it uploaded a v24.08.00 to the main label on the legate channel 🎉

Image

ref: https://anaconda.org/legate/legate-boost/files

Created an environment and ran the tests.

docker run \
  --rm \
  --gpus "2,3" \
  -v $(pwd):/opt/work \
  -w /opt/work \
  -it rapidsai/ci-conda \
  bash

# --override-channels just to be sure we're not cheating with channels
# configured globally in the env where I ran this
conda create \
  --name test-legate-boost \
   --yes \
  --override-channels \
  -c legate \
  -c legate/label/experimental \
  -c conda-forge \
    legate-boost=24.08 \
    python=3.11
output of 'conda env export --name test-legate-boost (click me)
name: test-legate-boost
channels:
  - legate
  - legate/label/experimental
  - rapidsai
  - rapidsai-nightly
  - dask/label/dev
  - pytorch
  - conda-forge
  - nvidia
dependencies:
  - _libgcc_mutex=0.1=conda_forge
  - _openmp_mutex=4.5=2_kmp_llvm
  - attr=2.5.1=h166bdaf_1
  - bzip2=1.0.8=h4bc722e_7
  - c-ares=1.34.2=heb4867d_0
  - ca-certificates=2024.8.30=hbcca054_0
  - cffi=1.17.1=py311hf29c0ef_0
  - cuda-cudart=12.6.77=h5888daf_0
  - cuda-cudart_linux-64=12.6.77=h3f2d84a_0
  - cuda-nvrtc=12.6.77=hbd13f7d_0
  - cuda-nvtx=12.6.77=hbd13f7d_0
  - cuda-version=12.6=h7480c83_3
  - cunumeric=24.09.00.dev109=cuda12_py311_gdf1344ba_109_gpu
  - cutensor=2.0.2.5=hbc370b7_0
  - elfutils=0.191=h924a536_0
  - gettext=0.22.5=he02047a_3
  - gettext-tools=0.22.5=he02047a_3
  - gnutls=3.8.7=h32866dd_0
  - joblib=1.4.2=pyhd8ed1ab_0
  - keyutils=1.6.1=h166bdaf_0
  - krb5=1.21.3=h659f571_0
  - ld_impl_linux-64=2.43=h712a8e2_2
  - legate=24.09.00.dev296=cuda12_py311_g9039cf6f_296_ucx_gpu
  - legate-boost=24.08.00=cuda12_py311_0_gpu
  - libarchive=3.7.4=hfca40fe_0
  - libasprintf=0.22.5=he8f35ee_3
  - libasprintf-devel=0.22.5=he8f35ee_3
  - libblas=3.9.0=25_linux64_openblas
  - libcap=2.69=h0f662aa_0
  - libcblas=3.9.0=25_linux64_openblas
  - libcublas=12.6.3.3=hbd13f7d_1
  - libcufft=11.3.0.4=hbd13f7d_0
  - libcurand=10.3.7.77=hbd13f7d_0
  - libcurl=8.10.1=hbbe4b11_0
  - libcusolver=11.7.1.2=hbd13f7d_0
  - libcusparse=12.5.4.2=hbd13f7d_0
  - libedit=3.1.20191231=he28a2e2_2
  - libev=4.33=hd590300_2
  - libexpat=2.6.3=h5888daf_0
  - libffi=3.4.2=h7f98852_5
  - libgcc=14.2.0=h77fa898_1
  - libgcc-ng=14.2.0=h69a702a_1
  - libgcrypt=1.11.0=h4ab18f5_1
  - libgettextpo=0.22.5=he02047a_3
  - libgettextpo-devel=0.22.5=he02047a_3
  - libgfortran=14.2.0=h69a702a_1
  - libgfortran-ng=14.2.0=h69a702a_1
  - libgfortran5=14.2.0=hd5240d6_1
  - libgpg-error=1.50=h4f305b6_0
  - libhwloc=2.11.2=default_he43201b_1000
  - libiconv=1.17=hd590300_2
  - libidn2=2.3.7=hd590300_0
  - liblapack=3.9.0=25_linux64_openblas
  - libmicrohttpd=1.0.1=hbc5bc17_1
  - libnghttp2=1.64.0=h161d5f1_0
  - libnl=3.10.0=h4bc722e_0
  - libnsl=2.0.1=hd590300_0
  - libnvjitlink=12.6.77=hbd13f7d_1
  - libopenblas=0.3.28=openmp_hd680484_0
  - libsqlite=3.47.0=hadc24fc_1
  - libssh2=1.11.0=h0841786_0
  - libstdcxx=14.2.0=hc0a3c3a_1
  - libstdcxx-ng=14.2.0=h4852527_1
  - libsystemd0=256.7=h2774228_1
  - libtasn1=4.19.0=h166bdaf_0
  - libudev1=256.7=hb9d3cd8_1
  - libunistring=0.9.10=h7f98852_0
  - libuuid=2.38.1=h0b41bf4_0
  - libxcrypt=4.4.36=hd590300_1
  - libxml2=2.13.4=h064dc61_2
  - libzlib=1.3.1=hb9d3cd8_2
  - llvm-openmp=19.1.3=h024ca30_0
  - lz4-c=1.9.4=hcb278e6_0
  - lzo=2.10=hd590300_1001
  - mpi=1.0=openmpi
  - nccl=2.23.4.1=h52f6c39_2
  - ncurses=6.5=he02047a_1
  - nettle=3.9.1=h7ab15ed_0
  - nomkl=1.0=h5ca1d4c_0
  - numpy=1.26.4=py311h64a7726_0
  - openblas=0.3.28=openmp_h44988d0_0
  - openmpi=4.1.6=hc5af2df_101
  - openssl=3.3.2=hb9d3cd8_0
  - opt_einsum=3.4.0=pyhd8ed1ab_0
  - p11-kit=0.24.1=hc5aa10d_0
  - pip=24.3.1=pyh8b19718_0
  - pycparser=2.22=pyhd8ed1ab_0
  - python=3.11.10=hc5c86c4_3_cpython
  - python_abi=3.11=5_cp311
  - rdma-core=54.0=h5888daf_1
  - readline=8.2=h8228510_1
  - scikit-learn=1.5.2=py311h57cc02b_1
  - scipy=1.14.1=py311he9a78e4_1
  - setuptools=75.3.0=pyhd8ed1ab_0
  - threadpoolctl=3.5.0=pyhc1e730c_0
  - tk=8.6.13=noxft_h4845f30_101
  - typing-extensions=4.12.2=hd8ed1ab_0
  - typing_extensions=4.12.2=pyha770c72_0
  - tzdata=2024b=hc8b5060_0
  - ucx=1.17.0=h05e919c_3
  - wheel=0.44.0=pyhd8ed1ab_0
  - xz=5.2.6=h166bdaf_0
  - zlib=1.3.1=hb9d3cd8_2
  - zstd=1.5.6=ha6fb4c9_0
prefix: /opt/conda/envs/test-legate-boost

(NOTICE: I'm on a system with CUDA, and it pulled in the _gpu builds over everything without me having to specify that!)

Then installed test dependencies and ran legate-boost's tests.

source activate test-legate-boost

conda install \
  -c conda-forge \
  --yes \
     'hypothesis>=6' \
     'matplotlib>=3.9' \
     'nbconvert>=7.16' \
     'notebook>=7' \
     'pytest>=7,<8' \
     'seaborn>=0.13' \
     'xgboost>=2.0'

./ci/run_pytests_gpu.sh

Those failed with fatal errors 😭

plugins: hypothesis-6.115.6, anyio-4.6.2.post1
collecting ... [0 - 7f1166dbb740]    0.000000 {5}{numa}: mems_allowed: ret=-1 errno=1 mask= count=64
[0 - 7f1166dbb740]    0.000000 {6}{gpu}: Failed to allocate GPU memory of size 29360128000
Fatal Python error: Aborted

Current thread 0x00007f1166dbb740 (most recent call first):
  File "/opt/conda/envs/test-legate-boost/lib/python3.1/site-packages/legate/core/__init__.py", line 109 in <module>

Will look into it.

@jameslamb
Copy link
Member Author

Those failed with fatal errors 😭

I missed the "Failed to allocate GPU memory" before! Looked at nvidia-smi and saw that there were other processes using the same GPUs as me (I ran this on a shared machine).

Re-ran with smaller memory requirements and saw the tests pass. This is working 😎

@jameslamb
Copy link
Member Author

The docs deployment failed on that tag build: https://github.com/rapidsai/legate-boost/actions/runs/11616162673/job/32349185474

Like this:

Tag "v24.08.00" is not allowed to deploy to github-pages due to environment protection rules.
The deployment was rejected or didn't satisfy other protection rules.

I've asked ops to help take a look.

@jameslamb
Copy link
Member Author

Alright ops fixed the docs builds... the issue was just that deploy-github-pags-on-new-tags was not turned on in the repo settings.

I pushed another tag (v24.08.01) and saw docs published successfully!!

https://github.com/rapidsai/legate-boost/actions/runs/11619624304/job/32360219823

... but with the wrong version

Image

Put up #177 to fix that.

@jameslamb
Copy link
Member Author

Alright, trying this again 😂

Merged #177, saw that successfully build packages and deploy docs: https://github.com/rapidsai/legate-boost/actions/runs/11632125999

I just pushed another tag, like this:

git checkout main
git pull upstream main
git tag -a v24.08.02 -m 'v24.08.02'
git push upstream 'v24.08.02'

That triggered this build: https://github.com/rapidsai/legate-boost/actions/runs/11632477453

Hopefully, we'll see all the CI jobs succeed and docs published with the correct version (24.08.02).

@jameslamb
Copy link
Member Author

grrrr why did that not work.

The deployment says it succeeded: https://github.com/rapidsai/legate-boost/actions/runs/11632477453/job/32396301167

And I can see the docs-building job installed the version we wanted.

legate-boost              24.08.02        cuda12_py312_0_cpu    file:///tmp/local-conda-packages

(build link)

But the docs still have the wrong version in them (yes I cleared my browser cache):

Image

My next theory was "ok, maybe the wrong artifact is being pulled". Checked the logs from the build job:

With the provided path, there will be 1 file uploaded
Artifact name is valid!
Root directory input is valid!
Beginning upload of artifact content to blob storage
Uploaded bytes 4334606
Finished uploading artifact content to blob storage!
SHA256 hash of uploaded artifact zip is 9ed2d0645717aaeb74a7eebed1549b8ae80977098020d3220697fcf9d5be6551
Finalizing artifact upload
Artifact github-pages.zip successfully finalized. Artifact ID 2133838991
Artifact github-pages has been successfully uploaded! Final size is 4334606 bytes. Artifact ID is 2133838991
Artifact download URL: https://github.com/rapidsai/legate-boost/actions/runs/11632477453/artifacts/2133838991

Compared that to the deploy job:

Fetching artifact metadata for "github-pages" in this workflow run
Found 4 artifact(s)
Creating Pages deployment with payload:
{
	"artifact_id": 2133838991,
	"pages_build_version": "f0f5ac033092d849efb362d8bf66dad9243ec331",
	"oidc_token": "***"
}
Created deployment for f0f5ac033092d849efb362d8bf66dad9243ec331, ID: f0f5ac033092d849efb362d8bf66dad9243ec331
Getting Pages deployment status...
Reported success!

Those IDs exactly match.

@jameslamb
Copy link
Member Author

I clicked on the github-pages artifact in the summary from the run that was triggered by the tag: https://github.com/rapidsai/legate-boost/actions/runs/11632477453

https://github.com/rapidsai/legate-boost/actions/runs/11632477453/artifacts/2133838991

And opened it up locally (it's index.html)... it has the correct version (24.08.02)! So it's not like the version is wrong in the HTML we're producing.

@jameslamb
Copy link
Member Author

I just merged #178, which triggered this build: https://github.com/rapidsai/legate-boost/actions/runs/11673382407/workflow

... it failed again with workflow syntax errors 🙃

The workflow is not valid. .github/workflows/build.yaml (Line: 42, Col: 15): Unexpected symbol: '"tag"'. Located at position 22 within expression: github.event_name == "tag" || inputs.deploy_docs == true

I'll put up another PR fixing that, and testing the syntax directly on the PR. Sorry for all the noise getting this last part working 😭

@jameslamb
Copy link
Member Author

Alright does look like the changes from #180 did successfully lead to the docs NOT being redeployed on a merge to main.

Image

(build link)

Pushed a new tag:

git checkout main
git pull upstream main
git tag -a v24.08.06 -m 'v24.08.06'
git push upstream 'v24.08.06'

That triggered this build: https://github.com/rapidsai/legate-boost/actions/runs/11710469244

which... ALSO skipped the deployment 😫 😫 😫 😫

@jameslamb
Copy link
Member Author

tried triggering with workflow dispatch, checking the "deploy docs?" box... also did not deploy: https://github.com/rapidsai/legate-boost/actions/runs/11710994183/job/32619034518

@jameslamb
Copy link
Member Author

Alright, trying this again. Just merged #181.

That triggered this build: https://github.com/rapidsai/legate-boost/actions/runs/11726389551/job/32664900010

That succeeded and did NOT re-deploy the docs site (which is what we wanted) 🎉

Image

Next, I pushed a tag

git checkout main
git pull upstream main
git tag -a v24.08.07 -m 'v24.08.07'
git push upstream 'v24.08.07'

That triggered this build: https://github.com/rapidsai/legate-boost/actions/runs/11726677680

That succeeded and DID re-deploy the docs site 🎉

Image

Image

So we are good! I'll go delete all the testing-only packages and tags.

@jameslamb
Copy link
Member Author

Deleted all the following tags here in the repo:

v24.08.07
v24.08.06
v24.08.05
v24.08.04
v24.08.03
v24.06.01dev
v0.1.0

Just leaving this v24.09.00.dev:

Image

https://github.com/rapidsai/legate-boost/tags

@RAMitchell @seberg @mfoerste4 you should probably delete them in your local checkout of the repo too, just to remove a source of difference between local and CI.

git tag -d v24.08.07
git tag -d v24.08.06
git tag -d v24.08.05
git tag -d v24.08.04
git tag -d v24.08.03
git tag -d v24.06.01dev
git tag -d v0.1.0

@jameslamb
Copy link
Member Author

jameslamb commented Nov 7, 2024

I've asked for RAPIDS ops help deleting all the testing-only packages, but we don't have to keep this open until then.

At this point, I think we can say this is closed! Thanks very much for the help everyone!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant