Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to Legate 24.09 nightlies #163

Merged
merged 2 commits into from
Oct 23, 2024
Merged

Update to Legate 24.09 nightlies #163

merged 2 commits into from
Oct 23, 2024

Conversation

seberg
Copy link
Contributor

@seberg seberg commented Oct 11, 2024

This updates to the nightlies based on Jacobs work (i.e. most of the cmake changes and the larger part of the non-stream related C changes).

A slightly bigger change is that there is now a warning for the stream pool API, so we have to pass through the stream more explicitly.

Locally there is a test failing because legate --cpus 2 thinks that --cpus are passed twice. Let's see, the question will be whether we can just remove it (using the default of 4 cpus probably).

This comment was marked as resolved.

@seberg
Copy link
Contributor Author

seberg commented Oct 11, 2024

OK... This is failing to build in CI due to https://github.com/nv-legate/legate.core.internal/issues/1246 will mark as draft until then.

@seberg seberg marked this pull request as draft October 11, 2024 14:38
Copy link
Contributor

@RAMitchell RAMitchell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for taking this on @seberg! Looks not too bad in terms of code changes. Will take another look when the CI is working.

@jameslamb jameslamb self-requested a review October 11, 2024 15:23
Copy link
Member

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this!

I think your diagnosis of why this is failing (#163 (comment)) is exactly correct.

I left some other small packaging-related suggestions for your consideration.

dependencies.yaml Outdated Show resolved Hide resolved
dependencies.yaml Outdated Show resolved Hide resolved
conda/recipes/legate-boost/conda_build_config.yaml Outdated Show resolved Hide resolved
ci/build_python.sh Show resolved Hide resolved
@seberg seberg force-pushed the legate-24.09 branch 3 times, most recently from 226e78b to a0981da Compare October 15, 2024 15:08
@seberg
Copy link
Contributor Author

seberg commented Oct 16, 2024

Just to note: the way that install_info.py is now created, it includes the full local build path to search for the library.
I am not clear that is intentional (I don't think the cunumeric install_info.py includes it), OTOH, maybe it helps with local builds?

@seberg seberg force-pushed the legate-24.09 branch 2 times, most recently from 69dcf7f to 889ab86 Compare October 16, 2024 10:18
@seberg
Copy link
Contributor Author

seberg commented Oct 16, 2024

Ping CI to run (close/reopen).

@seberg seberg closed this Oct 16, 2024
@seberg seberg reopened this Oct 16, 2024
@jameslamb
Copy link
Member

/ok to test

@seberg
Copy link
Contributor Author

seberg commented Oct 16, 2024

This is getting somewhere: The test failures seem pretty harmless (one new mypy failure, and the --cpus getting passed twice, see below details at end).

Two notes about the setup:

  • install_info is now fully generated, and I am not sure it was previously ./ci/run_pytests_gpu.sh works around this by moving directory away to not pick up the source package. I am not sure in-place builds work with this (did they before?) Local builds are fine, the file pops up in-tree.
  • install_info.py seems to hardcode the actual build path. That doesn't seem right for a package. This is true, and a potential problem, but it is not new.

Not sure what this is about. Maybe it is picking up the config of the parent process? Not sure that starting legate from a legate process is a good idea :).

== LEGATE ERROR:
== LEGATE ERROR: Duplicate argument --cpus
== LEGATE ERROR:
Usage: LEGATE [--help] [--version] [--cpus INT] [--gpus INT] [--omps INT] [--ompthreads INT] [--utility INT] [--sysmem INT] [--numamem INT] [--fbmem INT] [--zcmem INT] [--regmem INT] [--eager-alloc-percentage INT] [--profile] [--spy] [--logging STRING] [--logdir STRING] [--log-to-file] [--freeze-on-error]
Optional arguments:
  -h, --help                    shows help message and exits 
  -v, --version                 prints version information and exits 
Legate arguments (detailed usage):
  --cpus INT                    number of CPU's to reserve, must be >=0 [default: -1]
  --gpus INT                    number of GPU's to reserve, must be >=0 [default: -1]
  --omps INT                    number of OpenMP processors to reserve, must be >=0 [default: -1]
  --ompthreads INT              number of OpenMP threads to use, must be >=0 [default: -1]
  --utility INT                 number of utility processors to reserve, must be >=0 [default: 2]
  --sysmem INT                  size (in megabytes) of system memory to reserve [default: -1]
  --numamem INT                 size (in megabytes) of NUMA memory to reserve [default: -1]
  --fbmem INT                   size (in megabytes) of GPU (or "frame buffer") memory to reservef [default: -1]
  --zcmem INT                   size (in megabytes) of zero-copy GPU memory to reserve [default: 128]
  --regmem INT                  size (in megabytes) of NIC-registered memory to reserve [default: 0]
  --eager-alloc-percentage INT  percentage of eager allocation [default: 50]
  --profile                     whether to enable Legion runtime profiling 
  --spy                         whether to enable Legion spy 
  --logging STRING              comma separated list of loggers to enable and their level, e.g. legate=3,foo=0,bar=5 [default: ""]
  --logdir STRING               directory to emit logfiles to [default: ""]
  --log-to-file                 wether to save logs to file 
  --freeze-on-error             whether to pause the program on first error 
test_examples.py::test_benchmark FAILED

@seberg
Copy link
Contributor Author

seberg commented Oct 16, 2024

Hmmmmm, had to actually cancel the GPU job, it was still running after an hour, which I don't think is correct... hanging at:

EDIT: Hmmm, no. There are more tests later, I just missed it between all the warnings. So maybe it is working, but just slow for some reason. It's working, but the Neural net tests are ~10x slower than previously

models/test_nn.py::test_nn[float32-0.0-hidden_layer_sizes1-1] [0 - 7f0788d16740]    0.000000 {5}{numa}: mems_allowed: ret=-1 errno=1 mask= count=64
[0 - 7f056827c740]  698.580191 {4}{runtime}: [warning 1116] LEGION WARNING: Ignoring request by mapper legate.core on Node 0 to check for collective usage for region requirement 0 of task legateboost::BuildNNTask (UID 2743741) because region requirement has writing privileges. (from file /tmp/conda-croot/legate/work/arch-conda/skbuild_core/_deps/legion-src/runtime/legion/legion_tasks.cc:819)
For more information see:
http://legion.stanford.edu/messages/warning_code.html#warning_code_1116
[0 - 7f056827c740]  698.580213 {4}{runtime}: [warning 1116] LEGION WARNING: Ignoring request by mapper legate.core on Node 0 to check for collective usage for region requirement 1 of task legateboost::BuildNNTask (UID 2743741) because region requirement has writing privileges. (from file /tmp/conda-croot/legate/work/arch-conda/skbuild_core/_deps/legion-src/runtime/legion/legion_tasks.cc:819)
For more information see:
http://legion.stanford.edu/messages/warning_code.html#warning_code_1116
[0 - 7f0568288740]  698.647548 {4}{runtime}: [warning 1116] LEGION WARNING: Ignoring request by mapper legate.core on Node 0 to check for collective usage for region requirement 0 of task legateboost::BuildNNTask (UID 2743862) because region requirement has writing privileges. (from file /tmp/conda-croot/legate/work/arch-conda/skbuild_core/_deps/legion-src/runtime/legion/legion_tasks.cc:819)
For more information see:
http://legion.stanford.edu/messages/warning_code.html#warning_code_1116
[0 - 7f0568288740]  698.647563 {4}{runtime}: [warning 1116] LEGION WARNING: Ignoring request by mapper legate.core on Node 0 to check for collective usage for region requirement 1 of task legateboost::BuildNNTask (UID 2743862) because region requirement has writing privileges. (from file /tmp/conda-croot/legate/work/arch-conda/skbuild_core/_deps/legion-src/runtime/legion/legion_tasks.cc:819)
For more information see:
http://legion.stanford.edu/messages/warning_code.html#warning_code_1116
[0 - 7f0568288740]  698.714261 {4}{runtime}: [warning 1116] LEGION WARNING: Ignoring request by mapper legate.core on Node 0 to check for collective usage for region requirement 0 of task legateboost::BuildNNTask (UID 2743980) because region requirement has writing privileges. (from file /tmp/conda-croot/legate/work/arch-conda/skbuild_core/_deps/legion-src/runtime/legion/legion_tasks.cc:819)
For more information see:
http://legion.stanford.edu/messages/warning_code.html#warning_code_1116
[0 - 7f0568288740]  698.714276 {4}{runtime}: [warning 1116] LEGION WARNING: Ignoring request by mapper legate.core on Node 0 to check for collective usage for region requirement 1 of task legateboost::BuildNNTask (UID 2743980) because region requirement has writing privileges. (from file /tmp/conda-croot/legate/work/arch-conda/skbuild_core/_deps/legion-src/runtime/legion/legion_tasks.cc:819)
For more information see:
http://legion.stanford.edu/messages/warning_code.html#warning_code_1116
[0 - 7f0568288740]  698.714285 {4}{runtime}: [warning 1116] LEGION WARNING: Ignoring request by mapper legate.core on Node 0 to check for collective usage for region requirement 2 of task legateboost::BuildNNTask (UID 2743980) because region requirement has writing privileges. (from file /tmp/conda-croot/legate/work/arch-conda/skbuild_core/_deps/legion-src/runtime/legion/legion_tasks.cc:819)
For more information see:
http://legion.stanford.edu/messages/warning_code.html#warning_code_1116
[0 - 7f0568288740]  698.714292 {4}{runtime}: [warning 1116] LEGION WARNING: Ignoring request by mapper legate.core on Node 0 to check for collective usage for region requirement 5 of task legateboost::BuildNNTask (UID 2743980) because region requirement has writing privileges. (from file /tmp/conda-croot/legate/work/arch-conda/skbuild_core/_deps/legion-src/runtime/legion/legion_tasks.cc:819)
For more information see:

<similar warnings repeat>

This updates code for 24.09 compatibility based on Jacobs start.

Additional/larger changes:
* legate::cuda::StreamPool::get_stream_pool().get_stream(); is deprecated
  so use context.get_task_stream() and rewire things as needed.
* We are always including the label/experimental channel.
* The `install_info.py` file generation seem slightly different, so
  the test scripts now move into the test folder and then run the tests.
  (This could maybe be improved.)
  This has no effect locally, though.

Signed-off-by: Sebastian Berg <sebastianb@nvidia.com>
@seberg seberg marked this pull request as ready for review October 17, 2024 15:48
@RAMitchell
Copy link
Contributor

The reason for the slowness of the GPU tests is that they are actually the CPU tests :)

./ci/run_pytests_cpu.sh

@jameslamb a typo?

@seberg
Copy link
Contributor Author

seberg commented Oct 23, 2024

The reason for the slowness of the GPU tests is that they are actually the CPU tests :)

Let me fix that, although I am confused. I thought the new version of legate would guess at using the GPU if anything!?

Signed-off-by: Sebastian Berg <sebastianb@nvidia.com>
@seberg
Copy link
Contributor Author

seberg commented Oct 23, 2024

Well, let's see. Maybe I got the tags wrong also, and we are actually compiling a CPU only version for both or so...

@jameslamb
Copy link
Member

The reason for the slowness of the GPU tests is that they are actually the CPU tests :)

./ci/run_pytests_cpu.sh

@jameslamb a typo?

ah!!! Yes absolutely a typo! Sorry about that, thank you for fixing it.

If this PR is going to go on much longer, I recommend just putting up a separate one with that change to get it onto main.

@seberg
Copy link
Contributor Author

seberg commented Oct 23, 2024

The CPU tests are taking 43 minutes right now. Maybe that is actually acceptable for the moment?

@RAMitchell
Copy link
Contributor

Its acceptable for now, although lets not leave it for too long. From my perspective this PR is fine. @jameslamb will this interfere with your packaging work if we merged it?

- cunumeric {{ legate_version }}
- legate-core {{ legate_version }}
- cunumeric {{ legate_version }} =*_gpu
- legate {{ legate_version }} =*_gpu
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh hey great! I'm glad they now added _gpu to the build strings for legate / cunumeric. When I'd asked about that before, the folks maintaining legate / cunumeric were against it: https://github.com/nv-legate/legate.core.internal/pull/1035#discussion_r1702268130

@jameslamb jameslamb self-requested a review October 23, 2024 14:43
Copy link
Member

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my perspective this PR is fine. @jameslamb will this interfere with your packaging work if we merged it?

Thanks for the @. I totally support merging this... focusing on just supporting legate / cunumeric 24.09 will help the packaging work towards publishing a legate-boost 24.09.

@seberg
Copy link
Contributor Author

seberg commented Oct 23, 2024

I guess I'll put it in then, thanks! 🤞 that CI will be stable enough (and that slowdown tracked down soon!).

@seberg seberg merged commit 6c8a1cb into rapidsai:main Oct 23, 2024
10 checks passed
@seberg seberg deleted the legate-24.09 branch October 23, 2024 19:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants