Unify `par_dispatch`, `par_for_outer` & `par_for_inner` overloads #1142

acreyes · 2024-07-29T10:56:42Z

PR Summary

Provides a single overload for par_dispatch, par_for_outer & par_for_inner that can handle both integer and IndexRange launch bounds.

par_dispatch & par_for_outer loops are handled by the same par_dispatch_impl struct that constructs the appropriate kokkos policy and functor for kokkos_dispatch
relies on the TypeList struct to hold parameter packs and accompanying type traits to figure out function signatures
Similar pattern for par_for_inner
Introduces a new loop pattern LoopPatternCollapse<team, thread, vector> that can be used to collapse a general ND loop over any combination of kokkos teams, threads and vectors inspired by #pragma acc collapse directives
- specializes to LoopPatternTPTTR, LoopPatternTPTVR, LoopPatternTPTTRTVR, InnerLoopPatternTTR and InnerLoopPatternTVR patterns
Fallbacks for incompatible Tags & Patterns that can show up from DEFAULT_LOOP_PATTERN
generalizes tests for par_for & par_reduce and improves coverage for all patterns up to rank 7 loops

Addresses #1134

PR Checklist

…atch-template

pgrete · 2024-12-02T19:31:34Z

I'm rerunning the Cuda test as it failed with some (unexpected) host to device mem copies.

pgrete · 2024-12-04T14:07:29Z

I'm rerunning the Cuda test as it failed with some (unexpected) host to device mem copies.

The test repeatedly failed. Any idea where those extra copies come from?

acreyes · 2024-12-04T16:11:47Z

I'm rerunning the Cuda test as it failed with some (unexpected) host to device mem copies.

The test repeatedly failed. Any idea where those extra copies come from?

I'll check, but that is unexpected. I think it passed some time back in september so at least there should be a pretty recent diff to use

pgrete

I really like this! Thanks for putting in all the effort.

I have to admit that mentally parsing the new machinery is more challenging than the old verbose one, but I think it's a way cleaner approach!
I'd like to do some downstream performance testing early next week and understand/track down the additional host/device copies before I finally approve.

doc/sphinx/src/par_for.rst

pgrete · 2024-12-06T20:38:53Z

src/kokkos_abstraction.hpp

+template <>
+struct UsesHierarchialPar<OuterLoopPatternTeams> : std::true_type {
+  static constexpr std::size_t Nvector = 0;
+  static constexpr std::size_t Nthread = 0;
+};


Can you comment on this trait?
I'm not sure I follow the default values for Nvector and Nthread.

The par_dispatch_impl::dispatch_impl tagged dispatches abstract the kernel launch over an outer flattened loop and an inner flattened loop. The inner loop flattening is used in the TPT[RTV]R patterns (hence the thread/vector) and also in the SimdFor pattern for the innermost vectorized loop. The default then is zero for all those loop patterns that don't have any vector/thread inner loops.

I see. This makes sense (I was probably confused because I associated sth different based on the naming and personal habits).

pgrete · 2024-12-06T20:43:00Z

src/kokkos_abstraction.hpp

+  static constexpr bool is_ParFor =
+      std::is_same<Tag, dispatch_impl::ParallelForDispatch>::value;
+  static constexpr bool is_ParScan =
+      std::is_same<Tag, dispatch_impl::ParallelScanDispatch>::value;


No special handling/logic required for par_reduces below?

par_reduce should work with every pattern except for the SimdFor one. However SimdFor only works for par_for, which is why that is the only check that is done.

edit: maybe I take that back. The TPT[RTV]R I think could in principle work with par_reduce but certainly not the way they're currently written. I'll duplicate the SimdFor check for the Hierarchical ones

pgrete · 2024-12-06T20:47:31Z

src/kokkos_abstraction.hpp

+        Kokkos::MDRangePolicy<Kokkos::Rank<Rank>>(exec_space, {bound_arr[OuterIs].s...},
+                                                  {(1 + bound_arr[OuterIs].e)...}),


Are the block sizes here forwarded (i.e., the old {1, 1, 1, 1, iu + 1 - il})?

good catch, I think I had meant to come back and add this in and forgot. This is in now

src/kokkos_abstraction.hpp

pgrete · 2024-12-06T20:58:20Z

src/kokkos_abstraction.hpp

+    using HierarchialPar = typename dispatch_type::HierarchialPar;
+    constexpr std::size_t Nvector = HierarchialPar::Nvector;
+    constexpr std::size_t Nthread = HierarchialPar::Nthread;
+    constexpr std::size_t Nouter = Rank - Nvector - Nthread;


What exactly is Nouter here (given the Rank - Nvector - Nthread formula)?
I also tried to follow the MakeCollapse<Rank, Nouter trail below but I Nouter seems to become Nteam which is then not used anymore.
I'm probably missing sth here.

This one covers all the various TPT[RTV]R patterns. These are always 1 or 0 loops for either the vector or thread range loops in the inner pattern. Nouter is all the remaining loops that become flattened into an outer team policy loop.

pgrete · 2024-12-06T21:12:04Z

src/kokkos_types.hpp

+template <std::size_t ND, typename T, typename State = empty_state_t>
+using ParArray = typename ParArrayND_impl<std::integral_constant<std::size_t, ND>,
+                                          State>::template type<T>;
+


and below: this is a new interface, isn't it? Might be worth to briefly add this to the doc along the ParArray#D.

yes, I've added some new documentation for it

Co-authored-by: Philipp Grete <pgrete@hs.uni-hamburg.de>

acreyes · 2024-12-06T23:40:40Z

I'd like to do some downstream performance testing early next week and understand/track down the additional host/device copies before I finally approve.

👍

I believe I've tracked down the source of the HtoD copies. The Indexer struct is used to flatten/reconstruct the multidimensional indices and holds some Kokkos::Array<int.ND>s for that purpose. For some reason the lambda capture of this guy triggers a mem copy. It can be constructed inside the kernel instead and that seems to solve it.

I don't understand the behavior though, and it also seems to be related to the Kokkos version. 4.0.1 doesn't have the copies, but starting at least in 4.2 the copies show up.

Even stranger the same pattern is used for the LoopPatternFlatRange kernels, but doesn't result in any mem copies, at least according to Nsight.

pgrete · 2024-12-12T14:19:05Z

src/kokkos_abstraction.hpp

-                                                  {(1 + bound_arr[OuterIs].e)...}),
-        function, std::forward<Args>(args)...);
+    constexpr std::size_t Nouter = sizeof...(OuterIs);
+    Kokkos::Array<int, Nouter> tiling{(OuterIs, 1)...};


Is this working as expected?
If I infer the intent correctly, this should create an array initialized to 1 everywhere.
My compiler complains with a warning

/p/project/coldcluster/pgrete/athenapk/external/parthenon/src/kokkos_abstraction.hpp(493): warning #174-D: expression has no effect Kokkos::Array<int, Nouter> tiling{(OuterIs, 1)...};

AFAIK default init doesn't work for arrays, so we might need sth like

std::array<int, Nouter> tiling; tiling.fill(1);

This was working for me, but better to avoid the warning.

The warning makes sense since (OuterIs, 1) will always just evaluate to 1.

pgrete · 2024-12-12T14:27:57Z

I now ran some more detailed tests.
Compile times increase (as expected given the additional template magic), here tested for AthenaPK on 48 cores

CUDA: 5m29s -> 6m7s
HIP: 2m18s -> 2m40s
so nothing too dramatic (from my point of view).

However, performance is a concern.
I tested small(ish) and large blocks with 128x256x256 and 32x64x64 cells respectively on A100 and MI250X and our flux kernels (that use hierarchical parallelism with scratch memory) are up to 22% slower with the new layout (whereas the flat kernels remain effectively identical in performance).
I'm not exactly sure where this difference comes from but I suspect that the additional logic result in additional register usage, which limits the occupancy of the kernels.

Maybe we can discuss the performance implications during the sync today.

fglines-nv · 2024-12-16T18:54:48Z

I looked into the performance issues in AthenaPK, I verified that there are performance issues in the flux kernels with this PR but only for the X1 flux, not X2 and X3. It's definitely due to increased register pressure.

Kernel	Baseline Time	PR 1142 Time
x1 flux	1.75 ms	1.98 ms
x2	1.42 ms	1.43 ms
x3	1.19 ms	1.20 ms

Kernel	Baseline Regs	PR 1142 Regs
x1 flux	76	82
x2 flux	82	85
x3 flux	83	87

That jump from 76->82 registers is enough to push the kernel from running 6 blocks per SM to 5 blocks per SM, hence fewer warps in the pipeline doing loads and thus the ~20% drop in performance. The x2 and x3 kernels could potentially gain 20% if a few registers could be optimized away.

You'd see this PR affect other high register kernels the same way, the higher the count the more they'd be impacted by this PR. Generally, the hierarchal kernels I've seen in Parthenon codes are more complex and high register count. Sometimes higher than this.

I'll take a look at the arithmetic in this PR and see if we can reduce register usage without changing the interface.

acreyes and others added 30 commits June 22, 2024 18:00

wrap 3D flat loop abstractions

6995d11

add 4D loop and test

2e61847

add specialization for const int &

054ca0e

added mdrange loops to par_reduce tests

4c83c4c

refactor flatloop specialization

0730429

clean up

adb15dd

formatting

5db0d34

linting

ba335d7

templating functor index types

2fa8ad1

moved to a single functor

b9a95a1

Update CHANGELOG.md

9033aa3

Merge branch 'develop' into acreyes/par_reduce-flatloops

2d450c8

first pass, doesn't like 4D loops

503ae0c

formatting

e01dabf

added overload for index ranges

259115d

generic par_dispatch for all flatrange loops

ad53f40

wrapped 2D MDRange loop

442c04f

wrapped rest of MDRange loops

a865f00

enabled all simd loops

dd46b7e

cleaning up some warnings

7c7ecc0

Merge remote-tracking branch 'upstream/develop' into acreyes/par-disp…

53f0f85

…atch-template

cleaning up templates & traits

e9b440d

adding loop collapse patterns

2452b48

wrapped team policy loops

188d413

separate inner loop collapses

5bb7764

Wrapping inner par_for loops

1102470

simdfor inner loops

049bf52

formatting

9a39c02

cleaning up

08e788f

infer loop rank from launch bounds rather than functor signature

8d1a5ca

acreyes added 2 commits September 13, 2024 16:32

Merge remote-tracking branch 'upstream/develop' into acreyes/par-disp…

f4b2141

…atch-template

Merge branch 'develop' into acreyes/par-dispatch-template

4cb73d1

acreyes mentioned this pull request Sep 30, 2024

Add a MeshData variant for refinement tagging #1182

Open

12 tasks

acreyes and others added 5 commits October 31, 2024 13:55

Merge remote-tracking branch 'upstream/develop' into acreyes/par-disp…

fbd2674

…atch-template

Merge branch 'develop' into acreyes/par-dispatch-template

b175f41

fix merged test for device

11616d8

Merge remote-tracking branch 'upstream/develop' into acreyes/par-disp…

1661218

…atch-template

Merge branch 'develop' into acreyes/par-dispatch-template

e6c7b93

pgrete enabled auto-merge (squash) December 2, 2024 10:39

pgrete disabled auto-merge December 2, 2024 10:39

pgrete reviewed Dec 6, 2024

View reviewed changes

acreyes and others added 6 commits December 6, 2024 14:06

fix for HtoD copies in par_for_outer

28c3bee

add mdrange tiling

3dfc5cf

docs for ParArray<ND, T>

2de2eab

Update src/kokkos_abstraction.hpp

206cd31

Co-authored-by: Philipp Grete <pgrete@hs.uni-hamburg.de>

Update doc/sphinx/src/par_for.rst

7bc731e

Co-authored-by: Philipp Grete <pgrete@hs.uni-hamburg.de>

Update doc/sphinx/src/par_for.rst

c448372

Co-authored-by: Philipp Grete <pgrete@hs.uni-hamburg.de>

Yurlungur and others added 2 commits December 6, 2024 17:44

Merge branch 'develop' into acreyes/par-dispatch-template

6152149

check par_reduce for hierarchical patterns

10f3bc2

pgrete enabled auto-merge (squash) December 12, 2024 10:08

pgrete disabled auto-merge December 12, 2024 10:08

pgrete mentioned this pull request Dec 12, 2024

[DNM] [WIP] Bump parth to unify par parthenon-hpc-lab/athenapk#130

Open

pgrete reviewed Dec 12, 2024

View reviewed changes

fill tiling array

279b126

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unify `par_dispatch`, `par_for_outer` & `par_for_inner` overloads #1142

Unify `par_dispatch`, `par_for_outer` & `par_for_inner` overloads #1142

acreyes commented Jul 29, 2024 •

edited

Loading

pgrete commented Dec 2, 2024

pgrete commented Dec 4, 2024

acreyes commented Dec 4, 2024

pgrete left a comment

pgrete Dec 6, 2024

acreyes Dec 6, 2024

pgrete Dec 12, 2024

pgrete Dec 6, 2024

acreyes Dec 6, 2024 •

edited

Loading

pgrete Dec 6, 2024

acreyes Dec 6, 2024

pgrete Dec 6, 2024

acreyes Dec 6, 2024

pgrete Dec 6, 2024

acreyes Dec 6, 2024

acreyes commented Dec 6, 2024 •

edited

Loading

pgrete Dec 12, 2024

acreyes Dec 12, 2024

pgrete commented Dec 12, 2024

fglines-nv commented Dec 16, 2024

		Kokkos::MDRangePolicy<Kokkos::Rank<Rank>>(exec_space, {bound_arr[OuterIs].s...},
		{(1 + bound_arr[OuterIs].e)...}),

Unify par_dispatch, par_for_outer & par_for_inner overloads #1142

Are you sure you want to change the base?

Unify par_dispatch, par_for_outer & par_for_inner overloads #1142

Conversation

acreyes commented Jul 29, 2024 • edited Loading

PR Summary

PR Checklist

pgrete commented Dec 2, 2024

pgrete commented Dec 4, 2024

acreyes commented Dec 4, 2024

pgrete left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

acreyes Dec 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

acreyes commented Dec 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pgrete commented Dec 12, 2024

fglines-nv commented Dec 16, 2024

Unify `par_dispatch`, `par_for_outer` & `par_for_inner` overloads #1142

Unify `par_dispatch`, `par_for_outer` & `par_for_inner` overloads #1142

acreyes commented Jul 29, 2024 •

edited

Loading

acreyes Dec 6, 2024 •

edited

Loading

acreyes commented Dec 6, 2024 •

edited

Loading