Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unify par_dispatch, par_for_outer & par_for_inner overloads #1142

Open
wants to merge 113 commits into
base: develop
Choose a base branch
from

Conversation

acreyes
Copy link
Contributor

@acreyes acreyes commented Jul 29, 2024

PR Summary

Provides a single overload for par_dispatch, par_for_outer & par_for_inner that can handle both integer and IndexRange launch bounds.

  • par_dispatch & par_for_outer loops are handled by the same par_dispatch_impl struct that constructs the appropriate kokkos policy and functor for kokkos_dispatch
  • relies on the TypeList struct to hold parameter packs and accompanying type traits to figure out function signatures
  • Similar pattern for par_for_inner
  • Introduces a new loop pattern LoopPatternCollapse<team, thread, vector> that can be used to collapse a general ND loop over any combination of kokkos teams, threads and vectors inspired by #pragma acc collapse directives
    • specializes to LoopPatternTPTTR, LoopPatternTPTVR, LoopPatternTPTTRTVR, InnerLoopPatternTTR and InnerLoopPatternTVR patterns
  • Fallbacks for incompatible Tags & Patterns that can show up from DEFAULT_LOOP_PATTERN
  • generalizes tests for par_for & par_reduce and improves coverage for all patterns up to rank 7 loops

Addresses #1134

PR Checklist

  • Code passes cpplint
  • New features are documented.
  • Adds a test for any bugs fixed. Adds tests for new features.
  • Code is formatted
  • Changes are summarized in CHANGELOG.md
  • Change is breaking (API, behavior, ...)
    • Change is additionally added to CHANGELOG.md in the breaking section
    • PR is marked as breaking
    • Short summary API changes at the top of the PR (plus optionally with an automated update/fix script)
  • CI has been triggered on Darwin for performance regression tests.
  • Docs build
  • (@lanl.gov employees) Update copyright on changed files

@pgrete pgrete enabled auto-merge (squash) December 2, 2024 10:39
@pgrete pgrete disabled auto-merge December 2, 2024 10:39
@pgrete
Copy link
Collaborator

pgrete commented Dec 2, 2024

I'm rerunning the Cuda test as it failed with some (unexpected) host to device mem copies.

@pgrete
Copy link
Collaborator

pgrete commented Dec 4, 2024

I'm rerunning the Cuda test as it failed with some (unexpected) host to device mem copies.

The test repeatedly failed. Any idea where those extra copies come from?

@acreyes
Copy link
Contributor Author

acreyes commented Dec 4, 2024

I'm rerunning the Cuda test as it failed with some (unexpected) host to device mem copies.

The test repeatedly failed. Any idea where those extra copies come from?

I'll check, but that is unexpected. I think it passed some time back in september so at least there should be a pretty recent diff to use

Copy link
Collaborator

@pgrete pgrete left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this! Thanks for putting in all the effort.

I have to admit that mentally parsing the new machinery is more challenging than the old verbose one, but I think it's a way cleaner approach!
I'd like to do some downstream performance testing early next week and understand/track down the additional host/device copies before I finally approve.

doc/sphinx/src/par_for.rst Outdated Show resolved Hide resolved
doc/sphinx/src/par_for.rst Outdated Show resolved Hide resolved
Comment on lines 120 to 124
template <>
struct UsesHierarchialPar<OuterLoopPatternTeams> : std::true_type {
static constexpr std::size_t Nvector = 0;
static constexpr std::size_t Nthread = 0;
};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you comment on this trait?
I'm not sure I follow the default values for Nvector and Nthread.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The par_dispatch_impl::dispatch_impl tagged dispatches abstract the kernel launch over an outer flattened loop and an inner flattened loop. The inner loop flattening is used in the TPT[RTV]R patterns (hence the thread/vector) and also in the SimdFor pattern for the innermost vectorized loop. The default then is zero for all those loop patterns that don't have any vector/thread inner loops.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. This makes sense (I was probably confused because I associated sth different based on the naming and personal habits).

Comment on lines 196 to 199
static constexpr bool is_ParFor =
std::is_same<Tag, dispatch_impl::ParallelForDispatch>::value;
static constexpr bool is_ParScan =
std::is_same<Tag, dispatch_impl::ParallelScanDispatch>::value;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No special handling/logic required for par_reduces below?

Copy link
Contributor Author

@acreyes acreyes Dec 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

par_reduce should work with every pattern except for the SimdFor one. However SimdFor only works for par_for, which is why that is the only check that is done.

edit: maybe I take that back. The TPT[RTV]R I think could in principle work with par_reduce but certainly not the way they're currently written. I'll duplicate the SimdFor check for the Hierarchical ones

Comment on lines 491 to 492
Kokkos::MDRangePolicy<Kokkos::Rank<Rank>>(exec_space, {bound_arr[OuterIs].s...},
{(1 + bound_arr[OuterIs].e)...}),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the block sizes here forwarded (i.e., the old {1, 1, 1, 1, iu + 1 - il})?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, I think I had meant to come back and add this in and forgot. This is in now

src/kokkos_abstraction.hpp Outdated Show resolved Hide resolved
using HierarchialPar = typename dispatch_type::HierarchialPar;
constexpr std::size_t Nvector = HierarchialPar::Nvector;
constexpr std::size_t Nthread = HierarchialPar::Nthread;
constexpr std::size_t Nouter = Rank - Nvector - Nthread;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly is Nouter here (given the Rank - Nvector - Nthread formula)?
I also tried to follow the MakeCollapse<Rank, Nouter trail below but I Nouter seems to become Nteam which is then not used anymore.
I'm probably missing sth here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one covers all the various TPT[RTV]R patterns. These are always 1 or 0 loops for either the vector or thread range loops in the inner pattern. Nouter is all the remaining loops that become flattened into an outer team policy loop.

Comment on lines +162 to +165
template <std::size_t ND, typename T, typename State = empty_state_t>
using ParArray = typename ParArrayND_impl<std::integral_constant<std::size_t, ND>,
State>::template type<T>;

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and below: this is a new interface, isn't it? Might be worth to briefly add this to the doc along the ParArray#D.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I've added some new documentation for it

acreyes and others added 6 commits December 6, 2024 14:06
Co-authored-by: Philipp Grete <pgrete@hs.uni-hamburg.de>
Co-authored-by: Philipp Grete <pgrete@hs.uni-hamburg.de>
Co-authored-by: Philipp Grete <pgrete@hs.uni-hamburg.de>
@acreyes
Copy link
Contributor Author

acreyes commented Dec 6, 2024

I'd like to do some downstream performance testing early next week and understand/track down the additional host/device copies before I finally approve.

👍

I believe I've tracked down the source of the HtoD copies. The Indexer struct is used to flatten/reconstruct the multidimensional indices and holds some Kokkos::Array<int.ND>s for that purpose. For some reason the lambda capture of this guy triggers a mem copy. It can be constructed inside the kernel instead and that seems to solve it.

I don't understand the behavior though, and it also seems to be related to the Kokkos version. 4.0.1 doesn't have the copies, but starting at least in 4.2 the copies show up.

Even stranger the same pattern is used for the LoopPatternFlatRange kernels, but doesn't result in any mem copies, at least according to Nsight.

{(1 + bound_arr[OuterIs].e)...}),
function, std::forward<Args>(args)...);
constexpr std::size_t Nouter = sizeof...(OuterIs);
Kokkos::Array<int, Nouter> tiling{(OuterIs, 1)...};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this working as expected?
If I infer the intent correctly, this should create an array initialized to 1 everywhere.
My compiler complains with a warning

/p/project/coldcluster/pgrete/athenapk/external/parthenon/src/kokkos_abstraction.hpp(493): warning #174-D: expression has no effect
      Kokkos::Array<int, Nouter> tiling{(OuterIs, 1)...};

AFAIK default init doesn't work for arrays, so we might need sth like

    std::array<int, Nouter> tiling;
    tiling.fill(1);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was working for me, but better to avoid the warning.

The warning makes sense since (OuterIs, 1) will always just evaluate to 1.

@pgrete
Copy link
Collaborator

pgrete commented Dec 12, 2024

I now ran some more detailed tests.
Compile times increase (as expected given the additional template magic), here tested for AthenaPK on 48 cores

  • CUDA: 5m29s -> 6m7s
  • HIP: 2m18s -> 2m40s
    so nothing too dramatic (from my point of view).

However, performance is a concern.
I tested small(ish) and large blocks with 128x256x256 and 32x64x64 cells respectively on A100 and MI250X and our flux kernels (that use hierarchical parallelism with scratch memory) are up to 22% slower with the new layout (whereas the flat kernels remain effectively identical in performance).
I'm not exactly sure where this difference comes from but I suspect that the additional logic result in additional register usage, which limits the occupancy of the kernels.

Maybe we can discuss the performance implications during the sync today.

@fglines-nv
Copy link
Collaborator

I looked into the performance issues in AthenaPK, I verified that there are performance issues in the flux kernels with this PR but only for the X1 flux, not X2 and X3. It's definitely due to increased register pressure.

Kernel Baseline Time PR 1142 Time
x1 flux 1.75 ms 1.98 ms
x2 1.42 ms 1.43 ms
x3 1.19 ms 1.20 ms
Kernel Baseline Regs PR 1142 Regs
x1 flux 76 82
x2 flux 82 85
x3 flux 83 87

That jump from 76->82 registers is enough to push the kernel from running 6 blocks per SM to 5 blocks per SM, hence fewer warps in the pipeline doing loads and thus the ~20% drop in performance. The x2 and x3 kernels could potentially gain 20% if a few registers could be optimized away.

You'd see this PR affect other high register kernels the same way, the higher the count the more they'd be impacted by this PR. Generally, the hierarchal kernels I've seen in Parthenon codes are more complex and high register count. Sometimes higher than this.

I'll take a look at the arithmetic in this PR and see if we can reduce register usage without changing the interface.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants