-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unify par_dispatch
, par_for_outer
& par_for_inner
overloads
#1142
base: develop
Are you sure you want to change the base?
Unify par_dispatch
, par_for_outer
& par_for_inner
overloads
#1142
Conversation
I'm rerunning the Cuda test as it failed with some (unexpected) host to device mem copies. |
The test repeatedly failed. Any idea where those extra copies come from? |
I'll check, but that is unexpected. I think it passed some time back in september so at least there should be a pretty recent diff to use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like this! Thanks for putting in all the effort.
I have to admit that mentally parsing the new machinery is more challenging than the old verbose one, but I think it's a way cleaner approach!
I'd like to do some downstream performance testing early next week and understand/track down the additional host/device copies before I finally approve.
src/kokkos_abstraction.hpp
Outdated
template <> | ||
struct UsesHierarchialPar<OuterLoopPatternTeams> : std::true_type { | ||
static constexpr std::size_t Nvector = 0; | ||
static constexpr std::size_t Nthread = 0; | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you comment on this trait?
I'm not sure I follow the default values for Nvector
and Nthread
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The par_dispatch_impl::dispatch_impl
tagged dispatches abstract the kernel launch over an outer flattened loop and an inner flattened loop. The inner loop flattening is used in the TPT[RTV]R
patterns (hence the thread/vector) and also in the SimdFor
pattern for the innermost vectorized loop. The default then is zero for all those loop patterns that don't have any vector/thread inner loops.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. This makes sense (I was probably confused because I associated sth different based on the naming and personal habits).
src/kokkos_abstraction.hpp
Outdated
static constexpr bool is_ParFor = | ||
std::is_same<Tag, dispatch_impl::ParallelForDispatch>::value; | ||
static constexpr bool is_ParScan = | ||
std::is_same<Tag, dispatch_impl::ParallelScanDispatch>::value; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No special handling/logic required for par_reduces below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
par_reduce
should work with every pattern except for the SimdFor
one. However SimdFor
only works for par_for
, which is why that is the only check that is done.
edit: maybe I take that back. The TPT[RTV]R
I think could in principle work with par_reduce
but certainly not the way they're currently written. I'll duplicate the SimdFor
check for the Hierarchical
ones
src/kokkos_abstraction.hpp
Outdated
Kokkos::MDRangePolicy<Kokkos::Rank<Rank>>(exec_space, {bound_arr[OuterIs].s...}, | ||
{(1 + bound_arr[OuterIs].e)...}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the block sizes here forwarded (i.e., the old {1, 1, 1, 1, iu + 1 - il}
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch, I think I had meant to come back and add this in and forgot. This is in now
using HierarchialPar = typename dispatch_type::HierarchialPar; | ||
constexpr std::size_t Nvector = HierarchialPar::Nvector; | ||
constexpr std::size_t Nthread = HierarchialPar::Nthread; | ||
constexpr std::size_t Nouter = Rank - Nvector - Nthread; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What exactly is Nouter
here (given the Rank - Nvector - Nthread
formula)?
I also tried to follow the MakeCollapse<Rank, Nouter
trail below but I Nouter
seems to become Nteam
which is then not used anymore.
I'm probably missing sth here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one covers all the various TPT[RTV]R
patterns. These are always 1 or 0 loops for either the vector or thread range loops in the inner pattern. Nouter
is all the remaining loops that become flattened into an outer team policy loop.
template <std::size_t ND, typename T, typename State = empty_state_t> | ||
using ParArray = typename ParArrayND_impl<std::integral_constant<std::size_t, ND>, | ||
State>::template type<T>; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and below: this is a new interface, isn't it? Might be worth to briefly add this to the doc along the ParArray#D.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I've added some new documentation for it
Co-authored-by: Philipp Grete <pgrete@hs.uni-hamburg.de>
Co-authored-by: Philipp Grete <pgrete@hs.uni-hamburg.de>
Co-authored-by: Philipp Grete <pgrete@hs.uni-hamburg.de>
👍 I believe I've tracked down the source of the HtoD copies. The I don't understand the behavior though, and it also seems to be related to the Kokkos version. 4.0.1 doesn't have the copies, but starting at least in 4.2 the copies show up. Even stranger the same pattern is used for the |
src/kokkos_abstraction.hpp
Outdated
{(1 + bound_arr[OuterIs].e)...}), | ||
function, std::forward<Args>(args)...); | ||
constexpr std::size_t Nouter = sizeof...(OuterIs); | ||
Kokkos::Array<int, Nouter> tiling{(OuterIs, 1)...}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this working as expected?
If I infer the intent correctly, this should create an array initialized to 1
everywhere.
My compiler complains with a warning
/p/project/coldcluster/pgrete/athenapk/external/parthenon/src/kokkos_abstraction.hpp(493): warning #174-D: expression has no effect
Kokkos::Array<int, Nouter> tiling{(OuterIs, 1)...};
AFAIK default init doesn't work for arrays, so we might need sth like
std::array<int, Nouter> tiling;
tiling.fill(1);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was working for me, but better to avoid the warning.
The warning makes sense since (OuterIs, 1)
will always just evaluate to 1.
I now ran some more detailed tests.
However, performance is a concern. Maybe we can discuss the performance implications during the sync today. |
I looked into the performance issues in AthenaPK, I verified that there are performance issues in the flux kernels with this PR but only for the X1 flux, not X2 and X3. It's definitely due to increased register pressure.
That jump from 76->82 registers is enough to push the kernel from running 6 blocks per SM to 5 blocks per SM, hence fewer warps in the pipeline doing loads and thus the ~20% drop in performance. The x2 and x3 kernels could potentially gain 20% if a few registers could be optimized away. You'd see this PR affect other high register kernels the same way, the higher the count the more they'd be impacted by this PR. Generally, the hierarchal kernels I've seen in Parthenon codes are more complex and high register count. Sometimes higher than this. I'll take a look at the arithmetic in this PR and see if we can reduce register usage without changing the interface. |
PR Summary
Provides a single overload for
par_dispatch
,par_for_outer
&par_for_inner
that can handle both integer andIndexRange
launch bounds.par_dispatch
&par_for_outer
loops are handled by the samepar_dispatch_impl
struct that constructs the appropriate kokkos policy and functor forkokkos_dispatch
TypeList
struct to hold parameter packs and accompanying type traits to figure out function signaturespar_for_inner
LoopPatternCollapse<team, thread, vector>
that can be used to collapse a general ND loop over any combination of kokkos teams, threads and vectors inspired by#pragma acc collapse
directivesLoopPatternTPTTR
,LoopPatternTPTVR
,LoopPatternTPTTRTVR
,InnerLoopPatternTTR
andInnerLoopPatternTVR
patternsTag
s &Pattern
s that can show up fromDEFAULT_LOOP_PATTERN
par_for
&par_reduce
and improves coverage for all patterns up to rank 7 loopsAddresses #1134
PR Checklist