-
-
Notifications
You must be signed in to change notification settings - Fork 430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for data-parallelism for parallel algorithms #2330
Conversation
- adding tests - flyby: fix minmax algorithms
- more work on for_each - factoring out low-level customization points
- adding performance test
85ae0a4
to
ba1d4b8
Compare
- applying Vc build system settings to relevant files only
include(HPX_SetupVc) | ||
endif() | ||
if(NOT Vc_FOUND AND HPX_WITH_VC_DATAPAR) | ||
hpx_warn("Vc was not found while datapar support was requested, forcing HPX_WITH_VC_DATAPAR=OFF. Set Vc_ROOT to installation path of Vc") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I'll change that.
# if defined(__NVCC__) | ||
# define HPX_SINGLE_INHERITANCE __single_inheritance | ||
# endif | ||
# define HPX_CDECL __cdecl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The compiler flags for MSVC added by VC include a global __vectorcall
calling convention for all functions. We need to be able to force cdecl
for certain functions, though. This macro is used for those.
|
||
util::loop( | ||
policy, first, last, | ||
hpx::util::bind(std::move(f1), _1, std::ref(ret))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since f1 is a functor already, can we get of the bind here? ret
could be directly passed to count_iteration
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only if we define yet another type... Or do I misunderstand what you have in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, you're right, that would require additional code without much benefit.
|
||
util::loop( | ||
policy, first, last, | ||
hpx::util::bind(std::move(f1), _1, std::ref(ret))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
{ | ||
return call_sequential(policy, std::forward<Args>(args)...); | ||
} | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I miss something here, why do we have to fall back to sequential execution if the user requested datapar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This overload is selected if the iterators do not permit parallelization.
inner_product_partition< | ||
parallel::v1::sequential_execution_policy, | ||
Op1, Op2, T | ||
>{parallel::v1::seq, op1, op2, init}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why don't you call std::inner_product instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, will try.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using std::inner_product
will not work as we need to iterate over Vc::Scalar::vector<T>
instead of the plain scalar. This is necessary to enable generic code in the invoked lambdas (especially needed for conditionals).
|
||
// loop_step properly advances the iterators | ||
auto part_sum = util::loop_step(policy, | ||
inner_product_indirect<Op2>{op2}, first1, first2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this missing to invoke op1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about out of bounds check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Op1, Op2, T | ||
>{parallel::v1::seq, op1, op2, result}); | ||
|
||
return result; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks wrong. Shouldn't it be something equivalent to std::inner_product? Or like the sequential version above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The difference between the sequential and the parallel code is that the sequential code works off an initial value, the parallel code does not (it works on a partition only, and the overall initial value is applied last).
HPX_FORCEINLINE std::size_t data_alignment(Iter it) | ||
{ | ||
return reinterpret_cast<std::uintptr_t>(std::addressof(*it)) & | ||
(Vc::Vector<typename Iter::value_type>::MemoryAlignment - 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not std::iterator_traits<Iter>::value_type
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that's an oversight - will fix.
std::is_const< | ||
typename std::iterator_traits<Iter>::value_type | ||
>::value | ||
>::type> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's wrong with storing the value of a const iterator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This overload prevents storing the value back through a const iterator. It's not about preventing to store the iterator itself.
V11 tmp1(std::addressof(*it1), Vc::Aligned); | ||
V12 tmp2(std::addressof(*it2), Vc::Aligned); | ||
std::advance(it1, V11::Size); | ||
std::advance(it2, V12::Size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sizes should match, right? This should be checked with a static assert, I guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, since it is scalar, shouldn't V11::Size == V12::Size == 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sizes are checked here. If the sizes are different we fall back to sequential execution anyways.
Also, since it is scalar, shouldn't V11::Size == V12::Size == 1?
Yes, you're right. a simple ++it1, ++it2
should do the trick here.
V12 tmp2(std::addressof(*it2), Vc::Aligned); | ||
std::advance(it1, V11::Size); | ||
std::advance(it2, V12::Size); | ||
return hpx::util::invoke(f, &tmp1, &tmp2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this missing a store_on_exit
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is currently used for inner_product
only where op1 and op2 must not invalidate any iterators, including the end iterators, or modify any elements of the ranges involved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I missed that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the ability to have vectorization support in general and the implementation here looks nice (the review comments are just minor).
One of my biggest concerns is that the user currently can't know the type of the arguments in the algorithm callback. I suggest to add a datapar_traits<T>
facility in the future to allow to write function objects that are able to provide overloads for the vectorized and non vectorized versions.
This might also be helpful for choosing a scalar overload and perform manual loop unrolling and other compiler specific tricks to help the compilers auto vectorizer if only a scalar version of the algorithm callback exists.
V2 tmp2(std::addressof(*it2), Vc::Aligned); | ||
std::advance(it1, V1::Size); | ||
std::advance(it2, V2::Size); | ||
return hpx::util::invoke(f, &tmp1, &tmp2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you simplify it to remove the unnecessary code? Looks like only the alignment parameter is different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be factored out for sure.
V1 tmp1(std::addressof(*it1), Vc::Aligned); | ||
V2 tmp2(std::addressof(*it2), Vc::Aligned); | ||
std::advance(it1, V1::Size); | ||
std::advance(it2, V2::Size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, is V1::Size == V2::Size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not supposed to be called if the sizes are not the same. I'll add a static_assert
.
HPX_HOST_DEVICE HPX_FORCEINLINE | ||
static typename std::result_of<F&&(V1*, V2*)>::type | ||
callv(F && f, Iter1& it1, Iter2& it2) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing store_on_exit
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, op1 and op2 must not invalidate any iterators, including the end iterators, or modify any elements of the ranges involved.
|
||
typedef Vc::Scalar::Vector<value_type> V1; | ||
|
||
V1 tmp(std::addressof(*it), Vc::Aligned); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this check for alignment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can scalars be unaligned?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't find the implementation of the ctor for the scalar vector ... but I guess those can't be unaligned ... while thinking about it ... wouldn't it make sense to use the regular built-in types in the scalar version instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Vc::Scalar::Vector
is needed to allow for generic code in the operators. Plain scalars might not work. This is specifically important when conditionals are involved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a flaw then. There are several locations in the loop, where loop_optimized returns false and the policy is switch to seq
(For example here) The passed function object has to work with built in types in either case.
This can be worked around with my proposal for datapar_traits
to provide the necessary overloads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree, this was a flaw. I have changed all code to use Vc::Scalar::Vector<T>
now, which fixes the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, just to make sure what's implemented here is really what we want:
When the datapar execution policy is selected, the elements will get automatically mapped to a type from the Vc Library (be it a scalar or SIMD vector) for the purpose of handling it generically.
Now, for every other execution policy, we always have to have a third overload.
This might not seem dramatic for arithmetic functions, but overall, we require 3 overloads (or 3 template instantiations) to work correctly in the general case. Is this correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the user wants to use the same kernel with different execution policies several versions of it will be generated, yes. Using the datapar
(and related policies) will instantiate the kernel twice, once for the vector type and once for the scalar 'vector' type.
|
||
V1 tmp(std::addressof(*it), Vc::Aligned); | ||
auto ret = hpx::util::invoke(f, &tmp); | ||
ret.store(std::addressof(*dest), Vc::Aligned); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
foreach(_flag ${Vc_ARCHITECTURE_FLAGS}) | ||
set_target_properties(inner_product_exe PROPERTIES COMPILE_FLAGS ${_flag}) | ||
endforeach() | ||
endif() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to set those properties globally for hpx?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it really make sense to compile all of HPX with some vectorization flag?
|
||
foreach(_flag ${Vc_ARCHITECTURE_FLAGS}) | ||
set_target_properties(${test}_test_exe PROPERTIES COMPILE_FLAGS ${_flag}) | ||
endforeach() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as for the inner_product benchmark. I think it would be wise to have these properties set globally. Otherwise this will be a subtile source of errors.
|
||
if(Vc_FOUND) | ||
include_directories(SYSTEM ${Vc_INCLUDE_DIR}) | ||
link_directories(${Vc_LIB_DIR}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a call to hpx_library_dir
is missing here.
hpx_libraries(${Vc_LIBRARIES}) | ||
|
||
foreach(_flag ${Vc_DEFINITIONS}) | ||
add_definitions(${_flag}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be hpx_add_target_compile_definition
foreach(_flag ${Vc_DEFINITIONS}) | ||
add_definitions(${_flag}) | ||
endforeach() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also add ${Vc_ARCHITECTURE_FLAGS}
and ${Vc_COMPILE_FLAGS}
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As said, I'm not sure if this is a good idea. What's the point in applying vectorization options to all of HPX?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be mostly for convenience of the end user. Consider out-of-tree builds, without adding the above mentioned flags, the user has to provide those as well to get the expected results. This also leaks the usage of Vc (which is an implementation detail, IMHO) to third party applications and libraries.
When adding those flags, all of HPX might benefit from potential auto vectorization by the compiler.
11af7b4
to
5a799ae
Compare
@sithhell: All review comments have been addressed. This could be merged now. |
Hmm. Looks like my comment regarding the flags got lost. |
Sorry, what 'flags' did I miss? |
Ahh, the build-flags. I'm still not convinced that this is a good idea. For instance for MSVC, building HPX with all Vc flags wouldn't be even possible (this could be a flaw in Vc itself, though). I'd be fine with doing it for non-MSVC platforms only, if that's ok with you. |
My concern currently is mostly with third party applications that want to |
- flyby: spell fix in comment
@sithhell I have moved the compile flags to SetupVc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good now!
Just as a general remark, I think having a trait that allows users to determine the actual value type of the iterated elements will be of great benefit when writing policy agnostic, generic code, but this is certainly outside of the scope of this PR.
This PR proposes to add a new execution policy (currently called
datapar_execution
, this name may change in the future) which applies certain transformations to each of the elements of the input/output sequences in order to be able to vectorize the loop iterations. The transformations pack a number of input elements into a 'vector-pack' which is then passed to the iteration. Note that the iteration function needs to be either a generic lambda or a polymorphic function object as the types used are not always 'known' to the user.The patch also supports using
datapar_execution(task)
to make the algorithm invocations asynchronous.The transformations are applied for random-access input/output sequences of arithmetic types only. For all other cases the execution will silently fall back to the
par
execution policy..This PR adds the
for_each
,for_each_n
,count
,count_if
, andtransform
algorithms. Later commits add support forinner_product
as well.Code using the new execution policies depends on the Vc library. Specify
Vc_ROOT=<vc_install_dir>
to cmake for it to be located.Other execution policies will be added in the future, such as
dataseq_execution
anddataseq_executon(task)
, i.e. vectorization without parallelization. We will also add support for executors and executor parameters.