Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for data-parallelism for parallel algorithms #2330

Merged
merged 9 commits into from
Sep 20, 2016
Merged

Conversation

hkaiser
Copy link
Member

@hkaiser hkaiser commented Sep 11, 2016

This PR proposes to add a new execution policy (currently called datapar_execution, this name may change in the future) which applies certain transformations to each of the elements of the input/output sequences in order to be able to vectorize the loop iterations. The transformations pack a number of input elements into a 'vector-pack' which is then passed to the iteration. Note that the iteration function needs to be either a generic lambda or a polymorphic function object as the types used are not always 'known' to the user.

The patch also supports using datapar_execution(task) to make the algorithm invocations asynchronous.

The transformations are applied for random-access input/output sequences of arithmetic types only. For all other cases the execution will silently fall back to the par execution policy..

This PR adds the for_each, for_each_n, count, count_if, and transform algorithms. Later commits add support for inner_product as well.

Code using the new execution policies depends on the Vc library. Specify Vc_ROOT=<vc_install_dir> to cmake for it to be located.

Other execution policies will be added in the future, such as dataseq_execution and dataseq_executon(task), i.e. vectorization without parallelization. We will also add support for executors and executor parameters.

- adding performance test
@hkaiser hkaiser force-pushed the datapar branch 2 times, most recently from 85ae0a4 to ba1d4b8 Compare September 15, 2016 01:41
- applying Vc build system settings to relevant files only
include(HPX_SetupVc)
endif()
if(NOT Vc_FOUND AND HPX_WITH_VC_DATAPAR)
hpx_warn("Vc was not found while datapar support was requested, forcing HPX_WITH_VC_DATAPAR=OFF. Set Vc_ROOT to installation path of Vc")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be an error.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I'll change that.

# if defined(__NVCC__)
# define HPX_SINGLE_INHERITANCE __single_inheritance
# endif
# define HPX_CDECL __cdecl
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compiler flags for MSVC added by VC include a global __vectorcall calling convention for all functions. We need to be able to force cdecl for certain functions, though. This macro is used for those.


util::loop(
policy, first, last,
hpx::util::bind(std::move(f1), _1, std::ref(ret)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since f1 is a functor already, can we get of the bind here? ret could be directly passed to count_iteration.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only if we define yet another type... Or do I misunderstand what you have in mind?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you're right, that would require additional code without much benefit.


util::loop(
policy, first, last,
hpx::util::bind(std::move(f1), _1, std::ref(ret)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

{
return call_sequential(policy, std::forward<Args>(args)...);
}
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I miss something here, why do we have to fall back to sequential execution if the user requested datapar?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This overload is selected if the iterators do not permit parallelization.

inner_product_partition<
parallel::v1::sequential_execution_policy,
Op1, Op2, T
>{parallel::v1::seq, op1, op2, init});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't you call std::inner_product instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, will try.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using std::inner_product will not work as we need to iterate over Vc::Scalar::vector<T> instead of the plain scalar. This is necessary to enable generic code in the invoked lambdas (especially needed for conditionals).


// loop_step properly advances the iterators
auto part_sum = util::loop_step(policy,
inner_product_indirect<Op2>{op2}, first1, first2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this missing to invoke op1?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about out of bounds check?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this missing to invoke op1?

No, this calculates the first intermediate value which is then used as the init for the subsequent operation.

What about out of bounds check?

That's done here and here

Op1, Op2, T
>{parallel::v1::seq, op1, op2, result});

return result;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks wrong. Shouldn't it be something equivalent to std::inner_product? Or like the sequential version above?

Copy link
Member Author

@hkaiser hkaiser Sep 16, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference between the sequential and the parallel code is that the sequential code works off an initial value, the parallel code does not (it works on a partition only, and the overall initial value is applied last).

HPX_FORCEINLINE std::size_t data_alignment(Iter it)
{
return reinterpret_cast<std::uintptr_t>(std::addressof(*it)) &
(Vc::Vector<typename Iter::value_type>::MemoryAlignment - 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not std::iterator_traits<Iter>::value_type?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that's an oversight - will fix.

std::is_const<
typename std::iterator_traits<Iter>::value_type
>::value
>::type>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's wrong with storing the value of a const iterator?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This overload prevents storing the value back through a const iterator. It's not about preventing to store the iterator itself.

V11 tmp1(std::addressof(*it1), Vc::Aligned);
V12 tmp2(std::addressof(*it2), Vc::Aligned);
std::advance(it1, V11::Size);
std::advance(it2, V12::Size);
Copy link
Member

@sithhell sithhell Sep 16, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sizes should match, right? This should be checked with a static assert, I guess.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, since it is scalar, shouldn't V11::Size == V12::Size == 1?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sizes are checked here. If the sizes are different we fall back to sequential execution anyways.

Also, since it is scalar, shouldn't V11::Size == V12::Size == 1?

Yes, you're right. a simple ++it1, ++it2 should do the trick here.

V12 tmp2(std::addressof(*it2), Vc::Aligned);
std::advance(it1, V11::Size);
std::advance(it2, V12::Size);
return hpx::util::invoke(f, &tmp1, &tmp2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this missing a store_on_exit?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is currently used for inner_product only where op1 and op2 must not invalidate any iterators, including the end iterators, or modify any elements of the ranges involved.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I missed that.

Copy link
Member

@sithhell sithhell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the ability to have vectorization support in general and the implementation here looks nice (the review comments are just minor).

One of my biggest concerns is that the user currently can't know the type of the arguments in the algorithm callback. I suggest to add a datapar_traits<T> facility in the future to allow to write function objects that are able to provide overloads for the vectorized and non vectorized versions.

This might also be helpful for choosing a scalar overload and perform manual loop unrolling and other compiler specific tricks to help the compilers auto vectorizer if only a scalar version of the algorithm callback exists.

V2 tmp2(std::addressof(*it2), Vc::Aligned);
std::advance(it1, V1::Size);
std::advance(it2, V2::Size);
return hpx::util::invoke(f, &tmp1, &tmp2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you simplify it to remove the unnecessary code? Looks like only the alignment parameter is different.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be factored out for sure.

V1 tmp1(std::addressof(*it1), Vc::Aligned);
V2 tmp2(std::addressof(*it2), Vc::Aligned);
std::advance(it1, V1::Size);
std::advance(it2, V2::Size);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, is V1::Size == V2::Size?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not supposed to be called if the sizes are not the same. I'll add a static_assert.

HPX_HOST_DEVICE HPX_FORCEINLINE
static typename std::result_of<F&&(V1*, V2*)>::type
callv(F && f, Iter1& it1, Iter2& it2)
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing store_on_exit?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, op1 and op2 must not invalidate any iterators, including the end iterators, or modify any elements of the ranges involved.


typedef Vc::Scalar::Vector<value_type> V1;

V1 tmp(std::addressof(*it), Vc::Aligned);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this check for alignment?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can scalars be unaligned?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't find the implementation of the ctor for the scalar vector ... but I guess those can't be unaligned ... while thinking about it ... wouldn't it make sense to use the regular built-in types in the scalar version instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Vc::Scalar::Vector is needed to allow for generic code in the operators. Plain scalars might not work. This is specifically important when conditionals are involved.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a flaw then. There are several locations in the loop, where loop_optimized returns false and the policy is switch to seq (For example here) The passed function object has to work with built in types in either case.
This can be worked around with my proposal for datapar_traits to provide the necessary overloads.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree, this was a flaw. I have changed all code to use Vc::Scalar::Vector<T> now, which fixes the issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, just to make sure what's implemented here is really what we want:
When the datapar execution policy is selected, the elements will get automatically mapped to a type from the Vc Library (be it a scalar or SIMD vector) for the purpose of handling it generically.
Now, for every other execution policy, we always have to have a third overload.
This might not seem dramatic for arithmetic functions, but overall, we require 3 overloads (or 3 template instantiations) to work correctly in the general case. Is this correct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the user wants to use the same kernel with different execution policies several versions of it will be generated, yes. Using the datapar (and related policies) will instantiate the kernel twice, once for the vector type and once for the scalar 'vector' type.


V1 tmp(std::addressof(*it), Vc::Aligned);
auto ret = hpx::util::invoke(f, &tmp);
ret.store(std::addressof(*dest), Vc::Aligned);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

foreach(_flag ${Vc_ARCHITECTURE_FLAGS})
set_target_properties(inner_product_exe PROPERTIES COMPILE_FLAGS ${_flag})
endforeach()
endif()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to set those properties globally for hpx?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it really make sense to compile all of HPX with some vectorization flag?


foreach(_flag ${Vc_ARCHITECTURE_FLAGS})
set_target_properties(${test}_test_exe PROPERTIES COMPILE_FLAGS ${_flag})
endforeach()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as for the inner_product benchmark. I think it would be wise to have these properties set globally. Otherwise this will be a subtile source of errors.


if(Vc_FOUND)
include_directories(SYSTEM ${Vc_INCLUDE_DIR})
link_directories(${Vc_LIB_DIR})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a call to hpx_library_dir is missing here.

hpx_libraries(${Vc_LIBRARIES})

foreach(_flag ${Vc_DEFINITIONS})
add_definitions(${_flag})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be hpx_add_target_compile_definition

foreach(_flag ${Vc_DEFINITIONS})
add_definitions(${_flag})
endforeach()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also add ${Vc_ARCHITECTURE_FLAGS} and ${Vc_COMPILE_FLAGS} here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As said, I'm not sure if this is a good idea. What's the point in applying vectorization options to all of HPX?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be mostly for convenience of the end user. Consider out-of-tree builds, without adding the above mentioned flags, the user has to provide those as well to get the expected results. This also leaks the usage of Vc (which is an implementation detail, IMHO) to third party applications and libraries.
When adding those flags, all of HPX might benefit from potential auto vectorization by the compiler.

@hkaiser hkaiser force-pushed the datapar branch 2 times, most recently from 11af7b4 to 5a799ae Compare September 17, 2016 14:11
@hkaiser
Copy link
Member Author

hkaiser commented Sep 17, 2016

@sithhell: All review comments have been addressed. This could be merged now.

@sithhell
Copy link
Member

sithhell commented Sep 17, 2016

@sithhell: All review comments have been addressed. This could be merged now.

Hmm. Looks like my comment regarding the flags got lost.

@hkaiser
Copy link
Member Author

hkaiser commented Sep 17, 2016

Hmm. Looks like my comment regarding the flags got lost.

Sorry, what 'flags' did I miss?

@hkaiser
Copy link
Member Author

hkaiser commented Sep 17, 2016

Hmm. Looks like my comment regarding the flags got lost.

Sorry, what 'flags' did I miss?

Ahh, the build-flags. I'm still not convinced that this is a good idea. For instance for MSVC, building HPX with all Vc flags wouldn't be even possible (this could be a flaw in Vc itself, though). I'd be fine with doing it for non-MSVC platforms only, if that's ok with you.

@sithhell
Copy link
Member

sithhell commented Sep 17, 2016

Ahh, the build-flags. I'm still not convinced that this is a good idea. For instance for MSVC, building HPX with all Vc flags wouldn't be even possible (this could be a flaw in Vc itself, though). I'd be fine with doing it for non-MSVC platforms only, if that's ok with you.

My concern currently is mostly with third party applications that want to
use datapar. Those have to search for vc, set the flags for the TUs
requiring it. This sounds not very convenient, especially from the
viewpoint that the fact we use vc should be considered an implementation
detail.

- flyby: spell fix in comment
@hkaiser
Copy link
Member Author

hkaiser commented Sep 17, 2016

@sithhell I have moved the compile flags to SetupVc.

Copy link
Member

@sithhell sithhell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good now!

Just as a general remark, I think having a trait that allows users to determine the actual value type of the iterated elements will be of great benefit when writing policy agnostic, generic code, but this is certainly outside of the scope of this PR.

@sithhell sithhell merged commit 884a26a into master Sep 20, 2016
@sithhell sithhell deleted the datapar branch September 20, 2016 05:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants