Support for data-parallelism for parallel algorithms #2330

hkaiser · 2016-09-11T14:09:57Z

This PR proposes to add a new execution policy (currently called datapar_execution, this name may change in the future) which applies certain transformations to each of the elements of the input/output sequences in order to be able to vectorize the loop iterations. The transformations pack a number of input elements into a 'vector-pack' which is then passed to the iteration. Note that the iteration function needs to be either a generic lambda or a polymorphic function object as the types used are not always 'known' to the user.

The patch also supports using datapar_execution(task) to make the algorithm invocations asynchronous.

The transformations are applied for random-access input/output sequences of arithmetic types only. For all other cases the execution will silently fall back to the par execution policy..

This PR adds the for_each, for_each_n, count, count_if, and transform algorithms. Later commits add support for inner_product as well.

Code using the new execution policies depends on the Vc library. Specify Vc_ROOT=<vc_install_dir> to cmake for it to be located.

Other execution policies will be added in the future, such as dataseq_execution and dataseq_executon(task), i.e. vectorization without parallelization. We will also add support for executors and executor parameters.

- adding tests - flyby: fix minmax algorithms

- more work on for_each - factoring out low-level customization points

- adding performance test

- applying Vc build system settings to relevant files only

sithhell · 2016-09-15T14:27:41Z

CMakeLists.txt

+  include(HPX_SetupVc)
+endif()
+if(NOT Vc_FOUND AND HPX_WITH_VC_DATAPAR)
+  hpx_warn("Vc was not found while datapar support was requested, forcing HPX_WITH_VC_DATAPAR=OFF. Set Vc_ROOT to installation path of Vc")


This should be an error.

Ok. I'll change that.

sithhell · 2016-09-16T07:32:22Z

hpx/config/compiler_specific.hpp

+#  if defined(__NVCC__)
+#    define HPX_SINGLE_INHERITANCE __single_inheritance
+#  endif
+#  define HPX_CDECL __cdecl


Why do we need this?

The compiler flags for MSVC added by VC include a global __vectorcall calling convention for all functions. We need to be able to force cdecl for certain functions, though. This macro is used for those.

sithhell · 2016-09-16T07:35:53Z

hpx/parallel/algorithms/count.hpp

+
+                util::loop(
+                    policy, first, last,
+                    hpx::util::bind(std::move(f1), _1, std::ref(ret)));


since f1 is a functor already, can we get of the bind here? ret could be directly passed to count_iteration.

Only if we define yet another type... Or do I misunderstand what you have in mind?

Yeah, you're right, that would require additional code without much benefit.

sithhell · 2016-09-16T07:38:59Z

hpx/parallel/algorithms/count.hpp

+
+                util::loop(
+                    policy, first, last,
+                    hpx::util::bind(std::move(f1), _1, std::ref(ret)));


Same as above.

sithhell · 2016-09-16T07:41:22Z

hpx/parallel/algorithms/detail/dispatch.hpp

+        {
+            return call_sequential(policy, std::forward<Args>(args)...);
+        }
+#endif


I think I miss something here, why do we have to fall back to sequential execution if the user requested datapar?

This overload is selected if the iterators do not permit parallelization.

sithhell · 2016-09-16T07:53:31Z

hpx/parallel/algorithms/inner_product.hpp

+                        inner_product_partition<
+                            parallel::v1::sequential_execution_policy,
+                            Op1, Op2, T
+                        >{parallel::v1::seq, op1, op2, init});


why don't you call std::inner_product instead?

Good point, will try.

Using std::inner_product will not work as we need to iterate over Vc::Scalar::vector<T> instead of the plain scalar. This is necessary to enable generic code in the invoked lambdas (especially needed for conditionals).

sithhell · 2016-09-16T07:56:27Z

hpx/parallel/algorithms/inner_product.hpp

+
+                // loop_step properly advances the iterators
+                auto part_sum = util::loop_step(policy,
+                    inner_product_indirect<Op2>{op2}, first1, first2);


Isn't this missing to invoke op1?

What about out of bounds check?

Isn't this missing to invoke op1?

No, this calculates the first intermediate value which is then used as the init for the subsequent operation.

What about out of bounds check?

That's done here and here

sithhell · 2016-09-16T07:59:46Z

hpx/parallel/algorithms/inner_product.hpp

+                                    Op1, Op2, T
+                                >{parallel::v1::seq, op1, op2, result});
+
+                            return result;


This looks wrong. Shouldn't it be something equivalent to std::inner_product? Or like the sequential version above?

The difference between the sequential and the parallel code is that the sequential code works off an initial value, the parallel code does not (it works on a partition only, and the overall initial value is applied last).

sithhell · 2016-09-16T08:08:03Z

hpx/parallel/datapar/detail/iterator_helpers.hpp

+        HPX_FORCEINLINE std::size_t data_alignment(Iter it)
+        {
+            return reinterpret_cast<std::uintptr_t>(std::addressof(*it)) &
+                (Vc::Vector<typename Iter::value_type>::MemoryAlignment - 1);


Why not std::iterator_traits<Iter>::value_type?

Thanks, that's an oversight - will fix.

sithhell · 2016-09-16T08:13:19Z

hpx/parallel/datapar/detail/iterator_helpers.hpp

+                std::is_const<
+                    typename std::iterator_traits<Iter>::value_type
+                >::value
+            >::type>


What's wrong with storing the value of a const iterator?

This overload prevents storing the value back through a const iterator. It's not about preventing to store the iterator itself.

sithhell · 2016-09-16T08:17:45Z

hpx/parallel/datapar/detail/iterator_helpers.hpp

+                V11 tmp1(std::addressof(*it1), Vc::Aligned);
+                V12 tmp2(std::addressof(*it2), Vc::Aligned);
+                std::advance(it1, V11::Size);
+                std::advance(it2, V12::Size);


The sizes should match, right? This should be checked with a static assert, I guess.

Also, since it is scalar, shouldn't V11::Size == V12::Size == 1?

The sizes are checked here. If the sizes are different we fall back to sequential execution anyways.

Also, since it is scalar, shouldn't V11::Size == V12::Size == 1?

Yes, you're right. a simple ++it1, ++it2 should do the trick here.

sithhell · 2016-09-16T08:20:44Z

hpx/parallel/datapar/detail/iterator_helpers.hpp

+                V12 tmp2(std::addressof(*it2), Vc::Aligned);
+                std::advance(it1, V11::Size);
+                std::advance(it2, V12::Size);
+                return hpx::util::invoke(f, &tmp1, &tmp2);


Isn't this missing a store_on_exit?

This is currently used for inner_product only where op1 and op2 must not invalidate any iterators, including the end iterators, or modify any elements of the ranges involved.

Right, I missed that.

sithhell

I like the ability to have vectorization support in general and the implementation here looks nice (the review comments are just minor).

One of my biggest concerns is that the user currently can't know the type of the arguments in the algorithm callback. I suggest to add a datapar_traits<T> facility in the future to allow to write function objects that are able to provide overloads for the vectorized and non vectorized versions.

This might also be helpful for choosing a scalar overload and perform manual loop unrolling and other compiler specific tricks to help the compilers auto vectorizer if only a scalar version of the algorithm callback exists.

sithhell · 2016-09-16T08:23:21Z

hpx/parallel/datapar/detail/iterator_helpers.hpp

+                V2 tmp2(std::addressof(*it2), Vc::Aligned);
+                std::advance(it1, V1::Size);
+                std::advance(it2, V2::Size);
+                return hpx::util::invoke(f, &tmp1, &tmp2);


Can you simplify it to remove the unnecessary code? Looks like only the alignment parameter is different.

This can be factored out for sure.

sithhell · 2016-09-16T08:23:47Z

hpx/parallel/datapar/detail/iterator_helpers.hpp

+                V1 tmp1(std::addressof(*it1), Vc::Aligned);
+                V2 tmp2(std::addressof(*it2), Vc::Aligned);
+                std::advance(it1, V1::Size);
+                std::advance(it2, V2::Size);


Again, is V1::Size == V2::Size?

This is not supposed to be called if the sizes are not the same. I'll add a static_assert.

sithhell · 2016-09-16T08:24:20Z

hpx/parallel/datapar/detail/iterator_helpers.hpp

+            HPX_HOST_DEVICE HPX_FORCEINLINE
+            static typename std::result_of<F&&(V1*, V2*)>::type
+            callv(F && f, Iter1& it1, Iter2& it2)
+            {


missing store_on_exit?

Same as above, op1 and op2 must not invalidate any iterators, including the end iterators, or modify any elements of the ranges involved.

sithhell · 2016-09-16T08:25:57Z

hpx/parallel/datapar/detail/iterator_helpers.hpp

+
+                typedef Vc::Scalar::Vector<value_type> V1;
+
+                V1 tmp(std::addressof(*it), Vc::Aligned);


Shouldn't this check for alignment?

Can scalars be unaligned?

I can't find the implementation of the ctor for the scalar vector ... but I guess those can't be unaligned ... while thinking about it ... wouldn't it make sense to use the regular built-in types in the scalar version instead?

The Vc::Scalar::Vector is needed to allow for generic code in the operators. Plain scalars might not work. This is specifically important when conditionals are involved.

This is a flaw then. There are several locations in the loop, where loop_optimized returns false and the policy is switch to seq (For example here) The passed function object has to work with built in types in either case.
This can be worked around with my proposal for datapar_traits to provide the necessary overloads.

Yes, I agree, this was a flaw. I have changed all code to use Vc::Scalar::Vector<T> now, which fixes the issue.

So, just to make sure what's implemented here is really what we want:
When the datapar execution policy is selected, the elements will get automatically mapped to a type from the Vc Library (be it a scalar or SIMD vector) for the purpose of handling it generically.
Now, for every other execution policy, we always have to have a third overload.
This might not seem dramatic for arithmetic functions, but overall, we require 3 overloads (or 3 template instantiations) to work correctly in the general case. Is this correct?

If the user wants to use the same kernel with different execution policies several versions of it will be generated, yes. Using the datapar (and related policies) will instantiate the kernel twice, once for the vector type and once for the scalar 'vector' type.

sithhell · 2016-09-16T08:26:11Z

hpx/parallel/datapar/detail/iterator_helpers.hpp

+
+                V1 tmp(std::addressof(*it), Vc::Aligned);
+                auto ret = hpx::util::invoke(f, &tmp);
+                ret.store(std::addressof(*dest), Vc::Aligned);


Same as above.

sithhell · 2016-09-16T08:59:20Z

tests/performance/local/CMakeLists.txt

+  foreach(_flag ${Vc_ARCHITECTURE_FLAGS})
+    set_target_properties(inner_product_exe PROPERTIES COMPILE_FLAGS ${_flag})
+  endforeach()
+endif()


Would it make sense to set those properties globally for hpx?

Does it really make sense to compile all of HPX with some vectorization flag?

sithhell · 2016-09-16T09:02:42Z

tests/unit/parallel/datapar_algorithms/CMakeLists.txt

+
+  foreach(_flag ${Vc_ARCHITECTURE_FLAGS})
+    set_target_properties(${test}_test_exe PROPERTIES COMPILE_FLAGS ${_flag})
+  endforeach()


Same as for the inner_product benchmark. I think it would be wise to have these properties set globally. Otherwise this will be a subtile source of errors.

sithhell · 2016-09-16T09:05:01Z

cmake/HPX_SetupVc.cmake

+
+if(Vc_FOUND)
+  include_directories(SYSTEM ${Vc_INCLUDE_DIR})
+  link_directories(${Vc_LIB_DIR})


a call to hpx_library_dir is missing here.

sithhell · 2016-09-16T09:06:29Z

cmake/HPX_SetupVc.cmake

+  hpx_libraries(${Vc_LIBRARIES})
+
+  foreach(_flag ${Vc_DEFINITIONS})
+    add_definitions(${_flag})


This should be hpx_add_target_compile_definition

sithhell · 2016-09-16T09:07:35Z

cmake/HPX_SetupVc.cmake

+  foreach(_flag ${Vc_DEFINITIONS})
+    add_definitions(${_flag})
+  endforeach()
+


Please also add ${Vc_ARCHITECTURE_FLAGS} and ${Vc_COMPILE_FLAGS} here.

As said, I'm not sure if this is a good idea. What's the point in applying vectorization options to all of HPX?

It would be mostly for convenience of the end user. Consider out-of-tree builds, without adding the above mentioned flags, the user has to provide those as well to get the expected results. This also leaks the usage of Vc (which is an implementation detail, IMHO) to third party applications and libraries.
When adding those flags, all of HPX might benefit from potential auto vectorization by the compiler.

hkaiser · 2016-09-17T14:13:08Z

@sithhell: All review comments have been addressed. This could be merged now.

sithhell · 2016-09-17T14:32:32Z

@sithhell: All review comments have been addressed. This could be merged now.

Hmm. Looks like my comment regarding the flags got lost.

hkaiser · 2016-09-17T14:55:09Z

Hmm. Looks like my comment regarding the flags got lost.

Sorry, what 'flags' did I miss?

hkaiser · 2016-09-17T15:01:03Z

Hmm. Looks like my comment regarding the flags got lost.

Sorry, what 'flags' did I miss?

Ahh, the build-flags. I'm still not convinced that this is a good idea. For instance for MSVC, building HPX with all Vc flags wouldn't be even possible (this could be a flaw in Vc itself, though). I'd be fine with doing it for non-MSVC platforms only, if that's ok with you.

sithhell · 2016-09-17T15:22:30Z

Ahh, the build-flags. I'm still not convinced that this is a good idea. For instance for MSVC, building HPX with all Vc flags wouldn't be even possible (this could be a flaw in Vc itself, though). I'd be fine with doing it for non-MSVC platforms only, if that's ok with you.

My concern currently is mostly with third party applications that want to
use datapar. Those have to search for vc, set the flags for the TUs
requiring it. This sounds not very convenient, especially from the
viewpoint that the fact we use vc should be considered an implementation
detail.

- flyby: spell fix in comment

hkaiser · 2016-09-17T16:46:06Z

@sithhell I have moved the compile flags to SetupVc.

sithhell

Looks good now!

Just as a general remark, I think having a trait that allows users to determine the actual value type of the iterated elements will be of great benefit when writing policy agnostic, generic code, but this is certainly outside of the scope of this PR.

hkaiser added 5 commits September 7, 2016 17:30

Proof of concept data parallelism implementation for parallel algorithms

fe984b6

Adding loop with cancellation token

0c1b923

Adding asynchronous datapar execution

1b12caa

Adding datapar support for count, count_if, and for_each_n

e8896f6

- adding tests - flyby: fix minmax algorithms

Adapting transform and transform_binary to support datapar

9d7b05b

- more work on for_each - factoring out low-level customization points

hkaiser added type: enhancement category: algorithms labels Sep 11, 2016

hkaiser added this to the 1.0.0 milestone Sep 11, 2016

Implementing inner_product(datapar, ...)

09934f7

- adding performance test

hkaiser force-pushed the datapar branch 2 times, most recently from 85ae0a4 to ba1d4b8 Compare September 15, 2016 01:41

Adding test

98583d8

- applying Vc build system settings to relevant files only

hkaiser force-pushed the datapar branch from ba1d4b8 to 98583d8 Compare September 15, 2016 12:24

hkaiser mentioned this pull request Sep 15, 2016

Implement datapar for parallel algorithms #2333

Open

36 tasks

Cleaning up transform_loop implementation for datapar

10b7b35

sithhell reviewed Sep 16, 2016

View reviewed changes

sithhell requested changes Sep 16, 2016

View reviewed changes

hkaiser force-pushed the datapar branch 2 times, most recently from 11af7b4 to 5a799ae Compare September 17, 2016 14:11

Addressing comments from code review

c565b21

- flyby: spell fix in comment

hkaiser force-pushed the datapar branch from 5a799ae to c565b21 Compare September 17, 2016 16:45

sithhell approved these changes Sep 18, 2016

View reviewed changes

sithhell merged commit 884a26a into master Sep 20, 2016

sithhell deleted the datapar branch September 20, 2016 05:19


		typedef Vc::Scalar::Vector<value_type> V1;

		V1 tmp(std::addressof(*it), Vc::Aligned);

Support for data-parallelism for parallel algorithms #2330

Support for data-parallelism for parallel algorithms #2330

Conversation

hkaiser commented Sep 11, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hkaiser Sep 16, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sithhell Sep 16, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sithhell left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hkaiser commented Sep 17, 2016

sithhell commented Sep 17, 2016 • edited by hkaiser Loading

hkaiser commented Sep 17, 2016

hkaiser commented Sep 17, 2016

sithhell commented Sep 17, 2016 • edited by hkaiser Loading

hkaiser commented Sep 17, 2016

sithhell left a comment

Choose a reason for hiding this comment

hkaiser commented Sep 11, 2016 •

edited

Loading

hkaiser Sep 16, 2016 •

edited

Loading

sithhell Sep 16, 2016 •

edited

Loading

sithhell commented Sep 17, 2016 •

edited by hkaiser

Loading

sithhell commented Sep 17, 2016 •

edited by hkaiser

Loading