Skip to content

Commit

Permalink
Use 20000 as block size instead of the chunk size from traits::maxima…
Browse files Browse the repository at this point in the history
…l_number_of_chucks(...).
  • Loading branch information
taeguk committed Jul 26, 2017
1 parent f929e4c commit aae9fb6
Showing 1 changed file with 2 additions and 9 deletions.
11 changes: 2 additions & 9 deletions hpx/parallel/algorithms/partition.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -824,22 +824,15 @@ namespace hpx { namespace parallel { inline namespace v1
hpx::util::decay<ExPolicy>::type::executor_parameters_type
parameters_type;

typedef executor_parameter_traits<parameters_type> traits;

try {
if (first == last)
return algorithm_result::get(std::move(first));

auto count = std::distance(first, last);

std::size_t const cores = execution::processing_units_count(
policy.executor(), policy.parameters());
std::size_t max_chunks = traits::maximal_number_of_chunks(
policy.parameters(), policy.executor(), cores, count);

// TODO: Find good block size.
const std::size_t block_size = (std::max)
(std::size_t(count) / max_chunks, std::size_t(1));
// TODO: Find more better block size.
const std::size_t block_size = std::size_t(20000);
block_manager<FwdIter> block_manager(first, last, block_size);

std::vector<hpx::future<block<FwdIter>>>
Expand Down

7 comments on commit aae9fb6

@hkaiser
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should definitely take the traits into account. This is a means for the user to influence things. I wouldn't like for this to be not possible any more.

@taeguk
Copy link
Member Author

@taeguk taeguk commented on aae9fb6 Jul 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hkaiser But, the problem is that max_chunks which is returned by traits::maximal_number_of_chunks(...) is very big and it decreases the performance very very much.

In this case, chunk size of existed traits is inadaptable to use for block size.
As I think, block size should not be influenced by data count.

@hkaiser
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the end we need to have a means for the user to control the used chunk sizes. How do you suggest we do that?

@taeguk
Copy link
Member Author

@taeguk taeguk commented on aae9fb6 Jul 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hkaiser
One idea is adding new thing to executor parameters traits. (

struct executor_parameter_traits
)
For example, add the function get_block_size(...) into struct executor_parameter_traits { ... }.
The problem is that this new feature is only used in parallel::partition. (I think that it is not good because 'executor parameter traits' is generic interface.)
In fact, this new feature is used for parallel algorithms using parallel::partition, too.
And the user can be confused chunk size and block size.

Another idea is adding parameter to interface of parallel::partition.
It violates the interface of C++ standard, so it seems bad. But, it is meaningful because this is very simple and clear solution. But, because block size should be propagated when other parallel algorithms use parallel::partition, it is bad solution.

@mcopik
Copy link
Contributor

@mcopik mcopik commented on aae9fb6 Jul 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@taeguk Is there at least one chunker which implements maximal_number_of_chunks? Looking at traits implementation, you should get 4 * cores. It should not be big.

The concept of "block size" is itself quite generic, many algorithms are "blocked", especially in linear algebra. I think it's perfectly fine to have both chunk and block size because they are intended to represent different concepts. As you noticed, chunk size may depend on the number of elements. For many problems, block size depends on cache size and CPU architecture.

@taeguk
Copy link
Member Author

@taeguk taeguk commented on aae9fb6 Jul 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mcopik Sorry, I mean not max_chunks, but chunk size. Like you say, maximal_number_of_chunks returns small number. And so, chunk size is very big because chunk size is max_chunks / data count.
In fact, there is get_chunk_size(...) in executor parameter traits. If I use that with execution::par, I get (num_tasks + 4 * cores - 1) / (4 * cores) as chunk size. Anyway, using chunk size for block size in parallel::partition brings bad performance. The reason is that if we use big block size, remaining_blocks that are remained after sub-partitioning by each thread have very big size, so sequential codes should be performed long time.
As I think, the concept of "block size" is not generic. In parallel::partition, cache size and CPU architecture are not all considerations to determine block size. There is more thing. Like I said above, big block size makes remaining_blocks bigger. The max value of sum of sizes of remaining_blocks is block size * cores. The point is that remaining_blocks are processed sequentially. So, big block size can decrease parallelism. But, the very small block size is also bad because very small block size occurs bad cache utilization and excessive many block fetches. Therefore, it is important to find adequate block size. (If you want to know what remaining_blocks are, see

call(ExPolicy && policy, FwdIter first, FwdIter last,
Pred && pred, Proj && proj)
{
typedef util::detail::algorithm_result<
ExPolicy, FwdIter
> algorithm_result;
typedef typename
hpx::util::decay<ExPolicy>::type::executor_parameters_type
parameters_type;
try {
if (first == last)
return algorithm_result::get(std::move(first));
std::size_t const cores = execution::processing_units_count(
policy.executor(), policy.parameters());
// TODO: Find more better block size.
const std::size_t block_size = std::size_t(20000);
block_manager<FwdIter> block_manager(first, last, block_size);
std::vector<hpx::future<block<FwdIter>>>
remaining_block_futures(cores);
// Main parallel phrase: perform sub-partitioning in each thread.
for (std::size_t i = 0; i < remaining_block_futures.size(); ++i)
{
remaining_block_futures[i] = execution::async_execute(
policy.executor(),
[&block_manager, pred, proj]()
{
return partition_thread(block_manager, pred, proj);
});
}
// Wait sub-partitioning to be all finished.
hpx::wait_all(remaining_block_futures);
// Handle exceptions in parallel phrase.
std::list<std::exception_ptr> errors;
// TODO: Is it okay to use thing in util::detail:: ?
util::detail::handle_local_exceptions<ExPolicy>::call(
remaining_block_futures, errors);
std::vector<block<FwdIter>> remaining_blocks(
remaining_block_futures.size());
// Get remaining blocks from the result of sub-partitioning.
for (std::size_t i = 0; i < remaining_block_futures.size(); ++i)
remaining_blocks[i] = remaining_block_futures[i].get();
// Remove blocks that are empty.
FwdIter boundary = block_manager.boundary();
remaining_blocks.erase(std::remove_if(
std::begin(remaining_blocks), std::end(remaining_blocks),
[boundary](block<FwdIter> const& block) -> bool
{
return block.empty();
}), std::end(remaining_blocks));
// Sort remaining blocks to be listed from left to right.
std::sort(std::begin(remaining_blocks),
std::end(remaining_blocks));
// Collapse remaining blocks each other.
collapse_remaining_blocks(remaining_blocks, pred, proj);
// Merge remaining blocks into one block
// which is adjacent to boundary.
block<FwdIter> unpartitioned_block =
merge_remaining_blocks(remaining_blocks,
block_manager.boundary(), first);
// Perform sequetial partition to unpartitioned range.
FwdIter real_boundary = sequential_partition(
unpartitioned_block.first, unpartitioned_block.last,
pred, proj);
return algorithm_result::get(std::move(real_boundary));
}
)
Anyway, because of above reason, I think "block size of parallel::partition" is not generic concept.
And, I want to say that maybe generally 'block size' and 'chunk size' are used for same meaning. But, "block size of parallel::partition" is different from meanings of general 'block size' and 'chunk size' because "block size of parallel::partition" is determined with considering remaining_blocks. (maybe it can be confusing because of its naming.)
If we should add "block size of parallel::partition" into traits, maybe using obvious namings may be better like using 'partition_block_size', 'block_size_for_partition', 'partition_chunk_size', or 'chunk_size_for_partition'.

@hkaiser
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it to be very confusing to introduce both chunk_size and block_size From the users perspective these look very similar, even more s as (to the best of my knowledge) none of the algorithms would need both at the same time.

Please sign in to comment.