[FEA] Reduce building time #93

PointKernel · 2021-07-06T14:18:35Z

Is your feature request related to a problem? Please describe.
After introducing rapids.cmake into the project, building cuco becomes unreasonably expensive for such a small project. More precisely, it takes 11 mins to build the code with 6 concurrent threads. Linking STATIC_MAP_TEST and DYNAMIC_MAP_TEST are the major time killers.

Describe the solution you'd like
Get rid of the dynamic initialization warnings in these two tests and reduce building time.

The text was updated successfully, but these errors were encountered:

jrhemstad · 2021-07-06T14:43:05Z

I'm guessing the problem is that it is building for all architectures now. @robertmaynard @trxcllnt what's the flag to specify a single architecture with rapids-cmake?

robertmaynard · 2021-07-06T14:56:10Z

-DCMAKE_CUDA_ARCHITECTURES=NATIVE to compile only for the GPU architectures on your machine. If you want to build for a custom architecture not on your machine you do something like -DCMAKE_CUDA_ARCHITECTURES=75

PointKernel · 2021-09-27T21:14:35Z

// Thrust logical algorithms (any_of/all_of/none_of) don't work with device
// lambdas: See https://github.com/thrust/thrust/issues/1062
template <typename Iterator, typename Predicate>
bool all_of(Iterator begin, Iterator end, Predicate p)
{
  auto size = thrust::distance(begin, end);
  return size == thrust::count_if(begin, end, p);
}

Thrust-1.12 logical functions do support device lambdas thus I replaced the above code with using thrust::all_of; in #107. By doing so, building STATIC_MAP_TEST now takes about 8.5 mins on my local machine, and it was only 4 mins before.

Is it a known issue that compiling thrust::all_of is expensive?

alliepiper · 2021-09-29T02:39:13Z

Looking at the implementations:

thrust::count_if is implemented as a simple reduction using the predicate in a transform iterator.
thrust::all_of is implemented using find_if, which actually calls thrust::reduce repeatedly in an attempt to stop the search when the first match is found. There's a fair bit of extra tuple code in find to return the matching iterator that count_if doesn't have.

I'm curious -- did you see a runtime perf drop when you made the change, too? It sounds like there's room for improvement in the all_of implementation.

jrhemstad · 2021-09-29T02:40:57Z

This is just test code, so perf doesn't really matter all that much.

It sounds like there's room for improvement in the all_of implementation.

Yeah, I looked into this extensively a while back: NVIDIA/cccl#720

PointKernel mentioned this issue Nov 4, 2021

[WIP] Reduce compilation time #115

Closed

PointKernel added improvement topic: performance Performance related issue labels Dec 3, 2021

PointKernel mentioned this issue Jan 5, 2022

Reduce compilation time #131

Merged

PointKernel closed this as completed in #131 Jan 24, 2022

PointKernel mentioned this issue Oct 18, 2024

[ENHANCEMENT]: Get rid of custom test utilities #622

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Reduce building time #93

[FEA] Reduce building time #93

PointKernel commented Jul 6, 2021

jrhemstad commented Jul 6, 2021

robertmaynard commented Jul 6, 2021

PointKernel commented Sep 27, 2021

alliepiper commented Sep 29, 2021

jrhemstad commented Sep 29, 2021

[FEA] Reduce building time #93

[FEA] Reduce building time #93

Comments

PointKernel commented Jul 6, 2021

jrhemstad commented Jul 6, 2021

robertmaynard commented Jul 6, 2021

PointKernel commented Sep 27, 2021

alliepiper commented Sep 29, 2021

jrhemstad commented Sep 29, 2021