Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Reduce building time #93

Closed
PointKernel opened this issue Jul 6, 2021 · 5 comments · Fixed by #131
Closed

[FEA] Reduce building time #93

PointKernel opened this issue Jul 6, 2021 · 5 comments · Fixed by #131
Labels
topic: performance Performance related issue

Comments

@PointKernel
Copy link
Member

Is your feature request related to a problem? Please describe.
After introducing rapids.cmake into the project, building cuco becomes unreasonably expensive for such a small project. More precisely, it takes 11 mins to build the code with 6 concurrent threads. Linking STATIC_MAP_TEST and DYNAMIC_MAP_TEST are the major time killers.

Describe the solution you'd like
Get rid of the dynamic initialization warnings in these two tests and reduce building time.

@jrhemstad
Copy link
Collaborator

I'm guessing the problem is that it is building for all architectures now. @robertmaynard @trxcllnt what's the flag to specify a single architecture with rapids-cmake?

@robertmaynard
Copy link
Collaborator

-DCMAKE_CUDA_ARCHITECTURES=NATIVE to compile only for the GPU architectures on your machine. If you want to build for a custom architecture not on your machine you do something like -DCMAKE_CUDA_ARCHITECTURES=75

@PointKernel
Copy link
Member Author

// Thrust logical algorithms (any_of/all_of/none_of) don't work with device
// lambdas: See https://github.com/thrust/thrust/issues/1062
template <typename Iterator, typename Predicate>
bool all_of(Iterator begin, Iterator end, Predicate p)
{
  auto size = thrust::distance(begin, end);
  return size == thrust::count_if(begin, end, p);
}

Thrust-1.12 logical functions do support device lambdas thus I replaced the above code with using thrust::all_of; in #107. By doing so, building STATIC_MAP_TEST now takes about 8.5 mins on my local machine, and it was only 4 mins before.

Is it a known issue that compiling thrust::all_of is expensive?

@alliepiper
Copy link
Collaborator

Looking at the implementations:

  • thrust::count_if is implemented as a simple reduction using the predicate in a transform iterator.
  • thrust::all_of is implemented using find_if, which actually calls thrust::reduce repeatedly in an attempt to stop the search when the first match is found. There's a fair bit of extra tuple code in find to return the matching iterator that count_if doesn't have.

I'm curious -- did you see a runtime perf drop when you made the change, too? It sounds like there's room for improvement in the all_of implementation.

@jrhemstad
Copy link
Collaborator

This is just test code, so perf doesn't really matter all that much.

It sounds like there's room for improvement in the all_of implementation.

Yeah, I looked into this extensively a while back: NVIDIA/cccl#720

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic: performance Performance related issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants