Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compile times are growing significantly #581

Open
jrhemstad opened this issue Sep 9, 2018 · 12 comments · May be fixed by #17078
Open

Compile times are growing significantly #581

jrhemstad opened this issue Sep 9, 2018 · 12 comments · May be fixed by #17078
Assignees
Labels
CMake CMake build issue

Comments

@jrhemstad
Copy link
Contributor

Feature request

As anyone who has built libgdf recently has surely noticed, the time to compile the library from scratch has grown significantly in the last few months. For example, compiling on all 10 cores of my i9-7900X @ 3.30GHz takes 11 minutes as reported by time make -j.

Compiling with -ftime-report may be a good place to start to see where all the compilation time is being spent.

This is likely due to the large amount of template instantiation that is required for instantiating functions for all possible types. We should make sure that best practices are being followed in template instantiation such that a template for a given type is only having to be instantiated once via explicit instantiation.

Much of our code is implemented in headers, which causes it to be recompiled everywhere that header is included. Using pre-compiled headers may help:
https://gcc.gnu.org/onlinedocs/gcc/Precompiled-Headers.html
http://itscompiling.eu/2017/01/12/precompiled-headers-cpp-compilation/

Furthermore, code should be scrubbed from excessive and unnecessary #includes. Compiling with -MD will show what files are being included

Here's a Clang tool that ensures you only include the necessary headers: https://github.com/include-what-you-use/include-what-you-use

Here's a Clang tool to profile time spent in template instantiation: https://github.com/mikael-s-persson/templight

@harrism
Copy link
Member

harrism commented Oct 26, 2018

Do the clang tools work with nvcc? Much of our template time is in nvcc. You can use nvcc --time foo.csv to dump timing for different nvcc phases.

@jrhemstad
Copy link
Contributor Author

I'm not sure. You can technically get Clang to compile device code, so that may be a path worth exploring using Clang + these tools.

@mike-wendt mike-wendt transferred this issue from rapidsai/libgdf Dec 20, 2018
@mike-wendt mike-wendt added the Needs Triage Need team to review and classify label Dec 20, 2018
@kkraus14 kkraus14 added code quality CMake CMake build issue and removed Needs Triage Need team to review and classify labels Feb 13, 2019
@hcho3
Copy link
Contributor

hcho3 commented Jun 12, 2020

Any updates on this? I'd love to use precompiled headers with CUDA projects.

@harrism
Copy link
Member

harrism commented Jun 15, 2020

Compile time continues to grow, but that is largely because our supported feature set and supported types continue to grow. In 0.15 we are aiming to add at least 10 new types (4 unsigned int types, 4 timestamp types, list column type, decimal fixed-point type). Naturally this will increase compile time and binary size.

Meanwhile, in 0.14 we dropped all of the legacy APIs that were previously deprecated, which reduced compile time a bit, and significantly reduced binary size. There have been and will continue to be various efforts to reduce compile time of certain components. We are investigating possibly splitting libcudf into multiple libraries.

We have not discussed precompiled headers.

@beckernick
Copy link
Member

@jrhemstad @harrism , is this still a relevant issue?

@jrhemstad
Copy link
Contributor Author

Our compile time is worse than ever, so I guess its still relevant. We could benefit from someone putting in a concerted effort to eliminate unnecessary includes across the library.

@vyasr
Copy link
Contributor

vyasr commented Dec 20, 2022

I'm not sure. You can technically get Clang to compile device code, so that may be a path worth exploring using Clang + these tools.

Out of curiosity I gave this a quick shot. (Unsurprisingly) clang does not currently support the experimental CUDA features that we have enabled (--expt-extended-lambda --expt-relaxed-constexpr) so the compilation terminates pretty quickly. Not sure if there are other downstream issues that we would face if we attempted this after stripping those out (not suggesting that we should, although #7795 remains open so maybe it is worth pursuing).

@jrhemstad
Copy link
Contributor Author

clang does not currently support the experimental CUDA features that we have enabled (--expt-extended-lambda --expt-relaxed-constexpr)

Pretty sure clang supports those features natively without the need for any extra compile flags. I'm guessing the error was caused by clang not recognizing those flags.

@vyasr
Copy link
Contributor

vyasr commented Dec 21, 2022

You're right, it does. I removed those and made some progress, but not nearly enough for a working build with clang yet. Here's a list of necessary changes so far:

  • Remove all CUDF_CUDA_FLAGS set in ConfigureCUDA.cmake. Most are either unsupported or ignored (including some of the warnings flags), so the blanket removal is easiest.
  • Remove [[nodiscard]] attributes, which don't appear to be supported in clang device code yet.
  • Set -D__STRICT_ANSI__ as a CUDA compiler flag. Otherwise it tries to compile float128 code, which is not yet supported in clang.
  • Pass -fcuda-allow-variadic-functions to the clang compiler (or via the CMake configure CLI -DCMAKE_CUDA_FLAGS="-Xclang -fcuda-allow-variadic-functions"

At this point I start seeing failures like this:

...type_traits:2777:5: error: no type named 'type' in 'std::invoke_result<cudf::detail::indexalator_factory::nullable_index_accessor, int>'

and

error: type 'thrust::transform_iterator<(lambda at ...gather.cu:49:26), cudf::detail::input_indexalator>::super_t' (aka 'iterator_adaptor<transform_iterator<(lambda at ...gather.cu:49:26), cudf::detail::input_indexalator, thrust::use_default, thrust::use_default>, cudf::detail::input_indexalator, int, thrust::use_default, std::random_access_iterator_tag, int>') is not a direct or virtual base of 'thrust::transform_iterator<(lambda at ...gather.cu:49:26), cudf::detail::input_indexalator>'

I need to track this down a bit further, but it looks like some aspect of how thrust SFINAEs different code paths isn't supported in clang device code yet either.

@PointKernel
Copy link
Member

PointKernel commented Dec 21, 2022

I tried to build cuco with clang about a year ago and was blocked by its dependencies like thrust or libcudacxx that cannot be built with clang. To find how much effort is required to build device code with clang, I would suggest starting with a smaller library like cuco and see how it goes from there.

Related issues:

@vyasr
Copy link
Contributor

vyasr commented Dec 21, 2022

Well then... looks like we've got to work our way all the way up the stack for this. For the purpose of something like clang-tidy we might be able to get some partial results based on the discussion in rapidsai/raft#424, but that's probably only partial support at best and I don't know if that will work with the other tools of interest like IWYU.

@vyasr
Copy link
Contributor

vyasr commented Oct 9, 2024

Compile times are an ever-present problem for us. This issue as currently framed isn't clearly actionable, so let's lay out some concrete points.

We should make sure that best practices are being followed in template instantiation

#379 implemented the type_dispatcher, which now controls all our template instantiations and ensures that we have a minimal set.

Compiling with -ftime-report may be a good place to start to see where all the compilation time is being spent.

Since #9631 we have been tracking build times in CI. We monitor this and keep an eye on TUs that are particularly slow to compile. Where necessary, we have reacted to slow compilation by reorganizing the code and explicitly instantiating templates.

Furthermore, code should be scrubbed from excessive and unnecessary #includes.

This seems like the main action item remaining. As discussed above, include-what-you-use is a good tool for this, so I would consider evaluating and either implementing or rejecting that as the only thing left to do here. Since clang compilation is the bottleneck here, I propose that we just get our C/C++ source files working first. Once we have systematic approaches in place for that and are regularly checking on the quality, we can incrementally work up to getting CUDA source files working since that is a much heavier lift (and at that point we can also split between host and device code in cu files). This is similar to the approach we are taking in #16958 for clang-tidy.

@vyasr vyasr linked a pull request Oct 14, 2024 that will close this issue
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants