Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to make CUDA functionalities an extension #2265

Closed
CarloLucibello opened this issue Jun 14, 2023 · 13 comments
Closed

how to make CUDA functionalities an extension #2265

CarloLucibello opened this issue Jun 14, 2023 · 13 comments
Milestone

Comments

@CarloLucibello
Copy link
Member

CarloLucibello commented Jun 14, 2023

Now that NNlibCUDA is an extension in NNlib with a weak dependence on CUDA and cuDNN (see FluxML/NNlib.jl#492), we have to decide how to move forward here.

The actual CUDA code we have here is quite lean, https://github.com/FluxML/Flux.jl/blob/master/src/cuda/cudnn.jl, and actually it should be reduced to none by moving all the functional implementation of normalization to NNlib.

What really matters is that both CUDA.jl and cuDNN.jl are loaded in order to activate all cuda functionalities in NNlib.

I think we basically have 2 options:

  1. We tell the users to explicitly do using Flux, CUDA, cuDNN to unlock cuda functionality. We will have an extension FluxCUDAcuDNNExt here, with CUDA, cuDNN as weak dependencies.
    Notice that the extension is also loaded with using Flux, cuDNN, since cuDNN loads CUDA.
  2. We create a dummy package FluxCUDA with no code inside, but with CUDA and cuDNN as dependencies.
    Here we will have an extension FluxFluxCUDAExt, with FluxCUDA as a weak dependency. Users will have to do using Flux, FluxCUDA.

I lean more towards option 2 1. Thoughts?

@ToucheSir
Copy link
Member

Copying comment from #2268:

I'm pretty ambivalent on whether 1) or 2) would be better, but 2) might let us keep compat with older Julia versions if we made it load NNlibCUDA and/or the Flux CUDA code on <1.9.

RE:

Here we will have an extension FluxFluxCUDAExt, with FluxCUDA as a weak dependency.

I don't believe this is required. Since it's a proper package and not an extension, we could move the existing Flux CUDA code into it. Low tech, but it would work. Whenever we figure out how to reduce the import length issue, we can always convert that code into a package extension in Flux itself and deprecate FluxCUDA.

Alternatively, we could still use a FluxCUDAExt and FluxCUDA would trigger that. However, that would require us to have a (frozen, backport only) copy of the Flux CUDA code in FluxCUDA itself to support Julia <1.9.

@CarloLucibello
Copy link
Member Author

CarloLucibello commented Jun 17, 2023

Since it's a proper package and not an extension, we could move the existing Flux CUDA code into it.

yes we can do that but I'd rather have all of the Flux code in a single repo, it would make changes involving one or multiple backends and maintenance in general much easier

However, that would require us to have a (frozen, backport only) copy of the Flux CUDA code in FluxCUDA itself to support Julia <1.9.

I'd rather drop < 1.9, supporting old julia versions is not a good use of our time, we can do it when it doesn't take much effort but with julia 1.9 everything changes and we want to rip the benefits without being burdened by the need to support previous releases

@darsnack
Copy link
Member

Before we drop support for <1.9, we should count bug fixes over the last year that would rise to the level of a backport (i.e. if our Julia compat had increased, these are fixes we would want to backport).

Personally, I think dropping 1.6 support should be considered carefully. It would not be a great situation when the foremost ML package does not support LTS and the last supported release has major bugs.

@darsnack
Copy link
Member

Also my vote is for Option 2 regardless of where we end up putting the CUDA-related code.

@CarloLucibello
Copy link
Member Author

If we go with FluxCUDA, should we also have FluxAMDGPU and FluxMetal as well? They are currently loaded by the single imports using AMDGPU and using Metal.

@ToucheSir
Copy link
Member

I think those are less important because they don't support the LTS. Unlike with CUDA.jl, people are going to want to use a very recent version of Julia with AMDGPU or Metal.

@CarloLucibello
Copy link
Member Author

CarloLucibello commented Jun 17, 2023

I thought a bit more about this and I think we can orthogonalize the two discussions:

a) Whether to go with option 1) or 2)

b) Should we support 1.6 only through backports or we keep the julia compact to v1.6 in future versions, meaning that flux should behave consistently on different julia versions.

In order to keep the 1.6 compat, independently of the outcome of a), we need to do the following:

  • put the CUDA extension in NNlib under @require for VERSION < v1.9 and relax the julia compat there again to 1.6
  • do the same here in Flux

NNlib depends already on Requires.jl although I was hoping to get rid of it (FluxML/NNlib.jl#494).
So I guess this option is not so bad, only a few extra lines of Requires code to keep the compatibility.
Should we do it?

@ToucheSir
Copy link
Member

ToucheSir commented Jun 17, 2023

put the CUDA extension in NNlib under @require for VERSION < v1.9 and relax the julia compat there again to 1.6

That's what I tried in FluxML/NNlib.jl#445, but as noted in that thread the additional latency from giving up precompilation was way too extreme. Requires would probably work for the Flux functionality though given that is much smaller (NNlibCUDA is massive in comparison).

@CarloLucibello
Copy link
Member Author

That's what I tried in FluxML/NNlib.jl#445, but as noted in that thread the additional latency from giving up precompilation was way too extreme.

So we should use Requires in Flux but not in NNlib? In this case on older julia versions we have to point Flux to NNlibCUDA, but I would strongly oppose this, we would use a deprecated package or not depending on the julia version

@ToucheSir
Copy link
Member

This is precisely why I've noted on multiple occasions how careful we need to be about this transition! There are so many options that either don't work technically or that people do not like, so taking some time to really nail down the steps required is a good idea.

@CarloLucibello
Copy link
Member Author

We have to drop 1.6, I don't see any other options. We can explicitly state though that we guarantee backports to the LTS for critical bugs.

@CarloLucibello
Copy link
Member Author

What about options 1) and 2)? I think I lean more towards option 1) now, for symmetry with the other extensions, and because @mcabbott's argument on the line of "people should not need to know about an obscure package" (cuDNN) could be said to apply to FluxCUDA as well, so maybe better be transparent.
But maybe having a FluxCUDA package is more future-proof, could avoid breaking changes in case different loading schemes will be needed in the future.

@CarloLucibello
Copy link
Member Author

done in #2268

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants