Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option for filtering by CUDNN_DETERMINISTIC in cudnnConvolutionAlgoPerfChoose #938

Open
ToucheSir opened this issue May 30, 2021 · 9 comments
Labels
cuda libraries Stuff about CUDA library wrappers. enhancement New feature or request good first issue Good for newcomers

Comments

@ToucheSir
Copy link
Contributor

Is your feature request related to a problem? Please describe.

Following from https://discourse.julialang.org/t/flux-reproducibility-of-gpu-experiments/62092, there is no way to guarantee for users of e.g. NNlibCUDA to ensure convolution operations only use deterministic algorithms

Describe the solution you'd like

Something along the lines of
https://github.com/pytorch/pytorch/blob/6c70cbedb6102da08fe91186d40a41b50991681d/aten/src/ATen/native/cudnn/Conv_v7.cpp#L219-L252. Whether this would need to be plumbed through higher-level functions, set as a global option or exposed through a context manager is left for debate.

Describe alternatives you've considered

The only solution now is seems to be pirating cudnnConvolutionForwardAD such that it doesn't use cudnnConvolutionFwdAlgoPerf?

Additional context

https://docs.nvidia.com/deeplearning/cudnn/developer-guide/index.html#reproducibility states:

...the following routines do not guarantee reproducibility because they use atomic operations:

  • cudnnConvolutionBackwardFilter when CUDNN_CONVOLUTION_BWD_FILTER_ALGO_0 or CUDNN_CONVOLUTION_BWD_FILTER_ALGO_3 is used
  • cudnnConvolutionBackwardData when CUDNN_CONVOLUTION_BWD_DATA_ALGO_0 is used

So I have no clue if/how often these show up in practice in cudnnConvolution*AlgoPerf, but I assume they must if users are seeing non-deterministic results?

@ToucheSir ToucheSir added the enhancement New feature or request label May 30, 2021
@maleadt maleadt added cuda libraries Stuff about CUDA library wrappers. good first issue Good for newcomers labels Jun 1, 2021
@ericphanson
Copy link

I saw this has a "good first issue" tag- any hints as to how to do this?

@maleadt
Copy link
Member

maleadt commented May 17, 2022

I haven't put much thought in it; a good first step would be to have a good look at what other libraries/frameworks do (although there's a couple of references in the OP already). We also already have a math_mode, so maybe the deterministic mode should piggy-back on the pedantic math mode: https://github.com/JuliaGPU/CUDA.jl/blob/056a5269e41d3f447e957b4058bc424683bbbd09/lib/cudnn/CUDNN.jl#L43-L55=. If that doesn't work, a separate determinism TLS entry seems like the best way to go (similar to how math_mode is implemented).

@marcpabst
Copy link

marcpabst commented Sep 10, 2022

This is still an issue that seems to be easily fixable once there is agreement on how to implement a deterministicoption/flag. I wonder however, if

  • the deterministic behavior should maybe be the default one? I find it super uninituitve that I use CUDA.seed! and still get non-reproducible results (maybe CUDA.seed! should implicitly set the deterministic flag?),
  • setting the deterministic flag should invalide the cache (I guess?), and
  • there are situations where no deterministic algoritm is available.

@maleadt
Copy link
Member

maleadt commented Sep 10, 2022

CUDA.seed! seeds the RNG, which is unrelated to CUDNN, so I don't see why we should couple those.

@marcpabst
Copy link

Yeah, I agree. It's just that from a scientific computing language, I kina expect reproducibility by default. At the very least, there should be a clear warning in the docs (but more suited for Flux.jl, I guess).

@ToucheSir
Copy link
Contributor Author

Yeah, I agree. It's just that from a scientific computing language, I kina expect reproducibility by default.

The missing context here is that deep learning in general has a much looser notion of "reproducibility" than most other scientific computing domains. This is due to many reasons and could be an entire discussion on its own, but suffice it to say that the slight deviances caused by allowing non-deterministic algorithms are not sufficient to get most all deep learning code to stop using them. Indeed, none of PyTorch/TF/JAX filter to only deterministic algorithms by default, and Flux is no exception.

@maleadt
Copy link
Member

maleadt commented Sep 11, 2022

It's just that from a scientific computing language, I kina expect reproducibility by default.

We've had this discussion in the past already. Using Julia doesn't guarantee anything, it all depends on the specific domain. In the case of GPU computing, sacrificing some accuracy is the norm just because of how the devices work (threads executing in a non-deterministic fashion, resulting in inaccuracies due to floating-point semantics). We are not going to degrade performance if none of the other software packages in this specific domain (of GPU programming, deep learning, etc) behave like that.

@ToucheSir
Copy link
Contributor Author

Back to the main issue, the main unanswered question is how the layering works here with so many libraries in the stack. Flux probably shouldn't be involved because we'd like it to be accelerator agnostic.

So that leaves CUDA, NNlib and NNlibCUDA. I think a global flag in CUDA.jl might be awkward because, unlike MATH_MODE, algorithm filtering doesn't neatly map to a single cu*Set*Mode call. If cudnnConvolutionForward! and friends could take either an algorithm override or callback to filter algorithms, NNlibCUDA could handle most of this logic. Then NNlib(CUDA) can choose to add global functions to toggle deterministic conv algorithms (i.e. flip a Ref value and and check that in conv!) at its leisure.

@renatobellotti
Copy link

I noticed that sparse matrix-vector multiplications are also not reproducible. The matrix-vector multiplications yield a different result whenever they are executed:

using CUDA
using Random

Random.seed!(546784)

A = cu(sprand(Float64, 1000, 1000, 0.6))
x = cu(rand(Float64, 1000))

y1 = A * x
y2 = A * x

maximum(abs.(y2 - y1))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda libraries Stuff about CUDA library wrappers. enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

5 participants