Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

port improvements from vchuravy/NCCL.jl #5

Closed
wants to merge 1 commit into from
Closed

Conversation

vchuravy
Copy link
Member

TODO

  • figure out how to test examples
  • add tests for new functionality

end
uid = MPI.bcast(uid, 0, comm)::NCCL.UniqueID

dev = CuDevice(parse(Int, first(split(ENV["CUDA_VISIBLE_DEVICES"], ","))))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "right" way of doing this is:

    lcomm = MPI.Comm_split_type(mpicomm, MPI.MPI_COMM_TYPE_SHARED,
                                MPI.Comm_rank(mpicomm))
    CUDAnative.device!(MPI.Comm_rank(lcomm))

For NCCL to work best all devices need to be visible to all MPI processes and there should only be a maxium of n local ranks per n devices

@@ -2,6 +2,12 @@

export Allreduce!, Broadcast!, Reduce!, Allgather!, ReduceScatter!

function allReduce!(::Op, sendbuf, recvbuf, comm::Communicator; stream=CUDAdrv.CuDefaultStream()) where Op
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following the MPI.jl convention to lowercase Julia specific implementations

@vchuravy vchuravy requested review from kshyatt and maleadt November 27, 2019 18:01
@maleadt
Copy link
Member

maleadt commented Nov 27, 2019

Interesting

192 Test Failed at /builds/JuliaGPU/NCCL.jl/test/runtests.jl:9
193   Expression: size(comms[1]) == 2
194    Evaluated: 8 == 2
195 Stacktrace:
196  [1] macro expansion at /builds/JuliaGPU/NCCL.jl/test/runtests.jl:9 [inlined]
197  [2] macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Test/src/Test.jl:1083 [inlined]
198  [3] macro expansion at /builds/JuliaGPU/NCCL.jl/test/runtests.jl:6 [inlined]
199  [4] macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Test/src/Test.jl:1083 [inlined]
200  [5] top-level scope at /builds/JuliaGPU/NCCL.jl/test/runtests.jl:5

and ensuing

213 Reduce!: Error During Test at /builds/JuliaGPU/NCCL.jl/test/runtests.jl:66
214   Got exception outside of a @test
215   CUDA error: an illegal memory access was encountered (code 700, ERROR_ILLEGAL_ADDRESS)
216   Stacktrace:
217    [1] macro expansion at /builds/JuliaGPU/NCCL.jl/.julia/packages/CUDAdrv/3EzC1/src/error.jl:123 [inlined]
218    [2] cuMemcpyDtoH_v2(::Ptr{Float64}, ::CuPtr{Float64}, ::Int64) at /builds/JuliaGPU/NCCL.jl/.julia/packages/CUDAdrv/3EzC1/src/libcuda.jl:444
219    [3] #unsafe_copyto!#5 at /builds/JuliaGPU/NCCL.jl/.julia/packages/CUDAdrv/3EzC1/src/memory.jl:285 [inlined]
220    [4] unsafe_copyto! at /builds/JuliaGPU/NCCL.jl/.julia/packages/CUDAdrv/3EzC1/src/memory.jl:278 [inlined]
221    [5] copyto!(::Array{Float64,1}, ::Int64, ::CuArray{Float64,1,Nothing}, ::Int64, ::Int64) at /builds/JuliaGPU/NCCL.jl/.julia/packages/CuArrays/7z7MV/src/array.jl:265
222    [6] copyto! at /builds/JuliaGPU/NCCL.jl/.julia/packages/GPUArrays/0lvhc/src/abstractarray.jl:118 [inlined]
223    [7] collect(::CuArray{Float64,1,Nothing}) at /builds/JuliaGPU/NCCL.jl/.julia/packages/CuArrays/7z7MV/src/array.jl:245
224    [8] macro expansion at /builds/JuliaGPU/NCCL.jl/test/runtests.jl:87 [inlined]
225    [9] macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Test/src/Test.jl:1083 [inlined]
226    [10] macro expansion at /builds/JuliaGPU/NCCL.jl/test/runtests.jl:67 [inlined]
227    [11] macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Test/src/Test.jl:1083 [inlined]
228    [12] top-level scope at /builds/JuliaGPU/NCCL.jl/test/runtests.jl:5
229    [13] include at ./boot.jl:317 [inlined]
230    [14] include_relative(::Module, ::String) at ./loading.jl:1044
231    [15] include(::Module, ::String) at ./sysimg.jl:29
232    [16] include(::String) at ./client.jl:392
233    [17] top-level scope at none:0
234    [18] eval(::Module, ::Any) at ./boot.jl:319
235    [19] exec_options(::Base.JLOptions) at ./client.jl:243
236    [20] _start() at ./client.jl:425

@maleadt maleadt force-pushed the master branch 8 times, most recently from 859e000 to 68aebf8 Compare January 22, 2020 15:17
@vchuravy vchuravy mentioned this pull request Dec 4, 2020
@maleadt
Copy link
Member

maleadt commented Feb 22, 2024

Most of the functionality here has been included already, so I think we can close this.

@maleadt maleadt closed this Feb 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants