Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add packet copy (LL) for AllReduce #85

Merged
merged 54 commits into from
Jun 12, 2023
Merged

Add packet copy (LL) for AllReduce #85

merged 54 commits into from
Jun 12, 2023

Conversation

chhwang
Copy link
Contributor

@chhwang chhwang commented May 25, 2023

Add packet copy interfaces and an example kernel for AllReduce

@chhwang chhwang requested a review from Binyang2014 May 25, 2023 02:57
Base automatically changed from binyli/all-reduce to main June 7, 2023 10:58
@chhwang
Copy link
Contributor Author

chhwang commented Jun 8, 2023

1G AllReduce 8-GPU kernel 1 (ring reduce) new throughput after bug fix: 236.64GB/s

# minBytes 1073741824 maxBytes 1073741824 step: 2(factor) warmup iters: 10 iters: 10 validation: 1 graph: 10, kernel num: 1
#
# Using devices
#  Rank  0 Pid 1665108 on costsim-dev-00000A device  0 [0001:00:00.0] NVIDIA A100-SXM4-80GB
#  Rank  1 Pid 1665109 on costsim-dev-00000A device  1 [0002:00:00.0] NVIDIA A100-SXM4-80GB
#  Rank  2 Pid 1665110 on costsim-dev-00000A device  2 [0003:00:00.0] NVIDIA A100-SXM4-80GB
#  Rank  3 Pid 1665111 on costsim-dev-00000A device  3 [0004:00:00.0] NVIDIA A100-SXM4-80GB
#  Rank  4 Pid 1665112 on costsim-dev-00000A device  4 [000B:00:00.0] NVIDIA A100-SXM4-80GB
#  Rank  5 Pid 1665113 on costsim-dev-00000A device  5 [000C:00:00.0] NVIDIA A100-SXM4-80GB
#  Rank  6 Pid 1665114 on costsim-dev-00000A device  6 [000D:00:00.0] NVIDIA A100-SXM4-80GB
#  Rank  7 Pid 1665115 on costsim-dev-00000A device  7 [000E:00:00.0] NVIDIA A100-SXM4-80GB
#
# Initializing MSCCL++
# Setting up the connection in MSCCL++
#
#                                    in-place                       out-of-place          
#       size         count     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong
#        (B)    (elements)     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
  1073741824     268435456   7940.6  135.22  236.64      0

# Out of bounds values : 0 OK
#

@chhwang
Copy link
Contributor Author

chhwang commented Jun 8, 2023

10KB AllReduce 8-GPU kernel 2 (packet copy) latency: 6.54us (7 blocks, 1 block per peer). Using more thread blocks doesn't help.

# minBytes 10240 maxBytes 10240 step: 2(factor) warmup iters: 10 iters: 1000 validation: 1 graph: 100, kernel num: 2
#
# Using devices
#  Rank  0 Pid 1692118 on costsim-dev-00000A device  0 [0001:00:00.0] NVIDIA A100-SXM4-80GB
#  Rank  1 Pid 1692119 on costsim-dev-00000A device  1 [0002:00:00.0] NVIDIA A100-SXM4-80GB
#  Rank  2 Pid 1692120 on costsim-dev-00000A device  2 [0003:00:00.0] NVIDIA A100-SXM4-80GB
#  Rank  3 Pid 1692121 on costsim-dev-00000A device  3 [0004:00:00.0] NVIDIA A100-SXM4-80GB
#  Rank  4 Pid 1692122 on costsim-dev-00000A device  4 [000B:00:00.0] NVIDIA A100-SXM4-80GB
#  Rank  5 Pid 1692123 on costsim-dev-00000A device  5 [000C:00:00.0] NVIDIA A100-SXM4-80GB
#  Rank  6 Pid 1692124 on costsim-dev-00000A device  6 [000D:00:00.0] NVIDIA A100-SXM4-80GB
#  Rank  7 Pid 1692125 on costsim-dev-00000A device  7 [000E:00:00.0] NVIDIA A100-SXM4-80GB
#
# Initializing MSCCL++
# Setting up the connection in MSCCL++
#
#                                    in-place                       out-of-place          
#       size         count     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong
#        (B)    (elements)     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
       10240          2560     6.54    1.56    2.74      0

# Out of bounds values : 0 OK
#

@chhwang chhwang requested a review from saeedmaleki June 8, 2023 13:42
@chhwang
Copy link
Contributor Author

chhwang commented Jun 8, 2023

@saeedmaleki I struggled with an occasional correctness bug in the LL AllReduce (kernel 2) and concluded that we need a __threadfence() here after many tests. However, it is still unclear why we need this. Do you have any ideas? https://github.com/microsoft/mscclpp/blob/chhwang/ll/test/mscclpp-test/allreduce_test.cu#L238

test/mscclpp-test/allreduce_test.cu Outdated Show resolved Hide resolved
include/mscclpp/channel.hpp Show resolved Hide resolved
test/mscclpp-test/allreduce_test.cu Outdated Show resolved Hide resolved
@chhwang
Copy link
Contributor Author

chhwang commented Jun 12, 2023

10KB AllReduce 8-GPU kernel 2 (packet copy) latency: 5.99us (7 blocks, 1 block per peer)

# minBytes 10240 maxBytes 10240 step: 2(factor) warmup iters: 10 iters: 1000 validation: 1 graph: 100, kernel num: 2
#
# Using devices
#  Rank  0 Pid 2196015 on costsim-dev-00000A device  0 [0001:00:00.0] NVIDIA A100-SXM4-80GB
#  Rank  1 Pid 2196016 on costsim-dev-00000A device  1 [0002:00:00.0] NVIDIA A100-SXM4-80GB
#  Rank  2 Pid 2196017 on costsim-dev-00000A device  2 [0003:00:00.0] NVIDIA A100-SXM4-80GB
#  Rank  3 Pid 2196018 on costsim-dev-00000A device  3 [0004:00:00.0] NVIDIA A100-SXM4-80GB
#  Rank  4 Pid 2196019 on costsim-dev-00000A device  4 [000B:00:00.0] NVIDIA A100-SXM4-80GB
#  Rank  5 Pid 2196020 on costsim-dev-00000A device  5 [000C:00:00.0] NVIDIA A100-SXM4-80GB
#  Rank  6 Pid 2196021 on costsim-dev-00000A device  6 [000D:00:00.0] NVIDIA A100-SXM4-80GB
#  Rank  7 Pid 2196022 on costsim-dev-00000A device  7 [000E:00:00.0] NVIDIA A100-SXM4-80GB
#
# Initializing MSCCL++
# Setting up the connection in MSCCL++
#
#                                    in-place                       out-of-place          
#       size         count     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong
#        (B)    (elements)     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
       10240          2560     5.99    1.71    2.99      0

# Out of bounds values : 0 OK
#

@chhwang chhwang enabled auto-merge (squash) June 12, 2023 07:57
@chhwang chhwang merged commit 43de015 into main Jun 12, 2023
@chhwang chhwang deleted the chhwang/ll branch June 12, 2023 13:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants