Add packet copy (LL) for AllReduce #85

chhwang · 2023-05-25T02:56:52Z

Add packet copy interfaces and an example kernel for AllReduce

chhwang · 2023-06-08T12:57:42Z

1G AllReduce 8-GPU kernel 1 (ring reduce) new throughput after bug fix: 236.64GB/s

# minBytes 1073741824 maxBytes 1073741824 step: 2(factor) warmup iters: 10 iters: 10 validation: 1 graph: 10, kernel num: 1
#
# Using devices
#  Rank  0 Pid 1665108 on costsim-dev-00000A device  0 [0001:00:00.0] NVIDIA A100-SXM4-80GB
#  Rank  1 Pid 1665109 on costsim-dev-00000A device  1 [0002:00:00.0] NVIDIA A100-SXM4-80GB
#  Rank  2 Pid 1665110 on costsim-dev-00000A device  2 [0003:00:00.0] NVIDIA A100-SXM4-80GB
#  Rank  3 Pid 1665111 on costsim-dev-00000A device  3 [0004:00:00.0] NVIDIA A100-SXM4-80GB
#  Rank  4 Pid 1665112 on costsim-dev-00000A device  4 [000B:00:00.0] NVIDIA A100-SXM4-80GB
#  Rank  5 Pid 1665113 on costsim-dev-00000A device  5 [000C:00:00.0] NVIDIA A100-SXM4-80GB
#  Rank  6 Pid 1665114 on costsim-dev-00000A device  6 [000D:00:00.0] NVIDIA A100-SXM4-80GB
#  Rank  7 Pid 1665115 on costsim-dev-00000A device  7 [000E:00:00.0] NVIDIA A100-SXM4-80GB
#
# Initializing MSCCL++
# Setting up the connection in MSCCL++
#
#                                    in-place                       out-of-place          
#       size         count     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong
#        (B)    (elements)     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
  1073741824     268435456   7940.6  135.22  236.64      0

# Out of bounds values : 0 OK
#

chhwang · 2023-06-08T13:42:06Z

10KB AllReduce 8-GPU kernel 2 (packet copy) latency: 6.54us (7 blocks, 1 block per peer). Using more thread blocks doesn't help.

# minBytes 10240 maxBytes 10240 step: 2(factor) warmup iters: 10 iters: 1000 validation: 1 graph: 100, kernel num: 2
#
# Using devices
#  Rank  0 Pid 1692118 on costsim-dev-00000A device  0 [0001:00:00.0] NVIDIA A100-SXM4-80GB
#  Rank  1 Pid 1692119 on costsim-dev-00000A device  1 [0002:00:00.0] NVIDIA A100-SXM4-80GB
#  Rank  2 Pid 1692120 on costsim-dev-00000A device  2 [0003:00:00.0] NVIDIA A100-SXM4-80GB
#  Rank  3 Pid 1692121 on costsim-dev-00000A device  3 [0004:00:00.0] NVIDIA A100-SXM4-80GB
#  Rank  4 Pid 1692122 on costsim-dev-00000A device  4 [000B:00:00.0] NVIDIA A100-SXM4-80GB
#  Rank  5 Pid 1692123 on costsim-dev-00000A device  5 [000C:00:00.0] NVIDIA A100-SXM4-80GB
#  Rank  6 Pid 1692124 on costsim-dev-00000A device  6 [000D:00:00.0] NVIDIA A100-SXM4-80GB
#  Rank  7 Pid 1692125 on costsim-dev-00000A device  7 [000E:00:00.0] NVIDIA A100-SXM4-80GB
#
# Initializing MSCCL++
# Setting up the connection in MSCCL++
#
#                                    in-place                       out-of-place          
#       size         count     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong
#        (B)    (elements)     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
       10240          2560     6.54    1.56    2.74      0

# Out of bounds values : 0 OK
#

chhwang · 2023-06-08T13:51:51Z

@saeedmaleki I struggled with an occasional correctness bug in the LL AllReduce (kernel 2) and concluded that we need a __threadfence() here after many tests. However, it is still unclear why we need this. Do you have any ideas? https://github.com/microsoft/mscclpp/blob/chhwang/ll/test/mscclpp-test/allreduce_test.cu#L238

test/mscclpp-test/allreduce_test.cu

include/mscclpp/channel.hpp

test/mscclpp-test/allreduce_test.cu

chhwang · 2023-06-12T07:51:23Z

10KB AllReduce 8-GPU kernel 2 (packet copy) latency: 5.99us (7 blocks, 1 block per peer)

# minBytes 10240 maxBytes 10240 step: 2(factor) warmup iters: 10 iters: 1000 validation: 1 graph: 100, kernel num: 2
#
# Using devices
#  Rank  0 Pid 2196015 on costsim-dev-00000A device  0 [0001:00:00.0] NVIDIA A100-SXM4-80GB
#  Rank  1 Pid 2196016 on costsim-dev-00000A device  1 [0002:00:00.0] NVIDIA A100-SXM4-80GB
#  Rank  2 Pid 2196017 on costsim-dev-00000A device  2 [0003:00:00.0] NVIDIA A100-SXM4-80GB
#  Rank  3 Pid 2196018 on costsim-dev-00000A device  3 [0004:00:00.0] NVIDIA A100-SXM4-80GB
#  Rank  4 Pid 2196019 on costsim-dev-00000A device  4 [000B:00:00.0] NVIDIA A100-SXM4-80GB
#  Rank  5 Pid 2196020 on costsim-dev-00000A device  5 [000C:00:00.0] NVIDIA A100-SXM4-80GB
#  Rank  6 Pid 2196021 on costsim-dev-00000A device  6 [000D:00:00.0] NVIDIA A100-SXM4-80GB
#  Rank  7 Pid 2196022 on costsim-dev-00000A device  7 [000E:00:00.0] NVIDIA A100-SXM4-80GB
#
# Initializing MSCCL++
# Setting up the connection in MSCCL++
#
#                                    in-place                       out-of-place          
#       size         count     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong
#        (B)    (elements)     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
       10240          2560     5.99    1.71    2.99      0

# Out of bounds values : 0 OK
#

test/mscclpp-test/allreduce_test.cu

Binyang2014 and others added 29 commits April 27, 2023 07:36

code complelete

35814e9

fix correctness issue

6d0815b

Fix correctness issuee

76cffe5

fix lint

17a78c3

ass compile

6c8a265

Fix build issue

90a71c5

Fix runtime error

8ec7723

Fix correctness issue

bdbd70e

merge main

ce7f305

Fix crash issue

24cc3fc

minor change

a448ab3

Fix memory leak

5520b7e

Fix review comments

59ce235

Finish allgather

72fe5dd

address comments

e51039d

load element to register first then store to remote address

cf0cf3d

merge mscclpp-test branch

e0ea017

Finish allGather

e5dc130

merge main

de18ead

Merge branch 'main' into binyli/all-gather

668d540

Merge branch 'main' into binyli/ring-reduce

32633dd

init

38cffbb

Merge binyli/all-gather

c61fd39

Build connections

07a3825

allreduce_test works

a1ea369

Merge branch 'main' into binyli/all-reduce

52ef4b3

Bug fix

13f1798

Add CUDA flags

aa67880

Add packet copy (LL)

9a26af0

chhwang requested a review from Binyang2014 May 25, 2023 02:57

chhwang added 5 commits June 6, 2023 02:46

Fix overlapping

1010889

Improve vector sum

af67914

Merge branch 'main' into binyli/all-reduce

01c4891

Merge branch 'main' into binyli/all-reduce

7f3f366

Merge branch 'binyli/all-reduce' into chhwang/ll

d9d67b4

chhwang force-pushed the chhwang/ll branch from 3afe72b to d9d67b4 Compare June 6, 2023 07:52

Base automatically changed from binyli/all-reduce to main June 7, 2023 10:58

Enhance LL AllReduce

8191379

chhwang force-pushed the chhwang/ll branch from 25b02ba to 8191379 Compare June 8, 2023 08:51

chhwang added 3 commits June 8, 2023 11:30

Support multiple blocks per peer

824dbb9

Merge branch 'main' into chhwang/ll

34bc4ec

Fix a ring reduce bug

e67f7f3

chhwang added 2 commits June 8, 2023 13:31

Fix a AllReduce kernel 2 bug

75da9ac

Bug fix

2f6c39c

chhwang requested a review from saeedmaleki June 8, 2023 13:42

saeedmaleki approved these changes Jun 12, 2023

View reviewed changes

test/mscclpp-test/allreduce_test.cu Outdated Show resolved Hide resolved

include/mscclpp/channel.hpp Show resolved Hide resolved

test/mscclpp-test/allreduce_test.cu Outdated Show resolved Hide resolved

chhwang added 3 commits June 12, 2023 07:33

Out-of-place for AllReduce kernel 2

740997c

Optimize sync()

72a8376

Merge branch 'main' into chhwang/ll

59e0a9a

chhwang enabled auto-merge (squash) June 12, 2023 07:57

Binyang2014 reviewed Jun 12, 2023

View reviewed changes

test/mscclpp-test/allreduce_test.cu Outdated Show resolved Hide resolved

Binyang2014 reviewed Jun 12, 2023

View reviewed changes

test/mscclpp-test/allreduce_test.cu Show resolved Hide resolved

test/mscclpp-test/allreduce_test.cu Outdated Show resolved Hide resolved

Init TestEngine with TestArgs

f6adba9

Binyang2014 approved these changes Jun 12, 2023

View reviewed changes

chhwang merged commit 43de015 into main Jun 12, 2023

chhwang deleted the chhwang/ll branch June 12, 2023 13:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add packet copy (LL) for AllReduce #85

Add packet copy (LL) for AllReduce #85

chhwang commented May 25, 2023 •

edited

Loading

chhwang commented Jun 8, 2023

chhwang commented Jun 8, 2023

chhwang commented Jun 8, 2023

chhwang commented Jun 12, 2023

Add packet copy (LL) for AllReduce #85

Add packet copy (LL) for AllReduce #85

Conversation

chhwang commented May 25, 2023 • edited Loading

chhwang commented Jun 8, 2023

chhwang commented Jun 8, 2023

chhwang commented Jun 8, 2023

chhwang commented Jun 12, 2023

chhwang commented May 25, 2023 •

edited

Loading