TL/UCP: use pipelining in SRA allreduce for CUDA #873

Sergei-Lebedev · 2023-11-07T15:04:13Z

What

Use pipelining in SRA knomial allreduce algorithm for inplace CUDA buffers

Why ?

Improves performance of large messaged. In case of inplace SRA needs scratch space and for large messages we have to always allocate it through cudaMalloc which is slow.

How ?

SRA allreduce is split into chunks of CUDA memory pool element size

Sergei-Lebedev force-pushed the topic/sra_pipeline branch from c5d2cdf to 0db620d Compare November 14, 2023 11:19

Sergei-Lebedev requested review from manjugv, bureddy and samnordmann November 14, 2023 14:30

Sergei-Lebedev added the Ready-for-Review label Nov 14, 2023

samnordmann approved these changes Nov 23, 2023

View reviewed changes

bureddy approved these changes Dec 6, 2023

View reviewed changes

TL/UCP: use pipelining in SRA allreduce for CUDA

b28f886

Sergei-Lebedev force-pushed the topic/sra_pipeline branch from 0db620d to b28f886 Compare December 8, 2023 12:22

Sergei-Lebedev merged commit d257388 into openucx:master Dec 8, 2023
11 checks passed

Sergei-Lebedev deleted the topic/sra_pipeline branch December 8, 2023 16:20

B-a-S pushed a commit to B-a-S/ucc that referenced this pull request Jan 4, 2024

TL/UCP: use pipelining in SRA allreduce for CUDA (openucx#873)

52c86d8

janjust pushed a commit to janjust/ucc that referenced this pull request Jan 31, 2024

TL/UCP: use pipelining in SRA allreduce for CUDA (openucx#873)

1cbd598

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TL/UCP: use pipelining in SRA allreduce for CUDA #873

TL/UCP: use pipelining in SRA allreduce for CUDA #873

Sergei-Lebedev commented Nov 7, 2023

TL/UCP: use pipelining in SRA allreduce for CUDA #873

TL/UCP: use pipelining in SRA allreduce for CUDA #873

Conversation

Sergei-Lebedev commented Nov 7, 2023

What

Why ?

How ?