Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TL/UCP: use pipelining in SRA allreduce for CUDA #873

Merged
merged 1 commit into from
Dec 8, 2023

Conversation

Sergei-Lebedev
Copy link
Contributor

What

Use pipelining in SRA knomial allreduce algorithm for inplace CUDA buffers

Why ?

Improves performance of large messaged. In case of inplace SRA needs scratch space and for large messages we have to always allocate it through cudaMalloc which is slow.

How ?

SRA allreduce is split into chunks of CUDA memory pool element size

@Sergei-Lebedev Sergei-Lebedev merged commit d257388 into openucx:master Dec 8, 2023
11 checks passed
@Sergei-Lebedev Sergei-Lebedev deleted the topic/sra_pipeline branch December 8, 2023 16:20
B-a-S pushed a commit to B-a-S/ucc that referenced this pull request Jan 4, 2024
janjust pushed a commit to janjust/ucc that referenced this pull request Jan 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants