Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TL/CUDA: fix linear algorithms #751

Merged

Conversation

Sergei-Lebedev
Copy link
Contributor

What

  • Fixes linear algorithms in TL CUDA for NVSwitch, add checks for max number of peers
  • Add copy multi performance test to ucc_perftest
  • Use cudaMemcpyAysnc for large size of copy multi operation

Why ?

Fixes crashes on some NVLink topologies, improves performance of linear algorithms

@Sergei-Lebedev Sergei-Lebedev force-pushed the topic/tl_cuda_linear_fix branch 2 times, most recently from b5c71d1 to 51cada1 Compare March 20, 2023 13:41
src/components/ec/base/ucc_ec_base.h Show resolved Hide resolved
src/components/ec/cuda/ec_cuda.h Outdated Show resolved Hide resolved
cudaGraphGetNodes(ee_task->graph, nodes, &num_nodes);
for (i = 0; i < task_args->copy_multi.num_vectors; i++) {
status = CUDA_FUNC(
cudaGraphExecMemcpyNodeSetParams1D(ee_task->graph_exec, nodes[i],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: zero-length operations are not supported, so maybe skip when counts[i]=0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't skip it since all nodes of the graphs should be valid

src/components/ec/cuda/ec_cuda_executor_interruptible.c Outdated Show resolved Hide resolved
tools/perf/ucc_pt_benchmark.cc Show resolved Hide resolved
@@ -19,8 +20,18 @@ ucc_pt_op_memcpy::ucc_pt_op_memcpy(ucc_datatype_t dt, ucc_memory_type mt,
has_range_ = true;
has_bw_ = true;

if (nbufs == UCC_PT_DEFAULT_N_BUFS) {
nbufs = 1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe define "1" in a macro ?
Same for the constants used in reduce and reduce_strided

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's used in one place and default is different for different tests, can you pls elaborate what do you want to change

Copy link
Collaborator

@samnordmann samnordmann Mar 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what I suggest is to do something along the line of:
in tools/perf/ucc_pt_config.h

#define UCC_PT_DEFAULT_N_BUFS 0
#define UCC_PT_DEFAULT_N_BUFS_MEMCPY 1
#define UCC_PT_DEFAULT_N_BUFS_REDUCE 1
#define UCC_PT_DEFAULT_N_BUFS_REDUCE_STRIDED 2

and then to use those variables in the different files instead of the hardcoded numbers.

If you think this is irrelevant please discard this suggestion

@Sergei-Lebedev Sergei-Lebedev force-pushed the topic/tl_cuda_linear_fix branch from 51cada1 to 0ed5edd Compare March 28, 2023 14:15
@manjugv manjugv added this to the v1.2.0 Release milestone Mar 29, 2023
@Sergei-Lebedev Sergei-Lebedev force-pushed the topic/tl_cuda_linear_fix branch from 0ed5edd to 1e1d387 Compare April 20, 2023 08:09
@Sergei-Lebedev Sergei-Lebedev merged commit 67828ae into openucx:master Apr 20, 2023
@Sergei-Lebedev Sergei-Lebedev deleted the topic/tl_cuda_linear_fix branch May 3, 2023 07:06
janjust pushed a commit to janjust/ucc that referenced this pull request Jan 31, 2024
* TL/CUDA: fix linear algorithms

* REVIEW: fix review comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants