Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CL/HIER: bcast 2step algorithm #620

Merged
merged 8 commits into from
Feb 1, 2023

Conversation

vspetrov
Copy link
Collaborator

@vspetrov vspetrov commented Sep 5, 2022

What

Adds 2 lvl pipelined broadcast algorithm to CL/HIER

Perf improvement. Example on 32nodes, ppn=56.

UCC master (tl/ucp) UCC new (cl/hier, shm+ucp) UCC new (cl/hier, shm+ucp) + BCAST_KN_RADIX=6
6.46 5.27 4.38
6.49 5.32 4.47
6.51 5.32 4.39
6.46 5.33 4.44
6.45 5.29 4.39
7.21 5.54 4.59
8.19 6.3 5.21
8.76 7.2 6.42
10.95 7.6 6.15
11.72 8.25 6.55
14.06 9.15 7.44
16.86 11.78 9.72
24.28 17.73 14.92

@vspetrov vspetrov force-pushed the topic/cl_hier_bcast_2step branch 2 times, most recently from 4ae0f20 to 4c8ae5e Compare November 20, 2022 20:11
src/components/cl/hier/bcast/bcast_2step.c Outdated Show resolved Hide resolved
src/components/cl/hier/bcast/bcast_2step.c Outdated Show resolved Hide resolved
src/components/cl/hier/bcast/bcast_2step.c Outdated Show resolved Hide resolved
src/components/cl/hier/bcast/bcast_2step.c Outdated Show resolved Hide resolved
src/components/cl/hier/bcast/bcast_2step.c Outdated Show resolved Hide resolved
src/components/cl/hier/bcast/bcast_2step.c Outdated Show resolved Hide resolved
src/components/cl/hier/bcast/bcast_2step.c Show resolved Hide resolved
src/components/cl/hier/bcast/bcast_2step.c Outdated Show resolved Hide resolved
src/components/cl/hier/bcast/bcast_2step.c Outdated Show resolved Hide resolved
src/components/cl/hier/bcast/bcast_2step.c Outdated Show resolved Hide resolved
src/components/cl/hier/bcast/bcast.h Outdated Show resolved Hide resolved
src/components/cl/hier/bcast/bcast_2step.c Outdated Show resolved Hide resolved
src/components/cl/hier/bcast/bcast_2step.c Outdated Show resolved Hide resolved
src/components/cl/hier/bcast/bcast_2step.c Outdated Show resolved Hide resolved
src/components/cl/hier/bcast/bcast_2step.c Show resolved Hide resolved
@vspetrov vspetrov force-pushed the topic/cl_hier_bcast_2step branch from 4c8ae5e to e08014e Compare December 5, 2022 20:03
@Sergei-Lebedev
Copy link
Contributor

@vspetrov looks like with new gtest asan found memleak, do you want to fix it in this PR. Or I can create fix in another PR

@vspetrov vspetrov force-pushed the topic/cl_hier_bcast_2step branch from ac489b1 to 9ffa8b2 Compare December 7, 2022 08:20
src/components/cl/hier/bcast/bcast_2step.c Outdated Show resolved Hide resolved
src/components/cl/hier/bcast/bcast_2step.c Outdated Show resolved Hide resolved
src/components/cl/hier/bcast/bcast_2step.c Outdated Show resolved Hide resolved
src/components/cl/hier/bcast/bcast_2step.c Outdated Show resolved Hide resolved
goto out;
}
n_tasks++;
if (root_on_local_node && (root != rank)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like this condition is always false
if root != rank => root on remote node because we are inside if (SBGP_ENABLED(cl_team, NODE_LEADERS))
if root_on_local_node => root == rank because of the same reason

In that case you don't need first_task because first_task is always 0

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, i beleive you are right. I re-checked and tested.

@vspetrov vspetrov force-pushed the topic/cl_hier_bcast_2step branch from 9ffa8b2 to c322720 Compare December 12, 2022 08:12
ucc_schedule_add_task(schedule, tasks[0]);
if (n_tasks > 1) {
if (root == rank) {
ucc_task_subscribe_dep(&schedule->super, tasks[1],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not correct, you can start 2 tasks (node leaders bcast and node bcast) in parallel only if root rank belongs to node leaders group.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not correct, you can start 2 tasks (node leaders bcast and node bcast) in parallel only if root rank belongs to node leaders group.

If root is not node leader then it has only 1 task, and will not get there

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, but node leader task should depend on node task in that case

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right.. this is fun. that is exactly why i had "first_task" before. and the condition above was correct actually to handle that case. i'll revert that last one change

@vspetrov vspetrov force-pushed the topic/cl_hier_bcast_2step branch from c322720 to 2155d14 Compare December 12, 2022 15:07
@vspetrov
Copy link
Collaborator Author

so, i added "first_task" back since it handles exactly the required case. Also, the reason gtest didn't catch the bug when i removed "first_task" is because the was incorrect "reset" callback in gtest bcast unit test. fixed that as well.

@vspetrov vspetrov force-pushed the topic/cl_hier_bcast_2step branch from 2155d14 to dbecc38 Compare December 12, 2022 17:23
@vspetrov
Copy link
Collaborator Author

looks like persistent gtest hangs with new bcast 2 step. need to debug...

@manjugv
Copy link
Contributor

manjugv commented Dec 14, 2022

Need to check with UCF folks on the copyright

@manjugv
Copy link
Contributor

manjugv commented Jan 27, 2023

@vspetrov Please add the copyright blurb you want. It was cleared by UCF.

@vspetrov vspetrov force-pushed the topic/cl_hier_bcast_2step branch from dbecc38 to 15915dd Compare February 1, 2023 10:19
@Sergei-Lebedev Sergei-Lebedev merged commit fb47917 into openucx:master Feb 1, 2023
janjust pushed a commit to janjust/ucc that referenced this pull request Jan 31, 2024
* CL/HIER: bcast 2step algorithm

* TEST: gtest cl_hier_rab allreduce gtest

* TEST: bcast cl_hier_2step gtest

* CI: add enable-assert to default clang-tidy

* CL/HIER: fix oob mem leak

* REVIEW: address comments

* TEST: bcast gtest "reset" fix

* CL/HIER: bcast 2 step persistent fallback

---------

Co-authored-by: Valentin Petrov <valentinp@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants