Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TL/MLX5: various optimizations #1012

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

samnordmann
Copy link
Collaborator

@samnordmann samnordmann commented Aug 21, 2024

What

This PR contains various optimizations for TL/MLX5/a2a. In order of importance/relevance:

  1. support rectangular blocks
  2. other configurations in how we post the WQEs:
    • iterate across nodes before blocks when posting the WQEs
    • reuse dm chunks
    • send blocks by batch
  3. knomial fan-in for the internode sync

We might want to merge this PR as is, or to divide it into several smaller ones. But this branch is at least a pointer for a working version, that can be used as is for performance experimentation.

TODO:

One important optimization that is yet to be implemented is to support using several NICs. So far, our algorithm only uses one NIC.

cc @lappazos @x41lakazam

@janjust janjust requested review from MamziB and janjust September 19, 2024 15:23
ucc_tl_mlx5_alltoall_t *a2a = team->a2a;
int seq_index = task->alltoall.seq_index;
int npolls = UCC_TL_MLX5_TEAM_CTX(team)->cfg.npolls;
int radix = UCC_TL_MLX5_TEAM_LIB(team)->cfg.fanin_kn_radix;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alignment

}
*dist *= radix;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alignment

@@ -242,7 +282,13 @@ static ucc_status_t ucc_tl_mlx5_fanout_start(ucc_coll_task_t *coll_task)

tl_debug(UCC_TASK_LIB(task), "fanout start");
/* start task if completion event received */
UCC_TL_MLX5_PROFILE_REQUEST_EVENT(task, "mlx5_alltoall_fanout_start", 0);
if (team->a2a->node.sbgp->group_rank == team->a2a->node.asr_rank) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we remove this if/else when PROFILING is not requested? this is to improve the performance when profiling is not needed

ucc_tl_mlx5_alltoall_t *a2a = team->a2a;
int node_size = a2a->node.sbgp->group_size;
int net_size = a2a->net.sbgp->group_size;
int op_msgsize = node_size * a2a->max_msg_size * UCC_TL_TEAM_SIZE(team) *
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

align

int block_msgsize = block_h * block_w * task->alltoall.msg_size;
ucc_status_t status = UCC_OK;
int node_grid_w = node_size / block_w;
int node_nbr_blocks = (node_size * node_size) / (block_h * block_w);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

align

if (!send_to_self &&
task->alltoall.op->blocks_sent[cyc_rank] < node_nbr_blocks) {
dm = ucc_tl_mlx5_a2a_wait_for_dm_chunk(task);
if (status != UCC_OK) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which status you are checking?

@@ -28,6 +28,19 @@ static ucc_config_field_t ucc_tl_mlx5_lib_config_table[] = {
ucc_offsetof(ucc_tl_mlx5_lib_config_t, dm_buf_num),
UCC_CONFIG_TYPE_ULUNITS},

{"FORCE_REGULAR", "y",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you encapsulate all these three env vars into one with 3 options

c->offset = (ptrdiff_t)team->dm_offset;
team->dm_offset = PTR_OFFSET(team->dm_offset,
UCC_TL_MLX5_TEAM_LIB(team)->cfg.dm_buf_size);
c->addr = (uintptr_t)PTR_OFFSET(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

align

@swx-jenkins3
Copy link

Can one of the admins verify this patch?

TL/MLX5: add npolls cfg for FANIN

TL/MLX5: knomial fanin

TL/MLX5: add prints and profile events

TL/MLX5: remove debug prints
tiny bit more robust

print blocks dimensions

fully working configurable batch_size, serialization, and pollings
clean and working

TL/MLX5: add more config for block dimensions

force longer by default
@samnordmann samnordmann force-pushed the tl/mlx5/pr/main_clean_history branch from 32d0718 to 43dd1d7 Compare December 18, 2024 16:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants