Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TL/MLX5: revise team and ctx init #815

Merged
merged 14 commits into from
Aug 22, 2023

Conversation

samnordmann
Copy link
Collaborator

@samnordmann samnordmann commented Jul 31, 2023

What

Revise TL/MLX5 context and team init.

  • Allow other collective to run if a2a setup fails
  • Handling device memory allocation error, and other errors happening during team alltoall init
  • allocate device memory atomics before device memory pool
  • change setup/cleanup error log to debug log

Along the way, we fix two important bugs

@samnordmann samnordmann force-pushed the tl_mlx5/fix_team_creation branch 2 times, most recently from 1575241 to 2b0dfb3 Compare July 31, 2023 10:06
@samnordmann samnordmann force-pushed the tl_mlx5/fix_team_creation branch from 26c7d95 to 9008049 Compare August 7, 2023 12:34
@samnordmann samnordmann requested a review from MamziB August 7, 2023 12:35
@samnordmann samnordmann force-pushed the tl_mlx5/fix_team_creation branch from 9008049 to 9744aec Compare August 7, 2023 12:45
src/components/tl/mlx5/alltoall/alltoall.c Outdated Show resolved Hide resolved
src/components/tl/mlx5/tl_mlx5_context.c Outdated Show resolved Hide resolved
src/components/tl/mlx5/tl_mlx5_team.c Show resolved Hide resolved
@samnordmann samnordmann requested a review from bureddy August 10, 2023 11:24
src/components/tl/mlx5/alltoall/alltoall.c Outdated Show resolved Hide resolved
src/components/tl/mlx5/tl_mlx5_team.c Outdated Show resolved Hide resolved
src/components/tl/mlx5/tl_mlx5_team.c Outdated Show resolved Hide resolved
src/components/tl/mlx5/tl_mlx5_team.c Outdated Show resolved Hide resolved
@samnordmann samnordmann force-pushed the tl_mlx5/fix_team_creation branch from 2d8cc32 to 855f390 Compare August 17, 2023 17:14
@samnordmann samnordmann force-pushed the tl_mlx5/fix_team_creation branch from ae74c70 to bff0c7e Compare August 21, 2023 10:26
@samnordmann samnordmann requested a review from manjugv August 21, 2023 15:21
@Sergei-Lebedev Sergei-Lebedev merged commit cd7175c into openucx:master Aug 22, 2023
nsarka pushed a commit to nsarka/ucc that referenced this pull request Oct 24, 2023
* TL/MLX5: revise team and ctx init

* TL/MLX5: change setup/cleanup error log to debug

* TL/MLX5: minor reviews

* CODESTYLE: clang-tidy

* TL/MLX5: alloc dm atomics before mpool

* CODESTYLE: clang-tidy

* TL/MLX5: fix bug with with socket

* REVIEW: minor comments

* TL/MLX5: disable coverity issue

* REVIEW: minor comments

* TL/MLX5: fix socket closing

* TL/MLX5: score map if a2a not avail

* CODESTYLE: clang format
nsarka pushed a commit to nsarka/ucc that referenced this pull request Oct 24, 2023
* TL/MLX5: revise team and ctx init

* TL/MLX5: change setup/cleanup error log to debug

* TL/MLX5: minor reviews

* CODESTYLE: clang-tidy

* TL/MLX5: alloc dm atomics before mpool

* CODESTYLE: clang-tidy

* TL/MLX5: fix bug with with socket

* REVIEW: minor comments

* TL/MLX5: disable coverity issue

* REVIEW: minor comments

* TL/MLX5: fix socket closing

* TL/MLX5: score map if a2a not avail

* CODESTYLE: clang format
janjust pushed a commit to janjust/ucc that referenced this pull request Jan 31, 2024
* TL/MLX5: revise team and ctx init

* TL/MLX5: change setup/cleanup error log to debug

* TL/MLX5: minor reviews

* CODESTYLE: clang-tidy

* TL/MLX5: alloc dm atomics before mpool

* CODESTYLE: clang-tidy

* TL/MLX5: fix bug with with socket

* REVIEW: minor comments

* TL/MLX5: disable coverity issue

* REVIEW: minor comments

* TL/MLX5: fix socket closing

* TL/MLX5: score map if a2a not avail

* CODESTYLE: clang format
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants