Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TL/SHARP: SHARP OOB fixes #746

Merged
merged 4 commits into from
Mar 15, 2023
Merged

TL/SHARP: SHARP OOB fixes #746

merged 4 commits into from
Mar 15, 2023

Conversation

bureddy
Copy link
Collaborator

@bureddy bureddy commented Mar 6, 2023

  • hide sharp lib errors
  • disable lazy init by default
  • removed hard coded sharp IB device.

Why ?

  • users are seeing sharp lib errors related to the application trying to contact sharp_am service but sharp is not enabled in the setup.
  • sharp groups are limited resources and sharp team creation needs to know resource availability to avoid failure during the collective post.

@bureddy bureddy force-pushed the sharp-oob branch 2 times, most recently from 5d07eaa to a5a3e16 Compare March 6, 2023 21:07
if(!sharp_ctx->cfg.enable_lazy_group_alloc) {
init_spec.config.flags |= SHARP_COLL_DISABLE_LAZY_GROUP_RESOURCE_ALLOC;
}
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

print warning if disabling lazy group alloc is not supported by sharp and user has changed default value

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can expose the user option also only when flag is there

@@ -30,7 +30,7 @@ static ucc_config_field_t ucc_tl_sharp_context_config_table[] = {
{"", "", NULL, ucc_offsetof(ucc_tl_sharp_context_config_t, super),
UCC_CONFIG_TYPE_TABLE(ucc_tl_context_config_table)},

{"DEVICES", "mlx5_0:1",
{"DEVICES", "",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how device will be selected for previous SHARP releases?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was sharplib user's responsibility to specify the device.. there is no default device selection

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, this is my question, if one builds UCC 1.2 with SHARP 3.0 how it will work without providing device?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the default selection was fixed in libsharp in the last HPCX release i. it will fail if we use older than that. that should be ok

@bureddy bureddy requested a review from Sergei-Lebedev March 9, 2023 15:54
  - hide sharp lib errors
  - disable lazy init by default
@bureddy bureddy merged commit b837e87 into openucx:master Mar 15, 2023
@bureddy bureddy deleted the sharp-oob branch March 15, 2023 19:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants