Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCC returns an error when modifying non-existing TLs in context config #656

Closed
almogsegal opened this issue Oct 19, 2022 · 5 comments
Closed

Comments

@almogsegal
Copy link

almogsegal commented Oct 19, 2022

UCC returns an error when modifying non-existing TLs in context config. I understand the reasoning for failing in such case but there's no way to predict that. I'd expect one of the following behaviors:

  1. UCC won't fail if TL doesn't exist in the context.
  2. UCC should allow to query the context for existing TLs so the user would be able to modify them without failing. for example:
UCC_CHECK(ucc_context_config_query(ctx_config, "tl/cuda", "VALUE", &has_cuda));
if (has_cuda)
{
    UCC_CHECK(ucc_context_config_modify(ctx_config, "tl/cuda", "TUNE", "0"));
}
@almogsegal
Copy link
Author

@manjugv please review.

@almogsegal
Copy link
Author

Below is how my output looks like when I run with 16 processes, it makes it harder to find what I need in the output when I run with more processes.

[1667117403.755480] [luna-0062:1724877:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117403.839909] [luna-0062:1724878:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117403.842918] [luna-0063:551408:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117403.857859] [luna-0062:1724865:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117403.862395] [luna-0062:1724874:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117403.869124] [luna-0062:1724863:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117403.876273] [luna-0062:1724870:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117403.878267] [luna-0063:551407:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117403.891668] [luna-0063:551413:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117403.893574] [luna-0063:551415:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117403.900817] [luna-0062:1724852:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117403.902945] [luna-0063:551410:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117403.908191] [luna-0062:1724876:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117403.910438] [luna-0063:551414:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117403.920437] [luna-0063:551411:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117403.929928] [luna-0063:551409:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117404.465478] [luna-0063:551408:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117404.465504] [luna-0063:551409:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117404.465552] [luna-0063:551413:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117404.465626] [luna-0063:551414:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117404.465313] [luna-0062:1724876:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117404.465598] [luna-0063:551411:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117404.465696] [luna-0063:551410:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117404.465348] [luna-0062:1724877:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117404.465342] [luna-0062:1724874:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117404.465360] [luna-0062:1724865:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117404.465739] [luna-0063:551407:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117404.465417] [luna-0062:1724878:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117404.465449] [luna-0062:1724870:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117404.466063] [luna-0063:551415:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117404.465776] [luna-0062:1724863:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117404.465772] [luna-0062:1724852:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117405.608562] [luna-0062:1724852:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117405.609375] [luna-0063:551407:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117405.609399] [luna-0063:551415:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117405.609489] [luna-0063:551413:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117405.609213] [luna-0062:1724870:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117405.609576] [luna-0063:551410:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117405.609332] [luna-0062:1724878:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117405.609749] [luna-0063:551408:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117405.609793] [luna-0063:551409:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117405.609545] [luna-0062:1724863:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117405.609594] [luna-0062:1724876:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117405.609811] [luna-0062:1724877:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117405.610080] [luna-0063:551411:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117405.609831] [luna-0062:1724874:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117405.610269] [luna-0063:551414:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117405.610001] [luna-0062:1724865:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117406.252270] [luna-0062:1724878:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117406.252486] [luna-0062:1724870:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117406.252972] [luna-0063:551407:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117406.252988] [luna-0063:551410:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117406.253180] [luna-0063:551408:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117406.253213] [luna-0063:551414:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117406.252935] [luna-0062:1724877:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117406.252946] [luna-0062:1724865:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117406.275740] [luna-0062:1724876:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117406.276116] [luna-0063:551411:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117406.276354] [luna-0063:551413:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117406.276087] [luna-0062:1724874:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117406.278249] [luna-0063:551409:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117406.277894] [luna-0062:1724852:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117406.278279] [luna-0063:551415:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117406.278038] [luna-0062:1724863:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117406.835123] [luna-0063:551414:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117406.835124] [luna-0063:551410:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117406.835127] [luna-0063:551411:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117406.835126] [luna-0063:551409:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117406.836288] [luna-0063:551408:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117406.836295] [luna-0063:551407:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117406.836293] [luna-0063:551415:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117406.836305] [luna-0063:551413:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117406.853934] [luna-0062:1724865:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117406.853936] [luna-0062:1724870:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117406.853941] [luna-0062:1724876:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117406.853936] [luna-0062:1724852:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117406.857540] [luna-0062:1724877:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117406.857543] [luna-0062:1724878:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117406.857547] [luna-0062:1724874:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context
[1667117406.857540] [luna-0062:1724863:0]     ucc_context.c:240  UCC  ERROR required TL nccl is not part of the context

@manjugv
Copy link
Contributor

manjugv commented Nov 10, 2022

@almogsegal #667 does this work for you?

@almogsegal
Copy link
Author

almogsegal commented Nov 10, 2022

@manjugv it definitely does. Thank you!
For long term, I think it would be nice to be able to query the context as I suggest so libraries and other users can make performance hints for the users. E.g.

UCC was not compiled with NCCL support. To achieve better performance, consider recompiling with NCCL.

@Sergei-Lebedev
Copy link
Contributor

@almogsegal fyi just merged #667

@manjugv manjugv closed this as completed Feb 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants