CORE: improve logging #721

Sergei-Lebedev · 2023-01-30T07:55:49Z

What

Improve logging in different components of UCC

Why ?

Currently there is no unified rules between TLs, CLs, CORE about what to print and how to print.

How ?

Make LOG_LEVE=info less verbose, print only essential info about UCC runtime including
1.1. UCC version
1.2. UCC runtime path (path to the libucc.so)
1.3. global config path if used or n/a otherwise
1.4. team create log with team id and team size
1.5. score map i.e. what collectives available, what memory types available, what TLs/CLs are used
1.6. team destroy log
Added UCC_COLL_TRACE (default is no)
2.1. Leader rank prints collective info on each ucc_collective_init if UCC log level is info
2.2. All ranks from team print collective info on ucc_collective_init if UCC log level is debug or above
2.3. For LOG_LEVEL=info collective info includes, src and dst buffer address, count, datatype, memory type if available; inplace flag if set; root for rooted collectives; operation for reduction collectives; chosen TL/CL; team id. For V collectives (allgatherv, alltoallv, scatterv ...) count is sum of corresponding values in counts buffer.
2.4. For LOG_LEVEL=debug collective info output additionally includes team rank number, context rank number, collective sequence number, pointer to request

All other prints moved from info log level to debug log level.

UCC_LOG_LEVEL=info example

ucc_constructor.c:183  UCC  INFO  version: 1.2.0, loaded from: /home/ucc/build-debug/install/lib/libucc.so.1, cfg file: n/a
ucc_constructor.c:183  UCC  INFO  version: 1.2.0, loaded from: /home/ucc/build-debug/install/lib/libucc.so.1, cfg file: n/a
       ucc_team.c:471  UCC  INFO  ===== COLL_SCORE_MAP (team_id 32768, size 2) =====
ucc_coll_score_map.c:201  UCC  INFO  Allgather:
ucc_coll_score_map.c:201  UCC  INFO     Host: {0..inf}:TL_UCP:10
ucc_coll_score_map.c:201  UCC  INFO     Cuda: {0..inf}:TL_UCP:10
ucc_coll_score_map.c:201  UCC  INFO     CudaManaged: {0..inf}:TL_UCP:10
ucc_coll_score_map.c:201  UCC  INFO  Allgatherv:
ucc_coll_score_map.c:201  UCC  INFO     Host: {0..inf}:TL_UCP:10
ucc_coll_score_map.c:201  UCC  INFO     Cuda: {0..inf}:TL_UCP:10
ucc_coll_score_map.c:201  UCC  INFO     CudaManaged: {0..inf}:TL_UCP:10
ucc_coll_score_map.c:201  UCC  INFO  Allreduce:
ucc_coll_score_map.c:201  UCC  INFO     Host: {0..4095}:TL_UCP:10 {4K..inf}:TL_UCP:10
ucc_coll_score_map.c:201  UCC  INFO     Cuda: {0..4095}:TL_UCP:10 {4K..inf}:TL_UCP:10
ucc_coll_score_map.c:201  UCC  INFO     CudaManaged: {0..4095}:TL_UCP:10 {4K..inf}:TL_UCP:10
ucc_coll_score_map.c:201  UCC  INFO  Alltoall:
ucc_coll_score_map.c:201  UCC  INFO     Host: {0..inf}:TL_UCP:10
ucc_coll_score_map.c:201  UCC  INFO     Cuda: {0..inf}:TL_UCP:10
ucc_coll_score_map.c:201  UCC  INFO     CudaManaged: {0..inf}:TL_UCP:10
ucc_coll_score_map.c:201  UCC  INFO  Alltoallv:
ucc_coll_score_map.c:201  UCC  INFO     Host: {0..inf}:TL_UCP:10
ucc_coll_score_map.c:201  UCC  INFO     Cuda: {0..inf}:TL_UCP:10
ucc_coll_score_map.c:201  UCC  INFO     CudaManaged: {0..inf}:TL_UCP:10
ucc_coll_score_map.c:201  UCC  INFO  Barrier:
ucc_coll_score_map.c:201  UCC  INFO     Host: {0..inf}:TL_UCP:10
ucc_coll_score_map.c:201  UCC  INFO     Cuda: {0..inf}:TL_UCP:10
ucc_coll_score_map.c:201  UCC  INFO     CudaManaged: {0..inf}:TL_UCP:10
ucc_coll_score_map.c:201  UCC  INFO  Bcast:
ucc_coll_score_map.c:201  UCC  INFO     Host: {0..32767}:TL_UCP:10 {32K..inf}:TL_UCP:10
ucc_coll_score_map.c:201  UCC  INFO     Cuda: {0..32767}:TL_UCP:10 {32K..inf}:TL_UCP:10
ucc_coll_score_map.c:201  UCC  INFO     CudaManaged: {0..32767}:TL_UCP:10 {32K..inf}:TL_UCP:10
ucc_coll_score_map.c:201  UCC  INFO  Fanin:
ucc_coll_score_map.c:201  UCC  INFO     Host: {0..inf}:TL_UCP:10
ucc_coll_score_map.c:201  UCC  INFO     Cuda: {0..inf}:TL_UCP:10
ucc_coll_score_map.c:201  UCC  INFO     CudaManaged: {0..inf}:TL_UCP:10
ucc_coll_score_map.c:201  UCC  INFO  Fanout:
ucc_coll_score_map.c:201  UCC  INFO     Host: {0..inf}:TL_UCP:10
ucc_coll_score_map.c:201  UCC  INFO     Cuda: {0..inf}:TL_UCP:10
ucc_coll_score_map.c:201  UCC  INFO     CudaManaged: {0..inf}:TL_UCP:10
ucc_coll_score_map.c:201  UCC  INFO  Gather:
ucc_coll_score_map.c:201  UCC  INFO     Host: {0..inf}:TL_UCP:10
ucc_coll_score_map.c:201  UCC  INFO     Cuda: {0..inf}:TL_UCP:10
ucc_coll_score_map.c:201  UCC  INFO     CudaManaged: {0..inf}:TL_UCP:10
ucc_coll_score_map.c:201  UCC  INFO  Gatherv:
ucc_coll_score_map.c:201  UCC  INFO     Host: {0..inf}:TL_UCP:10
ucc_coll_score_map.c:201  UCC  INFO     Cuda: {0..inf}:TL_UCP:10
ucc_coll_score_map.c:201  UCC  INFO     CudaManaged: {0..inf}:TL_UCP:10
ucc_coll_score_map.c:201  UCC  INFO  Reduce:
ucc_coll_score_map.c:201  UCC  INFO     Host: {0..inf}:TL_UCP:10
ucc_coll_score_map.c:201  UCC  INFO     Cuda: {0..inf}:TL_UCP:10
ucc_coll_score_map.c:201  UCC  INFO     CudaManaged: {0..inf}:TL_UCP:10
ucc_coll_score_map.c:201  UCC  INFO  Reduce_scatter:
ucc_coll_score_map.c:201  UCC  INFO     Host: {0..inf}:TL_UCP:10
ucc_coll_score_map.c:201  UCC  INFO     Cuda: {0..inf}:TL_UCP:10
ucc_coll_score_map.c:201  UCC  INFO     CudaManaged: {0..inf}:TL_UCP:10
ucc_coll_score_map.c:201  UCC  INFO  Reduce_scatterv:
ucc_coll_score_map.c:201  UCC  INFO     Host: {0..inf}:TL_UCP:10
ucc_coll_score_map.c:201  UCC  INFO     Cuda: {0..inf}:TL_UCP:10
ucc_coll_score_map.c:201  UCC  INFO     CudaManaged: {0..inf}:TL_UCP:10
ucc_coll_score_map.c:201  UCC  INFO  Scatterv:
ucc_coll_score_map.c:201  UCC  INFO     Host: {0..inf}:TL_UCP:10
ucc_coll_score_map.c:201  UCC  INFO     Cuda: {0..inf}:TL_UCP:10
ucc_coll_score_map.c:201  UCC  INFO     CudaManaged: {0..inf}:TL_UCP:10
       ucc_team.c:474  UCC  INFO  ================================================
OSU MPI All-to-All Personalized Exchange Latency Test v5.8

Size       Avg Latency(us)

....

ucc_team.c:528  UCC  INFO  team destroyed, team_id 32768

UCC_COLL_TRACE=y example

ucc_coll.c:257  UCC  INFO  coll_init: Barrier; TL_UCP, team_id 32768
ucc_coll.c:257  UCC  INFO  coll_init: Alltoall: src={0x7f7f2d789000, 2097152, int8, Host}, dst={0x7f7f2d587000, 2097152, int8, Host}; TL_UCP, team_id 32768
ucc_coll.c:257  UCC  INFO  coll_init: Barrier; TL_UCP, team_id 32768
ucc_coll.c:257  UCC  INFO  coll_init: Reduce min root 0: src={0x7ffd97e4ebd0, 1, float64, Host}, dst={0x7ffd97e4ebe8, 1, float64, Host}; TL_UCP, team_id 32768
ucc_coll.c:257  UCC  INFO  coll_init: Reduce max root 0: src={0x7ffd97e4ebd0, 1, float64, Host}, dst={0x7ffd97e4ebe0, 1, float64, Host}; TL_UCP, team_id 32768
ucc_coll.c:257  UCC  INFO  coll_init: Reduce sum root 0: src={0x7ffd97e4ebd0, 1, float64, Host}, dst={0x7ffd97e4ebd8, 1, float64, Host}; TL_UCP, team_id 32768

Sergei-Lebedev · 2023-01-30T08:08:11Z

@jirikraus please take a look

Sergei-Lebedev · 2023-01-30T10:24:11Z

bot:retest

shimmybalsam · 2023-01-31T12:30:30Z

General: in the print you show from UCC_LOG_LEVEL=info example there is no CL information, only TL score map. perhaps chosen CL info should be added as well.

shimmybalsam · 2023-01-31T12:38:23Z

src/core/ucc_global_opts.c

@@ -30,6 +30,13 @@ ucc_config_field_t ucc_global_config_table[] = {
     ucc_offsetof(ucc_global_config_t, log_component),
     UCC_CONFIG_TYPE_LOG_COMP},

+    {"COLL_TRACE", "n",
+     "If UCC logging level is INFO or higher UCC will print info message"


a bit confusing since you're explaining what would happen in case of "y" but defining as "n"

why is it confusing? by default coll trace is disabled because it produces a lot of output. If you want to see it just set UCC_COLL_TRACE=y.

In my opinion it would be more clear like this:
"If enabled and UCC logging level is INFO or higher UCC will print info message per collective.\n"

jirikraus · 2023-01-31T13:17:57Z

Thanks @Sergei-Lebedev looking at the example output this looks great. I can't comment on the source change but assume this is not requested. As @shimmybalsam points out I think it makes sense to also cover the choosen CL.

vspetrov

The change is awesome! I have just minor comments.
Why not keeping coll seq_num on the trace with INFO level. the line will be almost same length? It gives a lot of info on how heavily the team is used (if seq_num is large or not), etc.

Looks like we don't have trace for triggered post, lets add.

A couple of next steps we could as a follow up:

Add UCC_TL_COLL_TRACE
basically same tracing but at TL level. This var should of course NOT inherit from UCC_COLL_TRACE. It would be convenient to trace in detail actual algorithms start/finalize at TL level.
With existing tracing, what else do we need to add to logging to potentially "replay" the trace ? This could be a very handy tool for perf analysis

vspetrov · 2023-02-01T07:46:42Z

src/utils/ucc_coll_utils.c

+        }
+        break;
+    case UCC_COLL_TYPE_ALLGATHERV:
+    case UCC_COLL_TYPE_REDUCE_SCATTERV:


maybe move thise entire blob into separate util function:

static inline void ucc_coll_args_get_info(const ucc_coll_args_t *args, ucc_coll_buffer_info_t *s_info, ucc_coll_buffer_info_t *d_info) { ... }

it can set s_info.buffer or r_info.buffer to NULL that would indicate the same as has_src/has_dst flags.

s_info.buffer == NULL is actually valid buffer, for example we did alltoallv with 64bit displacement in pytorch

you are right. still worth moving to separate fn imho, even with 2 more flags. current function is very large. just from readability perspective.

Sergei-Lebedev · 2023-02-01T09:02:21Z

Thanks @Sergei-Lebedev looking at the example output this looks great. I can't comment on the source change but assume this is not requested. As @shimmybalsam points out I think it makes sense to also cover the choosen CL.

It prints CL HIER if it's used and some TL if CL BASIC is used. The logic is that we print top level task info which is TL task for CL BASIC and CL task for CL HIER.

Sergei-Lebedev · 2023-02-01T09:14:26Z

Why not keeping coll seq_num on the trace with INFO level. the line will be almost same length? It gives a lot of info on how heavily the team is used (if seq_num is large or not), etc.

Sure, no problem, will add.

Looks like we don't have trace for triggered post, lets add.

We print this info in coll_init so we will have it for triggered post as well. But maybe move it to collective post and triggered post instead? I'm not sure which option is better.

Add UCC_TL_COLL_TRACE
basically same tracing but at TL level. This var should of course NOT inherit from UCC_COLL_TRACE. It would be convenient to trace in detail actual algorithms start/finalize at TL level.

Agree, but we have much more option where to add in TL. For example in every init, post, frag start ...

With existing tracing, what else do we need to add to logging to potentially "replay" the trace ? This could be a very handy tool for perf analysis

Yes, i wanted to start with some script that can collect different statistic, but replay is also an option.

jirikraus · 2023-02-01T10:00:07Z

Thanks @Sergei-Lebedev looking at the example output this looks great. I can't comment on the source change but assume this is not requested. As @shimmybalsam points out I think it makes sense to also cover the choosen CL.

It prints CL HIER if it's used and some TL if CL BASIC is used. The logic is that we print top level task info which is TL task for CL BASIC and CL task for CL HIER.

I think that would address my point. However I am lacking background to figure how this might look like in practice. Can you provide an example how that would look? The example you provide above only cover TL_UCP not the CL used.

Sergei-Lebedev · 2023-02-01T11:25:35Z

Thanks @Sergei-Lebedev looking at the example output this looks great. I can't comment on the source change but assume this is not requested. As @shimmybalsam points out I think it makes sense to also cover the choosen CL.

It prints CL HIER if it's used and some TL if CL BASIC is used. The logic is that we print top level task info which is TL task for CL BASIC and CL task for CL HIER.

I think that would address my point. However I am lacking background to figure how this might look like in practice. Can you provide an example how that would look? The example you provide above only cover TL_UCP not the CL used.

sure, it's similar:

ucc_coll.c:257  UCC  INFO  coll_init: Barrier; CL_HIER, team_id 1
ucc_coll.c:257  UCC  INFO  coll_init: Allreduce min: src={0x7ffd443b8238, 1, float64, Host}, dst={0x7ffd443b8228, 1, float64, Host}; CL_HIER, team_id 1
ucc_coll.c:257  UCC  INFO  coll_init: Allreduce max: src={0x7ffd443b8238, 1, float64, Host}, dst={0x7ffd443b8220, 1, float64, Host}; CL_HIER, team_id 1
ucc_coll.c:257  UCC  INFO  coll_init: Allreduce sum: src={0x7ffd443b8238, 1, float64, Host}, dst={0x7ffd443b8230, 1, float64, Host}; CL_HIER, team_id 1

shimmybalsam · 2023-02-01T11:53:01Z

Thanks @Sergei-Lebedev looking at the example output this looks great. I can't comment on the source change but assume this is not requested. As @shimmybalsam points out I think it makes sense to also cover the choosen CL.

It prints CL HIER if it's used and some TL if CL BASIC is used. The logic is that we print top level task info which is TL task for CL BASIC and CL task for CL HIER.

I think that would address my point. However I am lacking background to figure how this might look like in practice. Can you provide an example how that would look? The example you provide above only cover TL_UCP not the CL used.

sure, it's similar:
ucc_coll.c:257  UCC  INFO  coll_init: Barrier; CL_HIER, team_id 1
ucc_coll.c:257  UCC  INFO  coll_init: Allreduce min: src={0x7ffd443b8238, 1, float64, Host}, dst={0x7ffd443b8228, 1, float64, Host}; CL_HIER, team_id 1
ucc_coll.c:257  UCC  INFO  coll_init: Allreduce max: src={0x7ffd443b8238, 1, float64, Host}, dst={0x7ffd443b8220, 1, float64, Host}; CL_HIER, team_id 1
ucc_coll.c:257  UCC  INFO  coll_init: Allreduce sum: src={0x7ffd443b8238, 1, float64, Host}, dst={0x7ffd443b8230, 1, float64, Host}; CL_HIER, team_id 1

Looks good but I still think that also for CL BASIC we should output that we are using CL_BASIC and then which TLs

jirikraus · 2023-02-01T11:57:35Z

Thanks @Sergei-Lebedev looking at the example output this looks great. I can't comment on the source change but assume this is not requested. As @shimmybalsam points out I think it makes sense to also cover the choosen CL.

It prints CL HIER if it's used and some TL if CL BASIC is used. The logic is that we print top level task info which is TL task for CL BASIC and CL task for CL HIER.

I think that would address my point. However I am lacking background to figure how this might look like in practice. Can you provide an example how that would look? The example you provide above only cover TL_UCP not the CL used.

sure, it's similar:
ucc_coll.c:257  UCC  INFO  coll_init: Barrier; CL_HIER, team_id 1
ucc_coll.c:257  UCC  INFO  coll_init: Allreduce min: src={0x7ffd443b8238, 1, float64, Host}, dst={0x7ffd443b8228, 1, float64, Host}; CL_HIER, team_id 1
ucc_coll.c:257  UCC  INFO  coll_init: Allreduce max: src={0x7ffd443b8238, 1, float64, Host}, dst={0x7ffd443b8220, 1, float64, Host}; CL_HIER, team_id 1
ucc_coll.c:257  UCC  INFO  coll_init: Allreduce sum: src={0x7ffd443b8238, 1, float64, Host}, dst={0x7ffd443b8230, 1, float64, Host}; CL_HIER, team_id 1
Looks good but I still think that also for CL BASIC we should output that we are using CL_BASIC and then which TLs

And as @vspetrov suggested add UCC_TL_COLL_TRACE to provide additional details. To make sure I understand this right when CL hier is used we won't see the TLs it leverages? Could we show that without getting too verbose, E.g.

ucc_coll.c:257  UCC  INFO  coll_init: Barrier; CL_HIER, team_id 1
ucc_coll.c:257  UCC  INFO  coll_init: Allreduce min: src={0x7ffd443b8238, 1, float64, Host}, dst={0x7ffd443b8228, 1, float64, Host}; CL_HIER ( comma seperated list of TLs levearge), team_id 1
ucc_coll.c:257  UCC  INFO  coll_init: Allreduce max: src={0x7ffd443b8238, 1, float64, Host}, dst={0x7ffd443b8220, 1, float64, Host}; CL_HIER ( comma seperated list of TLs levearge), team_id 1
ucc_coll.c:257  UCC  INFO  coll_init: Allreduce sum: src={0x7ffd443b8238, 1, float64, Host}, dst={0x7ffd443b8230, 1, float64, Host}; CL_HIER ( comma seperated list of TLs levearge, e.g. (TL_UCP, TL_SHARP)) , team_id 1

and something similar for CL Basic?

vspetrov · 2023-02-01T17:17:49Z

We print this info in coll_init so we will have it for triggered post as well. But maybe move it to collective post and triggered post instead? I'm not sure which option is better.

right, i'm starting to forget stuff... i think, we should keep as is - print detailed info on collective_init and on post/finalize we have a verbose messages that will be printed under DEBUG. BTW, in post/finalize we print pointer of the request which makes it easy to match initialized request with the posted/finalized. I saw you no longer print the request pointer in coll_init. Maybe lets add it under DEBUG.

Sergei-Lebedev · 2023-02-01T18:00:06Z

@jirikraus @vspetrov @shimmybalsam
I slightly changed the log for collective trace. Now UCC_COLL_TRACE is enum instead of binary value, which allows to finer control over verbosity level. Coll trace doesn't depend on main LOG_LEVEL anymore, examples

UCC_COLL_TRACE=diag: leader ranks prints info from coll_args structure (coll type, buffer pointer, size, datatype, memory type)

ucc_coll.c:255  UCC_COLL DIAG  coll_init: Barrier
ucc_coll.c:255  UCC_COLL DIAG  coll_init: Alltoall: src={0x8ae9310, 1024, float32, Host}, dst={0x88772d0, 1024, float32, Host}
ucc_coll.c:255  UCC_COLL DIAG  coll_init: Barrier
ucc_coll.c:255  UCC_COLL DIAG  coll_init: Alltoall: src={0x8ae9310, 1024, float32, Host}, dst={0x88772d0, 1024, float32, Host}
ucc_coll.c:255  UCC_COLL DIAG  coll_init: Barrier
ucc_coll.c:255  UCC_COLL DIAG  coll_init: Allreduce min: src={0x7ffe38c79368, 1, float64, Host}, dst={0x7ffe38c79358, 1, float64, Host}
ucc_coll.c:255  UCC_COLL DIAG  coll_init: Allreduce max: src={0x7ffe38c79368, 1, float64, Host}, dst={0x7ffe38c79350, 1, float64, Host}
ucc_coll.c:255  UCC_COLL DIAG  coll_init: Allreduce sum: src={0x7ffe38c79368, 1, float64, Host}, dst={0x7ffe38c79360, 1, float64, Host}

UCC_COLL_TRACE=info: same as log level diag but adds team_id and CL/TL being used

ucc_coll.c:255  UCC_COLL INFO  coll_init: Barrier; CL_HIER {TL_UCP, TL_SHARP}, team_id 1
ucc_coll.c:255  UCC_COLL INFO  coll_init: Alltoall: src={0x776d2d0, 1024, float32, Host}, dst={0x75012d0, 1024, float32, Host}; CL_BASIC {TL_UCP}, team_id 1
ucc_coll.c:255  UCC_COLL INFO  coll_init: Barrier; CL_HIER {TL_UCP, TL_SHARP}, team_id 1
ucc_coll.c:255  UCC_COLL INFO  coll_init: Alltoall: src={0x776d2d0, 1024, float32, Host}, dst={0x75012d0, 1024, float32, Host}; CL_BASIC {TL_UCP}, team_id 1
ucc_coll.c:255  UCC_COLL INFO  coll_init: Barrier; CL_HIER {TL_UCP, TL_SHARP}, team_id 1
ucc_coll.c:255  UCC_COLL INFO  coll_init: Allreduce min: src={0x7ffd253387f8, 1, float64, Host}, dst={0x7ffd253387e8, 1, float64, Host}; CL_HIER {TL_UCP, TL_SHARP}, team_id 1
ucc_coll.c:255  UCC_COLL INFO  coll_init: Allreduce max: src={0x7ffd253387f8, 1, float64, Host}, dst={0x7ffd253387e0, 1, float64, Host}; CL_HIER {TL_UCP, TL_SHARP}, team_id 1
ucc_coll.c:255  UCC_COLL INFO  coll_init: Allreduce sum: src={0x7ffd253387f8, 1, float64, Host}, dst={0x7ffd253387f0, 1, float64, Host}; CL_HIER {TL_UCP, TL_SHARP}, team_id 1

UCC_COLL_TRACE=debug: same as log level info but adds rank, ctx_rank, sequence number and reuqest pointer

ucc_coll.c:255  UCC_COLL DEBUG coll_init: Barrier; CL_HIER {TL_UCP, TL_SHARP}, team_id 1 rank 0, ctx_rank 0, seq_num 0, req 0x8a34440
ucc_coll.c:255  UCC_COLL DEBUG coll_init: Alltoall: src={0x8e632d0, 1024, float32, Host}, dst={0x8bf72d0, 1024, float32, Host}; CL_BASIC {TL_UCP}, team_id 1 rank 0, ctx_rank 0, seq_num 1, req 0x8bf6e00
ucc_coll.c:255  UCC_COLL DEBUG coll_init: Barrier; CL_HIER {TL_UCP, TL_SHARP}, team_id 1 rank 0, ctx_rank 0, seq_num 2, req 0x8a34440
ucc_coll.c:255  UCC_COLL DEBUG coll_init: Alltoall: src={0x8e632d0, 1024, float32, Host}, dst={0x8bf72d0, 1024, float32, Host}; CL_BASIC {TL_UCP}, team_id 1 rank 0, ctx_rank 0, seq_num 3, req 0x8bf7040
ucc_coll.c:255  UCC_COLL DEBUG coll_init: Barrier; CL_HIER {TL_UCP, TL_SHARP}, team_id 1 rank 0, ctx_rank 0, seq_num 4, req 0x8a34440
ucc_coll.c:255  UCC_COLL DEBUG coll_init: Allreduce min: src={0x7ffe3c4af258, 1, float64, Host}, dst={0x7ffe3c4af248, 1, float64, Host}; CL_HIER {TL_UCP, TL_SHARP}, team_id 1 rank 0, ctx_rank 0, seq_num 5, req 0x8a34440
ucc_coll.c:255  UCC_COLL DEBUG coll_init: Allreduce max: src={0x7ffe3c4af258, 1, float64, Host}, dst={0x7ffe3c4af240, 1, float64, Host}; CL_HIER {TL_UCP, TL_SHARP}, team_id 1 rank 0, ctx_rank 0, seq_num 6, req 0x8a34440
ucc_coll.c:255  UCC_COLL DEBUG coll_init: Allreduce sum: src={0x7ffe3c4af258, 1, float64, Host}, dst={0x7ffe3c4af250, 1, float64, Host}; CL_HIER {TL_UCP, TL_SHARP}, team_id 1 rank 0, ctx_rank 0, seq_num 7, req 0x8a34440

UCC_COLL_TRACE=trace is identical to UCC_COLL_TRACE=debug but all ranks print collective info
Note that list of TLs inside CL HIER depends on what rank prints it. If rank belongs to multiple topology subgroups it will use multiple TLs, but if rank is only part of one subgroup it will print single TL

jirikraus · 2023-02-02T09:12:19Z

@jirikraus @vspetrov @shimmybalsam I slightly changed the log for collective trace. Now UCC_COLL_TRACE is enum instead of binary value, which allows to finer control over verbosity level. Coll trace doesn't depend on main LOG_LEVEL anymore, examples

Thanks this looks great. @Sergei-Lebedev when is this expected to land in a UCC release? I wonder how we should track adding a description of this to the user_guide.md discussed in PR #720 ?

Sergei-Lebedev · 2023-02-02T10:53:26Z

@jirikraus @vspetrov @shimmybalsam I slightly changed the log for collective trace. Now UCC_COLL_TRACE is enum instead of binary value, which allows to finer control over verbosity level. Coll trace doesn't depend on main LOG_LEVEL anymore, examples

Thanks this looks great. @Sergei-Lebedev when is this expected to land in a UCC release? I wonder how we should track adding a description of this to the user_guide.md discussed in PR #720 ?

If everybody agrees on logging format, then this PR will be merged pretty soon. After it's merged you can update User Guide accordingly and I don't think it depends on release date. But answering your question we are going to release new UCC version in April.

Sergei-Lebedev · 2023-02-17T12:22:04Z

@vspetrov @bureddy kind remainder, pls approve if you are ok with the changes

* CORE: improve logging * CORE: add coll_trace loggger * CORE: add post and finalize to debug log

Sergei-Lebedev added the Ready-for-Review label Jan 30, 2023

Sergei-Lebedev requested review from bureddy, vspetrov, samnordmann, manjugv and shimmybalsam January 30, 2023 08:07

shimmybalsam approved these changes Jan 31, 2023

View reviewed changes

shimmybalsam suggested changes Jan 31, 2023

View reviewed changes

samnordmann approved these changes Jan 31, 2023

View reviewed changes

vspetrov reviewed Feb 1, 2023

View reviewed changes

shimmybalsam approved these changes Feb 2, 2023

View reviewed changes

Sergei-Lebedev force-pushed the topic/better_log branch from a2a73cf to 4c90c34 Compare February 2, 2023 10:48

vspetrov approved these changes Mar 1, 2023

View reviewed changes

manjugv approved these changes Mar 5, 2023

View reviewed changes

CORE: improve logging

c7c4375

Sergei-Lebedev added 2 commits March 5, 2023 13:24

CORE: add coll_trace loggger

614c791

CORE: add post and finalize to debug log

ef8e5f8

Sergei-Lebedev force-pushed the topic/better_log branch from 4c90c34 to ef8e5f8 Compare March 5, 2023 11:02

Sergei-Lebedev merged commit 751c80c into openucx:master Mar 6, 2023

Sergei-Lebedev deleted the topic/better_log branch March 6, 2023 06:03

janjust pushed a commit to janjust/ucc that referenced this pull request Jan 31, 2024

CORE: improve logging (openucx#721)

f816b76

* CORE: improve logging * CORE: add coll_trace loggger * CORE: add post and finalize to debug log

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CORE: improve logging #721

CORE: improve logging #721

Sergei-Lebedev commented Jan 30, 2023

Sergei-Lebedev commented Jan 30, 2023

Sergei-Lebedev commented Jan 30, 2023

shimmybalsam commented Jan 31, 2023

shimmybalsam Jan 31, 2023

Sergei-Lebedev Feb 1, 2023

shimmybalsam Feb 1, 2023

jirikraus commented Jan 31, 2023

vspetrov left a comment

vspetrov Feb 1, 2023

Sergei-Lebedev Feb 1, 2023

vspetrov Feb 1, 2023

Sergei-Lebedev commented Feb 1, 2023

Sergei-Lebedev commented Feb 1, 2023

jirikraus commented Feb 1, 2023

Sergei-Lebedev commented Feb 1, 2023

shimmybalsam commented Feb 1, 2023

jirikraus commented Feb 1, 2023

vspetrov commented Feb 1, 2023

Sergei-Lebedev commented Feb 1, 2023

jirikraus commented Feb 2, 2023

Sergei-Lebedev commented Feb 2, 2023

Sergei-Lebedev commented Feb 17, 2023

CORE: improve logging #721

CORE: improve logging #721

Conversation

Sergei-Lebedev commented Jan 30, 2023

What

Why ?

How ?

Sergei-Lebedev commented Jan 30, 2023

Sergei-Lebedev commented Jan 30, 2023

shimmybalsam commented Jan 31, 2023

shimmybalsam Jan 31, 2023

Choose a reason for hiding this comment

Sergei-Lebedev Feb 1, 2023

Choose a reason for hiding this comment

shimmybalsam Feb 1, 2023

Choose a reason for hiding this comment

jirikraus commented Jan 31, 2023

vspetrov left a comment

Choose a reason for hiding this comment

vspetrov Feb 1, 2023

Choose a reason for hiding this comment

Sergei-Lebedev Feb 1, 2023

Choose a reason for hiding this comment

vspetrov Feb 1, 2023

Choose a reason for hiding this comment

Sergei-Lebedev commented Feb 1, 2023

Sergei-Lebedev commented Feb 1, 2023

jirikraus commented Feb 1, 2023

Sergei-Lebedev commented Feb 1, 2023

shimmybalsam commented Feb 1, 2023

jirikraus commented Feb 1, 2023

vspetrov commented Feb 1, 2023

Sergei-Lebedev commented Feb 1, 2023

jirikraus commented Feb 2, 2023

Sergei-Lebedev commented Feb 2, 2023

Sergei-Lebedev commented Feb 17, 2023