-
Notifications
You must be signed in to change notification settings - Fork 102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CORE: improve logging #721
CORE: improve logging #721
Conversation
@jirikraus please take a look |
bot:retest |
General: in the print you show from UCC_LOG_LEVEL=info example there is no CL information, only TL score map. perhaps chosen CL info should be added as well. |
src/core/ucc_global_opts.c
Outdated
@@ -30,6 +30,13 @@ ucc_config_field_t ucc_global_config_table[] = { | |||
ucc_offsetof(ucc_global_config_t, log_component), | |||
UCC_CONFIG_TYPE_LOG_COMP}, | |||
|
|||
{"COLL_TRACE", "n", | |||
"If UCC logging level is INFO or higher UCC will print info message" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a bit confusing since you're explaining what would happen in case of "y" but defining as "n"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is it confusing? by default coll trace is disabled because it produces a lot of output. If you want to see it just set UCC_COLL_TRACE=y.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion it would be more clear like this:
"If enabled and UCC logging level is INFO or higher UCC will print info message per collective.\n"
Thanks @Sergei-Lebedev looking at the example output this looks great. I can't comment on the source change but assume this is not requested. As @shimmybalsam points out I think it makes sense to also cover the choosen CL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change is awesome! I have just minor comments.
Why not keeping coll seq_num on the trace with INFO level. the line will be almost same length? It gives a lot of info on how heavily the team is used (if seq_num is large or not), etc.
Looks like we don't have trace for triggered post, lets add.
A couple of next steps we could as a follow up:
-
Add UCC_TL_COLL_TRACE
basically same tracing but at TL level. This var should of course NOT inherit from UCC_COLL_TRACE. It would be convenient to trace in detail actual algorithms start/finalize at TL level. -
With existing tracing, what else do we need to add to logging to potentially "replay" the trace ? This could be a very handy tool for perf analysis
} | ||
break; | ||
case UCC_COLL_TYPE_ALLGATHERV: | ||
case UCC_COLL_TYPE_REDUCE_SCATTERV: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe move thise entire blob into separate util function:
static inline void ucc_coll_args_get_info(const ucc_coll_args_t *args, ucc_coll_buffer_info_t *s_info,
ucc_coll_buffer_info_t *d_info) {
...
}
it can set s_info.buffer or r_info.buffer to NULL that would indicate the same as has_src/has_dst flags.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s_info.buffer == NULL is actually valid buffer, for example we did alltoallv with 64bit displacement in pytorch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right. still worth moving to separate fn imho, even with 2 more flags. current function is very large. just from readability perspective.
It prints CL HIER if it's used and some TL if CL BASIC is used. The logic is that we print top level task info which is TL task for CL BASIC and CL task for CL HIER. |
Sure, no problem, will add.
We print this info in coll_init so we will have it for triggered post as well. But maybe move it to collective post and triggered post instead? I'm not sure which option is better.
Agree, but we have much more option where to add in TL. For example in every init, post, frag start ...
Yes, i wanted to start with some script that can collect different statistic, but replay is also an option. |
I think that would address my point. However I am lacking background to figure how this might look like in practice. Can you provide an example how that would look? The example you provide above only cover TL_UCP not the CL used. |
sure, it's similar:
|
Looks good but I still think that also for CL BASIC we should output that we are using CL_BASIC and then which TLs |
And as @vspetrov suggested add UCC_TL_COLL_TRACE to provide additional details. To make sure I understand this right when CL hier is used we won't see the TLs it leverages? Could we show that without getting too verbose, E.g.
and something similar for CL Basic? |
right, i'm starting to forget stuff... i think, we should keep as is - print detailed info on collective_init and on post/finalize we have a verbose messages that will be printed under DEBUG. BTW, in post/finalize we print pointer of the request which makes it easy to match initialized request with the posted/finalized. I saw you no longer print the request pointer in coll_init. Maybe lets add it under DEBUG. |
@jirikraus @vspetrov @shimmybalsam UCC_COLL_TRACE=diag: leader ranks prints info from coll_args structure (coll type, buffer pointer, size, datatype, memory type)
UCC_COLL_TRACE=info: same as log level diag but adds team_id and CL/TL being used
UCC_COLL_TRACE=debug: same as log level info but adds rank, ctx_rank, sequence number and reuqest pointer
UCC_COLL_TRACE=trace is identical to UCC_COLL_TRACE=debug but all ranks print collective info |
Thanks this looks great. @Sergei-Lebedev when is this expected to land in a UCC release? I wonder how we should track adding a description of this to the user_guide.md discussed in PR #720 ? |
a2a73cf
to
4c90c34
Compare
If everybody agrees on logging format, then this PR will be merged pretty soon. After it's merged you can update User Guide accordingly and I don't think it depends on release date. But answering your question we are going to release new UCC version in April. |
4c90c34
to
ef8e5f8
Compare
* CORE: improve logging * CORE: add coll_trace loggger * CORE: add post and finalize to debug log
What
Improve logging in different components of UCC
Why ?
Currently there is no unified rules between TLs, CLs, CORE about what to print and how to print.
How ?
1.1. UCC version
1.2. UCC runtime path (path to the libucc.so)
1.3. global config path if used or n/a otherwise
1.4. team create log with team id and team size
1.5. score map i.e. what collectives available, what memory types available, what TLs/CLs are used
1.6. team destroy log
2.1. Leader rank prints collective info on each ucc_collective_init if UCC log level is info
2.2. All ranks from team print collective info on ucc_collective_init if UCC log level is debug or above
2.3. For LOG_LEVEL=info collective info includes, src and dst buffer address, count, datatype, memory type if available; inplace flag if set; root for rooted collectives; operation for reduction collectives; chosen TL/CL; team id. For V collectives (allgatherv, alltoallv, scatterv ...) count is sum of corresponding values in counts buffer.
2.4. For LOG_LEVEL=debug collective info output additionally includes team rank number, context rank number, collective sequence number, pointer to request
All other prints moved from info log level to debug log level.
UCC_LOG_LEVEL=info example
UCC_COLL_TRACE=y example