-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add trace hang function #59217
Add trace hang function #59217
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
||
int* gpu_global_ranks = nullptr; | ||
size_t gpu_global_ranks_size = num_ranks * sizeof(int); | ||
CUDA_CHECK(cudaMalloc(&gpu_global_ranks, gpu_global_ranks_size)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
为什么不走框架的显存分配,而是裸调cudaapi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
为什么不走框架的显存分配,而是裸调cudaapi
这里不需要使用tensor、stream、上下文等,没有必要使用更复杂的封装实现吧
* fix trace hang * fix compile error * fix code style * tinyfix * tiny update * fix code style --------- Co-authored-by: ForFishes <1422485404@qq.com>
* fix nccl_async_trace destruct problem when train finished * update * format code style
316ad3a
to
8ab79e9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
adee3bb
to
f6a4e0d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR types
Others
PR changes
Others
Description
Pcard-70448