Add trace hang function #59217

hitywt · 2023-11-21T12:09:05Z

PR types

Others

PR changes

Others

Description

Pcard-70448

Add trace hang function.
Enable trace hang function by setting environment variable FLAGS_enable_async_trace=True

LiYuRio

LGTM

ForFishes · 2023-11-24T07:55:26Z

paddle/fluid/distributed/collective/process_group_nccl.cc

+
+    int* gpu_global_ranks = nullptr;
+    size_t gpu_global_ranks_size = num_ranks * sizeof(int);
+    CUDA_CHECK(cudaMalloc(&gpu_global_ranks, gpu_global_ranks_size));


为什么不走框架的显存分配，而是裸调cudaapi

为什么不走框架的显存分配，而是裸调cudaapi

这里不需要使用tensor、stream、上下文等，没有必要使用更复杂的封装实现吧

* fix trace hang * fix compile error * fix code style * tinyfix * tiny update * fix code style --------- Co-authored-by: ForFishes <1422485404@qq.com>

* fix nccl_async_trace destruct problem when train finished * update * format code style

ForFishes

LGTM

update

ForFishes

LGTM

LiYuRio previously approved these changes Nov 24, 2023

View reviewed changes

ForFishes reviewed Nov 24, 2023

View reviewed changes

wentaoyu and others added 11 commits November 24, 2023 17:00

add comm async trace module, (PaddlePaddle#56916)

f2da758

Fix trace hang (PaddlePaddle#57536)

dbbbdd9

* fix trace hang * fix compile error * fix code style * tinyfix * tiny update * fix code style --------- Co-authored-by: ForFishes <1422485404@qq.com>

Fix nccl trace (PaddlePaddle#58338)

461ca18

* fix nccl_async_trace destruct problem when train finished * update * format code style

optimize trace hang && fix event leak (PaddlePaddle#58707)

f05b18a

update

158bfc8

fix compile problems

fc798c0

fix code style

9c9c8b1

fix logging

61e0606

fix code style

648202e

remove useless

b01d5ff

add ut && tinyfix

8ab79e9

hitywt dismissed LiYuRio’s stale review via 8ab79e9 November 24, 2023 09:02

hitywt force-pushed the add_trace_hang branch from 316ad3a to 8ab79e9 Compare November 24, 2023 09:02

ForFishes previously approved these changes Nov 27, 2023

View reviewed changes

hitywt dismissed ForFishes’s stale review via adee3bb November 27, 2023 03:41

hitywt marked this pull request as draft November 27, 2023 05:17

hitywt marked this pull request as ready for review November 27, 2023 05:17

opt cudaMalloc and cudaMemcpy

f6a4e0d

update

hitywt force-pushed the add_trace_hang branch from adee3bb to f6a4e0d Compare November 27, 2023 05:19

tinyfix

31119d0

ForFishes approved these changes Nov 28, 2023

View reviewed changes

ForFishes merged commit da313d7 into PaddlePaddle:develop Nov 28, 2023
28 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add trace hang function #59217

Add trace hang function #59217

hitywt commented Nov 21, 2023 •

edited

Loading

LiYuRio left a comment

ForFishes Nov 24, 2023

hitywt Nov 24, 2023 •

edited

Loading

ForFishes left a comment

ForFishes left a comment

Add trace hang function #59217

Add trace hang function #59217

Conversation

hitywt commented Nov 21, 2023 • edited Loading

PR types

PR changes

Description

LiYuRio left a comment

Choose a reason for hiding this comment

ForFishes Nov 24, 2023

Choose a reason for hiding this comment

hitywt Nov 24, 2023 • edited Loading

Choose a reason for hiding this comment

ForFishes left a comment

Choose a reason for hiding this comment

ForFishes left a comment

Choose a reason for hiding this comment

hitywt commented Nov 21, 2023 •

edited

Loading

hitywt Nov 24, 2023 •

edited

Loading