-
Notifications
You must be signed in to change notification settings - Fork 836
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
introduce USDT for the perf tracing #1532
Comments
Hi, The networking aspect in NCCL is far from being our only focus. NCCL bridges together NVLink technologies and networking technologies through the use of CUDA kernels running on the GPU, and the profiler is here to allow people to see what happens in the different NCCL layers. But it's not a network analysis tool. We're looking into adding profiler entry point in the network plugins, but it's not there yet. Also, I'm not very familiar with USDT/eBPF but it doesn't look like it was developed for high speed RDMA NICs which don't even go through kernel space to send over the network. Looks like it was done for TCP/IP traffic. In any case, if you have some insight as to how we could improve NCCL to make it compatible with some other tools, feel free to let us know. We can consider it when we improve our profiling capabilities. |
No, with ebpf, there is no difference of the rdma traffic. The ebpf probe is just a NOP instruction. And after it's activated, Nop is replaced with INT3 and it traps into the kernel to execute the ebpf program. With this, we can trace the user level code, not the rdma driver. |
Oh, but then every trace point would be an interrupt which would causes a switch to kernel mode. That seems very slow -- relative to the kind of message rate and latency we have with RDMA operations. |
Actually, depending on the CPU type, you may also need to pass |
Yes, that's what all I concern. I'm doing some search about user level bpf, like bpftime. But it seems that it's hard to be integrated. |
Hello,
I find we added some some tracing point of the profiler in recenty release. But it's not very practical for online/training system.
So do we have any plan to introduce some other mature tools, like USDT/ebpf? The change and the performance impact is very small. Only concern is that when doing the probe, it traps into kernel due to int3. Not sure the performance impact under edr NIC, like 400Gb. I tested it with my 40Gb nic and 3060 gpu, no performance downgrade is observed.
Thanks,
The text was updated successfully, but these errors were encountered: