Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

introduce USDT for the perf tracing #1532

Open
gangxie112 opened this issue Dec 6, 2024 · 5 comments
Open

introduce USDT for the perf tracing #1532

gangxie112 opened this issue Dec 6, 2024 · 5 comments

Comments

@gangxie112
Copy link

Hello,

I find we added some some tracing point of the profiler in recenty release. But it's not very practical for online/training system.
So do we have any plan to introduce some other mature tools, like USDT/ebpf? The change and the performance impact is very small. Only concern is that when doing the probe, it traps into kernel due to int3. Not sure the performance impact under edr NIC, like 400Gb. I tested it with my 40Gb nic and 3060 gpu, no performance downgrade is observed.

Thanks,

@sjeaugey
Copy link
Member

sjeaugey commented Dec 6, 2024

Hi,

The networking aspect in NCCL is far from being our only focus. NCCL bridges together NVLink technologies and networking technologies through the use of CUDA kernels running on the GPU, and the profiler is here to allow people to see what happens in the different NCCL layers. But it's not a network analysis tool. We're looking into adding profiler entry point in the network plugins, but it's not there yet.

Also, I'm not very familiar with USDT/eBPF but it doesn't look like it was developed for high speed RDMA NICs which don't even go through kernel space to send over the network. Looks like it was done for TCP/IP traffic.

In any case, if you have some insight as to how we could improve NCCL to make it compatible with some other tools, feel free to let us know. We can consider it when we improve our profiling capabilities.

@gangxie112
Copy link
Author

Hi,

The networking aspect in NCCL is far from being our only focus. NCCL bridges together NVLink technologies and networking technologies through the use of CUDA kernels running on the GPU, and the profiler is here to allow people to see what happens in the different NCCL layers. But it's not a network analysis tool. We're looking into adding profiler entry point in the network plugins, but it's not there yet.

Also, I'm not very familiar with USDT/eBPF but it doesn't look like it was developed for high speed RDMA NICs which don't even go through kernel space to send over the network. Looks like it was done for TCP/IP traffic.

In any case, if you have some insight as to how we could improve NCCL to make it compatible with some other tools, feel free to let us know. We can consider it when we improve our profiling capabilities.

No, with ebpf, there is no difference of the rdma traffic. The ebpf probe is just a NOP instruction. And after it's activated, Nop is replaced with INT3 and it traps into the kernel to execute the ebpf program. With this, we can trace the user level code, not the rdma driver.

@sjeaugey
Copy link
Member

sjeaugey commented Dec 9, 2024

Oh, but then every trace point would be an interrupt which would causes a switch to kernel mode. That seems very slow -- relative to the kind of message rate and latency we have with RDMA operations.

@sjeaugey
Copy link
Member

sjeaugey commented Dec 9, 2024

Actually, depending on the CPU type, you may also need to pass iommu=pt as a kernel boot option.

@gangxie112
Copy link
Author

gangxie112 commented Dec 11, 2024

Oh, but then every trace point would be an interrupt which would causes a switch to kernel mode. That seems very slow -- relative to the kind of message rate and latency we have with RDMA operations.

Yes, that's what all I concern. I'm doing some search about user level bpf, like bpftime. But it seems that it's hard to be integrated.
On the other hand, if we just hook the uprobe on the control path, like proxy ops, I think it may be acceptable. Because when we want to probe the nccl process, there most likely be some significant performance downgrade. I will do some performance tests against this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants