-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rmda user-manual #190
rmda user-manual #190
Conversation
f1248db
to
d8e1f44
Compare
Signed-off-by: iostream2008@163.com <iostream2008@163.com> Signed-off-by: wangjianyu <wangjianyu.wjy@alibaba-inc.com>
## A test report on affinity scheduling of rdma nic and GPU on k8s and high speed communication of RDMA computing network | ||
|
||
### Introduction | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- 这里应该主要做一下问题描述,以及 Koordinator 是怎么解决该问题的.
- 这里 Koordinator 已经支持了 GPU & RDMA 联合分配这个功能,不能再说缺乏这个功能了
@@ -0,0 +1,1219 @@ | |||
## A test report on affinity scheduling of rdma nic and GPU on k8s and high speed communication of RDMA computing network |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
题目太长了
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oK。我改下标题,尽量简短
|
||
#### Prerequisite | ||
|
||
<div>The basic K8S cluster environment for GPUs has been installed. The Nvidia driver and containerd have been installed on each GPU node, and the Mellanox NIC driver has been installed on the server.</div> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggesst the overall structure adjusted as follows
- Introduction: problem description and koordinator solution introduction
- Experiment Setting:
- Test Scanarios
- Cluster and Nodes
- Initinalize the nodes
- Deploy Koordinator and Multus
- Deploy Test Application and Check its allocation result
- NCCL Testing
}' | ||
``` | ||
|
||
Plan: Nad configuration file name of NIC ens3f0np0 on node2: sriov-attach-k8s-node2-ens3f0np0-kubeflow-conf.yaml. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NAD 的安装应该放在 multus 安装部分?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
应该说,NAD跟multus里引用到,但具体的编排内容还要视pod.yaml而定。所以建议还是跟pod.yaml前面编辑并安装比较合适
1* GPU1+1* RDMA communication between two Pods 2G data volume communication scenario | ||
|
||
```shell | ||
mpirun --allow-run-as-root -H 10.244.1.10:1,10.244.2.21:1 -mca plm_rsh_args "-p 20024" -x NCCL_IB_DISABLE=0 -x NCCL_DEBUG=INFO -x NCCL_SOCKET_IFNAME=eth0 -x NCCL_IB_HCA==mlx5_2 -x UCX_NET_DEVICES=eth0 -x NCCL_NET_GDR_READ=1 ./build/all_reduce_perf -b 2M -e 2G -f 2 -g 1 -n 100 -w 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please give a brief intruduction about mpirun
/close |
Ⅰ. Describe what this PR does
Since Gpus in AI scenarios require RDMA computing nics for high-speed NCCL communication, end-to-end support for rdma devices must be added, including device discovery, device registration, node resource update, scheduling, and allocation.
Ⅱ. Does this pull request fix one issue?
No
Ⅲ. Describe how to verify it
Ⅳ. Special notes for reviews
V. Checklist
I have written necessary docs and comments
I have added necessary unit tests and integration tests
All checks passed in make test