Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rmda user-manual #190

Closed
wants to merge 2 commits into from
Closed

rmda user-manual #190

wants to merge 2 commits into from

Conversation

ferris-cx
Copy link

Ⅰ. Describe what this PR does
Since Gpus in AI scenarios require RDMA computing nics for high-speed NCCL communication, end-to-end support for rdma devices must be added, including device discovery, device registration, node resource update, scheduling, and allocation.
Ⅱ. Does this pull request fix one issue?
No
Ⅲ. Describe how to verify it
Ⅳ. Special notes for reviews
V. Checklist
I have written necessary docs and comments
I have added necessary unit tests and integration tests
All checks passed in make test

@ZiMengSheng ZiMengSheng requested review from ZiMengSheng and saintube and removed request for ZiMengSheng November 28, 2024 12:52
@ZiMengSheng ZiMengSheng force-pushed the rdma branch 2 times, most recently from f1248db to d8e1f44 Compare December 4, 2024 08:45
Signed-off-by: iostream2008@163.com <iostream2008@163.com>
Signed-off-by: wangjianyu <wangjianyu.wjy@alibaba-inc.com>
## A test report on affinity scheduling of rdma nic and GPU on k8s and high speed communication of RDMA computing network

### Introduction

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. 这里应该主要做一下问题描述,以及 Koordinator 是怎么解决该问题的.
  2. 这里 Koordinator 已经支持了 GPU & RDMA 联合分配这个功能,不能再说缺乏这个功能了

@@ -0,0 +1,1219 @@
## A test report on affinity scheduling of rdma nic and GPU on k8s and high speed communication of RDMA computing network
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

题目太长了

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oK。我改下标题,尽量简短


#### Prerequisite

<div>The basic K8S cluster environment for GPUs has been installed. The Nvidia driver and containerd have been installed on each GPU node, and the Mellanox NIC driver has been installed on the server.</div>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggesst the overall structure adjusted as follows

  1. Introduction: problem description and koordinator solution introduction
  2. Experiment Setting:
    1. Test Scanarios
    2. Cluster and Nodes
  3. Initinalize the nodes
  4. Deploy Koordinator and Multus
  5. Deploy Test Application and Check its allocation result
  6. NCCL Testing

}'
```

Plan: Nad configuration file name of NIC ens3f0np0 on node2: sriov-attach-k8s-node2-ens3f0np0-kubeflow-conf.yaml.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NAD 的安装应该放在 multus 安装部分?

Copy link
Author

@ferris-cx ferris-cx Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

应该说,NAD跟multus里引用到,但具体的编排内容还要视pod.yaml而定。所以建议还是跟pod.yaml前面编辑并安装比较合适

1* GPU1+1* RDMA communication between two Pods 2G data volume communication scenario

```shell
mpirun --allow-run-as-root -H 10.244.1.10:1,10.244.2.21:1 -mca plm_rsh_args "-p 20024" -x NCCL_IB_DISABLE=0 -x NCCL_DEBUG=INFO -x NCCL_SOCKET_IFNAME=eth0 -x NCCL_IB_HCA==mlx5_2 -x UCX_NET_DEVICES=eth0 -x NCCL_NET_GDR_READ=1 ./build/all_reduce_perf -b 2M -e 2G -f 2 -g 1 -n 100 -w 5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please give a brief intruduction about mpirun

@ZiMengSheng
Copy link
Contributor

/close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants