From dfcacdb7e5c10118a5fc10db6bc5d247332ac70f Mon Sep 17 00:00:00 2001 From: Frame Date: Wed, 19 Jun 2024 13:46:11 +0800 Subject: [PATCH] gpu & rdma joint allocation best practice Signed-off-by: iostream2008@163.com Signed-off-by: wangjianyu --- .../gpu-rdma-joint-allocation.md | 1219 +++++++++++++++++ sidebars.js | 1 + static/img/rdma-nic-topo.png | Bin 0 -> 27920 bytes 3 files changed, 1220 insertions(+) create mode 100644 docs/best-practices/gpu-rdma-joint-allocation.md create mode 100644 static/img/rdma-nic-topo.png diff --git a/docs/best-practices/gpu-rdma-joint-allocation.md b/docs/best-practices/gpu-rdma-joint-allocation.md new file mode 100644 index 000000000..42c156599 --- /dev/null +++ b/docs/best-practices/gpu-rdma-joint-allocation.md @@ -0,0 +1,1219 @@ +## A test report on affinity scheduling of rdma nic and GPU on k8s and high speed communication of RDMA computing network + +### Introduction + +Currently, the device only supports the end-to-end capability of the GPU. Since Gpus in AI scenarios require RDMA computing nics for high-speed NCCL communication, end-to-end support for rdma devices must be added, including device discovery, device registration, node resource update, scheduling, and allocation. + +### GPU cluster environment + +#### Test scenario + +Dispatch one Pod on each of the two nodes in Sriov VF mode. Specify two RDMA network ports and two Pods for traffic verification, as shown in the following figure: + +![img](/img/rdma-nic-topo.png) + +#### Prerequisite + +
The basic K8S cluster environment for GPUs has been installed. The Nvidia driver and containerd have been installed on each GPU node, and the Mellanox NIC driver has been installed on the server.
+ +| Software/hardware name | Version/model | +| ------------------------ | ------------------------------------------------------------ | +| server*3 | os:Ubuntu 22.04.2 LTS
kernel:6.8.0-47-generic | +| k8s | v1.28.15 | +| koordinator | Based on 1.5, version to be released | +| containerd | v1.7.8 | +| nvidia-container-runtime | 3.14.0-1 | +| runc | 1.1.10 | +| nic | Mellanox Technologies MT27800 Family [ConnectX-5]
Driver version:MLNX_OFED_LINUX-24.07-0.6.1.0 | +| GPU | model:P40
Driver Version: 550.127.05 | +| cuda | 12.4 | +| nccl | 2.21.5 | +| multus-cni | A custom version based on release-v3 | + +##### k8s Cluster Info + +| node name | k8s version | IP | OS | Kernel | GPU | Containerd | +| ---------- | ----------- | -------------- | ------------------ | ---------------- | ----- | ------------------- | +| k8s-master | v1.28.15 | 192.168.10.203 | Ubuntu 22.04.4 LTS | 6.8.0-45-generic | / | containerd://1.7.22 | +| k8s-node1 | v1.28.15 | 192.168.10.232 | Ubuntu 22.04.4 LTS | 6.8.0-45-generic | P40*4 | containerd://1.7.22 | +| k8s-node2 | v1.28.15 | 192.168.10.231 | Ubuntu 22.04.4 LTS | 6.8.0-45-generic | P40*4 | containerd://1.7.22 | + +In this test, RDMA devices on node1 and node2 are used for stream testing + +##### GPU Info + +###### k8s-node1 + +```shell +root@k8s-node1:~/ss/koo/script# nvidia-smi +Wed Nov 27 16:21:46 2024 ++-----------------------------------------------------------------------------------------+ +| NVIDIA-SMI 550.127.05 Driver Version: 550.127.05 CUDA Version: 12.4 | +|-----------------------------------------+------------------------+----------------------+ +| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | +| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | +| | | MIG M. | +|=========================================+========================+======================| +| 0 Tesla P40 Off | 00000000:02:00.0 Off | 0 | +| N/A 21C P8 12W / 250W | 0MiB / 23040MiB | 0% Default | +| | | N/A | ++-----------------------------------------+------------------------+----------------------+ +| 1 Tesla P40 Off | 00000000:03:00.0 Off | 0 | +| N/A 26C P8 10W / 250W | 0MiB / 23040MiB | 0% Default | +| | | N/A | ++-----------------------------------------+------------------------+----------------------+ +| 2 Tesla P40 Off | 00000000:82:00.0 Off | 0 | +| N/A 23C P8 10W / 250W | 0MiB / 23040MiB | 0% Default | +| | | N/A | ++-----------------------------------------+------------------------+----------------------+ +| 3 Tesla P40 Off | 00000000:83:00.0 Off | 0 | +| N/A 18C P8 8W / 250W | 0MiB / 23040MiB | 0% Default | +| | | N/A | ++-----------------------------------------+------------------------+----------------------+ + ++-----------------------------------------------------------------------------------------+ +| Processes: | +| GPU GI CI PID Type Process name GPU Memory | +| ID ID Usage | +|=========================================================================================| +| No running processes found | ++-----------------------------------------------------------------------------------------+ +``` + +###### k8s-node2 + +```shell +root@k8s-node2:~# nvidia-smi +Wed Nov 27 16:22:16 2024 ++-----------------------------------------------------------------------------------------+ +| NVIDIA-SMI 550.120 Driver Version: 550.120 CUDA Version: 12.4 | +|-----------------------------------------+------------------------+----------------------+ +| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | +| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | +| | | MIG M. | +|=========================================+========================+======================| +| 0 Tesla P40 Off | 00000000:02:00.0 Off | 0 | +| N/A 31C P8 10W / 250W | 0MiB / 23040MiB | 0% Default | +| | | N/A | ++-----------------------------------------+------------------------+----------------------+ +| 1 Tesla P40 Off | 00000000:03:00.0 Off | 0 | +| N/A 31C P8 10W / 250W | 0MiB / 23040MiB | 0% Default | +| | | N/A | ++-----------------------------------------+------------------------+----------------------+ +| 2 Tesla P40 Off | 00000000:82:00.0 Off | 0 | +| N/A 37C P8 10W / 250W | 0MiB / 23040MiB | 0% Default | +| | | N/A | ++-----------------------------------------+------------------------+----------------------+ +| 3 Tesla P40 Off | 00000000:83:00.0 Off | 0 | +| N/A 30C P8 10W / 250W | 0MiB / 23040MiB | 0% Default | +| | | N/A | ++-----------------------------------------+------------------------+----------------------+ + ++-----------------------------------------------------------------------------------------+ +| Processes: | +| GPU GI CI PID Type Process name GPU Memory | +| ID ID Usage | +|=========================================================================================| +| No running processes found | ++-----------------------------------------------------------------------------------------+ +``` + +### Step + +#### Basic environmental preparation + +- To deploy multus-cni, download the plug-in (yaml file) from the Internet and run kubectl apply -f multus-CNi-daemon. yam. After the execution, the following information is displayed, indicating that the installation is successful. + + ```shell +root@k8s-master:~# kubectl get po -n kube-system |grep multus +kube-multus-ds-7ddbh 1/1 Running 0 38h +kube-multus-ds-cgvqq 1/1 Running 0 38h +kube-multus-ds-lc6nv 1/1 Running 0 38h +kube-multus-ds-t87r5 1/1 Running 0 38h + ``` + +#### Create VF based on physical nics + +- Plan the physical NIC for the test + + | node name | nic name | Nic model | NAD name | Ip address | remark | + | --------- | -------------------------- | ------------------------------------------------------------ | ----------------------------------------------- | ------------ | ------------------------------------------------------------ | + | K8s-node1 | ens11f0np0
ens11f1np1 | 01:00.0 Ethernet controller [0200]: Mellanox Technologies MT27800 Family [ConnectX-5] [15b3:1017]
01:00.1 Ethernet controller [0200]: Mellanox Technologies MT27800 Family [ConnectX-5] [15b3:1017] | sriov-attach-k8s-node1-ens11f0np0-kubeflow-conf | 10.20.12.121 | To simplify the testing, we create pod01 and have it schedule directionally to node1 and occupy the VF on node1 | + | K8s-node2 | ens3f0np0
ens3f1np1 | 81:00.0 Ethernet controller [0200]: Mellanox Technologies MT27800 Family [ConnectX-5] [15b3:1017]
81:00.1 Ethernet controller [0200]: Mellanox Technologies MT27800 Family [ConnectX-5] [15b3:1017] | sriov-attach-k8s-node2-ens3f0np0-kubeflow-conf | 10.20.12.134 | To simplify the testing, we create pod02 and have it schedule directionally to node2 and occupy this VF | + +- Create a VF on node1 + + Log in to node1 and create VF based on the Mellanox CX5 network adapter. Since the host already has two nics, three cx5 nics will appear if VF is successfully created. + + Create instruction is as follows: + + ``` +echo '1' > / sys/class/net/ens11f0np0 / device/sriov_numvfs + ``` + + The host runs the following command: "lspci |grep Mell". If [ConnectX-5 Virtual Function] is displayed, VF is created successfully. + + ```shell + root@k8s-node1:/data/cc/code/koordinator# lspci |grep Mell + 01:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5] + 01:00.1 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5] + 01:00.2 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function] //VF + ``` + +If you run ibstat, mlx5_2 in the output is VF: + + ```shell + CA 'mlx5_0' + CA type: MT4119 + Number of ports: 1 + Firmware version: 16.35.4030 + Hardware version: 0 + Node GUID: 0x1070fd0300a4487a + System image GUID: 0x1070fd0300a4487a + Port 1: + State: Active + Physical state: LinkUp + Rate: 25 + Base lid: 0 + LMC: 0 + SM lid: 0 + Capability mask: 0x00010000 + Port GUID: 0x1270fdfffea4487a + Link layer: Ethernet + CA 'mlx5_1' + CA type: MT4119 + Number of ports: 1 + Firmware version: 16.35.4030 + Hardware version: 0 + Node GUID: 0x1070fd0300a4487b + System image GUID: 0x1070fd0300a4487a + Port 1: + State: Down + Physical state: Disabled + Rate: 25 + Base lid: 0 + LMC: 0 + SM lid: 0 + Capability mask: 0x00010000 + Port GUID: 0x1270fdfffea4487b + Link layer: Ethernet + CA 'mlx5_2' //VF + CA type: MT4120 + Number of ports: 1 + Firmware version: 16.35.4030 + Hardware version: 0 + Node GUID: 0x0000000000000000 + System image GUID: 0x1070fd0300a4487a + Port 1: + State: Active + Physical state: LinkUp + Rate: 25 + Base lid: 0 + LMC: 0 + SM lid: 0 + Capability mask: 0x00010000 + Port GUID: 0x0000000000000000 + Link layer: Ethernet + ``` + + + +- node2创建1个VF + + Log in to node2 and create VF based on the Mellanox CX5 network adapter. The host already has two cx5 nics. If the VF is created successfully, three cx5 nics are displayed. + + Create instruction is as follows: + + ``` + echo '1' > / sys/class/net/ens11f0np0 / device/sriov_numvfs + ``` + + The host runs the following command: "lspci |grep Mell". If [ConnectX-5 Virtual Function] is displayed, VF is created successfully. + + ```shell + root@k8s-node3:~# lspci |grep Mell + d2:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5] + d2:00.1 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5] + d2:01.2 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]//VF + ``` + + 执行ibstat,输出中的mlx5_8即为VF: + + ```shell + CA 'mlx5_0' + CA type: MT4119 + Number of ports: 1 + Firmware version: 16.32.1010 + Hardware version: 0 + Node GUID: 0x1070fd0300a4486a + System image GUID: 0x1070fd0300a4486a + Port 1: + State: Down + Physical state: Disabled + Rate: 40 + Base lid: 0 + LMC: 0 + SM lid: 0 + Capability mask: 0x00010000 + Port GUID: 0x0000000000000000 + Link layer: Ethernet + CA 'mlx5_1' + CA type: MT4119 + Number of ports: 1 + Firmware version: 16.32.1010 + Hardware version: 0 + Node GUID: 0x1070fd0300a4486b + System image GUID: 0x1070fd0300a4486a + Port 1: + State: Down + Physical state: Disabled + Rate: 25 + Base lid: 0 + LMC: 0 + SM lid: 0 + Capability mask: 0x00010000 + Port GUID: 0x0000000000000000 + Link layer: Ethernet + CA 'mlx5_2' //VF + CA type: MT4119 + Number of ports: 1 + Firmware version: 16.35.3006 + Hardware version: 0 + Node GUID: 0x1070fd0300a44882 + System image GUID: 0x1070fd0300a44882 + Port 1: + State: Down + Physical state: Disabled + Rate: 40 + Base lid: 0 + LMC: 0 + SM lid: 0 + Capability mask: 0x00010000 + Port GUID: 0x0000000000000000 + Link layer: Ethernet + ``` + + +#### Edit Pod.yaml and Deploy + +- Note: This test requires two pods, so you need to write yaml files corresponding to two Pods. Expect one Pod directed to node1 and one Pod directed to node2. + +- Label: In order to facilitate testing, Pod directional scheduling is required to a node, and Node needs to be labeled.Specific instructions are as follows: + + ```shell + kubectl label nodes k8s-node1 koo=node1;kubectl label nodes k8s-node2 koo=node2 + ``` + +- pod-vf01.yaml is as follows + + ```yaml + apiVersion: v1 + kind: Pod + metadata: + name: pod-vf01 + namespace: kubeflow + annotations: + k8s.v1.cni.cncf.io/networks: sriov-attach-k8s-node1-ens11f0np0-kubeflow-conf//this NAD needs to be written separately + scheduling.koordinator.sh/device-joint-allocate: |- + { + "deviceTypes": ["gpu","rdma"] + } + scheduling.koordinator.sh/device-allocate-hint: |- + { + "rdma": { + "vfSelector": {} //apply VF + } + } + labels: + selector-type: pod + spec: + nodeSelector: + koo: node1 //Directional scheduling to 1 node + schedulerName: koord-scheduler //Uses the koordlet scheduling plug-in + containers: + - name: container-vf + image: nvcr.io/nvidia/pytorch:24.04-py3 + securityContext: + capabilities: + add: [ "IPC_LOCK" ] + imagePullPolicy: IfNotPresent + command: [ "/bin/bash", "-c", "--" ] + args: [ "while true; do sleep 300000; done;" ] + volumeMounts: + - mountPath: /dev/shm + name: shm + resources: + requests: + koordinator.sh/gpu: 100//apply a GPU + koordinator.sh/rdma: 100//apply a VF + limits: + koordinator.sh/gpu: 100 + koordinator.sh/rdma: 100 + volumes: + - name: shm + emptyDir: + medium: Memory + sizeLimit: "10Gi" + ``` + + + +- pod-vf02.yaml is as follows. It is basically the same as pod-vf01, except that the name and the node to which it is directed are different + + ```yaml + apiVersion: v1 + kind: Pod + metadata: + name: pod-vf02 + namespace: kubeflow + annotations: + k8s.v1.cni.cncf.io/networks: sriov-attach-k8s-node3-enp210s0f1np1-kubeflow-conf + scheduling.koordinator.sh/device-joint-allocate: |- + { + "deviceTypes": ["gpu","rdma"] + } + scheduling.koordinator.sh/device-allocate-hint: |- + { + "rdma": { + "vfSelector": {} + } + } + labels: + selector-type: pod + spec: + nodeSelector: + koo: node2 + schedulerName: koord-scheduler + containers: + - name: container-vf + image: nvcr.io/nvidia/pytorch:24.04-py3 + securityContext: + capabilities: + add: [ "IPC_LOCK" ] + imagePullPolicy: IfNotPresent + command: [ "/bin/bash", "-c", "--" ] + args: [ "while true; do sleep 300000; done;" ] + volumeMounts: + - mountPath: /dev/shm + name: shm + resources: + requests: + koordinator.sh/gpu: 100 + koordinator.sh/rdma: 100 + limits: + koordinator.sh/gpu: 100 + koordinator.sh/rdma: 100 + volumes: + - name: shm + emptyDir: + medium: Memory + sizeLimit: "10Gi" + ``` + +- Edit NAD + + As pod application more nics, need to rely on NetworkAttachmentDefinition configuration, so you need to write early nad configuration file. + + Plan: node1 The name of the Nad configuration file corresponding to NIC ens11f0np0 is: sriov-attach-k8s-node1-ens11f0np0-kubeflow-conf.yaml. + + ```yaml + apiVersion: "k8s.cni.cncf.io/v1" + kind: NetworkAttachmentDefinition + metadata: + name: sriov-attach-k8s-node1-ens11f0np0-kubeflow-conf + namespace: kubeflow + annotations: + k8s.v1.cni.cncf.io/resourceName: koordinator.sh/rdma + spec: + config: '{ + "cniVersion": "0.3.1", + "name": "sriov-attach", + "type": "sriov", + "capabilities": { + "mac": true, + "ipam": true + }, + "master": "ens11f0np0", + "mode": "passthrough", + "ipam": { + "type": "host-local", + "subnet": "10.20.12.0/24", + "rangeStart": "10.20.12.121", //Plan the IP address range of the Pod + "rangeEnd": "10.20.12.121" + } + }' + ``` + +Plan: Nad configuration file name of NIC ens3f0np0 on node2: sriov-attach-k8s-node2-ens3f0np0-kubeflow-conf.yaml. + +```yaml + apiVersion: "k8s.cni.cncf.io/v1" + kind: NetworkAttachmentDefinition + metadata: + name: sriov-attach-k8s-node2-ens3f0np0-kubeflow-conf + namespace: kubeflow + annotations: + k8s.v1.cni.cncf.io/resourceName: koordinator.sh/rdma + spec: + config: '{ + "cniVersion": "0.3.1", + "name": "sriov-attach", + "type": "sriov", + "capabilities": { + "mac": true, + "ipam": true + }, + "master": "ens3f0np0", + "mode": "passthrough", + "ipam": { + "type": "host-local", + "subnet": "10.20.12.0/24", + "rangeStart": "10.20.12.134",//Plan the IP address range of the Pod + "rangeEnd": "10.20.12.134" + } + }' + ``` + +- create Namespace on k8s cluster + + Log in to the active k8s node and run the following command + + ``` + kubectl create ns kubeflow + ``` + +- Run the following command to deploy the nad + + ```shell + kubectl apply -f sriov-attach-k8s-node1-ens11f0np0-kubeflow-conf.yaml + kubectl apply -f sriov-attach-k8s-node2-ens3f0np0-kubeflow-conf.yaml + ``` + +- Run the following command to deploy the Pod + + ```shell + kubectl apply -f pod07.yaml + kubectl apply -f pod08.yaml + ``` + +- Check pod running status + + ```shell + root@k8s-master:~/ss/koo/rdma/sriov# kubectl get po -n kubeflow -owide + NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES + pod-vf01 1/1 Running 0 103m 10.244.1.10 k8s-node1 + pod-vf02 1/1 Running 0 10h 10.244.2.18 k8s-node2 + ``` + + If the status of the pod is running, the pod is successfully created and running. + +#### Check the pod-vf01 distribution device result + +- Because pod-vf01 is scheduled to node1, check the device topology crd of node1. Run the command: + + ```shell + kubectl get pod pod-vf01 -n kubeflow -oyaml + ``` + +- Because the content is more, we are only paying attention to the scheduling in the annotations. Koordinator.Sh/device-allocated information + + ```yaml + scheduling.koordinator.sh/device-allocated: '{"gpu":[{"minor":0,"resources":{"koordinator.sh/gpu-core":"100","koordinator.sh/gpu-memory":"23040Mi","koordinator.sh/gpu-memory-ratio":"100"}}],"rdma":[{"minor":0,"resources":{"koordinator.sh/rdma":"1"},"extension":{"vfs":[{"minor":-1,"busID":"0000:01:00.2"}]}}]}' + ...... + dnsPolicy: ClusterFirst + enableServiceLinks: true + nodeName: k8s-node1 //It has been scheduled to node 1 + nodeSelector: + koo: node1 + preemptionPolicy: PreemptLowerPriority + priority: 0 + restartPolicy: Always + schedulerName: koord-scheduler + ``` + +- Check the GPU allocation result + + ```shell + root@pod-vf01:/home# nvidia-smi + Fri Nov 22 06:55:59 2024 + +-----------------------------------------------------------------------------------------+ + | NVIDIA-SMI 550.127.05 Driver Version: 550.127.05 CUDA Version: 12.4 | + |-----------------------------------------+------------------------+----------------------+ + | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | + | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | + | | | MIG M. | + |=========================================+========================+======================| + | 0 Tesla P40 Off | 00000000:02:00.0 Off | 0 | + | N/A 24C P8 10W / 250W | 0MiB / 23040MiB | 0% Default | + | | | N/A | + +-----------------------------------------+------------------------+----------------------+ + + +-----------------------------------------------------------------------------------------+ + | Processes: | + | GPU GI CI PID Type Process name GPU Memory | + | ID ID Usage | + |=========================================================================================| + | No running processes found | + +-----------------------------------------------------------------------------------------+ + ``` + + + +- Check whether the Pod named pod-vf01 device assignment results meet affinity + + Because pod-vf01 is scheduled to k8s-node1, check the device topology crd of k8s-node1. Run the command: + + ```shell + kc get devices.scheduling.koordinator.sh k8s-node1 -oyaml + ``` + + check equipment cr topology information, as follows: + + ```yaml + apiVersion: scheduling.koordinator.sh/v1alpha1 + kind: Device + metadata: + ..... + spec: + devices: + - health: true + id: GPU-989aa251-1dfe-5bbc-7c12-46e817b1de9a + minor: 0 //The GPU to which pod-vf01 is assigned is GPU 0, and the corresponding PCIE is pci0000:00 + resources: + koordinator.sh/gpu-core: "100" + koordinator.sh/gpu-memory: 23040Mi + koordinator.sh/gpu-memory-ratio: "100" + topology: + busID: "0000:02:00.0" + nodeID: 0 + pcieID: pci0000:00 + socketID: -1 + type: gpu + - health: true + id: "0000:01:00.0" + minor: 0 + resources: + koordinator.sh/rdma: "100" + topology: + busID: "0000:01:00.0" + nodeID: 0 + pcieID: pci0000:00 + socketID: -1 + type: rdma + vfGroups: + - vfs: + - busID: "0000:01:00.2"//pod-vf01 is assigned to this vf device, and the corresponding PCIE is pci0000:00 + minor: -1 + - health: true + id: GPU-e8a40bd0-e484-2d1b-cad9-75b043139b0c + minor: 1 + resources: + koordinator.sh/gpu-core: "100" + koordinator.sh/gpu-memory: 23040Mi + koordinator.sh/gpu-memory-ratio: "100" + topology: + busID: "0000:03:00.0" + nodeID: 0 + pcieID: pci0000:00 + socketID: -1 + type: gpu + - health: true + id: "0000:01:00.1" + minor: 1 + resources: + koordinator.sh/rdma: "100" + topology: + busID: "0000:01:00.1" + nodeID: 0 + pcieID: pci0000:00 + socketID: -1 + type: rdma + - health: true + id: GPU-5293b3a7-2bbb-e135-c6ab-c548b5c5b0a6 + minor: 2 + resources: + koordinator.sh/gpu-core: "100" + koordinator.sh/gpu-memory: 23040Mi + koordinator.sh/gpu-memory-ratio: "100" + topology: + busID: 0000:82:00.0 + nodeID: 0 + pcieID: pci0000:80 + socketID: -1 + type: gpu + - health: true + id: "0000:05:00.0" + minor: 2 + resources: + koordinator.sh/rdma: "100" + topology: + busID: "0000:05:00.0" + nodeID: 0 + pcieID: pci0000:00 + socketID: -1 + type: rdma + - health: true + id: GPU-d60a283a-a846-eaa7-f551-c0c4f6f4402a + minor: 3 + resources: + koordinator.sh/gpu-core: "100" + koordinator.sh/gpu-memory: 23040Mi + koordinator.sh/gpu-memory-ratio: "100" + topology: + busID: 0000:83:00.0 + nodeID: 0 + pcieID: pci0000:80 + socketID: -1 + type: gpu + status: {} + ``` + + According to the topology information, pod-vf01 is assigned to the vf device busID: "0000:01:00.2", and the corresponding PCIE is pci0000:00. The GPU to which pod-vf01 is assigned is GPU 0, and the corresponding PCIE is pci0000:00. If the PCIE ID is the same, the GPU and NIC meet the expected topology affinity. + +#### Check the pod-vf02 distribution device result + +- In the same way, check whether the device assignment result of pod-vf02 meets affinity + + Run the command + + ```shell + kubectl get pod pod-vf02 -n kubeflow -oyaml + ``` + + Because the content is more, we are only paying attention to the scheduling in the annotations. Koordinator.Sh/device-allocated information + + ```yaml + scheduling.koordinator.sh/device-allocated: '{"gpu":[{"minor":2,"resources":{"koordinator.sh/gpu-core":"100","koordinator.sh/gpu-memory":"23040Mi","koordinator.sh/gpu-memory-ratio":"100"}}],"rdma":[{"minor":9,"resources":{"koordinator.sh/rdma":"1"},"extension":{"vfs":[{"minor":-1,"busID":"0000:d2:01.2"}]}}]}' + ``` + +- Check the GPU allocation result: + + ```shell + root@pod-vf02:/home/nccl-tests# nvidia-smi + Fri Nov 22 06:56:58 2024 + +-----------------------------------------------------------------------------------------+ + | NVIDIA-SMI 550.120 Driver Version: 550.120 CUDA Version: 12.4 | + |-----------------------------------------+------------------------+----------------------+ + | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | + | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | + | | | MIG M. | + |=========================================+========================+======================| + | 0 Tesla P40 Off | 00000000:57:00.0 Off | Off | + | N/A 23C P8 11W / 250W | 0MiB / 24576MiB | 0% Default | + | | | N/A | + +-----------------------------------------+------------------------+----------------------+ + + +-----------------------------------------------------------------------------------------+ + | Processes: | + | GPU GI CI PID Type Process name GPU Memory | + | ID ID Usage | + |=========================================================================================| + | No running processes found | + +-----------------------------------------------------------------------------------------+ + ``` + +- Check whether the Pod named pod-vf01 device assignment results meet affinity + + Because pod-vf02 is scheduled to k8s-node2, check the device topology crd of k8s-node2. Run the command: + + ```shell + kc get devices.scheduling.koordinator.sh k8s-node2 -oyaml + ``` + + check equipment cr topology information, as follows: + + ```yaml + apiVersion: scheduling.koordinator.sh/v1alpha1 + kind: Device + metadata: + ...... + spec: + devices: + - health: true + id: GPU-8fee8688-ebf4-4281-dd1a-9c1087aeb02d + minor: 0 + resources: + koordinator.sh/gpu-core: "100" + koordinator.sh/gpu-memory: 23040Mi + koordinator.sh/gpu-memory-ratio: "100" + topology: + busID: "0000:56:00.0" + nodeID: 0 + pcieID: pci0000:4a + socketID: -1 + type: gpu + + - health: true + id: GPU-b45a64b3-d78d-08fc-669f-041859f90658 + minor: 1 + resources: + koordinator.sh/gpu-core: "100" + koordinator.sh/gpu-memory: 24Gi + koordinator.sh/gpu-memory-ratio: "100" + topology: + busID: "0000:57:00.0" + nodeID: 0 + pcieID: pci0000:4a + socketID: -1 + type: gpu + + + - health: true + id: GPU-cb146f50-4880-4d17-b6cd-7dc665c0c867 + minor: 2//pod-vf08分配到该GPU设备即2号GPU,对应PCIE为 pci0000:c9 + resources: + koordinator.sh/gpu-core: "100" + koordinator.sh/gpu-memory: 23040Mi + koordinator.sh/gpu-memory-ratio: "100" + topology: + busID: 0000:d1:00.0 + nodeID: 1 + pcieID: pci0000:c9 + socketID: -1 + type: gpu + - health: true + id: GPU-5d758779-e34e-d058-c938-a3cd1eb1ed8c + minor: 3 + resources: + koordinator.sh/gpu-core: "100" + koordinator.sh/gpu-memory: 23040Mi + koordinator.sh/gpu-memory-ratio: "100" + topology: + busID: 0000:d6:00.0 + nodeID: 1 + pcieID: pci0000:c9 + socketID: -1 + type: gpu + - health: true + id: 0000:d2:00.0 + minor: 8 + resources: + koordinator.sh/rdma: "100" + topology: + busID: 0000:d2:00.0 + nodeID: 1 + pcieID: pci0000:c9 + socketID: -1 + type: rdma + - health: true + id: 0000:d2:00.1 + minor: 9 + resources: + koordinator.sh/rdma: "100" + topology: + busID: 0000:d2:00.1 + nodeID: 1 + pcieID: pci0000:c9 + socketID: -1 + type: rdma + vfGroups: + - vfs: + - busID: 0000:d2:01.2 //pod-vf08 is assigned to the vF device, and the corresponding PCIE is pci0000:c9 + minor: -1 + status: {} + ``` + + According to the topology information, pod-vf02 is assigned to the vf device busID: "0000:d2:01.2", and the corresponding PCIE is pci0000:c9. The GPU to which pod-vf02 is assigned is GPU 2, and the corresponding PCIE is pci0000:c9. If the PCIE ID is the same, the GPU and NIC meet the expected topology affinity. + + At this point, one GPU and one RDMA device applied by the two Pods are successfully allocated, and the topology affinity is met. + +- Check the IP information allocated inside the pod, using poD-VF07 as an example (08 reference 07, the same is not specified here) + + To enter the container, run the following command: + + ```shell + kc exec -it pod-vf01 -n kubeflow -- bash + ``` + + Due to the lack of basic network test tools inside the container, you need to install the following network commands and tools: + + ```shell + apt-get update + apt-get install -y net-tools + apt install -y iputils-ping + apt-get install infiniband-diags -y + apt-get install -y kmod + apt-get install -y perftest + apt-get install -y ethtool + ...... + ``` + + After the installation is successful, run the ifconfig command to check the IP address assignment. + + ```shell + eth0: flags=4163 mtu 1450 + inet 10.244.1.10 netmask 255.255.255.0 broadcast 10.244.1.255 + inet6 fe80::e4c7:a3ff:fe4c:9d15 prefixlen 64 scopeid 0x20 + ether e6:c7:a3:4c:9d:15 txqueuelen 0 (Ethernet) + RX packets 17129 bytes 57434980 (57.4 MB) + RX errors 0 dropped 244 overruns 0 frame 0 + TX packets 13383 bytes 1019323 (1.0 MB) + TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 + + lo: flags=73 mtu 65536 + inet 127.0.0.1 netmask 255.0.0.0 + inet6 ::1 prefixlen 128 scopeid 0x10 + loop txqueuelen 1000 (Local Loopback) + RX packets 487 bytes 211446 (211.4 KB) + RX errors 0 dropped 0 overruns 0 frame 0 + TX packets 487 bytes 211446 (211.4 KB) + TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 + + net1: flags=4163 mtu 1500 + inet 10.20.12.121 netmask 255.255.255.0 broadcast 10.20.12.255 + inet6 fe80::6ce7:bfff:fee0:9382 prefixlen 64 scopeid 0x20 + ether 6e:e7:bf:e0:93:82 txqueuelen 1000 (Ethernet) + RX packets 477 bytes 86270 (86.2 KB) + RX errors 0 dropped 0 overruns 0 frame 0 + TX packets 327 bytes 47335 (47.3 KB) + TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 + ``` + + The net1 network port name here is the network port name assigned by multus-cni to pod, and the address is the address segment we configured in the previous nad named sriov-attach-k8s-node1-ens11f0np0-kubeflow-conf: 10.20.12.121. + + As with pod-vf01, check the IP address of pod-vf02 with the ifconfig command: + + ```shell + eth0: flags=4163 mtu 1450 + inet 10.244.2.21 netmask 255.255.255.0 broadcast 10.244.2.255 + inet6 fe80::f45c:90ff:fe3a:67a2 prefixlen 64 scopeid 0x20 + ether f6:5c:90:3a:67:a2 txqueuelen 0 (Ethernet) + RX packets 21690 bytes 65555332 (65.5 MB) + RX errors 0 dropped 1310 overruns 0 frame 0 + TX packets 15612 bytes 1218973 (1.2 MB) + TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 + + lo: flags=73 mtu 65536 + inet 127.0.0.1 netmask 255.0.0.0 + inet6 ::1 prefixlen 128 scopeid 0x10 + loop txqueuelen 1000 (Local Loopback) + RX packets 794 bytes 277124 (277.1 KB) + RX errors 0 dropped 0 overruns 0 frame 0 + TX packets 794 bytes 277124 (277.1 KB) + TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 + + net1: flags=4163 mtu 1500 + inet 10.20.12.134 netmask 255.255.255.0 broadcast 10.20.12.255 + inet6 fe80::ac97:a4ff:fe72:d1f1 prefixlen 64 scopeid 0x20 + ether ae:97:a4:72:d1:f1 txqueuelen 1000 (Ethernet) + RX packets 492 bytes 110501 (110.5 KB) + RX errors 0 dropped 0 overruns 0 frame 0 + TX packets 318 bytes 42371 (42.3 KB) + TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 + ``` + + The net1 network port name here is the network port name assigned by multus-cni to pod, and the address is the address segment we configured in the previous nad named sriov-attach-k8s-node2-ens3f0np0-kubeflow-conf: 10.20.12.134. + + + +ping each other inside the containers of pod-vf01 and pod-vf02, ping pod-vf02's net1 network port inside pod-vf01: + + ```shell + root@pod-vf01:/workspace# ping 10.20.12.134 + PING 10.20.12.134 (10.20.12.134) 56(84) bytes of data. + 64 bytes from 10.20.12.134: icmp_seq=1 ttl=64 time=0.293 ms + 64 bytes from 10.20.12.134: icmp_seq=2 ttl=64 time=0.212 ms + 64 bytes from 10.20.12.134: icmp_seq=3 ttl=64 time=0.216 ms + 64 bytes from 10.20.12.134: icmp_seq=4 ttl=64 time=0.221 ms + ``` + +The results show that the two Pods can communicate with each other, but ping is not enough to prove that the VF ports assigned by the two cx5 can communicate. You need to perform further tests on the specified vf port. + +- Check the mounting information of vf devices inside the pod, using POD-VF01 as an example (pod-vf02 refer to pod-vf01 for the same reason, no special explanation is provided here). + + Access Pod-vf01 and run the ibstat command + + ```yaml + root@pod-vf01:/workspace# ibstat + CA 'mlx5_0' + CA type: MT4119 + Number of ports: 1 + Firmware version: 16.35.4030 + Hardware version: 0 + Node GUID: 0x1070fd0300a4487a + System image GUID: 0x1070fd0300a4487a + Port 1: + State: Active + Physical state: LinkUp + Rate: 25 + Base lid: 0 + LMC: 0 + SM lid: 0 + Capability mask: 0x00010000 + Port GUID: 0x0000000000000000 + Link layer: Ethernet + CA 'mlx5_1' + CA type: MT4119 + Number of ports: 1 + Firmware version: 16.35.4030 + Hardware version: 0 + Node GUID: 0x1070fd0300a4487b + System image GUID: 0x1070fd0300a4487a + Port 1: + State: Down + Physical state: Disabled + Rate: 25 + Base lid: 0 + LMC: 0 + SM lid: 0 + Capability mask: 0x00010000 + Port GUID: 0x0000000000000000 + Link layer: Ethernet + CA 'mlx5_2'//VF + CA type: MT4120 + Number of ports: 1 + Firmware version: 16.35.4030 + Hardware version: 0 + Node GUID: 0x0000000000000000 + System image GUID: 0x1070fd0300a4487a + Port 1: + State: Active + Physical state: LinkUp + Rate: 25 + Base lid: 0 + LMC: 0 + SM lid: 0 + Capability mask: 0x00010000 + Port GUID: 0x6ce7bffffee09382 + Link layer: Ethernet + ``` + + You can see three network ports: mlx5_0 (Up), mlx5_1 (Down), and mlx5_2 (Up). In fact, the VF we apply for comes from the mlx5_2 virtualized by the physical network adapter mlx5_0. That is, mlx5_2 is a virtual network interface, which is derived from mlx5_0. mlx5_1 is unavailable in the Down state. The pod should actually only use the mlx5_2 virtual VF communication inside. Similarly, the VF port used by pod-vf02 is mlx5_2. So let's do a test. + +- ib_write_bw tool stream test + + Enter the Pod-vf01 container, use the mlx5_2 (VF) port to enable the ib_write listening service, and run the following command: + + ```shell + root@pod-vf01:/workspace# ib_write_bw -d mlx5_2 -F + + ************************************ + * Waiting for client to connect... * + ************************************ + + ``` + + Enter the Pod-vf02 container, use the mlx5_2 (VF) port to enable the ib_write service connected to pod-vf01, and run the following command + + ```shell + root@pod-vf02:/workspace# ib_write_bw -d mlx5_2 10.20.12.121 + --------------------------------------------------------------------------------------- + RDMA_Write BW Test + Dual-port : OFF Device : mlx5_2 + Number of qps : 1 Transport type : IB + Connection type : RC Using SRQ : OFF + PCIe relax order: ON + ibv_wr* API : ON + TX depth : 128 + CQ Moderation : 1 + Mtu : 1024[B] + Link type : Ethernet + GID index : 3 + Max inline data : 0[B] + rdma_cm QPs : OFF + Data ex. method : Ethernet + --------------------------------------------------------------------------------------- + local address: LID 0000 QPN 0x03ad PSN 0x17d925 RKey 0x029300 VAddr 0x0073f17a0af000 + GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:20:12:134 + remote address: LID 0000 QPN 0x00e1 PSN 0x146e34 RKey 0x021400 VAddr 0x007bc5c59c3000 + GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:20:12:121 + --------------------------------------------------------------------------------------- + #bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps] + Conflicting CPU frequency values detected: 800.000000 != 2000.000000. CPU Frequency is not max. + 65536 5000 2758.40 2758.38 0.044134 + --------------------------------------------------------------------------------------- + ``` + + Analysis of results: + + 1. Field interpretation + + bytes: The size of data transmitted each time is 65536 bytes. + + iterations: 5000 iterations are performed. + + BW peak[MB/sec] : The peak bandwidth is 2758.40 MB/s. + + BW average[MB/sec] : The average bandwidth is 2758.38 MB/s. + + MsgRate[Mpps] : The message rate (messages per second) is 0.044134 Mpps. + + 2. In the preceding result, ibv_wr* API:ON indicates that ibv_wr* API is used to perform RDMA operations. Transport type: IB: indicates InfiniBand. Note: The IB nic device of the RDMA protocol is used for network communication, which meets expectations. + +Next, we test GPU communication, that is, we used GPU collection communication library NCCL to carry out NCCL communication test on VF network ports of two cx5. + +#### The GPU uses an RDMA device for NCCL testing + +##### Install the nccl program and compile + +Enter the containers of pod-vf01 and pod-vf02 respectively, and execute the install nccl and compile commands, taking pod-vf01 as an example:: + +- Entry container + + ``` + kc exec -it pod-vf07 -n kubeflow -- bash + ``` + +- Enter directory /home + + ``` + cd /home/ + ``` + +- Download code + + ``` + git clone https://github.com/NVIDIA/nccl-tests.git + ``` + +- Enter directory /home/nccl-tests + + ``` + cd /home/nccl-tests + ``` + +- Compile + + ``` + make MPI=1 MPI_HOME=/usr/local/mpi + ``` + + After the make compilation is successful, perform the next mutual trust configuration. + +##### Set ssh trust between two containers + +###### Go to the pod-vf01 container and run the install openssh command + +- Entry container + + ```shell + kc exec -it pod-vf07 -n kubeflow -- bash + ``` + +- apt-update + + ``` + apt update + ``` + +- install openssh-server + + ``` + apt install vim openssh-server openssh-client -y + ``` + +- To generate a ras key, run the ssh-keygen -t rsa command and press Enter + + ``` + ssh-keygen -t rsa + ``` + +###### Repeat the above steps on pod-vf02 + +###### Write the contents of the file /root/.ssh/id_rsa.pub to /root/.ssh/authorized_keys in all containers: Copy by hand, one per line + +``` +ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQCVRX69XvcjVlF6a1wqxMMh4ZHDNSzEGwPm7qJdsCkO1JPUpCI+2h44NzRtKBFMf1kfw3d6fOqTh/mVhuhBFTmsQVHaGjj8tffkVzieSJ3RAQYFHKvv4ZPvcN3bsbiqbjE9Syq0JLDahZy1sfTygI0ax6p0uJVAVr03bKy31WVAVi2R6f2Hc6QB5tsHVOzIBK7hCehhNe0wfPW8q0vVK8y36DBLwZC92DLPn77x27c8zT87K2nIuDiVGGkKAu3Fkk6utYswPijlZIW6OjMY1Orx8400eo77wZSybCfZJc25Fr9C14l53db7BV4x1vOcy1teGh8OkOJXwtDo6okQpOJhpuG25FlIpFEgQJZPFkYHOFB+q783+o8vAFd7g3xouS2ARlNnqsO7jB8ZvMTaa89NyKlQKWI3ObVkqjqYvRXlZ/gDhRG2Z5QSV/eVhsY3Dx5IMVPobz4R3rV3/n5QIUXRnMebEAxdfM+VeX+0P11yjPOrYyti7D+p1rYB+3Yf5/0= +ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQCZRkemmpzBFIl8CQ3lb8uzzMs5H9f7Mo8eHm/IVYRR8FF6X1Gh+z8c88q1fdMgfa9vup2JbRywUeHS2LY9+I3Ln2MK6VB568LjRGJFaGK2vrEcBnaQgPKa9W1xXX+k+93CcAgjECw92nVVKCkfALLUyZEEqmw9Va5iV74cPM7le7VBQOfbOWfogweYuwE7FwRHrFDbueyc9GX1BvzOscSFn/V2YEuQzKOkZQHmcX+OAeV/TepZVKzYzt5mN0Q0P7UWmgn2CD+a4IFjQjXxbPw1zDP+wYmD6jIADks2GNHJu8huCK4IMJQzesMOWoch+2kkK80b0UvAQjTUMwMr2t6CPgOQafEygOr623clROYSSycTQ09ikt9g6SO31UZ4idNcoRcYqomDUs3+pceorer9adLHXM8MmRyRl6wEhCufJ4p4hYhwkL0rLCpBQ011NCP0hzoxUlQyVMnW13ztaKazX65ibunelGdpxJVeI++ldHDD6I3ZdhyP9Yiw767ka2k= +``` + +##### Two containers start sshd + +Execute the following commands inside each container + +``` +mkdir -p /var/run/sshd && /usr/sbin/sshd -p 20024 +``` + +##### Test without secret communication + +Note The ip address of Pod-vf02 is 10.244.2.21. To access pod-vf01, run ssh root@10.244.2.21 -p 20024 + +```shell +root@pod-vf01:/home# ssh root@10.244.2.21 -p 20024 +Welcome to Ubuntu 22.04.4 LTS (GNU/Linux 6.5.0-41-generic x86_64) + + * Documentation: https://help.ubuntu.com + * Management: https://landscape.canonical.com + * Support: https://ubuntu.com/pro + +This system has been minimized by removing packages and content that are +not required on a system that users do not log into. + +To restore this content, you can run the 'unminimize' command. +Last login: Fri Nov 22 06:51:03 2024 from 10.244.2.1 +root@pod-vf02:~# +``` + +If it is displayed that you can directly jump to the inside of another pod-vf02 container, it means that the no-secret setting is successful! + +##### Double machine IB communication test + +1* GPU1+1* RDMA communication between two Pods 2G data volume communication scenario + +```shell +mpirun --allow-run-as-root -H 10.244.1.10:1,10.244.2.21:1 -mca plm_rsh_args "-p 20024" -x NCCL_IB_DISABLE=0 -x NCCL_DEBUG=INFO -x NCCL_SOCKET_IFNAME=eth0 -x NCCL_IB_HCA==mlx5_2 -x UCX_NET_DEVICES=eth0 -x NCCL_NET_GDR_READ=1 ./build/all_reduce_perf -b 2M -e 2G -f 2 -g 1 -n 100 -w 5 +``` + +Instructions: + +1.-x NCCL_IB_HCA==mlx5_2: the name of the VF NIC device; + +2.-H 10.244.1.10:1,10.244.2.21:1 the IP addresses of the two containers, where :1 indicates the number of GPUs. + +Inside either container, the personal test executes the following command inside the pod-vf02 container: + +```shell +root@pod-vf02:/home/nccl-tests# mpirun --allow-run-as-root -H 10.244.1.10:1,10.244.2.21:1 -mca plm_rsh_args "-p 20024" -x NCCL_IB_DISABLE=0 -x NCCL_DEBUG=INFO -x NCCL_SOCKET_IFNAME=eth0 -x NCCL_IB_HCA==mlx5_2 -x UCX_NET_DEVICES=eth0 -x NCCL_NET_GDR_READ=1 ./build/all_reduce_perf -b 2M -e 2G -f 2 -g 1 -n 100 -w 5 +# nThread 1 nGpus 1 minBytes 2097152 maxBytes 2147483648 step: 2(factor) warmup iters: 5 iters: 100 agg iters: 1 validation: 1 graph: 0 +............... +NCCL version 2.21.5+cuda12.4 +pod-vf07:15718:15718 [0] NCCL INFO cudaDriverVersion 12040 +pod-vf07:15718:15718 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0 +pod-vf07:15718:15718 [0] NCCL INFO Bootstrap : Using eth0:10.244.1.10<0> +pod-vf08:12090:12099 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so +pod-vf08:12090:12099 [0] NCCL INFO P2P plugin IBext_v8 +pod-vf08:12090:12099 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0 +pod-vf08:12090:12099 [0] NCCL INFO NET/IB : Using [0]mlx5_2:1/RoCE [RO]; OOB eth0:10.244.2.21<0> +pod-vf08:12090:12099 [0] NCCL INFO Using non-device net plugin version 0 +pod-vf08:12090:12099 [0] NCCL INFO Using network IBext_v8 +pod-vf07:15718:15726 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so +pod-vf07:15718:15726 [0] NCCL INFO P2P plugin IBext_v8 +pod-vf07:15718:15726 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0 +pod-vf07:15718:15726 [0] NCCL INFO NET/IB : Using [0]mlx5_2:1/RoCE [RO]; OOB eth0:10.244.1.10<0> +pod-vf07:15718:15726 [0] NCCL INFO Using non-device net plugin version 0 +pod-vf07:15718:15726 [0] NCCL INFO Using network IBext_v8 +.............. + +pod-vf02:12090:12099 [0] NCCL INFO ncclCommInitRank comm 0x5e303a52bd70 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 57000 commId 0xadcb40d61cc1bc4b - Init COMPLETE +# +# out-of-place in-place +# size count type redop root time algbw busbw #wrong time algbw busbw #wrong +# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) + 2097152 524288 float sum -1 880.2 2.38 2.38 0 877.1 2.39 2.39 0 + 4194304 1048576 float sum -1 1735.3 2.42 2.42 0 1737.9 2.41 2.41 0 + 8388608 2097152 float sum -1 3444.5 2.44 2.44 0 3440.1 2.44 2.44 0 + 16777216 4194304 float sum -1 6828.2 2.46 2.46 0 6857.6 2.45 2.45 0 + 33554432 8388608 float sum -1 13405 2.50 2.50 0 13311 2.52 2.52 0 + 67108864 16777216 float sum -1 25563 2.63 2.63 0 25467 2.64 2.64 0 + 134217728 33554432 float sum -1 49333 2.72 2.72 0 49034 2.74 2.74 0 + 268435456 67108864 float sum -1 96904 2.77 2.77 0 96606 2.78 2.78 0 + 536870912 134217728 float sum -1 190709 2.82 2.82 0 190911 2.81 2.81 0 + 1073741824 268435456 float sum -1 379615 2.83 2.83 0 380115 2.82 2.82 0 + 2147483648 536870912 float sum -1 756857 2.84 2.84 0 757311 2.84 2.84 0 +pod-vf01:15718:15718 [0] NCCL INFO comm 0x576eb5d4d740 rank 1 nranks 2 cudaDev 0 busId 2000 - Destroy COMPLETE +pod-vf02:12090:12090 [0] NCCL INFO comm 0x5e303a52bd70 rank 0 nranks 2 cudaDev 0 busId 57000 - Destroy COMPLETE +# Out of bounds values : 0 OK +# Avg bus bandwidth : 2.61937 +``` + +The above test results show that nccl runs successfully, and the GPU communication between containers uses mlx5_2 communication device. + +The preceding nccl logs show that the IB device mlx5_2 is used. + +``` +pod-vf02:12090:12099 [0] NCCL INFO NET/IB : Using [0]mlx5_2:1/RoCE [RO]; OOB eth0:10.244.2.21<0> +pod-vf01:15718:15726 [0] NCCL INFO NET/IB : Using [0]mlx5_2:1/RoCE [RO]; OOB eth0:10.244.1.10<0> +``` + +Thus, it is proved that the scheduling framework koordinator can jointly schedule GPU+RDMA devices, RMDA devices are successfully mounted to the container, and GPU and RDMA meet the topology affinity, which can greatly improve the communication efficiency of GPU and thus improve the training efficiency of large models. \ No newline at end of file diff --git a/sidebars.js b/sidebars.js index d2bd04431..4efb84c35 100644 --- a/sidebars.js +++ b/sidebars.js @@ -105,6 +105,7 @@ const sidebars = { 'best-practices/fine-grained-cpu-orchestration', 'best-practices/colocation-of-hadoop-yarn', 'best-practices/network-qos-with-terwayqos', + 'best-practices/gpu-rdma-joint-allocation' ], }, ], diff --git a/static/img/rdma-nic-topo.png b/static/img/rdma-nic-topo.png new file mode 100644 index 0000000000000000000000000000000000000000..77f406b5af4a91fd5ab0b895e4a9ffcb13c8de8e GIT binary patch literal 27920 zcmeFY_g7O}6fSB(L8XWsr5kAi(u+VSi3*6Jib$^#ks?LF00BZk6r@8a0Vz?cbm<-B zNDtBpy$FN|p-4gtq1<@xz5l?yZ@lq-cw;0Zd#|j$$6DW-YtC<$74cMGi|H!Y)pO_0 zF+rYaJUe&pV(Phb=P?YIX)Vgn5aP6-^WM+29-S*2;8~;H&^bQTdwA|#MclRH7nf-F zSKd4^_da)yo%QeceBWI%_}n?yJc!1_=YBSub5~~Xn_Ev(4iHoa8GfqZmmk+(Mq~?K z*NC|=5F9G5$Mdly`!c=2$5=M$N24<{9_8ZgUmxaj9#2B07jI?iDpqpno{Onu>bS&q z{>rB=ktW%Q$&*&^puMp+%E7Pab439}5OJGI$tc(=EwY5Ij};mbY>f{<-GUI*|NW=h zH}>F1=t`lGT2Ah39e-YdSAI^AIMKD@M&IsO#Mj)q6nS4@c;{UQ;JCH+(HbhnsyoA> zGG^iFNXy?IC-xAq?MQc%uUkL02X)(KgqPE>xhWSTv4)sy`l(i&E1b0Zduy(~J7u>X zWzXG%l|t|CZ&`DbK^3`LZ?|K4sm~#879&M&7?%OiVm%dy@;XIk3&ys=zPaxk%$JK3 z8&Nh3Ew7ioldj;I`&Q7}{@#!PYk=UU!9FmB9E`HDy0J<45N*bQMLdUiwq7LoPw&G9 z<)&uI*7F3T0bgR)=xXXH;j=2nY9zzIX%j?mV>lwLV`1#ugO$_iqnx3IV)oLyTq}`I z;M=y(c6sy1urTVq=^?U?@(c^1zETi0q3t0O!RtR(VwvKt(IC|jD~f%IY^};@vT z1(}a-b;-5t0dElJ|6zJw0%e?eFjqpp?kpk&o%s`jQi8j>tOw!WKPB*u41qY*=cmGO zq$ld#vKa1rY+WqLz3yZ55b8uDq1Q>ka_#FrkfGI{ik3AQsGcKcIJ)R$Y=;fp{BIJ| zgzw6GK8i;?R!)xa;&oW;++df~6SCS6gV zmuoe?+hj;v7ms0^pTL1l?`qBT#<5z0x2~k)^g7SRPvGEKe%7{^T~}Z!!UmZI_d=xl z>!*jpbT_>_6u!;R)-A$T@|Z__at7Ng^OH0|js!$J8C=WqI(-8pdQ#VD^USBBABeaq zA?P!Zp?$*?V1enqDQlXQW0zT(aSvmMw&)&nSY>V?mW^^mg90rO~2%Ou~Q(N$KkAqvldkOgrT-ULf0R40qeT;yI zda=!_e>%C5CTPSM2gXui=@@udY!i?`gdG^{E`;H3I+z?t;5Lh$_=xiR{RUjh4@~vI zB{)&6u~KTbAz_hAWmx~dZ`yFrn{Uh!4sUMW`;#Sa!jj)ViYoWPUq(I=LvlJiy1mZU zZgw%scr?6!jvn~H{RLcC7|*tqDhl`b$%fa@vRIOXK?jva6xKs?LqlrcLj(2G8;Y!z ziZ(CLV0}hc-S4Gna>+Vmr8E$lz6z)lR5bp4xBmCr7SJuS0TF9*>XPJW?P!Ykm^&8e zOa5m@`nj)Hu zrehJ005GL44+t4%1wL#Q>RVAek#;QxJyGVV9VXZTcUkQlj0T$26*j$t-TE;v%wh3{ z%RDJ{Arr+%qBK>qc^lCG!C+YUCH@DlcZlk9AMQ@*r0Q9|EBk1`Vk-4M)AFD zIa0*Y62ZK$24?82=m4EIpt}ADsM=wePJ;C}06`^=k+1`ny{JZtL6fiTZ|0LfyeHBn zGp56j_(|pEBHz?G7a~u|Ax zKdyO}@xTR2{B>>>W}WhxF&x+JM;iTCo~!OUe#S!UTTS6IMV)*I_6T4vgK{}J5a9(eP6TMu*BOP6Y2>?+)0^QzlirLG+GH6#Njr*qkE}* zQ9HS}R&9?0>s&G=ya$~_UD-U+L)kyCIYu{voGl?MrIqIXA-1SjekgiuC>)e}JTr zC)wGZ{%N;j;MNo}f#IqEI8cEE>2r9O`Z$Q(xE|JgYDK3cBXLjehm-V^(!^)hv93C= zdCgZ6$7FretlVB6J+*!yo+C+=*-10$w$sf#YX)0ptD4zLre{nd)n~d-qh#%Ryhv1B zkfp;G=wo2){m&I{V{1Yyl||%q!*GGK+#?Q*%iPhOwZ)GfBoqx9<8~*aP*BTXJyf5;+$3i~LTg(~s(b81KOCX)h z#zcOqrE1F>5tcUCU2oFx6aw{6a`@;~)E(~hC!Bd#ywB6g{LWyKUU7+~iG&;%^~*87 zGOd#7quX7*pM-cT7K_QURDJt40-O6R{B8{w_qrh{jgTbu)YX0*^l>jH+1QmZhLFAs zP=Ugj5)Pj5bQgn+EGt1zrrWYJBNgl|5wZSi<{-zP&TZs|S2P_utw^#Zur-{;WA6;8 zls<^UYrx0cPEm;1>+mnlL^m_qzJq1gl*JGB8ubGUN>RjzI$8Ss<%SBb24i|7Pt~Sr z@Q|dR)0=Dxl-vsku5Mtgf)Xs;9!S)RjwmB;{=BXd`D-i;bb;ANKP#^97X76k%~jBD zuQ%2A+Lh`1llpi^mlp?^D3i)LA+e$0zOWs2{-?b+eSW`B99HU9Wj$m}^BNkgaY<{) zAa=#O^_}O!S_n;4nJg<$6!~9=WoXz~+HBGA|`o_Vp!ZimJMW15JA;8}HpQC^)UzSKm+} zB^PX)&;;ep*S2pifERr`(w&V6=BGA0=F`D4JHS5qi7$mTH2(R(6C+<(_Q$;JY3bDg z0>4tu3kJ$(z6ua7llxaRul^;Eyz6kzYe)8oimchg*g|fqF};BO?J&Ns1T7mDp;bPn zWG%asJ!OUa4r!j9iT_YzJuAv{PNS1qND(fDp~@4zDORoOF|y0htglqYoL^k)92%iu z6Z!%(8YUjlJ%L*9aS%shoKcG07;|(4o3OJn!GZffJT}K_sXWzcJHQ*{p{^l%r_!j} zIEC_L`<4WB#Xbnf?slJ8)Lo-7qho%@;~+jYTs2I@mefQ1@(=blmJJ?vqS4y3e4|GS zD+o~cvX$89C2IwFop4rl0849K7FO1v>^HNK?X~ZkH)$}fKPXd~N{nrR;a)c!jUk%4$3ccoTyq8=cB@Sf zC|5MkzM7!)i4I)xEQ<(wh#hi#0$rzMtKE&~8tEIH^r?QfQq|QUL61()Sa30-@~o83 zBbyzO-L-L<7_@4E!(M90&RL0&RUlQ6e>MN~|FgW-@-ThGrnq;B$F2lZ+mbw*=2A#4qPb77m=48d2_PxVo zt~%B(J7!z%y+ddVbb<1-)Jk(3V!q!6jj#|JwZJ9FBV{aj;Y8@<4fm3Z1I=5B{rPp- zd#Jw6i{RkvO5=+7>FoY&o0tE21}$^Hz5jR=Dos?~KKjQ)T=ZM(LCQbp&z&iM0Rgn8 zSDKF3e* zxcXT=NhiaDq`*{DBObQ~1gGx35F#-IZIx|2ELr}_CN zJ!Io|MOa|wWnt#vFBj=_g*7|-XB3`Sczl2^K*%>K~jQ zW_TGY9GU{t?gxA}j!tqDXFzFM( zf}hEQp6dL72K`rX&I%^@!uZV1lyz_2!ZX9JQvalb_6%=|GSzW>Cy|Vz78c_YX)SI; z8iu{=7u8Ic7%Jm^U%qy%60pQX4OS~3HW*?t=0f+nC(qx2_tHPfh&FSyE}U%rO*bK2 z{dJ`@Oc5XtGw&jp&1dY@&`bu`t8Xl+9wkzv}TE;Yg?Bpj6OP0laRXyCv`azRac>b+Q9Iy#Nu`J z%t1YxxcmrFi1hjFUQ8o)L#NWQwdL+jWO$uBdw0nU!E7A2daBMWo?T+$_G*4h8SGh> zIz#xlQ%aso%^UPhIzI}5!STlvgj5;_{&;LanN0#|&4)G4ueyCL(44B?e+X*B?WH(Z zFc}S$8`h;FpY*G%6_QbL<YQ+axNpJqfo4_6h`;gn-GYV~jn#;s^nzw*z z;eAIIlQ^8RY<_2BerjcZb4>JrL--YQ?N*4+}Zg&BOhAFB^GPd907#eu}Y*#&i@?ymGRLc1{S zwwr4WOlKdszAp>;A`1aBa+)cZcxRU#nqBry(S1?VC@BOqPOUHA-pIdEPNI+=e? zCOp58kM9_8zmHlJOb{~@#0^`V-Ak9>x67>w__Y7-G zL#u`Q0!y0t0lA|2;)%yqF%`2@th#Gu5OJ9?|V5)BgSt&)>e$;B<2VlM~u!TKY1#h zG0?9p9i*;l)xIM4*A(4ZRu_D@Tly%$&Pr{;tj`{UTMq94Igf`#lOGd#el!%!`=)1& zhIQ5TuX#&*xO(VA_oVi%PdIBck840^vb+iU(MFZYTyGfZU&6ZbKlTCSO2s+wXHa!4 zfTW0X#EhCw)%}fjz0dS=ha9gPpTdeBLT9LVq>v@<0x6l~46JKpMd0RRMEqMN4NK{<{1TV+i8b%B&AAK(a1`Ek znvgmgbCjF51mEsbgEx6pt$rjov-ln%kev%LRx#^P}};)|boteSX^ZGx-Yg^8fBMR2=O_ zqDabK7ByvN^z3j8r zv`4sl&_#-4KHtg5_cZ~qu0;(Y9gSY<0K8ESYo0h1Af2mA4tkIz151NjL_Ai+eB)FM z!xJw1IGvjLf+C(O+8<9kro{~=l;5gVBYlr|kd=(@Zno}EEC?6sb1XbZQ>;45vnj4r zUmf?e9vv}>WU1*Mrzq#D96d)0@S9!cRFGIz;nK1E{dhBWhwFq<6UqxIDo@YJ&KGaD zl61GU8N!1xyh)}V5Xh3GWpis37zE|a?NRS?v{x>-EVR*YX{k^%B5H~k8@oeiN+dpi z_SWRDS<&Vu6SOO4RH7q${S`MqkJx?VEoFvx8Q;b!zO4~{r_1ca4BDuZO^0sWDINYM zjVun~R0rb2vYtns=(;4oupx|viIn?ptn-o@pR5$5peH`H-m{fyF%A+}@s-VXuaTBO z8@ff1B1oyvq`o)jnlyZC@lFw1d(-2ln>(FRi>&KtR0R4adelOVPje8t85WyCNt*6C z2hcRV`+LW9N3Z*}&fLnZZs#7&$`>%Al?rh7X$cmD?&fA(1I)P9p8E=qWskt9d3RB( zr`3kj_^4G%u?A;VhO2(d1Mvc5cb+=$lsu+4_&fKUd7k9ny2nYIDLdxNvJRfV-sOJq z%s=Pnib7$@U!zMy<)bP&)626Q+mfevp!CiXT*0E`SHmAW2EP0Ws`a(V zi!mLy!}>s@5}HW$e`njO8OP-KRvWc9X(1E=oJvYC!+3 z>=iDDsDI^N5#_2h`Jq_A*B4`c4gc@alZFcSWao^Mu09d{TMf(tB-P!iNfz%uh5d!0 zMg>s-NJSW25&fl+qg!TAf8KW*OBQGQ)!Gt#ggT_IopAyUZH+VxhH0 zmXCf_31qXTJ!4f`wAUg7ZHyq6?FMgoEDKL{c^FEsg5@Q;jJ|jaOelYej_6Aj+g0|W z80x=pN$YgXAx%VtNlGb7OZH_an|ZX9DcMbFi7O6*s~5}$;?3rV&RYwBhlh;r7O&qX zkM_tRgRIe7D^T%rzExfe$FV4-!j9%r{~I(GbvC5xNpg`;#KbQtW!gK<)09JB?JF8$ z%VM<^pjIJX(2NUJ=Ixf?IOfaX^=0C|my<W4S14gW{|nQNGC4$F5bi*p-xye!9ZQxm?HdtDCpjtp&0>C4I{KEe??$b zsHWrXIwyZyslTj(1mjOOL^IBD$|OI*f*Qnt5g0e&T`;Fb*%C$2q7Ko<4vBRY$&M1A zB_~C1hFnFn$THI@he4U-F%hteScWh!hZpeS-z(azH=~^QlxbwhBIcpa^De6|6Rg-m zl|Cb^I3cr3C7;|NBA1eQz+Ij*n<|WPvTxBXX6YB1IJ7DeL$?>vb|zjgjos60sUyr7 z7AN~TB#ttrUt2owm=Nb5U>Ua0zSJi}VN7Tb(*V?O zIY&l?#Y(Ne={Ewfp_tKbH7C(=_#p*N%YUVDy@aLXUg?422DN@l;SAY{+BK4-53l}V zH2t&|jMF)V^HNq3uMxDddaz)l*OJUzhg5eP%#cZ{$2Y{ZZ5cJW@2IxofN>xC6nh*3`+{i_;>Va8%xKvD7_?jo|TE+aVh(iM-IV6c?<;&&BK%D85Yemk~^nGf^-^uEQY8sP zQ!!*uzE-iojQ)qe1u-xQooSRA=c=7n0ZK8#5FeGG_vmo!L}8B_tFR2MM;QT6)Zw>P zz<61HJjh0^bl$fDL&zAF$@{DES7ovA7irP2$JfkovmmXcOSF*2@Rv1jJz#6XL^_cv zKgmn>G4Ber1R-u06hIu7xVlBC@Yze zIAKNGQeBF;A3h%#V!LtN^f!c~;;}D!h8}(*_9i8P4KC_sK^iJ-8n(xm_ByLsZ0? z`MNF-0j~|4@@?}Ge4q4>DsOi<7UUk0i5a5>eS6`1wa-ezJE#`B7uQd&V}8Q;X`TUT z{@&n!LsD=6S0JG?7NQ4O?-P%oI+4oyOEAPn%y`tTRp_Zn!-od!SeTdQWJgk_%KOM< zu^BVgDv+4YyNINB&qX` zn_(G4dTOP2_e!57CVI7JM$YHokF*%k*Za)xlpyiAmUpS0m0caGE3oxbm!FFG;LIuj z6b>bKou9KHjJcT$k%pDmG+gf6eUAMzg}W)QaccHB=5eGmbnni(PV4oGqG)e=dj?8b zoy^2f*F@44@AXUA$iQzKd(>^n8&eR05w2z~Eq}hPkf)Ggs7gqbOIz{QA zWYjdWHn7C4Y=+u1I_dayP|*SKb1JT4uxoX0B?Dx<0?+imw+ua#u`QLPy3T>Ri)3SE z1!_oi>#ycP&$fzJ26vR>Jo1d$wP;M_n^bC>v3tng(l z(_2Z1JXI4yROS{WSFb^wBdwSflXe|dq*SP-(H4{Fk0k6GCix%x*}D&^-8%l%2lf`& zYg9hr9E_wWA?}|Zv2E`hUeoriVHQh!0HP6~zr%fUT|G{}tvuJC8*DX_3hOJ_{}`mS z8g%j|S!Gup54__WK{o{=Cz76H!YMq2pm)i96gea1St#k!$`?@k zvo7VC43-_OzZuX!9tS^CX3GPR;Mr^G8Kb`H83#bndUL{3eJ_QDxDQ-1<199kS`VMq znDUuR=O-Y?eWn20V0 zoGsq@VtAO33d6DTk}zxG>Go={Hx17hKdDs$aIH0f)y;^|WzUmuABt`LjA*cQYmNNS zzACdNG}kmo{*Cg=TK_8$Y2c0^j-t4XH$d~J9$ues`f_#ayPEE*$A*fU?wg-YLL7}M zN;7uXuW8gc(jwn6hZ$m}DCV$6iJ#69wcvMI6$)!RLc`lVA*^v1Tm1<53J$O?QbvV{R z*9pw)7s@m9%1`9BK2@sXnq1O|ACGHO4AcRm%!@YT_QPrm-2r5Z+<;*}+oqX&74V>|dGEs}&9KKd^X?H` zj0VxeHg!jIz%#}PR)YAmx!5WY4qF{K&p#;BI)2dSkckH3XXOaT9s5RM>ARi$nDh+a z6y<_iTCxX^>D&b(nDb?5l@>k?psK7djeUnbTK(y9G0hoz8fWPp3V2xXL4}lkmP~P< z$u-L&{yU2L&)@p+i*j`|wny)5>{>j3-^lut$^12MZsV{OJH4TJVsm0aMYVno8f}oh z^D4m1Mt2z@L7(6nkfhEmzwTwHyllQt%UL!5@h_}&H80xIn<|58o`&f^E>$+j;aHwj zEPX7K(o|ISkkVcG#8rTJ$B>?Mx(Xu(Xy_!f%zT>AYBjqn3P9!>?}a&Z$_nyI(BcQx zdBgidDMR}s^Z0Al4HA*eVr1N=S9PK;cGJs!IUF~9wNc?=Iv4hpT1njAW2z1ax8`vt zg?Hv3>>wf%*YM6OdTm9ZkB?9z87?x4BRwj+R?+Vk31i9}El91jCPtEY?uEl>DaqxR zeGNbifnv469!9YmRJNp8kq!s;qB^Kg;KxZ&={~19sFs|BG3y1DviTjsE_@~zy@*Ql zc<(E#=a=un9%oq+KQDQVIE|4q$3&eoVcm|F_^yYb{T~-~u}lvIQ%3uAl7OS(D{Kl8 zcn|C|m`9&asTfJI)+2mfmC2^~i&jRzsrBAvtR1%{^;5?wmx{BvIVb_Gq(Ibgbm;T^ zF7xx)NdGF06X#6_4c*lY_+*4Q?I-dmmv=S9y?Bg#>BMU^{RhTaQ?&ve|IjPrS(eSX z?q-iD@$DIEi1l*H49(99SPyiqL=<-@=VugziNp0vH7R9P0Fmx5AT6sqi#%$V$b?oq z{S6P2Vs)8NoJTwwMg@l)4IGO*E?@^W0WY7&AC7nHwSnXDcoMN-PwH`J`%mqgc|7Ip68J=JVbr*d~+UJ4nQ5)Q=L81L(t6>XIoy)x2N z%dGS$*rNY7TbJB%VBGqNl?^ZZBgt#VgtD^PnSppBFz8!w+iYZnsylL8TN>1D)@I)a zbkpVr(BFBkXLW`kTnJr+a1(S<*N{H8Z#moZU_?0o?aENk%n-nBwngkP(RNk8y1QIfuGSve1 zXVcvq*UAragq#{1^3N( z>i||1REz|j&+fV5ga1UGho(G}3m$U=+(;eRM7FLIDZ$8$ZzL}Q@JMc<9Jt3i@rsFmr2Yrb{ zkIh{X=_cB~g$g<&o4g@bf;rosGw%Hoh}_)zp_8Lb!3(hgB*?#-|!`*6L^NXYYP<*a{|>Z%=8*Uy;n_j#w!(wuu=_f;3`KnHpC zA(o^Obn9*um^w9R^VDYKtOt>G_8`1C7c7p5S4=rjOj_P)+tx+ph7yj+gKgN}L}B>Y znFrcuvspIsw0pBDLeqJB-ukj>a_yGwX7p56DSaxtjo&YHN(|z~)DUtk?q8zQBsN}* zf|=$Nf`pGIo7zw&KnD$juo#5%HOx1;+SeP%+~~H(!}1dm$Tc9XOqblZ^VkkPktx$X zZ}H5k_h2bDSG~L%7ANX z6)1h{129%@mETu=tRs9VjIBS5nLy!;jyMOUPGZ;(Jiu15Vz)Js#%DID*y?Li&3`=h z4N^eGfIXGG%;VS*{ni3x+sDp}{Uh}YZHz%hFEaptMB!VqC!gH21=ntAr&BUUY)(1K zbpm>q?^UM-irS7a;?4XW-hS$<*d4@r?w-uiK}IC3{MYgfq&8PRqLC3G9u=7Iy z$akcX^DIS3LH2Ofa^VQK5cKN<`_Ye^DEW3xcI&qbiNXdWLuV~0ck#{N>e}jOvF_Qq z`LD|wR!smWzqJ-@Y)EI~;5pmLQ>0J6Hp--7cSHx7%eWK~N5)bQKebQax+1Lu4Z_^8 z8u@-!gDR8Yqz<^h6LhW|Ny!p%Ta!vc$4RY$Q<%x>I`NbW+PkHm9&wjTgchXiVszfX z=l=?V%3qjX=1Jsj51)wjjUeY&j2FrB$FyEu`^_CJhY9VS{a7%kj*vgs+| zF9dFAE3aN^-{&Sf)9J}=iI`^g?OYq+?NbxZW`Dw5g_w51@9EPTLFbE(gIRfPJ553e z%{8ReXpGg5X`13#Jk#dHsd`M*k!smp`un_k z-u6!6jh8_IXDQqEplqZX_O7qA^CAUua*%X6Xp_MdHD)r6E5HbE5p8Ps{PsR`}iXRP=PRb4+8 z(Yi<KcP-PL(-}^Q(G_ za~c1rt}~Ki z^>$eEOi1!g_lFkXD|#BdZQh)!0fbDZW}sKHgDB-N0ooJ%w-aNoA&mM@m{%K4{?t*w2s$6Ub(4xKdb^z~ zu-Fm7W{r@_%sBXOva2h-iQ5il(~E7xF{yU$!Bh81dEVaY2l}Y$*V@ac!>0~b8;P;+ zzHaZU2|A7FXVI}eSaf|1M(W;dVFGz)8rhp69pjCLWB%&{kPv*LJI{1<|-7w zC*@lE9PO@wrDPTh;MPZciAMk;i2t>K6r@Q!>vlvGVuW>g{NUMHOT7dnz|Q>)>J<#{ zylFaHDxQHf^2*n!L^jKrpsFoLp!blQv<)lVlrX|)sw^L*riI!!nK%w{yk^(Hw9hIa8dVf`w0L4iycyV?B;3Ec2m*+Wx)~Wx9zn5Ufp^T zEc$ma7VppJ|D|fqtd~>&w&!Yc{Qn*PKZ`@h$~kg=y1Vjx+kcX^cFm9e{{hhoz}pD5 zywIc-AvD~6Izraz!@^Ar!9NPJ$cTrCkM)JgDDbC?la z`>x`>+P`y}C+G~2`#xqvA)#*9@FR>N-q*jj)aA0HlhEWo+Np&b?Z2+$V3i%a9}%JN__1T3N+uM;D)rZ0hINK>58q6YNI_4oa|BfjZk=n*5!_-#F^rYn6Q$2K z_#Iw;PxQh>^n*B+^uwbd4&0pn?eI%+`#0{yy*QC!390p!p}n+Pm{I!v`#T@BhxDA9Aintf_;avw`b>1hO7}i)LW3^xa-vlDrewL8@QlnMJHH-|@Y4MI2K9O_0QW z+xkXa&S~I*dn2qau?fw}0Ima+kycSXXTsWzDl|ie%AI=^TzKmO&uwlz>(ApH*;(Ud z8N&+(Zxpu+wC%YTkR?dHS*F=R^#k`qu#Wf@L)Am-)pphu&!g&(;MJ@5AQ&+W-g6Aapu`1@S7$o{G^bFIDxq&QBlgsasjyc*upJ$~@jxpJx-?>2baV4*W5v z5yW}_Z0R*!#jJ4#>5Q*nF_sVh0s-#$_`Vt82J!Swz9YRgq|fT>{BRx$q)l+y{dd68 z7aE>#?{L@^J^p?%=v=`EJJdN*`WJM@$NXXpL-k_a@!!{*w2L}@xwb8D_%aEIEHm>w z6hC`KH;czEP^{?TznJ+CYsPubg{is^SU7Kg(yd~t+kGM1^2`W8{7$JI%kd3yUyb%Q26Otj&ekeH(hcA1 z=wo;O^9pA9uNtlDx@>;JfZBxj}c3SQxg`*G>5j3*}oFV*r+L2%RmY3#!DA< zB}5n61-|AOTCm4(-X3EiEV|!`0r*Px()&s(_UgnuzJC^{z4e^=QbKWct=zi-zmD1u zw*94}v)iuPsJBTgw!XGkL+1B{c2;hM0N68ufKht51B5QD_(H|PR(Q?fux8dl9_KQ` zq4h&w;K`eYl}sIdLHVM^?kJO!xIWk)0iDn10m`DD-Szs!%MB<&-b<8;If~x}-;$e#rzJ)J|w<>M?Ge{fOiL_KB>nC@vLr9-T-J zp09fQhEw>)?S@ihvj|NcQiLq@wHMiG(WjBsrHQ$PF&4$Wizo3x1}EeEM}UQ9awSnk zqb;)a%6TsK;$>f+Qy7HKg)Rpf$9m}@H+dM~=MUM_d`Y9)bGkvFWtwckxl;0kB{_I* zd8pr`;+yYD#-bO78N9-S>eRK8&gaW=MH??5a(^v*MhL^UeXa{yH<2rwio0_v8QTvT zvLA$E%X4qgd4-pK_$+^W^JABHvT}W{p%p3ChS#dhf5j(pQ*hzfR{zhoao2^h?~st< zNR(iqM5{l^^5l)t@ZjNIkQugps^KL5K~hutu2A;irJ%#OrVq!Ui}g*tyS_>CV;Hy& zxWwaX$Jyy>>~EeO44s)2c)eb`U;WNTPGw`-`Ft0t&G6FiVF#w$&8qvAK@7J~K6`$7 z7>T_2Z~Q@Y9Mby6Ql#l?UU?L23F)wNOeMZt(%G_I$dr{phkijD-V@>9FVh(^x`RZ5 zon3vc)}rw=A&r0IzCyfD(Q;rw`MKP)Em_T$8uiou%3SvSSVj$lTwVxnX#==No^}o8 zK* zbni0QDE>?YtpR7jf$hLc;oh!vX_PY$Z~dPnQy)_K6x&Bxu#MR>W_$QgFN^l|yIz<( zNZJ8Y2ZZZI^0&+t_?9ncTN41+wgo_B3nx84I}%5{KO*Y*y##lpQ-A%FBr_we*n{Ky zYhJjnwHkb%lYXbg)xyBf2>`pQs{ddTvL0qmn1wE>!eO1R*c!B)##8WZJF4H02?`JPcEL_#(N9C$jX_BBb_%k>` z%ANqIj)T&rKzl>0+Z*mg1V@hSR)^;=x9xUIlW`0)$jNy*k z>>?E80+p^*r^qB7hGe-$&{jv|>7b*MVTY{Q?YLDx{8O4_?S%T!f$0=8wn+^M-izKk zOOM=IfAm~3z92WmVrOyMr+>ox@1edxgtyaarzuE1SnD>!+bbgqoukji(ShEk3Q#Jk zA>EA+u$S4xHyhI5=7f=wTxb!j{jf)B(LZ8lEn%zgzVtfixyV9q^mc)lqUrR<^Z6!U zhO^JRUXqr7FXJZi)IP9ZZtP8LDKzq!BI+EQaVF}#vGQ%rhaU|GeR+a<`?QmVzH9rz zm=ZzrbCTzI#S7|NH@es!&CZojg=T*Weia-iycj4IPyZb4;6}Q(MOUDD+Bc{ECInI9 zAH9K?Z>#A%kO&em%*16ZOR2kO8M(yMS6X}mD>Q`tMP|f zt4S(vfp(ei7E#;oD3;q(LiJXp*W#*P*f?rPuw;;{C&P@=9=qrAgc}8!{wv|+sYTR( zUfTYRhn%gIfZ9L2;H~Scws1V`c#rs!1QRJb56tJ;fpbHqv<0>rdhF zrO-D2t3oH&CaSYw%3)-(`{<^9oHcPfXi}chN z5~tvmiKL{^I@1I7*0`SKuM6WXGZdRwCmu>mwkaKFu_de`kMA7loV>ZYbpzLqZ;UIg zwK9LkEJ<#&=DYzX$SmNjb+_CfpEH-*#-wNr^_DNhUium$8wGGexxf0KE7j2^j}6fa zicY?B_mr?7b;QTexv_RqA7NJhi zB6H=~|C%Jf^m0HH>*R3|M3q@5SpB;C_3q{L9X}`LR)&vWcIY!0r1hHCpI+++)i~A# zOvVwmgf4Ff`hd<=HO1m=CYHwRXiG~y;`3vyujzl{s`Xj?P*EaDXh!2(@7o*jXxq8G zX3O(3zVFf=fBs1*mlDN2TsnJM*?*`nLj6EVIesX>`zG;XzxMzzybreFHyjf$xLG8u zwgw^WO63XUyj5vD)%vBm^CRY5{DTj{AGURFG;0oN*>Qkfe^}SQ_O+Qmc5;=to$SCl z>wInYk|2j`X=u~?1%3))CW0aKBbwA%zYyX9i zp$VKb3D7Q#{dJ$}(0fKMJSlvQGzw64@q(VPBX`&Vo3`_Nw1`OEmMtx?(Q6vmNsF2w zDGkNttw#pbiIFcp`x{fgkr^{AFhAC>*T2mc<*0PcM8=wLuQbUcb{xAObo$ zEwof0soccI3>rYZ&j7DpbE-$!eR~W{(5jVKg{Di1JG)%h%MvYajy3~eWxlEMYuj<2 z#2t?RGNfU{-tt>>R{!kGXMU-=Medw}o-fMkh^5)IW{J*`jYbT?V1!IU-K~X3G z5gOV|&K(!_p@SgIpXLpp5ju}((&d2EgZ%~APlQwVd}F;~5f5;2zR!YqGl4N?YaLqP z4stqB>>Hsc9nlq9CZjCKIv7y^D!Hi8qzwS{T>iL|S)oAZl3LW@KM?U)qul|vpQ^*4 zP@X;=ZK^yitcNJgeVw!2S0FK%lp?c~#@Kch<}@M8C4+-F>C!BEf<(ri>p>>CTA+X&1R=24! zk!ZHW=E{%Q8H_5316*QYcY;WoAYl>`wO!%d`zlxIZ#|DS9j8zx8N+?RjX+n%v=HZ+ z*>covC%M{I-+aG9ZUXShWAqO*9UCwJ;R%NkZtA<0ny-xZKwqT)NJ z!t(@}mBi1qom=ricT-8zc^=E=Cfn<)+mhoq`WO9g^bT#F_FrieMvwqQ7z%aD928&5 z4h}JQ?q-GY6Q)mI;1;+OPe?mC`K@TnUCZ+BFk?^eyN4QS^|>$prxMiPfpi`uNULc?HBqx6Qr7D#8`cJYAe4nZ=JGJ~4Y|YluyaB`HTI z5nGL>FxFO98S%FjGJ%1ILW9GZIljS#_Bv}n_`&SnEcu0{@J9U+!XhkDfg=ESr=-e` zZ|x$U+18H1{fPNo%gFNJ-lOP>?cP@-}z@payZ>+XOxu)_Ucu(r(nY;1p%G7gR6*Lq( z$mATnwuoKJ8O5aBj5Ouip)2x=iB(QcnglrR z=9d)G3?v2F6T0S?W2t$aKn2p?!0nN~oWEllort5RlO#}x^dLa~#1nfE2oVQo)eXb* z=Gkn_E~aixzxhh$);-OisHtn>rKbMO-aMcB+>EpH{3#@|FhKsKoNW@?gk1|8J>6YjUJFl_E!XPyFmNDXE zMFF=Vaik_1BZwsqhS_R3!RrvLfQijgS!0Ei2Hd1P2U(3-dqzx~xEnhO1tug{4-)Q^ zus?m5DKPkX`W3xVd452)zU1QCinx?9@d?q&$x(Ej`whW*OU%~I%~j*TjHNrvRx?#= z9s-;aB~`E&T)))@-a!N8>gV?rsVHz1%2@#nx=loALjSbtch(?!Z;ka>!F4_drR~^? zu-%<7?-=I+?j}4uK9;LSN1a6&hV0#WxfsAcIm{uIfpXE{d3);anVtz z`r!F)ZcW|0Tv(7aQXFO|QoXe~SZ((;?>#qG_1*Dxxd}`=jpX5f&lnQ>yIkLFqoe-chfst<+y%{DQn%gx8Cay}k5 zja&SPuuB<4*b<}HzgX8V1RmZje_!othxDi|dN|X_!Zv1r4rqOCSXF3TZOvi-%*lkU z&iurtR#VWORpZ5HHihI{;)2zpCg^Fc_gZo$075^*#Av~lY%|lzaSv%9TMWf-si#c~ zH3GpAd6jG$7^%#7+kpXAQcSiD_=BsgkuvmmxKG`M> z|MI#n^0ofrpy~IL%snO88MNuv%28u%(VEU&WZP?AhvCWBx;<4(ad4yM%-X|vsI5^R z-C|@0wN#t|-u1f!;9Pae^GWU5eO(@>T_luFEp9Aj#KJR>cWzhs0JpnbfyE4X zT=egMJFSbDG1~R{KVbXV;S+#?5aEVfI~i#5mY<1?vDlD0N03|Jt|}!I7tPI|dHB_( zn()HZ?pM_fbXpmmr>kYEhc6w}5aJm|@LW7x-3sWzc0aPm@?Rf;2Jv(`{`Jy7nf}9U zy4TVUy}vNeAmM)2NT!FpvE3k$ZK{{#_8X1@IEyS(8nSEff57}r6Bv)=cOEAJ>jIq{ z%!Fv=dM$eT>>j9hhnv{&(1)Q@cS(K9is2o-Ru!= z+UL>f-dT1tifdr(H#ZR_i4@8=5<$GnP1B%`<=-2XHa4=YAn*Ek>nbjJ-*e=j6NQ>7 zrE>Cux_8DbT++3Q*4M^^JdA^NJM0Bxt7QiYe%Yqw;GieSWq%JzpN4rFcWIqSZi8xv zPy%*M?A$f+luzYlPg@g(8SDv5p00c1CbS!>Nm!t_j0+$d zMy`;jOJ3Zh{X#Kq=4e-?u0G}KCGKT`{phjJl3+{vOi4m;az$;uvgyJR+|l6Jizn`r zy_hR2@0qVC?&8J#AcwQ@Q!m}+kiW|!RkF50!U+rRC=XRX`bQ5V-b&S$9z3TnRWB2aCt zeICW31)a35aBM{s^aWH2Dy-&rJf~ReZMehbyJnFk?R7;7kpni6-VVa$GC}m^YK({R zc(?FFDWGnM^m^rtI&_TOTBD7KGEy$Ml6|d@@`-ymd{Wb_>3fHB1%DU5MlsE{z-Bfx zxV^&4*6?=vh3tTrmnF})59WKLWab$@pMmH+&K&(qb|oDz|%<{B+$=ESai{Kl}W>yILRVO{m=21ozUNF2Jsf$3{SzG4hFtWf62cR&Pb=0@}W z{VLJyULnbfFu9Ov=--xAYjc2s6K~sIKqy~vQWK&+e$)Ft(SvL@@r8_A4QT3u5mN-i zBePdsV8gz5QjUvn$YIOsq&54)VVL$+6^RFA$ED;?27f`yPiHEdc3U;5kZYgdHG z@Rx1J!ojSzvagS9-KmLQ`k^2yVI}*nv$uPONX}^8sacT`31iL94^jQI$Ozj2Gh|?x zOgA8OwU;^-e>aDu2a5BuWPwLds5gYP#lvgjDXBpPXs6r6OR`aQGIb5Pe#-G)+-FBq z3r#v1jO`@Z3y8%Ru(cMd$#zdS-ghU#)GZ2ve=)Pf@&7AL7(bRj6ubCaw%(88fr+Eg z02uulJiY#NE+wTkUm>CQ#U#Z!Wz)H#fplg4a@OW}3`VNoe|hGTDS$;|!5G*kwHsebGz=dxG%3kn`a?ReeW^MY1XVr-$xCo#-v}{?Pn;{41RH`)d)9spqDT!KQ zi{7+>`tn?$Q4cDyoFVx1AR3oAUhg%2%TE1k!+Vg4{=*yLnl5rcTLH~a-!_%zy=0+- zNSD`DF{HN&Pl}C9ft37P&-uyuh}do3F|Yd(+WQTAaIAagd&$f{--bbH26fGrPcJo~ zW5fPR_81GKc&N`r@#ZLIoPN$f|E^K(KCSBZxmt(z+5Lol#eYu|2QAH?rbtI+(Y zQJkP~2xu9IyC#3;X@3p>53{kyLJM(~pCS=`LZL0zK%>E^?P*`_V?{CY%DP9kR)Vf( z!5d_h_au$`fUc2#?w6#Y1Af0aFG_Pbh699AZyOPjr*%Gn2He;=>p-y(|B`srNzI7g zI|A>s?q{c~_r$gf@uI`B==qE1G>gl0WJYLnXHBUeXvaIC((&i>|HR&js_Q|yY!9Zj zUie`Mop`+!8O4x)TBHq9Z{6}PsU9nqnfT;4rDDGJ{v)?V^@V;@^5#nFrB&ROY^#|M z+e4Nnfz{8izA|`o`|<^P&!$267{-9T)-Rhh-Zg@e;{`a#ODf($$gdu-f7e$hdZ@-4-iJN%bae6_YCzA9&oceuCHAJx@k({t%nRcnFyS6vX9Cokz zRfXftk7GC$LeIu-bGY%DEy`)+x-Yi3)nZcU=Tx1Yl;d{}Fp}1_(oqBh@lwVS z+-3aLF`wGd)&zYBzOLgkKE9=$ z=!X5&Vix0!dSCU}W=AjtWc|X~IFIl&9iwSS3VS@BqSv>6BNGvrWz@$SDF2oY1~s~N zSDoDkua})7w^nH8*hcO&xwwA_xSecq<0ATkuJEFb>tOGP<@pz6=EoM(Xu)_w$q3Wc zMD~+b_=W7HM8btcq@n;&NPajI#ahO+h1Fza85I}M#A`A!` zW%>*44~xFq_e!M~Sr)crv?1*)87(O$Jqt62vY#yL$6eUBjmol5G2=V4jwgi|KCPTB z7JBIuzwQ+Xb4=?~S^x(pU+hy+E51j#a52eUJ`KMnFOJ^o>gxtZ*bQr?^w>Kqu%HY~ z8bYr3mME3!phq6!u5@47xRD96#~mg8*~%>@bswOM1ZY<;zez`34AK_Lf_WXkq8<}& zbKPNx#O{W*8x@rGA9$^vPewAwd|X9yB3oql;Y{#pG#R#;ADem}=8O@qQt=#a);e*rCyynn%&x|M-dn#m%4_3*ZhCa4jj z4Px5cUp;(iCvfz6)7n6{_}g|Eu-iRoxCb~Aej%;YX5cpIXVBk@jKq6mWgdOKVW3pi z0dvuXMLshVAfM-PV`CUDyxzZ|Nt+sH^)nxTlos79QEJSj z@lbjXR=Cx!z&qODbkyu)40pkJ?5^O)JdNr zOzy8GL)+_F;(w_~8*;;nN(63^zVXJDMZH?6ayf|8i#%u0VBOD5N7ePu`kJN~;)lw4 zay%r&q3)>x5UEnDF%O%xIDS^ZT<%#f+n@ua5~C5rChi_oQ%E*w!!(vH;_<(YWB;G)_oPkFtmzho2S%Mb_B&*-~dquR^V zPMn59Wu09__;c-gflNHwrC*OENlsS$#5pB}8%rW9-$%x!msT#;j{aD|pfAv&NWyuJOCTi$n-QrHx9$JqOl{@^4 z+YPfZR&0EZIQe7?62umZ(cE&hrg9U_htrxG^Z(@hq##_ILR7L5$~jt;LSQd9vE*Eb zuxW4`k{cgxQ4`1Cd@3jJ)3f>_bDI7^W8?=b9qn$!-4jdN% z^jK_IN>npI7*-{4_eCxNAhkW9v`>EbvfC}|VCG!ziY6-yIQis(Vx8%6rAqNrsn`S1 zf@CE1$PPwD?+iBqdG&~L|r9fx^Kc|R*+S~VZ?g>_2usCNIZ^Nj4me_3|(FD@4e4I6CGj_j~aLsv-9WQcIVD9;4ka{BqF*fP-cr-HJsf!>3P2}lXX z^(mZYEPO4Ftm2n#=5I%lhSz+oR4Pkkf0PKU+Fu9y>t!vhWlipUl~yBR2>E{9SBxs_ z1pwzer+l6Th8Wql#v|T+MEuioUBbRk*MG;1wzvU68JB{&;`7 zVgeNC6FC?c%yaV&Z(DTu>6{qC!H3j68S4yuTJ6nHOYa!q(nRw+kgkdLPET{UH6d58j53F(6fIT=zd$3t{sG-ACrm7&=D74x@7r2ZDAV^opTUtku zoBCdszvXNTb}!#VRjGG>jMVHP(*(z#fWpBDu>ks+t}0LncSO4^nT9wCc8f782KGP2 zZeb+#8C~P2x-S8$7RZrfy>XrL8Q+#hXa4~ja--s6tCjRaD3S*Hr&U z01|-anPdPzAs`f*^_`;Trd3U88KfPiGZV9>t0K4TPzYn9UdY9532aMwpIecEQMBIg zfRK-hp}R7Oc^5S~01W!vW)qL(iTOY_u$7pZT2%3lkulcQsXUXDO?*1`^Pg`3V~5Tk z4sofl*E_ngnpwLPFm`K02gw_{YT>ygLtX zY01e~K5(&jBEK#C-@5J>B#yuR_fK1k!^OO~w{u$F}fRgh}o#iKWdPDZA}kF(mq!0?xYn|37~8Y1S5$} zhAmp4BKpeI>!t-bL*Nq(pgzq{YojMlw*sU^U`=c`xI9i`RCxs=`x|H zIem`Sk~;4bhpu0tU|Hi4{#`iOLI}PbgNj;xouj>xNzpCNs8{~{G_}jD! zjgA?TWv2FFUh^@Js7}jlV*U9o$#j{H@4Z>H zKBdj4{K|7_p%dYhOZ4`JIYaQmDn4>%T`;spA3{z$$dj{CQ4lp%%#i14TBpXTrY^+>AyNi091nGFGqSTi{6?j?8KfmDzP=hr0r zmv4vob#}WS`?@-BC?h?G0Odnsp>?I)r9+`&;WNTu1QU;Sb=fI%HqC-jRAHOg$*2Ua zn8=OeIa{qt@Hc7Y9~}Oe$ucaFW%Z5Qg^eQu(DC5G;1(m@Re&%1GNkUUU+b`2Yk8>_@t-zXvw-`YP`@?uYgP%Z*bX-t#1Y9wFkhirM=Z-TN)}A*uUx*bMZ8cod zlxt*-(tGX%Y>t94+V|(KVIrXioyQPy{&UtIB*D{qCnt4^j*|+)I`k!o0@71b zpQL)O3D!)oXVpVn4)*+rkyv}T)30Rc-;q`KdB};!@Tq14A|0XPd6IS~U@(pT^!gj^ z$(pxnGGL5n7y@fQO4O<*bh+k*UW>W;bX!Y?`CnySHFty zjbN+voB=|nvu%x*@reVrSg$JU#_j&`WFrpPjUM^7!s!RO+(KeyEAzo||o z9Jgr%l3g7Wn|f5`sq@hO9Q@wkNOeG{4cQEY4DRn6yOpib8S7DzY`ydRrHnCE6_6$V zN&}@9H{x%o7%X3(N&)TNA4=WYD?yuWlYZ|m!QH}zw(OR0kE`$$pI)V_9brUYRe|65 z1W&|Zk&|4`(rY+>8bctaPyW`w8!F~Ti;HygPetmGdfm>Wzw z`MUP0n^U&jtAx9Y>a^_g#E8k|Z0%o%>ST?->~0YQ$55SB%YYQnRhvf?&?A2p4xJ0T zf3bo8(5&oPGWXhS(|I0co)2LT7nig52(OoZ-YtQ3oH+QPRvNc$y8)z&N#z$(dy>uu zE3b0i4%X$v>K80*2b$eWy~RTLNO4HH*ISS@=5Hr>#d)5qGurL$&7T_mMvifD3_ye9 zTiv~7agL^oYjDkJ1(G&zH(I6`zq?~rh25=c&r*FV!=nL@P6buPBNhppN2pr_MG_t`^v{FfH7om$UlX0OW*`t%tZ4)fB(Zs hS2&;1+TiZf&Nr0$c&FO}_!YA~#)f7FCD&Y@{x1ZeeNg}a literal 0 HcmV?d00001