This repository has been archived by the owner on Jun 6, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 549
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[Azure RDMA] Merge Azure RDMA change into master branch (#2091)
* [Azure-RDMA] cluster configuration preparation for az-rdma (#2004) * [Restserver] Add azure RDMA configuration depending on the environment passed in (#2010) * Add necessary rdma enviroment in azure to restserver's yarn container startup script. * [Doc] update job tutorial doc about minFailedTaskCount and minSucceededTaskCount (#2009) * [Restserver] Append ip hostanme pairs into /etc/hosts in rdma workload. (#2038) * [Restserver] Add az-rdma switch in user's job.json (#2024) * [Paictl] SSH & SFTP-Copy tool for admin to maintain cluster. (#2058) * [Azure RDMA] [Job Example] An example of intel mpi benchmark based on azure rdma (#2089) * [Azure RDMA] [Doc] Update tutorial of rdma for admin to enable it in OpenPAI. (#2090) * Optimize machine list variable in sftp_copy and ssh * Use for...else... to remove flag. * "===" to replace "==" * Move the logic of machine list to paictl.
- Loading branch information
Showing
24 changed files
with
805 additions
and
3 deletions.
There are no files selected for viewing
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,93 @@ | ||
# Copyright (c) Microsoft Corporation | ||
# All rights reserved. | ||
# | ||
# MIT License | ||
# | ||
# Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated | ||
# documentation files (the "Software"), to deal in the Software without restriction, including without limitation | ||
# the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and | ||
# to permit persons to whom the Software is furnished to do so, subject to the following conditions: | ||
# The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. | ||
# | ||
# THE SOFTWARE IS PROVIDED *AS IS*, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING | ||
# BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND | ||
# NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, | ||
# DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | ||
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. | ||
|
||
from ..clusterObjectModel import cluster_object_model | ||
from ..k8sPaiLibrary.maintainlib import common | ||
|
||
import sys | ||
import logging | ||
import logging.config | ||
|
||
|
||
class OpenPaiSftpCopy: | ||
|
||
def __init__(self, filename, source, dest, machine_list, filter): | ||
self.filename = filename | ||
self.source = source | ||
self.dest = dest | ||
self.origin_machine_list = machine_list | ||
self.filter_rule = filter | ||
self.machine_list = {} | ||
|
||
self.logger = logging.getLogger(__name__) | ||
|
||
|
||
def construct_machine_list(self): | ||
rule_list = [] | ||
self.logger.info("=============================================") | ||
self.logger.info("================ Filter Rule ================") | ||
self.logger.info("=============================================") | ||
if self.filter_rule != None: | ||
for rule in self.filter_rule: | ||
kv = rule.split("=") | ||
rule_list.append({"key":kv[0], "value":kv[1]}) | ||
self.logger.info("key = {0}, value = {1}".format(kv[0], kv[1])) | ||
else: | ||
self.logger.info("No filter rule.") | ||
self.logger.info("\n") | ||
self.logger.info("\n") | ||
|
||
self.logger.info("=============================================") | ||
self.logger.info("======= Machine List After filtered =========") | ||
self.logger.info("=============================================") | ||
for hostname in self.origin_machine_list: | ||
host = self.origin_machine_list[hostname] | ||
for rule in rule_list: | ||
if rule["key"] not in host: | ||
break | ||
if host[rule["key"]] != rule["value"]: | ||
break | ||
else: | ||
self.machine_list[hostname] = host | ||
self.logger.info("Machine Host Name: {0}, Machine Ip Address: {1}".format(hostname, host["hostip"])) | ||
self.logger.info("\n") | ||
self.logger.info("\n") | ||
|
||
count_input = 0 | ||
while True: | ||
user_input = raw_input("Do you want to continue this operation? (Y/N) ") | ||
if user_input == "N": | ||
sys.exit(1) | ||
elif user_input == "Y": | ||
break | ||
else: | ||
print(" Please type Y or N.") | ||
count_input = count_input + 1 | ||
if count_input == 3: | ||
self.logger.warning("3 Times......... Sorry, we will force stopping your operation.") | ||
sys.exit(1) | ||
|
||
def run(self): | ||
|
||
self.construct_machine_list() | ||
|
||
for hostname in self.machine_list: | ||
host = self.machine_list[hostname] | ||
if common.sftp_paramiko(self.source, self.dest, self.filename, host) == False: | ||
self.logger.error("[ Failed ]: Task on the machine [ hostname: {0}, ip-address: {1} ]".format(hostname, host["hostip"])) | ||
else: | ||
self.logger.info("[ Successful ]: Task on the machine [ hostname: {0}, ip-address: {1} ]".format(hostname, host["hostip"])) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,94 @@ | ||
# Copyright (c) Microsoft Corporation | ||
# All rights reserved. | ||
# | ||
# MIT License | ||
# | ||
# Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated | ||
# documentation files (the "Software"), to deal in the Software without restriction, including without limitation | ||
# the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and | ||
# to permit persons to whom the Software is furnished to do so, subject to the following conditions: | ||
# The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. | ||
# | ||
# THE SOFTWARE IS PROVIDED *AS IS*, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING | ||
# BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND | ||
# NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, | ||
# DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | ||
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. | ||
|
||
from ..clusterObjectModel import cluster_object_model | ||
from ..k8sPaiLibrary.maintainlib import common | ||
import sys | ||
import logging | ||
import logging.config | ||
|
||
|
||
class OpenPaiSSH: | ||
|
||
def __init__(self, command, machine_list, filter): | ||
self.cmd = command | ||
self.origin_machine_list = machine_list | ||
self.filter_rule = filter | ||
self.machine_list = {} | ||
|
||
self.logger = logging.getLogger(__name__) | ||
|
||
|
||
def construct_machine_list(self): | ||
rule_list = [] | ||
self.logger.info("=============================================") | ||
self.logger.info("================ Filter Rule ================") | ||
self.logger.info("=============================================") | ||
if self.filter_rule != None: | ||
for rule in self.filter_rule: | ||
kv = rule.split("=") | ||
rule_list.append({"key":kv[0], "value":kv[1]}) | ||
self.logger.info("key = {0}, value = {1}".format(kv[0], kv[1])) | ||
else: | ||
self.logger.info("No filter rule.") | ||
self.logger.info("\n") | ||
self.logger.info("\n") | ||
|
||
self.logger.info("=============================================") | ||
self.logger.info("======= Machine List After filtered =========") | ||
self.logger.info("=============================================") | ||
for hostname in self.origin_machine_list: | ||
host = self.origin_machine_list[hostname] | ||
for rule in rule_list: | ||
if rule["key"] not in host: | ||
break | ||
if host[rule["key"]] != rule["value"]: | ||
break | ||
else: | ||
self.machine_list[hostname] = host | ||
self.logger.info("Machine Host Name: {0}, Machine Ip Address: {1}".format(hostname, host["hostip"])) | ||
self.logger.info("\n") | ||
self.logger.info("\n") | ||
|
||
count_input = 0 | ||
while True: | ||
user_input = raw_input("Do you want to continue this operation? (Y/N) ") | ||
if user_input == "N": | ||
sys.exit(1) | ||
elif user_input == "Y": | ||
break | ||
else: | ||
print(" Please type Y or N.") | ||
count_input = count_input + 1 | ||
if count_input == 3: | ||
self.logger.warning("3 Times......... Sorry, we will force stopping your operation.") | ||
sys.exit(1) | ||
|
||
|
||
def run(self): | ||
|
||
self.construct_machine_list() | ||
|
||
for hostname in self.machine_list: | ||
host = self.machine_list[hostname] | ||
if common.ssh_shell_with_password_input_paramiko(host, self.cmd) == False: | ||
self.logger.error("[ Failed ]: Task on the machine [ hostname: {0}, ip-address: {1} ]".format(hostname, host["hostip"])) | ||
else: | ||
self.logger.info("[ Successful ]: Task on the machine [ hostname: {0}, ip-address: {1} ]".format(hostname, host["hostip"])) | ||
|
||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,102 @@ | ||
<!-- | ||
Copyright (c) Microsoft Corporation | ||
All rights reserved. | ||
MIT License | ||
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated | ||
documentation files (the "Software"), to deal in the Software without restriction, including without limitation | ||
the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and | ||
to permit persons to whom the Software is furnished to do so, subject to the following conditions: | ||
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. | ||
THE SOFTWARE IS PROVIDED *AS IS*, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING | ||
BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND | ||
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, | ||
DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | ||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. | ||
--> | ||
|
||
|
||
### Enable the capability of RDMA for your VM in azure | ||
|
||
#### Knowledge <a name="knowledge"></a> | ||
The RDMA-capable instances | ||
(Important): https://docs.microsoft.com/en-us/azure/virtual-machines/linux/sizes-hpc#rdma-capable-instances | ||
|
||
The cluster configuraiton options to enable rdma (Important): https://docs.microsoft.com/en-us/azure/virtual-machines/linux/sizes-hpc#cluster-configuration-options | ||
|
||
The network topology considerations(Important): https://docs.microsoft.com/en-us/azure/virtual-machines/linux/sizes-hpc#network-topology-considerations | ||
|
||
#### Steps | ||
|
||
###### ``` 1. Mark the RDMA capable machines with label``` | ||
|
||
|
||
Based on [the knowledge section](#knowledge), you should mark your RDMA-capable machine with a specify label in [cluster-configuration.yaml](../../../../examples/cluster-configuration/cluster-configuration.yaml). Of course, you could customize the label as what you like. | ||
|
||
For example, in this tutorial, the following label will be used. | ||
|
||
```YAML | ||
machine-list: | ||
- hostname: example-hosts | ||
hostip: x.x.x.x | ||
machine-type: example | ||
k8s-role: worker | ||
pai-worker: "true" | ||
# The lable of RDMA capable machines in this example | ||
rdma: "true" | ||
``` | ||
###### ``` 2. Copy the Azure RDMA enable to the target path ``` | ||
|
||
```bash | ||
cd pai/ | ||
sudo ./paictl.py utility sftp-copy -p /path/to/cluster/config -n Azure-RDMA-enable.sh -s src/azure-rdma -d /tmp -f rdma=true | ||
``` | ||
|
||
###### ``` 3. Enable Azure RDMA with the script ``` | ||
|
||
```bash | ||
cd pai/ | ||
sudo ./paictl.py utility ssh -p /path/to/cluster/config -f rdma=true -c "sudo /bin/bash /tmp/Azure-RDMA-enable.sh" | ||
``` | ||
|
||
|
||
###### ``` 4. Restart all your rdma capable machines in azure portal``` | ||
|
||
Please communicate with your cluster owner to reboot the rdma machines after the following steps. | ||
|
||
###### ``` 5. Open the switch configuration for az-rdma whose default value is false``` | ||
|
||
In the [services-configuration.yaml](../../../../examples/cluster-configuration/services-configuration.yaml), please uncomment the configuration field ```cluster.common.az-rdma``` and set its value as ```"true""```. | ||
|
||
|
||
For example, you should modify it as following. | ||
```YAML | ||
cluster: | ||
# | ||
common: | ||
# clusterid: pai | ||
# | ||
# # HDFS, zookeeper data path on your cluster machine. | ||
# data-path: "/datastorage" | ||
# | ||
# # Enable QoS feature or not. Default value is "true" | ||
# qos-switch: "true" | ||
# | ||
# # If your cluster is created by Azure and the machine is rdma enabled. | ||
# # Set this configuration as "true", the rdma environment will be set into your container. | ||
az-rdma: "true" | ||
``` | ||
|
||
|
||
###### Note | ||
- If you wanna enable azure rdma feature in your cluster, please ensure all the worker machines in your cluster is azure rdma capable! | ||
- TODO: YARN should only schedule the rdma job to the machine with azure rdma machine. | ||
- After enabling azure rdma feature in your cluster, everytime adding new machine or remove machine from the cluster, you should restart restserver to refresh the machinelist in it. | ||
- TODO: Make restserver able to update the machinelist through configmap in a loop. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
36 changes: 36 additions & 0 deletions
36
examples/azure-rdma-inte-mpi-benchmark-with-horovod-image/DOCKER.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
## Basic environment | ||
|
||
First of all, PAI runs all jobs in Docker container. | ||
|
||
1: Install Docker CE | ||
|
||
2: Get an account of docker hub to store your image. | ||
|
||
# AzureRDMA && IntelMpi on PAI docker env | ||
|
||
## Contents | ||
|
||
|
||
We need to build a AzureRDMA&IntelMPI image to run intel benchmark workload on OpenPAI, this can be done with following steps: | ||
|
||
|
||
- Get a license for your intel mpi. And then modify the ```ACTIVATION_TYPE``` in the [silent.cfg](./silent.cfg) | ||
|
||
- Write an AzureRDMA&IntelMPI Dockerfile and save it to `Dockerfile.example.horovod-intelmpi-az-rdma`: | ||
|
||
- You could refer to this [Dockerfile](./Dockerfile.example.horovod-intelmpi-az-rdma) | ||
- If your intel MPI is activated by a license file. You should copy it to the docker image, when building it. | ||
- You'd better keep the image in a private registry. Because you build the license in the image. | ||
|
||
- Build the Docker image from `Dockerfile.example.horovod-intelmpi-az-rdma`: | ||
|
||
```bash | ||
$ sudo docker build -f Dockerfile.example.horovod-intelmpi-az-rdma -t USER/pai.example.horovod-intelmpi-az-rdma . | ||
``` | ||
|
||
- Push the Docker image to a Docker registry: | ||
|
||
```bash | ||
$ sudo docker push USER/pai.example.horovod-intelmpi-az-rdma | ||
``` | ||
Note: Replace USER with the Docker Hub username you registered, you will be required to login before pushing Docker image. |
Oops, something went wrong.