From d77a99c9966f8944817dff02950b20f7c7f30bb3 Mon Sep 17 00:00:00 2001 From: "Shinai Yang (FA TALENT)" Date: Tue, 25 Dec 2018 11:17:44 +0800 Subject: [PATCH 01/41] fix remote bug --- src/nni_manager/rest_server/restValidationSchemas.ts | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/nni_manager/rest_server/restValidationSchemas.ts b/src/nni_manager/rest_server/restValidationSchemas.ts index b1dc5293a7..bfb1ff24d2 100644 --- a/src/nni_manager/rest_server/restValidationSchemas.ts +++ b/src/nni_manager/rest_server/restValidationSchemas.ts @@ -28,7 +28,7 @@ export namespace ValidationSchemas { username: joi.string().required(), ip: joi.string().ip().required(), port: joi.number().min(1).max(65535).required(), - passwd: joi.string().required(), + passwd: joi.string(), sshKeyPath: joi.string(), passphrase: joi.string() })), From 7cb03f99d86efbc90ddae5809dfb0f412e666417 Mon Sep 17 00:00:00 2001 From: "Shinai Yang (FA TALENT)" Date: Thu, 27 Dec 2018 17:17:34 +0800 Subject: [PATCH 02/41] add document --- README.md | 11 +++-- docs/FrameworkControllerMode.md | 84 +++++++++++++++++++++++++++++++++ docs/KubeflowMode.md | 2 +- 3 files changed, 91 insertions(+), 6 deletions(-) create mode 100644 docs/FrameworkControllerMode.md diff --git a/README.md b/README.md index 4d57b720a0..0972751e95 100644 --- a/README.md +++ b/README.md @@ -76,11 +76,12 @@ You can use these commands to get more information about the experiment commands description 1. nnictl experiment show show the information of experiments 2. nnictl trial ls list all of trial jobs -3. nnictl log stderr show stderr log content -4. nnictl log stdout show stdout log content -5. nnictl stop stop an experiment -6. nnictl trial kill kill a trial job by id -7. nnictl --help get help information about nnictl +3. nnictl top monitor the status of running experiments +4. nnictl log stderr show stderr log content +5. nnictl log stdout show stdout log content +6. nnictl stop stop an experiment +7. nnictl trial kill kill a trial job by id +8. nnictl --help get help information about nnictl ----------------------------------------------------------------------- ``` diff --git a/docs/FrameworkControllerMode.md b/docs/FrameworkControllerMode.md new file mode 100644 index 0000000000..00bf25b267 --- /dev/null +++ b/docs/FrameworkControllerMode.md @@ -0,0 +1,84 @@ +**Run an Experiment on FrameworkController** +=== +NNI supports running experiment using [FrameworkController](https://github.com/Microsoft/frameworkcontroller), called frameworkcontroller mode. FrameworkController is built to orchestrate all kinds of applications on Kubernetes, and you have to set a kubernetes cluster before using frameworkcontroller. + +## Set up Kubernetes Service and kubeconfig +FrameworkController has same prerequisites as kubeflow mode except that you don't need to install kubeflow. Please refer the [document](./KubeflowMode.md) to set up your kubernetes cluster and other prerequisites for nni. + +## Set up FrameworkController +Follow the [guideline](https://github.com/Microsoft/frameworkcontroller/tree/master/example/run) to set up the frameworkcontroller in the kubernetes cluster, nni support frameworkcontroller by the statefulset mode. + +## Design +Please refer the design of [kubeflow training service](./KubeflowMode.md), frameworkcontroller training service pipeline is similar with kubeflow training service. + +## Example + +The frameworkcontroller config file format is: +``` +authorName: default +experimentName: example_mnist +trialConcurrency: 1 +maxExecDuration: 10h +maxTrialNum: 100 +#choice: local, remote, pai, kubeflow, frameworkcontroller +trainingServicePlatform: frameworkcontroller +searchSpacePath: ~/nni/examples/trials/mnist/search_space.json +#choice: true, false +useAnnotation: false +tuner: + #choice: TPE, Random, Anneal, Evolution + builtinTunerName: TPE + classArgs: + #choice: maximize, minimize + optimize_mode: maximize +#assessor: +# builtinAssessorName: Medianstop +# classArgs: +# optimize_mode: maximize +# gpuNum: 0 +trial: + codeDir: ~/nni/examples/trials/mnist + taskRoles: + - name: worker + taskNum: 1 + command: python3 mnist.py + gpuNum: 1 + cpuNum: 1 + memoryMB: 8192 + image: msranni/nni:latest + frameworkAttemptCompletionPolicy: + minFailedTaskCount: 1 + minSucceededTaskCount: 1 +frameworkcontrollerConfig: + storage: nfs + nfs: + server: {your_nfs_server} + path: {your_nfs_server_exported_path} +``` +If you use Azure Kubernetes Service, you should set `kubeflowConfig` in your config yaml file as follows: +``` +frameworkcontrollerConfig: + storage: azureStorage + keyVault: + vaultName: {your_vault_name} + name: {your_secert_name} + azureStorage: + accountName: {your_storage_account_name} + azureShare: {your_azure_share_name} +``` +Note: You should explicitly set `trainingServicePlatform: frameworkcontroller` in nni config yaml file if you want to start experiment in kubeflow mode. + +The trial's config format for nni frameworkcontroller mode is a simple version of frameworkcontroller's offical config, you could refer the [tensorflow example](https://github.com/Microsoft/frameworkcontroller/blob/master/example/framework/scenario/tensorflow/cpu/tensorflowdistributedtrainingwithcpu.yaml) for deeply understanding. +Trial configuration in frameworkcontroller mode have the following configuration keys: +* taskRoles: you could set multiple task roles in config file, and each task role is a basic unit to process in kubernetes cluster. + * name: the name of task role specified, like "worker", "ps", "master". + * taskNum: the replica number of the task role. + * command: the users' command to be used in the container. + * gpuNum: the number of gpu device used in container. + * cpuNum: the number of cpu device used in container. + * memoryMB: the memory limitaion to be specified in container. + * image: the docker image used to create pod and run the program. + * frameworkAttemptCompletionPolicy: the policy to run framework, please refer the [user-manual](https://github.com/Microsoft/frameworkcontroller/blob/master/doc/user-manual.md) to get the specific information. + +## How to run example +After you prepare a config file, you could run your experiment by nnictl. The way to start an experiment on frameworkcontroller is similar to kubeflow, please the [document](./KubeflowMode.md) for more information. \ No newline at end of file diff --git a/docs/KubeflowMode.md b/docs/KubeflowMode.md index 2c4721b971..c312c1e3b7 100644 --- a/docs/KubeflowMode.md +++ b/docs/KubeflowMode.md @@ -100,7 +100,7 @@ Trial configuration in kubeflow mode have the following configuration keys: * gpuNum * image * Required key. In kubeflow mode, your trial program will be scheduled by Kubernetes to run in [Pod](https://kubernetes.io/docs/concepts/workloads/pods/pod/). This key is used to specify the Docker image used to create the pod where your trail program will run. - * We already build a docker image [nnimsra/nni](https://hub.docker.com/r/msranni/nni/) on [Docker Hub](https://hub.docker.com/). It contains NNI python packages, Node modules and javascript artifact files required to start experiment, and all of NNI dependencies. The docker file used to build this image can be found at [here](../deployment/Dockerfile.build.base). You can either use this image directly in your config file, or build your own image based on it. + * We already build a docker image [msranni/nni](https://hub.docker.com/r/msranni/nni/) on [Docker Hub](https://hub.docker.com/). It contains NNI python packages, Node modules and javascript artifact files required to start experiment, and all of NNI dependencies. The docker file used to build this image can be found at [here](../deployment/Dockerfile.build.base). You can either use this image directly in your config file, or build your own image based on it. * ps (optional). This config section is used to configure tensorflow parameter server role. Once complete to fill nni experiment config file and save (for example, save as exp_kubeflow.yaml), then run the following command From 44d156550e97e27bf365569810668933d4234023 Mon Sep 17 00:00:00 2001 From: "Shinai Yang (FA TALENT)" Date: Thu, 27 Dec 2018 17:21:06 +0800 Subject: [PATCH 03/41] add document --- docs/FrameworkControllerMode.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/FrameworkControllerMode.md b/docs/FrameworkControllerMode.md index 00bf25b267..a977e20e20 100644 --- a/docs/FrameworkControllerMode.md +++ b/docs/FrameworkControllerMode.md @@ -1,6 +1,6 @@ **Run an Experiment on FrameworkController** === -NNI supports running experiment using [FrameworkController](https://github.com/Microsoft/frameworkcontroller), called frameworkcontroller mode. FrameworkController is built to orchestrate all kinds of applications on Kubernetes, and you have to set a kubernetes cluster before using frameworkcontroller. +NNI supports running experiment using [FrameworkController](https://github.com/Microsoft/frameworkcontroller), called frameworkcontroller mode. FrameworkController is built to orchestrate all kinds of applications on Kubernetes, you could use frameworkcontroller as a training service to run your experiment. ## Set up Kubernetes Service and kubeconfig FrameworkController has same prerequisites as kubeflow mode except that you don't need to install kubeflow. Please refer the [document](./KubeflowMode.md) to set up your kubernetes cluster and other prerequisites for nni. From 7ab7386d403987a29b805df027a042232ba8d259 Mon Sep 17 00:00:00 2001 From: "Shinai Yang (FA TALENT)" Date: Thu, 27 Dec 2018 17:26:53 +0800 Subject: [PATCH 04/41] update --- docs/FrameworkControllerMode.md | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/docs/FrameworkControllerMode.md b/docs/FrameworkControllerMode.md index a977e20e20..15d3edbc48 100644 --- a/docs/FrameworkControllerMode.md +++ b/docs/FrameworkControllerMode.md @@ -9,7 +9,7 @@ FrameworkController has same prerequisites as kubeflow mode except that you don' Follow the [guideline](https://github.com/Microsoft/frameworkcontroller/tree/master/example/run) to set up the frameworkcontroller in the kubernetes cluster, nni support frameworkcontroller by the statefulset mode. ## Design -Please refer the design of [kubeflow training service](./KubeflowMode.md), frameworkcontroller training service pipeline is similar with kubeflow training service. +Please refer the design of [kubeflow training service](./KubeflowMode.md), frameworkcontroller training service pipeline is similar to kubeflow training service. ## Example @@ -31,11 +31,11 @@ tuner: classArgs: #choice: maximize, minimize optimize_mode: maximize -#assessor: -# builtinAssessorName: Medianstop -# classArgs: -# optimize_mode: maximize -# gpuNum: 0 +assessor: + builtinAssessorName: Medianstop + classArgs: + optimize_mode: maximize + gpuNum: 0 trial: codeDir: ~/nni/examples/trials/mnist taskRoles: @@ -55,7 +55,7 @@ frameworkcontrollerConfig: server: {your_nfs_server} path: {your_nfs_server_exported_path} ``` -If you use Azure Kubernetes Service, you should set `kubeflowConfig` in your config yaml file as follows: +If you use Azure Kubernetes Service, you should set `frameworkcontrollerConfig` in your config yaml file as follows: ``` frameworkcontrollerConfig: storage: azureStorage @@ -66,9 +66,9 @@ frameworkcontrollerConfig: accountName: {your_storage_account_name} azureShare: {your_azure_share_name} ``` -Note: You should explicitly set `trainingServicePlatform: frameworkcontroller` in nni config yaml file if you want to start experiment in kubeflow mode. +Note: You should explicitly set `trainingServicePlatform: frameworkcontroller` in nni config yaml file if you want to start experiment in frameworkcontrollerConfig mode. -The trial's config format for nni frameworkcontroller mode is a simple version of frameworkcontroller's offical config, you could refer the [tensorflow example](https://github.com/Microsoft/frameworkcontroller/blob/master/example/framework/scenario/tensorflow/cpu/tensorflowdistributedtrainingwithcpu.yaml) for deeply understanding. +The trial's config format for nni frameworkcontroller mode is a simple version of frameworkcontroller's offical config, you could refer the [tensorflow example of frameworkcontroller](https://github.com/Microsoft/frameworkcontroller/blob/master/example/framework/scenario/tensorflow/cpu/tensorflowdistributedtrainingwithcpu.yaml) for deeply understanding. Trial configuration in frameworkcontroller mode have the following configuration keys: * taskRoles: you could set multiple task roles in config file, and each task role is a basic unit to process in kubernetes cluster. * name: the name of task role specified, like "worker", "ps", "master". From d9e1ea8261f8ede301e2a1ce18c0d2027a8a5f76 Mon Sep 17 00:00:00 2001 From: "Shinai Yang (FA TALENT)" Date: Thu, 27 Dec 2018 17:42:12 +0800 Subject: [PATCH 05/41] update --- docs/KubeflowMode.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/KubeflowMode.md b/docs/KubeflowMode.md index c312c1e3b7..4f830f0e82 100644 --- a/docs/KubeflowMode.md +++ b/docs/KubeflowMode.md @@ -1,6 +1,6 @@ **Run an Experiment on Kubeflow** === -Now NNI supports running experiment on [Kubeflow](https://github.com/kubeflow/kubeflow), called kubeflow mode. Before starting to use NNI kubeflow mode, you should have a kubernetes cluster, either on-prem or [Azure Kubernetes Service(AKS)](https://azure.microsoft.com/en-us/services/kubernetes-service/), a Ubuntu machine on which [kubeconfig](https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/) is setup to connect to your kubernetes cluster. If you are not familiar with kubernetes, [here](https://kubernetes.io/docs/tutorials/kubernetes-basics/) is a goot start. In kubeflow mode, your trial program will run as kubeflow job in kubernetes cluster. +Now NNI supports running experiment on [Kubeflow](https://github.com/kubeflow/kubeflow), called kubeflow mode. Before starting to use NNI kubeflow mode, you should have a kubernetes cluster, either on-prem or [Azure Kubernetes Service(AKS)](https://azure.microsoft.com/en-us/services/kubernetes-service/), a Ubuntu machine on which [kubeconfig](https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/) is setup to connect to your kubernetes cluster. If you are not familiar with kubernetes, [here](https://kubernetes.io/docs/tutorials/kubernetes-basics/) is a good start. In kubeflow mode, your trial program will run as kubeflow job in kubernetes cluster. ## Prerequisite for on-premises Kubernetes Service 1. A **Kubernetes** cluster using Kubernetes 1.8 or later. Follow this [guideline](https://kubernetes.io/docs/setup/) to set up Kubernetes From 2c225a840f16124107b6ae2a57b98422298239f5 Mon Sep 17 00:00:00 2001 From: "Shinai Yang (FA TALENT)" Date: Thu, 27 Dec 2018 17:47:56 +0800 Subject: [PATCH 06/41] update --- docs/FrameworkControllerMode.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/FrameworkControllerMode.md b/docs/FrameworkControllerMode.md index 15d3edbc48..3f93f112f8 100644 --- a/docs/FrameworkControllerMode.md +++ b/docs/FrameworkControllerMode.md @@ -81,4 +81,4 @@ Trial configuration in frameworkcontroller mode have the following configuration * frameworkAttemptCompletionPolicy: the policy to run framework, please refer the [user-manual](https://github.com/Microsoft/frameworkcontroller/blob/master/doc/user-manual.md) to get the specific information. ## How to run example -After you prepare a config file, you could run your experiment by nnictl. The way to start an experiment on frameworkcontroller is similar to kubeflow, please the [document](./KubeflowMode.md) for more information. \ No newline at end of file +After you prepare a config file, you could run your experiment by nnictl. The way to start an experiment on frameworkcontroller is similar to kubeflow, please refer the [document](./KubeflowMode.md) for more information. \ No newline at end of file From be23f553b4f50e38d298fffc1a3332d21f2535fb Mon Sep 17 00:00:00 2001 From: "Shinai Yang (FA TALENT)" Date: Sat, 29 Dec 2018 15:41:29 +0800 Subject: [PATCH 07/41] update --- docs/FrameworkControllerMode.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/docs/FrameworkControllerMode.md b/docs/FrameworkControllerMode.md index 3f93f112f8..0273d87321 100644 --- a/docs/FrameworkControllerMode.md +++ b/docs/FrameworkControllerMode.md @@ -1,15 +1,15 @@ **Run an Experiment on FrameworkController** === -NNI supports running experiment using [FrameworkController](https://github.com/Microsoft/frameworkcontroller), called frameworkcontroller mode. FrameworkController is built to orchestrate all kinds of applications on Kubernetes, you could use frameworkcontroller as a training service to run your experiment. +NNI supports running experiment using [FrameworkController](https://github.com/Microsoft/frameworkcontroller), called frameworkcontroller mode. FrameworkController is built to orchestrate all kinds of applications on Kubernetes, now you can use frameworkcontroller as the training service to run NNI experiment. ## Set up Kubernetes Service and kubeconfig -FrameworkController has same prerequisites as kubeflow mode except that you don't need to install kubeflow. Please refer the [document](./KubeflowMode.md) to set up your kubernetes cluster and other prerequisites for nni. +FrameworkController has similar prerequisites as kubeflow mode, like K8S installation and storage preparation, but you don't need to install kubeflow. Please refer Kubeflow mode [document](./KubeflowMode.md) to set up your kubernetes cluster and other prerequisites for nni. ## Set up FrameworkController -Follow the [guideline](https://github.com/Microsoft/frameworkcontroller/tree/master/example/run) to set up the frameworkcontroller in the kubernetes cluster, nni support frameworkcontroller by the statefulset mode. +Follow the [guideline](https://github.com/Microsoft/frameworkcontroller/tree/master/example/run) to set up frameworkcontroller in the kubernetes cluster, nni support frameworkcontroller by the statefulset mode. ## Design -Please refer the design of [kubeflow training service](./KubeflowMode.md), frameworkcontroller training service pipeline is similar to kubeflow training service. +Please refer the design of [kubeflow training service](./KubeflowMode.md), frameworkcontroller training service pipeline is similar. ## Example @@ -68,7 +68,7 @@ frameworkcontrollerConfig: ``` Note: You should explicitly set `trainingServicePlatform: frameworkcontroller` in nni config yaml file if you want to start experiment in frameworkcontrollerConfig mode. -The trial's config format for nni frameworkcontroller mode is a simple version of frameworkcontroller's offical config, you could refer the [tensorflow example of frameworkcontroller](https://github.com/Microsoft/frameworkcontroller/blob/master/example/framework/scenario/tensorflow/cpu/tensorflowdistributedtrainingwithcpu.yaml) for deeply understanding. +The trial's config format for nni frameworkcontroller mode is a simple version of frameworkcontroller's offical config, you could refer the [tensorflow example of frameworkcontroller](https://github.com/Microsoft/frameworkcontroller/blob/master/example/framework/scenario/tensorflow/cpu/tensorflowdistributedtrainingwithcpu.yaml) for deep understanding. Trial configuration in frameworkcontroller mode have the following configuration keys: * taskRoles: you could set multiple task roles in config file, and each task role is a basic unit to process in kubernetes cluster. * name: the name of task role specified, like "worker", "ps", "master". From 91612098cf4e29ea5427b0ee00f11812b9422af6 Mon Sep 17 00:00:00 2001 From: "Shinai Yang (FA TALENT)" Date: Thu, 3 Jan 2019 16:56:37 +0800 Subject: [PATCH 08/41] fix remote issue --- .../remote_machine/remoteMachineTrainingService.ts | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/src/nni_manager/training_service/remote_machine/remoteMachineTrainingService.ts b/src/nni_manager/training_service/remote_machine/remoteMachineTrainingService.ts index 5302d90bbb..7ab2adf943 100644 --- a/src/nni_manager/training_service/remote_machine/remoteMachineTrainingService.ts +++ b/src/nni_manager/training_service/remote_machine/remoteMachineTrainingService.ts @@ -110,15 +110,14 @@ class RemoteMachineTrainingService implements TrainingService { /** * List submitted trial jobs */ - public listTrialJobs(): Promise { + public async listTrialJobs(): Promise { const jobs: TrialJobDetail[] = []; const deferred: Deferred = new Deferred(); - - this.trialJobsMap.forEach(async (value: RemoteMachineTrialJobDetail, key: string) => { + for (const [key, value] of this.trialJobsMap) { if (value.form.jobType === 'TRIAL') { jobs.push(await this.getTrialJob(key)); } - }); + }; deferred.resolve(jobs); return deferred.promise; From e661c5528353e656db8de6a65b0c392bf90362f6 Mon Sep 17 00:00:00 2001 From: "Shinai Yang (FA TALENT)" Date: Thu, 3 Jan 2019 17:10:50 +0800 Subject: [PATCH 09/41] fix forEach --- .../kubernetes/kubernetesTrainingService.ts | 6 +++--- src/nni_manager/training_service/pai/paiTrainingService.ts | 4 ++-- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/src/nni_manager/training_service/kubernetes/kubernetesTrainingService.ts b/src/nni_manager/training_service/kubernetes/kubernetesTrainingService.ts index 631b0e98ea..a97ee63d57 100644 --- a/src/nni_manager/training_service/kubernetes/kubernetesTrainingService.ts +++ b/src/nni_manager/training_service/kubernetes/kubernetesTrainingService.ts @@ -81,14 +81,14 @@ abstract class KubernetesTrainingService { } } - public listTrialJobs(): Promise { + public async listTrialJobs(): Promise { const jobs: TrialJobDetail[] = []; - this.trialJobsMap.forEach(async (value: KubernetesTrialJobDetail, key: string) => { + for (const [key, value] of this.trialJobsMap) { if (value.form.jobType === 'TRIAL') { jobs.push(await this.getTrialJob(key)); } - }); + }; return Promise.resolve(jobs); } diff --git a/src/nni_manager/training_service/pai/paiTrainingService.ts b/src/nni_manager/training_service/pai/paiTrainingService.ts index c1ef8ccb60..a19a225d09 100644 --- a/src/nni_manager/training_service/pai/paiTrainingService.ts +++ b/src/nni_manager/training_service/pai/paiTrainingService.ts @@ -104,11 +104,11 @@ class PAITrainingService implements TrainingService { public async listTrialJobs(): Promise { const jobs: TrialJobDetail[] = []; - this.trialJobsMap.forEach(async (value: PAITrialJobDetail, key: string) => { + for (const [key, value] of this.trialJobsMap) { if (value.form.jobType === 'TRIAL') { jobs.push(await this.getTrialJob(key)); } - }); + }; return Promise.resolve(jobs); } From 4fec2cc7c6b2b744e65dab21b9bee729b2f590e7 Mon Sep 17 00:00:00 2001 From: "Shinai Yang (FA TALENT)" Date: Mon, 7 Jan 2019 14:15:15 +0800 Subject: [PATCH 10/41] update doc according to comments --- docs/FrameworkControllerMode.md | 24 ++++++++++++++++++++---- 1 file changed, 20 insertions(+), 4 deletions(-) diff --git a/docs/FrameworkControllerMode.md b/docs/FrameworkControllerMode.md index 0273d87321..805b0d0b50 100644 --- a/docs/FrameworkControllerMode.md +++ b/docs/FrameworkControllerMode.md @@ -1,9 +1,25 @@ **Run an Experiment on FrameworkController** === -NNI supports running experiment using [FrameworkController](https://github.com/Microsoft/frameworkcontroller), called frameworkcontroller mode. FrameworkController is built to orchestrate all kinds of applications on Kubernetes, now you can use frameworkcontroller as the training service to run NNI experiment. +NNI supports running experiment using [FrameworkController](https://github.com/Microsoft/frameworkcontroller), called frameworkcontroller mode. FrameworkController is built to orchestrate all kinds of applications on Kubernetes, you don't need to install kubeflow for specific deeplearning framework like tf-operator or pytorch-operator any more. Now you can use frameworkcontroller as the training service to run NNI experiment. + +## Prerequisite for on-premises Kubernetes Service +1. A **Kubernetes** cluster using Kubernetes 1.8 or later. Follow this [guideline](https://kubernetes.io/docs/setup/) to set up Kubernetes +2. Prepare a **kubeconfig** file, which will be used by NNI to interact with your kubernetes API server. By default, NNI manager will use $(HOME)/.kube/config as kubeconfig file's path. You can also specify other kubeconfig files by setting the **KUBECONFIG** environment variable. Refer this [guideline]( https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig) to learn more about kubeconfig. +3. If your NNI trial job needs GPU resource, you should follow this [guideline](https://github.com/NVIDIA/k8s-device-plugin) to configure **Nvidia device plugin for Kubernetes**. +4. Prepare a **NFS server** and export a general purpose mount (we recommend to map your NFS server path in `root_squash option`, otherwise permission issue may raise when nni copy files to NFS. Refer this [page](https://linux.die.net/man/5/exports) to learn what root_squash option is), or **Azure File Storage**. +5. Install **NFS client** on the machine where you install NNI and run nnictl to create experiment. Run this command to install NFSv4 client: + ``` + apt-get install nfs-common + ``` + +6. Install **NNI**, follow the install guide [here](GetStarted.md). + +## Prerequisite for Azure Kubernetes Service +1. NNI support kubeflow based on Azure Kubernetes Service, follow the [guideline](https://azure.microsoft.com/en-us/services/kubernetes-service/) to set up Azure Kubernetes Service. +2. Install [Azure CLI](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest) and __kubectl__. Use `az login` to set azure account, and connect kubectl client to AKS, refer this [guideline](https://docs.microsoft.com/en-us/azure/aks/kubernetes-walkthrough#connect-to-the-cluster). +3. Follow the [guideline](https://docs.microsoft.com/en-us/azure/storage/common/storage-quickstart-create-account?tabs=portal) to create azure file storage account. If you use Azure Kubernetes Service, nni need Azure Storage Service to store code files and the output files. +4. To access Azure storage service, nni need the access key of the storage account, and nni use [Azure Key Vault](https://azure.microsoft.com/en-us/services/key-vault/) Service to protect your private key. Set up Azure Key Vault Service, add a secret to Key Vault to store the access key of Azure storage account. Follow this [guideline](https://docs.microsoft.com/en-us/azure/key-vault/quick-create-cli) to store the access key. -## Set up Kubernetes Service and kubeconfig -FrameworkController has similar prerequisites as kubeflow mode, like K8S installation and storage preparation, but you don't need to install kubeflow. Please refer Kubeflow mode [document](./KubeflowMode.md) to set up your kubernetes cluster and other prerequisites for nni. ## Set up FrameworkController Follow the [guideline](https://github.com/Microsoft/frameworkcontroller/tree/master/example/run) to set up frameworkcontroller in the kubernetes cluster, nni support frameworkcontroller by the statefulset mode. @@ -78,7 +94,7 @@ Trial configuration in frameworkcontroller mode have the following configuration * cpuNum: the number of cpu device used in container. * memoryMB: the memory limitaion to be specified in container. * image: the docker image used to create pod and run the program. - * frameworkAttemptCompletionPolicy: the policy to run framework, please refer the [user-manual](https://github.com/Microsoft/frameworkcontroller/blob/master/doc/user-manual.md) to get the specific information. + * frameworkAttemptCompletionPolicy: the policy to run framework, please refer the [user-manual](https://github.com/Microsoft/frameworkcontroller/blob/master/doc/user-manual.md#frameworkattemptcompletionpolicy) to get the specific information. Users could use the policy to control the pod, for example, if ps does not stop, only worker stops, this completionpolicy could helps stop ps. ## How to run example After you prepare a config file, you could run your experiment by nnictl. The way to start an experiment on frameworkcontroller is similar to kubeflow, please refer the [document](./KubeflowMode.md) for more information. \ No newline at end of file From 11fec6f1e9d97cabe15e78128e63018ffc227f85 Mon Sep 17 00:00:00 2001 From: "Shinai Yang (FA TALENT)" Date: Mon, 7 Jan 2019 14:40:58 +0800 Subject: [PATCH 11/41] update --- docs/KubeflowMode.md | 149 +++++++++++++++++++++++++++++++++---------- 1 file changed, 114 insertions(+), 35 deletions(-) diff --git a/docs/KubeflowMode.md b/docs/KubeflowMode.md index 4f830f0e82..12664ced34 100644 --- a/docs/KubeflowMode.md +++ b/docs/KubeflowMode.md @@ -28,64 +28,140 @@ Kubeflow training service instantiates a kubernetes rest client to interact with For each trial, we will upload all the files in your local codeDir path (configured in nni_config.yaml) together with NNI generated files like parameter.cfg into a storage volumn. Right now we support two kinds of storage volumns: [nfs](https://en.wikipedia.org/wiki/Network_File_System) and [azure file storage](https://azure.microsoft.com/en-us/services/storage/files/), you should configure the storage volumn in nni config yaml file. After files are prepared, Kubeflow training service will call K8S rest API to create kubeflow jobs ([tf-operator](https://github.com/kubeflow/tf-operator) job or [pytorch-operator](https://github.com/kubeflow/pytorch-operator) job) in K8S, and mount your storage volumn into the job's pod. Output files of kubeflow job, like stdout, stderr, trial.log or model files, will also be copied back to the storage volumn. NNI will show the storage volumn's URL for each trial in WebUI, to allow user browse the log files and job's output files. +## Supported operator +NNI only support tf-operator and pytorch-operator of kubeflow, other operators is not tested. +Users could set operator type in config file. +``` +operator: tf-operator +``` +If users want to use tf-operator, he could set `ps` and `worker` in trial config. If users want to use pytorch-operator, he could set `master` and `worker` in trial config. + +## Supported sotrage type +NNI support NFS and Azure Storage to store the code and output files, users could set storage type in config file and set the corresponding config. +The setting for NFS storage are as follows: +``` +kubeflowConfig: + operator: tf-operator + apiVersion: v1alpha2 + storage: nfs + nfs: + # Your NFS server IP, like 10.10.10.10 + server: {your_nfs_server_ip} + # Your NFS server export path, like /var/nfs/nni + path: {your_nfs_server_export_path} +``` +If you use Azure storage, you should set `kubeflowConfig` in your config yaml file as follows: +``` +kubeflowConfig: + operator: tf-operator + apiVersion: v1alpha2 + storage: azureStorage + keyVault: + vaultName: {your_vault_name} + name: {your_secert_name} + azureStorage: + accountName: {your_storage_account_name} + azureShare: {your_azure_share_name} +``` + + ## Run an experiment -Use `examples/trials/mnist` as an example. The nni config yaml file's content is like: +Use `examples/trials/mnist` as an example. This is a tensorflow job, and use tf-operator of kubeflow. The nni config yaml file's content is like: ``` -authorName: your_name +authorName: default experimentName: example_mnist -# how many trials could be concurrently running -trialConcurrency: 4 -# maximum experiment running duration -maxExecDuration: 3h -# empty means never stop -maxTrialNum: 100 -# choice: local, remote, pai, kubeflow +trialConcurrency: 2 +maxExecDuration: 1h +maxTrialNum: 20 +#choice: local, remote, pai, kubeflow trainingServicePlatform: kubeflow -# choice: true, false +searchSpacePath: search_space.json +#choice: true, false useAnnotation: false tuner: + #choice: TPE, Random, Anneal, Evolution builtinTunerName: TPE classArgs: #choice: maximize, minimize optimize_mode: maximize +assessor: + builtinAssessorName: Medianstop + classArgs: + optimize_mode: maximize + gpuNum: 0 trial: - codeDir: ~/nni/examples/trials/mnist - ps: - replicas: 1 - command: python mnist-keras.py - gpuNum: 0 + codeDir: . + worker: + replicas: 2 + command: python3 dist_mnist.py + gpuNum: 1 cpuNum: 1 memoryMB: 8196 - image: {your_docker_image_for_tensorflow_ps} - worker: - replicas: 1 - command: python mnist-keras.py - gpuNum: 2 + image: msranni/nni:latest + ps: + replicas: 1 + command: python3 dist_mnist.py + gpuNum: 0 cpuNum: 1 memoryMB: 8196 - image: {your_docker_image_for_tensorflow_worker} + image: msranni/nni:latest kubeflowConfig: operator: tf-operator + apiVersion: v1alpha2 storage: nfs nfs: - server: {your_nfs_server} - path: {your_nfs_server_exported_path} + # Your NFS server IP, like 10.10.10.10 + server: {your_nfs_server_ip} + # Your NFS server export path, like /var/nfs/nni + path: {your_nfs_server_export_path} ``` -If you use Azure Kubernetes Service, you should set `kubeflowConfig` in your config yaml file as follows: + +Note: You should explicitly set `trainingServicePlatform: kubeflow` in nni config yaml file if you want to start experiment in kubeflow mode. + +If you want to run Pytorch jobs, you could set your config files as follow: ``` +authorName: default +experimentName: example_mnist_distributed_pytorch +trialConcurrency: 1 +maxExecDuration: 1h +maxTrialNum: 10 +#choice: local, remote, pai, kubeflow +trainingServicePlatform: kubeflow +searchSpacePath: search_space.json +#choice: true, false +useAnnotation: false +tuner: + #choice: TPE, Random, Anneal, Evolution + builtinTunerName: TPE + classArgs: + #choice: maximize, minimize + optimize_mode: minimize +trial: + codeDir: . + master: + replicas: 1 + command: python3 dist_mnist.py + gpuNum: 1 + cpuNum: 1 + memoryMB: 2048 + image: msranni/nni:latest + worker: + replicas: 1 + command: python3 dist_mnist.py + gpuNum: 0 + cpuNum: 1 + memoryMB: 2048 + image: msranni/nni:latest kubeflowConfig: - operator: tf-operator - storage: azureStorage - keyVault: - vaultName: {your_vault_name} - name: {your_secert_name} - azureStorage: - accountName: {your_storage_account_name} - azureShare: {your_azure_share_name} + operator: pytorch-operator + apiVersion: v1alpha2 + nfs: + # Your NFS server IP, like 10.10.10.10 + server: {your_nfs_server_ip} + # Your NFS server export path, like /var/nfs/nni + path: {your_nfs_server_export_path} ``` -Note: You should explicitly set `trainingServicePlatform: kubeflow` in nni config yaml file if you want to start experiment in kubeflow mode. - Trial configuration in kubeflow mode have the following configuration keys: * codeDir * code directory, where you put training code and config files @@ -101,13 +177,16 @@ Trial configuration in kubeflow mode have the following configuration keys: * image * Required key. In kubeflow mode, your trial program will be scheduled by Kubernetes to run in [Pod](https://kubernetes.io/docs/concepts/workloads/pods/pod/). This key is used to specify the Docker image used to create the pod where your trail program will run. * We already build a docker image [msranni/nni](https://hub.docker.com/r/msranni/nni/) on [Docker Hub](https://hub.docker.com/). It contains NNI python packages, Node modules and javascript artifact files required to start experiment, and all of NNI dependencies. The docker file used to build this image can be found at [here](../deployment/Dockerfile.build.base). You can either use this image directly in your config file, or build your own image based on it. + * apiVersion + * Required key. The API version of your kubeflow. * ps (optional). This config section is used to configure tensorflow parameter server role. +* master(optional). This config section is used to configure pytorch parameter server role. Once complete to fill nni experiment config file and save (for example, save as exp_kubeflow.yaml), then run the following command ``` nnictl create --config exp_kubeflow.yaml ``` -to start the experiment in kubeflow mode. NNI will create Kubeflow tfjob for each trial, and the job name format is something like `nni_exp_{experiment_id}_trial_{trial_id}`. +to start the experiment in kubeflow mode. NNI will create Kubeflow tfjob or pytorchjob for each trial, and the job name format is something like `nni_exp_{experiment_id}_trial_{trial_id}`. You can see the kubeflow tfjob created by NNI in your Kubernetes dashboard. Notice: In kubeflow mode, NNIManager will start a rest server and listen on a port which is your NNI WebUI's port plus 1. For example, if your WebUI port is `8080`, the rest server will listen on `8081`, to receive metrics from trial job running in Kubernetes. So you should `enable 8081` TCP port in your firewall rule to allow incoming traffic. From a03a1912048d73af61868305012d5c39e503e957 Mon Sep 17 00:00:00 2001 From: "Shinai Yang (FA TALENT)" Date: Mon, 7 Jan 2019 14:44:28 +0800 Subject: [PATCH 12/41] update --- docs/KubeflowMode.md | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/docs/KubeflowMode.md b/docs/KubeflowMode.md index 12664ced34..2beb69b59a 100644 --- a/docs/KubeflowMode.md +++ b/docs/KubeflowMode.md @@ -31,8 +31,15 @@ For each trial, we will upload all the files in your local codeDir path (configu ## Supported operator NNI only support tf-operator and pytorch-operator of kubeflow, other operators is not tested. Users could set operator type in config file. +The setting of tf-operator: ``` -operator: tf-operator +kubeflowConfig: + operator: tf-operator +``` +The setting of pytorch-operator: +``` +kubeflowConfig: + operator: pytorch-operator ``` If users want to use tf-operator, he could set `ps` and `worker` in trial config. If users want to use pytorch-operator, he could set `master` and `worker` in trial config. @@ -41,8 +48,6 @@ NNI support NFS and Azure Storage to store the code and output files, users coul The setting for NFS storage are as follows: ``` kubeflowConfig: - operator: tf-operator - apiVersion: v1alpha2 storage: nfs nfs: # Your NFS server IP, like 10.10.10.10 @@ -53,8 +58,6 @@ kubeflowConfig: If you use Azure storage, you should set `kubeflowConfig` in your config yaml file as follows: ``` kubeflowConfig: - operator: tf-operator - apiVersion: v1alpha2 storage: azureStorage keyVault: vaultName: {your_vault_name} From 7c7832cef0579539479460d07e1282e898489868 Mon Sep 17 00:00:00 2001 From: "Shinai Yang (FA TALENT)" Date: Mon, 7 Jan 2019 15:17:22 +0800 Subject: [PATCH 13/41] update --- README.md | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/README.md b/README.md index 1d56023093..58b034fc78 100644 --- a/README.md +++ b/README.md @@ -83,12 +83,11 @@ You can use these commands to get more information about the experiment commands description 1. nnictl experiment show show the information of experiments 2. nnictl trial ls list all of trial jobs -3. nnictl top monitor the status of running experiments -4. nnictl log stderr show stderr log content -5. nnictl log stdout show stdout log content -6. nnictl stop stop an experiment -7. nnictl trial kill kill a trial job by id -8. nnictl --help get help information about nnictl +3. nnictl log stderr show stderr log content +4. nnictl log stdout show stdout log content +5. nnictl stop stop an experiment +6. nnictl trial kill kill a trial job by id +7. nnictl --help get help information about nnictl ----------------------------------------------------------------------- ``` From 85c015dc71a60879440b1a64dfa2e9f8b395f75e Mon Sep 17 00:00:00 2001 From: "Shinai Yang (FA TALENT)" Date: Tue, 8 Jan 2019 15:30:37 +0800 Subject: [PATCH 14/41] remove 'any more' --- docs/FrameworkControllerMode.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/FrameworkControllerMode.md b/docs/FrameworkControllerMode.md index 805b0d0b50..c54c33756b 100644 --- a/docs/FrameworkControllerMode.md +++ b/docs/FrameworkControllerMode.md @@ -1,6 +1,6 @@ **Run an Experiment on FrameworkController** === -NNI supports running experiment using [FrameworkController](https://github.com/Microsoft/frameworkcontroller), called frameworkcontroller mode. FrameworkController is built to orchestrate all kinds of applications on Kubernetes, you don't need to install kubeflow for specific deeplearning framework like tf-operator or pytorch-operator any more. Now you can use frameworkcontroller as the training service to run NNI experiment. +NNI supports running experiment using [FrameworkController](https://github.com/Microsoft/frameworkcontroller), called frameworkcontroller mode. FrameworkController is built to orchestrate all kinds of applications on Kubernetes, you don't need to install kubeflow for specific deeplearning framework like tf-operator or pytorch-operator. Now you can use frameworkcontroller as the training service to run NNI experiment. ## Prerequisite for on-premises Kubernetes Service 1. A **Kubernetes** cluster using Kubernetes 1.8 or later. Follow this [guideline](https://kubernetes.io/docs/setup/) to set up Kubernetes From 4a13aebe4e4622304eb41c3907a5a343ac6bde38 Mon Sep 17 00:00:00 2001 From: "Shinai Yang (FA TALENT)" Date: Mon, 18 Mar 2019 12:07:23 +0800 Subject: [PATCH 15/41] fix kubeflow-it --- test/pipelines-it-kubeflow.yml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/test/pipelines-it-kubeflow.yml b/test/pipelines-it-kubeflow.yml index b7f86a968a..7746636a7f 100644 --- a/test/pipelines-it-kubeflow.yml +++ b/test/pipelines-it-kubeflow.yml @@ -43,9 +43,9 @@ jobs: python3 -m pip install torch==0.4.1 --user python3 -m pip install torchvision==0.2.1 --user python3 -m pip install keras==2.1.6 --user - python3 -m pip install tensorflow-gpu==1.10.0 --user + python3 -m pip install tensorflow-gpu==1.12.0 --user sudo apt-get install swig -y - nnictl package install --name=SMAC + PATH=$HOME/.local/bin:$PATH nnictl package install --name=SMAC displayName: 'Install dependencies for integration tests in Kubeflow mode' - script: | From b18c82fd80b203ead1323f3f7741c5b50c88764f Mon Sep 17 00:00:00 2001 From: "Shinai Yang (FA TALENT)" Date: Mon, 18 Mar 2019 13:47:56 +0800 Subject: [PATCH 16/41] update --- test/pipelines-it-kubeflow.yml | 2 +- test/pipelines-it-local.yml | 4 ++-- test/pipelines-it-pai.yml | 4 ++-- test/pipelines-it-remote.yml | 6 +++++- 4 files changed, 10 insertions(+), 6 deletions(-) diff --git a/test/pipelines-it-kubeflow.yml b/test/pipelines-it-kubeflow.yml index 7746636a7f..fd13191520 100644 --- a/test/pipelines-it-kubeflow.yml +++ b/test/pipelines-it-kubeflow.yml @@ -43,7 +43,7 @@ jobs: python3 -m pip install torch==0.4.1 --user python3 -m pip install torchvision==0.2.1 --user python3 -m pip install keras==2.1.6 --user - python3 -m pip install tensorflow-gpu==1.12.0 --user + python3 -m pip install tensorflow==1.12.0 --user sudo apt-get install swig -y PATH=$HOME/.local/bin:$PATH nnictl package install --name=SMAC displayName: 'Install dependencies for integration tests in Kubeflow mode' diff --git a/test/pipelines-it-local.yml b/test/pipelines-it-local.yml index e10dd014bd..b6d6dca584 100644 --- a/test/pipelines-it-local.yml +++ b/test/pipelines-it-local.yml @@ -12,9 +12,9 @@ jobs: python3 -m pip install torch==0.4.1 --user python3 -m pip install torchvision==0.2.1 --user python3 -m pip install keras==2.1.6 --user - python3 -m pip install tensorflow-gpu==1.10.0 --user + python3 -m pip install tensorflow-gpu==1.12.0 --user sudo apt-get install swig -y - nnictl package install --name=SMAC + PATH=$HOME/.local/bin:$PATH nnictl package install --name=SMAC displayName: 'Install dependencies for integration tests' - script: | cd test diff --git a/test/pipelines-it-pai.yml b/test/pipelines-it-pai.yml index faf5af3f90..beaa230561 100644 --- a/test/pipelines-it-pai.yml +++ b/test/pipelines-it-pai.yml @@ -43,9 +43,9 @@ jobs: python3 -m pip install torch==0.4.1 --user python3 -m pip install torchvision==0.2.1 --user python3 -m pip install keras==2.1.6 --user - python3 -m pip install tensorflow-gpu==1.10.0 --user + python3 -m pip install tensorflow-gpu==1.12.0 --user sudo apt-get install swig -y - nnictl package install --name=SMAC + PATH=$HOME/.local/bin:$PATH nnictl package install --name=SMAC displayName: 'Install dependencies for integration tests in PAI mode' - script: | diff --git a/test/pipelines-it-remote.yml b/test/pipelines-it-remote.yml index 6154da47fe..50f1c0229d 100644 --- a/test/pipelines-it-remote.yml +++ b/test/pipelines-it-remote.yml @@ -2,11 +2,15 @@ jobs: - job: 'integration_test_remote' steps: - - script: python3 -m pip install --upgrade pip setuptools + - script: python3 -m pip install --upgrade pip setuptools --user displayName: 'Install python tools' - script: | source install.sh displayName: 'Install nni toolkit via source code' + - script: | + sudo apt-get install swig -y + PATH=$HOME/.local/bin:$PATH nnictl package install --name=SMAC + displayName: 'Install dependencies for integration tests in remote mode' - task: CopyFilesOverSSH@0 inputs: sshEndpoint: remote_nni-ci-gpu-01 From b7b85bfc9c638454457e4ef54b69f8ecc0493528 Mon Sep 17 00:00:00 2001 From: "Shinai Yang (FA TALENT)" Date: Mon, 18 Mar 2019 18:30:24 +0800 Subject: [PATCH 17/41] disable versionCheck to debug --- test/training_service.yml | 1 + 1 file changed, 1 insertion(+) diff --git a/test/training_service.yml b/test/training_service.yml index 5742c29987..57f38b2fef 100644 --- a/test/training_service.yml +++ b/test/training_service.yml @@ -1,6 +1,7 @@ kubeflow: maxExecDuration: 15m nniManagerIp: + versionCheck: false kubeflowConfig: operator: tf-operator apiVersion: v1alpha2 From 9f0904edf62ba84e3f912df3000a8153b1d8c401 Mon Sep 17 00:00:00 2001 From: "Shinai Yang (FA TALENT)" Date: Mon, 18 Mar 2019 19:10:59 +0800 Subject: [PATCH 18/41] revert code --- test/training_service.yml | 1 - 1 file changed, 1 deletion(-) diff --git a/test/training_service.yml b/test/training_service.yml index 57f38b2fef..5742c29987 100644 --- a/test/training_service.yml +++ b/test/training_service.yml @@ -1,7 +1,6 @@ kubeflow: maxExecDuration: 15m nniManagerIp: - versionCheck: false kubeflowConfig: operator: tf-operator apiVersion: v1alpha2 From d1a703779ce48d3a83767ea72a9e161e9b7d8ed6 Mon Sep 17 00:00:00 2001 From: "Shinai Yang (FA TALENT)" Date: Wed, 20 Mar 2019 19:45:01 +0800 Subject: [PATCH 19/41] add debug config --- test/generate_ts_config.py | 3 +++ 1 file changed, 3 insertions(+) diff --git a/test/generate_ts_config.py b/test/generate_ts_config.py index d13c451a9e..8f312faeb6 100644 --- a/test/generate_ts_config.py +++ b/test/generate_ts_config.py @@ -42,6 +42,7 @@ def update_training_service_config(args): config[args.ts]['trial']['outputDir'] = args.output_dir if args.vc is not None: config[args.ts]['trial']['virtualCluster'] = args.vc + config[args.ts]['debug'] = True elif args.ts == 'kubeflow': if args.nfs_server is not None: config[args.ts]['kubeflowConfig']['nfs']['server'] = args.nfs_server @@ -57,6 +58,7 @@ def update_training_service_config(args): config[args.ts]['kubeflowConfig']['azureStorage']['azureShare'] = args.azs_share if args.nni_docker_image is not None: config[args.ts]['trial']['worker']['image'] = args.nni_docker_image + config[args.ts]['debug'] = True elif args.ts == 'remote': if args.remote_user is not None: config[args.ts]['machineList'][0]['username'] = args.remote_user @@ -66,6 +68,7 @@ def update_training_service_config(args): config[args.ts]['machineList'][0]['port'] = args.remote_port if args.remote_pwd is not None: config[args.ts]['machineList'][0]['passwd'] = args.remote_pwd + config[args.ts]['debug'] = True dump_yml_content(TRAINING_SERVICE_FILE, config) From 01eb4362d8b08aeab1d4f85d12647459b98e3d64 Mon Sep 17 00:00:00 2001 From: "Shinai Yang (FA TALENT)" Date: Wed, 20 Mar 2019 21:27:21 +0800 Subject: [PATCH 20/41] fix debug mode --- tools/nni_cmd/launcher.py | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/tools/nni_cmd/launcher.py b/tools/nni_cmd/launcher.py index b52847c089..f3501513d0 100644 --- a/tools/nni_cmd/launcher.py +++ b/tools/nni_cmd/launcher.py @@ -443,7 +443,7 @@ def launch_experiment(args, experiment_config, mode, config_file_name, experimen # start a new experiment print_normal('Starting experiment...') # set debug configuration - if args.debug is not None: + if experiment_config.get('debug') is None: experiment_config['debug'] = args.debug response = set_experiment(experiment_config, mode, args.port, config_file_name) if response: @@ -510,7 +510,6 @@ def create_experiment(args): exit(1) experiment_config = get_yml_content(config_path) validate_all_content(experiment_config, config_path) - nni_config.set_config('experimentConfig', experiment_config) launch_experiment(args, experiment_config, 'new', config_file_name) nni_config.set_config('restServerPort', args.port) From b5693061da12eb20e23b6f539bd1f7273692e56d Mon Sep 17 00:00:00 2001 From: "Shinai Yang (FA TALENT)" Date: Thu, 21 Mar 2019 20:44:33 +0800 Subject: [PATCH 21/41] fix Makefile --- Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Makefile b/Makefile index 03452dad2a..abb3d11670 100644 --- a/Makefile +++ b/Makefile @@ -57,7 +57,7 @@ NNI_YARN_FOLDER ?= $(NNI_DEPENDENCY_FOLDER)/nni-yarn NNI_YARN := PATH=$(BIN_FOLDER):$${PATH} $(NNI_YARN_FOLDER)/bin/yarn ## Version number -NNI_VERSION_VALUE = $(shell git describe --tags) +NNI_VERSION_VALUE = $(shell git describe --tags --abbrev=0) NNI_VERSION_TEMPLATE = 999.0.0-developing # Main targets From 4163760110a239e7de7db402f772a0b8a922bad6 Mon Sep 17 00:00:00 2001 From: "Shinai Yang (FA TALENT)" Date: Thu, 21 Mar 2019 20:46:09 +0800 Subject: [PATCH 22/41] add debug in remote --- test/generate_ts_config.py | 2 -- 1 file changed, 2 deletions(-) diff --git a/test/generate_ts_config.py b/test/generate_ts_config.py index 8f312faeb6..8179334f22 100644 --- a/test/generate_ts_config.py +++ b/test/generate_ts_config.py @@ -42,7 +42,6 @@ def update_training_service_config(args): config[args.ts]['trial']['outputDir'] = args.output_dir if args.vc is not None: config[args.ts]['trial']['virtualCluster'] = args.vc - config[args.ts]['debug'] = True elif args.ts == 'kubeflow': if args.nfs_server is not None: config[args.ts]['kubeflowConfig']['nfs']['server'] = args.nfs_server @@ -58,7 +57,6 @@ def update_training_service_config(args): config[args.ts]['kubeflowConfig']['azureStorage']['azureShare'] = args.azs_share if args.nni_docker_image is not None: config[args.ts]['trial']['worker']['image'] = args.nni_docker_image - config[args.ts]['debug'] = True elif args.ts == 'remote': if args.remote_user is not None: config[args.ts]['machineList'][0]['username'] = args.remote_user From 426c3b5657382332f39995a4f826240d0bbddca5 Mon Sep 17 00:00:00 2001 From: "Shinai Yang (FA TALENT)" Date: Thu, 21 Mar 2019 20:48:02 +0800 Subject: [PATCH 23/41] revert code --- tools/nni_cmd/launcher.py | 1 + 1 file changed, 1 insertion(+) diff --git a/tools/nni_cmd/launcher.py b/tools/nni_cmd/launcher.py index f3501513d0..6341c754da 100644 --- a/tools/nni_cmd/launcher.py +++ b/tools/nni_cmd/launcher.py @@ -510,6 +510,7 @@ def create_experiment(args): exit(1) experiment_config = get_yml_content(config_path) validate_all_content(experiment_config, config_path) + nni_config.set_config('experimentConfig', experiment_config) launch_experiment(args, experiment_config, 'new', config_file_name) nni_config.set_config('restServerPort', args.port) From a8f9ac491bfd5d2a1d7bfb9f9e4c256e1d6e5ae1 Mon Sep 17 00:00:00 2001 From: "Shinai Yang (FA TALENT)" Date: Fri, 22 Mar 2019 11:15:45 +0800 Subject: [PATCH 24/41] remove prefix in Makefile --- Makefile | 1 + deployment/pypi/Makefile | 1 + 2 files changed, 2 insertions(+) diff --git a/Makefile b/Makefile index abb3d11670..351e1e0757 100644 --- a/Makefile +++ b/Makefile @@ -58,6 +58,7 @@ NNI_YARN := PATH=$(BIN_FOLDER):$${PATH} $(NNI_YARN_FOLDER)/bin/yarn ## Version number NNI_VERSION_VALUE = $(shell git describe --tags --abbrev=0) +NNI_VERSION_VALUE = $(NNI_VERSION_VALUE#*v) #remove prefix 'v' in version NNI_VERSION_TEMPLATE = 999.0.0-developing # Main targets diff --git a/deployment/pypi/Makefile b/deployment/pypi/Makefile index b75cc3212c..07ea2526c6 100644 --- a/deployment/pypi/Makefile +++ b/deployment/pypi/Makefile @@ -13,6 +13,7 @@ endif TIME_STAMP = $(shell date -u "+%y%m%d%H%M") NNI_VERSION_VALUE = $(shell git describe --tags --abbrev=0) +NNI_VERSION_VALUE = $(NNI_VERSION_VALUE#*v) #remove prefix 'v' in version # To include time stamp in version value, run: # make version_ts=true build From 570a50392595fe1b385baa2f017f8da6773918db Mon Sep 17 00:00:00 2001 From: "Shinai Yang (FA TALENT)" Date: Fri, 22 Mar 2019 11:32:02 +0800 Subject: [PATCH 25/41] remove prefix of version --- Makefile | 3 ++- deployment/pypi/Makefile | 4 ++-- 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/Makefile b/Makefile index 351e1e0757..687e7a34e1 100644 --- a/Makefile +++ b/Makefile @@ -58,7 +58,8 @@ NNI_YARN := PATH=$(BIN_FOLDER):$${PATH} $(NNI_YARN_FOLDER)/bin/yarn ## Version number NNI_VERSION_VALUE = $(shell git describe --tags --abbrev=0) -NNI_VERSION_VALUE = $(NNI_VERSION_VALUE#*v) #remove prefix 'v' in version +NNI_VERSION_VALUE_WITH_PREFIX = $(shell git describe --tags --abbrev=0) +NNI_VERSION_VALUE = $(NNI_VERSION_VALUE_WITH_PREFIX:v%=%) #remove prefix 'v' in version NNI_VERSION_TEMPLATE = 999.0.0-developing # Main targets diff --git a/deployment/pypi/Makefile b/deployment/pypi/Makefile index 07ea2526c6..d777c3d523 100644 --- a/deployment/pypi/Makefile +++ b/deployment/pypi/Makefile @@ -12,8 +12,8 @@ else endif TIME_STAMP = $(shell date -u "+%y%m%d%H%M") -NNI_VERSION_VALUE = $(shell git describe --tags --abbrev=0) -NNI_VERSION_VALUE = $(NNI_VERSION_VALUE#*v) #remove prefix 'v' in version +NNI_VERSION_VALUE_WITH_PREFIX = $(shell git describe --tags --abbrev=0) +NNI_VERSION_VALUE = $(NNI_VERSION_VALUE_WITH_PREFIX:v%=%) #remove prefix 'v' in version # To include time stamp in version value, run: # make version_ts=true build From ed04fc5325c3cad03b8ce834899f30baea22664d Mon Sep 17 00:00:00 2001 From: "Shinai Yang (FA TALENT)" Date: Fri, 22 Mar 2019 11:47:04 +0800 Subject: [PATCH 26/41] add strip() in trialkeeper --- tools/nni_trial_tool/trial_keeper.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/nni_trial_tool/trial_keeper.py b/tools/nni_trial_tool/trial_keeper.py index a949688f2a..770df98d97 100644 --- a/tools/nni_trial_tool/trial_keeper.py +++ b/tools/nni_trial_tool/trial_keeper.py @@ -119,7 +119,7 @@ def check_version(args): if not args.version: # skip version check nni_log(LogType.Warning, 'Skipping version check!') - elif trial_keeper_version != args.version: + elif trial_keeper_version.strip() != args.version.strip(): nni_log(LogType.Error, 'Exit trial keeper, trial keeper version is {}, and trainingService version is {}, \ versions does not match, please check your code and image versions!'.format(trial_keeper_version, args.version)) os._exit(1) From 3a99187c8c8a2e656807cc48ddbb66af96275aa7 Mon Sep 17 00:00:00 2001 From: "Shinai Yang (FA TALENT)" Date: Fri, 22 Mar 2019 21:57:24 +0800 Subject: [PATCH 27/41] fix remote pipeline --- test/pipelines-it-remote.yml | 19 +++++++++---------- test/remote_docker.py | 4 +--- 2 files changed, 10 insertions(+), 13 deletions(-) diff --git a/test/pipelines-it-remote.yml b/test/pipelines-it-remote.yml index 50f1c0229d..9b63fe953d 100644 --- a/test/pipelines-it-remote.yml +++ b/test/pipelines-it-remote.yml @@ -4,6 +4,12 @@ jobs: steps: - script: python3 -m pip install --upgrade pip setuptools --user displayName: 'Install python tools' + - script: | + cd deployment/pypi + echo 'building prerelease package...' + make build + ls $(Build.SourcesDirectory)/deployment/pypi/dist/ + displayName: 'build nni bdsit_wheel' - script: | source install.sh displayName: 'Install nni toolkit via source code' @@ -14,17 +20,10 @@ jobs: - task: CopyFilesOverSSH@0 inputs: sshEndpoint: remote_nni-ci-gpu-01 - sourceFolder: src/sdk/pynni - targetFolder: /tmp/nnitest/$(Build.BuildId)/pynni - overwrite: true - displayName: 'Copy sdk files to remote machine' - - task: CopyFilesOverSSH@0 - inputs: - sshEndpoint: remote_nni-ci-gpu-01 - sourceFolder: tools - targetFolder: /tmp/nnitest/$(Build.BuildId)/tools + sourceFolder: src/sdk/pynni/dist/ + targetFolder: /tmp/nnitest/$(Build.BuildId)/dist overwrite: true - displayName: 'Copy tool files to remote machine' + displayName: 'Copy dist files to remote machine' - task: CopyFilesOverSSH@0 inputs: sshEndpoint: remote_nni-ci-gpu-01 diff --git a/test/remote_docker.py b/test/remote_docker.py index e134182dea..9c43907d5e 100644 --- a/test/remote_docker.py +++ b/test/remote_docker.py @@ -28,10 +28,8 @@ def start_container(image, name): run_cmds = ['docker', 'run', '-d', '-p', str(port) + ':22', '--name', name, '--mount', 'type=bind,source=' + source_dir + ',target=/tmp/nni', image] output = check_output(run_cmds) commit_id = output.decode('utf-8') - sdk_cmds = ['docker', 'exec', name, 'python3', '-m', 'pip', 'install', '--user', '--no-cache-dir', '/tmp/nni/pynni/'] + sdk_cmds = ['docker', 'exec', name, 'python3', '-m', 'pip', 'install', '/tmp/nni/dist/nni-*.whl'] check_call(sdk_cmds) - tools_cmds = ['docker', 'exec', name, 'python3', '-m', 'pip', 'install', '--user', '--no-cache-dir', '/tmp/nni/tools'] - check_call(tools_cmds) with open(source_dir + '/port', 'w') as file: file.write(str(port)) From ebce7f22a69b411ffd6589c9225e72af017fc384 Mon Sep 17 00:00:00 2001 From: "Shinai Yang (FA TALENT)" Date: Fri, 22 Mar 2019 22:07:29 +0800 Subject: [PATCH 28/41] update makefile --- Makefile | 2 -- deployment/pypi/Makefile | 3 +-- test/generate_ts_config.py | 1 - tools/nni_cmd/launcher.py | 1 - 4 files changed, 1 insertion(+), 6 deletions(-) diff --git a/Makefile b/Makefile index 687e7a34e1..abb3d11670 100644 --- a/Makefile +++ b/Makefile @@ -58,8 +58,6 @@ NNI_YARN := PATH=$(BIN_FOLDER):$${PATH} $(NNI_YARN_FOLDER)/bin/yarn ## Version number NNI_VERSION_VALUE = $(shell git describe --tags --abbrev=0) -NNI_VERSION_VALUE_WITH_PREFIX = $(shell git describe --tags --abbrev=0) -NNI_VERSION_VALUE = $(NNI_VERSION_VALUE_WITH_PREFIX:v%=%) #remove prefix 'v' in version NNI_VERSION_TEMPLATE = 999.0.0-developing # Main targets diff --git a/deployment/pypi/Makefile b/deployment/pypi/Makefile index d777c3d523..b75cc3212c 100644 --- a/deployment/pypi/Makefile +++ b/deployment/pypi/Makefile @@ -12,8 +12,7 @@ else endif TIME_STAMP = $(shell date -u "+%y%m%d%H%M") -NNI_VERSION_VALUE_WITH_PREFIX = $(shell git describe --tags --abbrev=0) -NNI_VERSION_VALUE = $(NNI_VERSION_VALUE_WITH_PREFIX:v%=%) #remove prefix 'v' in version +NNI_VERSION_VALUE = $(shell git describe --tags --abbrev=0) # To include time stamp in version value, run: # make version_ts=true build diff --git a/test/generate_ts_config.py b/test/generate_ts_config.py index 8179334f22..d13c451a9e 100644 --- a/test/generate_ts_config.py +++ b/test/generate_ts_config.py @@ -66,7 +66,6 @@ def update_training_service_config(args): config[args.ts]['machineList'][0]['port'] = args.remote_port if args.remote_pwd is not None: config[args.ts]['machineList'][0]['passwd'] = args.remote_pwd - config[args.ts]['debug'] = True dump_yml_content(TRAINING_SERVICE_FILE, config) diff --git a/tools/nni_cmd/launcher.py b/tools/nni_cmd/launcher.py index d6e4b27d19..4b541ee5b5 100644 --- a/tools/nni_cmd/launcher.py +++ b/tools/nni_cmd/launcher.py @@ -512,7 +512,6 @@ def create_experiment(args): exit(1) experiment_config = get_yml_content(config_path) validate_all_content(experiment_config, config_path) - nni_config.set_config('experimentConfig', experiment_config) launch_experiment(args, experiment_config, 'new', config_file_name) nni_config.set_config('restServerPort', args.port) From 43b23817f5e63ecd56ef60785c3ef510ec450b79 Mon Sep 17 00:00:00 2001 From: "Shinai Yang (FA TALENT)" Date: Fri, 22 Mar 2019 22:18:45 +0800 Subject: [PATCH 29/41] set gpu num be 0 in kubeflow --- test/training_service.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/test/training_service.yml b/test/training_service.yml index 5742c29987..159c756359 100644 --- a/test/training_service.yml +++ b/test/training_service.yml @@ -15,7 +15,7 @@ kubeflow: worker: replicas: 1 command: - gpuNum: 1 + gpuNum: 0 cpuNum: 1 memoryMB: 8192 image: From 54e4f9ad3121524e2b35f80012168d455b544d93 Mon Sep 17 00:00:00 2001 From: "Shinai Yang (FA TALENT)" Date: Fri, 22 Mar 2019 22:22:41 +0800 Subject: [PATCH 30/41] fix remote it --- test/pipelines-it-remote.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/test/pipelines-it-remote.yml b/test/pipelines-it-remote.yml index 9b63fe953d..b3e4133ecd 100644 --- a/test/pipelines-it-remote.yml +++ b/test/pipelines-it-remote.yml @@ -20,7 +20,7 @@ jobs: - task: CopyFilesOverSSH@0 inputs: sshEndpoint: remote_nni-ci-gpu-01 - sourceFolder: src/sdk/pynni/dist/ + sourceFolder: /deployment/pypi/dist/ targetFolder: /tmp/nnitest/$(Build.BuildId)/dist overwrite: true displayName: 'Copy dist files to remote machine' From 1ee422ac4f4ad3ec02b00d71a5f96e07d5978af6 Mon Sep 17 00:00:00 2001 From: "Shinai Yang (FA TALENT)" Date: Mon, 25 Mar 2019 10:47:50 +0800 Subject: [PATCH 31/41] fix remote-it --- test/pipelines-it-remote.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/test/pipelines-it-remote.yml b/test/pipelines-it-remote.yml index b3e4133ecd..10cdfa6a1e 100644 --- a/test/pipelines-it-remote.yml +++ b/test/pipelines-it-remote.yml @@ -20,7 +20,7 @@ jobs: - task: CopyFilesOverSSH@0 inputs: sshEndpoint: remote_nni-ci-gpu-01 - sourceFolder: /deployment/pypi/dist/ + sourceFolder: deployment/pypi/dist/ targetFolder: /tmp/nnitest/$(Build.BuildId)/dist overwrite: true displayName: 'Copy dist files to remote machine' From 188e9aee1a165d63e9a61021d807f94ce3702977 Mon Sep 17 00:00:00 2001 From: "Shinai Yang (FA TALENT)" Date: Mon, 25 Mar 2019 11:05:18 +0800 Subject: [PATCH 32/41] fix remote_docker.py --- test/remote_docker.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/test/remote_docker.py b/test/remote_docker.py index 9c43907d5e..3117c0e060 100644 --- a/test/remote_docker.py +++ b/test/remote_docker.py @@ -28,7 +28,7 @@ def start_container(image, name): run_cmds = ['docker', 'run', '-d', '-p', str(port) + ':22', '--name', name, '--mount', 'type=bind,source=' + source_dir + ',target=/tmp/nni', image] output = check_output(run_cmds) commit_id = output.decode('utf-8') - sdk_cmds = ['docker', 'exec', name, 'python3', '-m', 'pip', 'install', '/tmp/nni/dist/nni-*.whl'] + sdk_cmds = ['docker', 'exec', name, 'python3', '-m', 'pip', 'install', '/tmp/nni/dist/*.whl'] check_call(sdk_cmds) with open(source_dir + '/port', 'w') as file: file.write(str(port)) From 600dd96e995188ddda7c9a727c5a547204da70f1 Mon Sep 17 00:00:00 2001 From: "Shinai Yang (FA TALENT)" Date: Mon, 25 Mar 2019 11:27:21 +0800 Subject: [PATCH 33/41] fix remote_docker.py --- test/remote_docker.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/test/remote_docker.py b/test/remote_docker.py index 3117c0e060..3d054eb5b4 100644 --- a/test/remote_docker.py +++ b/test/remote_docker.py @@ -28,7 +28,7 @@ def start_container(image, name): run_cmds = ['docker', 'run', '-d', '-p', str(port) + ':22', '--name', name, '--mount', 'type=bind,source=' + source_dir + ',target=/tmp/nni', image] output = check_output(run_cmds) commit_id = output.decode('utf-8') - sdk_cmds = ['docker', 'exec', name, 'python3', '-m', 'pip', 'install', '/tmp/nni/dist/*.whl'] + sdk_cmds = ['docker', 'exec', name, 'python3', '-m', 'pip', 'install', '/tmp/nni/dist/nni-0.1-py3-none-manylinux1_x86_64.whl'] check_call(sdk_cmds) with open(source_dir + '/port', 'w') as file: file.write(str(port)) From 5f64369478d22941fd386fbc7577a9e9d93eb14b Mon Sep 17 00:00:00 2001 From: "Shinai Yang (FA TALENT)" Date: Mon, 25 Mar 2019 14:25:47 +0800 Subject: [PATCH 34/41] fix remote pipeline --- test/pipelines-it-remote.yml | 2 +- test/remote_docker.py | 8 ++++++-- 2 files changed, 7 insertions(+), 3 deletions(-) diff --git a/test/pipelines-it-remote.yml b/test/pipelines-it-remote.yml index 10cdfa6a1e..469b8a0df4 100644 --- a/test/pipelines-it-remote.yml +++ b/test/pipelines-it-remote.yml @@ -35,7 +35,7 @@ jobs: inputs: sshEndpoint: remote_nni-ci-gpu-01 runOptions: commands - commands: python3 /tmp/nnitest/$(Build.BuildId)/test/remote_docker.py --mode start --name $(Build.BuildId) --image nni/nni + commands: python3 /tmp/nnitest/$(Build.BuildId)/test/remote_docker.py --mode start --name $(Build.BuildId) --image nni/nni --tag $(git describe --tags --abbrev=0) displayName: 'Start docker' - task: DownloadSecureFile@1 inputs: diff --git a/test/remote_docker.py b/test/remote_docker.py index 3d054eb5b4..e3806fc492 100644 --- a/test/remote_docker.py +++ b/test/remote_docker.py @@ -3,6 +3,7 @@ from subprocess import check_output, check_call import socket import random +import re def detect_port(port): '''Detect if the port is used, return True if the port is used''' @@ -28,7 +29,9 @@ def start_container(image, name): run_cmds = ['docker', 'run', '-d', '-p', str(port) + ':22', '--name', name, '--mount', 'type=bind,source=' + source_dir + ',target=/tmp/nni', image] output = check_output(run_cmds) commit_id = output.decode('utf-8') - sdk_cmds = ['docker', 'exec', name, 'python3', '-m', 'pip', 'install', '/tmp/nni/dist/nni-0.1-py3-none-manylinux1_x86_64.whl'] + regular = re.compile('v?(?P[0-9](\.[0-9]){0,2}).*') + version = regular.search(args.tag).group('version') + sdk_cmds = ['docker', 'exec', name, 'python3', '-m', 'pip', 'install', '/tmp/nni/dist/nni-{0}-py3-none-manylinux1_x86_64.whl'.format(version)] check_call(sdk_cmds) with open(source_dir + '/port', 'w') as file: file.write(str(port)) @@ -44,7 +47,8 @@ def stop_container(name): parser = argparse.ArgumentParser() parser.add_argument('--mode', required=True, choices=['start', 'stop'], dest='mode', help='start or stop a container') parser.add_argument('--name', required=True, dest='name', help='the name of container to be used') - parser.add_argument('--image', dest='image', help='the image to be used') + parser.add_argument('--image', required=True, dest='image', help='the image to be used') + parser.add_argument('--tag', required=True, help='the tag of branch, used in wheel name') args = parser.parse_args() if args.mode == 'start': start_container(args.image, args.name) From c0715a0b456c0a3d054f03206c232b688076d34d Mon Sep 17 00:00:00 2001 From: "Shinai Yang (FA TALENT)" Date: Mon, 25 Mar 2019 14:46:17 +0800 Subject: [PATCH 35/41] refactor remote_docker.py --- test/pipelines-it-remote.yml | 2 +- test/remote_docker.py | 16 ++++++++++++---- 2 files changed, 13 insertions(+), 5 deletions(-) diff --git a/test/pipelines-it-remote.yml b/test/pipelines-it-remote.yml index 469b8a0df4..10cdfa6a1e 100644 --- a/test/pipelines-it-remote.yml +++ b/test/pipelines-it-remote.yml @@ -35,7 +35,7 @@ jobs: inputs: sshEndpoint: remote_nni-ci-gpu-01 runOptions: commands - commands: python3 /tmp/nnitest/$(Build.BuildId)/test/remote_docker.py --mode start --name $(Build.BuildId) --image nni/nni --tag $(git describe --tags --abbrev=0) + commands: python3 /tmp/nnitest/$(Build.BuildId)/test/remote_docker.py --mode start --name $(Build.BuildId) --image nni/nni displayName: 'Start docker' - task: DownloadSecureFile@1 inputs: diff --git a/test/remote_docker.py b/test/remote_docker.py index e3806fc492..51de3c8f94 100644 --- a/test/remote_docker.py +++ b/test/remote_docker.py @@ -22,6 +22,13 @@ def find_port(): port = random.randint(10000, 20000) return port +def find_wheel_package(dir): + '''Find the wheel package uploaded to this machine''' + for file_name in os.listdir(dir): + if file_name.endswith('.whl'): + return file_name + return None + def start_container(image, name): '''Start docker container, generate a port in /tmp/nnitest/{name}/port file''' port = find_port() @@ -29,9 +36,11 @@ def start_container(image, name): run_cmds = ['docker', 'run', '-d', '-p', str(port) + ':22', '--name', name, '--mount', 'type=bind,source=' + source_dir + ',target=/tmp/nni', image] output = check_output(run_cmds) commit_id = output.decode('utf-8') - regular = re.compile('v?(?P[0-9](\.[0-9]){0,2}).*') - version = regular.search(args.tag).group('version') - sdk_cmds = ['docker', 'exec', name, 'python3', '-m', 'pip', 'install', '/tmp/nni/dist/nni-{0}-py3-none-manylinux1_x86_64.whl'.format(version)] + wheel_name = find_wheel_package(os.path.join(source_dir, 'dist')) + if not wheel_name: + print('Error: could not find wheel package in {0}'.format(source_dir)) + exit(1) + sdk_cmds = ['docker', 'exec', name, 'python3', '-m', 'pip', 'install', '/tmp/nni/dist/{0}'.format(wheel_name)] check_call(sdk_cmds) with open(source_dir + '/port', 'w') as file: file.write(str(port)) @@ -48,7 +57,6 @@ def stop_container(name): parser.add_argument('--mode', required=True, choices=['start', 'stop'], dest='mode', help='start or stop a container') parser.add_argument('--name', required=True, dest='name', help='the name of container to be used') parser.add_argument('--image', required=True, dest='image', help='the image to be used') - parser.add_argument('--tag', required=True, help='the tag of branch, used in wheel name') args = parser.parse_args() if args.mode == 'start': start_container(args.image, args.name) From db5ee0965e362471034b765efb0af24530801ea8 Mon Sep 17 00:00:00 2001 From: "Shinai Yang (FA TALENT)" Date: Mon, 25 Mar 2019 14:57:09 +0800 Subject: [PATCH 36/41] revert code --- test/training_service.yml | 2 +- tools/nni_cmd/launcher.py | 1 + 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/test/training_service.yml b/test/training_service.yml index 159c756359..5742c29987 100644 --- a/test/training_service.yml +++ b/test/training_service.yml @@ -15,7 +15,7 @@ kubeflow: worker: replicas: 1 command: - gpuNum: 0 + gpuNum: 1 cpuNum: 1 memoryMB: 8192 image: diff --git a/tools/nni_cmd/launcher.py b/tools/nni_cmd/launcher.py index 4b541ee5b5..d6e4b27d19 100644 --- a/tools/nni_cmd/launcher.py +++ b/tools/nni_cmd/launcher.py @@ -512,6 +512,7 @@ def create_experiment(args): exit(1) experiment_config = get_yml_content(config_path) validate_all_content(experiment_config, config_path) + nni_config.set_config('experimentConfig', experiment_config) launch_experiment(args, experiment_config, 'new', config_file_name) nni_config.set_config('restServerPort', args.port) From 86da5ef8522218e1976b1bf9efe2afa23b9f5f4b Mon Sep 17 00:00:00 2001 From: "Shinai Yang (FA TALENT)" Date: Mon, 25 Mar 2019 14:57:54 +0800 Subject: [PATCH 37/41] revert unused code --- tools/nni_cmd/launcher.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/nni_cmd/launcher.py b/tools/nni_cmd/launcher.py index d6e4b27d19..ed22737bff 100644 --- a/tools/nni_cmd/launcher.py +++ b/tools/nni_cmd/launcher.py @@ -512,7 +512,7 @@ def create_experiment(args): exit(1) experiment_config = get_yml_content(config_path) validate_all_content(experiment_config, config_path) - + nni_config.set_config('experimentConfig', experiment_config) launch_experiment(args, experiment_config, 'new', config_file_name) nni_config.set_config('restServerPort', args.port) From 03601bad127a5044c35f18b3d4d493226373c329 Mon Sep 17 00:00:00 2001 From: "Shinai Yang (FA TALENT)" Date: Mon, 25 Mar 2019 15:05:43 +0800 Subject: [PATCH 38/41] revert unused code --- test/remote_docker.py | 1 - 1 file changed, 1 deletion(-) diff --git a/test/remote_docker.py b/test/remote_docker.py index 51de3c8f94..785e5644b2 100644 --- a/test/remote_docker.py +++ b/test/remote_docker.py @@ -3,7 +3,6 @@ from subprocess import check_output, check_call import socket import random -import re def detect_port(port): '''Detect if the port is used, return True if the port is used''' From d8ba386be8d614b8cf3fb50cbea365c58777793a Mon Sep 17 00:00:00 2001 From: "Shinai Yang (FA TALENT)" Date: Mon, 25 Mar 2019 15:08:02 +0800 Subject: [PATCH 39/41] revert unused code --- Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Makefile b/Makefile index abb3d11670..03452dad2a 100644 --- a/Makefile +++ b/Makefile @@ -57,7 +57,7 @@ NNI_YARN_FOLDER ?= $(NNI_DEPENDENCY_FOLDER)/nni-yarn NNI_YARN := PATH=$(BIN_FOLDER):$${PATH} $(NNI_YARN_FOLDER)/bin/yarn ## Version number -NNI_VERSION_VALUE = $(shell git describe --tags --abbrev=0) +NNI_VERSION_VALUE = $(shell git describe --tags) NNI_VERSION_TEMPLATE = 999.0.0-developing # Main targets From 51fef76cf25b708a018a69d0f38603c832ee34f6 Mon Sep 17 00:00:00 2001 From: "Shinai Yang (FA TALENT)" Date: Mon, 25 Mar 2019 15:18:39 +0800 Subject: [PATCH 40/41] fix remote_docker.py --- test/remote_docker.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/test/remote_docker.py b/test/remote_docker.py index 785e5644b2..5cc12746fa 100644 --- a/test/remote_docker.py +++ b/test/remote_docker.py @@ -55,7 +55,7 @@ def stop_container(name): parser = argparse.ArgumentParser() parser.add_argument('--mode', required=True, choices=['start', 'stop'], dest='mode', help='start or stop a container') parser.add_argument('--name', required=True, dest='name', help='the name of container to be used') - parser.add_argument('--image', required=True, dest='image', help='the image to be used') + parser.add_argument('--image', dest='image', help='the image to be used') args = parser.parse_args() if args.mode == 'start': start_container(args.image, args.name) From 084774ea8313fec917c9c4720ce213a7fa9e3d8f Mon Sep 17 00:00:00 2001 From: "Shinai Yang (FA TALENT)" Date: Mon, 25 Mar 2019 16:07:20 +0800 Subject: [PATCH 41/41] fix remote_docker.py --- test/remote_docker.py | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/test/remote_docker.py b/test/remote_docker.py index 5cc12746fa..98f37a1444 100644 --- a/test/remote_docker.py +++ b/test/remote_docker.py @@ -3,6 +3,7 @@ from subprocess import check_output, check_call import socket import random +import re def detect_port(port): '''Detect if the port is used, return True if the port is used''' @@ -23,8 +24,9 @@ def find_port(): def find_wheel_package(dir): '''Find the wheel package uploaded to this machine''' + regular = re.compile('^nni-.*\.whl$') for file_name in os.listdir(dir): - if file_name.endswith('.whl'): + if regular.search(file_name): return file_name return None