From d13964dc4444d24b73e3ec1f6f8250e385d72cec Mon Sep 17 00:00:00 2001 From: SparkSnail Date: Tue, 8 Jan 2019 16:06:32 +0800 Subject: [PATCH] Add frameworkcontroller document (#530) Add frameworkcontroller document. Fix other document small issues. --- docs/FrameworkControllerMode.md | 100 ++++++++++++++++++++ docs/KubeflowMode.md | 156 ++++++++++++++++++++++++-------- 2 files changed, 219 insertions(+), 37 deletions(-) create mode 100644 docs/FrameworkControllerMode.md diff --git a/docs/FrameworkControllerMode.md b/docs/FrameworkControllerMode.md new file mode 100644 index 0000000000..c54c33756b --- /dev/null +++ b/docs/FrameworkControllerMode.md @@ -0,0 +1,100 @@ +**Run an Experiment on FrameworkController** +=== +NNI supports running experiment using [FrameworkController](https://github.com/Microsoft/frameworkcontroller), called frameworkcontroller mode. FrameworkController is built to orchestrate all kinds of applications on Kubernetes, you don't need to install kubeflow for specific deeplearning framework like tf-operator or pytorch-operator. Now you can use frameworkcontroller as the training service to run NNI experiment. + +## Prerequisite for on-premises Kubernetes Service +1. A **Kubernetes** cluster using Kubernetes 1.8 or later. Follow this [guideline](https://kubernetes.io/docs/setup/) to set up Kubernetes +2. Prepare a **kubeconfig** file, which will be used by NNI to interact with your kubernetes API server. By default, NNI manager will use $(HOME)/.kube/config as kubeconfig file's path. You can also specify other kubeconfig files by setting the **KUBECONFIG** environment variable. Refer this [guideline]( https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig) to learn more about kubeconfig. +3. If your NNI trial job needs GPU resource, you should follow this [guideline](https://github.com/NVIDIA/k8s-device-plugin) to configure **Nvidia device plugin for Kubernetes**. +4. Prepare a **NFS server** and export a general purpose mount (we recommend to map your NFS server path in `root_squash option`, otherwise permission issue may raise when nni copy files to NFS. Refer this [page](https://linux.die.net/man/5/exports) to learn what root_squash option is), or **Azure File Storage**. +5. Install **NFS client** on the machine where you install NNI and run nnictl to create experiment. Run this command to install NFSv4 client: + ``` + apt-get install nfs-common + ``` + +6. Install **NNI**, follow the install guide [here](GetStarted.md). + +## Prerequisite for Azure Kubernetes Service +1. NNI support kubeflow based on Azure Kubernetes Service, follow the [guideline](https://azure.microsoft.com/en-us/services/kubernetes-service/) to set up Azure Kubernetes Service. +2. Install [Azure CLI](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest) and __kubectl__. Use `az login` to set azure account, and connect kubectl client to AKS, refer this [guideline](https://docs.microsoft.com/en-us/azure/aks/kubernetes-walkthrough#connect-to-the-cluster). +3. Follow the [guideline](https://docs.microsoft.com/en-us/azure/storage/common/storage-quickstart-create-account?tabs=portal) to create azure file storage account. If you use Azure Kubernetes Service, nni need Azure Storage Service to store code files and the output files. +4. To access Azure storage service, nni need the access key of the storage account, and nni use [Azure Key Vault](https://azure.microsoft.com/en-us/services/key-vault/) Service to protect your private key. Set up Azure Key Vault Service, add a secret to Key Vault to store the access key of Azure storage account. Follow this [guideline](https://docs.microsoft.com/en-us/azure/key-vault/quick-create-cli) to store the access key. + + +## Set up FrameworkController +Follow the [guideline](https://github.com/Microsoft/frameworkcontroller/tree/master/example/run) to set up frameworkcontroller in the kubernetes cluster, nni support frameworkcontroller by the statefulset mode. + +## Design +Please refer the design of [kubeflow training service](./KubeflowMode.md), frameworkcontroller training service pipeline is similar. + +## Example + +The frameworkcontroller config file format is: +``` +authorName: default +experimentName: example_mnist +trialConcurrency: 1 +maxExecDuration: 10h +maxTrialNum: 100 +#choice: local, remote, pai, kubeflow, frameworkcontroller +trainingServicePlatform: frameworkcontroller +searchSpacePath: ~/nni/examples/trials/mnist/search_space.json +#choice: true, false +useAnnotation: false +tuner: + #choice: TPE, Random, Anneal, Evolution + builtinTunerName: TPE + classArgs: + #choice: maximize, minimize + optimize_mode: maximize +assessor: + builtinAssessorName: Medianstop + classArgs: + optimize_mode: maximize + gpuNum: 0 +trial: + codeDir: ~/nni/examples/trials/mnist + taskRoles: + - name: worker + taskNum: 1 + command: python3 mnist.py + gpuNum: 1 + cpuNum: 1 + memoryMB: 8192 + image: msranni/nni:latest + frameworkAttemptCompletionPolicy: + minFailedTaskCount: 1 + minSucceededTaskCount: 1 +frameworkcontrollerConfig: + storage: nfs + nfs: + server: {your_nfs_server} + path: {your_nfs_server_exported_path} +``` +If you use Azure Kubernetes Service, you should set `frameworkcontrollerConfig` in your config yaml file as follows: +``` +frameworkcontrollerConfig: + storage: azureStorage + keyVault: + vaultName: {your_vault_name} + name: {your_secert_name} + azureStorage: + accountName: {your_storage_account_name} + azureShare: {your_azure_share_name} +``` +Note: You should explicitly set `trainingServicePlatform: frameworkcontroller` in nni config yaml file if you want to start experiment in frameworkcontrollerConfig mode. + +The trial's config format for nni frameworkcontroller mode is a simple version of frameworkcontroller's offical config, you could refer the [tensorflow example of frameworkcontroller](https://github.com/Microsoft/frameworkcontroller/blob/master/example/framework/scenario/tensorflow/cpu/tensorflowdistributedtrainingwithcpu.yaml) for deep understanding. +Trial configuration in frameworkcontroller mode have the following configuration keys: +* taskRoles: you could set multiple task roles in config file, and each task role is a basic unit to process in kubernetes cluster. + * name: the name of task role specified, like "worker", "ps", "master". + * taskNum: the replica number of the task role. + * command: the users' command to be used in the container. + * gpuNum: the number of gpu device used in container. + * cpuNum: the number of cpu device used in container. + * memoryMB: the memory limitaion to be specified in container. + * image: the docker image used to create pod and run the program. + * frameworkAttemptCompletionPolicy: the policy to run framework, please refer the [user-manual](https://github.com/Microsoft/frameworkcontroller/blob/master/doc/user-manual.md#frameworkattemptcompletionpolicy) to get the specific information. Users could use the policy to control the pod, for example, if ps does not stop, only worker stops, this completionpolicy could helps stop ps. + +## How to run example +After you prepare a config file, you could run your experiment by nnictl. The way to start an experiment on frameworkcontroller is similar to kubeflow, please refer the [document](./KubeflowMode.md) for more information. \ No newline at end of file diff --git a/docs/KubeflowMode.md b/docs/KubeflowMode.md index 2c4721b971..2beb69b59a 100644 --- a/docs/KubeflowMode.md +++ b/docs/KubeflowMode.md @@ -1,6 +1,6 @@ **Run an Experiment on Kubeflow** === -Now NNI supports running experiment on [Kubeflow](https://github.com/kubeflow/kubeflow), called kubeflow mode. Before starting to use NNI kubeflow mode, you should have a kubernetes cluster, either on-prem or [Azure Kubernetes Service(AKS)](https://azure.microsoft.com/en-us/services/kubernetes-service/), a Ubuntu machine on which [kubeconfig](https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/) is setup to connect to your kubernetes cluster. If you are not familiar with kubernetes, [here](https://kubernetes.io/docs/tutorials/kubernetes-basics/) is a goot start. In kubeflow mode, your trial program will run as kubeflow job in kubernetes cluster. +Now NNI supports running experiment on [Kubeflow](https://github.com/kubeflow/kubeflow), called kubeflow mode. Before starting to use NNI kubeflow mode, you should have a kubernetes cluster, either on-prem or [Azure Kubernetes Service(AKS)](https://azure.microsoft.com/en-us/services/kubernetes-service/), a Ubuntu machine on which [kubeconfig](https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/) is setup to connect to your kubernetes cluster. If you are not familiar with kubernetes, [here](https://kubernetes.io/docs/tutorials/kubernetes-basics/) is a good start. In kubeflow mode, your trial program will run as kubeflow job in kubernetes cluster. ## Prerequisite for on-premises Kubernetes Service 1. A **Kubernetes** cluster using Kubernetes 1.8 or later. Follow this [guideline](https://kubernetes.io/docs/setup/) to set up Kubernetes @@ -28,64 +28,143 @@ Kubeflow training service instantiates a kubernetes rest client to interact with For each trial, we will upload all the files in your local codeDir path (configured in nni_config.yaml) together with NNI generated files like parameter.cfg into a storage volumn. Right now we support two kinds of storage volumns: [nfs](https://en.wikipedia.org/wiki/Network_File_System) and [azure file storage](https://azure.microsoft.com/en-us/services/storage/files/), you should configure the storage volumn in nni config yaml file. After files are prepared, Kubeflow training service will call K8S rest API to create kubeflow jobs ([tf-operator](https://github.com/kubeflow/tf-operator) job or [pytorch-operator](https://github.com/kubeflow/pytorch-operator) job) in K8S, and mount your storage volumn into the job's pod. Output files of kubeflow job, like stdout, stderr, trial.log or model files, will also be copied back to the storage volumn. NNI will show the storage volumn's URL for each trial in WebUI, to allow user browse the log files and job's output files. +## Supported operator +NNI only support tf-operator and pytorch-operator of kubeflow, other operators is not tested. +Users could set operator type in config file. +The setting of tf-operator: +``` +kubeflowConfig: + operator: tf-operator +``` +The setting of pytorch-operator: +``` +kubeflowConfig: + operator: pytorch-operator +``` +If users want to use tf-operator, he could set `ps` and `worker` in trial config. If users want to use pytorch-operator, he could set `master` and `worker` in trial config. + +## Supported sotrage type +NNI support NFS and Azure Storage to store the code and output files, users could set storage type in config file and set the corresponding config. +The setting for NFS storage are as follows: +``` +kubeflowConfig: + storage: nfs + nfs: + # Your NFS server IP, like 10.10.10.10 + server: {your_nfs_server_ip} + # Your NFS server export path, like /var/nfs/nni + path: {your_nfs_server_export_path} +``` +If you use Azure storage, you should set `kubeflowConfig` in your config yaml file as follows: +``` +kubeflowConfig: + storage: azureStorage + keyVault: + vaultName: {your_vault_name} + name: {your_secert_name} + azureStorage: + accountName: {your_storage_account_name} + azureShare: {your_azure_share_name} +``` + + ## Run an experiment -Use `examples/trials/mnist` as an example. The nni config yaml file's content is like: +Use `examples/trials/mnist` as an example. This is a tensorflow job, and use tf-operator of kubeflow. The nni config yaml file's content is like: ``` -authorName: your_name +authorName: default experimentName: example_mnist -# how many trials could be concurrently running -trialConcurrency: 4 -# maximum experiment running duration -maxExecDuration: 3h -# empty means never stop -maxTrialNum: 100 -# choice: local, remote, pai, kubeflow +trialConcurrency: 2 +maxExecDuration: 1h +maxTrialNum: 20 +#choice: local, remote, pai, kubeflow trainingServicePlatform: kubeflow -# choice: true, false +searchSpacePath: search_space.json +#choice: true, false useAnnotation: false tuner: + #choice: TPE, Random, Anneal, Evolution builtinTunerName: TPE classArgs: #choice: maximize, minimize optimize_mode: maximize +assessor: + builtinAssessorName: Medianstop + classArgs: + optimize_mode: maximize + gpuNum: 0 trial: - codeDir: ~/nni/examples/trials/mnist - ps: - replicas: 1 - command: python mnist-keras.py - gpuNum: 0 + codeDir: . + worker: + replicas: 2 + command: python3 dist_mnist.py + gpuNum: 1 cpuNum: 1 memoryMB: 8196 - image: {your_docker_image_for_tensorflow_ps} - worker: - replicas: 1 - command: python mnist-keras.py - gpuNum: 2 + image: msranni/nni:latest + ps: + replicas: 1 + command: python3 dist_mnist.py + gpuNum: 0 cpuNum: 1 memoryMB: 8196 - image: {your_docker_image_for_tensorflow_worker} + image: msranni/nni:latest kubeflowConfig: operator: tf-operator + apiVersion: v1alpha2 storage: nfs nfs: - server: {your_nfs_server} - path: {your_nfs_server_exported_path} + # Your NFS server IP, like 10.10.10.10 + server: {your_nfs_server_ip} + # Your NFS server export path, like /var/nfs/nni + path: {your_nfs_server_export_path} ``` -If you use Azure Kubernetes Service, you should set `kubeflowConfig` in your config yaml file as follows: + +Note: You should explicitly set `trainingServicePlatform: kubeflow` in nni config yaml file if you want to start experiment in kubeflow mode. + +If you want to run Pytorch jobs, you could set your config files as follow: ``` +authorName: default +experimentName: example_mnist_distributed_pytorch +trialConcurrency: 1 +maxExecDuration: 1h +maxTrialNum: 10 +#choice: local, remote, pai, kubeflow +trainingServicePlatform: kubeflow +searchSpacePath: search_space.json +#choice: true, false +useAnnotation: false +tuner: + #choice: TPE, Random, Anneal, Evolution + builtinTunerName: TPE + classArgs: + #choice: maximize, minimize + optimize_mode: minimize +trial: + codeDir: . + master: + replicas: 1 + command: python3 dist_mnist.py + gpuNum: 1 + cpuNum: 1 + memoryMB: 2048 + image: msranni/nni:latest + worker: + replicas: 1 + command: python3 dist_mnist.py + gpuNum: 0 + cpuNum: 1 + memoryMB: 2048 + image: msranni/nni:latest kubeflowConfig: - operator: tf-operator - storage: azureStorage - keyVault: - vaultName: {your_vault_name} - name: {your_secert_name} - azureStorage: - accountName: {your_storage_account_name} - azureShare: {your_azure_share_name} + operator: pytorch-operator + apiVersion: v1alpha2 + nfs: + # Your NFS server IP, like 10.10.10.10 + server: {your_nfs_server_ip} + # Your NFS server export path, like /var/nfs/nni + path: {your_nfs_server_export_path} ``` -Note: You should explicitly set `trainingServicePlatform: kubeflow` in nni config yaml file if you want to start experiment in kubeflow mode. - Trial configuration in kubeflow mode have the following configuration keys: * codeDir * code directory, where you put training code and config files @@ -100,14 +179,17 @@ Trial configuration in kubeflow mode have the following configuration keys: * gpuNum * image * Required key. In kubeflow mode, your trial program will be scheduled by Kubernetes to run in [Pod](https://kubernetes.io/docs/concepts/workloads/pods/pod/). This key is used to specify the Docker image used to create the pod where your trail program will run. - * We already build a docker image [nnimsra/nni](https://hub.docker.com/r/msranni/nni/) on [Docker Hub](https://hub.docker.com/). It contains NNI python packages, Node modules and javascript artifact files required to start experiment, and all of NNI dependencies. The docker file used to build this image can be found at [here](../deployment/Dockerfile.build.base). You can either use this image directly in your config file, or build your own image based on it. + * We already build a docker image [msranni/nni](https://hub.docker.com/r/msranni/nni/) on [Docker Hub](https://hub.docker.com/). It contains NNI python packages, Node modules and javascript artifact files required to start experiment, and all of NNI dependencies. The docker file used to build this image can be found at [here](../deployment/Dockerfile.build.base). You can either use this image directly in your config file, or build your own image based on it. + * apiVersion + * Required key. The API version of your kubeflow. * ps (optional). This config section is used to configure tensorflow parameter server role. +* master(optional). This config section is used to configure pytorch parameter server role. Once complete to fill nni experiment config file and save (for example, save as exp_kubeflow.yaml), then run the following command ``` nnictl create --config exp_kubeflow.yaml ``` -to start the experiment in kubeflow mode. NNI will create Kubeflow tfjob for each trial, and the job name format is something like `nni_exp_{experiment_id}_trial_{trial_id}`. +to start the experiment in kubeflow mode. NNI will create Kubeflow tfjob or pytorchjob for each trial, and the job name format is something like `nni_exp_{experiment_id}_trial_{trial_id}`. You can see the kubeflow tfjob created by NNI in your Kubernetes dashboard. Notice: In kubeflow mode, NNIManager will start a rest server and listen on a port which is your NNI WebUI's port plus 1. For example, if your WebUI port is `8080`, the rest server will listen on `8081`, to receive metrics from trial job running in Kubernetes. So you should `enable 8081` TCP port in your firewall rule to allow incoming traffic.