SparkSnail · SparkSnail · Jan 9, 2019 · Jan 8, 2019 · Jan 8, 2019 · Jan 8, 2019
diff --git a/docs/AdvancedNAS.md b/docs/AdvancedNAS.md
@@ -0,0 +1,87 @@
+# Tutorial for Advanced Neural Architecture Search
+Currently many of the NAS algorithms leverage the technique of **weight sharing** among trials to accelerate its training process. For example, [ENAS][1] delivers 1000x effiency with '_parameter sharing between child models_', compared with the previous [NASNet][2] algorithm. Other NAS algorithms such as [DARTS][3], [Network Morphism][4], and [Evolution][5] is also leveraging, or has the potential to leverage weight sharing.
+
+This is a tutorial on how to enable weight sharing in NNI.
+
+## Weight Sharing among trials
+Currently we recommend sharing weights through NFS (Network File System), which supports sharing files across machines, and is light-weighted, (relatively) efficient. We also welcome contributions from the community on more efficient techniques.
+
+### Weight Sharing through NFS file
+With the NFS setup (see below), trial code can share model weight through loading & saving files. Here we recommend that user feed the tuner with the storage path:
+```yaml
+tuner:
+  codeDir: path/to/customer_tuner
+  classFileName: customer_tuner.py 
+  className: CustomerTuner
+  classArgs:
+    ...
+    save_dir_root: /nfs/storage/path/
+```
+And let tuner decide where to save & load weights and feed the paths to trials through `nni.get_next_parameters()`:
+
+![weight_sharing_design](./img/weight_sharing.png)
+
+ For example, in tensorflow:
+```python
+# save models
+saver = tf.train.Saver()
+saver.save(sess, os.path.join(params['save_path'], 'model.ckpt'))
+# load models
+tf.init_from_checkpoint(params['restore_path'])
+```
+where `'save_path'` and `'restore_path'` in hyper-parameter can be managed by the tuner.
+
+### NFS Setup
+In NFS, files are physically stored on a server machine, and trials on the client machine can read/write those files in the same way that they access local files.
+
+#### Install NFS on server machine
+First, install NFS server:
+```bash
+sudo apt-get install nfs-kernel-server
+```
+Suppose `/tmp/nni/shared` is used as the physical storage, then run:
+```bash
+sudo mkdir -p /tmp/nni/shared
+sudo echo "/tmp/nni/shared *(rw,sync,no_subtree_check,no_root_squash)" >> /etc/exports
+sudo service nfs-kernel-server restart
+```
+You can check if the above directory is successfully exported by NFS using `sudo showmount -e localhost`
+
+#### Install NFS on client machine
+First, install NFS client:
+```bash
+sudo apt-get install nfs-common
+```
+Then create & mount the mounted directory of shared files:
+```bash
+sudo mkdir -p /mnt/nfs/nni/
+sudo mount -t nfs 10.10.10.10:/tmp/nni/shared /mnt/nfs/nni
+```
+where `10.10.10.10` should be replaced by the real IP of NFS server machine in practice.
+
+## Asynchornous Dispatcher Mode for trial dependency control
+The feature of weight sharing enables trials from different machines, in which most of the time **read after write** consistency must be assured. After all, the child model should not load parent model before parent trial finishes training. To deal with this, users can enable **asynchronous dispatcher mode** with `multiThread: true` in `config.yml` in NNI, where the dispatcher assign a tuner thread each time a `NEW_TRIAL` request comes in, and the tuner thread can decide when to submit a new trial by blocking and unblocking the thread itself. For example:
+```python
+    def generate_parameters(self, parameter_id):
+        self.thread_lock.acquire()
+        indiv = # configuration for a new trial
+        self.events[parameter_id] = threading.Event()
+        self.thread_lock.release()
+        if indiv.parent_id is not None:
+            self.events[indiv.parent_id].wait()
+
+    def receive_trial_result(self, parameter_id, parameters, reward):
+        self.thread_lock.acquire()
+        # code for processing trial results
+        self.thread_lock.release()
+        self.events[parameter_id].set()
+```
+
+## Examples
+For details, please refer to this [simple weight sharing example](../test/async_sharing_test). We also provided a [practice example](../examples/trials/weight_sharing/ga_squad) for reading comprehension, based on previous [ga_squad](../examples/trials/ga_squad) example.
+
+[1]: https://arxiv.org/abs/1802.03268
+[2]: https://arxiv.org/abs/1707.07012
+[3]: https://arxiv.org/abs/1806.09055
+[4]: https://arxiv.org/abs/1806.10282
+[5]: https://arxiv.org/abs/1703.01041 
diff --git a/docs/FrameworkControllerMode.md b/docs/FrameworkControllerMode.md
@@ -0,0 +1,100 @@
+**Run an Experiment on FrameworkController**
+===
+NNI supports running experiment using [FrameworkController](https://github.com/Microsoft/frameworkcontroller), called frameworkcontroller mode. FrameworkController is built to orchestrate all kinds of applications on Kubernetes, you don't need to install kubeflow for specific deeplearning framework like tf-operator or pytorch-operator. Now you can use frameworkcontroller as the training service to run NNI experiment.
+
+## Prerequisite for on-premises Kubernetes Service
+1. A **Kubernetes** cluster using Kubernetes 1.8 or later. Follow this [guideline](https://kubernetes.io/docs/setup/) to set up Kubernetes
+2. Prepare a **kubeconfig** file, which will be used by NNI to interact with your kubernetes API server. By default, NNI manager will use $(HOME)/.kube/config as kubeconfig file's path. You can also specify other kubeconfig files by setting the **KUBECONFIG** environment variable. Refer this [guideline]( https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig) to learn more about kubeconfig. 
+3. If your NNI trial job needs GPU resource, you should follow this [guideline](https://github.com/NVIDIA/k8s-device-plugin) to configure **Nvidia device plugin for Kubernetes**.
+4. Prepare a **NFS server** and export a general purpose mount (we recommend to map your NFS server path in `root_squash option`, otherwise permission issue may raise when nni copy files to NFS. Refer this [page](https://linux.die.net/man/5/exports) to learn what root_squash option is), or **Azure File Storage**. 
+5. Install **NFS client** on the machine where you install NNI and run nnictl to create experiment. Run this command to install NFSv4 client:
+    ```
+    apt-get install nfs-common 
+    ```     
+
+6. Install **NNI**, follow the install guide [here](GetStarted.md).
+
+## Prerequisite for Azure Kubernetes Service
+1. NNI support kubeflow based on Azure Kubernetes Service, follow the [guideline](https://azure.microsoft.com/en-us/services/kubernetes-service/) to set up Azure Kubernetes Service.
+2. Install [Azure CLI](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest) and __kubectl__.  Use `az login` to set azure account, and connect kubectl client to AKS, refer this [guideline](https://docs.microsoft.com/en-us/azure/aks/kubernetes-walkthrough#connect-to-the-cluster).
+3. Follow the [guideline](https://docs.microsoft.com/en-us/azure/storage/common/storage-quickstart-create-account?tabs=portal) to create azure file storage account. If you use Azure Kubernetes Service, nni need Azure Storage Service to store code files and the output files.
+4. To access Azure storage service, nni need the access key of the storage account, and nni use [Azure Key Vault](https://azure.microsoft.com/en-us/services/key-vault/) Service to protect your private key. Set up Azure Key Vault Service, add a secret to Key Vault to store the access key of Azure storage account. Follow this [guideline](https://docs.microsoft.com/en-us/azure/key-vault/quick-create-cli) to store the access key.
+
+
+## Set up FrameworkController
+Follow the [guideline](https://github.com/Microsoft/frameworkcontroller/tree/master/example/run) to set up frameworkcontroller in the kubernetes cluster, nni support frameworkcontroller by the statefulset mode.
+
+## Design
+Please refer the design of [kubeflow training service](./KubeflowMode.md), frameworkcontroller training service pipeline is similar.
+
+## Example
+
+The frameworkcontroller config file format is:
+```
+authorName: default
+experimentName: example_mnist
+trialConcurrency: 1
+maxExecDuration: 10h
+maxTrialNum: 100
+#choice: local, remote, pai, kubeflow, frameworkcontroller
+trainingServicePlatform: frameworkcontroller
+searchSpacePath: ~/nni/examples/trials/mnist/search_space.json
+#choice: true, false
+useAnnotation: false
+tuner:
+  #choice: TPE, Random, Anneal, Evolution
+  builtinTunerName: TPE
+  classArgs:
+    #choice: maximize, minimize
+    optimize_mode: maximize
+assessor:
+  builtinAssessorName: Medianstop
+  classArgs:
+    optimize_mode: maximize
+  gpuNum: 0
+trial:
+  codeDir: ~/nni/examples/trials/mnist
+  taskRoles:
+    - name: worker
+      taskNum: 1
+      command: python3 mnist.py
+      gpuNum: 1
+      cpuNum: 1
+      memoryMB: 8192
+      image: msranni/nni:latest
+      frameworkAttemptCompletionPolicy:
+        minFailedTaskCount: 1
+        minSucceededTaskCount: 1
+frameworkcontrollerConfig:
+  storage: nfs
+  nfs:
+    server: {your_nfs_server}
+    path: {your_nfs_server_exported_path}
+```
+If you use Azure Kubernetes Service, you should  set `frameworkcontrollerConfig` in your config yaml file as follows:
+```
+frameworkcontrollerConfig:
+  storage: azureStorage
+  keyVault:
+    vaultName: {your_vault_name}
+    name: {your_secert_name}
+  azureStorage:
+    accountName: {your_storage_account_name}
+    azureShare: {your_azure_share_name}
+```
+Note: You should explicitly set `trainingServicePlatform: frameworkcontroller` in nni config yaml file if you want to start experiment in frameworkcontrollerConfig mode. 
+
+The trial's config format for nni frameworkcontroller mode is a simple version of frameworkcontroller's offical config, you could refer the [tensorflow example of frameworkcontroller](https://github.com/Microsoft/frameworkcontroller/blob/master/example/framework/scenario/tensorflow/cpu/tensorflowdistributedtrainingwithcpu.yaml) for deep understanding.  
+Trial configuration in frameworkcontroller mode have the following configuration keys:
+* taskRoles: you could set multiple task roles in config file, and each task role is a basic unit to process in kubernetes cluster.
+   * name: the name of task role specified, like "worker", "ps", "master".
+   * taskNum: the replica number of the task role.
+   * command: the users' command to be used in the container.
+   * gpuNum: the number of gpu device used in container.
+   * cpuNum: the number of cpu device used in container.
+   * memoryMB: the memory limitaion to be specified in container.
+   * image: the docker image used to create pod and run the program.
+   * frameworkAttemptCompletionPolicy: the policy to run framework, please refer the [user-manual](https://github.com/Microsoft/frameworkcontroller/blob/master/doc/user-manual.md#frameworkattemptcompletionpolicy) to get the specific information. Users could use the policy to control the pod, for example, if ps does not stop, only worker stops, this completionpolicy could helps stop ps.
+
+## How to run example
+After you prepare a config file, you could run your experiment by nnictl. The way to start an experiment on frameworkcontroller is similar to kubeflow, please refer the [document](./KubeflowMode.md) for more information.