add doc for one-shot training modes of NAS (#1254)

* update nas doc
microsoft · Aug 5, 2019 · 1fda860 · 1fda860
1 parent 1dab311
commit 1fda860
Show file tree

Hide file tree

Showing 3 changed files with 117 additions and 27 deletions.
diff --git a/docs/en_US/AdvancedFeature/GeneralNasInterfaces.md b/docs/en_US/AdvancedFeature/GeneralNasInterfaces.md
@@ -1,11 +1,13 @@
 # General Programming Interface for Neural Architecture Search (experimental feature)
 
-_*This is an experimental feature, currently, we only implemented the general NAS programming interface. Weight sharing and one-shot NAS based on this programming interface will be supported in the following releases._
+_*This is an experimental feature, currently, we only implemented the general NAS programming interface. Weight sharing will be supported in the following releases._
 
 Automatic neural architecture search is taking an increasingly important role on finding better models. Recent research works have proved the feasibility of automatic NAS, and also found some models that could beat manually designed and tuned models. Some of representative works are [NASNet][2], [ENAS][1], [DARTS][3], [Network Morphism][4], and [Evolution][5]. There are new innovations keeping emerging. However, it takes great efforts to implement those algorithms, and it is hard to reuse code base of one algorithm for implementing another.
 
 To facilitate NAS innovations (e.g., design/implement new NAS models, compare different NAS models side-by-side), an easy-to-use and flexible programming interface is crucial.
 
+<a name="ProgInterface"></a>
+
 ## Programming interface
 
  A new programming interface for designing and searching for a model is often demanded in two scenarios. 1) When designing a neural network, the designer may have multiple choices for a layer, sub-model, or connection, and not sure which one or a combination performs the best. It would be appealing to have an easy way to express the candidate layers/sub-models they want to try. 2) For the researchers who are working on automatic NAS, they want to have an unified way to express the search space of neural architectures. And making unchanged trial code adapted to different searching algorithms.
@@ -53,13 +55,16 @@ After finishing the trial code through the annotation above, users have implicit
 ```javascript
 {
     "mutable_1": {
-        "layer_1": {
-            "layer_choice": ["conv(ch=128)", "pool", "identity"],
-            "optional_inputs": ["out1", "out2", "out3"],
-            "optional_input_size": 2
-        },
-        "layer_2": {
-            ...
+        "_type": "mutable_layer",
+        "_value": {
+            "layer_1": {
+                "layer_choice": ["conv(ch=128)", "pool", "identity"],
+                "optional_inputs": ["out1", "out2", "out3"],
+                "optional_input_size": 2
+            },
+            "layer_2": {
+                ...
+            }
         }
     }
 }
@@ -83,9 +88,109 @@ Accordingly, a specified neural architecture (generated by tuning algorithm) is
 
 With the specification of the format of search space and architecture (choice) expression, users are free to implement various (general) tuning algorithms for neural architecture search on NNI. One future work is to provide a general NAS algorithm.
 
+## Support of One-Shot NAS
+
+One-Shot NAS is a popular approach to find good neural architecture within a limited time and resource budget. Basically, it builds a full graph based on the search space, and uses gradient descent to at last find the best subgraph. There are different training approaches, such as [training subgraphs (per mini-batch)][1], [training full graph through dropout][6], [training with architecture weights (regularization)][3]. 
+
+NNI has supported the general NAS as demonstrated above. From users' point of view, One-Shot NAS and NAS have the same search space specification, thus, they could share the same programming interface as demonstrated above, just different training modes. NNI provides four training modes:
+
+**\*classic_mode\***: this mode is described [above](#ProgInterface), in this mode, each subgraph runs as a trial job. To use this mode, you should enable NNI annotation and specify a tuner for nas in experiment config file. [Here](https://github.com/microsoft/nni/tree/master/examples/trials/mnist-nas) is an example to show how to write trial code and the config file. And [here](https://github.com/microsoft/nni/tree/master/examples/tuners/random_nas_tuner) is a simple tuner for nas.
+
+**\*enas_mode\***: following the training approach in [enas paper][1]. It builds the full graph based on neural architrecture search space, and only activate one subgraph that generated by the controller for each mini-batch. [Detailed Description](#ENASMode). (currently only supported on tensorflow). 
+
+To use enas_mode, you should add one more field in the `trial` config as shown below.
+```diff
+trial:
+    command: your command to run the trial
+    codeDir: the directory where the trial's code is located
+    gpuNum: the number of GPUs that one trial job needs
++   #choice: classic_mode, enas_mode, oneshot_mode
++   nasMode: enas_mode
+```
+Similar to classic_mode, in enas_mode you need to specify a tuner for nas, as it also needs to receive subgraphs from tuner (or controller using the terminology in the paper). Since this trial job needs to receive multiple subgraphs from tuner, each one for a mini-batch, two lines need to be added to the trial code to receive the next subgraph (i.e., `nni.training_update`) and report the result of the current subgraph. Below is an example:
+```python
+for _ in range(num):
+    # here receives and enables a new subgraph
+    """@nni.training_update(tf=tf, session=self.session)"""
+    loss, _ = self.session.run([loss_op, train_op])
+    # report the loss of this mini-batch
+    """@nni.report_final_result(loss)"""
+```
+Here, `nni.training_update` is to do some update on the full graph. In enas_mode, the update means receiving a subgraph and enabling it on the next mini-batch. While in darts_mode, the update means training the architecture weights (details in darts_mode). In enas_mode, you need to pass the imported tensorflow package to `tf` and the session to `session`.
+
+**\*oneshot_mode\***: following the training approach in [this paper][6]. Different from enas_mode which trains the full graph by training large numbers of subgraphs, in oneshot_mode the full graph is built and dropout is added to candidate inputs and also added to candidate ops' outputs. Then this full graph is trained like other DL models. [Detailed Description](#OneshotMode). (currently only supported on tensorflow). 
+
+To use oneshot_mode, you should add one more field in the `trial` config as shown below. In this mode, no need to specify tuner in the config file as it does not need tuner. (Note that you still need to specify a tuner (any tuner) in the config file for now.) Also, no need to add `nni.training_update` in this mode, because no special processing (or update) is needed during training.
+```diff
+trial:
+    command: your command to run the trial
+    codeDir: the directory where the trial's code is located
+    gpuNum: the number of GPUs that one trial job needs
++   #choice: classic_mode, enas_mode, oneshot_mode
++   nasMode: oneshot_mode
+```
+
+**\*darts_mode\***: following the training approach in [this paper][3]. It is similar to oneshot_mode. There are two differences, one is that darts_mode only add architecture weights to the outputs of candidate ops, the other is that it trains model weights and architecture weights in an interleaved manner. [Detailed Description](#DartsMode).
+
+To use darts_mode, you should add one more field in the `trial` config as shown below. In this mode, also no need to specify tuner in the config file as it does not need tuner. (Note that you still need to specify a tuner (any tuner) in the config file for now.)
+```diff
+trial:
+    command: your command to run the trial
+    codeDir: the directory where the trial's code is located
+    gpuNum: the number of GPUs that one trial job needs
++   #choice: classic_mode, enas_mode, oneshot_mode
++   nasMode: darts_mode
+```
+
+When using darts_mode, you need to call `nni.training_update` as shown below when architecture weights should be updated. Updating architecture weights need `loss` for updating the weights as well as the training data (i.e., `feed_dict`) for it.
+```python
+for _ in range(num):
+    # here trains the architecture weights
+    """@nni.training_update(tf=tf, session=self.session, loss=loss, feed_dict=feed_dict)"""
+    loss, _ = self.session.run([loss_op, train_op])
+```
+
+**Note:** for enas_mode, oneshot_mode, and darts_mode, NNI only works on the training phase. They also have their own inference phase which is not handled by NNI. For enas_mode, the inference phase is to generate new subgraphs through the controller. For oneshot_mode, the inference phase is sampling new subgraphs randomly and choosing good ones. For darts_mode, the inference phase is pruning a proportion of candidates ops based on architecture weights.
+
+<a name="ENASMode"></a>
+
+### enas_mode
+
+In enas_mode, the compiled trial code builds the full graph (rather than subgraph), it receives a chosen architecture and training this architecture on the full graph for a mini-batch, then request another chosen architecture. It is supported by [NNI multi-phase](./multiPhase.md).
+
+Specifically, for trials using tensorflow, we create and use tensorflow variable as signals, and tensorflow conditional functions to control the search space (full-graph) to be more flexible, which means it can be changed into different sub-graphs (multiple times) depending on these signals. [Here]() is an example for enas_mode.
+
+<a name="OneshotMode"></a>
+
+### oneshot_mode
+
+Below is the figure to show where dropout is added to the full graph for one layer in `nni.mutable_layers`, input 1-k are candidate inputs, the four ops are candidate ops.
+
+![](../../img/oneshot_mode.png)
+
+As suggested in the [paper][6], a dropout method is implemented to the inputs for every layer. The dropout rate is set to r^(1/k), where 0 < r < 1 is a hyper-parameter of the model (default to be 0.01) and k is number of optional inputs for a specific layer. The higher the fan-in, the more likely each possible input is to be dropped out. However, the probability of dropping out all optional_inputs of a layer is kept constant regardless of its fan-in. Suppose r = 0.05. If a layer has k = 2 optional_inputs then each one will independently be dropped out with probability 0.051/2 ≈ 0.22 and will be retained with probability 0.78. If a layer has k = 7 optional_inputs then each one will independently be dropped out with probability 0.051/7 ≈ 0.65 and will be retained with probability 0.35. In both cases, the probability of dropping out all of the layer's optional_inputs is 5%. The outputs of candidate ops are dropped out through the same way. [Here]() is an example for oneshot_mode.
+
+<a name="DartsMode"></a>
+
+### darts_mode
+
+Below is the figure to show where architecture weights are added to the full graph for one layer in `nni.mutable_layers`, output of each candidate op is multiplied by a weight which is called architecture weight.
+
+![](../../img/darts_mode.png)
+
+In `nni.training_update`, tensorflow MomentumOptimizer is used to train the architecture weights based on the pass `loss` and `feed_dict`. [Here]() is an example for darts_mode.
+
+### [__TODO__] Multiple trial jobs for One-Shot NAS
+
+One-Shot NAS usually has only one trial job with the full graph. However, running multiple such trial jobs leads to benefits. For example, in enas_mode multiple trial jobs could share the weights of the full graph to speedup the model training (or converge). Some One-Shot approaches are not stable, running multiple trial jobs increase the possibility of finding better models.
+
+NNI natively supports running multiple such trial jobs. The figure below shows how multiple trial jobs run on NNI.
+
+![](../../img/one-shot_training.png)
+
 =============================================================
 
-## Neural architecture search on NNI
+## System design of NAS on NNI
 
 ### Basic flow of experiment execution
 
@@ -95,7 +200,7 @@ NNI's annotation compiler transforms the annotated trial code to the code that c
 
 The above figure shows how the trial code runs on NNI. `nnictl` processes user trial code to generate a search space file and compiled trial code. The former is fed to tuner, and the latter is used to run trials. 
 
-[Simple example of NAS on NNI](https://github.com/microsoft/nni/tree/v0.8/examples/trials/mnist-nas).
+[Simple example of NAS on NNI](https://github.com/microsoft/nni/tree/master/examples/trials/mnist-nas).
 
 ### [__TODO__] Weight sharing
 
@@ -107,24 +212,9 @@ We believe weight sharing (transferring) plays a key role on speeding up NAS, wh
 
 Example of weight sharing on NNI.
 
-### [__TODO__] Support of One-Shot NAS
-
-One-Shot NAS is a popular approach to find good neural architecture within a limited time and resource budget. Basically, it builds a full graph based on the search space, and uses gradient descent to at last find the best subgraph. There are different training approaches, such as [training subgraphs (per mini-batch)][1], [training full graph through dropout][6], [training with architecture weights (regularization)][3]. Here we focus on the first approach, i.e., training subgraphs (ENAS).
-
-With the same annotated trial code, users could choose One-Shot NAS as execution mode on NNI. Specifically, the compiled trial code builds the full graph (rather than subgraph demonstrated above), it receives a chosen architecture and training this architecture on the full graph for a mini-batch, then request another chosen architecture. It is supported by [NNI multi-phase](MultiPhase.md). We support this training approach because training a subgraph is very fast, building the graph every time training a subgraph induces too much overhead.
-
-![](../../img/one-shot_training.png)
-
-The design of One-Shot NAS on NNI is shown in the above figure. One-Shot NAS usually only has one trial job with full graph. NNI supports running multiple such trial jobs each of which runs independently. As One-Shot NAS is not stable, running multiple instances helps find better model. Moreover, trial jobs are also able to synchronize weights during running (i.e., there is only one copy of weights, like asynchronous parameter-server mode). This may speedup converge.
-
-Example of One-Shot NAS on NNI.
-
-
-## [__TODO__] General tuning algorithms for NAS
-
-Like hyperparameter tuning, a relatively general algorithm for NAS is required. The general programming interface makes this task easier to some extent. We have a RL-based tuner algorithm for NAS from our contributors. We expect efforts from community to design and implement better NAS algorithms.
+## General tuning algorithms for NAS
 
-More tuning algorithms for NAS.
+Like hyperparameter tuning, a relatively general algorithm for NAS is required. The general programming interface makes this task easier to some extent. We have an [RL tuner based on PPO algorithm](https://github.com/microsoft/nni/tree/master/src/sdk/pynni/nni/ppo_tuner) for NAS. We expect efforts from community to design and implement better NAS algorithms.
 
 ## [__TODO__] Export best neural architecture and code
 

diff --git a/docs/img/darts_mode.png b/docs/img/darts_mode.png
diff --git a/docs/img/oneshot_mode.png b/docs/img/oneshot_mode.png