Skip to content

Commit

Permalink
Merge pull request #113 from Microsoft/master
Browse files Browse the repository at this point in the history
merge master
  • Loading branch information
SparkSnail authored Jan 9, 2019
2 parents 85cb472 + c288a16 commit 3784355
Show file tree
Hide file tree
Showing 41 changed files with 3,670 additions and 109 deletions.
87 changes: 87 additions & 0 deletions docs/AdvancedNAS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
# Tutorial for Advanced Neural Architecture Search
Currently many of the NAS algorithms leverage the technique of **weight sharing** among trials to accelerate its training process. For example, [ENAS][1] delivers 1000x effiency with '_parameter sharing between child models_', compared with the previous [NASNet][2] algorithm. Other NAS algorithms such as [DARTS][3], [Network Morphism][4], and [Evolution][5] is also leveraging, or has the potential to leverage weight sharing.

This is a tutorial on how to enable weight sharing in NNI.

## Weight Sharing among trials
Currently we recommend sharing weights through NFS (Network File System), which supports sharing files across machines, and is light-weighted, (relatively) efficient. We also welcome contributions from the community on more efficient techniques.

### Weight Sharing through NFS file
With the NFS setup (see below), trial code can share model weight through loading & saving files. Here we recommend that user feed the tuner with the storage path:
```yaml
tuner:
codeDir: path/to/customer_tuner
classFileName: customer_tuner.py
className: CustomerTuner
classArgs:
...
save_dir_root: /nfs/storage/path/
```
And let tuner decide where to save & load weights and feed the paths to trials through `nni.get_next_parameters()`:

![weight_sharing_design](./img/weight_sharing.png)

For example, in tensorflow:
```python
# save models
saver = tf.train.Saver()
saver.save(sess, os.path.join(params['save_path'], 'model.ckpt'))
# load models
tf.init_from_checkpoint(params['restore_path'])
```
where `'save_path'` and `'restore_path'` in hyper-parameter can be managed by the tuner.

### NFS Setup
In NFS, files are physically stored on a server machine, and trials on the client machine can read/write those files in the same way that they access local files.

#### Install NFS on server machine
First, install NFS server:
```bash
sudo apt-get install nfs-kernel-server
```
Suppose `/tmp/nni/shared` is used as the physical storage, then run:
```bash
sudo mkdir -p /tmp/nni/shared
sudo echo "/tmp/nni/shared *(rw,sync,no_subtree_check,no_root_squash)" >> /etc/exports
sudo service nfs-kernel-server restart
```
You can check if the above directory is successfully exported by NFS using `sudo showmount -e localhost`

#### Install NFS on client machine
First, install NFS client:
```bash
sudo apt-get install nfs-common
```
Then create & mount the mounted directory of shared files:
```bash
sudo mkdir -p /mnt/nfs/nni/
sudo mount -t nfs 10.10.10.10:/tmp/nni/shared /mnt/nfs/nni
```
where `10.10.10.10` should be replaced by the real IP of NFS server machine in practice.

## Asynchornous Dispatcher Mode for trial dependency control
The feature of weight sharing enables trials from different machines, in which most of the time **read after write** consistency must be assured. After all, the child model should not load parent model before parent trial finishes training. To deal with this, users can enable **asynchronous dispatcher mode** with `multiThread: true` in `config.yml` in NNI, where the dispatcher assign a tuner thread each time a `NEW_TRIAL` request comes in, and the tuner thread can decide when to submit a new trial by blocking and unblocking the thread itself. For example:
```python
def generate_parameters(self, parameter_id):
self.thread_lock.acquire()
indiv = # configuration for a new trial
self.events[parameter_id] = threading.Event()
self.thread_lock.release()
if indiv.parent_id is not None:
self.events[indiv.parent_id].wait()
def receive_trial_result(self, parameter_id, parameters, reward):
self.thread_lock.acquire()
# code for processing trial results
self.thread_lock.release()
self.events[parameter_id].set()
```

## Examples
For details, please refer to this [simple weight sharing example](../test/async_sharing_test). We also provided a [practice example](../examples/trials/weight_sharing/ga_squad) for reading comprehension, based on previous [ga_squad](../examples/trials/ga_squad) example.

[1]: https://arxiv.org/abs/1802.03268
[2]: https://arxiv.org/abs/1707.07012
[3]: https://arxiv.org/abs/1806.09055
[4]: https://arxiv.org/abs/1806.10282
[5]: https://arxiv.org/abs/1703.01041
Binary file added docs/img/weight_sharing.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
6 changes: 3 additions & 3 deletions examples/trials/ga_squad/trial.py
Original file line number Diff line number Diff line change
Expand Up @@ -338,7 +338,7 @@ def train_with_graph(graph, qp_pairs, dev_qp_pairs):
answers = generate_predict_json(
position1, position2, ids, contexts)
if save_path is not None:
with open(save_path + 'epoch%d.prediction' % epoch, 'w') as file:
with open(os.path.join(save_path, 'epoch%d.prediction' % epoch), 'w') as file:
json.dump(answers, file)
else:
answers = json.dumps(answers)
Expand All @@ -359,8 +359,8 @@ def train_with_graph(graph, qp_pairs, dev_qp_pairs):
bestacc = acc

if save_path is not None:
saver.save(sess, save_path + 'epoch%d.model' % epoch)
with open(save_path + 'epoch%d.score' % epoch, 'wb') as file:
saver.save(os.path.join(sess, save_path + 'epoch%d.model' % epoch))
with open(os.path.join(save_path, 'epoch%d.score' % epoch), 'wb') as file:
pickle.dump(
(position1, position2, ids, contexts), file)
logger.debug('epoch %d acc %g bestacc %g' %
Expand Down
41 changes: 41 additions & 0 deletions examples/trials/mnist/config_frameworkcontroller.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
authorName: default
experimentName: example_mnist
trialConcurrency: 1
maxExecDuration: 1h
maxTrialNum: 10
#choice: local, remote, pai, kubeflow
trainingServicePlatform: frameworkcontroller
searchSpacePath: search_space.json
#choice: true, false
useAnnotation: false
tuner:
#choice: TPE, Random, Anneal, Evolution
builtinTunerName: TPE
classArgs:
#choice: maximize, minimize
optimize_mode: maximize
assessor:
builtinAssessorName: Medianstop
classArgs:
optimize_mode: maximize
gpuNum: 0
trial:
codeDir: .
taskRoles:
- name: worker
taskNum: 1
command: python3 mnist.py
gpuNum: 1
cpuNum: 1
memoryMB: 8192
image: msranni/nni:latest
frameworkAttemptCompletionPolicy:
minFailedTaskCount: 1
minSucceededTaskCount: 1
frameworkcontrollerConfig:
storage: nfs
nfs:
# Your NFS server IP, like 10.10.10.10
server: {your_nfs_server_ip}
# Your NFS server export path, like /var/nfs/nni
path: {your_nfs_server_export_path}
171 changes: 171 additions & 0 deletions examples/trials/weight_sharing/ga_squad/attention.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,171 @@
# Copyright (c) Microsoft Corporation
# All rights reserved.
#
# MIT License
#
# Permission is hereby granted, free of charge,
# to any person obtaining a copy of this software and associated
# documentation files (the "Software"),
# to deal in the Software without restriction, including without limitation
# the rights to use, copy, modify, merge, publish, distribute, sublicense,
# and/or sell copies of the Software, and
# to permit persons to whom the Software is furnished to do so, subject to the following conditions:
# The above copyright notice and this permission notice shall be included
# in all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED *AS IS*, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING
# BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
# NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM,
# DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

import math

import tensorflow as tf
from tensorflow.python.ops.rnn_cell_impl import RNNCell


def _get_variable(variable_dict, name, shape, initializer=None, dtype=tf.float32):
if name not in variable_dict:
variable_dict[name] = tf.get_variable(
name=name, shape=shape, initializer=initializer, dtype=dtype)
return variable_dict[name]


class DotAttention:
'''
DotAttention
'''

def __init__(self, name,
hidden_dim,
is_vanilla=True,
is_identity_transform=False,
need_padding=False):
self._name = '/'.join([name, 'dot_att'])
self._hidden_dim = hidden_dim
self._is_identity_transform = is_identity_transform
self._need_padding = need_padding
self._is_vanilla = is_vanilla
self._var = {}

@property
def is_identity_transform(self):
return self._is_identity_transform

@property
def is_vanilla(self):
return self._is_vanilla

@property
def need_padding(self):
return self._need_padding

@property
def hidden_dim(self):
return self._hidden_dim

@property
def name(self):
return self._name

@property
def var(self):
return self._var

def _get_var(self, name, shape, initializer=None):
with tf.variable_scope(self.name):
return _get_variable(self.var, name, shape, initializer)

def _define_params(self, src_dim, tgt_dim):
hidden_dim = self.hidden_dim
self._get_var('W', [src_dim, hidden_dim])
if not self.is_vanilla:
self._get_var('V', [src_dim, hidden_dim])
if self.need_padding:
self._get_var('V_s', [src_dim, src_dim])
self._get_var('V_t', [tgt_dim, tgt_dim])
if not self.is_identity_transform:
self._get_var('T', [tgt_dim, src_dim])
self._get_var('U', [tgt_dim, hidden_dim])
self._get_var('b', [1, hidden_dim])
self._get_var('v', [hidden_dim, 1])

def get_pre_compute(self, s):
'''
:param s: [src_sequence, batch_size, src_dim]
:return: [src_sequence, batch_size. hidden_dim]
'''
hidden_dim = self.hidden_dim
src_dim = s.get_shape().as_list()[-1]
assert src_dim is not None, 'src dim must be defined'
W = self._get_var('W', shape=[src_dim, hidden_dim])
b = self._get_var('b', shape=[1, hidden_dim])
return tf.tensordot(s, W, [[2], [0]]) + b

def get_prob(self, src, tgt, mask, pre_compute, return_logits=False):
'''
:param s: [src_sequence_length, batch_size, src_dim]
:param h: [batch_size, tgt_dim] or [tgt_sequence_length, batch_size, tgt_dim]
:param mask: [src_sequence_length, batch_size]\
or [tgt_sequence_length, src_sequence_length, batch_sizse]
:param pre_compute: [src_sequence_length, batch_size, hidden_dim]
:return: [src_sequence_length, batch_size]\
or [tgt_sequence_length, src_sequence_length, batch_size]
'''
s_shape = src.get_shape().as_list()
h_shape = tgt.get_shape().as_list()
src_dim = s_shape[-1]
tgt_dim = h_shape[-1]
assert src_dim is not None, 'src dimension must be defined'
assert tgt_dim is not None, 'tgt dimension must be defined'

self._define_params(src_dim, tgt_dim)

if len(h_shape) == 2:
tgt = tf.expand_dims(tgt, 0)
if pre_compute is None:
pre_compute = self.get_pre_compute(src)

buf0 = pre_compute
buf1 = tf.tensordot(tgt, self.var['U'], axes=[[2], [0]])
buf2 = tf.tanh(tf.expand_dims(buf0, 0) + tf.expand_dims(buf1, 1))

if not self.is_vanilla:
xh1 = tgt
xh2 = tgt
s1 = src
if self.need_padding:
xh1 = tf.tensordot(xh1, self.var['V_t'], 1)
xh2 = tf.tensordot(xh2, self.var['S_t'], 1)
s1 = tf.tensordot(s1, self.var['V_s'], 1)
if not self.is_identity_transform:
xh1 = tf.tensordot(xh1, self.var['T'], 1)
xh2 = tf.tensordot(xh2, self.var['T'], 1)
buf3 = tf.expand_dims(s1, 0) * tf.expand_dims(xh1, 1)
buf3 = tf.tanh(tf.tensordot(buf3, self.var['V'], axes=[[3], [0]]))
buf = tf.reshape(tf.tanh(buf2 + buf3), shape=tf.shape(buf3))
else:
buf = buf2
v = self.var['v']
e = tf.tensordot(buf, v, [[3], [0]])
e = tf.squeeze(e, axis=[3])
tmp = tf.reshape(e + (mask - 1) * 10000.0, shape=tf.shape(e))
prob = tf.nn.softmax(tmp, 1)
if len(h_shape) == 2:
prob = tf.squeeze(prob, axis=[0])
tmp = tf.squeeze(tmp, axis=[0])
if return_logits:
return prob, tmp
return prob

def get_att(self, s, prob):
'''
:param s: [src_sequence_length, batch_size, src_dim]
:param prob: [src_sequence_length, batch_size]\
or [tgt_sequence_length, src_sequence_length, batch_size]
:return: [batch_size, src_dim] or [tgt_sequence_length, batch_size, src_dim]
'''
buf = s * tf.expand_dims(prob, axis=-1)
att = tf.reduce_sum(buf, axis=-3)
return att
31 changes: 31 additions & 0 deletions examples/trials/weight_sharing/ga_squad/config_remote.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
authorName: default
experimentName: ga_squad_weight_sharing
trialConcurrency: 2
maxExecDuration: 1h
maxTrialNum: 200
#choice: local, remote, pai
trainingServicePlatform: remote
#choice: true, false
useAnnotation: false
multiThread: true
tuner:
codeDir: ../../../tuners/weight_sharing/ga_customer_tuner
classFileName: customer_tuner.py
className: CustomerTuner
classArgs:
optimize_mode: maximize
population_size: 32
save_dir_root: /mnt/nfs/nni/ga_squad
trial:
command: python3 trial.py --input_file /mnt/nfs/nni/train-v1.1.json --dev_file /mnt/nfs/nni/dev-v1.1.json --max_epoch 1 --embedding_file /mnt/nfs/nni/glove.6B.300d.txt
codeDir: .
gpuNum: 1
machineList:
- ip: remote-ip-0
port: 8022
username: root
passwd: screencast
- ip: remote-ip-1
port: 8022
username: root
passwd: screencast
Loading

0 comments on commit 3784355

Please sign in to comment.