Skip to content

Commit

Permalink
docs & examples: increase training iterations
Browse files Browse the repository at this point in the history
  • Loading branch information
ymjiang committed Jul 23, 2019
1 parent c59d988 commit f6f7f96
Show file tree
Hide file tree
Showing 3 changed files with 9 additions and 9 deletions.
12 changes: 6 additions & 6 deletions docs/step-by-step-tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ export DMLC_PS_ROOT_PORT=1234
export EVAL_TYPE=benchmark
python /usr/local/byteps/launcher/launch.py \
/usr/local/byteps/example/tensorflow/run_tensorflow_byteps.sh \
--model ResNet50 --num-iters 1000
--model ResNet50 --num-iters 1000000
```

### PyTorch
Expand All @@ -51,7 +51,7 @@ export DMLC_PS_ROOT_PORT=1234
export EVAL_TYPE=benchmark
python /usr/local/byteps/launcher/launch.py \
/usr/local/byteps/example/pytorch/start_pytorch_byteps.sh \
--model resnet50 --num-iters 1000
--model resnet50 --num-iters 1000000
```

### MXNet
Expand Down Expand Up @@ -166,15 +166,15 @@ If your workers use TensorFlow, you need to change the image name to `bytepsimag
```
python /usr/local/byteps/launcher/launch.py \
/usr/local/byteps/example/tensorflow/run_tensorflow_byteps.sh \
--model ResNet50 --num-iters 1000
--model ResNet50 --num-iters 1000000
```

If your workers use PyTorch, you need to change the image name to `bytepsimage/worker_pytorch`, and replace the python script with

```
python /usr/local/byteps/launcher/launch.py \
/usr/local/byteps/example/pytorch/start_pytorch_byteps.sh \
--model resnet50 --num-iters 1000
--model resnet50 --num-iters 1000000
```

## Distributed Training with RDMA
Expand Down Expand Up @@ -300,13 +300,13 @@ If your workers use TensorFlow, you need to change the image name to `bytepsimag
```
python /usr/local/byteps/launcher/launch.py \
/usr/local/byteps/example/tensorflow/run_tensorflow_byteps.sh \
--model ResNet50 --num-iters 1000
--model ResNet50 --num-iters 1000000
```

If your workers use PyTorch, you need to change the image name to `bytepsimage/worker_pytorch_rdma`, and replace the python script with

```
python /usr/local/byteps/launcher/launch.py \
/usr/local/byteps/example/pytorch/start_pytorch_byteps.sh \
--model resnet50 --num-iters 1000
--model resnet50 --num-iters 1000000
```
4 changes: 2 additions & 2 deletions example/pytorch/train_mnist_byteps.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,8 @@
help='input batch size for training (default: 64)')
parser.add_argument('--test-batch-size', type=int, default=1000, metavar='N',
help='input batch size for testing (default: 1000)')
parser.add_argument('--epochs', type=int, default=10, metavar='N',
help='number of epochs to train (default: 10)')
parser.add_argument('--epochs', type=int, default=100, metavar='N',
help='number of epochs to train (default: 100)')
parser.add_argument('--lr', type=float, default=0.01, metavar='LR',
help='learning rate (default: 0.01)')
parser.add_argument('--momentum', type=float, default=0.5, metavar='M',
Expand Down
2 changes: 1 addition & 1 deletion example/tensorflow/tensorflow_mnist.py
Original file line number Diff line number Diff line change
Expand Up @@ -128,7 +128,7 @@ def main(_):
bps.BroadcastGlobalVariablesHook(0),

# BytePS: adjust number of steps based on number of GPUs.
tf.train.StopAtStepHook(last_step=20000 // bps.size()),
tf.train.StopAtStepHook(last_step=200000 // bps.size()),

tf.train.LoggingTensorHook(tensors={'step': global_step, 'loss': loss},
every_n_iter=10),
Expand Down

0 comments on commit f6f7f96

Please sign in to comment.