- Pick an untaken pair of instances from this sheet and put your name in the Taken By column.
- ssh into the master instance (Primary IP) with the private key
net-accel-dl-tutorial.pem
available in the Slack channel:
ssh -i /path/to/net-accel-dl-tutorial.pem ubuntu@$PRIMARY_IP
If you see WARNING: UNPROTECTED PRIVATE KEY FILE!
, please do chmod 400 /path/to/net-accel-dl-tutorial.pem
if you are on a Unix-like system.
Refer to this guide if you use Windows.
- Run the script:
~/scripts/1_tf_mnist.sh
# the first run may take up to 3 miutes
# subsequent runs should take about 40 seconds
The script calls horovodrun
, you can find detailed instructions on how to run as well as how to integrate horovod in existing training scripts.
- Inspect the training script and identify Horovod code:
vim ~/scripts/tensorflow_mnist.py
This is Python code that trains an MNIST classifier using Horovod. Can you identify where the Horovod API is being used?
This hands-on will give you a taste of what its like to configure and run SwitchML. It will then walk you through using SwitchML's API in code and how you can compile your own SwitchML applications. Finally, as an additional exercise, you will write your own prepostprocessor to control how the data is loaded and unloaded from user buffers to and from messages/packets.
For simplicity, and since not all of our attendees have a testbed with multiple nodes and a Tofino switch connecting them, we will be using a "dummy" backend that sleeps to simulate communication and is simple enough to run on a single node on your own personal laptops.
Without further ado, let's get started!
The time of the tutorial is very limited, so in order to get the most out of it and to avoid any holdbacks due to slow internet or a slow machine, we encourage you to download needed materials and prepare your setup in advance by following these two simple steps:
- Download and install docker on your machine.
- Pull the switchml sigcomm image from docker hub by running:
docker pull omaralama/switchml:sigcomm21_exercise
in your machine's command line.
We also encourage you to browse SwitchML's documentation and source code a bit to get a general feel for the project although this is not required to benefit from the tutorial or to complete all exercises.
Alright, first thing to do if you haven't already is start a container from the image that we pulled earlier. To do that you can run
docker run -it --name switchml omaralama/switchml:sigcomm21_exercise
This should start a container from the image you had pulled earlier and you should be in a new shell inside that container.
Next navigate to where we have predownloaded switchml for you. Then go to the dev_root
directory where all of the source code is located (We will list all of our paths starting tfrom this directory).
cd /home/switchml/dev_root
Next let us compile the SwitchML client library. The client library is required to compile all examples, benchmarks, framework integrations. To do that go to the client_lib
directory and run make
.
cd client_lib
make
This will give us the switchml client library and since we did not provide the Makefile with neither DPDK=1 nor RDMA=1, it will only include the dummy backend. Verify that the library has been generated by listing the contents of dev_root/build/lib
.
ls ../build/lib/
It should include libswitchml-client.a
Now that we have switchml's client library compiled, we can go ahead and compile the microbenchmark.
Navigate to dev_root/benchmarks
and run make
.
cd ../benchmarks
make
This should generate an allreduce_benchmark binary in dev_root/build/bin
To verify that everything is working navigate to dev_root/build/bin
and print the benchmark's help message.
cd ../build/bin
./allreduce_benchmark --help
If you do not encounter any errors and you see the benchmark message then you are all set.
For this exercise we want to run the microbenchmark with different configurations and arguments and see how that effects performance.
Firstly, for any SwitchML application to run, we need to copy a configuration file to our working directory. A default configuration file is always generated with any SwitchML build that includes only applicable options to that build. You can find it as /home/switchml/dev_root/build/switchml.cfg
.
Go ahead and copy it to our bin folder cp ../switchml.cfg .
Now let us open up the copied config and perform the following edits:
- Set
num_worker_threads=1
to start out with 1 worker thread. - Set
packet_numel=131072
to start out with a big packet which reduces some processing on the dummy backend. (Remember nothing is being sent with the dummy so we can make this arbitrarily big)
The container has vim
and nano
at hand to perform these edits but feel free to quickly install your favorite editor if you like.
Now that we have our configuration ready, we can run the benchmark with different arguments.
As with the first exercise, feel free to print the help message to read through all available arguments.
But for now we can simply change the number of elements in each tensor that we want to aggregate and observe its effect on performance. And lets use floating point tensors for example.
Run:
./allreduce_benchmark --tensor-type float --tensor-numel 10000
At this point you should have some throughput statistics printed out. If you don't then please contact one of the organizers.
Make a note of the mean throughput and try to run 3 different times with different number of elements and see if that is affecting your performance.
Record all your results in a table.
Go back to step 2.2 and edit your configuration increasing the number of worker threads from 1 to 4 to 8. And test out each configuration with different number of elements. How is increasing the number of threads affecting the performance? Any trends? compare with your previously recorded results.
You have successfully run SwitchML. The RDMA and DPDK backends are as easy to run assuming you have setup the switch side correctly. Each backend has its own simple configuration.
If time permits feel free to play around with other configurations or arguments, record and plot your results, and deduce relationships and correlations.
In this exercise, we will create our own small SwitchML benchmark from scratch. The objective is to perform the following
- Allocate a local buffer.
- Start the SwitchML context and get its handle.
- Submit a synchronous allreduce job to reduce the local buffer's data
- Report how much time that took.
- Submit an asynchronous allreduce job to reduce the same local buffer's data.
- Report how much time that took.
- Wait for the job to actually finish.
- Report how much time that took.
- Stop the context.
To start with let us create a new directory in dev_root/examples
with our name.
mkdir your_name
cd your_name
pwd
pwd
should print /home/switchml/dev_root/examples/your_name
Now we will create our main.cc
file using vim main.cc
and start writing our own application. Copy the following template over to your new main.cc
file as a starting point. ( You may need to copy the code to a notepad first to get rid of any formatting).
// TODO(1): Include switchml's context header
#include <stdio.h>
#include <chrono>
typedef std::chrono::steady_clock clk;
int main() {
std::chrono::time_point<clk> begin, end;
// TODO(2): Get a reference to the context
// (First time this is called the context will be created)
printf("Hello from %s, this is %s's SwitchML benchmark.\n",
"<Your country>", "<Your name>");
// TODO(3): Start the context
// This loads configuration and starts all worker threads
printf("I am allocating a tensor\n");
float* our_data = new float[1<<24]; // 64 MB tensor. Feel free to change this.
begin = clk::now();
// TODO(4) Submit synchronous allreduce job.
end = clk::now();
printf("Synchronous allreduce call took %ld ns.\n",
std::chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count());
begin = clk::now();
// TODO(5) Submit asynchronous allreduce job and store its job handle.
end = clk::now();
printf("Asynchronous allreduce call took %ld ns.\n",
std::chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count());
begin = clk::now();
// TODO(6) Wait for the job you submitted to finish.
end = clk::now();
printf("Asynchronous allreduce waiting for the job took %ld ns.\n",
std::chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count());
// TODO(7) Stop the switchml context
delete [] our_data;
}
Consult SwitchML's Context API documentation and the hello_world example located in dev_root/examples/hello_world
. To continue the exercise
Once you are ready to test your application, it is time for compilation. SwitchML has multiple libraries that it needs to link to at the final compilation stage. Thus, we have to use the examples makefile and edit it slightly to include our new application.
-
Open up
dev_root/examples/Makefile
-
Locate the following lines at the end of the file:
hello_world: $(BINDIR)/hello_world $(BINDIR)/hello_world: $(BINDIR) $(CXX) $(CXXFLAGS) $(INC) $(SRCDIR)/hello_world/main.cc $(LDFLAGS) -o $(BINDIR)/hello_world
-
Duplicate these lines replacing all hello_world strings with the name of your application.
your_name: $(BINDIR)/your_name $(BINDIR)/your_name: $(BINDIR) $(CXX) $(CXXFLAGS) $(INC) $(SRCDIR)/your_name/main.cc $(LDFLAGS) -o $(BINDIR)/your_name
That's it. Save your file and quit. You can now compile your application by running make your_name
in the dev_root/examples
directory.
Assuming your application was written correctly the first time the compilation should finish without any errors. However, we all know that this is a far fetched assumption :D. So when an error pops up, go back, fix it, and try again.
Once the compilation completes with no errors, then make sure your application's binary has been generated in dev_root/build/bin
.
Just like the second exercise, copy the switchml configuration over if its not already there, and run your application. If you survive segmentation faults on your first run then congratulations. If not go back, debug, remove old benchmark binary, compile, and test again.
Once you start seeing meaningful results, record them and compare synchronous calls with asynchronous calls and waits.
If you are ahead, then you can go back to your source code and add more jobs, change the size of the buffer, or add verification code. Then compile again and test your changes.
Congratulations! You have successfully, compiled, configured, and ran SwitchML. You have also written, compiled, and tested your own small benchmark.
We hope that the tutorial was useful to you and that it showed you how easy it is to use and build on top of SwitchML.
Don't hesitate to contact the SwitchML team if you have any questions.
You need recent versions of the following software installed on your local computer:
This part of the tutorial requires to run a pair of VMs. To simplify the process, we ship a pre-imaged VM box using vagrant.
First, you will need to clone our git repository.
$ git clone https://github.com/sands-lab/sigcomm21_omnireduce_tutorial.git
To setup the vagrant box, simply cd sigcomm21_omnireduce_tutorial
and run the following commands on your host
# The first `vagrant up` invocation fetches the vagrant box.
# It is likely that this takes some time, so launch this command ASAP!
$ vagrant up
Once done, two virtual machines, node1
and node2
, are created and started.
We use Soft-RoCE to support RDMA. We have already pre-configured Soft-RoCE within the VMs. But you need to add an RDMA link for the specified type (rxe) of the network device on both node1
and node2
.
Firstly, open two terminals on your host and run the following commands to SSH node1
and node2
# In terminal 1, SSH node1
$ vagrant ssh node1
# In terminal 2, SSH node2
$ vagrant ssh node2
Starting from now, we assume that otherwise stated, all commands are run inside the vagrant box.
After that, you need to run the following commands inside both node1
and node2
# Check the network interface name, it should be eth1
$ ifconfig
# Add an rdma link for rxe to eth1
$ sudo rdma link add roce type rxe netdev eth1
# Check the status of RDMA configuration, make sure that device was added under RDEV (rxe device)
# If there is no error, you will see the following outputs.
$ ibv_devices
# device node GUID
# ------ ----------------
# rocep0s8 0a0027fffedd1943
We have configured an SSH login without password between node1
and node2
for mpirun
. You need to check it with the following command
# On node1
$ ssh 192.168.59.102
# On node2
$ ssh 192.168.59.101
By default, vagrant will share your project directory (the directory with the Vagrantfile) to /vagrant
. Check whether node1
and node2
can access to this folder. If it is accessible, copy omnireduce project folder to vagrant
with following command
# On node1
$ cp -r /home/vagrant/omnireduce /vagrant
After the tutorial you may clean up the files that the above copy leaves behind in your local disk.
We run OmniReduce on node1
and node2
in colocated mode, which means we run both one aggregator
and one worker
process on each VM. To reduce the hardware requirements, we only run AllReduce on CPU tensors in this tutorial. If your machine has got GPUs, you would be able to run it on GPU tensors.
The steps are as follows.
cd
to the cloned git repository on your host and start two terminals with to SSH node1
and one to SSH node2
.
# In terminal 1, SSH node1
$ vagrant ssh node1
# In terminal 2, SSH node1
$ vagrant ssh node1
# In terminal 3, SSH node2
$ vagrant ssh node2
Inspect omnireduce.cfg
in /vagrant/omnireduce/omnireduce-RDMA/example
In most cases, you do not need to change anything in this file. One thing you need to confirm is that ib_hca
is the same to the device output of ibv_devices
command.
Let's start two aggregators processes, one per node, on terminal 2 and 3.
# In terminal 2, inside node1
$ cd /vagrant/omnireduce/omnireduce-RDMA/example
$ ./aggregator
# In terminal 3, inside node2
$ cd /vagrant/omnireduce/omnireduce-RDMA/example
$ ./aggregator
We use mpirun
to execute a simple microbenchmark based on OmniReduce. This command runs a copy of the worker program one each node.
# In terminal 1, inside node1
$ cd /vagrant/omnireduce/omnireduce-RDMA/example
$ mpirun -n 2 -hosts 192.168.59.101,192.168.59.102 ./worker
The benchmark code we run is worker_test.cpp
. In the default setting, the float tensor size is 262144 and block density is 1.0.
If you still have time, you can modify these values in the worker_test.cpp
(tensor_size
in line 16 and density
in line 25). Then run make
in this folder. Try to run the experiment again and observe the differences. Note that you need to restart aggregator
before each run.
Use the same instructions as with the initial part of this tutorial to acquire a pair of VMs.
GRACE overrides Horovod's DistributedOptimizer
and DistributedGradientTape
APIs. Users simply need to construct and pass a parameter grace
to the above APIs:
# TensorFlow minimal example
from grace_dl.tensorflow.communicator.allgather import Allgather
from grace_dl.tensorflow.compressor.topk import TopKCompressor
from grace_dl.tensorflow.memory.residual import ResidualMemory
grc = Allgather(TopKCompressor(0.3), ResidualMemory(), hvd.size())
optimizer = hvd.DistributedOptimizer(optimizer, grace=grc)
## or:
# tape = hvd.DistributedGradientTape(tape, grace=grc)
GRACE also provides a handy helper which parses dict input and returns the grace object:
from grace_dl.tensorflow.helper import grace_from_params
# users can make params a command line input with argparse
params = {'compressor': 'topk', 'memory': 'residual', 'communicator': 'allgather', 'compress_ratio': 0.01}
grc = grace_from_params(params)
modify ~/scripts/tensorflow_mnist_grace.py
to use GRACE.
you can then run:
~/scripts/2_tf_mnist_grace.sh
to check correctness. If done correctly you should see outputs like:
==Debug== grace is called
rather than
==Debug== grace not called
Refer to ~/scripts/exercise_solutions/tensorflow_mnist_grace.py
for a sample solution.
The main components of GRACE framework are Communicator
, Compressor
, and Memory
abstract classes. Take PyTorch API for example, the entry point for grace is:
class Communicator(ABC):
@abstractmethod
def send_receive(self, tensors, name, ctx):
# 1. communicate
# 2. decompress
# 3. aggregation
raise NotImplemented("send was not implemented.")
def __init__(self, compressor, memory, world_size):
self.compressor = compressor
self.memory = memory
self.world_size = world_size
def step(self, tensor, name):
tensor = self.memory.compensate(tensor, name)
tensors_compressed, ctx = self.compressor.compress(tensor, name)
self.memory.update(tensor, name, self.compressor, tensors_compressed, ctx)
return self.send_receive(tensors_compressed, name, ctx)
For custom Compressor
s, users need to implement the compress
and decompress
method. Take PyTorch TopKCompressor for example:
class TopKCompressor(Compressor):
def __init__(self, compress_ratio):
super().__init__()
# compressor specific params should go here
self.compress_ratio = compress_ratio
def compress(self, tensor, name):
# your are given a gradient tensor and its unique name as the input
tensor_flat = tensor.flatten()
k = max(1, int(tensor_flat.numel() * self.compress_ratio))
_, indices = torch.topk(tensor_flat.abs(), k, sorted=False,)
values = torch.gather(tensor_flat, 0, indices)
tensors = values, indices
ctx = tensor.numel(), tensor.size()
# ctx is anything you need to save and use when decompressing
return tensors, ctx
def decompress(self, tensors, ctx):
"""Decompress by filling empty slots with zeros and reshape back using the original shape"""
numel, shape = ctx
values, indices = tensors
tensor_decompressed = torch.zeros(numel, dtype=values.dtype, layout=values.layout, device=values.device)
tensor_decompressed.scatter_(0, indices, values)
# you need to return a tensor with the same shape as the "original" gradient
return tensor_decompressed.view(shape)
We will implement a variant of Top-K and Threshold in which we use a threshold to clip the gradients, but ensure that we send at least K% of the gradient elements. This prevents bad threshold selections that zeros out the whole gradient. We will name it guarded_threshold
.
Go ahead to ~/scripts/guarded_threshold.py
and implement the compress
method. A template is already provided for you.
You may refer to the TopK and Threshold implementation available at
~/src/grace-dl/grace_dl/tensorflow/compressor/topk.py
and
~/src/grace-dl/grace_dl/tensorflow/compressor/threshold.py
A simple test script is provided for your convenience, invoke
~/scripts/3_test_custom_compressor.sh
and you will see Test passed!
if your implementation works. If you messed up the code and want to reset, you can do
cd ~/scripts
git checkout -- guarded_threshold.py
Refer to ~/scripts/exercise_solutions/guarded_threshold.py
for a sample solution.
Finally you can run the end to end experiments with:
~/scripts/4_tf_mnist_custom_compressor.sh
Check out how we use the GuardedThresholdCompressor
in tensorflow_mnist_custom_compressor.py
lines 132-136