Skip to content

Commit

Permalink
Merge pull request #98 from microsoft/master
Browse files Browse the repository at this point in the history
Add OpEvo example (microsoft#2549)
  • Loading branch information
chicm-ms authored Jun 30, 2020
2 parents c8f3c5d + 6de1570 commit 4c306f0
Show file tree
Hide file tree
Showing 37 changed files with 2,489 additions and 1 deletion.
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -90,6 +90,7 @@ Within the following table, we summarized the current NNI capabilities, we are g
<li><a href="docs/en_US/TrialExample/Cifar10Examples.md">Cifar10-pytorch</li></a>
<li><a href="docs/en_US/TrialExample/SklearnExamples.md">Scikit-learn</a></li>
<li><a href="docs/en_US/TrialExample/EfficientNet.md">EfficientNet</a></li>
<li><a href="docs/en_US/TrialExample/OpEvoExamples.md">Kernel Tunning</li></a>
<a href="docs/en_US/SupportedFramework_Library.md">More...</a><br/>
</ul>
</ul>
Expand Down Expand Up @@ -170,7 +171,7 @@ Within the following table, we summarized the current NNI capabilities, we are g
<li><a href="docs/en_US/TrainingService/KubeflowMode.md">Kubeflow</a></li>
<li><a href="docs/en_US/TrainingService/FrameworkControllerMode.md">FrameworkController on K8S (AKS etc.)</a></li>
</ul>
<ul><li><a href="docs/en_US/TrainingService/DLTSMode.md">DLWorkspace (aka. DLTS)</a></li>
<ul><li><a href="docs/en_US/TrainingService/DLTSMode.md">DLWorkspace (aka. DLTS)</a></li>
</ul>
</td>
</tr>
Expand Down
85 changes: 85 additions & 0 deletions docs/en_US/TrialExample/OpEvoExamples.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
# Tuning Tensor Operators on NNI

## Overview

Abundant applications raise the demands of training and inference deep neural networks (DNNs) efficiently on diverse hardware platforms ranging from cloud servers to embedded devices. Moreover, computational graph-level optimization of deep neural network, like tensor operator fusion, may introduce new tensor operators. Thus, manually optimized tensor operators provided by hardware-specific libraries have limitations in terms of supporting new hardware platforms or supporting new operators, so automatically optimizing tensor operators on diverse hardware platforms is essential for large-scale deployment and application of deep learning technologies in the real-world problems.

Tensor operator optimization is substantially a combinatorial optimization problem. The objective function is the performance of a tensor operator on specific hardware platform, which should be maximized with respect to the hyper-parameters of corresponding device code, such as how to tile a matrix or whether to unroll a loop. This example illustrates how to automatically tune tensor operators with NNI. Three tuning algorithms, OpEvo, G-BFS and N-A2C are provided. Please refer to [OpEvo: An Evolutionary Method for Tensor Operator Optimization](https://arxiv.org/abs/2006.05664) for detailed explanation about these algorithms.


## Environment Setup

We prepared a dockerfile for setting up experiment environments. Before starting, please make sure the Docker daemon is running and the driver of your GPU accelerator is properly installed. Enter into the example folder `examples/trials/systems/opevo` and run below command to build and instantiate a Docker image from the dockerfile.
```bash
# if you are using Nvidia GPU
make cuda-env
# if you are using AMD GPU
make rocm-env
```

## Run Experiments:

Three representative kinds of tensor operators, **matrix multiplication**, **batched matrix multiplication** and **2D convolution**, are chosen from BERT and AlexNet, and tuned with NNI. The `Trial` code for all tensor operators is `/root/compiler_auto_tune_stable.py`, and `Search Space` files and `config` files for each tuning algorithm locate in `/root/experiments/`, which are categorized by tensor operators. Here `/root` refers to the root of the container.

For tuning the operators of matrix multiplication, please run below commands from `/root`:
```bash
# (N, K) x (K, M) represents a matrix of shape (N, K) multiplies a matrix of shape (K, M)

# (512, 1024) x (1024, 1024)
# tuning with opevo
nnictl create --config experiments/mm/N512K1024M1024/config_opevo.yml
# tuning with g-bfs
nnictl create --config experiments/mm/N512K1024M1024/config_gbfs.yml
# tuning with n-a2c
nnictl create --config experiments/mm/N512K1024M1024/config_na2c.yml

# (512, 1024) x (1024, 4096)
# tuning with opevo
nnictl create --config experiments/mm/N512K1024M4096/config_opevo.yml
# tuning with g-bfs
nnictl create --config experiments/mm/N512K1024M4096/config_gbfs.yml
# tuning with n-a2c
nnictl create --config experiments/mm/N512K1024M4096/config_na2c.yml

# (512, 4096) x (4096, 1024)
# tuning with opevo
nnictl create --config experiments/mm/N512K1024M4096/config_opevo.yml
# tuning with g-bfs
nnictl create --config experiments/mm/N512K1024M4096/config_gbfs.yml
# tuning with n-a2c
nnictl create --config experiments/mm/N512K1024M4096/config_na2c.yml
```

For tuning the operators of batched matrix multiplication, please run below commands from `/root`:
```bash
# batched matrix with batch size 960 and shape of matrix (128, 128) multiplies batched matrix with batch size 960 and shape of matrix (128, 64)
nnictl create --config experiments/bmm/B960N128K128M64PNN/config_opevo.yml
# batched matrix with batch size 960 and shape of matrix (128, 128) is transposed first and then multiplies batched matrix with batch size 960 and shape of matrix (128, 64)
nnictl create --config experiments/bmm/B960N128K128M64PTN/config_opevo.yml
# batched matrix with batch size 960 and shape of matrix (128, 64) is transposed first and then right multiplies batched matrix with batch size 960 and shape of matrix (128, 64).
nnictl create --config experiments/bmm/B960N128K64M128PNT/config_opevo.yml
```

For tuning the operators of 2D convolution, please run below commands from `/root`:
```bash
# image tensor of shape $(512, 3, 227, 227)$ convolves with kernel tensor of shape $(64, 3, 11, 11)$ with stride 4 and padding 0
nnictl create --config experiments/conv/N512C3HW227F64K11ST4PD0/config_opevo.yml
# image tensor of shape $(512, 64, 27, 27)$ convolves with kernel tensor of shape $(192, 64, 5, 5)$ with stride 1 and padding 2
nnictl create --config experiments/conv/N512C64HW27F192K5ST1PD2/config_opevo.yml
```

Please note that G-BFS and N-A2C are not eligible to tune the operators of batched matrix multiplication and 2D convolution, since there are unsupportable parameters in the search spaces of these operators.

## Citing OpEvo

If you use OpEvo in your research, please consider citing the paper as follows:
```
@misc{gao2020opevo,
title={OpEvo: An Evolutionary Method for Tensor Operator Optimization},
author={Xiaotian Gao and Cui Wei and Lintao Zhang and Mao Yang},
year={2020},
eprint={2006.05664},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
```
1 change: 1 addition & 0 deletions docs/en_US/examples.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,5 +11,6 @@ Examples
EvolutionSQuAD<./TrialExample/SquadEvolutionExamples>
GBDT<./TrialExample/GbdtExample>
RocksDB <./TrialExample/RocksdbExamples>
OpEvo <./TrialExample/OpEvoExamples>
KDExample <./TrialExample/KDExample>
EfficientNet <./TrialExample/EfficientNet>
42 changes: 42 additions & 0 deletions examples/trials/systems/opevo/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
FROM nvidia/cuda:10.0-cudnn7-devel-ubuntu18.04

ENV PYTHONDONTWRITEBYTECODE 1
ENV HIP_PLATFORM hcc
ENV PATH $PATH:/opt/rocm/bin:/usr/local/nvidia/lib64/bin
ENV TVM_HOME=/opt/tvm
ENV PYTHONPATH=/usr/local/rocm/src:$TVM_HOME/python:$TVM_HOME/topi/python:$TVM_HOME/nnvm/python
ENV HSA_USERPTR_FOR_PAGED_MEM=0

RUN env > /etc/environment

RUN apt-get update && apt install -y --no-install-recommends git ca-certificates \
python3-pip python3-wheel python3-setuptools python3-dev python3-pytest \
vim less netcat-openbsd inetutils-ping curl patch iproute2 \
g++ libpci3 libnuma-dev make cmake file openssh-server kmod gdb libopenmpi-dev openmpi-bin \
autoconf automake autotools-dev libtool multiarch-support \
&& rm -rf /var/lib/apt/lists/*

RUN curl -sL http://repo.radeon.com/rocm/apt/debian/rocm.gpg.key | apt-key add - && \
printf "deb [arch=amd64] http://repo.radeon.com/rocm/apt/3.3/ xenial main" | tee /etc/apt/sources.list.d/rocm_hip.list && \
apt update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
rocm-dev zlib1g-dev unzip librdmacm-dev rocblas hipsparse rccl rocfft rocrand miopen-hip && apt-get clean && rm -rf /var/lib/apt/lists/*
RUN ln -sf libcudart.so /usr/local/cuda-10.0/targets/x86_64-linux/lib/libcudart_static.a

RUN pip3 install tornado psutil xgboost==0.80 numpy decorator attrs && rm -rf ~/.cache
RUN git clone https://github.com/dmlc/tvm $TVM_HOME

RUN cd $TVM_HOME && git checkout v0.6 && git submodule init && git submodule update && \
mkdir -p build && cd build && cp ../cmake/config.cmake . && \
sed -i 's/LLVM ON/LLVM OFF/g' config.cmake && sed -i 's/CUDA OFF/CUDA ON/g' config.cmake && \
cmake .. && make -j16

RUN pip3 install nni==1.5 && rm -rf ~/.cache
RUN pip3 install torch==1.5.0+cpu torchvision==0.6.0+cpu -f https://download.pytorch.org/whl/torch_stable.html && rm -rf ~/.cache

ADD tvm_patches/tvm_v0.6.patch $TVM_HOME/tvm_v0.6.patch
ADD tvm_patches/libcuda.so.1 $TVM_HOME/build
RUN ln -sf libcuda.so.1 $TVM_HOME/build/libcudart.so.10.0
RUN cd $TVM_HOME && git apply tvm_v0.6.patch && cd build && make -j16

ADD src /root/

14 changes: 14 additions & 0 deletions examples/trials/systems/opevo/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
rocm-env: build
docker run -it --rm --privileged -v /:/host -w /root \
-e BACKEND=c-rocm -p 8080:8080 \
tvm4nni bash || true

cuda-env: build
docker run -it --rm --privileged -v /:/host -w /root \
-e BACKEND=c-cuda -p 8080:8080 \
-v /usr/lib/x86_64-linux-gnu/libcuda.so.1:/usr/lib/x86_64-linux-gnu/libcuda.so.1 \
-v $(shell dirname `ldd /usr/lib/x86_64-linux-gnu/libcuda.so.1 | grep nvidia-fatbinaryloader | awk '{print $$3}'`):/usr/local/nvidia/lib64 \
tvm4nni bash || true

build:
docker build -t tvm4nni --network=host .
Loading

0 comments on commit 4c306f0

Please sign in to comment.