Merge pull request #98 from microsoft/master

Add OpEvo example (microsoft#2549)
chicm-ms · Jun 30, 2020 · 4c306f0 · 4c306f0
2 parents c8f3c5d + 6de1570
commit 4c306f0
Show file tree

Hide file tree

Showing 37 changed files with 2,489 additions and 1 deletion.
diff --git a/README.md b/README.md
@@ -90,6 +90,7 @@ Within the following table, we summarized the current NNI capabilities, we are g
            <li><a href="docs/en_US/TrialExample/Cifar10Examples.md">Cifar10-pytorch</li></a>
            <li><a href="docs/en_US/TrialExample/SklearnExamples.md">Scikit-learn</a></li>
            <li><a href="docs/en_US/TrialExample/EfficientNet.md">EfficientNet</a></li>
+           <li><a href="docs/en_US/TrialExample/OpEvoExamples.md">Kernel Tunning</li></a>
               <a href="docs/en_US/SupportedFramework_Library.md">More...</a><br/>
           </ul>
         </ul>
@@ -170,7 +171,7 @@ Within the following table, we summarized the current NNI capabilities, we are g
             <li><a href="docs/en_US/TrainingService/KubeflowMode.md">Kubeflow</a></li>
             <li><a href="docs/en_US/TrainingService/FrameworkControllerMode.md">FrameworkController on K8S (AKS etc.)</a></li>
             </ul>
-            <ul><li><a href="docs/en_US/TrainingService/DLTSMode.md">DLWorkspace (aka. DLTS)</a></li>        
+            <ul><li><a href="docs/en_US/TrainingService/DLTSMode.md">DLWorkspace (aka. DLTS)</a></li>
       </ul>
       </td>
     </tr>

diff --git a/docs/en_US/TrialExample/OpEvoExamples.md b/docs/en_US/TrialExample/OpEvoExamples.md
@@ -0,0 +1,85 @@
+# Tuning Tensor Operators on NNI
+
+## Overview
+
+Abundant applications raise the demands of training and inference deep neural networks (DNNs) efficiently on diverse hardware platforms ranging from cloud servers to embedded devices. Moreover, computational graph-level optimization of deep neural network, like tensor operator fusion, may introduce new tensor operators. Thus, manually optimized tensor operators provided by hardware-specific libraries have limitations in terms of supporting new hardware platforms or supporting new operators, so automatically optimizing tensor operators on diverse hardware platforms is essential for large-scale deployment and application of deep learning technologies in the real-world problems.
+
+Tensor operator optimization is substantially a combinatorial optimization problem. The objective function is the performance of a tensor operator on specific hardware platform, which should be maximized with respect to the hyper-parameters of corresponding device code, such as how to tile a matrix or whether to unroll a loop. This example illustrates how to automatically tune tensor operators with NNI. Three tuning algorithms, OpEvo, G-BFS and N-A2C are provided. Please refer to [OpEvo: An Evolutionary Method for Tensor Operator Optimization](https://arxiv.org/abs/2006.05664) for detailed explanation about these algorithms.
+
+
+## Environment Setup
+
+We prepared a dockerfile for setting up experiment environments. Before starting, please make sure the Docker daemon is running and the driver of your GPU accelerator is properly installed. Enter into the example folder `examples/trials/systems/opevo` and run below command to build and instantiate a Docker image from the dockerfile.
+```bash
+# if you are using Nvidia GPU
+make cuda-env
+# if you are using AMD GPU
+make rocm-env
+```
+
+## Run Experiments:
+
+Three representative kinds of tensor operators, **matrix multiplication**, **batched matrix multiplication** and **2D convolution**, are chosen from BERT and AlexNet, and tuned with NNI. The `Trial` code for all tensor operators is `/root/compiler_auto_tune_stable.py`, and `Search Space` files and `config` files for each tuning algorithm locate in `/root/experiments/`, which are categorized by tensor operators. Here `/root` refers to the root of the container.
+
+For tuning the operators of matrix multiplication, please run below commands from `/root`:
+```bash
+# (N, K) x (K, M) represents a matrix of shape (N, K) multiplies a matrix of shape (K, M)
+
+# (512, 1024) x (1024, 1024)
+# tuning with opevo
+nnictl create --config experiments/mm/N512K1024M1024/config_opevo.yml
+# tuning with g-bfs
+nnictl create --config experiments/mm/N512K1024M1024/config_gbfs.yml
+# tuning with n-a2c
+nnictl create --config experiments/mm/N512K1024M1024/config_na2c.yml
+
+# (512, 1024) x (1024, 4096)
+# tuning with opevo
+nnictl create --config experiments/mm/N512K1024M4096/config_opevo.yml
+# tuning with g-bfs
+nnictl create --config experiments/mm/N512K1024M4096/config_gbfs.yml
+# tuning with n-a2c
+nnictl create --config experiments/mm/N512K1024M4096/config_na2c.yml
+
+# (512, 4096) x (4096, 1024)
+# tuning with opevo
+nnictl create --config experiments/mm/N512K1024M4096/config_opevo.yml
+# tuning with g-bfs
+nnictl create --config experiments/mm/N512K1024M4096/config_gbfs.yml
+# tuning with n-a2c
+nnictl create --config experiments/mm/N512K1024M4096/config_na2c.yml
+```
+
+For tuning the operators of batched matrix multiplication, please run below commands from `/root`:
+```bash
+# batched matrix with batch size 960 and shape of matrix (128, 128) multiplies batched matrix with batch size 960 and shape of matrix (128, 64)
+nnictl create --config experiments/bmm/B960N128K128M64PNN/config_opevo.yml
+# batched matrix with batch size 960 and shape of matrix (128, 128) is transposed first and then multiplies batched matrix with batch size 960 and shape of matrix (128, 64)
+nnictl create --config experiments/bmm/B960N128K128M64PTN/config_opevo.yml
+# batched matrix with batch size 960 and shape of matrix (128, 64) is transposed first and then right multiplies batched matrix with batch size 960 and shape of matrix (128, 64).
+nnictl create --config experiments/bmm/B960N128K64M128PNT/config_opevo.yml
+```
+
+For tuning the operators of 2D convolution, please run below commands from `/root`:
+```bash
+# image tensor of shape $(512, 3, 227, 227)$ convolves with kernel tensor of shape $(64, 3, 11, 11)$ with stride 4 and padding 0
+nnictl create --config experiments/conv/N512C3HW227F64K11ST4PD0/config_opevo.yml
+# image tensor of shape $(512, 64, 27, 27)$ convolves with kernel tensor of shape $(192, 64, 5, 5)$ with stride 1 and padding 2
+nnictl create --config experiments/conv/N512C64HW27F192K5ST1PD2/config_opevo.yml
+```
+
+Please note that G-BFS and N-A2C are not eligible to tune the operators of batched matrix multiplication and 2D convolution, since there are unsupportable parameters in the search spaces of these operators.
+
+## Citing OpEvo
+
+If you use OpEvo in your research, please consider citing the paper as follows:
+```
+@misc{gao2020opevo,
+    title={OpEvo: An Evolutionary Method for Tensor Operator Optimization},
+    author={Xiaotian Gao and Cui Wei and Lintao Zhang and Mao Yang},
+    year={2020},
+    eprint={2006.05664},
+    archivePrefix={arXiv},
+    primaryClass={cs.LG}
+}
+```
diff --git a/docs/en_US/examples.rst b/docs/en_US/examples.rst
@@ -11,5 +11,6 @@ Examples
     EvolutionSQuAD<./TrialExample/SquadEvolutionExamples>
     GBDT<./TrialExample/GbdtExample>
     RocksDB <./TrialExample/RocksdbExamples>
+    OpEvo <./TrialExample/OpEvoExamples>
     KDExample <./TrialExample/KDExample>
     EfficientNet <./TrialExample/EfficientNet>
diff --git a/examples/trials/systems/opevo/Dockerfile b/examples/trials/systems/opevo/Dockerfile
@@ -0,0 +1,42 @@
+FROM nvidia/cuda:10.0-cudnn7-devel-ubuntu18.04
+
+ENV PYTHONDONTWRITEBYTECODE 1
+ENV HIP_PLATFORM hcc
+ENV PATH $PATH:/opt/rocm/bin:/usr/local/nvidia/lib64/bin
+ENV TVM_HOME=/opt/tvm
+ENV PYTHONPATH=/usr/local/rocm/src:$TVM_HOME/python:$TVM_HOME/topi/python:$TVM_HOME/nnvm/python
+ENV HSA_USERPTR_FOR_PAGED_MEM=0
+
+RUN env > /etc/environment
+
+RUN apt-get update && apt install -y --no-install-recommends git ca-certificates \
+    python3-pip python3-wheel python3-setuptools python3-dev python3-pytest \
+    vim less netcat-openbsd inetutils-ping curl patch iproute2 \
+    g++ libpci3 libnuma-dev make cmake file openssh-server kmod gdb libopenmpi-dev openmpi-bin \
+        autoconf automake autotools-dev libtool multiarch-support \
+    && rm -rf /var/lib/apt/lists/*
+
+RUN curl -sL http://repo.radeon.com/rocm/apt/debian/rocm.gpg.key | apt-key add - && \
+    printf "deb [arch=amd64] http://repo.radeon.com/rocm/apt/3.3/ xenial main" | tee /etc/apt/sources.list.d/rocm_hip.list && \
+    apt update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
+    rocm-dev zlib1g-dev unzip librdmacm-dev rocblas hipsparse rccl rocfft rocrand miopen-hip && apt-get clean && rm -rf /var/lib/apt/lists/*
+RUN ln -sf libcudart.so /usr/local/cuda-10.0/targets/x86_64-linux/lib/libcudart_static.a
+
+RUN pip3 install tornado psutil xgboost==0.80 numpy decorator attrs && rm -rf ~/.cache
+RUN git clone https://github.com/dmlc/tvm $TVM_HOME
+
+RUN cd $TVM_HOME && git checkout v0.6 && git submodule init && git submodule update && \
+    mkdir -p build && cd build && cp ../cmake/config.cmake . && \
+    sed -i 's/LLVM ON/LLVM OFF/g' config.cmake && sed -i 's/CUDA OFF/CUDA ON/g' config.cmake && \
+    cmake .. && make -j16
+
+RUN pip3 install nni==1.5 && rm -rf ~/.cache
+RUN pip3 install torch==1.5.0+cpu torchvision==0.6.0+cpu -f https://download.pytorch.org/whl/torch_stable.html && rm -rf ~/.cache
+
+ADD tvm_patches/tvm_v0.6.patch $TVM_HOME/tvm_v0.6.patch
+ADD tvm_patches/libcuda.so.1 $TVM_HOME/build
+RUN ln -sf libcuda.so.1 $TVM_HOME/build/libcudart.so.10.0
+RUN cd $TVM_HOME && git apply tvm_v0.6.patch && cd build && make -j16
+
+ADD src /root/
+
diff --git a/examples/trials/systems/opevo/Makefile b/examples/trials/systems/opevo/Makefile
@@ -0,0 +1,14 @@
+rocm-env: build
+	docker run -it --rm --privileged -v /:/host -w /root \
+		-e BACKEND=c-rocm -p 8080:8080 \
+		tvm4nni bash || true
+
+cuda-env: build
+	docker run -it --rm --privileged -v /:/host -w /root \
+		-e BACKEND=c-cuda -p 8080:8080 \
+		-v /usr/lib/x86_64-linux-gnu/libcuda.so.1:/usr/lib/x86_64-linux-gnu/libcuda.so.1 \
+		-v $(shell dirname `ldd /usr/lib/x86_64-linux-gnu/libcuda.so.1 | grep nvidia-fatbinaryloader | awk '{print $$3}'`):/usr/local/nvidia/lib64 \
+		tvm4nni bash || true
+
+build:
+	docker build -t tvm4nni --network=host .