This code branch is used for OSDI'23 Artifact Evaluation of paper #847, titled "Welder: Scheduling Deep Learning Memory Access via Tile-graph".
-
Artifacts Available:
- Most Welder related code are open-sourced under this repo and. Some Welder related code are implemented in the welder branch of the TVM and NNFusion repo.
-
Artifacts Functional:
- Documentation: the following of documents include detailed guidelines on how to build, install, test Welder and the experiments to compare with other baselines.
- Completeness: the source code under "welder/" folder includes all the key components of Welder described in the paper.
- Exercisability: under the artifacts folder, we prepare all the script and data to reproduce the experiments in individual folders named by the figure name in paper.
-
Results Reproduced:
- To reproduce the main results presented in our paper, we provide a Docker image containing the environments. As the GraphCore, ROCm environments are internal resources with restrict accessibility, we use the CUDA GPUs (NVIDIA Tesla V100 GPU) environment to reproduce the main results. We provide detailed guidelines to help reproduce the results step by step. For the rest inaccessible environments, We also provide detailed guideline to help reproduce the results step by step.
To ease the process of installing all the dependencies, baseline software, and Welder code, we provide a Dockerfile and a simple guideline to build a Docker image with all of above installed.
cd welder
# build the image
docker build -t welder_cuda .
# run the container
nvidia-docker run -it --cap-add=SYS_ADMIN --network=host --name welder_test welder_cuda bash
Since welder's paper evaluate different models with different batch-sizes and data-types, leading to more than 50 models to tune to completely reproduce the paper's result. To help reproduce quickly, we have uploaded all welder's compiled model of V100 GPU at temp.tar.gz - Google Cloud Drive
pip install gdown
gdown https://drive.google.com/u/0/uc?id=1xJUk7ZBoe6bjaqMpTI-n9gqGtc01IOWG
After downloading, it should be extracted under the artifacts/temp folder. You can see a lot of model folders in it. With these pre-compiled models, results can be reproduced more quickly with a few commands. Here is a list of script we provide:
Name | Description | Commands |
---|---|---|
Figure1 | onnxruntime memory performance for different models | Figure1 |
Figure2 | Latency Number of a simple case | Figure2 |
Figure9 | Model inference performance on V100 FP32 | Figure9_10 |
Figure10 | Model inference performance on V100 FP16 (TensorCore) | Figure9_10 |
Figure11 | Model inference performance on V100 FP16 (No TensorCore) | Figure11 |
Figure13 | Latency, kernel count, global memory transaction and IRS | Figure13 |
Table3 | Performance for WELDER and FasterTransformer | Table3 |
Table5 | Compilation time of Ansor and Welder | Table5 |
Table6 | Performance on compute intensive models | Table6 |
Table7 | Scale-up large DNN models to host memory (GPU) | Table7 |
Note that results in Figure13/part of Table 6 requires ROCM-GPU/GraphCore IPU environments which is not directly available here.
The run instruction is
python run_all.py
The run instruction is
python run_all.py
This figure includes several baselines. The for Welder, onnxruntime, pytorch, tensorrt and Rammer are
python profile_rammer_all.py
python profile_ort_all.py
python profile_torch_all.py
python profile_welder_all.py
python profile_trt_all.py
The run instruction for Ansor is below, it requires additional action before running it.
# Our tunning log for Ansor only applies for this version.
cd /root/tvm/build && git checkout v0.9.0 && make -j
# after switching branch
cd -
python profile_ansor_all.py
# don't forget to get back
cd /root/tvm/build && git checkout welder && make -j
The run instruction is
python profile_welder_no_tc.py
Note that the baseline(Ansor)'s result is already shown in the above section with profile_ansor_all.py.
The run instructions are
# measure latency, IRS and kernel count
python get_IRS.py
# measure memory perf
python get_metrics.py
# measure Ansor's latency, IRS, kernel count and memory perf
cd /root/tvm/build && git checkout v0.9.0 && make -j
python get_ansor_data.py
cd /root/tvm/build && git checkout welder && make -j
Note 1: get_ansor_data.py requires TVM v0.9.0, please switch to that branch following the above instructions.
Note 2: Memory perf (Load/Store trans) from get_ansor_data.py should be halfed because the evaluator actually runs the model twice.
The run instruction is
python run_ft_cpp_all.py
If Faster Transformers is not installed, please follow the following commands:
git clone https://github.com/NVIDIA/FasterTransformer
cd FasterTransformer
git checkout release/v5.2_bug_fix_tag
# remove line 20 add_definitions("-DENABLE_BF16") in CMakeLists.txt
# we don't use BF16 and this will cause compile error.
mkdir build && cd build
cmake .. -DSM=70 -DCMAKE_BUILD_TYPE=Release
make bert_example bert_gemm vit_example vit_gemm swin_example swin_gemm -j
The run instruction is
python estimate_run_time_welder.py
python estimate_run_time_ansor.py
The run instruction is
python run_all.py
The run instruction is
bash run_all.sh
Despite using the logs provided above, you can also run welder from scratch. To compile a model with Welder, there are several steps.
python torch2onnx.py MODEL --prefix PREFIX [--bs BATCHSIZE] [--fp16]
To generate an ONNX model, we first use the script torch2onnx.py to generate an onnx file under the PREFIX folder. It is recommended to create a new PREFIX folder for every model.
The MODEL parameter can be one of the ten models evaluated in the paper (bert, vit, swin_transformer, BSRN, NAFNet, Restormer, mobilevit, Conformer, mobilenet and NeRF).
Default batchsize is 1, it can be set with --bs flag. The default datatype is float32, if --fp16 is used, the datatype will be float16.
After running this command, The PREFIX folder will be created which contains a model.onnx file. This PREFIX will be used in the following Welder's compilation steps. Some other baselines will also use this PREFIX as the workspace.
Afther the PREFIX folder is created, run the following command
python tune_welder.py PREFIX --topk 20 --arch V100
The command will compile the model.onnx under the PREFIX folder. The --topk 20 and --arch V100 indicates that 20 trails is made for each task(subgraph) and V100 GPU is the target.
Specially, when reproducing results in the paper, special flags will be added. The 3 included cases are: bert, fp32, bs=1,64 and swin_transformer, fp16, bs=1. In this three cases, we add an additional compile flag:
python tune_welder.py PREFIX --topk 20 --arch V100 --skip_dot
Ths flag will lower some Dot kernels to CUDA library (cublas) which performs better than generated kernels in these 3 cases.
After running the previous command, you can profile the latency of the welder's generated model:
# to evaluate inference performance, you can directly use an executable
cd PREFIX/nnfusion_rt/cuda_codegen
./build/main_test
# OR use python script which feeds data with pytorch and ctypes
python3 run_welder.py PREFIX
To check the correctness of Welder's compiled model, you can run the following command to compare Welder's output with onnx-runtime's output.
python3 test_acc.py PREFIX
You can also run other baselines on this model:
# torch
python run_torch.py MODEL [--bs BATCHSIZE] [--fp16]
# TensorRT
python run_trt.py --prefix PREFIX [--fp16]
# onnxruntime
python run_onnxrt.py --prefix PREFIX
# Ansor, note that Ansor requires about one day to tune for a model
python run_ansor.py --prefix PREFIX
# Astitch
python3 run_blade.py MODEL [--bs BATCHSIZE] [--fp16]