To reproduce the evaluation presented in the CS592 paper:
First, login to the cuda server cuda.cs.purdue.edu
, make sure you have access to this server
ssh username@cuda.cs.purdue.edu
Then, clone the code from github, and enter the repo directory
git clone https://github.com/Kuigesi/PipelineExecution-Reproducible.git
cd PipelineExecution-Reproducible
Before we run the evaluation, we need to check the GPU usage, our evaluation requires 4 GPUs and will occupy at most 5GB memory per GPU.
To check the GPU usage, run
nvidia-smi
The evaluation in our paper uses the first 4 GPUs (0,1,2,3), to reproduce the evaluation in our paper, make sure the first 4 GPUs are all available and all have sufficient free memory (5GB), then run
bash ./runtest.sh
This will produce the following files:
-
./benchmark/data/benchmark.csv
, which is the collected results of the running time of different parallel settigs. -
2 figures
./benchmark/pictures/pipelineparallelruntime.pdf
,./benchmark/pictures/pipelineparallelspeedup.pdf
will be plotted to illustrate the runtime and speed up of different parallel seetings.
To check out the generated figures, the pdf file should be transfered to your local computer.
There are total 8 GPUs in cuda.cs.purdue.edu
, GPU(0, 1, 2, 3) are TITAN Xp, GPU(4, 5) are GEFORCE GTX TITAN, GPU(6, 7) are Tesla K40c, they have different computation capacities.
If the first 4GPUs (0, 1, 2, 3) are not all available, you can also use different GPUs to perform the experiment, but the result will be very different because GPU(0, 1, 2, 3), GPU(4, 5), GPU(6, 7) have different computation capacities.
You can switch to different GPU devices by providing the GPU's device ID to the script. Data Parallelism on 2GPUs will run on the first 2 given GPUs, other settings will use all 4 given GPUs. You can run
bash ./runtest.sh 4 5 2 3
to run the experiment on GPU4, GPU5, GPU2, GPU3. GPU4 and GPU5 run significantly slower than GPU(0, 1, 2, 3), so the result will be different from the paper.