ReproMPI Benchmark (Development Version)

Introduction

The ReproMPI Benchmark is a tool designed to accurately measure the run-time of MPI blocking collective operations. It provides multiple process synchronization methods and a flexible mechanism for predicting the number of measurements that are sufficient to obtain statistically sound results.

References

Sascha Hunold, Alexandra Carpen-Amarie: On the Impact of Synchronizing Clocks and Processes on Benchmarking MPI Collectives. EuroMPI 2015: 8:1-8:10
Sascha Hunold, Alexandra Carpen-Amarie, Jesper Larsson Träff: Reproducible MPI Micro-Benchmarking Isn't As Easy As You Think. EuroMPI/ASIA 2014: 69
Sascha Hunold, Alexandra Carpen-Amarie: Reproducible MPI Benchmarking is Still Not as Easy as You Think. IEEE Trans. Parallel Distributed Syst. 27(12): 3617-3630 (2016)
Sascha Hunold, Alexandra Carpen-Amarie: Hierarchical Clock Synchronization in MPI. CLUSTER 2018: 325-336
Sascha Hunold, Alexandra Carpen-Amarie: Autotuning MPI Collectives using Performance Guidelines. HPC Asia 2018: 64-74
Joseph Schuchart, Sascha Hunold, George Bosilca: Synchronizing MPI Processes in Space and Time. EuroMPI 2023: 7:1-7:11

Components

mpibenchmark: actual MPI benchmark for collectives
pgchecker: performance guideline checker

Installation

Prerequisites
- an MPI library
- CMake (version >= 3.0)
- GSL libraries

Basic installation

  cd $BENCHMARK_PATH
  ./cmake .
  make

For specific configuration options check the Benchmark Configuration section.

Running the ReproMPI Benchmark

The ReproMPI code is designed to serve two specific purposes:

Benchmarking of MPI collective calls

The most common usage scenario of the benchmark is to specify an MPI collective function to be benchmarked, a (list of) message sizes and the number of measurement repetitions for each test, as in the following example.

mpirun -np 4 ./bin/mpibenchmark --calls-list=MPI_Bcast,MPI_Allgather 
             --msizes-list=8,1024,2048  --nrep=10

Command-line Options

Common Options

-h print help
-v print run-times measured for each process
--msizes-list= list of comma-separated message sizes in Bytes, e.g., --msizes-list=10,1024
--msize-interval=min=<min>,max=<max>,step=<step> list of power of 2 message sizes as an interval between $2^{min}$ and $2^{max}$, with $2^{step}$ distance between values, e.g., --msize-interval=min=1,max=4,step=1
--calls-list=<args> list of comma-separated MPI calls to be benchmarked, e.g., --calls-list=MPI_Bcast,MPI_Allgather
--root-proc=<process_id> root node for collective operations
--operation=<mpi_op> MPI operation applied by collective operations (where applicable), e.g., --operation=MPI_BOR.

Supported operations: MPI_BOR, MPI_BAND, MPI_LOR, MPI_LAND, MPI_MIN, MPI_MAX, MPI_SUM, MPI_PROD
--datatype=<mpi_type> MPI datatype used by collective operations, e.g., --datatype=MPI_CHAR.

Supported datatypes: MPI_CHAR, MPI_INT, MPI_FLOAT, MPI_DOUBLE
--shuffle-jobs shuffle experiments before running the benchmark
--params=k1:v1,k2:v2 list of comma-separated =key:value= pairs to be printed in the benchmark output.
-f | --input-file=<path> input file containing the list of benchmarking jobs (tuples of MPI function, message size, number of repetitions). It replaces all the other common options.

Options Related to the Window-based Synchronization

--window-size=<win> window size in microseconds for Window-based synchronization

Specific Options for the ReproMPI Benchmark

--nrep=<nrep> set number of experiment repetitions
--summary=<args> list of comma-separated data summarizing methods (mean, median, min, max, var, stddev), e.g., --summary=mean,max

Supported Collective Operations:

MPI Collectives

MPI_Allgather
MPI_Allreduce
MPI_Alltoall
MPI_Barrier
MPI_Bcast
MPI_Exscan
MPI_Gather
MPI_Reduce
MPI_Reduce_scatter
MPI_Reduce_scatter_block
MPI_Scan
MPI_Scatter

Mockup Functions of Various MPI Collectives

MPI_Allgather	MPI_Allreduce	MPI_Alltoall	MPI_Bcast	MPI_Gather	MPI_Reduce	MPI_Reduce_scatter_block	MPI_Scan	MPI_Scatter
Default	Default	Default	Default	Default	Default	Default	Default	Default
Allgatherv	Reduce+Bcast	Alltoallv	Allgatherv	Allgather	Allreduce	Reduce+Scatter	Exscan+Reducelocal	Bcast
Allreduce	Reducescatterblock+Allgather	Lane	Scatter+Allgather	Gatherv	Reducescatterblock+Gather	Reducescatter	Lane	Scatterv
Alltoall	Reducescatter+Allgatherv		Lane	Reduce	Reducescatter+Gatherv	Allreduce	Hier	Lane
Gather+Bcast	Lane		Hier	Lane	Reducescatter	Hier		Hier
Lane	Hier			Hier	Lane	Lane
Lane Zero					Hier
Hier

Process Synchronization Methods

MPI_Barrier

This is the default synchronization method enabled for the benchmark.

Dissemination Barrier

To benchmark collective operations acorss multiple MPI libraries using the same barrier implementation, the benchmark provides a dissemination barrier that can replace the default MPI_Barrier to synchronize processes.

To enable the dissemination barrier, the following flag has to be set before compiling the benchmark (e.g., using the =ccmake= command).

ENABLE_BENCHMARK_BARRIER

Both barrier-based synchronization methods can alternatively use a double barrier before each measurement.

ENABLE_DOUBLE_BARRIER

Window-based Synchronization

The ReproMPI benchmark implements a window-based process synchronization mechanism, which estimates the clock offset/drift of each process relative to a reference process and then uses the obtained global clocks to synchronize processes before each measurement and to compute run-times.

Timing procedure

The MPI operation run-time is computed in a different manner depending on the selected clock synchronization method. If global clocks are available, the run-times are computed as the difference between the largest exit time and the first start time among all processes.

If a barrier-based synchronization is used, the run-time of an MPI call is computed as the largest local run-time across all processes.

However, the timing proceduce that relies on global clocks can be used in combination with a barrier-based synchronization when the following flag is enabled:

Clock resolution

The =MPI_Wtime= cll is used by default to obtain the current time. To obtain accurate measurements of short time intervals, the benchmark can rely on the high resolution =RDTSC/RDTSCP= instructions (if they are available on the test machines) by setting on of the following flags:

ENABLE_RDTSC
ENABLE_RDTSCP

Additionally, setting the clock frequency of the CPU is required to obtain accurate measurements:

FREQUENCY_MHZ                    2300

The clock frequency can also be automatically estimated (as done by the NetGauge tool) by enabling the following variable:

CALIBRATE_RDTSC

However, this method reduces the results accuracy and we advise to manually set the highest CPU frequency instead. More details about the usage of =RDTSC=-based timers can be found in our research report.

List of Compilation Flags

This is the full list of compilation flags that can be used to control all the previously detailed configuration parameters.

 CALIBRATE_RDTSC                  OFF   
 COMPILE_BENCH_TESTS              OFF                 
 COMPILE_SANITY_CHECK_TESTS       OFF               
 ENABLE_BENCHMARK_BARRIER         OFF             
 ENABLE_DOUBLE_BARRIER            OFF             
 ENABLE_GLOBAL_TIMES              OFF             
 ENABLE_LOGP_SYNC                 OFF             
 ENABLE_RDTSC                     OFF             
 ENABLE_RDTSCP                    OFF           
 ENABLE_WINDOWSYNC_HCA            OFF            
 ENABLE_WINDOWSYNC_JK             OFF        
 ENABLE_WINDOWSYNC_SK             OFF      
 FREQUENCY_MHZ                    2300

Clock Synchronization Algorithms

HCA [1]

HCA2 [1]

HCA3 [4]

Topo1 [4]

Topo2 [4]

two-level hierarchical clock-sync
- top level for sync between nodes
- bottom level on compute node
default
- top: HCA3
- bottom: ClockPropagation

Name		Name	Last commit message	Last commit date
Latest commit History 447 Commits
benchmaker		benchmaker
cmake_modules		cmake_modules
docker		docker
examples/pgcheck		examples/pgcheck
platform_files		platform_files
src		src
.gitignore		.gitignore
AUTHORS		AUTHORS
CMakeLists.txt		CMakeLists.txt
ChangeLog		ChangeLog
Changes.txt		Changes.txt
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ReproMPI Benchmark (Development Version)

Introduction

References

Components

Installation

Basic installation

Running the ReproMPI Benchmark

Benchmarking of MPI collective calls

Command-line Options

Common Options

Options Related to the Window-based Synchronization

Specific Options for the ReproMPI Benchmark

Supported Collective Operations:

MPI Collectives

Mockup Functions of Various MPI Collectives

Process Synchronization Methods

MPI_Barrier

Dissemination Barrier

Window-based Synchronization

Timing procedure

Clock resolution

List of Compilation Flags

Clock Synchronization Algorithms

HCA [1]

HCA2 [1]

HCA3 [4]

Topo1 [4]

Topo2 [4]

About

Releases

Packages

Contributors 6

Languages

License

hunsa/reprompi

Folders and files

Latest commit

History

Repository files navigation

ReproMPI Benchmark (Development Version)

Introduction

References

Components

Installation

Basic installation

Running the ReproMPI Benchmark

Benchmarking of MPI collective calls

Command-line Options

Common Options

Options Related to the Window-based Synchronization

Specific Options for the ReproMPI Benchmark

Supported Collective Operations:

MPI Collectives

Mockup Functions of Various MPI Collectives

Process Synchronization Methods

MPI_Barrier

Dissemination Barrier

Window-based Synchronization

Timing procedure

Clock resolution

List of Compilation Flags

Clock Synchronization Algorithms

HCA [1]

HCA2 [1]

HCA3 [4]

Topo1 [4]

Topo2 [4]

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Languages

Packages