-
Notifications
You must be signed in to change notification settings - Fork 104
BLISlab: A Sandbox for Optimizing GEMM
Matrix-matrix multiplication is a fundamental operation of great importance to scientific computing and, increasingly, machine learning. It is a simple enough concept to be introduced in a typical high school algebra course yet in practice important enough that its implementation on computers continues to be an active research topic. This note describes a set of exercises that use this operation to illustrate how high performance can be attained on modern CPUs with hierarchical memories (multiple caches). It does so by building on the insights that underly the BLAS-like Library Instantiation Software (BLIS) framework by exposing a simplified “sandbox” that mimics the implementation in BLIS. As such, it also becomes a vehicle for the "crowd sourcing" of the optimization of BLIS. We call this set of exercises BLISlab.
Matrix-matrix multiplication (GEMM) is frequently used as a simple example with which to raise awareness of how to optimize code on modern processors. The reason is that the operation is simple to describe, challenging to fully optimize, and of practical importance. In this document, we walk the reader through the techniques that underly the currently fastest implementations for CPU architectures.
The Basic Linear Algebra Subprograms (BLAS) form an interface for a set of linear algebra operations upon which higher level linear algebra libraries, such at LAPACK and libflame , are built. The idea is that if someone optimizes the BLAS for a given architecture, then all applications and libraries that are written in terms of calls to the BLAS will benefit from such optimizations.
The BLAS are divided into three sets: the level-1 BLAS (vector-vector operations), the level-2 BLAS (matrix-vector operations), and the level-3 BLAS (matrix-matrix operations). The last set benefits from the fact that, if all matrix operands are n \times n in size, $ O( n^3 ) $ floating point operations are performed with $ O( n^2 ) $ data so that the cost of moving data between memory layers (main memory, the caches, and the registers) can be amortized over many computations. As a result, high performance can in principle be achieved if these operations are carefully implemented.
In particular, with double precision floating point numbers is supported by the BLAS with the (Fortran) call
dgemm( transa, transb, m, n, k alpha, A, lda, B, ldb, beta, C, ldc )
which, by appropriately choosing transa and
transb, computes
In our exercises, we consider the simplified version of ,
The intricacies of high-performance implementations are such that implementation of the BLAS in general and in particular was often relegated to unsung experts who develop numerical libraries for the hardware vendors, for example as part of IBM’s ESSL, Intel’s MKL, Cray’s LibSci, and AMD’s ACML libraries. These libraries were typically written (at least partially) in assembly code and highly specialized for a specific processor.
A key paper [@IBM:P2] showed how an “algorithms and architectures" approach to hand-in-hand designing architectures, compilers, and algorithms allowed BLAS to be written in a high level language (Fortan) for the IBM Power architectures and explained the intricacies of achieving high performance on those processors. The Portable High Performance ANSI C (PHiPAC) [@PHiPAC97] project subsequently provided guidelines for writing high-performance code in C and suggested how to autogenerate and tune written this way. The Automatically Tuned Linear Algebra Software (ATLAS) [@ATLAS; @ATLAS_journal] built upon these insights and made autotuning and autogeneration of BLAS libraries mainstream.
As part of this document we discuss more recent papers on the subject, including the paper that introduced the Goto approach to implementing [@Goto:2008:AHP] and the BLIS refactoring of that approach [@BLIS1], as well as other papers that are of more direct relevance.
There are others who have put together exercises based on . Recent efforts relevant to this paper are by Michael Lehn at Ulm University and a wiki on that we ourselves put together.
The purpose of this paper is to guide you towards high-performance implementations of . Our ulterior motive is that our BLIS framework for implementing BLAS requires a so-called micro-kernel to be highly optimized for various CPUs. In teaching you the basic techniques, we are hoping to identify “The One” who will contribute the best micro-kernel. Think of it as our version of “HPC’s Got Talent”. Although we focus in our description on optimization for the Intel Haswell architecture, the setup can be easily modified to instead help you (and us) optimize for other CPUs. Indeed, BLIS itself supports architectures that include AMD and Intel’s x86 processors, IBM’s Power processors, ARM processors, and Texas Instrument DSP processors [@BLIS2; @BLIS3; @BLIS-TI].
02step1
03step2
04step3
05step4
06conclusion
07links
We thank the other members of the Science of High-Performance Computing (SHPC) team for their support. This research was partially sponsored by the National Science Foundation grant ACI-1148125/1340293 and Intel through its funding of the SHPC group as an Intel Parallel Computing Center.
Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF).