BarbaETal2015-inexact-gmres.tex

\documentclass[smallcondensed,final]{svjour3}

\usepackage{algorithm}
\usepackage{algorithmicx}
\usepackage[noend]{algpseudocode}

\usepackage{amsmath}
\usepackage{amsfonts,amssymb}
\usepackage{calc}
\usepackage[labelfont=bf,labelsep=quad]{caption} % bold fig in captions

\usepackage{subfig}
\usepackage{textcomp}
\usepackage{xspace}

%% I don't want to use dvips so let's load them again
\usepackage{graphicx}
\usepackage[pdftex]{hyperref}

% CUSTOM COMMAND DEFINITIONS
\newcommand{\R}{\mathbb{R}}
\newcommand{\Z}{\mathbb{Z}}
\newcommand{\K}{\mathbb{K}}
\newcommand{\cpu}{\textsc{cpu}}
\newcommand{\gpu}{\textsc{gpu}}

\newcommand{\bem}{\textsc{bem}\xspace}
\newcommand{\fmm}{\textsc{fmm}\xspace}
\newcommand{\fmmbem}{\fmm-\bem}

\newcommand{\cpp}{C$^{++}$}
\newcommand{\blas}{\textsc{blas}}
\newcommand{\sse}{\textsc{sse}}

% CURLY LETTERS
\newcommand{\bigO}{\mathcal{O}}
\renewcommand{\O}[1]{\mathcal{O}(#1)}

% FMM OPERATOR DEFINITIONS
\newcommand{\ptom}{\textsc{p}\texttwooldstyle\textsc{m}\xspace} % P2M
\newcommand{\ltop}{\textsc{l}\texttwooldstyle\textsc{p}\xspace} % L2P
\newcommand{\mtop}{\textsc{m}\texttwooldstyle\textsc{p}\xspace} % M2P
\newcommand{\mtom}{\textsc{m}\texttwooldstyle\textsc{m}\xspace} % M2M
\newcommand{\mtol}{\textsc{m}\texttwooldstyle\textsc{l}\xspace} % M2L
\newcommand{\ltol}{\textsc{l}\texttwooldstyle\textsc{l}\xspace} % L2L
\newcommand{\ptop}{\textsc{p}\texttwooldstyle\textsc{p}\xspace} % P2P

% MISC THINGS
\newcommand{\ncrit}{N_{\text{CRIT}}}
\newcommand{\pmin}{p_{\text{min}}}
\newcommand{\tsolve}{t_{\text{solve}}}
\newcommand{\mac}{\textsc{mac}}

% SOLVER DEFINITIONS
\newcommand{\cg}{\textsc{cg}}
\newcommand{\gmres}{\textsc{gmres}\xspace}
\newcommand{\fgmres}{\textsc{fgmres}\xspace}
\newcommand{\bicgstab}{\textsc{bicgstab}\xspace}

% the text 'd' for integrals
\newcommand{\di}[1]{\text{d}#1}
% partial derivatives (frac)
\newcommand{\partiald}[2]{\frac{\partial #1}{\partial #2}}
% partial derivatives (inline)
\newcommand{\partialdi}[2]{\partial #1 / \partial #2}
% define a vector - undertilde
%\newcommand{\vect}[1]{\utilde{#1}}
% - bold
\newcommand{\vect}[1]{\mathbf{#1}}
% \hat{n}
\newcommand{\nhat}{\hat{\mathbf{n}}}
% curly L
\renewcommand{\L}{\mathcal{L}}
% curly D
\newcommand{\D}{\mathcal{D}}
% sign
\newcommand{\sign}{\text{sign}}
% basis vectors
\newcommand{\e}{\vect{e}}
% dyadic product
\newcommand{\dyad}[2]{#1 \otimes #2}

\graphicspath{{figs/}} %  PATH to figure files-- change to ./ for submission

\journalname{Advances in Computational Mathematics}

\begin{document}

\title{Inexact GMRES iterations and relaxation strategies with fast-multipole boundary element method \thanks{
	This work was partially supported by the National Science Foundation under award ACI-1149784.}
}

\author{%
Tingyu Wang \and Simon K. Layton \and  Lorena A. Barba
}

\institute{
	Tingyu Wang \at Department of Mechanical and Aerospace Engineering, The George Washington University, Washington, DC
	\and
	Simon K. Layton \at Nvidia, Corp., Santa Clara, CA
	\and
	Lorena A. Barba \at Department of Mechanical and Aerospace Engineering, The George Washington University, Washington, DC \\
	\email{labarba@gwu.edu} \\
}

\date{Received: date / Accepted: date}
% The correct dates will be entered by the editor

\maketitle

\begin{abstract}
Boundary element methods produce dense linear systems that can be accelerated via multipole expansions. Solved with Krylov methods, this implies computing the matrix-vector products within each iteration with some error, at an accuracy controlled by the order of the expansion, $p$. We take advantage of a unique property of Krylov iterations that allows lower accuracy of the matrix-vector products as convergence proceeds, and propose a relaxation strategy based on progressively decreasing $p$. In extensive numerical tests of the relaxed Krylov iterations, we obtained speed-ups of between $1.5\times$ and $2.3\times$ for Laplace problems and between $2.7\times$ and $3.3\times$ for Stokes problems. We include an application to Stokes flow around red blood cells, computing with up to 64 cells and problem size up to 131k boundary elements and nearly 400k unknowns. The study was done with an in-house multi-threaded C++ code, on a hexa-core CPU. The code is available on its version-control repository, \url{https://github.com/barbagroup/fmm-bem-relaxed}, and we share reproducibility packages for all results in \url{https://github.com/barbagroup/inexact-gmres/}.

\keywords{boundary integral equation \and boundary element method \and collocation method \and fast multipole method \and iterative solvers \and Krylov methods \and Stokes flow}
\subclass{35Q35 \and 35Q99 \and 45B05 \and 76D07 \and 76Z99}
\end{abstract}

\section{Introduction}

The boundary integral formulation is popular in applied mechanics for solving problems governed by the equations of Laplace, Stokes, Helmholtz and Lam{\'e}. Applications include electrostatics, low-Reynolds-number flow, acoustics and linear elasticity and span all scales from biomolecules, micro-electromechanical and biomedical devices to wind turbines, submarines and  problems in aerospace and geodynamics. The key is formulating the governing partial differential equations in equivalent boundary integral equations, and discretizing over the boundaries. 
The two ways to derive a discrete linear system from boundary integral equations are collocation and Galerkin methods.
In this paper, we use a collocation boundary element method (\bem).
The approach reduces the dimensionality of the problem (three-dimensional problems are solved on two-dimensional surfaces), but generates dense systems of algebraic equations. 
The computational complexity of dense solvers, scaling as $\O{N^3}$ for direct methods and $\O{N^2}$ for iterative methods, frustrated large-scale applications of \bem\ until the late 1990s, when researchers began working out how to incorporate fast algorithms 
\cite{Nishimura2002,Liu2006}. Treecodes and fast multipole methods (\fmm) reduce the complexity of \bem\ solutions to $\O{N \log N}$ and $\O{N}$, respectively, although often with serious programming effort. Nevertheless, large-scale \bem\ simulations are now possible, especially given the parallel scalability of the \fmm \cite{YokotaETal2011a,YokotaBarba2011a}.

Accelerating \bem\ solutions with \fmm\ hinges on seeing the dense matrix-vector products done within each iteration of the linear solver, as $N$-body problems. Gauss quadrature points on source panels and collocation points on target panels interact via the Green's function of the governing equation, similar to how charges, particles or masses interact under electrostatic or gravitational potentials. In the same way, far-away sources can be clustered together and represented by series expansions to calculate their contribution on a target point with controllable accuracy. As a result, the matrix-vector products are computed with some error. To ensure convergence of the iterative solver, or achieve a desired accuracy in the solution, we might require the order of the series expansions, $p$, to be sufficiently large. But based on heuristic considerations, \gmres and other Krylov methods have the surprising behavior of only requiring high accuracy in early iterations, while accuracy requirements can be relaxed in later iterations. This property offers the opportunity to use different values of $p$ in the \fmm\ as the iterations progress to convergence, reducing the total time to solution.

This paper presents a relaxation strategy for fast-multipole boundary element methods consisting of decreasing the orders of expansion $p$ as Krylov iterations progress in the solver. We tested extensively using the Laplace and Stokes equations, and also with an application of Stokes flow around red blood cells. The aim of the study was to show that a \bem\ solution with inexact Krylov iterations converges, despite low-accuracy matrix-vector multiplications at later iterations, and to find out the speed-ups that can be obtained. We wrote an in-house  multi-threaded code in \cpp\ that enable experimenting in a variety of scenarios.\footnote{The code is available for reproduction of our results at \url{https://github.com/barbagroup/fmm-bem-relaxed}.} 

%% METHODS
\section{Methods for the integral solution of elliptic equations using inexact {\small GMRES}}

\subsection{Boundary-integral solution of the Laplace equation}

To write the Laplace equation, $\nabla^{2}\phi(\vect{x}) = 0$,  in its integral formulation, we use the classical procedure of multiplying by the Green's function and integrating, applying the divergence theorem of Gauss and the chain rule, then dealing with singularities by a limiting process. This results in
%
\begin{equation}\label{eqn:laplace_bem_final}
	\frac{1}{2}\phi + \int_{\Gamma} \phi\partiald{G}{\nhat}\;\di{\Gamma} = \int_{\Gamma}\partiald{\phi}{\nhat}G\;\di{\Gamma},
\end{equation}

\noindent where $G = 1/4\pi r$ is the free-space Green's function for the Laplace equation ($\nabla^{2}G = -\delta$),  $\partiald{\cdots}{\nhat}$ represents the partial derivative in the direction normal to the boundary surface, and the (Cauchy principal-value) integrals are on the boundary $\Gamma$ of the domain. (The details of the derivation can be found in Ref. \cite{BrebbiaDominguez1992}.) The boundary element method (\bem) consists of discretizing the boundary into surface panels and enforcing Equation \eqref{eqn:laplace_bem_final} on a set of target points (collocation version). In a typical \bem, surface panels take a constant value $\phi_j$, and the surface integrals become sums over $N$ flat surface elements, $\Gamma_j$, resulting in the following discretized equation:
%
\begin{equation}
	\frac{1}{2}\phi_i = \sum_{j=1}^{N} \partiald{\phi_j}{\nhat_j}\;\int_{\Gamma}G_{ij}\di{\Gamma_j} - \sum_{j=1}^{N} \phi_j\int_{\Gamma}\partiald{G_{ij}}{\nhat_j}\;\di{\Gamma_j}.
\end{equation}

The values of the potential or its normal derivative on each panel are known from boundary conditions, resulting in either first-kind or second-kind integral equations. Finding the remaining unknowns requires solving a system of linear equations $A\vect{x}=\vect{b}$. The elements of the coefficient matrix are
%
\begin{equation} \label{eqn:laplace_matrix}
	A_{ij} = 
	\begin{cases}
		\int_{\Gamma} G_{ij}\;\di{\Gamma_j}, & \phi\;\text{given on panel}\;j \\
		\int_{\Gamma} \partiald{G_{ij}}{\nhat_j}\;\di{\Gamma_j} + \frac{1}{2} I_{ij}, & \partiald{\phi}{\nhat}\;\text{given on panel } j
	\end{cases}
\end{equation}

\noindent
where $I$ is an identity matrix. The right-hand side vector $\vect{b}$ is formed with the known terms: e.g., if $\phi$ is given on panel $j$, then $\phi_j\int_{\Gamma_j}\partialdi{G_{ij}}{\nhat_j}\;\di{\Gamma_j}$ will appear in the term $b_i$.

Expressing $G_{ij}$ and $\partialdi{G_{ij}}{\nhat_j}$ in terms of $1/r$ and $\nhat_j\cdot\nabla(1/r)$
%
\begin{eqnarray}
	\label{eqn:laplace_bem_G}\int_{\Gamma} G_{ij}\;\di{\Gamma_j} & = & \frac{1}{4\pi}\int_{\Gamma} \frac{1}{|\vect{x}_i-\vect{x}_j|} \;\di{\Gamma_j} \\ 
	\label{eqn:laplace_bem_dGdn}\int_{\Gamma} \partiald{G_{ij}}{\nhat_j}\;\di{\Gamma_j} & = & \frac{1}{4\pi}\int_{\Gamma}\frac{d\vect{x}\cdot\nhat_j}{|\vect{x}_i-\vect{x}_j|^{3}}\;\di{\Gamma_j}
\end{eqnarray}

The next steps are to apply an appropriate numerical integration scheme in order to generate all the terms of the coefficient matrix, and subsequently solve the linear system of equations, as described below.

\subsection{Boundary-integral solution of the Stokes equation}

The Stokes equation for a flow at very low Reynolds number, $\mu\nabla^{2}\vect{u} =  \nabla p$ (where $\mu$ is the viscosity and $p$ the pressure), can be rewritten in its integral formulation by means of a similar process as that described above for the Laplace equation. But it is a vector equation and its fundamental solutions are tensors. The boundary integral form of the the Stokes equation is

\begin{equation}
	\label{eqn:stokes_bem_12}
	\frac{1}{2}u_j(\vect{x_0}) = -\frac{1}{8\pi\mu}\int_{\Gamma} t_i(\vect{x})G_{ij}(\vect{x},\vect{x}_0)\;\di{\Gamma} + \frac{1}{8\pi} \int_{\Gamma} u_i(\vect{x})T_{ijk}(\vect{x},\vect{x}_0)n_k(\vect{x})\;\di{\Gamma}.
\end{equation}

\noindent where $\vect{u}$ is the velocity vector satisfying the Stokes equation (Einstein indicial summation implied), with $\sigma$ the corresponding stress tensor and $\vect{t} = \sigma\cdot\nhat$  the traction, vectors $\vect{x}$ and $\vect{x}_0$ are two distinct points on the boundary, and $\vect{G}$ and $\vect{T}$ are the stokeslet and stresslet fundamental solutions:
%
\begin{equation}
	\label{eqn:stokeslet}
	G_{ij}(\vect{x},\vect{y})  =  \frac{\delta_{ij}}{r} + \frac{(x_i-y_i)(x_j-y_j)}{r^{3}} 
\end{equation}

\begin{equation}
	\label{eqn:stresslet}
	T_{ijk}(\vect{x},\vect{y})  = - 6\frac{(x_i-y_i)(x_j-y_j)(x_k-y_k)}{r^{5}}.
\end{equation}

Indices $i, j, k$ denote here the Cartesian-tensor components and $\delta_{ij}$ is the Kronecker delta. Discretizing the boundary with $N$ surface panels results in sums that we now number with the index $J$.
The discretized form with constant surface panels becomes
%
\begin{equation}
	\label{eqn:stokes_bem_discretized}
	\frac{1}{2}u_j(\vect{x_0}) = -\frac{1}{8\pi\mu}\sum_{J=1}^{N}t_i\int_{\Gamma} G_{ij}(\vect{x}_J, \vect{x}_0)\;\di{\Gamma_J} + \frac{1}{8\pi} \sum_{J=1}^{N}u_i\int_{\Gamma} T_{ijk}(\vect{x}_J, \vect{x}_0)\cdot n_k\;\di{\Gamma_J}.
\end{equation}


\subsection{Numerical, semi-analytical and analytical integration methods}

The boundary integral formulations all demand that we compute integrals of the type $\int_{\Gamma} \K_{ij}\;\di{\Gamma_j}$, where $\K_{ij}=\K(\vect{x}_j-\vect{x}_i)$ is the kernel, the point $\vect{x}_j$ is on the panel surface $\Gamma_j$ and the point $\vect{x}_i$ is a target or evaluation point. Because the kernel $\K$ is often singular, we need specific approaches depending on the distance $\vect{x}_j-\vect{x}_i$. Where the target point is far enough from the surface $\Gamma_j$, simple quadrature with a few Gauss points will suffice. As the target point \emph{nears} the source panel (with the definition of ``near'' to be determined), we need high-accuracy quadrature. Finally, in the case where the target point is on the source panel, the integral is (close to) singular and we must use analytical or semi-analytical methods. Figure \ref{fig:integration_domain} illustrates the three situations.


\begin{figure}
	\begin{centering}
	\includegraphics[width=0.7\textwidth]{IntegrationDomain.pdf}
	\caption{Depending on the distance between a source panel and target point, we use different numerical, analytical or semi-analytical integration methods, balancing computational efficiency and accuracy near singularities. The threshold distance between far and near-singular regions is $2\sqrt{2 S_j}$, where $S_j$ is the surface area of the source panel.}
	\label{fig:integration_domain}
	\end{centering}
\end{figure}


Applying Gauss quadrature to the integrals appearing in the coefficients of the boundary element discretization of the Laplace equation, \eqref{eqn:laplace_bem_G} and \eqref{eqn:laplace_bem_dGdn}, for example, gives
%
\begin{eqnarray}
	\label{eqn:gauss:1st-kind}
	\int_{\Gamma} G(\vect{x}_i,\vect{x}_j)\;\di{\Gamma_j} & \approx & \frac{1}{4\pi} \sum_{k=1}^{K} q_k\cdot S_j\cdot \frac{1}{|\vect{x}_i-\vect{x}_k|},\;\;\vect{x}_k \in \Gamma_j, \\ 
	\label{eqn:gauss:2nd-kind}
	\int_{\Gamma} \partiald{G(\vect{x}_i,\vect{x}_j)}{\nhat_j}\;\di{\Gamma_j} & \approx & \frac{1}{4\pi} \sum_{k=1}^{K}q_k\cdot S_j\cdot \frac{d\vect{x}\cdot\nhat_j}{|\vect{x}_i-\vect{x}_k|^{3}},\;\;\vect{x}_k \in \Gamma_j,
\end{eqnarray}

\noindent
with $q_k$ the area-normalized Gauss quadrature weights and $S_j$ the surface area of panel $\Gamma_j$. To control the accuracy of the numerical integration, we vary the number of quadrature points, using for example $K= 4$ for targets that are far from the source panel and $K\approx 20$ for near-singular situations. 
The near-singular region is within a distance of $2\sqrt{2 S_j}$, a criterion that we settled on after  testing with several choices using a panel's characteristic length scale as factor.
When the target point is on the source panel, the standard approach is to use analytical or semi-analytical methods for the singular and hyper-singular integrals over the panels.
For the Laplace equation, we used a semi-analytical method. 
It applies the technique first presented in the classic work of Hess and Smith \cite[p.~49, ff.]{HessSmith1967} for decomposing the integral into a sum of integrals over triangles formed by the projection of the target point on the panel plane, and the panel vertices. Using polar coordinates, the integration over the radial component can be done analytically and the integration over the angular component is done by quadrature. % Also in \cite{ZhuHuangSongWhite2001}
Several analytic integration techniques are at our disposal for dealing with the singular integrals from boundary element methods. Explicit expressions for these integrals over flat triangular domains result in recursive formulae on the edges of the integration triangle. These are available for Laplace potentials \cite{Fata2009} and linear elastic surface potentials \cite{Fata2011}. 
We obtained the analytic integrals for the Stokes equation from Fata's formulas for linear elasticity, after setting the Poisson ratio to $1/2$. 

\subsection{Krylov subspace methods}

For large linear systems of equations $A\vect{x}=\vect{b}$, direct solution is generally impractical and iterative solution methods are preferred. Krylov subspace methods derive from the Cayley-Hamilton theorem, which states that you can express the inverse of a  matrix $A$ as a linear combination of its powers. The Krylov subspace in Krylov subspace methods is spanned by the products of the initial residual $r_0$ and powers of $A$; to order-$m$, this is: $K_{m}(A,r_0) = \text{span}\{ r_0, Ar_0, A^{2}r_0, ..., A^{m-1}r_0\}$.
Krylov methods include the conjugate gradient method, the biconjugate gradient stabilized method (\textsc{bicgstab}) and the generalized minimal residual method, \gmres \cite{SaadSchultz1986}, which we use. In boundary element applications, the greatest cost per iteration is the matrix-vector product (mat-vec), $w\gets A\cdot z_j$, taking $\O{N^{2}}$ time in a direct implementation. However, given the structure of the coefficient matrix in boundary element methods, this operation can be reduced to $\O{N}$ time using, for example, the fast multipole method.
The key is understanding the matrix-vector product as an $N$-body problem. Let's consider the first-kind integral problem for the Laplace equation. In \eqref{eqn:laplace_matrix} we see that the matrix coefficients are $\int_{\Gamma} G_{ij}\,\di{\Gamma_j}$. Applying Gauss quadrature to obtain the coefficients gives \eqref{eqn:gauss:1st-kind}. Thus, the matrix-vector product gives the following  for row $i$
%
\begin{equation}\label{eqn:matvec-onerow}
	\sum_{j=1}^{N_p}  \partiald{\phi_j}{\nhat_j}\; \sum_{k=1}^{K} q_k\cdot S_j\cdot \frac{1}{|\vect{x}_i-\vect{x}_k|},\;\;\vect{x}_k \in \Gamma_j, 
\end{equation}

\noindent
The inner sum is over the Gauss quadrature points (only a few per panel) and the outer sum is over the integration panels. We list the pseudocode of the full mat-vec in the Appendix as Algorithm \ref{alg:matvec}. Taking all the quadrature points together as the set of ``sources'' and all the collocation points on the panels as the set of ``targets,'' the algorithm is reduced to two for-loops instead of three, making the analogy with an $N$-body problem more clear.


\subsection{Fast multipole method}

Fast multipole methods were invented to accelerate the solution of $N$-body problems, that is, problems seeking to determine the motion of $N$ bodies that interact with each other via a long-distance effect (like electrostatics or gravitation). A direct approach to such a problem takes $\O{N^{2}}$ time to compute. The first \emph{fast} algorithms for $N$-body problems \cite{Appel1985,BarnesHut1986} combined two ideas: (1) approximating the effect of groups of distant bodies with a few moments (of the charges or masses), and (2) using a hierarchical sub-division of space to determine the acceptable distances to apply these approximations.
 These ideas produced the treecode algorithm, with $\bigO(N\log N)$ time to compute.
The fast multipole method \cite{GreengardRokhlin1987} introduces a third key idea that leads to $\bigO(N)$ scaling: allowing groups of distant bodies to interact with \emph{groups} of targets, by means of a mathematical representation called local expansion.

A typical $N$-body problem evaluates a potential $\phi$ on $i=1, 2, \cdots, N$ bodies
using the following expression
%
\begin{equation}\label{eqn:nbody}
	\phi_{i} = \sum_{j=1}^{N} m_{j}\cdot\K(\vect{x}_{i},\vect{y}_{j}) \; = \; \sum_{j=1}^{N}\K_{ij}m_{j},
\end{equation}

\noindent where $\K_{ij} = \K(\vect{x}_{i},\vect{y}_{j})$ is referred to as the \emph{kernel}, and the potential is a solution of an elliptic equation, e.g., the Poisson equation for gravitation. The expression in \eqref{eqn:nbody} is analogous to that for one row of the \bem\ mat-vec in Equation \eqref{eqn:matvec-onerow}, taking all $N_p \cdot K$ Gauss points collectively.
The first step in the \fmm\ acceleration of \bem\ is to group the Gauss quadrature points (i.e., the ``sources'') into clusters, and represent their influence via multipole expansions at the cluster centers. If using Taylor expansions, for example, truncated to the first $p$ terms, the potential at a point is approximated by
%
\begin{equation}
	\phi(\vect{x}_i) \approx \sum_{||\vect{k}||=0}^{p}\frac{1}{\vect{k}!}D^{\vect{k}}_{\vect{y}} \K(\vect{x}_i,\vect{y}_c)\, \sum_{j=1}^{n_c} m_j (\vect{y}_j-\vect{y}_c)^{\vect{k}},
	\label{eqn:cartesian_multipole}
\end{equation}

\noindent where $\vect{k}=\{k_1, k_2, k_3\}$ is a multi-index, $\vect{k}! = k_1!k_2!k_3!$, $\vect{y}^{\vect{k}} = y_1^{k_1}y_2^{k_2}y_3^{k_3}$, $D_{\vect{y}}^{\vect{k}} = D^{k_1}_{y_1}D^{k_2}_{y_2}D^{k_3}_{y_3}$ is the derivative operator and $p$ is the order of the expansion. The right-most terms are the multipoles, i.e., powers of distances with respect to the cluster center $\vect{y}_c$, and they can all be pre-calculated for a set of sources. Forming the clusters consists of recursively sub-dividing the spatial domain until a limit number of sources $\ncrit$ remains in each box, resulting in an adaptive octree structure. 
The step of computing the multipole expansions for each source cluster at the deepest level of the tree is referred to as the particle-to-multipole operation, {\ptom}. These multipole expansions are then translated and added to represent larger clusters (at higher levels of the tree) in the multipole-to-multipole operation, {\mtom}. The process just described is called the \emph{upward sweep}.
At this point, a treecode algorithm evaluates the potential on target points, performing  multipole-to-particle operations, {\mtop}---in the \bem, the multipoles represent clusters of quadrature points and the potential is evaluated on collocation points, as illustrated in Figure \ref{fig:rbc_fmmbox}. As implied in Equation\eqref{eqn:cartesian_multipole}, this is an approximation; and as implied in Figure \ref{fig:rbc_fmmbox}, the approximation is acceptable for remote target points only. The parameter that dictates whether we make the approximation is the \emph{multipole acceptance criterion}, \mac, denoted by $\theta$ and enforced as the maximum allowed ratio between cluster size and distance between targets and cluster center. When the \mac\ is not satisfied, sources and targets interact directly via \eqref{eqn:nbody} (called particle-to-particle operation, \ptop). The key to achieving the optimal $\bigO(N)$ scaling are local expansions, representing a group of target points; thus, the \fmm\ adds three operations to the algorithm: the multipole-to-local transformation (\mtol), the local-to-local translation (\ltol) and the local-to-particle evaluation (\ltop).

\begin{figure}
\begin{center}
	\includegraphics[width=0.7\textwidth]{redbloodcell-panel-fmm.pdf}
	\caption{First step in viewing the influence of boundary elements as an $N$-body problem accelerated by the fast multipole method: Gauss quadrature points in a region of the surface triangulation are grouped together in a cluster, and their contribution to the potential at a remote target point (white circle) is considered via a series expansion at the cluster center.}
	\label{fig:rbc_fmmbox}
\end{center}
\end{figure}

The Cartesian-expansion \mtol\ operation scales as $\O{p^{6}}$, which quickly becomes expensive when requiring higher accuracy via higher $p$. 
With spherical-harmonic expansions, \mtol scales as $\O{p^{4}}$ without rotation; or $\O{p^{3}}$ with rotation \cite{ChengETal1999}. The break-even point between the two types of \mtol operator occurs at high $p$, however \cite{white1996rotating}. In this work, we use the $\O{p^{4}}$ operator, given that we operate at low $p$ most of the time.
For the sake of brevity, we omit all the details and write the final form for the spherical expansions of the Laplace potential---denoted here by $\Phi(\vect{x}_i)$ to differentiate with the spherical co-ordinate $\phi$---as follows

\begin{eqnarray}
	\Phi(\vect{x}_i) & = & \sum_{n=0}^{p}\sum_{m=-n}^{n}\frac{Y^{m}_n(\theta_i,\phi_i)}{r_i^{n+1}}\underbrace{\left \{ \sum_j^{N}q_j\rho^{n}_jY^{-m}_n(\alpha_i,\beta_i)\right \} }_{M^{m}_n} \, \text{and}\\
	\Phi(\vect{x}_i) & = & \sum_{n=0}^{p}\sum_{m=-n}^{n}r_i^{n}Y^{m}_n(\theta_i,\phi_i)\underbrace{\left \{ \sum_j^{N}q_j\frac{Y^{-m}_n(\alpha_i,\beta_i)}{\rho^{n+1}_j}\right \} }_{L^{m}_n}.
\end{eqnarray}

\noindent
Here, $M^{m}_n$ and $L^{m}_n$ denote the multipole and local expansion coefficients; $Y_{n}^{m}$ is the spherical harmonic function; $(r,\theta,\phi)$ and $(\rho,\alpha,\beta)$ are the distance vectors from the center of the expansion to points $\vect{x}_i$ and $\vect{x}_j$, and $q_j$ are the weights of sources. For our purposes, we just want to make clear the parallel between a fast $N$-body algorithm and a fast \bem\ matrix-vector product via this necessarily abbreviated summary of the \fmm.

\subsection{Inexact Krylov iterations and relaxation strategies}

Accelerating a \bem\ solution with \fmm\ implies computing the matrix-vector products with some error, and the parameter controlling this accuracy is the order of multipole expansions, $p$. 
Two natural questions to ask are how large does the value of $p$ need to be to ensure that the iterative solver will still converge and what's the impact on the accuracy of the converged solution due to the error on the computed mat-vec.
We might expect to choose that value of $p$ that ensures convergence and desired accuracy, and apply it evenly for all iterations. 
But heuristically, it turns out that \gmres has an interesting behavior: early iterations needs to be computed with high accuracy, but accuracy requirements can be \emph{relaxed} for later iterations. 
With a $\O{p^{4}}$ scaling in the \mtol operator, this property of Krylov methods could offer the potential for further accelerating the \bem\ solution.
Bouras and Frayss{\'e} \cite{bouras2000relaxation,bourasfraysse2005} studied the effect of inexact Krylov iterations on convergence and accuracy via numerical experiments.
They found that if the system matrix is perturbed, so we are computing $(A+\Delta A_k)\vect{z}$ on each iteration, and the perturbation stays nearly equal in norm to $\eta \|A\|$, then the computed solution will have an error of the same order, $\eta$. This is the situation when simply computing the mat-vec within the limits of machine precision. But they also showed the more surprising result that the magnitude of the perturbations $\Delta A_k$ can be allowed to grow as the iterations progress.
They define a relaxation strategy whereby, for a desired final tolerance $\eta$ in the solution of the system, the mat-vecs in each iteration are computed with a coefficient matrix perturbed with $\Delta A_k$, where if $r_k$ is the absolute residual at step $k$,
%
\begin{equation}\label{eqn:matrix-perturbation}
  \|\Delta A_k\| = \varepsilon_k\|A\|, \quad \varepsilon_k=\min\left( \frac{\eta}{\min(\|r_{k-1}\|,1)}, 1\right).
\end{equation}

\noindent The matrix perturbation is always larger than the target tolerance $\eta$, and $\varepsilon_k$ increases when $r_k$ decreases (without surpassing $1$). In other words, the accuracy of the system mat-vecs are \emph{relaxed} as iterations proceed. This relaxation strategy led to converged solutions with \gmres, \textsc{cg} and \textsc{bicgstab}, using a variety of test matrices---often, and remarkably, in about the same number of iterations as a non-relaxed solution. In sum, through empirical results from previous studies, Krylov methods proved to be robust to inexact mat-vecs and only the first Krylov vectors need to be computed accurately.
The numerical evidence is also supported by theoretical studies \cite{simonciniszyld2003,vandeneshofsleijpen2004}.

In the fast multipole method, we have error bounds available for the approximations made in the various operations that make up the algorithm. Using the spherical-harmonics expansion for the Laplace kernel, for example, the error is bounded as follows 
%
\begin{equation}\label{eqn:multipole_error}
	\left | \phi(r, \theta, \phi) - \sum_{n=0}^{p}\sum_{m=-n}^{n}\frac{M^{m}_{n}}{r^{n+1}}\cdot Y^{m}_{n}(\theta, \phi) \right | \leq \frac{\sum_{i=1}^{N}q_{i}}{r-a}\left ( \frac{a}{r} \right )^{p+1},
\end{equation}

\noindent where $a$ is the  cluster radius and $r$ is the distance between the multipole center and the target. The above inequality is given in Ref. \cite[p.~55]{greengard1987} with label (3.38), and proved using the triangle inequality. This reference also gives similar bounds for the translation and evaluation operators. We used $\theta_{\text{MAC}} = 0.5$ throughout this study, so the closest that a target can be to the center of expansion of sources acting on it is $2a$, i.e., $r/a \geq 2$. In Equation \eqref{eqn:multipole_error}, we replace $r/a$ with $1/2$ (the worst case), equate with \eqref{eqn:matrix-perturbation} and solve for $p$ by applying a logarithm:

\begin{equation}\label{eqn:fmm_p}
	p_k \sim \lceil -\log_{2}(\varepsilon_k) \rceil.
\end{equation}

\noindent where our multipole acceptance criterion gives the base-$2$ logarithm. We thus have a relaxation heuristic that connects the allowed perturbation on the mat-vec for the inexact Krylov iterations to converge, with the order of the multipole expansion in the \fmm at step $k$.
Although the \fmm error bounds are known to be rather loose, we use this approach for a conservative relaxation of $p$, and we expect that with experience and more detailed studies, the relaxation strategy could be pushed to give better speed-ups in specific applications.


%$ RESULTS
\section{Results and discussion}
Our results include numerical experiments with \bem\ solutions for the Laplace and Stokes equations, including an application to Stokes flow around red blood cells. The \bem\ solver is accelerated with a fast multipole method using spherical expansions, implemented in a code designed to use boundary elements (panels) as the sources and targets, and written in multi-threaded \cpp. To facilitate reproducibility of our research, the code is made available via its version-control repository,\footnote{\href{https://github.com/barbagroup/fmm-bem-relaxed}{https://github.com/barbagroup/fmm-bem-relaxed}} under an MIT license. The paper manuscript is also available via its repository,\footnote{\href{https://github.com/barbagroup/inexact-gmres}{https://github.com/barbagroup/inexact-gmres}} which includes running scripts and plotting scripts to produce the included figures (see each figure caption for details).
We ran all our tests on a lab workstation with an Intel Core i7-5930K hexa-core processor and 32GB RAM. Using a test with the Laplace kernel, we confirmed that our \fmm\ code scales as $\O{N}$ by timing several runs with increasing problem size; see Figure \ref{fig:fmm_scaling}. The performance of our code falls into the range of several major \fmm software tested by Yokota \cite{Yokota2013-dualtree}, showing that the \fmm code used in this work performs competitively. The detailed result can be found in the supplementary Jupyter Notebook.\footnote{See the \href{https://github.com/barbagroup/inexact-gmres/blob/master/supplementary-materials/inexact-GMRES-result.ipynb}{``supplementary-materials''} folder in the paper repository.} We also found that the preconditioned cases take fewer iterations but more time to converge than the unpreconditioned ones, because the residuals after the first several iterations in the preconditioned cases are greater. For an inexact {\gmres} iteration, a bigger residual leads to a higher required $p$, which offsets the benefit of using fewer iterations. Therefore, we performed these tests without preconditioning.
The subsequent experiments investigate the use of a Krylov relaxation strategy by reducing the order of multipole expansions, $p$, as the iterations progress to convergence, determining the speed-up provided by such a strategy for different scenarios and problem sizes.
In each case, we chose the optimal $\ncrit$ (establishing the balance between near and far field in \fmm) to obtain the shortest solution time.

\begin{figure}
\begin{center}
	\includegraphics[width=0.5\textwidth]{FMMScaling.pdf}
	\caption{Scaling of the \fmm\ code with respect to problem size $N$, using a Laplace kernel with spherical expansions, with particles randomly distributed in a unit cube, $\theta_{\text{MAC}} = 0.5$, $p=5$ and optimal $\ncrit = 125$, obtaining 4 significant digits of accuracy for the force. The tests ran on a lab workstation with an Intel Core i7-5930K hexa-core processor, using six \cpu\ threads. The dotted line shows the expected $\O{N}$ scaling. Plotting script and figure available under CC-BY \cite{WangLaytonBarba2016-figshare1}.}
	\label{fig:fmm_scaling}
\end{center}
\end{figure}

\subsection{Inexact {\small GMRES} for the solution of the Laplace equation}
\label{sec:inexactLaplace}
To start, we studied grid convergence comparing numerical results with the analytical solution using a sphere with constant potential and charge on the surface: $\phi = \partialdi{\phi}{\nhat} = 1$. To make surface triangulations of a sphere with increasing refinement, we started with an 8-triangle closed surface, then split recursively each triangle into four smaller ones. Figure \ref{fig:glob_spheres} shows two example discretizations. We solved the boundary-element problem by collocation in both the first-kind and second-kind integral formulations, using a standard \gmres with fast-multipole-accelerated mat-vecs and the semi-analytical integrals for the singular terms. Figure \ref{fig:laplaceconvergence} shows the resulting convergence for both first-kind and second-kind formulations of the boundary element method on a sphere. To reveal the discretization error (even for the finest mesh with 131,072 panels), we first used a set of tight parameters: the order of spherical-harmonic expansions $p=20$, the number of Gauss points $K=13$ and the tolerance in the iterative solver $\eta=10^{-10}$. The convergence result is represented by the thicker lines in Figure \ref{fig:laplaceconvergence}. The observed order of convergence is 0.98 for the 1st-kind formulation and 1.01 for the 2nd-kind formulation, computed with the three points in the middle of each line. 

Then we carefully slackened these parameters $p$, $K$, $\eta$ one at a time for each case (with and without relaxation, both first-kind and second-kind formulations) until accuracy drops. Table \ref{tab:laplace_loose_params} summarizes the loosened parameters in the three larger cases,\footnote{We did not report the loose parameters for the three smaller cases because the solving time is too short (within 1 second) to demonstrate the speedup due to relaxation. Running smaller cases is just for a complete convergence study.} along with the optimal $\ncrit$ which minimizes the execution time for each case. It is intuitive that the two larger cases in Table \ref{tab:laplace_loose_params} require a higher $p=10$ than the smaller one $p=8$, since we have to use a more accurate \fmm-accelerated mat-vec to match a smaller discretization error. The thinner lines in Figure \ref{fig:laplaceconvergence} show the convergence result by using the loose parameters. This convergence analysis gives confidence in our \bem code, the singular/near-singular integral calculations, the far-field approximation using the \fmm, the choice of parameters and above all, justifies that our relaxation strategy on \gmres does not deteriorate accuracy. Since the loose parameters are ``accurate enough" and optimal, we used them in the following speedup test. 


\begin{figure}
\begin{center}
	\subfloat[][128 panels]{\includegraphics[width=0.45\textwidth]{sphere128.pdf}\label{fig:sphere128}}%\qquad
	\subfloat[][2048 panels]{\includegraphics[width=0.45\textwidth]{sphere2048.pdf}\label{fig:sphere2048}}
	\caption{Triangular discretizations of a spherical surface.}
	\label{fig:glob_spheres}
\end{center}
\end{figure}
%
\begin{figure}
\begin{center}
	\includegraphics[width=1.0\textwidth]{LaplaceConvergence.pdf}
	\caption{Convergence of 1st-kind (left plot) and 2nd-kind (right plot) solvers for the Laplace equation on a sphere, using a \gmres with \fmm-accelerated matrix-vector products, with $\theta_{\text{MAC}} = 0.5$. The vertical axis is the relative error of the potential evaluated at an exterior point with respect to the analytical solution, for a constant potential or charge on the surface: $\phi = \partialdi{\phi}{\nhat} = 1$. The tight parameters: $p=20$, $K=13$, $\eta=10^{-10}$; the loose parameters are listed in Table \ref{tab:laplace_loose_params}. Plotting script and figure available under CC-BY \cite{WangLaytonBarba2016-figshare2}.}
	\label{fig:laplaceconvergence}
\end{center}
\end{figure}

\begin{table}[h]
	\resizebox{\textwidth}{!}{
	\begin{tabular}{c|ccccc|ccccc}
		& \multicolumn{5}{c|}{Non-Relaxed} & \multicolumn{5}{c}{Relaxed}\\
	N & $K$ & $p$ & $\eta$ & $N_{\text{CRIT}_\text{1st}}$ & $N_{\text{CRIT}_\text{2nd}}$ & $K$ & $p_{init}$ & $\eta$ & $N_{\text{CRIT}_\text{1st}}$ & $N_{\text{CRIT}_\text{2nd}}$\\
	\hline
	& & & & & & & & & &\\
		8192 & 4 & 8 & $10^{-6}$ & 300 & 300 & 4 & 8 & $10^{-6}$ & 100 & 300\\
	32768  & 4 & 10 & $10^{-6}$ & 400 & 400 & 4 & 10 & $10^{-6}$ & 100 & 300\\
		131072  & 4 & 10 & $10^{-6}$ & 500 & 500 & 4 & 10 & $10^{-6}$ & 200 & 300\\
	\end{tabular}
	}
\caption{Loose parameters on a Laplace solver, the subscript on $\ncrit$ indicates 1st-kind or 2nd-kind integral formulation.}
\label{tab:laplace_loose_params}
\end{table}%

Next, we looked at the following test to see how the residual changes as the \gmres iterations proceed and  what value of $p$ is required in the \fmm-accelerated mat-vecs to continue convergence, according to Equation \eqref{eqn:fmm_p}. We discretized a sphere with $32,768$ surface triangles and solved a first-kind integral equation using a solver tolerance of $10^{-6}$ with an initial value of $p$ set to 8 and 10. As the residual gets smaller, the value of $p$ needed to maintain convergence of the solver drops, and a low-$p$ of just 2 is sufficient by the eleventh iteration for $p_\text{{initial}}=8$ and by the eighth iteration for $p_{\text{initial}}=10$. This offers the potential for substantial speed-ups in the calculations, because the translation operators of the \fmm scale as $\bigO(p^{4})$ for spherical harmonics. A more accurate first iteration results in a faster drop of required-$p$ and fewer iterations. 
But we note that only the far-field evaluation can be sped-up with the relaxation strategy, which means that the correct balance between near field and far field in the \fmm could change as we reset $p$ in the later iterations.

\begin{figure}
	\centering
	\includegraphics[width=1.0\textwidth]{LaplaceResidualIterations.pdf}
	\caption{In a test solving a first-kind integral equation with relaxtion, using a sphere discretized with $32,768$ triangles, with $K=4$, $p_\text{initial}=8$ (left plot), $p_\text{initial}=10$ (right plot) and optimal $\ncrit = 100$, the residual $\|r_{k}\|$  (solid line, left axis) decreases with successive \gmres iterations while the necessary $p$ (open circles, right axis) to achieve convergence drops quickly. Plotting script and figure available under CC-BY \cite{WangLaytonBarba2016-figshare2}.}
	\label{fig:residualp}
\end{figure}

To find out the potential speed-up, we compared the time to solution for different cases with and without the relaxation strategy. Using three surface discretizations, we solved the boundary-element problem with 1st- and 2nd-kind formulations, with a multi-threaded evaluator on 6 \cpu\ cores. In each case, we were careful to set the value of $\ncrit$ to minimize the time to solution of the particular test case. 
The detailed timings, the number of iterations to converge and the relative error of the potential evaluated at an exterior point are given in Tables \ref{tab:laplace_1st_relaxation} and \ref{tab:laplace_2nd_relaxation}.
Figure \ref{fig:relaxation_timing} shows the speed-up in the time spent solving the linear system of equations to the specified tolerance. 
As indicated in the caption, we used the loose parameters in Table \ref{tab:laplace_loose_params} in our experiments to measure modest speedup for each case.


\begin{figure}
	\centering
	\includegraphics[width=0.5\textwidth]{LaplaceSpeedupRelaxation.pdf}
	\caption{Speed-up using a relaxation strategy for three different triangulations of a sphere ($N$ is the number of surface panels), using 1st-kind and 2nd-kind integral formulations, with loose parameter settings in Table \ref{tab:laplace_loose_params}. Time is measured by averaging the solving time of three identical runs. (Multi-threaded evaluator running on 6 \cpu\ cores.) Plotting script and figure available under CC-BY \cite{WangLaytonBarba2016-figshare2}.}
	\label{fig:relaxation_timing}
\end{figure}


\begin{table}[h]
%\footnotesize
%\begin{center}
	\resizebox{\textwidth}{!}{
	\begin{tabular}{c|cccc|cccc|c}
	& \multicolumn{4}{c|}{Non-Relaxed} & \multicolumn{4}{c|}{Relaxed} &  \\
	N & $\ncrit$ & $\tsolve$ & Error & \# iters & $\ncrit$ & $\tsolve$ & Error & \# iters & Speed-up\\
	\hline
	& & & & & & & & &\\
	8192 & 300 & 3.14 & $6.17\times10^{-4}$ & 11 & 100 & 2.07 & $6.13\times10^{-4}$ & 12 & 1.52\\
	32768 & 400 & 19.84 & $1.69\times10^{-4}$ & 11 & 100 & 11.48 & $1.70\times10^{-4}$ & 13 & 1.73\\
	131072 & 500 & 97.37 & $4.94\times10^{-5}$ & 13 & 200 & 43.06 & $5.06\times10^{-5}$ & 14 & 2.26\\

	\end{tabular}
	}
%\end{center}
\caption{Speed-ups for the relaxation strategy on a Laplace 1st-kind integral solver.}
\label{tab:laplace_1st_relaxation}
\end{table}%

\begin{table}[h]
%\footnotesize
%\begin{center}
	\resizebox{\textwidth}{!}{
	\begin{tabular}{c|cccc|cccc|c}
	& \multicolumn{4}{c|}{Non-Relaxed} & \multicolumn{4}{c|}{Relaxed} &  \\
	N & $\ncrit$ & $\tsolve$ & Error & \# iters & $\ncrit$ & $\tsolve$ & Error & \# iters & Speed-up\\
	\hline
	& & & & & & & & &\\
	8192 & 300 & 1.19 & $9.74\times10^{-4}$ & 2 & 300 & 1.00 & $9.74\times10^{-4}$ & 2 & 1.19\\
	32768 & 400 & 7.10 & $2.29\times10^{-4}$ & 2 & 300 & 5.49 & $2.29\times10^{-4}$ & 2 & 1.29\\
	131072 & 500 & 27.49 & $5.01\times10^{-5}$ & 2 & 300 & 22.49 & $4.97\times10^{-5}$ & 2 & 1.22\\
	
	\end{tabular}
	}
%\end{center}
\caption{Speed-ups for the relaxation strategy on a Laplace 2nd-kind integral solver.}
\label{tab:laplace_2nd_relaxation}
\end{table}%

The results on Tables \ref{tab:laplace_1st_relaxation} and \ref{tab:laplace_2nd_relaxation} show a speed-up of between $1.5\times$ and $2.3\times$ for the 1st-kind integral formulation and between $1.1\times$ and $1.3\times$ for the 2nd-kind formulation. We also report the number of iterations to converge and the relative error with respect to the analytical solution. At the cost of one or two more iterations, the relaxation strategy yields decent speedup without corrupting the accuracy.
We need only 2 iterations for the 2nd-kind formulation, which explains why we have such low speedup compared with the 1st-kind formulation.
These tests also taught us that one has to give up on the idea of partitioning the domain between a near-field and a far-field in a way that balances the time spent computing each one---an accepted idea in \fmm applications. When relaxing the accuracy of the \gmres iterations, the time taken to compute the far field decreases significantly. This means that to minimize time-to-solution when using relaxed \gmres, the near and far fields should not be balanced, but rather the far field should be bloated. As a result, the first few iterations are completely dominated by the time to compute the far field, but this is offset by the benefit of much cheaper iterations from then on. This is a simple but unexpected and counter-intuitive algorithmic consequence of using inexact \gmres with \fmm.


\subsection{Inexact {\small GMRES} for solving the Stokes equation}
Like in \S\ref{sec:inexactLaplace}, we start with a grid-convergence study to build confidence that the Stokes solver is correct and converges to the right solution at the expected rate. As an application of the Stokes equation, we chose low-Reynolds-number flow, using a spherical geometry for the grid-convergence study. This classical problem of fluid mechanics has an analytical solution that gives the drag force on the sphere as $F_d = 6\pi\mu Ru_x$, where $\mu$ is the viscosity of the fluid, $R$ is the Reynolds number and $u_x$ is the freestream velocity, taken in the $x$-direction. We solve a first-kind integral equation for the traction force, $\vect{t}$, by imposing $\vect{u} = (1,0,0)^{T}$ at the center of every panel, and compute the drag force with

\begin{equation}
	\label{eqn:stokes_traction_drag}
	F_d = \int_\Gamma t_x\;\di{\Gamma'} = \sum_{j=1}^{N} t_{x_j}\cdot A_j.
\end{equation}


For all the tests, we set $R=1$, $u_x = 1$ and $\mu = 10^{-3}$, giving a drag force of $F_d = 0.01885$. We solve the integral problem using a boundary element method with fast-multipole-accelerated mat-vecs in a \gmres solver.
Like with the Laplace equation, we first used a set of stringent parameters: $p=20$, $K=13$ and $\eta=10^{-10}$, to guarantee that discretization error dominates over error from other sources. The thicker line in Figure \ref{fig:stokes_convergence} shows that we observe convergence at the expected rate of $\O{1 / \sqrt{N}}$, for first-kind integral equations using tight parameters. Again for both cases with and without relaxation, we gradually slackened these tight parameters one at a time to find the loosest possible values without affecting accuracy, and they are listed in Table \ref{tab:stokes_loose_params}. Through our extensive tests, we also found that we need to enforce a minimum $p$ to maintain accuracy for the cases with relaxations in Stokes solver, which means the required $p$ at each step is determined by Equation \eqref{eqn:fmm_p} with a floor of $p_{min}$. The choices of $\ncrit$ are optimal in Table \ref{tab:stokes_loose_params}. The finer lines in Figure \ref{fig:stokes_convergence} demonstrate the convergence result by using the loose parameters, which confirms our choice of parameters and the relaxation strategy on $p$.


\begin{figure}
\begin{center}
	\includegraphics[width=0.5\textwidth]{StokesConvergence.pdf}
	\caption{Convergence of the boundary-integral solution for Stokes flow around a sphere, using a  first-kind equation. The relative error is with respect to the analytical solution for drag on a sphere: $F_d = 6\pi\mu Ru_x$. The tight parameters: $p=20$, $K=13$, $\eta=10^{-10}$; the loose parameters are listed in Table \ref{tab:stokes_loose_params}. Plotting script and figure available under CC-BY \cite{WangLaytonBarba2016-figshare3}.}
	\label{fig:stokes_convergence}
\end{center}
\end{figure}


\begin{table}[h]
\footnotesize
\begin{center}
\begin{tabular}{c|cccc|ccccc}
  & \multicolumn{4}{c|}{Non-Relaxed} & \multicolumn{5}{c}{Relaxed}\\
  N & $K$ & $p$ & $\eta$ & $\ncrit$ & $K$ & $p_{init}$ & $p_{min}$ & $\eta$ & $\ncrit$\\
 \hline
   & & & & & & & & &\\
  8192 & 4 & 12 & $10^{-5}$ & 300 & 4 & 12 & 4 & $10^{-5}$ & 100 \\
  32768 & 4 & 12 & $10^{-5}$ & 400 & 4 & 12 & 3 & $10^{-5}$ & 60 \\
  131072 & 4 & 14 & $10^{-5}$ & 150\footnotemark[1] & 4 & 14 & 4 & $5\times10^{-6}$ & 80 \\
 
\end{tabular}
\end{center}
\caption{Loose parameters on a Stokes solver.}
\label{tab:stokes_loose_params}
\end{table}%

\footnotetext[1]{Due to memory constraints, we cannot run the largest case without relaxation by using its optimal $\ncrit$ (above 400). Instead, we used the closest possible $\ncrit=150$ in our experiments.}


Figure \ref{fig:stokes_relaxation_breakdown} illustrates clearly how we need to adjust the balance between near field and far field when using relaxation strategies. We ran the Stokes solver on a sphere discretized with 32,768 panels, using loose parameters and the corresponding optimal $\ncrit$ in Table \ref{tab:stokes_loose_params}. Because most of the iterations are spent computing at the low values of $p$, we need to start with a bloated far field. The bar plot shows the breakdown of time spent in the {\ptop} and {\mtol} kernels for each iteration: although the first iteration is unbalanced, with far field taking about 10 times as much \cpu\ time as near field, later iterations are close to balanced and the total time to solution is optimal: {\ptop} takes 63s in total, while {\mtol} costs 47s. 


The number of iterations needed to converge is larger in the case of the Stokes equation compared to the Laplace equation, which bodes well for the speed-up that we could get from relaxation. Moreover, computing the spherical expansion for the Stokes kernel is equivalent to computing four Laplace expansions, which combines with the larger number of iterations to offer greater speed-ups. Figure \ref{fig:stokes_speedup} shows the speed-up resulting from the relaxed \gmres iterations for three increasingly finer surface discretizations, where we observed a speedup around $3.0\times$. Table \ref{tab:stokes_speedup} shows the detailed result of the speedup test, including the relative error compared with the analytical solution and the number of iterations to converge.


\begin{figure}%[ht]
\begin{center}
	\includegraphics[width=1\textwidth]{StokesSolveBreakdown.pdf}
	\caption{Time breakdown between {\ptop} and {\mtol} when using a relaxation strategy for solving surface traction on the surface of a sphere with $32,768$ panels, with the loose set of parameters listed in Table \ref{tab:stokes_loose_params}. Plotting script and figure available under CC-BY \cite{WangLaytonBarba2016-figshare3}.}
	\label{fig:stokes_relaxation_breakdown}
\end{center}
\end{figure}


\begin{figure}%[ht]
\begin{center}
	\includegraphics[width=0.5\textwidth]{StokesSpeedupRelaxation.pdf}
	\caption{Speed-up for solving first-kind Stokes equation on the surface of a sphere, varying $N$, with loose parameter settings in Table \ref{tab:stokes_loose_params}. Time is measured by averaging the solving time of three identical runs. Plotting script and figure available under CC-BY \cite{WangLaytonBarba2016-figshare3}.}
	\label{fig:stokes_speedup}
\end{center}
\end{figure}


\begin{table}[h]
%\footnotesize
%\begin{center}
	\resizebox{\textwidth}{!}{
	\begin{tabular}{c|cccc|cccc|c}
	& \multicolumn{4}{c|}{Non-Relaxed} & \multicolumn{4}{c|}{Relaxed} &  \\
	N & $\ncrit$ & $\tsolve$ & Error & \# iters & $\ncrit$ & $\tsolve$ & Error & \# iters & Speed-up\\
	\hline
	& & & & & & & & &\\
	8192 & 300 & 82.57 & $2.28\times10^{-2}$ & 27 & 100 & 30.56 & $2.32\times10^{-2}$ & 29 & 2.70\\
	32768 & 400 & 370.18 & $1.11\times10^{-2}$ & 27 & 100 & 111.93 & $1.11\times10^{-2}$ & 30 & 3.31\\
	131072 & 150 & 3058.93 & $5.78\times10^{-3}$ & 24 & 200 & 682.14 & $6.55\times10^{-3}$ & 34 & 4.48\\
	
	\end{tabular}
	}
%\end{center}
\caption{Speed-ups for the relaxation strategy on a Stokes solver.}
\label{tab:stokes_speedup}
\end{table}%


\subsection{Application to red blood cells in Stokes flow}

A number of medical applications will benefit from greater understanding of the microflows around red blood cells and of the mechanical effects on the cells from this flow. 
The most notable example is perhaps the deadly malaria infection, which makes red blood cells stiffer thus disrupting the flow of blood in capillaries \cite{FedosovETal2011}.
Any design of a biomedical device that processes blood at the micrometer-scale needs to consider the mechanical behavior of blood at the cellular level \cite{Freund2014}. Blood is a dense suspension of mostly red blood cells and smaller concentrations of white blood cells and platelets. The flow regime in small capillaries is at very low Reynolds numbers, and thus completely dominated by viscous effects. 
Red blood cells are very flexible, so any physiologically realistic simulation should take into account their elastic deformations. But here we are only attempting to show the benefit of our relaxation strategy on the Stokes solver, and thus limit our study to the steady Stokes flow around a red-blood-cell geometry. The unsteady problem of coupled Stokes flow and linear elasticity can be approached by repeated solution of boundary-integral problems at every time step, and would equally benefit from the speed-ups seen on a single Stokes solution.

To create a surface discretization for a red blood cell, we start with a sphere discretized into triangular panels and transform every vertex $v = v(x,y,z)$, with $x,y,z\in [-1,1]$, into $v' = v'(x',y',z(\rho'))$ using the formula presented in Ref.~\cite{EvansFung1972}:
%
\begin{equation}
	\label{eqn:rbc_parameterization}
	z(\rho) = \pm \frac{1}{2}\sqrt{1 - \left(\frac{\rho}{r}\right)^{2}}\left ( C_0 + C_2 \left(\frac{\rho}{r}\right)^{2} + C_4\left(\frac{\rho}{r}\right)^{4}\right ),
\end{equation}

\noindent where $x' = x\cdot r,\; y' = y\cdot r,\; \rho = \sqrt{x'^{2}+y'^{2}}$, and the coefficients are: $r=3.91\mu$m,  $C_0= 0.81\mu$m, $C_2= 7.83\mu$m and $C_4=-4.39\mu$m.
Figure \ref{fig:glob_rbc} shows two examples of transformed shapes obtained from sphere triangulations using Equation \eqref{eqn:rbc_parameterization}.


\begin{figure}%[ht]
\begin{center}
	\subfloat[512 panels]{\includegraphics[width=0.4\textwidth]{RBC512.pdf}\label{fig:rbc512}} \qquad
	\subfloat[2048 panels]{\includegraphics[width=0.4\textwidth]{RBC2048.pdf}\label{fig:rbc2048}}
	\caption{Surface geometries of red blood cells obtained from transforming sphere triangulations using Equation \eqref{eqn:rbc_parameterization}.}
	\label{fig:glob_rbc}
\end{center}
\end{figure}

A grid-convergence study using the geometry of a red blood cell requires that we use Richardson extrapolation \cite{roache1998}, since we don't have an analytical solution for this situation. We calculated the drag on a red blood cell in uniform Stokes flow using three surface meshes, consecutively refined by a constant factor $c=4$. 
If the value $f_1$ corresponds to that obtained using the coarsest mesh and $f_2$ and $f_3$ to those using consecutively refined ones, then we can obtain the extrapolated value approximating the exact solution with the following formula:

\begin{equation}
	\bar{f} = \frac{f_1f_3-f_2^{2}}{f_1 -2f_2+f_3}
\end{equation}

Table \ref{tab:rbc_richardson_values} presents the computed values of the drag force obtained with three different meshes, of sizes $N=512$, $2048$ and $8192$, by using a set of tight parameters: $p=20$, $K=13$ and $\eta=10^{-10}$. We can also obtain the \emph{observed order of convergence}, $p$, as follows
%
\begin{equation}
	p = \frac{\ln{\left(\frac{f_2-f_1}{f_3-f_2}\right)}}{\ln{c}},
\end{equation}

\noindent where $c$ is the refinement ratio between two consecutive meshes. With the values in Table \ref{tab:rbc_richardson_values}, the observed order of convergence comes out at $0.52$, matching our expected rate of convergence of $\O{1 / \sqrt{N}}$. 
The thicker line in Figure \ref{fig:rbc_extrapolated_convergence} shows the error with respect to the extrapolated value, as a function of the mesh size. The plot includes the error obtained with five meshes, with the extrapolated value obtained using the first three coarser meshes. Like what we did before, we also wish to select a slack set of parameters for convergence test and benchmarking. Because the problem lacks an analytical solution and later we will test on multiple combinations of different number of cells and different number of panels per cell, it is cumbersome to search for the loose parameters case by case. Instead, based on our choice for the Stokes flow on a sphere (Table \ref{tab:stokes_loose_params}), we chose a safe and universal set of loose parameters---$p \left( p_{initial} \right) = 14$, $p_{min}=3$, $K=4$ and $\eta=10^{-5}$---for the application to red blood cells, then we found the optimal $\ncrit$ for each case with and without relaxation. These values are listed in Table \ref{tab:rbc_loose_params}. The convergence result associated with the loose settings is represented by the two thinner lines in Figure \ref{fig:rbc_extrapolated_convergence}, showing that drag force calculated by using loose parameters with relaxation matches the result by using tight parameters without relaxation, which convinces us to use the slacker set in the following studies. 

\begin{table}[h]
\footnotesize
\begin{center}
\begin{tabular}{c|c}
	$N$ & $f_x$ \\
	\hline
	& \\
	$512$ & $-0.057$ \\
	$2048$ & $-0.070$ \\ 
	$8192$ & $-0.077$ \\
%	$32768$ & $-0.080$
\end{tabular}
\end{center}
\caption{Surface mesh sizes and calculated drag force for the convergence study using a red blood cell in uniform Stokes flow.}
\label{tab:rbc_richardson_values}
\end{table}%


\begin{figure}
\begin{center}
	\includegraphics[width=0.5\textwidth]{EthrocyteConvergence.pdf}
	\caption{Observed convergence for Stokes flow around a red blood cell, with respect to the extrapolated value of the drag coefficient, using Richardson extrapolation \cite{roache1998}. The tight parameters: $p=20$, $K=13$, $\eta=10^{-10}$; the loose parameters are listed in Table \ref{tab:rbc_loose_params}. Plotting script and figure available under CC-BY \cite{WangLaytonBarba2016-figshare4}.}
	\label{fig:rbc_extrapolated_convergence}
\end{center}
\end{figure}


\begin{table}[h]
\footnotesize
\begin{center}
\begin{tabular}{c|cccc|ccccc}
  & \multicolumn{4}{c|}{Non-Relaxed} & \multicolumn{5}{c}{Relaxed}\\
  N & $K$ & $p$ & $\eta$ & $\ncrit$ & $K$ & $p_{init}$ & $p_{min}$ & $\eta$ & $\ncrit$\\
 \hline
   & & & & & & & & &\\
  2048 & 4 & 14 & $10^{-5}$ & 400 & 4 & 14 & 3 & $10^{-5}$ & 130 \\
  8192 & 4 & 14 & $10^{-5}$ & 400 & 4 & 14 & 3 & $10^{-5}$ & 140 \\
  32768 & 4 & 14 & $10^{-5}$ & 400 & 4 & 14 & 3 & $10^{-5}$ & 120 \\
  131072 & 4 & 14 & $10^{-5}$ & 150\footnotemark[2] & 4 & 14 & 3 & $10^{-5}$ & 100 \\
 
\end{tabular}
\end{center}
\caption{Loose parameters for the application to one red blood cell in Stokes flow.}
\label{tab:rbc_loose_params}
\end{table}%

\footnotetext[2]{For the same reason of memory constraints, $\ncrit=150$ gives the best available performance running the largest case on our workstation.}

The calculations with increasingly finer surface meshes take more time to complete not only because the number of unknowns is larger, but also because they may require a greater number of iterations to converge to a desired residual.
Figure \ref{fig:single_cell_iterations} shows that the number of iterations needed as the surface mesh varies from size $N=128$ to $8,192$ increases from 18 to 39 for the non-relaxed solver on one red blood cell. For further refined meshes, the number of iterations is between 31 and 39. This is an indication that the surface mesh is sufficiently refined with $8,192$ panels for one red blood cell.

\begin{figure}
\begin{center}
	\includegraphics[width=0.5\textwidth]{EthrocyteSingleCellIterations.pdf}
	\caption{Number of iterations needed for the system to converge for increasingly refined surface meshes on one red blood cell, using a non-relaxed \gmres solver with loose parameter settings and optimal $\ncrit$. Plotting script and figure available under CC-BY \cite{WangLaytonBarba2016-figshare4}.}
	\label{fig:single_cell_iterations}
\end{center}
\end{figure}


For the next test, we used several red blood cells in a sparse spatial arrangement. Realistic blood flows have densely packed red blood cells, but the purpose of this test is to simply demonstrate the boundary element solver with a larger problem. We set up a collection of red blood cells by making copies of a discretized cell, then randomly rotating each one, and shifting it spatially in each  coordinate direction by a positive random amount that ensures they do not overlap. The resulting arrangement may look like that shown in Figure \ref{fig:multiple_cells}.


\begin{figure}[ht]
\begin{center}
\includegraphics[width=0.7\textwidth]{4Cells_arrow.pdf}
	\caption{Surfaces for four red blood cells in a uniform Stokes flow.}
	\label{fig:multiple_cells}
\end{center}
\end{figure}

Using 2, 4 and 8 red blood cells, we looked at the number of iterations needed to converge to a solver tolerance of $10^{-5}$ when using three different surface mesh sizes on each cell: $N=2048$, $8192$ and $32,768$. In all cases, we ran the non-relaxed solver with the loose parameters and optimal $\ncrit$. Figure \ref{fig:multiple_cell_iterations} shows that the number of iterations needed to converge increases sharply with the number of red blood cells in the system, while the number of panels per cell has a smaller effect in this range of mesh sizes. In all cases with multiple cells, the number of iterations is above 45, and thus we expect to see good speed-ups using the relaxation strategy.


\begin{figure}
\begin{center}
	\includegraphics[width=1.0\textwidth]{EthrocyteMultipleCellIterations.pdf}
	\caption{Number of iterations needed to converge for systems with multiple red blood cells, discretized with different mesh sizes, using a non-relaxed \gmres solver with loose parameters and optimal $\ncrit$. Plotting script and figure available under CC-BY \cite{WangLaytonBarba2016-figshare4}.}
	\label{fig:multiple_cell_iterations}
\end{center}
\end{figure}

We completed several tests of the relaxation strategy, using between 1 and 64 red blood cells, and various mesh sizes on each cell. The common parameter settings for these tests are listed in Table \ref{tab:cells_relaxation_settings} and, in each case, we chose the value of $\ncrit$ to obtain the smallest time to solution for that run. The detailed results are listed in Tables \ref{tab:single_cell_relaxation_results}, \ref{tab:multiple_cell_relaxation_results_2048}, \ref{tab:multiple_cell_relaxation_results_8192} and \ref{tab:multiple_cell_relaxation_results_32768}, where we reported the calculated drag force $f_x$ due to lack of an analytical solution. Figure \ref{fig:multiple_cell_speedup} shows a summary of the observed speed-ups, mostly hovering close to $3\times$. The largest problems, with a total of $131,072$ panels (all cells combined), are atypical because in these cases we are unable to use an efficient sparse-matrix representation of the near field, due to the large memory requirement. But this can also be seen as an advantage of the relaxation strategy, which leads to using smaller near fields and thus extends the range of problem sizes where we can use the efficient sparse-matrix representation. Indeed, if one needed to solve a problem of size $N=131,072$ (on one \cpu\ using six threads, like we do here), then the potential for a $4\times$ speed-up is real.

\begin{table}[h]
\footnotesize
\begin{center}
\begin{tabular}{c|c}
 Variable & Setting \\ 
\hline
 & \\
 $p_{\text{initial}}$ & $14$ \\
 $p_{\text{min}}$ &  $3$ \\
 solver tolerance & $10^{-5}$ \\
 Near-field & Sparse matrix \\
 Threads & $6$ \\
 Solver & {\gmres} \\ 
 Preconditioner & None
\end{tabular}
\end{center}
\caption{Parameters for the tests of the relaxation strategy with red blood cells in uniform Stokes flow.}
\label{tab:cells_relaxation_settings}
\end{table}%


\begin{table}[htp]
%\footnotesize
%\begin{center}
	\resizebox{\textwidth}{!}{
	\begin{tabular}{c|c|cccc|cccc|c}
	& & \multicolumn{4}{c|}{Non-Relaxed} & \multicolumn{4}{c|}{Relaxed} &  \\
	N & \# unknowns & $\ncrit$ & $\tsolve$ & $f_x$ & \# iters & $\ncrit$ & $\tsolve$ & $f_x$ & \# iters & Speed-up\\
	\hline
	& & & & & & & & & &\\
	2048 & 6144 & 400 & 33.64 & -0.07033 & 35 & 130 & 11.21 & -0.07010 & 37 & 3.00\\
	8192 & 24576 & 400 & 183.73 & -0.07678 & 39 & 140 & 64.92 & -0.07780 & 41 & 2.83\\
	32768 & 98304 & 400 & 809.90 & -0.07989 & 35 & 120 & 253.90 & -0.07977 & 39 & 3.19\\
	131072 & 393216 & 150 & 4407.47 & -0.08141 & 31 & 100 & 980.65 & -0.08149 & 35 & 4.49\\
	\end{tabular}
	}
%\end{center}
\caption{Timings and speed-up of the relaxation strategy on a single red blood cell in uniform Stokes flow, with the test parameters shown in Table \ref{tab:cells_relaxation_settings}.}
\label{tab:single_cell_relaxation_results}
\end{table}%


\begin{table}[htp]
%\footnotesize
%\begin{center}
	\resizebox{\textwidth}{!}{
	\begin{tabular}{c|c|c|cccc|cccc|c}
	& & & \multicolumn{4}{c|}{Non-Relaxed} & \multicolumn{4}{c|}{Relaxed} &  \\
	N & \# unknowns & $N_c$ & $\ncrit$ & $\tsolve$ & $f_x$ & \# iters & $\ncrit$ & $\tsolve$ & $f_x$ & \# iters & Speed-up\\
	\hline
	& & & & & & & & & &\\
	2048 & 6144 & 1 & 400 & 33.64 & -0.07033 & 35 & 130 & 11.21 & -0.07010 & 37 & 3.00\\
	8192 & 24576 & 4 & 400 & 204.43 & -0.19825 & 51 & 140 & 87.34 & -0.19765 & 52 & 2.34\\
	32768 & 98304 & 16 & 400 & 1128.50 & -0.67816 & 67 & 100 & 466.12 & -0.67270 & 70 & 2.42\\
	131072 & 393216 & 64 & 150 & 6186.23 & -2.41785 & 76 & 120 & 1953.23 & -2.41026 & 79 & 3.17\\
	\end{tabular}
	}
%\end{center}
\caption{Timings and speed-up of the relaxation strategy with several red blood cells in uniform Stokes flow, each cell discretized with 2048 panels and test parameters shown in Table \ref{tab:cells_relaxation_settings}.}
\label{tab:multiple_cell_relaxation_results_2048}
\end{table}


\begin{table}[htp]
%\footnotesize
%\begin{center}
	\resizebox{\textwidth}{!}{
	\begin{tabular}{c|c|c|cccc|cccc|c}
	& & & \multicolumn{4}{c|}{Non-Relaxed} & \multicolumn{4}{c|}{Relaxed} &  \\
	N & \# unknowns & $N_c$ & $\ncrit$ & $\tsolve$ & $f_x$ & \# iters & $\ncrit$ & $\tsolve$ & $f_x$ & \# iters & Speed-up\\
	\hline
	& & & & & & & & & &\\
	8192 & 24576 & 1 & 400 & 183.73 & -0.07678 & 39 & 140 & 64.92 & -0.07780 & 41 & 2.83\\
	32768 & 98304 & 4 & 400 & 1300.30 & -0.21589 & 54 & 120 & 423.83 & -0.21703 & 55 & 3.07\\
	131072 & 393216 & 16 & 150 & 8837.73 & -0.73883 & 67 & 100 & 2097.87 & -0.73841 & 68 & 4.21\\
	\end{tabular}
	}
%\end{center}
\caption{Timings and speed-up of the relaxation strategy with several red blood cells in uniform Stokes flow, each cell discretized with $8,192$ panels and test parameters shown in Table \ref{tab:cells_relaxation_settings}.}
\label{tab:multiple_cell_relaxation_results_8192}
\end{table}


\begin{table}[htp]
%\footnotesize
%\begin{center}
	\resizebox{\textwidth}{!}{
	\begin{tabular}{c|c|c|cccc|cccc|c}
	& & & \multicolumn{4}{c|}{Non-Relaxed} & \multicolumn{4}{c|}{Relaxed} &  \\
	N & \# unknowns & $N_c$ & $\ncrit$ & $\tsolve$ & $f_x$ & \# iters & $\ncrit$ & $\tsolve$ & $f_x$ & \# iters & Speed-up\\
	\hline
	& & & & & & & & & &\\
	32768 & 98304 & 1 & 400 & 809.90 & -0.07989 & 35 & 120 & 253.90 & -0.07977 & 39 & 3.19\\
	131072 & 393216 & 4 & 150 & 6171.47 & -0.22452 & 49 & 120 & 1599.33 & -0.22450 & 51 & 3.86\\
	\end{tabular}
	}
%\end{center}
\caption{Timings and speed-up of the relaxation strategy with several red blood cells in uniform Stokes flow, each cell discretized with $32,768$ panels and test parameters shown in Table \ref{tab:cells_relaxation_settings}.}
\label{tab:multiple_cell_relaxation_results_32768}
\end{table}


\begin{figure}[ht]
\begin{center}
	\includegraphics[width=0.7\textwidth]{EthrocyteMultipleCellSpeedup.pdf}
	\caption{Speed-ups of the relaxation strategy with several red blood cells in uniform Stokes flow, using different mesh sizes on each cell. The abscissa value corresponds to the total number of panels (all cells). Test parameters shown in Table \ref{tab:cells_relaxation_settings}. Time is measured by averaging the solving time of three identical runs.} 
	\label{fig:multiple_cell_speedup}
\end{center}
\end{figure}

\subsection{Reproducibility}

Our inexact \gmres code is open-source and available on GitHub. Knowing the increasing importance of computational reproducibility, we provide the ``\textit{reproducibility package}"---containing the running and post-processing scripts---to generate the figures in the Results section. To save readers from potential headaches of dependency mismatches, we prepared a Dockerfile to set up the software environment (a Docker container), under which the reproducibility package can be used to reproduce our results. We also provide a Jupyter Notebook as supplementary material; it contains details of the parameter searches and timing benchmark with \fmm.


\section{Conclusion} 

We have shown the first successful application of a relaxation strategy for fast-multipole-accelerated boundary element methods, based on the theory of inexact \gmres. Testing the relaxation strategy on Laplace problems, we confirmed that it converges to the right solution, it provides moderate speed-ups over using a fixed $p$, and it leads to initially bloated far-fields to obtain the minimum time to solution.
Exploring the performance advantage of relaxing the value of $p$ as \gmres iterations advance, we concluded that problems requiring high accuracy and/or resulting in more ill-conditioned linear systems will experience the best speed-ups, which for Laplace problems were between $1.5\times$ and $2.3\times$ in our tests on a sphere with constant potential.

In the case of the Stokes equation, the speed-ups that can be obtained using a relaxation strategy are larger, due to the fact that Stokes problems require both more iterations to converge (and the relaxed solver spends more time at low $p$) and more work per iteration (equivalent to four Laplace evaluations). 
We found that it's important for Stokes problems to also enforce a minimum value of $p$ to avoid accuracy or convergence degradation.
Relaxed \gmres iterations in this case reduced the time to solution by about $3\times$ in tests of Stokes flow around a sphere. 
We completed various tests for Stokes flow around red blood cells, with up to 64 cells. We studied numerical convergence in this situation using Richardson extrapolation and obtained an observed order of convergence of $0.52$, close to the expected value of $1/2$. The speed-ups resulting from the relaxation strategy in these tests were in most cases above $3\times$.

Relaxing the truncation order $p$ in the multipole expansions as Krylov iterations progress is one of those seemingly simple ideas that strike one as obvious a posteriori. Yet, as far as we know, it has not been tried before, nor has it been implemented in a \bem. Given this method's wide popularity in computational engineering, we look forward to many applications benefitting from healthy speed-ups from applying relaxed-$p$ \fmm. We  showed that Stokes problems, in particular, can expect up to $4.5\times$ speed-ups in large problems still fitting on one workstation. Linear elasticity problems should experience similar speed-ups (although we didn't try them). This is pure algorithmic speed-up that should multiply with any hardware speed-ups obtained, for example, by moving computational kernels to \gpu s.

\appendix
\section{Algorithm listings}\label{sec:algorithms}
 
 \begin{algorithm}
 \footnotesize
	\caption{Matrix-vector multiplication.}
	\label{alg:matvec}
	\begin{algorithmic}
		%\Require 
		\State Initialize $\mathbf{w}$
		\For{Collocation points $i=1\cdots N_p$}
			\State $w_i \gets 0$
			\For{Integration panels $j=1 \cdots N_p$}
				\For{Gauss quadrature points $k= 1 \cdots K$}
				\State $w_i \gets w_i + v_j \cdot q_k \cdot S_j \, \frac{1}{|\mathbf{x_i}-\mathbf{x_k}|}$
				\EndFor
			\EndFor
		\EndFor 
	\end{algorithmic}
\end{algorithm}

\clearpage

% Conflicts of interest/Competing interests (include appropriate disclosures)
\subsection*{Conflict of interest}
The authors declare that they have no conflict of interest.

\begin{acknowledgements}
	This work was supported by the National Science Foundation via NSF CAREER award OCI-1149784. LAB also acknowledges support from NVIDIA Corp.\ via the CUDA Fellows Program. 
   Dr. Cris Cecka (previously at Harvard University, currently at Nvidia Corp.) contributed to the development of the code, particularly writing the octree and base evaluator. He later continued developing his fast-multipole framework, which evolved into his FMMTL project; see \href{https://github.com/ccecka/fmmtl}{https://github.com/ccecka/fmmtl}.
   The authors also wish to acknowledge valuable interactions with Dr. Christopher Cooper (previously at Boston University, currently at Universidad T{\'e}cnica Federico Santa Mar{\'i}a) that helped with the implementation of the boundary element method.
\end{acknowledgements}

%% Bibliography
\bibliographystyle{spmpsci}
\bibliography{bem,cfd,compbio,fastmethods,inexactmatvec,scicomp}

\end{document}
%% end of file