fsa-model.tex

\section{A Framework for Aging}\label{sec:fsa-framework}

\subsection{Natural Transfer Size}\label{sec:fsa-nts}

Our model of aging is based on the observation that the bandwidth of many types
of hardware is maximized when I/Os are large; that is, sequential I/Os are
faster than random I/Os.  We abstract away from the particulars of the storage
hardware by defining the \defn{natural transfer size} (NTS) to be the amount of
sequential data that must be transferred per I/O in order to obtain some fixed
fraction of maximum throughput, say 50\% or 90\%.  Reads that involve more than
the NTS of a device will run near bandwidth.

From \cref{fig:alpha-plot}, which plots SSD and HDD bandwidth as a
function of read size, we conclude that a reasonable NTS for both the SSDs and
HDDs we measured is 4MiB.

The cause of the gap between sequential- and random-I/O speeds differs for
different hardware.  For HDDs, seek times offer a simple explanation.  For
SSDs, this gap is hard to explain conclusively without vendor support, but
common theories include: sequential accesses are easier to stripe across
internal banks, better leveraging
parallelism~\cite{DBLP:conf/sigmetrics/JungK13}; some FTL translation data
structures have nonuniform search times~\cite{DBLP:journals/csur/MaFL14}; and
fragmented SSDs are not able to prefetch
data~\cite{DBLP:conf/sigmetrics/ChenKZ09} or
metadata~\cite{DBLP:conf/hotstorage/JiCSWLX16}.  Whatever the reason, SSDs show
a gap between sequential and random reads, though not as great as on disks.

In order to avoid aging, file systems should avoid breaking large files into
pieces significantly smaller than the NTS of the hardware. They should also
group small files that are logically related (close in recursive traversal
order) into clusters of size at least the NTS and store the clusters near each
other on disk.  We consider the major classes of file systems and explore the
challenges each file system type encounters in achieving these two goals.

\newcommand{\addalphaplot}[3]
{
  \addplot[color=#2, line width=0.75pt, mark=#3]
  table[x=read_size_bytes, y expr=\thisrow{read_size_bytes} * \thisrow{num_reads} / \thisrow{time_seconds} / 1000000] {fsa-data/device_measurements/#1_alpha_bw.csv};
  %\addlegendentry{\pgfkeysvalueof{/hardware-names/#1}}
}

\begin{figure}
  {\centering
	\tikzsetnextfilename{seq-read_bandwidth}
    \begin{tikzpicture}[yscale=0.825, xscale=0.825, trim axis left, trim axis right]
      \begin{axis}[
        width=\hsize,
        height=0.6\hsize,
        enlargelimits=0.05,
        % scale only axis,
        % title=Insertion per second against Load Factor, 
		  xlabel style={at={(axis description cs:0.5,-0.05)},anchor=north},
		  xlabel={Read size (MiB)}, 
		ylabel={Effective bandwidth (MiB per second)}, 
        xmin=4096,
        %xmax=16777216, 
        xmax=268435456,
        %ymin=0, 
        % ymax=50, 
        xtick={4096,16384,65536,262144,1048576,4194304,16777216,67108864,268435456},
		xticklabel style={align=center},
		%xticklabels={4\\KiB, 16\\KiB, 64\\KiB, 256\\KiB, 1\\MiB, 4\\MiB, 16\\MiB, 64\\MiB, 256\\MiB},
		xticklabels={0.004, 0.016, 0.063, 0.25, 1, 4, 16, 64, 256},
        ytick={0.25,1,4,16,64,256,1024},
		yticklabel style={align=center},
		%yticklabels={256\\KiB, 1\\MiB, 4\\MiB, 16\\MiB, 64\\MiB, 256\\MiB, 1\\GiB},
		yticklabels={0.25, 1, 4, 16, 64, 256, 1024},
        xmode=log,
        ymode=log,
        log basis x=2,
        log basis y=2,
        grid=major, 
        scaled x ticks=false,
        scaled y ticks=false,
        legend columns=2,
        legend cell align=left,
        legend pos=north west,
        % legend to name=alphaplotslegend,
        % transpose legend,
        ]
        \addalphaplot{ssd}{red}{square*}
        \addlegendentry{SSD}
        \addalphaplot{hdd}{blue}{*}
        \addlegendentry{HDD}
      \end{axis}
    \end{tikzpicture}
  \caption{\label{fig:alpha-plot}Effective bandwidth vs.\ read size (log-log
   scale, higher is better).  Even on SSDs, large I/Os can yield an order of
   magnitude more bandwidth than small I/Os.}
  }
\end{figure}

\subsection{Allocation Strategies and Aging}\label{sec:fsa-allocation}

The major file systems currently in use can be roughly  categorized as
B-tree-based, such as \xfs, \zfs, and \btrfs,  update-in-place, such as \ext,
and log-structured, such as \ftwofs~\cite{DBLP:conf/fast/LeeSHC15}.  The
research file system that we consider, \betrfs, is based on \bets.  Each of
these fundamental designs creates different aging considerations, discussed in
turn below.  In later sections, we present experimental validation for the
design principles presented below.

\paragraph{B-trees.}
The aging profile of a B-tree depends on the leaf size.  If the
leaves are much smaller than the NTS, then the
B-tree will age as the leaves are split and merged, and thus moved
around on the storage device.

Making leaves as large as the  NTS increases \emph{write amplification}, or the
ratio between the amount of data changed and the amount of data written to
storage.  In the extreme case, a single-bit change to a B-tree leaf can cause
the entire leaf to be rewritten.  Thus, B-trees are usually implemented with
small leaves.  Consequently, we expect them to age under a wide variety of
workloads.

In~\cref{sec:fsa-git}, we show that the aging of \btrfs is inversely related to
the size of the leaves, as predicted.  There are, in theory, ways to mitigate
the aging due to B-tree leaf movements.  For example, the leaves could be
stored in a packed memory array~\cite{DBLP:journals/siamcomp/BenderDF05}.
However, such an arrangement might well incur an unacceptable performance
overhead to keep the leaves arranged in logical order, and we know of no
examples of B-trees implemented with such leaf-arrangement algorithms.

\paragraph{Write-Once or Update-in-Place Filesystems.}  
When data is written once and never moved, such as in update-in-place file
systems like \ext, sequential order is very difficult to maintain: imagine a
workload that writes two files to disk, and then creates files that should
logically occur between them. Without moving one of the original files, data
cannot be maintained sequentially.  Such pathological cases abound, and the
process is quite brittle. As noted above, delayed allocation is an attempt to
mitigate the effects of such cases by batching writes and updates before
committing them to the overall structure. 

\paragraph{\bets.}  \bets batch changes to the file system in a sequence of
cascading logs, one per node of the tree.  Each time a node overflows, it is
flushed to the next node.  The seeming disadvantage is that data is written
many times, thus increasing the write amplification.  However, each time a node
is modified, it receives many changes, as opposed to B-tree, which might
receive only one change.  Thus, a \bet has asymptotically lower write
amplification than a B-tree.  Consequently, it can have much larger nodes, and
typically does in implementation.  \betrfs uses a \bet with 4MiB nodes.  

Since 4MiB is around the NTS for our storage devices, we expect \betrfs not to
age---which we verify below.

Log-structured merge trees (LSMs)~\cite{DBLP:journals/acta/ONeilCGO96} and
other write-optimized dictionaries can resist aging, depending on the
implementation.  As with \bets, it is essential that node sizes match the NTS,
the schema reflect logical access order, and enough writes are batched to avoid
heavy write amplification.