Skip to content

Commit

Permalink
final updates to hpctoolkit manual for the 2018-09 release.
Browse files Browse the repository at this point in the history
  • Loading branch information
jmellorcrummey committed Sep 30, 2018
1 parent e33feb4 commit d37524d
Show file tree
Hide file tree
Showing 4 changed files with 84 additions and 53 deletions.
Binary file modified doc/manual/HPCToolkit-users-manual.pdf
Binary file not shown.
30 changes: 18 additions & 12 deletions doc/manual/HPCToolkit-users-manual.tex
Original file line number Diff line number Diff line change
Expand Up @@ -116,7 +116,7 @@
% ***************************************************************************
% ***************************************************************************

\title{\HPCToolkit{} User's Manual}
\title{\HPCToolkit{} User's Manual\\[.5in]Version 2018.09}
%\subtitle{}

\author{
Expand Down Expand Up @@ -449,7 +449,7 @@ \subsection{Measuring Application Performance}
For instance:
\begin{quote}
\begin{verbatim}
export HPCRUN_EVENT_LIST="PAPI_TOT_CYC@4000001"
export HPCRUN_EVENT_LIST="CYCLES@f200"
[<mpi-launcher>] app [app-arguments]
\end{verbatim}
\end{quote}
Expand All @@ -473,16 +473,17 @@ \subsubsection{Specifying Sample Sources}

\HPCToolkit{} primarily monitors an application using asynchronous sampling.
Consequently, the most common option to \hpcrun{} is a list of sample sources that define how samples are generated.
A sample source takes the form of an event name $e$ and period $p$ and is specified as \texttt{$e$@$p$}, \eg{}, \mytt{PAPI_TOT_CYC@4000001}.
A sample source takes the form of an event name $e$ and \texttt{howoften}, specified as \texttt{$e$@howoften}. The specifier \texttt{howoften} may
be a number, indicating a period, \eg{} \mytt{CYCLES@4000001} or it may be \texttt{f} followed by a number, \mytt{CYCLES@f200} indicating a frequency in samples/second.
For a sample source with event $e$ and period $p$, after every \emph{p} instances of \emph{e}, a sample is generated that causes \hpcrun{} to inspect the and record information about the monitored application.

To configure \hpcrun{} with two samples sources, \texttt{$e_1$@$p_1$} and \texttt{$e_2$@$p_2$}, use the following options:
To configure \hpcrun{} with two samples sources, \texttt{$e_1$@howoften$_1$} and \texttt{$e_2$@howoften$_2$}, use the following options:
\begin{quote}
\texttt{--event $e_1$@$p_1$ --event $e_2$@$p_2$}
\texttt{--event $e_1$@howoften$_1$ --event $e_2$@howoften$_2$}
\end{quote}
To use the same sample sources with an \hpclink{}-ed application, use a command similar to:
\begin{quote}
\texttt{export HPCRUN\_EVENT\_LIST="$e_1$@$p_1$;$e_2$@$p_2$"}
\texttt{export HPCRUN\_EVENT\_LIST="$e_1$@howoften$_1$;$e_2$@howoften$_2$"}
\end{quote}


Expand Down Expand Up @@ -995,7 +996,7 @@ \section{Running and Analyzing MPI Programs}
%
\begin{quote}
\begin{verbatim}
export HPCRUN_EVENT_LIST="PAPI_TOT_CYC@4000001"
export HPCRUN_EVENT_LIST="CYCLES@f200"
<mpi-launcher> app [app-arguments]
\end{verbatim}
%
Expand Down Expand Up @@ -1162,7 +1163,7 @@ \section{Running a Statically Linked Binary}
#PBS -l size=64
#PBS -l walltime=01:00:00
cd $PBS_O_WORKDIR
export HPCRUN_EVENT_LIST="PAPI_TOT_CYC@4000000 PAPI_L2_TCM@400000"
export HPCRUN_EVENT_LIST="CYCLES@f200 PERF_COUNT_HW_CACHE_MISSES@f200"
aprun -n 64 ./app arg ...
\end{verbatim}
\end{quote}
Expand Down Expand Up @@ -1206,6 +1207,9 @@ \chapter{FAQ and Troubleshooting}
\section{How do I choose \hpcrun{} sampling periods?}
\label{sec:troubleshooting:hpcrun-sample-periods}

When using sample sources for hardware counter and software counter events provided by Linux \verb|perf_events|,
we recommend that you use frequency-based sampling. The default frequency is 300 samples/second.

Statisticians use samples sizes of approximately 3500 to make accurate projections about the voting preferences of millions of people.
In an analogous way, rather than collect unnecessary large amounts of performance information, sampling-based performance measurement collects ``just enough'' representative performance data.
You can control \hpcrun{}'s sampling periods to collect ``just enough'' representative data even for very long executions and, to a lesser degree, for very short executions.
Expand All @@ -1214,8 +1218,8 @@ \section{How do I choose \hpcrun{} sampling periods?}
Since unimportant contexts are irrelevant to performance, as long as this condition is met (and as long as samples are not correlated, etc.), \HPCToolkit{}'s performance data should be accurate.

We typically recommend targeting a frequency of hundreds of samples per second.
For very short runs, you may need to try thousands of samples per second.
For very long runs, tens of samples per second can be quite reasonable.
For very short runs, you may need to collect thousands of samples per second to record an adequate number of samples.
For long runs, tens of samples per second may suffice for performance diagnosis.

Choosing sampling periods for some events, such as Linux timers, cycles and instructions, is easy given a target sampling frequency.
Choosing sampling periods for other events such as cache misses is harder.
Expand Down Expand Up @@ -1250,7 +1254,9 @@ \section{\hpcrun{} incurs high overhead! Why?}
\begin{quote}
\verb|hpcsummary --all <hpctoolkit-measurements>|
\end{quote}
Please let us know if there are problems.
Note: The \verb|hpcsummary| script is no longer included in the \verb|bin| directory of an \HPCToolkit{} installation;
it is a developer script that can be found in the \verb|libexec/hpctoolkit| directory.
Let us know if you encounter signficant problems with bad unwinds.

\item You have very long call paths where long is in the hundreds or thousands.
On x86-based architectures, try additionally using \hpcrun{}'s \texttt{RETCNT} event.
Expand Down Expand Up @@ -1323,7 +1329,7 @@ \section{\hpcviewer{} writes a long list of Java error messages to the terminal!
\texttt{\$HOME/.hpctoolkit/hpcviewer} \\
and run \hpcviewer{} again.

On MacOS, persistent state is currently stored within Mac app. If the Eclipse persistent state gets corrupted, one cant simply clear the workspace because some initial persistent state is needed for Eclipse to function properly. For MacOS, the thing to try is downloading a fresh copy of hpcviewer and running the freshly downloaded copy.
On MacOS, persistent state is currently stored within Mac app. If the Eclipse persistent state gets corrupted, one can't simply clear the workspace because some initial persistent state is needed for Eclipse to function properly. For MacOS, the thing to try is downloading a fresh copy of hpcviewer and running the freshly downloaded copy.

If one of the aforementioned suggestions doesn’t fix the problem, report a bug.

Expand Down
32 changes: 17 additions & 15 deletions doc/manual/environ.tex
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,23 @@ \chapter{Environment Variables}
\section{Environment Variables for Users}
\label{user-env}

\paragraph{HPCTOOLKIT.}
Under normal circumstances, there is no need to use this environment variable.
However, there are two situations, however, \hpcrun{}
\emph{must} consult the \verb+HPCTOOLKIT+ environment variable to determine the location
of \HPCToolkit{}'s top-level installation directory:

\begin{itemize}
\item On some systems, parallel job launchers (e.g., Cray's aprun) \emph{copy} the
\hpcrun{} script to a different location. In this case, for \hpcrun{} to find libraries
and utilities it needs at runtime, you must set the \verb+HPCTOOLKIT+ environment variable to
\HPCToolkit{}'s top-level installation directory.
\item
If you launch the \hpcrun{} script via a file system link,
you must set \verb+HPCTOOLKIT+ for the same reason.
\end{itemize}


\paragraph{HPCRUN\_EVENT\_LIST.}

This environment variable is used provide a set of (event, period)
Expand Down Expand Up @@ -219,21 +236,6 @@ \section{Environment Variables for Developers}
core dump for each process, depending upon the settings for your
system. Be careful!

\paragraph{HPCRUN\_QUIET}

If this unfortunately-named environment variable is set, HPCToolkit's
measurement subsystem will turn on a default set of dynamic debugging
variables to log information about HPCToolkit's stack unwinding
based on on-the-fly binary analysis. If set, HPCToolkit's measurement
subsystem log information associated with the following debug flags:
TROLL (when a return address was not found algorithmically
and \HPCToolkit{} begins looking for possible return address values
on the call stack), SUSPICIOUS\_INTERVAL (when an x86 unwind recipe
is suspicious because it indicates that a base pointer is saved on
the stack when a return instruction is encountered) and DROP (when
samples are dropped because the measurement infrastructure was
unable to record a sample in a timely fashion).

\paragraph{HPCRUN\_FNBOUNDS\_CMD}

For dynamically-linked executables, this environment variable must
Expand Down
75 changes: 49 additions & 26 deletions doc/manual/hpcrun.tex
Original file line number Diff line number Diff line change
Expand Up @@ -19,35 +19,42 @@ \section{Using \hpcrun{}}

The basic options for \hpcrun{} are \verb|-e| (or \verb|--event|) to
specify a sampling source and rate and \verb|-t| (or \verb|--trace|) to
turn on tracing. Sample sources are specified as `\verb|event@period|'
where \verb|event| is the name of the source and \verb|period| is the
period (threshold) for that event, and this option may be used
multiple times. Note that a higher period implies a lower rate of
sampling. The basic syntax for profiling an application with
turn on tracing. Sample sources are specified as `\verb|event@howoften|'
where \verb|event| is the name of the source and \verb|howoften| is either
a number specifying the period (threshold) for that event, or \verb|f| followed by a number, \eg{}, \verb|@f100|
specifying a target sampling frequency for the event in samples/second.\footnote{Frequency-based sampling and
the frequency-based notation for {\tt howoften} is only
available for sample sources managed by Linux {\tt perf\_events}. For Linux {\tt perf\_events}, \HPCToolkit{} uses
a default sampling frequency of 300 samples/second.}
Note that a higher period implies a lower rate of sampling.
The \verb|-e| option may be used multiple times to specify that multiple
sample sources be used for measuring an execution.
The basic syntax for profiling an application with
\hpcrun{} is:

\begin{quote}
\begin{verbatim}
hpcrun -t -e event@period ... app arg ...
hpcrun -t -e event@howoften ... app arg ...
\end{verbatim}
\end{quote}

For example, to profile an application and sample every 15,000,000
total cycles and every 400,000 L2 cache misses you would use:
For example, to profile an application using hardware counter sample sources
provided by Linux \verb|perf_events| and sample cycles at 300 times/second (the default sampling frequency) and sample every 4,000,000 instructions,
you would use:

\begin{quote}
\begin{verbatim}
hpcrun -e PAPI_TOT_CYC@15000000 -e PAPI_L2_TCM@400000 app arg ...
hpcrun -e CYCLES -e INSTRUCTIONS@4000000 app arg ...
\end{verbatim}
\end{quote}

The units for the \verb|WALLCLOCK| sample source are in microseconds,
The units for timer-based sample sources (\verb|CPUTIME|, \verb|REALTIME|, and \verb|WALLCLOCK|) are microseconds,
so to sample an application with tracing every 5,000 microseconds
(200~times/second), you would use:

\begin{quote}
\begin{verbatim}
hpcrun -t -e WALLCLOCK@5000 app arg ...
hpcrun -t -e CPUTIME@5000 app arg ...
\end{verbatim}
\end{quote}

Expand All @@ -74,7 +81,7 @@ \section{Using \hpcrun{}}

\begin{quote}
\begin{verbatim}
mpirun -n 4 hpcrun -e PAPI_TOT_CYC@15000000 mpiapp arg ...
mpirun -n 4 hpcrun -e CYCLES mpiapp arg ...
\end{verbatim}
\end{quote}

Expand Down Expand Up @@ -103,7 +110,7 @@ \section{Using \hpclink{}}

% ===========================================================================

\section{Harware counter event names}
\section{Harware Counter Event Names}

HPCToolkit uses libpfm4\cite{libpfm-www} to translate from an event name string to an event code recognized by the kernel.
An event name is case insensitive and is defined as followed:
Expand Down Expand Up @@ -456,11 +463,11 @@ \subsection{PAPI}
enough, the count for the loop as a whole (and up the tree) should be
accurate.

\subsection{Wallclock, Realtime and Cputime}
\subsection{WALLCLOCK, REALTIME and CPUTIME}

\HPCToolkit{} supports three timer-based sample sources: \verb|CPUTIME|,
\verb|REALTIME| and \verb|WALLCLOCK|.
The units for periods of these timers are all in microseconds.
The unit for periods of these timers is microseconds.

Before describing this capability further, it is worth noting
that the CYCLES event supported by Linux \perfevents{} or PAPI's \verb|PAPI_TOT_CYC|
Expand Down Expand Up @@ -550,7 +557,7 @@ \subsection{IO}
thus is able to more accurately count the time spent in these
functions.

\subsection{Memleak}
\subsection{MEMLEAK}

The \verb|MEMLEAK| sample source counts the number of bytes allocated
and freed. Like \verb|IO|, \verb|MEMLEAK| is a synchronous sample
Expand Down Expand Up @@ -651,9 +658,9 @@ \section{Process Fraction}

\begin{quote}
\begin{tabular}{@{}cl}
(dynamic) & \verb|hpcrun -f 0.10 -e event@period app arg ...| \\
(dynamic) & \verb|hpcrun -f 1/10 -e event@period app arg ...| \\
(static) & \verb|export HPCRUN_EVENT_LIST='event@period'| \\
(dynamic) & \verb|hpcrun -f 0.10 -e event@howoften app arg ...| \\
(dynamic) & \verb|hpcrun -f 1/10 -e event@howoften app arg ...| \\
(static) & \verb|export HPCRUN_EVENT_LIST='event@howoften'| \\
& \verb|export HPCRUN_PROCESS_FRACTION=0.10| \\
& \verb|app arg ...|
\end{tabular}
Expand Down Expand Up @@ -757,8 +764,8 @@ \section{Starting and Stopping Sampling}

\begin{quote}
\begin{tabular}{@{}cl}
(dynamic) & \verb|hpcrun -ds -e event@period app arg ...| \\
(static) & \verb|export HPCRUN_EVENT_LIST='event@period'| \\
(dynamic) & \verb|hpcrun -ds -e event@howoften app arg ...| \\
(static) & \verb|export HPCRUN_EVENT_LIST='event@howoften'| \\
& \verb|export HPCRUN_DELAY_SAMPLING=1| \\
& \verb|app arg ...|
\end{tabular}
Expand Down Expand Up @@ -791,17 +798,17 @@ \section{Environment Variables for \hpcrun{}}
would be convenient for users.

\section{Platform-Specific Notes}
\label{sec:platform-specific}

%
% system specific notes for titan, keenland?
%
\subsection{Cray XE6 and XK6}
\label{sec:platform-specific}
\subsection{Cray Systems}

The ALPS job launcher used on Cray XE6 and XK6 systems copies
The ALPS job launcher used on Cray systems copies
programs to a special staging area before launching them,
as described in Section~\ref{sec:env-vars}.
Consequently, when using \hpcrun{} to monitor dynamically linked binaries on Cray XE6 and XK6 systems, you
Consequently, when using \hpcrun{} to monitor dynamically-linked binaries on Cray systems, you
should add the \verb|HPCTOOLKIT| environment variable to your launch
script.
Set \verb|HPCTOOLKIT| to the top-level \HPCToolkit{} installation directory
Expand All @@ -822,7 +829,7 @@ \subsection{Cray XE6 and XK6}
export CRAY_ROOTFS=DSL
cd $PBS_O_WORKDIR
aprun -n #nodes hpcrun -e event@period dynamic-app arg ...
aprun -n #nodes hpcrun -e event@howoften dynamic-app arg ...
\end{verbatim}
\end{quote}
% $
Expand All @@ -849,3 +856,19 @@ \subsection{Cray XE6 and XK6}
correct settings for \verb|PATH|, \verb|HPCTOOLKIT|, etc. In that case,
the easiest solution is to load the \verb|hpctoolkit| module. Try
``\verb|module show hpctoolkit|'' to see if it sets \verb|HPCTOOLKIT|.

\subsection{Blue Gene/Q Systems}
Blue Gene Q systems provide the \verb|WALLCLOCK| interval timer, but not the
POSIX \verb|CPUTIME| and \verb|REALTIME| timers.

The Linux \verb|perf_events| subsystem is unavailable on Blue Gene Q systems.
One should use the PAPI interface to monitor executions using hardware performance counters.

\subsection{ARM Systems}
\HPCToolkit{}'s measurement infrastructure depends upon \verb|libunwind| for call stack unwinding on ARM.
On some ARM systems, compilers put DWARF Function Descriptor Entries (FDEs) in the ELF \verb|.debug_frame| segment rather
than the \verb|.eh_frame| segment. In such cases, \HPCToolkit{} requires a bleeding-edge version of \verb|libunwind| that is not included in
\HPCToolkit{}'s \verb|hpctoolkit-externals| package.
\footnote{We are in the midst of deprecating {\tt hpctoolkit-externals} as we move to a spack-based distribution system. While it is used for the
current release, we are no longer maintaining it.}
Contact the \HPCToolkit{} forum if you need a copy of a newer \verb|libunwind|.

0 comments on commit d37524d

Please sign in to comment.