Skip to content

Commit

Permalink
Updating documentation to v2.2.0
Browse files Browse the repository at this point in the history
  • Loading branch information
khuck committed Aug 31, 2020
1 parent 08e1986 commit 7c80b79
Show file tree
Hide file tree
Showing 6 changed files with 75 additions and 55 deletions.
30 changes: 19 additions & 11 deletions doc/webdocs/docs/feature.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,10 +31,10 @@ APEX provides both *performance awareness* and *performance adaptation*.
* Software can subsequently associate performance state with policy for feedback control
* APEX introspection
* OS: track system resources, utilization, job contention, overhead
* Runtime (HPX, HPX-5, OpenMP...): track threads, queues, concurrency, remote operations, parcels, memory management
* Runtime (e.g. HPX, OpenMP, CUDA, OpenACC, Kokkos...): track threads, queues, concurrency, remote operations, parcels, memory management
* Application timer / counter observation

![Screenshot](img/APEX_diagram.pdf)
![Screenshot](img/APEX_arch.png)

*Above: APEX architecture diagram (when linked with an HPX application). The application and runtime send events to the APEX instrumentation API, which updates the performance state. The Policy Engine executes policies that change application behavior based on rule outcomes.*

Expand All @@ -44,16 +44,19 @@ APEX collects data through *inspectors*. The synchronous data collection uses an

* Initialize, terminate, new thread
* added to the HPX thread scheduler
* added to the HPX-5 thread scheduler
* added to the OpenMP runtime using the OMPT interface
* added to the pthread runtime by wrapping the pthread API calls
* Timer start, stop, yield, resume
* added to HPX task scheduler
* added to HPX-5 task scheduler
* added to the OpenMP runtime using the OMPT interface
* added to the pthread runtime by wrapping the pthread API calls
* added to the CUDA runtime by subscribing to CUPTI callbacks and asynchronous GPU activity
* added to the Kokkos runtime by registering for callbacks
* added to the OpenACC runtime by registering for callbacks
* Sampled values
* counters from HPX, HPX-5
* counters from HPX
* counters from OpenMP
* counters from CUPTI
* Custom events (meta-events)
* useful for triggering policies

Expand All @@ -65,26 +68,31 @@ Asynchonous data collection does not rely on events, but occurs periodically. A
* /proc/net/dev
* /proc/self/status
* lm_sensors
* power measurements
* power measurements
* counters from NVIDIA Monitoring Library (NVML)

## Event Listeners

There are a number of listeners in APEX that are triggered by the events passed in through the API. For example, the **Profiling Listener** records events related to maintaining the performance state.

* Start Event: records the name/address of the timer, gets a timestamp (using rdtsc), returns a profiler handle
* Stop Event: gets a timestamp, puts the profiler object in a queue for back-end processing and returns
* Stop Event: gets a timestamp, optionally puts the profiler object in a queue for back-end processing and returns
* Sample Event: put the name & value in the queue

Internally to APEX, there is an asynchronous consumer thread that processes profiler objects and samples to build a performance profile (in HPX, this thread is processed/scheduled as an HPX thread/task).
Internally to APEX, there is an asynchronous consumer thread that processes profiler objects and samples to build a performance profile (in HPX, this thread is processed/scheduled as an HPX thread/task), construct task graphs, and scatterplots of sampled task times.

The TAU Listener (used for postmortem analysis) synchronously passes all measurement events to TAU to build an offline profile or trace. TAU will also capture any other events for which it is configured, including MPI, memory, file I/O, etc.
The **TAU Listener** (used for postmortem analysis) synchronously passes all measurement events to TAU to build an offline profile or trace. TAU will also capture any other events for which it is configured, including MPI, memory, file I/O, etc.

The concurrency listener (also used for postmortem analysis) maintains a timeline of total concurrency, periodically sampled from within APEX.
The **concurrency listener** (also used for postmortem analysis) maintains a timeline of total concurrency, periodically sampled from within APEX.

* Start event: push timer ID on stack
* Stop event: pop timer ID off stack

An asynchronous consumer thread periodically logs the current timer for each thread. This thread will output a concurrency data report and gnuplot script at APEX termination.
An asynchronous consumer thread periodically logs the current timer for each thread. This thread will output a concurrency data report and gnuplot script at APEX termination.

The **OTF2 listener** will construct a full event trace and write the events out to an [OTF2](https://www.vi-hps.org/projects/score-p/) archive. OTF2 files can be visualized with tools like [Vampir](https://tu-dresden.de/zih/forschung/projekte/vampir/index?set_language=en) or [Traveler](https://github.com/hdc-arizona/traveler-integrated). Due to the constraints of OTF2 trace collection, tasks that start on one OS thread and end on another OS thread are not supported. Similarly, tasks/functions that are not perfectly nested are not supported by OTF2 tracing. For those types of tasks, we recommend the Trace Event listener.

The **Trace Event listener** will construct a full event trace and write the events to one or more [Google Trace Event](https://docs.google.com/document/d/1CvAClvFfyA5R-PhYUmn5OOQtYMH4h6I0nSsKchNAySU/edit#) trace files. The files can be visualized with the Google Chrome web browser, by navigating to the `chrome://tracing` URL. Other tools can be used to visualize or analyze traces, like [Catapult](https://chromium.googlesource.com/catapult).

## Policy Listener

Expand Down
Binary file added doc/webdocs/docs/img/APEX_arch.pdf
Binary file not shown.
Binary file added doc/webdocs/docs/img/APEX_arch.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
18 changes: 13 additions & 5 deletions doc/webdocs/docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,16 +3,24 @@

# APEX: Autonomic Performance Environment for eXascale

One of the key components of the XPRESS project is a new approach to performance observation, measurement, analysis and runtime decision making in order to optimize performance. The particular challenges of accurately measuring the performance characteristics of ParalleX [[1]](#fn1) applications (as well as other asynchronous multitasking runtime architectures) requires a new approach to parallel performance observation. The standard model of multiple operating system processes and threads observing themselves in a first-person manner while writing out performance profiles or traces for offline analysis will not adequately capture the full execution context, nor provide opportunities for runtime adaptation within OpenX. The approach taken in the XPRESS project is a new performance measurement system, called (Autonomic Performance Environment for eXascale). APEX includes methods for information sharing between the layers of the software stack, from the hardware through operating and runtime systems, all the way to domain specific or legacy applications. The performance measurement components incorporate relevant information across stack layers, with merging of third-person performance observation of node-level and global resources, remote processes, and both operating and runtime system threads. For a complete academic description of APEX, see the publication "APEX: An Autonomic Performance Environment for eXascale" [[2]](#References).
One of the key components of the US Department of Energy funded *XPRESS* project was a new approach to performance observation, measurement, analysis and runtime decision making in order to optimize performance. The particular challenges of accurately measuring the performance characteristics of ParalleX [[1]](#fn1) (e.g. HPX) applications (as well as other asynchronous multitasking runtime architectures) requires a new approach to parallel performance observation. The traditional model of multiple operating system processes and threads observing themselves in a first-person manner while writing out performance profiles or traces for offline analysis will not adequately capture the full execution context, nor provide opportunities for runtime adaptation. The approach taken in the completed XPRESS project was a new performance measurement system, called (Autonomic Performance Environment for eXascale). APEX includes methods for information sharing between the layers of the software stack, from the hardware through operating and runtime systems, all the way to domain specific or legacy applications. The performance measurement components incorporate relevant information across stack layers, with merging of third-person performance observation of node-level and global resources, remote processes, and both operating and runtime system threads. For a complete design description of APEX, see the publication "APEX: An Autonomic Performance Environment for eXascale" [[3]](#References). Since it's original project, APEX has been extended to support multiple runtime systems.

In short, APEX is an introspection and runtime adaptation library for asynchronous multitasking runtime systems. However, APEX is not *only* useful for AMT/AMR runtimes running on future exascale systems - it can be used by any application wanting to perform runtime adaptation to deal with heterogeneous and/or variable environments.

## Introspection
APEX provides an API for measuring actions within a runtime. The API includes methods for timer start/stop, as well as sampled counter values. APEX is designed to be integrated into a runtime, library and/or application and provide performance introspection for the purpose of runtime adaptation. While APEX *can* provide rudimentary post-mortem performance analysis measurement, there are many other performance measurement tools that perform that task much better (such as TAU http://tau.uoregon.edu). That said, APEX includes an event listener that integrates with the TAU measurement system, so APEX events can be forwarded to TAU and collected in a TAU profile and/or trace to be used for post-mortem performance anlaysis.
APEX provides an API for measuring actions within a runtime. The API includes methods for timer start/stop, as well as sampled counter values. APEX is designed to be integrated into a runtime, library and/or application and provide performance introspection for the purpose of runtime adaptation. While APEX *can* provide rudimentary post-mortem performance analysis measurement, there are many other performance measurement tools that perform that task more robustly (such as TAU http://tau.uoregon.edu). That said, APEX includes an event listener that integrates with the TAU measurement system, so APEX events can be forwarded to TAU and collected in a TAU profile and/or trace to be used for post-mortem performance anlaysis.

## Runtime Adaptation
APEX provides a mechanism for dynamic runtime behavior, either for autotuning or adaptation to changing environment. The infrastruture that provides the adaptation is the Policy Engine, which executes policies either periodically or triggered by events. The policies have access to the performance state as observed by the APEX introspection API. APEX is integrated with Active Harmony (http://www.dyninst.org/harmony) to provide dynamic search for autotuning.
APEX provides a mechanism for dynamic runtime behavior, either for autotuning or adaptation to changing environment. The infrastruture that provides the adaptation is the *Policy Engine*, which executes policies either periodically or triggered by events. The policies have access to the performance state as observed by the APEX introspection API. APEX is integrated with Active Harmony (http://www.dyninst.org/harmony) to provide dynamic search for autotuning.

## References
## References & APEX-related Publications
1. <a name="fn1"></a> Thomas Sterling, Daniel Kogler, Matthew Anderson, and Maciej Brodowicz. "SLOWER: A performance model for Exascale computing". *Supercomputing Frontiers and Innovations*, 1:42–57, September 2014. <http://superfri.org/superfri/article/view/10>
2. <a name="fn2"></a> Kevin A. Huck, Allan Porterfield, Nick Chaimov, Hartmut Kaiser, Allen D. Malony, Thomas Sterling, Rob Fowler. "An Autonomic Performance Environment for eXascale", *Journal of Supercomputing Frontiers and Innovations*, 2015. <http://superfri.org/superfri/article/view/64>
2. <a name="fn2"></a> Koniges, Alice, Jayashree Ajay Candadai, Hartmut Kaiser, Kevin Huck, Jeremy Kemp, Thomas Heller, Matthew Anderson et al. "HPX Applications and Performance Adaptation". No. SAND2015-8999C. Sandia National Lab.(SNL-NM), Albuquerque, NM (United States), 2015. <https://www.osti.gov/servlets/purl/1332791>
3. <a name="fn3"></a> Kevin A. Huck, Allan Porterfield, Nick Chaimov, Hartmut Kaiser, Allen D. Malony, Thomas Sterling, Rob Fowler. "An Autonomic Performance Environment for eXascale", *Journal of Supercomputing Frontiers and Innovations*, 2015. <http://superfri.org/superfri/article/view/64>
4. <a name="fn4"></a> Grubel, Patricia, Hartmut Kaiser, Kevin Huck, and Jeanine Cook. "Using intrinsic performance counters to assess efficiency in task-based parallel applications." In *2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)*, pp. 1692-1701. IEEE, 2016. <https://www.cs.uoregon.edu/research/paracomp/papers/ipdps16/hpcmaspa2016.pdf>
5. <a name="fn5"></a> Bari, Md Abdullah Shahneous, Nicholas Chaimov, Abid M. Malik, Kevin A. Huck, Barbara Chapman, Allen D. Malony, and Osman Sarood. "Arcs: Adaptive runtime configuration selection for power-constrained openmp applications." In *2016 IEEE International Conference on Cluster Computing (CLUSTER)*, pp. 461-470. IEEE, 2016. <https://www.cs.uoregon.edu/research/paracomp/papers/cluster16/arcs.pdf>
6. <a name="fn6"></a> Tohid, R., Bibek Wagle, Shahrzad Shirzad, Patrick Diehl, Adrian Serio, Alireza Kheirkhahan, Parsa Amini et al. "Asynchronous execution of python code on task-based runtime systems." In 2018 IEEE/ACM 4th International Workshop on Extreme Scale Programming Models and Middleware (ESPM2), pp. 37-45. IEEE, 2018. <http://hdc.cs.arizona.edu/papers/espm2_2018_phylanx.pdf>
7. Heller, Thomas, Bryce Adelstein Lelbach, Kevin A. Huck, John Biddiscombe, Patricia Grubel, Alice E. Koniges, Matthias Kretz et al. "Harnessing billions of tasks for a scalable portable hydrodynamic simulation of the merger of two stars." The International Journal of High Performance Computing Applications 33, no. 4 (2019): 699-715. <https://journals.sagepub.com/doi/full/10.1177/1094342018819744>
8. Wagle, Bibek, Mohammad Alaul Haque Monil, Kevin Huck, Allen D. Malony, Adrian Serio, and Hartmut Kaiser. "Runtime adaptive task inlining on asynchronous multitasking runtime systems." In Proceedings of the 48th International Conference on Parallel Processing, pp. 1-10. 2019. <https://dl.acm.org/doi/abs/10.1145/3337821.3337915>
9. Daiß, Gregor, Parsa Amini, John Biddiscombe, Patrick Diehl, Juhan Frank, Kevin Huck, Hartmut Kaiser, Dominic Marcello, David Pfander, and Dirk Pfüger. "From piz daint to the stars: simulation of stellar mergers using high-level abstractions." In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1-37. 2019. <https://arxiv.org/abs/1908.03121>
10. Steven R. Brandt, Alex Bigelow, Sayef Azad Sakin, Katy Williams, Katherine E. Isaacs, Kevin Huck, Rod Tohid, Bibek Wagle, Shahrzad Shirzad, and Hartmut Kaiser. 2020. JetLag: An Interactive, Asynchronous Array Computing Environment. In Practice and Experience in Advanced Research Computing (PEARC '20). Association for Computing Machinery, New York, NY, USA, 8–12. DOI: <https://doi.org/10.1145/3311790.3396657>
Loading

0 comments on commit 7c80b79

Please sign in to comment.