From 979f20be41e4875bef14071ac4f9b9f07ff6e259 Mon Sep 17 00:00:00 2001
From: Kevin Huck References & APEX-related Publ
diff --git a/install/index.html b/install/index.html
index d05a34bb..d65b373a 100644
--- a/install/index.html
+++ b/install/index.html
@@ -120,7 +120,7 @@
Installation with HPXHPX runtime, and is integrated into the HPX build system. To enable APEX measurement with HPX, enable the following CMake flags:
-DHPX_WITH_APEX=TRUE
-The -DHPX_WITH_APEX_TAG=develop
can be used to indicate a specific release version of APEX, or to use a specific GitHub branch of APEX. We recommend using the default configured version that comes with HPX (currently v2.6.4
) or the develop
branch. Additional CMake flags include:
The -DHPX_WITH_APEX_TAG=develop
can be used to indicate a specific release version of APEX, or to use a specific GitHub branch of APEX. We recommend using the default configured version that comes with HPX (currently v2.6.5
) or the develop
branch. Additional CMake flags include:
-DAPEX_WITH_LM_SENSORS=TRUE
to enable LM sensors support (assumed to be installed in default system paths)-DAPEX_WITH_PAPI=TRUE
and -DPAPI_ROOT=...
to enable PAPI supportAPEX is open source, and available on Github at http://github.com/UO-OACISS/apex.
-For stability, most users will want to download the most recent release of APEX (for example, v2.6.4):
-wget https://github.com/UO-OACISS/apex/archive/refs/tags/v2.6.4.tar.gz
-tar -xvzf v2.6.4.tar.gz
-cd apex-2.6.4
+For stability, most users will want to download the most recent release of APEX (for example, v2.6.5):
+wget https://github.com/UO-OACISS/apex/archive/refs/tags/v2.6.5.tar.gz
+tar -xvzf v2.6.5.tar.gz
+cd apex-2.6.5
Other users may want to work with the most recent code available, in which case you can clone the git repo:
git clone https://github.com/UO-OACISS/apex.git
@@ -248,7 +248,7 @@ Configuring and building APEX
The process for building APEX is:
1) Get the code (see above)
2) Enter the repo directory:
-cd apex-2.6.4
+cd apex-2.6.5
3) configure using CMake:
cmake -B build -DCMAKE_INSTALL_PREFIX=<installation-path> -DCMAKE_BUILD_TYPE=RelWithDebInfo ..
diff --git a/search/search_index.json b/search/search_index.json
index b5d0e516..e70fb5ed 100644
--- a/search/search_index.json
+++ b/search/search_index.json
@@ -1 +1 @@
-{"config":{"indexing":"full","lang":["en"],"min_search_length":3,"prebuild_index":false,"separator":"[\\s\\-]+"},"docs":[{"location":"","text":"APEX: Autonomic Performance Environment for eXascale \u00b6 One of the key components of the US Department of Energy funded XPRESS project was a new approach to performance observation, measurement, analysis and runtime decision making in order to optimize performance. The particular challenges of accurately measuring the performance characteristics of ParalleX [1] (e.g. HPX) applications (as well as other asynchronous multitasking runtime architectures) requires a new approach to parallel performance observation. The traditional model of multiple operating system processes and threads observing themselves in a first-person manner while writing out performance profiles or traces for offline analysis will not adequately capture the full execution context, nor provide opportunities for runtime adaptation. The approach taken in the completed XPRESS project was a new performance measurement system, called (Autonomic Performance Environment for eXascale). APEX includes methods for information sharing between the layers of the software stack, from the hardware through operating and runtime systems, all the way to domain specific or legacy applications. The performance measurement components incorporate relevant information across stack layers, with merging of third-person performance observation of node-level and global resources, remote processes, and both operating and runtime system threads. For a complete design description of APEX, see the publication \"APEX: An Autonomic Performance Environment for eXascale\" [3] . Since it's original project, APEX has been extended to support many popular runtime systems [11] . In short, APEX is an introspection and runtime adaptation library for asynchronous multitasking runtime systems. However, APEX is not only useful for AMT/AMR runtimes running on future exascale systems - it can be used by any application wanting to perform runtime adaptation to deal with heterogeneous and/or variable environments. Introspection \u00b6 APEX provides an API for measuring actions within a runtime. The API includes methods for timer start/stop, as well as sampled counter values. APEX is designed to be integrated into a runtime, library and/or application and provide performance introspection for the purpose of runtime adaptation. While APEX can provide rudimentary post-mortem performance analysis measurement, there are many other performance measurement tools that perform that task more robustly (such as TAU http://tau.uoregon.edu ). That said, APEX includes an event listener that integrates with the TAU measurement system, so APEX events can be forwarded to TAU and collected in a TAU profile and/or trace to be used for post-mortem performance anlaysis. Runtime Adaptation \u00b6 APEX provides a mechanism for dynamic runtime behavior, either for autotuning or adaptation to changing environment. The infrastruture that provides the adaptation is the Policy Engine , which executes policies either periodically or triggered by events. The policies have access to the performance state as observed by the APEX introspection API. APEX has several built in search strategies, including exhaustive, random, simulated annealing, and hill climibing. APEX is also integrated with Active Harmony http://www.dyninst.org/harmony to provide dynamic search using the Nelder Mead algorithm. Citing APEX \u00b6 Please use the following citation: https://doi.org/10.1109/ESPM256814.2022.00008 References & APEX-related Publications \u00b6 Thomas Sterling, Daniel Kogler, Matthew Anderson, and Maciej Brodowicz. \"SLOWER: A performance model for Exascale computing\". Supercomputing Frontiers and Innovations , 1:42\u201357, September 2014. http://superfri.org/superfri/article/view/10 Koniges, Alice, Jayashree Ajay Candadai, Hartmut Kaiser, Kevin Huck, Jeremy Kemp, Thomas Heller, Matthew Anderson et al. \"HPX Applications and Performance Adaptation\". No. SAND2015-8999C. Sandia National Lab.(SNL-NM), Albuquerque, NM (United States), 2015. https://www.osti.gov/servlets/purl/1332791 Kevin A. Huck, Allan Porterfield, Nick Chaimov, Hartmut Kaiser, Allen D. Malony, Thomas Sterling, Rob Fowler. \"An Autonomic Performance Environment for eXascale\", Journal of Supercomputing Frontiers and Innovations , 2015. http://superfri.org/superfri/article/view/64 Grubel, Patricia, Hartmut Kaiser, Kevin Huck, and Jeanine Cook. \"Using intrinsic performance counters to assess efficiency in task-based parallel applications.\" In 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) , pp. 1692-1701. IEEE, 2016. https://www.cs.uoregon.edu/research/paracomp/papers/ipdps16/hpcmaspa2016.pdf Bari, Md Abdullah Shahneous, Nicholas Chaimov, Abid M. Malik, Kevin A. Huck, Barbara Chapman, Allen D. Malony, and Osman Sarood. \"Arcs: Adaptive runtime configuration selection for power-constrained openmp applications.\" In 2016 IEEE International Conference on Cluster Computing (CLUSTER) , pp. 461-470. IEEE, 2016. https://www.cs.uoregon.edu/research/paracomp/papers/cluster16/arcs.pdf Tohid, R., Bibek Wagle, Shahrzad Shirzad, Patrick Diehl, Adrian Serio, Alireza Kheirkhahan, Parsa Amini et al. \"Asynchronous execution of python code on task-based runtime systems.\" In 2018 IEEE/ACM 4th International Workshop on Extreme Scale Programming Models and Middleware (ESPM2), pp. 37-45. IEEE, 2018. http://hdc.cs.arizona.edu/papers/espm2_2018_phylanx.pdf Heller, Thomas, Bryce Adelstein Lelbach, Kevin A. Huck, John Biddiscombe, Patricia Grubel, Alice E. Koniges, Matthias Kretz et al. \"Harnessing billions of tasks for a scalable portable hydrodynamic simulation of the merger of two stars.\" The International Journal of High Performance Computing Applications 33, no. 4 (2019): 699-715. https://journals.sagepub.com/doi/full/10.1177/1094342018819744 Wagle, Bibek, Mohammad Alaul Haque Monil, Kevin Huck, Allen D. Malony, Adrian Serio, and Hartmut Kaiser. \"Runtime adaptive task inlining on asynchronous multitasking runtime systems.\" In Proceedings of the 48th International Conference on Parallel Processing, pp. 1-10. 2019. https://dl.acm.org/doi/abs/10.1145/3337821.3337915 Dai\u00df, Gregor, Parsa Amini, John Biddiscombe, Patrick Diehl, Juhan Frank, Kevin Huck, Hartmut Kaiser, Dominic Marcello, David Pfander, and Dirk Pf\u00fcger. \"From piz daint to the stars: simulation of stellar mergers using high-level abstractions.\" In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1-37. 2019. https://arxiv.org/abs/1908.03121 Steven R. Brandt, Alex Bigelow, Sayef Azad Sakin, Katy Williams, Katherine E. Isaacs, Kevin Huck, Rod Tohid, Bibek Wagle, Shahrzad Shirzad, and Hartmut Kaiser. 2020. \"JetLag: An Interactive, Asynchronous Array Computing Environment\". In Practice and Experience in Advanced Research Computing (PEARC '20). Association for Computing Machinery, New York, NY, USA, 8\u201312. DOI: https://doi.org/10.1145/3311790.3396657 Kevin A. Huck, \"Broad Performance Measurement Support for Asynchronous Multi-Tasking with APEX,\" 2022 IEEE/ACM 7th International Workshop on Extreme Scale Programming Models and Middleware (ESPM2), Dallas, TX, USA, 2022, pp. 20-29. https://doi.org/10.1109/ESPM256814.2022.00008","title":"Home"},{"location":"#apex_autonomic_performance_environment_for_exascale","text":"One of the key components of the US Department of Energy funded XPRESS project was a new approach to performance observation, measurement, analysis and runtime decision making in order to optimize performance. The particular challenges of accurately measuring the performance characteristics of ParalleX [1] (e.g. HPX) applications (as well as other asynchronous multitasking runtime architectures) requires a new approach to parallel performance observation. The traditional model of multiple operating system processes and threads observing themselves in a first-person manner while writing out performance profiles or traces for offline analysis will not adequately capture the full execution context, nor provide opportunities for runtime adaptation. The approach taken in the completed XPRESS project was a new performance measurement system, called (Autonomic Performance Environment for eXascale). APEX includes methods for information sharing between the layers of the software stack, from the hardware through operating and runtime systems, all the way to domain specific or legacy applications. The performance measurement components incorporate relevant information across stack layers, with merging of third-person performance observation of node-level and global resources, remote processes, and both operating and runtime system threads. For a complete design description of APEX, see the publication \"APEX: An Autonomic Performance Environment for eXascale\" [3] . Since it's original project, APEX has been extended to support many popular runtime systems [11] . In short, APEX is an introspection and runtime adaptation library for asynchronous multitasking runtime systems. However, APEX is not only useful for AMT/AMR runtimes running on future exascale systems - it can be used by any application wanting to perform runtime adaptation to deal with heterogeneous and/or variable environments.","title":"APEX: Autonomic Performance Environment for eXascale"},{"location":"#introspection","text":"APEX provides an API for measuring actions within a runtime. The API includes methods for timer start/stop, as well as sampled counter values. APEX is designed to be integrated into a runtime, library and/or application and provide performance introspection for the purpose of runtime adaptation. While APEX can provide rudimentary post-mortem performance analysis measurement, there are many other performance measurement tools that perform that task more robustly (such as TAU http://tau.uoregon.edu ). That said, APEX includes an event listener that integrates with the TAU measurement system, so APEX events can be forwarded to TAU and collected in a TAU profile and/or trace to be used for post-mortem performance anlaysis.","title":"Introspection"},{"location":"#runtime_adaptation","text":"APEX provides a mechanism for dynamic runtime behavior, either for autotuning or adaptation to changing environment. The infrastruture that provides the adaptation is the Policy Engine , which executes policies either periodically or triggered by events. The policies have access to the performance state as observed by the APEX introspection API. APEX has several built in search strategies, including exhaustive, random, simulated annealing, and hill climibing. APEX is also integrated with Active Harmony http://www.dyninst.org/harmony to provide dynamic search using the Nelder Mead algorithm.","title":"Runtime Adaptation"},{"location":"#citing_apex","text":"Please use the following citation: https://doi.org/10.1109/ESPM256814.2022.00008","title":"Citing APEX"},{"location":"#references_apex-related_publications","text":"Thomas Sterling, Daniel Kogler, Matthew Anderson, and Maciej Brodowicz. \"SLOWER: A performance model for Exascale computing\". Supercomputing Frontiers and Innovations , 1:42\u201357, September 2014. http://superfri.org/superfri/article/view/10 Koniges, Alice, Jayashree Ajay Candadai, Hartmut Kaiser, Kevin Huck, Jeremy Kemp, Thomas Heller, Matthew Anderson et al. \"HPX Applications and Performance Adaptation\". No. SAND2015-8999C. Sandia National Lab.(SNL-NM), Albuquerque, NM (United States), 2015. https://www.osti.gov/servlets/purl/1332791 Kevin A. Huck, Allan Porterfield, Nick Chaimov, Hartmut Kaiser, Allen D. Malony, Thomas Sterling, Rob Fowler. \"An Autonomic Performance Environment for eXascale\", Journal of Supercomputing Frontiers and Innovations , 2015. http://superfri.org/superfri/article/view/64 Grubel, Patricia, Hartmut Kaiser, Kevin Huck, and Jeanine Cook. \"Using intrinsic performance counters to assess efficiency in task-based parallel applications.\" In 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) , pp. 1692-1701. IEEE, 2016. https://www.cs.uoregon.edu/research/paracomp/papers/ipdps16/hpcmaspa2016.pdf Bari, Md Abdullah Shahneous, Nicholas Chaimov, Abid M. Malik, Kevin A. Huck, Barbara Chapman, Allen D. Malony, and Osman Sarood. \"Arcs: Adaptive runtime configuration selection for power-constrained openmp applications.\" In 2016 IEEE International Conference on Cluster Computing (CLUSTER) , pp. 461-470. IEEE, 2016. https://www.cs.uoregon.edu/research/paracomp/papers/cluster16/arcs.pdf Tohid, R., Bibek Wagle, Shahrzad Shirzad, Patrick Diehl, Adrian Serio, Alireza Kheirkhahan, Parsa Amini et al. \"Asynchronous execution of python code on task-based runtime systems.\" In 2018 IEEE/ACM 4th International Workshop on Extreme Scale Programming Models and Middleware (ESPM2), pp. 37-45. IEEE, 2018. http://hdc.cs.arizona.edu/papers/espm2_2018_phylanx.pdf Heller, Thomas, Bryce Adelstein Lelbach, Kevin A. Huck, John Biddiscombe, Patricia Grubel, Alice E. Koniges, Matthias Kretz et al. \"Harnessing billions of tasks for a scalable portable hydrodynamic simulation of the merger of two stars.\" The International Journal of High Performance Computing Applications 33, no. 4 (2019): 699-715. https://journals.sagepub.com/doi/full/10.1177/1094342018819744 Wagle, Bibek, Mohammad Alaul Haque Monil, Kevin Huck, Allen D. Malony, Adrian Serio, and Hartmut Kaiser. \"Runtime adaptive task inlining on asynchronous multitasking runtime systems.\" In Proceedings of the 48th International Conference on Parallel Processing, pp. 1-10. 2019. https://dl.acm.org/doi/abs/10.1145/3337821.3337915 Dai\u00df, Gregor, Parsa Amini, John Biddiscombe, Patrick Diehl, Juhan Frank, Kevin Huck, Hartmut Kaiser, Dominic Marcello, David Pfander, and Dirk Pf\u00fcger. \"From piz daint to the stars: simulation of stellar mergers using high-level abstractions.\" In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1-37. 2019. https://arxiv.org/abs/1908.03121 Steven R. Brandt, Alex Bigelow, Sayef Azad Sakin, Katy Williams, Katherine E. Isaacs, Kevin Huck, Rod Tohid, Bibek Wagle, Shahrzad Shirzad, and Hartmut Kaiser. 2020. \"JetLag: An Interactive, Asynchronous Array Computing Environment\". In Practice and Experience in Advanced Research Computing (PEARC '20). Association for Computing Machinery, New York, NY, USA, 8\u201312. DOI: https://doi.org/10.1145/3311790.3396657 Kevin A. Huck, \"Broad Performance Measurement Support for Asynchronous Multi-Tasking with APEX,\" 2022 IEEE/ACM 7th International Workshop on Extreme Scale Programming Models and Middleware (ESPM2), Dallas, TX, USA, 2022, pp. 20-29. https://doi.org/10.1109/ESPM256814.2022.00008","title":"References & APEX-related Publications"},{"location":"environment/","text":"APEX Runtime Options \u00b6 Environment Variables \u00b6 There are a number of environment variables that control APEX behavior at runtime. The variables can be defined in the environment before application execution, or specified in a file called apex.conf in the current execution directory. The format of the configuration file is: APEX_VARIABLE1=value APEX_VARIABLE2=value ... To generate a default APEX configuration file in the current working directory, run the ./install/bin/apex_make_default_config program. To get a list of all known environment variables, run the ./install/bin/apex_environment_help program. Environment Variable Default Value Valid Values Description APEX_DISABLE 0 0,1 Disable APEX during the application execution APEX_SUSPEND 0 0,1 Suspend APEX timers and counters during the application execution APEX_PAPI_SUSPEND 0 0,1 Suspend PAPI counters during the application execution APEX_SCREEN_OUTPUT 0 0,1 Output APEX performance summary at exit APEX_VERBOSE 0 0,1 Output APEX options at entry APEX_PROFILE_OUTPUT 0 0,1 Output TAU profile of performance summary APEX_CSV_OUTPUT 0 0,1 Output CSV profile of performance summary APEX_TASKGRAPH_OUTPUT 0 0,1 Output graphviz reduced taskgraph APEX_POLICY 1 0,1 Enable APEX policy listener and execute registered policies APEX_PROC_STAT 1 0,1 Periodically read data from /proc/stat APEX_PROC_CPUINFO 0 0,1 Read data (once) from /proc/cpuinfo APEX_PROC_MEMINFO 0 0,1 Periodically read data from /proc/meminfo APEX_PROC_NET_DEV 0 0,1 Periodically read data from /proc/net/dev APEX_PROC_SELF_STATUS 0 0,1 Periodically read data from /proc/self/status APEX_PROC_SELF_IO 0 0,1 Periodically read data from /proc/self/io APEX_PROC_STAT_DETAILS 0 0,1 Periodically read detailed data from /proc/self/stat APEX_PROC_PERIOD 1000000 Integer /proc data read sampling period, in microseconds APEX_MEASURE_CONCURRENCY 0 0,1 Periodically sample thread activity and output report at exit APEX_MEASURE_CONCURRENCY_PERIOD 1000000 Integer Thread concurrency sampling period, in microseconds APEX_OTF2 0 0,1 Enable OTF2 trace output. APEX_TRACE_EVENT 0 0,1 Enable Google Trace Event output. APEX_OTF2_ARCHIVE_PATH OTF2_archive valid path OTF2 trace directory. APEX_OTF2_ARCHIVE_NAME APEX valid string OTF2 trace filename. APEX_TAU 0 0,1 Enable TAU profiling (if application is executed with tau_exec ). APEX_THROTTLE_CONCURRENCY 0 0,1 Enable thread concurrency throttling APEX_THROTTLING_MIN_THREADS 1 0,1 Minimum threads allowed APEX_THROTTLING_MAX_THREADS 8 0,1 Maximum threads allowed APEX_THROTTLE_ENERGY 0 0,1 Enable energy throttling APEX_THROTTLE_ENERGY_PERIOD 1000000 Integer Power sampling period, in microseconds APEX_THROTTLING_MIN_WATTS 150 Integer Minimum Watt threshold APEX_THROTTLING_MAX_WATTS 300 Integer Maximum Watt threshold APEX_PTHREAD_WRAPPER_STACK_SIZE 0 16k-8M When wrapping pthread_create, use this size for the stack. APEX_PAPI_METRICS null space-delimited string of metric names List of metrics to be measured by APEX when timers are used. Only meaningful if APEX is configured with PAPI support. Any supported metric from papi_avail ( see PAPI Documentation ) can be used. APEX_PAPI_SUSPEND 0 0,1 Suspend collection of PAPI metrics for APEX timers during the application execution APEX_PROCESS_ASYNC_STATE 1 0,1 Enable/disable asynchronous processing of statistics (useful when only collecting trace data) APEX_UNTIED_TIMERS 0 0,1 Disable callstack state maintenance for specific OS threads. This allows APEX timers to start on one thread and stop on another. This is not compatible with tracing. APEX_OMPT_REQUIRED_EVENTS_ONLY 0 0,1 Disable moderate-frequency, moderate-overhead OMPT events. APEX_OMPT_HIGH_OVERHEAD_EVENTS 0 0,1 Disable high-frequency, high-overhead OMPT events. APEX_PIN_APEX_THREADS 1 0,1 Pin APEX asynchronous threads to the last core/PU on the system. APEX_TASK_SCATTERPLOT 0 0,1 Periodically sample APEX tasks, generating a scatterplot of time distributions. APEX_TIME_TOP_LEVEL_OS_THREADS 0 0,1 When registering threads, measure their lifetimes. APEX_CUDA_COUNTERS 0 0,1 Enable CUDA CUPTI counter measurement. APEX_CUDA_KERNEL_DETAILS 0 0,1 Enable Context information for CUDA CUPTI counter measurement and CUDA CUPTI API callback timers. APEX_CUDA_RUNTIME_API 1 0,1 Enable callbacks for the CUDA Runtime API ( cuda*() functions). APEX_CUDA_DRIVER_API 0 0,1 Enable callbacks for the CUDA Driver API ( cu*() functions). APEX_JUPYTER_SUPPORT 0 0,1 When running HPX in a Jupyter notebook, enable special handling for APEX data output and system reset. apex_exec flags \u00b6 To control the behavior of APEX when using apex_exec , many flags are available, several of which will automatically set the above environment variables as necessary: Usage: apex_exec executable where APEX options are zero or more of: --apex:help show this usage message --apex:debug run with APEX in debugger --apex:verbose enable verbose list of APEX environment variables --apex:screen enable screen text output (on by default) --apex:screen-detail enable detailed text output (off by default) --apex:quiet disable screen text output --apex:final-output-only only output performance data at exit (ignore intermediate dump calls) --apex:csv enable csv text output --apex:tau enable tau profile output --apex:taskgraph enable taskgraph output (graphviz required for post-processing) --apex:tasktree enable tasktree output (python3 with Pandas required for post-processing) --apex:hatchet enable Hatchet tasktree output (python3 with Hatchet required for post-processing) --apex:concur Periodically sample thread activity (default: off) --apex:concur-max Max timers to track with concurrency activity (default: 5) --apex:concur-period Frequency of concurrency sampling, in microseconds (default: 1000000) --apex:throttle throttle short-lived timers to reduce overhead (default: off) --apex:throttle-calls minimum number of calls before throttling (default: 1000) --apex:throttle-per minimum timer duration in microseconds (default: 10) --apex:otf2 enable OTF2 trace output (requries --apex:mpi with MPI configurations) --apex:otf2path specify location of OTF2 archive (default: ./OTF2_archive) --apex:otf2name specify name of OTF2 file (default: APEX) --apex:gtrace enable Google Trace Events output (deprecated) --apex:pftrace enable Perfetto Trace output --apex:scatter enable scatterplot output (python required for post-processing) --apex:openacc enable OpenACC support --apex:kokkos enable Kokkos support --apex:kokkos-tuning enable Kokkos runtime autotuning support --apex:kokkos-fence enable Kokkos fences for async kernels --apex:raja enable RAJA support --apex:pthread enable pthread wrapper support --apex:gpu-memory enable GPU memory wrapper support --apex:cpu-memory enable CPU memory wrapper support --apex:untied enable tasks to migrate cores/OS threads during execution (not compatible with trace output) --apex:cuda enable CUDA/CUPTI measurement (default: off) --apex:cuda-counters enable CUDA/CUPTI counter support (default: off) --apex:cuda-driver enable CUDA driver API callbacks (default: off) --apex:cuda-details enable per-kernel statistics where available (default: off) --apex:hip enable HIP/ROCTracer measurement (default: off) --apex:hip-metrics enable HIP/ROCProfiler metric support (default: off) --apex:hip-counters enable HIP/ROCTracer counter support (default: off) --apex:hip-driver enable HIP/ROCTracer KSA driver API callbacks (default: off) --apex:hip-details enable per-kernel statistics where available (default: off) --apex:monitor-gpu enable GPU monitoring services (CUDA NVML, ROCm SMI) --apex:level0 enable OneAPI Level0 measurement (default: off) --apex:cpuinfo enable sampling of /proc/cpuinfo (Linux only) --apex:meminfo enable sampling of /proc/meminfo (Linux only) --apex:net enable sampling of /proc/net/dev (Linux only) --apex:status enable sampling of /proc/self/status (Linux only) --apex:io enable sampling of /proc/self/io (Linux only) --apex:period specify frequency of OS/HW sampling --apex:mpi enable MPI profiling (required for OTF2 support with MPI configurations) --apex:ompt enable OpenMP profiling (requires runtime support) --apex:ompt-simple only enable OpenMP Tools required events --apex:ompt-details enable all OpenMP Tools events --apex:source resolve function, file and line info for address lookups with binutils (default: function only) --apex:preload extra libraries to load with LD_PRELOAD _before_ APEX libraries (LD_PRELOAD value is added _after_ APEX libraries) --apex:postprocess run post-process scripts (graphviz, python) on output data after exit","title":"Useful Environment Variables"},{"location":"environment/#apex_runtime_options","text":"","title":"APEX Runtime Options"},{"location":"environment/#environment_variables","text":"There are a number of environment variables that control APEX behavior at runtime. The variables can be defined in the environment before application execution, or specified in a file called apex.conf in the current execution directory. The format of the configuration file is: APEX_VARIABLE1=value APEX_VARIABLE2=value ... To generate a default APEX configuration file in the current working directory, run the ./install/bin/apex_make_default_config program. To get a list of all known environment variables, run the ./install/bin/apex_environment_help program. Environment Variable Default Value Valid Values Description APEX_DISABLE 0 0,1 Disable APEX during the application execution APEX_SUSPEND 0 0,1 Suspend APEX timers and counters during the application execution APEX_PAPI_SUSPEND 0 0,1 Suspend PAPI counters during the application execution APEX_SCREEN_OUTPUT 0 0,1 Output APEX performance summary at exit APEX_VERBOSE 0 0,1 Output APEX options at entry APEX_PROFILE_OUTPUT 0 0,1 Output TAU profile of performance summary APEX_CSV_OUTPUT 0 0,1 Output CSV profile of performance summary APEX_TASKGRAPH_OUTPUT 0 0,1 Output graphviz reduced taskgraph APEX_POLICY 1 0,1 Enable APEX policy listener and execute registered policies APEX_PROC_STAT 1 0,1 Periodically read data from /proc/stat APEX_PROC_CPUINFO 0 0,1 Read data (once) from /proc/cpuinfo APEX_PROC_MEMINFO 0 0,1 Periodically read data from /proc/meminfo APEX_PROC_NET_DEV 0 0,1 Periodically read data from /proc/net/dev APEX_PROC_SELF_STATUS 0 0,1 Periodically read data from /proc/self/status APEX_PROC_SELF_IO 0 0,1 Periodically read data from /proc/self/io APEX_PROC_STAT_DETAILS 0 0,1 Periodically read detailed data from /proc/self/stat APEX_PROC_PERIOD 1000000 Integer /proc data read sampling period, in microseconds APEX_MEASURE_CONCURRENCY 0 0,1 Periodically sample thread activity and output report at exit APEX_MEASURE_CONCURRENCY_PERIOD 1000000 Integer Thread concurrency sampling period, in microseconds APEX_OTF2 0 0,1 Enable OTF2 trace output. APEX_TRACE_EVENT 0 0,1 Enable Google Trace Event output. APEX_OTF2_ARCHIVE_PATH OTF2_archive valid path OTF2 trace directory. APEX_OTF2_ARCHIVE_NAME APEX valid string OTF2 trace filename. APEX_TAU 0 0,1 Enable TAU profiling (if application is executed with tau_exec ). APEX_THROTTLE_CONCURRENCY 0 0,1 Enable thread concurrency throttling APEX_THROTTLING_MIN_THREADS 1 0,1 Minimum threads allowed APEX_THROTTLING_MAX_THREADS 8 0,1 Maximum threads allowed APEX_THROTTLE_ENERGY 0 0,1 Enable energy throttling APEX_THROTTLE_ENERGY_PERIOD 1000000 Integer Power sampling period, in microseconds APEX_THROTTLING_MIN_WATTS 150 Integer Minimum Watt threshold APEX_THROTTLING_MAX_WATTS 300 Integer Maximum Watt threshold APEX_PTHREAD_WRAPPER_STACK_SIZE 0 16k-8M When wrapping pthread_create, use this size for the stack. APEX_PAPI_METRICS null space-delimited string of metric names List of metrics to be measured by APEX when timers are used. Only meaningful if APEX is configured with PAPI support. Any supported metric from papi_avail ( see PAPI Documentation ) can be used. APEX_PAPI_SUSPEND 0 0,1 Suspend collection of PAPI metrics for APEX timers during the application execution APEX_PROCESS_ASYNC_STATE 1 0,1 Enable/disable asynchronous processing of statistics (useful when only collecting trace data) APEX_UNTIED_TIMERS 0 0,1 Disable callstack state maintenance for specific OS threads. This allows APEX timers to start on one thread and stop on another. This is not compatible with tracing. APEX_OMPT_REQUIRED_EVENTS_ONLY 0 0,1 Disable moderate-frequency, moderate-overhead OMPT events. APEX_OMPT_HIGH_OVERHEAD_EVENTS 0 0,1 Disable high-frequency, high-overhead OMPT events. APEX_PIN_APEX_THREADS 1 0,1 Pin APEX asynchronous threads to the last core/PU on the system. APEX_TASK_SCATTERPLOT 0 0,1 Periodically sample APEX tasks, generating a scatterplot of time distributions. APEX_TIME_TOP_LEVEL_OS_THREADS 0 0,1 When registering threads, measure their lifetimes. APEX_CUDA_COUNTERS 0 0,1 Enable CUDA CUPTI counter measurement. APEX_CUDA_KERNEL_DETAILS 0 0,1 Enable Context information for CUDA CUPTI counter measurement and CUDA CUPTI API callback timers. APEX_CUDA_RUNTIME_API 1 0,1 Enable callbacks for the CUDA Runtime API ( cuda*() functions). APEX_CUDA_DRIVER_API 0 0,1 Enable callbacks for the CUDA Driver API ( cu*() functions). APEX_JUPYTER_SUPPORT 0 0,1 When running HPX in a Jupyter notebook, enable special handling for APEX data output and system reset.","title":"Environment Variables"},{"location":"environment/#apex_exec_flags","text":"To control the behavior of APEX when using apex_exec , many flags are available, several of which will automatically set the above environment variables as necessary: Usage: apex_exec executable where APEX options are zero or more of: --apex:help show this usage message --apex:debug run with APEX in debugger --apex:verbose enable verbose list of APEX environment variables --apex:screen enable screen text output (on by default) --apex:screen-detail enable detailed text output (off by default) --apex:quiet disable screen text output --apex:final-output-only only output performance data at exit (ignore intermediate dump calls) --apex:csv enable csv text output --apex:tau enable tau profile output --apex:taskgraph enable taskgraph output (graphviz required for post-processing) --apex:tasktree enable tasktree output (python3 with Pandas required for post-processing) --apex:hatchet enable Hatchet tasktree output (python3 with Hatchet required for post-processing) --apex:concur Periodically sample thread activity (default: off) --apex:concur-max Max timers to track with concurrency activity (default: 5) --apex:concur-period Frequency of concurrency sampling, in microseconds (default: 1000000) --apex:throttle throttle short-lived timers to reduce overhead (default: off) --apex:throttle-calls minimum number of calls before throttling (default: 1000) --apex:throttle-per minimum timer duration in microseconds (default: 10) --apex:otf2 enable OTF2 trace output (requries --apex:mpi with MPI configurations) --apex:otf2path specify location of OTF2 archive (default: ./OTF2_archive) --apex:otf2name specify name of OTF2 file (default: APEX) --apex:gtrace enable Google Trace Events output (deprecated) --apex:pftrace enable Perfetto Trace output --apex:scatter enable scatterplot output (python required for post-processing) --apex:openacc enable OpenACC support --apex:kokkos enable Kokkos support --apex:kokkos-tuning enable Kokkos runtime autotuning support --apex:kokkos-fence enable Kokkos fences for async kernels --apex:raja enable RAJA support --apex:pthread enable pthread wrapper support --apex:gpu-memory enable GPU memory wrapper support --apex:cpu-memory enable CPU memory wrapper support --apex:untied enable tasks to migrate cores/OS threads during execution (not compatible with trace output) --apex:cuda enable CUDA/CUPTI measurement (default: off) --apex:cuda-counters enable CUDA/CUPTI counter support (default: off) --apex:cuda-driver enable CUDA driver API callbacks (default: off) --apex:cuda-details enable per-kernel statistics where available (default: off) --apex:hip enable HIP/ROCTracer measurement (default: off) --apex:hip-metrics enable HIP/ROCProfiler metric support (default: off) --apex:hip-counters enable HIP/ROCTracer counter support (default: off) --apex:hip-driver enable HIP/ROCTracer KSA driver API callbacks (default: off) --apex:hip-details enable per-kernel statistics where available (default: off) --apex:monitor-gpu enable GPU monitoring services (CUDA NVML, ROCm SMI) --apex:level0 enable OneAPI Level0 measurement (default: off) --apex:cpuinfo enable sampling of /proc/cpuinfo (Linux only) --apex:meminfo enable sampling of /proc/meminfo (Linux only) --apex:net enable sampling of /proc/net/dev (Linux only) --apex:status enable sampling of /proc/self/status (Linux only) --apex:io enable sampling of /proc/self/io (Linux only) --apex:period specify frequency of OS/HW sampling --apex:mpi enable MPI profiling (required for OTF2 support with MPI configurations) --apex:ompt enable OpenMP profiling (requires runtime support) --apex:ompt-simple only enable OpenMP Tools required events --apex:ompt-details enable all OpenMP Tools events --apex:source resolve function, file and line info for address lookups with binutils (default: function only) --apex:preload extra libraries to load with LD_PRELOAD _before_ APEX libraries (LD_PRELOAD value is added _after_ APEX libraries) --apex:postprocess run post-process scripts (graphviz, python) on output data after exit","title":"apex_exec flags"},{"location":"examples/","text":"HPX-3 and 1D stencil \u00b6 ...coming soon... HPX-5 and LULESH \u00b6 ...coming soon... HPX-5 and SSSP \u00b6 ...coming soon... HPX-5 and MiniGhost \u00b6 ...coming soon... OpenMP and LULESH 2.0 \u00b6 ...coming soon... OpenMP and NPB 3.2.1 \u00b6 ...coming soon... MPI applications \u00b6 ...coming soon...","title":"HPX-3 and 1D stencil"},{"location":"examples/#hpx-3_and_1d_stencil","text":"...coming soon...","title":"HPX-3 and 1D stencil"},{"location":"examples/#hpx-5_and_lulesh","text":"...coming soon...","title":"HPX-5 and LULESH"},{"location":"examples/#hpx-5_and_sssp","text":"...coming soon...","title":"HPX-5 and SSSP"},{"location":"examples/#hpx-5_and_minighost","text":"...coming soon...","title":"HPX-5 and MiniGhost"},{"location":"examples/#openmp_and_lulesh_20","text":"...coming soon...","title":"OpenMP and LULESH 2.0"},{"location":"examples/#openmp_and_npb_321","text":"...coming soon...","title":"OpenMP and NPB 3.2.1"},{"location":"examples/#mpi_applications","text":"...coming soon...","title":"MPI applications"},{"location":"feature/","text":"Feature Overview \u00b6 APEX: Motivation \u00b6 Frequently, software components or even entire applications run into a situation where the context of the execution environment has changed in some way (or does not meet assumptions). In those situations, the software requires some mechanism for evaluating its own performance and that of the underlying runtime system, operating system and hardware. The types of adaptation that the software wants to do could include: Controlling concurrency to improve energy efficiency for performance Parametric variability adjust the decomposition granularity for this machine / dataset choose a different algorithm for better performance/accuracy choose a different preconditioner for better performance/accuracy choose a different solver for better performance/accuracy Load Balancing when to perform AGAS migration? when to perform repartitioning? when to perform data exchanges? Parallel Algorithms (for_each\u2026) - choose a different execution model separate what from how Address the \u201cSLOW(ER)\u201d performance model avoid S tarvation reduce L atency reduce O verhead reduce W aiting reduce E nergy consumption improve R esiliency APEX provides both performance awareness and performance adaptation . APEX provides top-down and bottom-up performance mapping and feedback. APEX exposes node-wide resource utilization data and analysis, energy consumption, and health information in real time Software can subsequently associate performance state with policy for feedback control APEX introspection OS: track system resources, utilization, job contention, overhead Runtime (e.g. HPX, OpenMP, CUDA, OpenACC, Kokkos...): track threads, queues, concurrency, remote operations, parcels, memory management Application timer / counter observation Above: APEX architecture diagram (when linked with an HPX application). The application and runtime send events to the APEX instrumentation API, which updates the performance state. The Policy Engine executes policies that change application behavior based on rule outcomes. Supported Parallel Models \u00b6 HPX - APEX is fully integrated into the HPX runtime, so that all tasks that are scheduled by the thread scheduler are measured by APEX. In addition, all HPX counters are captured by APEX. C++ threads ( std::thread , std::async ) and vanilla POSIX threads - Using a pthread_create() wrapper, APEX can capture all spawned threads and measure the time spent in those top level functions. OpenMP - Using the OpenMP 5.0 OMPT interface, APEX can capture performance data related to OpenMP pragmas. OpenACC - Using the OpenACC Profiling interface, APEX can capture performance data related to OpenACC pragmas. Kokkos - Using the Kokkos profiling interface, APEX can capture performance data related to Kokkos parallel abstractions. RAJA - Using the RAJA profiling interface, APEX can capture performance data related to RAJA parallel abstractions. Unlike Kokkos, RAJA doesn't give any details, so don't expect much. CUDA - Using the NVIDIA CUPTI and NVML libraries, APEX can capture runtime and driver API calls as well as memory transfers and kernels executed on a device, and monitor GPU utilization. HIP - Using the AMD Roctracer, Rocprofiler and ROCM-SMI libraries, APEX can capture runtime and driver API calls as well as memory transfers and kernels executed on a device, and monitor GPU utilization. Intel SYCL - Using the Intel Level0 libraries, APEX can capture runtime and driver API calls as well as memory transfers and kernels executed on a device, and monitor GPU utilization. PhiProf - APEX is integrated with support to intercept PhiProf profiling data. See https://github.com/fmihpc/phiprof . StarPU - APEX is integrated with support to profile StarPU. See https://starpu.gitlabpages.inria.fr . Distributed Execution over MPI - While APEX doesn't measure all MPI function calls, it is \"MPI-aware\", and can detect when used in a distributed run so that each process can write separate or aggregated performance data. APEX provides rudimentary support for measuring point-to-point and collectives. Parallel Models with Experimental Support / In Development / Wish List \u00b6 Argobots - APEX has been used to instrument services based on Argobots, but it is not integrated into the runtime. TBB - The APEX team is evaluating integrated TBB support. Legion - No plans at this time. Charm++ - No plans at this time. Iris - Plans are afoot. Stay tuned. YAKL - Plans are afoot. Stay tuned. Introspection \u00b6 APEX collects data through inspectors . The synchronous data collection uses an event API and event listeners . The API includes events for: Initialize, terminate, thread creation, thread exit added to the HPX thread scheduler added to the OpenMP runtime using the OMPT interface added to the pthread runtime by wrapping the pthread API calls Timer start, stop, yield, resume added to HPX task scheduler added to the OpenMP runtime using the OMPT interface added to the pthread runtime by wrapping the pthread API calls added to the CUDA runtime by subscribing to CUPTI callbacks and asynchronous GPU activity added to the Kokkos runtime by registering for callbacks added to the OpenACC runtime by registering for callbacks Sampled values counters from HPX counters from OpenMP counters from CUPTI Custom events (meta-events) useful for triggering policies Asynchonous data collection does not rely on events, but occurs periodically. APEX exploits access to performance data from lower stack components (i.e. the runtime) or by reading from the RCR blackboard (i.e., power, energy). Other operating system and hardware health data is collected through other interfaces: /proc/stat /proc/cpuinfo /proc/meminfo /proc/net/dev /proc/self/status lm_sensors power measurements counters from NVIDIA Monitoring Library (NVML) PAPI hardware counters and components Event Listeners \u00b6 There are a number of listeners in APEX that are triggered by the events passed in through the API. For example, the Profiling Listener records events related to maintaining the performance state. Profiling Listener \u00b6 Start Event: records the name/address of the timer, gets a timestamp (using rdtsc), returns a profiler handle Stop Event: gets a timestamp, optionally puts the profiler object in a queue for back-end processing and returns Sample Event: put the name & value in the queue Internally to APEX, there is an asynchronous consumer thread that processes profiler objects and samples to build a performance profile (in HPX, this thread is processed/scheduled as an HPX thread/task), construct task graphs, and scatterplots of sampled task times. TAU Listener \u00b6 The TAU Listener (used for postmortem analysis) synchronously passes all measurement events to TAU to build an offline profile or trace. TAU will also capture any other events for which it is configured, including MPI, memory, file I/O, etc. Concurrency Tracking \u00b6 The concurrency listener (also used for postmortem analysis) maintains a timeline of total concurrency, periodically sampled from within APEX. Start event: push timer ID on stack Stop event: pop timer ID off stack An asynchronous consumer thread periodically logs the current timer for each thread. This thread will output a concurrency data report and gnuplot script at APEX termination. OTF2 Tracing \u00b6 The OTF2 listener will construct a full event trace and write the events out to an OTF2 archive. OTF2 files can be visualized with tools like Vampir or Traveler . Due to the constraints of OTF2 trace collection, tasks that start on one OS thread and end on another OS thread are not supported. Similarly, tasks/functions that are not perfectly nested are not supported by OTF2 tracing. For those types of tasks, we recommend the Trace Event listener. Google Trace Event Listener \u00b6 The Trace Event listener will construct a full event trace and write the events to one or more Google Trace Event trace files. The files can be visualized with the Google Chrome web browser, by navigating to the https://ui.perfetto.dev URL. Policy Listener \u00b6 Policies are rules that decide on outcomes based on observed state. Triggered policies are invoked by introspection API events. Periodic policies are run periodically on asynchronous thread. Polices are registered with the Policy Engine at program startup by runtime code and/or from the application. Applications, runtimes, and the OS can register callback functions to be executed. Callback functions define the policy rules - \u201cIf x < y then...(take some action!)\u201d. Enables runtime adaptation using introspection data Engages actuators across stack layers Is also used to involve online auto-tuning support","title":"Feature Overview"},{"location":"feature/#feature_overview","text":"","title":"Feature Overview"},{"location":"feature/#apex_motivation","text":"Frequently, software components or even entire applications run into a situation where the context of the execution environment has changed in some way (or does not meet assumptions). In those situations, the software requires some mechanism for evaluating its own performance and that of the underlying runtime system, operating system and hardware. The types of adaptation that the software wants to do could include: Controlling concurrency to improve energy efficiency for performance Parametric variability adjust the decomposition granularity for this machine / dataset choose a different algorithm for better performance/accuracy choose a different preconditioner for better performance/accuracy choose a different solver for better performance/accuracy Load Balancing when to perform AGAS migration? when to perform repartitioning? when to perform data exchanges? Parallel Algorithms (for_each\u2026) - choose a different execution model separate what from how Address the \u201cSLOW(ER)\u201d performance model avoid S tarvation reduce L atency reduce O verhead reduce W aiting reduce E nergy consumption improve R esiliency APEX provides both performance awareness and performance adaptation . APEX provides top-down and bottom-up performance mapping and feedback. APEX exposes node-wide resource utilization data and analysis, energy consumption, and health information in real time Software can subsequently associate performance state with policy for feedback control APEX introspection OS: track system resources, utilization, job contention, overhead Runtime (e.g. HPX, OpenMP, CUDA, OpenACC, Kokkos...): track threads, queues, concurrency, remote operations, parcels, memory management Application timer / counter observation Above: APEX architecture diagram (when linked with an HPX application). The application and runtime send events to the APEX instrumentation API, which updates the performance state. The Policy Engine executes policies that change application behavior based on rule outcomes.","title":"APEX: Motivation"},{"location":"feature/#supported_parallel_models","text":"HPX - APEX is fully integrated into the HPX runtime, so that all tasks that are scheduled by the thread scheduler are measured by APEX. In addition, all HPX counters are captured by APEX. C++ threads ( std::thread , std::async ) and vanilla POSIX threads - Using a pthread_create() wrapper, APEX can capture all spawned threads and measure the time spent in those top level functions. OpenMP - Using the OpenMP 5.0 OMPT interface, APEX can capture performance data related to OpenMP pragmas. OpenACC - Using the OpenACC Profiling interface, APEX can capture performance data related to OpenACC pragmas. Kokkos - Using the Kokkos profiling interface, APEX can capture performance data related to Kokkos parallel abstractions. RAJA - Using the RAJA profiling interface, APEX can capture performance data related to RAJA parallel abstractions. Unlike Kokkos, RAJA doesn't give any details, so don't expect much. CUDA - Using the NVIDIA CUPTI and NVML libraries, APEX can capture runtime and driver API calls as well as memory transfers and kernels executed on a device, and monitor GPU utilization. HIP - Using the AMD Roctracer, Rocprofiler and ROCM-SMI libraries, APEX can capture runtime and driver API calls as well as memory transfers and kernels executed on a device, and monitor GPU utilization. Intel SYCL - Using the Intel Level0 libraries, APEX can capture runtime and driver API calls as well as memory transfers and kernels executed on a device, and monitor GPU utilization. PhiProf - APEX is integrated with support to intercept PhiProf profiling data. See https://github.com/fmihpc/phiprof . StarPU - APEX is integrated with support to profile StarPU. See https://starpu.gitlabpages.inria.fr . Distributed Execution over MPI - While APEX doesn't measure all MPI function calls, it is \"MPI-aware\", and can detect when used in a distributed run so that each process can write separate or aggregated performance data. APEX provides rudimentary support for measuring point-to-point and collectives.","title":"Supported Parallel Models"},{"location":"feature/#parallel_models_with_experimental_support_in_development_wish_list","text":"Argobots - APEX has been used to instrument services based on Argobots, but it is not integrated into the runtime. TBB - The APEX team is evaluating integrated TBB support. Legion - No plans at this time. Charm++ - No plans at this time. Iris - Plans are afoot. Stay tuned. YAKL - Plans are afoot. Stay tuned.","title":"Parallel Models with Experimental Support / In Development / Wish List"},{"location":"feature/#introspection","text":"APEX collects data through inspectors . The synchronous data collection uses an event API and event listeners . The API includes events for: Initialize, terminate, thread creation, thread exit added to the HPX thread scheduler added to the OpenMP runtime using the OMPT interface added to the pthread runtime by wrapping the pthread API calls Timer start, stop, yield, resume added to HPX task scheduler added to the OpenMP runtime using the OMPT interface added to the pthread runtime by wrapping the pthread API calls added to the CUDA runtime by subscribing to CUPTI callbacks and asynchronous GPU activity added to the Kokkos runtime by registering for callbacks added to the OpenACC runtime by registering for callbacks Sampled values counters from HPX counters from OpenMP counters from CUPTI Custom events (meta-events) useful for triggering policies Asynchonous data collection does not rely on events, but occurs periodically. APEX exploits access to performance data from lower stack components (i.e. the runtime) or by reading from the RCR blackboard (i.e., power, energy). Other operating system and hardware health data is collected through other interfaces: /proc/stat /proc/cpuinfo /proc/meminfo /proc/net/dev /proc/self/status lm_sensors power measurements counters from NVIDIA Monitoring Library (NVML) PAPI hardware counters and components","title":"Introspection"},{"location":"feature/#event_listeners","text":"There are a number of listeners in APEX that are triggered by the events passed in through the API. For example, the Profiling Listener records events related to maintaining the performance state.","title":"Event Listeners"},{"location":"feature/#profiling_listener","text":"Start Event: records the name/address of the timer, gets a timestamp (using rdtsc), returns a profiler handle Stop Event: gets a timestamp, optionally puts the profiler object in a queue for back-end processing and returns Sample Event: put the name & value in the queue Internally to APEX, there is an asynchronous consumer thread that processes profiler objects and samples to build a performance profile (in HPX, this thread is processed/scheduled as an HPX thread/task), construct task graphs, and scatterplots of sampled task times.","title":"Profiling Listener"},{"location":"feature/#tau_listener","text":"The TAU Listener (used for postmortem analysis) synchronously passes all measurement events to TAU to build an offline profile or trace. TAU will also capture any other events for which it is configured, including MPI, memory, file I/O, etc.","title":"TAU Listener"},{"location":"feature/#concurrency_tracking","text":"The concurrency listener (also used for postmortem analysis) maintains a timeline of total concurrency, periodically sampled from within APEX. Start event: push timer ID on stack Stop event: pop timer ID off stack An asynchronous consumer thread periodically logs the current timer for each thread. This thread will output a concurrency data report and gnuplot script at APEX termination.","title":"Concurrency Tracking"},{"location":"feature/#otf2_tracing","text":"The OTF2 listener will construct a full event trace and write the events out to an OTF2 archive. OTF2 files can be visualized with tools like Vampir or Traveler . Due to the constraints of OTF2 trace collection, tasks that start on one OS thread and end on another OS thread are not supported. Similarly, tasks/functions that are not perfectly nested are not supported by OTF2 tracing. For those types of tasks, we recommend the Trace Event listener.","title":"OTF2 Tracing"},{"location":"feature/#google_trace_event_listener","text":"The Trace Event listener will construct a full event trace and write the events to one or more Google Trace Event trace files. The files can be visualized with the Google Chrome web browser, by navigating to the https://ui.perfetto.dev URL.","title":"Google Trace Event Listener"},{"location":"feature/#policy_listener","text":"Policies are rules that decide on outcomes based on observed state. Triggered policies are invoked by introspection API events. Periodic policies are run periodically on asynchronous thread. Polices are registered with the Policy Engine at program startup by runtime code and/or from the application. Applications, runtimes, and the OS can register callback functions to be executed. Callback functions define the policy rules - \u201cIf x < y then...(take some action!)\u201d. Enables runtime adaptation using introspection data Engages actuators across stack layers Is also used to involve online auto-tuning support","title":"Policy Listener"},{"location":"hpx5/","text":"Supported Runtime Systems \u00b6 HPX-5 (Indiana University) \u00b6 Note: Support for HPX-5 has stalled since the end of the XPRESS project. These instructions were valid as of ~2017. HPX-5 High Performance ParalleX is a second implementation of the ParalleX model. Developed and maintained by the CREST Group at Indiana University, HPX-5 is implemented in C. For more information, see https://hpx.crest.iu.edu . Configuring HPX-5 with APEX \u00b6 APEX is built as a pre-requisite dependency of HPX-5. So, before configuring and building HPX-5, configure and build APEX as a standalone library. In addition to the usual required options for CMake, we will also include the options to include Active Harmony (for policies), TAU (for performance analysis - see APEX with TAU for instructions on configuring TAU) and Binutils support, because the HPX-5 instrumentation uses function addresses to identify timers rather than strings. To include Binutils, we can choose one of: use a system-installed binutils by specifying -DUSE_BFD=TRUE use a custom build of Binutils by specifying -DUSE_BFD=TRUE -DBFD_ROOT= have APEX download and build Binutils automatically by specifying -DBUILD_BFD=TRUE . Note: HPX-5 uses JEMalloc, TBB Malloc or DLMalloc, so DO NOT configure APEX with either TCMalloc or JEMalloc. For example, assume TAU is installed in /usr/local/tau/2.25 and we will have CMake download and build Binutils and Active Harmony, and we want to install APEX to /usr/local/apex/2.3.1. To configure, build and install APEX in the main source directory (your paths may vary): cd $HOME/src wget https://github.com/khuck/xpress-apex/archive/v2.3.1.tar.gz tar -xvzf v2.3.1.tar.gz cd xpress-apex-2.3.1 mkdir build cd build cmake \\ -DBUILD_BFD=TRUE -DCMAKE_INSTALL_PREFIX=/usr/local/xpress-apex/2.3.1 -DCMAKE_BUILD_TYPE=RelWithDebInfo .. make make test # optional make doc # optional make install Keep in mind that APEX will automatically download, configure and build Active Harmony as part of the build process, unless you pass -DUSE_ACTIVEHARMONY=FALSE to the cmake command. After the build is complete, add the package configuration path to your PKG_CONFIG_PATH environment variable (HPX-5 uses autotools for configuration so it will find APEX using the utility pkg-config): export PKG_CONFIG_PATH=$PKG_CONFIG_PATH:/usr/local/xpress-apex/2.3.1/lib/pkgconfig To confirm the PKG_CONFIG_PATH variable is set correctly, try executing the pkg-config command: pkg-config --libs apex Which should give the following output (or something similar): -L/usr/local/xpress-apex/2.3.1/lib -L/usr/local/tau/2.25/x86_64/lib -L/usr/local/xpress-apex/2.3.1/lib -lapex -lpthread -lTAUsh-papi-pthread -lharmony -lbfd -liberty -lz -lm -Wl,-rpath,/usr/local/tau/2.25/x86_64/lib,-rpath,/usr/local/xpress-apex/2.3.1/lib -lstdc++ Once APEX is installed, you can configure and build HPX-5 with APEX. To include APEX in the HPX-5 configuration, include the --with-apex=yes option when calling configure. Assuming you have downloaded HPX-5 v.3.0, you would do the following: # go to the HPX source directory cd HPX_Release_v3.0.0/hpx # If you haven't already set the pkgconfig path, do so now... export PKG_CONFIG_PATH=$PKG_CONFIG_PATH:/usr/local/xpress-apex/2.3.1/lib/pkgconfig # configure ./bootstrap ./configure --enable-testsuite --prefix=/home/khuck/src/hpx-iu/hpx-install --with-apex=yes # build! make -j8 # install! make install To confirm that HPX-5 was configured and built with APEX correctly, run the simple APEX example: export APEX_SCREEN_OUTPUT=1 ./tests/unit/apex Which should give output similar to this: v0.1-5e4ac87-master Built on: 13:23:34 Dec 17 2015 C++ Language Standard version : 201402 GCC Compiler version : 5.2.1 20151010 APEX_TAU : 0 APEX_POLICY : 1 APEX_MEASURE_CONCURRENCY : 0 APEX_MEASURE_CONCURRENCY_PERIOD : 1000000 APEX_SCREEN_OUTPUT : 1 APEX_PROFILE_OUTPUT : 0 APEX_CSV_OUTPUT : 0 APEX_TASKGRAPH_OUTPUT : 0 APEX_PROC_CPUINFO : 0 APEX_PROC_MEMINFO : 0 APEX_PROC_NET_DEV : 0 APEX_PROC_SELF_STATUS : 0 APEX_PROC_STAT : 1 APEX_THROTTLE_CONCURRENCY : 0 APEX_THROTTLING_MAX_THREADS : 8 APEX_THROTTLING_MIN_THREADS : 1 APEX_THROTTLE_ENERGY : 0 APEX_THROTTLING_MAX_WATTS : 300 APEX_THROTTLING_MIN_WATTS : 150 APEX_PTHREAD_WRAPPER_STACK_SIZE : 0 APEX_PAPI_METRICS : Missing fib number. Using 10. fib(10)=55 seconds: 0.0005629 localities: 1 threads/locality: 8 Info: 34 items remaining on on the profiler_listener queue...done. CPU is 2.66036e+09 Hz. Elapsed time: 0.0364015 Cores detected: 8 Worker Threads observed: 8 Available CPU time: 0.291212 Action : #calls | minimum | mean | maximum | total | stddev | % total ------------------------------------------------------------------------------------------------------------ _fib_main_action [{/home/kh... : 1 --n/a-- 4.52e-04 --n/a-- 4.52e-04 --n/a-- 0.155 _fib_action [{/home/khuck/s... : 177 --n/a-- 4.39e-06 --n/a-- 7.77e-04 --n/a-- 0.267 _locality_stop_handler [{/h... : 1 --n/a-- 1.21e-05 --n/a-- 1.21e-05 --n/a-- 0.004 failed steals : 1 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 --n/a-- mail : 1 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 --n/a-- spawns : 1 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 --n/a-- stacks : 1 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 --n/a-- steals : 1 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 --n/a-- yields : 1 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 --n/a-- APEX Idle : --n/a-- --n/a-- --n/a-- --n/a-- 2.90e-01 --n/a-- 99.574 ------------------------------------------------------------------------------------------------------------ Building HPX-5 applications with APEX \u00b6 APEX will automatically be included in the link when HPX-5 applciations are built. To build an example, go to the hpx-apps directory and build the LULESH parcels example: cd hpx-apps/lulesh/parcels # assuming HPX-5 is installed in /usr/local/hpx/3.0, set the pkgconfig path export PKG_CONFIG_PATH=$PKG_CONFIG_PATH:/usr/local/hpx/3.0/lib/pkgconfig # configure ./bootstrap ./configure # make! make Then, to run the LULESH example: export APEX_SCREEN_OUTPUT=1 ./luleshparcels -n 8 -x 24 -i 100 --hpx-threads=8 Should give the following output (or similar): v0.1-907c977-master Built on: 09:50:08 Dec 23 2015 C++ Language Standard version : 201402 GCC Compiler version : 5.2.1 20151010 APEX_TAU : 0 APEX_POLICY : 1 APEX_MEASURE_CONCURRENCY : 0 APEX_MEASURE_CONCURRENCY_PERIOD : 1000000 APEX_SCREEN_OUTPUT : 1 APEX_PROFILE_OUTPUT : 0 APEX_CSV_OUTPUT : 0 APEX_TASKGRAPH_OUTPUT : 0 APEX_PROC_CPUINFO : 0 APEX_PROC_MEMINFO : 0 APEX_PROC_NET_DEV : 0 APEX_PROC_SELF_STATUS : 0 APEX_PROC_STAT : 1 APEX_THROTTLE_CONCURRENCY : 0 APEX_THROTTLING_MAX_THREADS : 8 APEX_THROTTLING_MIN_THREADS : 1 APEX_THROTTLE_ENERGY : 0 APEX_THROTTLING_MAX_WATTS : 300 APEX_THROTTLING_MIN_WATTS : 150 APEX_PTHREAD_WRAPPER_STACK_SIZE : 0 APEX_PAPI_METRICS : Number of domains: 8 nx: 24 maxcycles: 100 core-major ordering: 1 START_LOG PROGNAME: lulesh-parcels Elapsed time = 1.255209e+01 Run completed: Problem size = 24 Iteration count = 100 Final Origin Energy = 4.739209e+06 Testing plane 0 of energy array: MaxAbsDiff = 9.313226e-10 TotalAbsDiff = 2.841568e-09 MaxRelDiff = 2.946213e-12 END_LOG time_in_SBN3 = 4.570989e-01 time_in_PosVel = 2.182410e-01 time_in_MonoQ = 4.889381e+00 Elapsed: 12599.4 CPU is 2.66028e+09 Hz. Elapsed time: 12.6192 Cores detected: 8 Worker Threads observed: 8 Available CPU time: 100.953 Action : #calls | minimum | mean | maximum | total | stddev | % total ------------------------------------------------------------------------------------------------------------ _advanceDomain_action [{/ho... : 8 --n/a-- 1.17e+01 --n/a-- 9.34e+01 --n/a-- 92.506 _initDomain_action [{/home/... : 8 --n/a-- 2.04e-02 --n/a-- 1.63e-01 --n/a-- 0.162 _finiDomain_action [{/home/... : 8 --n/a-- 2.81e-03 --n/a-- 2.25e-02 --n/a-- 0.022 _main_action [{/home/khuck/... : 1 --n/a-- 4.73e-03 --n/a-- 4.73e-03 --n/a-- 0.005 _SBN1_result_action [{/home... : 56 --n/a-- 1.42e-03 --n/a-- 7.93e-02 --n/a-- 0.079 _SBN1_sends_action [{/home/... : 56 --n/a-- 1.87e-04 --n/a-- 1.05e-02 --n/a-- 0.010 _SBN3_result_action [{/home... : 5600 --n/a-- 1.33e-04 --n/a-- 7.45e-01 --n/a-- 0.738 _SBN3_sends_action [{/home/... : 5600 --n/a-- 9.05e-05 --n/a-- 5.07e-01 --n/a-- 0.502 _PosVel_result_action [{/ho... : 2800 --n/a-- 1.61e-04 --n/a-- 4.50e-01 --n/a-- 0.445 _PosVel_sends_action [{/hom... : 2800 --n/a-- 1.43e-04 --n/a-- 4.00e-01 --n/a-- 0.396 _MonoQ_result_action [{/hom... : 2400 --n/a-- 1.03e-04 --n/a-- 2.47e-01 --n/a-- 0.245 _MonoQ_sends_action [{/home... : 2400 --n/a-- 1.79e-04 --n/a-- 4.29e-01 --n/a-- 0.425 _locality_stop_handler [{/h... : 1 --n/a-- 2.45e-04 --n/a-- 2.45e-04 --n/a-- 0.000 _allreduce_init_handler [{/... : 2 --n/a-- 5.49e-04 --n/a-- 1.10e-03 --n/a-- 0.001 _allreduce_fini_handler [{/... : 2 --n/a-- 2.44e-04 --n/a-- 4.89e-04 --n/a-- 0.000 _allreduce_add_handler [{/h... : 9 --n/a-- 6.74e-05 --n/a-- 6.07e-04 --n/a-- 0.001 _allreduce_remove_handler [... : 9 --n/a-- 4.31e-05 --n/a-- 3.88e-04 --n/a-- 0.000 _allreduce_join_handler [{/... : 99 --n/a-- 4.90e-05 --n/a-- 4.86e-03 --n/a-- 0.005 _allreduce_bcast_handler [{... : 99 --n/a-- 2.75e-05 --n/a-- 2.72e-03 --n/a-- 0.003 CPU Guest % : 12 0.000 0.000 0.000 0.000 0.000 --n/a-- CPU I/O Wait % : 12 0.000 0.000 0.000 0.000 0.000 --n/a-- CPU IRQ % : 12 0.000 0.000 0.000 0.000 0.000 --n/a-- CPU Idle % : 12 0.000 0.789 8.429 9.464 2.305 --n/a-- CPU Nice % : 12 0.000 0.000 0.000 0.000 0.000 --n/a-- CPU Steal % : 12 0.000 0.000 0.000 0.000 0.000 --n/a-- CPU System % : 12 21.000 22.387 24.286 268.643 0.941 --n/a-- CPU User % : 12 77.500 80.426 89.714 965.107 4.315 --n/a-- CPU soft IRQ % : 12 0.000 0.010 0.125 0.125 0.035 --n/a-- failed steals : 1 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 --n/a-- mail : 1 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 --n/a-- spawns : 1 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 --n/a-- stacks : 1 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 --n/a-- steals : 1 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 --n/a-- yields : 1 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 --n/a-- APEX Idle : --n/a-- --n/a-- --n/a-- --n/a-- 4.50e+00 --n/a-- 4.455 ------------------------------------------------------------------------------------------------------------ To enable TAU profiling, set the APEX_TAU environment variable to 1. We will also set some other TAU environment varaibles and re-run the program: export APEX_TAU=1 export TAU_PROFILE_FORMAT=merged export TAU_SAMPLING=1 ./luleshparcels -n 8 -x 24 -i 100 --hpx-threads=8 The \"merged\" profile setting will create a single file (tauprofile.xml) for the whole application, rather than a profile.\\* file for each thread. The sampling flag will enable periodic interruption of the application to get a more detailed profile. After execution, there is a TAU profile file called \"tauprofile.xml\". To view the results of the profiling, run the ParaProf application on the profile (assuming the TAU utilities are in your path): paraprof tauprofile.xml Which should result in a profile like the following: Above: ParaProf main profiler window showing all threads of execution. Above: ParaProf main profiler window showing one thread of execution. Above: ParaProf main profiler window showing one thread of execution, in a callgraph view. For more information on using TAU with APEX, see APEX with TAU .","title":"Supported Runtime Systems"},{"location":"hpx5/#supported_runtime_systems","text":"","title":"Supported Runtime Systems"},{"location":"hpx5/#hpx-5_indiana_university","text":"Note: Support for HPX-5 has stalled since the end of the XPRESS project. These instructions were valid as of ~2017. HPX-5 High Performance ParalleX is a second implementation of the ParalleX model. Developed and maintained by the CREST Group at Indiana University, HPX-5 is implemented in C. For more information, see https://hpx.crest.iu.edu .","title":"HPX-5 (Indiana University)"},{"location":"hpx5/#configuring_hpx-5_with_apex","text":"APEX is built as a pre-requisite dependency of HPX-5. So, before configuring and building HPX-5, configure and build APEX as a standalone library. In addition to the usual required options for CMake, we will also include the options to include Active Harmony (for policies), TAU (for performance analysis - see APEX with TAU for instructions on configuring TAU) and Binutils support, because the HPX-5 instrumentation uses function addresses to identify timers rather than strings. To include Binutils, we can choose one of: use a system-installed binutils by specifying -DUSE_BFD=TRUE use a custom build of Binutils by specifying -DUSE_BFD=TRUE -DBFD_ROOT= have APEX download and build Binutils automatically by specifying -DBUILD_BFD=TRUE . Note: HPX-5 uses JEMalloc, TBB Malloc or DLMalloc, so DO NOT configure APEX with either TCMalloc or JEMalloc. For example, assume TAU is installed in /usr/local/tau/2.25 and we will have CMake download and build Binutils and Active Harmony, and we want to install APEX to /usr/local/apex/2.3.1. To configure, build and install APEX in the main source directory (your paths may vary): cd $HOME/src wget https://github.com/khuck/xpress-apex/archive/v2.3.1.tar.gz tar -xvzf v2.3.1.tar.gz cd xpress-apex-2.3.1 mkdir build cd build cmake \\ -DBUILD_BFD=TRUE -DCMAKE_INSTALL_PREFIX=/usr/local/xpress-apex/2.3.1 -DCMAKE_BUILD_TYPE=RelWithDebInfo .. make make test # optional make doc # optional make install Keep in mind that APEX will automatically download, configure and build Active Harmony as part of the build process, unless you pass -DUSE_ACTIVEHARMONY=FALSE to the cmake command. After the build is complete, add the package configuration path to your PKG_CONFIG_PATH environment variable (HPX-5 uses autotools for configuration so it will find APEX using the utility pkg-config): export PKG_CONFIG_PATH=$PKG_CONFIG_PATH:/usr/local/xpress-apex/2.3.1/lib/pkgconfig To confirm the PKG_CONFIG_PATH variable is set correctly, try executing the pkg-config command: pkg-config --libs apex Which should give the following output (or something similar): -L/usr/local/xpress-apex/2.3.1/lib -L/usr/local/tau/2.25/x86_64/lib -L/usr/local/xpress-apex/2.3.1/lib -lapex -lpthread -lTAUsh-papi-pthread -lharmony -lbfd -liberty -lz -lm -Wl,-rpath,/usr/local/tau/2.25/x86_64/lib,-rpath,/usr/local/xpress-apex/2.3.1/lib -lstdc++ Once APEX is installed, you can configure and build HPX-5 with APEX. To include APEX in the HPX-5 configuration, include the --with-apex=yes option when calling configure. Assuming you have downloaded HPX-5 v.3.0, you would do the following: # go to the HPX source directory cd HPX_Release_v3.0.0/hpx # If you haven't already set the pkgconfig path, do so now... export PKG_CONFIG_PATH=$PKG_CONFIG_PATH:/usr/local/xpress-apex/2.3.1/lib/pkgconfig # configure ./bootstrap ./configure --enable-testsuite --prefix=/home/khuck/src/hpx-iu/hpx-install --with-apex=yes # build! make -j8 # install! make install To confirm that HPX-5 was configured and built with APEX correctly, run the simple APEX example: export APEX_SCREEN_OUTPUT=1 ./tests/unit/apex Which should give output similar to this: v0.1-5e4ac87-master Built on: 13:23:34 Dec 17 2015 C++ Language Standard version : 201402 GCC Compiler version : 5.2.1 20151010 APEX_TAU : 0 APEX_POLICY : 1 APEX_MEASURE_CONCURRENCY : 0 APEX_MEASURE_CONCURRENCY_PERIOD : 1000000 APEX_SCREEN_OUTPUT : 1 APEX_PROFILE_OUTPUT : 0 APEX_CSV_OUTPUT : 0 APEX_TASKGRAPH_OUTPUT : 0 APEX_PROC_CPUINFO : 0 APEX_PROC_MEMINFO : 0 APEX_PROC_NET_DEV : 0 APEX_PROC_SELF_STATUS : 0 APEX_PROC_STAT : 1 APEX_THROTTLE_CONCURRENCY : 0 APEX_THROTTLING_MAX_THREADS : 8 APEX_THROTTLING_MIN_THREADS : 1 APEX_THROTTLE_ENERGY : 0 APEX_THROTTLING_MAX_WATTS : 300 APEX_THROTTLING_MIN_WATTS : 150 APEX_PTHREAD_WRAPPER_STACK_SIZE : 0 APEX_PAPI_METRICS : Missing fib number. Using 10. fib(10)=55 seconds: 0.0005629 localities: 1 threads/locality: 8 Info: 34 items remaining on on the profiler_listener queue...done. CPU is 2.66036e+09 Hz. Elapsed time: 0.0364015 Cores detected: 8 Worker Threads observed: 8 Available CPU time: 0.291212 Action : #calls | minimum | mean | maximum | total | stddev | % total ------------------------------------------------------------------------------------------------------------ _fib_main_action [{/home/kh... : 1 --n/a-- 4.52e-04 --n/a-- 4.52e-04 --n/a-- 0.155 _fib_action [{/home/khuck/s... : 177 --n/a-- 4.39e-06 --n/a-- 7.77e-04 --n/a-- 0.267 _locality_stop_handler [{/h... : 1 --n/a-- 1.21e-05 --n/a-- 1.21e-05 --n/a-- 0.004 failed steals : 1 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 --n/a-- mail : 1 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 --n/a-- spawns : 1 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 --n/a-- stacks : 1 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 --n/a-- steals : 1 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 --n/a-- yields : 1 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 --n/a-- APEX Idle : --n/a-- --n/a-- --n/a-- --n/a-- 2.90e-01 --n/a-- 99.574 ------------------------------------------------------------------------------------------------------------","title":"Configuring HPX-5 with APEX"},{"location":"hpx5/#building_hpx-5_applications_with_apex","text":"APEX will automatically be included in the link when HPX-5 applciations are built. To build an example, go to the hpx-apps directory and build the LULESH parcels example: cd hpx-apps/lulesh/parcels # assuming HPX-5 is installed in /usr/local/hpx/3.0, set the pkgconfig path export PKG_CONFIG_PATH=$PKG_CONFIG_PATH:/usr/local/hpx/3.0/lib/pkgconfig # configure ./bootstrap ./configure # make! make Then, to run the LULESH example: export APEX_SCREEN_OUTPUT=1 ./luleshparcels -n 8 -x 24 -i 100 --hpx-threads=8 Should give the following output (or similar): v0.1-907c977-master Built on: 09:50:08 Dec 23 2015 C++ Language Standard version : 201402 GCC Compiler version : 5.2.1 20151010 APEX_TAU : 0 APEX_POLICY : 1 APEX_MEASURE_CONCURRENCY : 0 APEX_MEASURE_CONCURRENCY_PERIOD : 1000000 APEX_SCREEN_OUTPUT : 1 APEX_PROFILE_OUTPUT : 0 APEX_CSV_OUTPUT : 0 APEX_TASKGRAPH_OUTPUT : 0 APEX_PROC_CPUINFO : 0 APEX_PROC_MEMINFO : 0 APEX_PROC_NET_DEV : 0 APEX_PROC_SELF_STATUS : 0 APEX_PROC_STAT : 1 APEX_THROTTLE_CONCURRENCY : 0 APEX_THROTTLING_MAX_THREADS : 8 APEX_THROTTLING_MIN_THREADS : 1 APEX_THROTTLE_ENERGY : 0 APEX_THROTTLING_MAX_WATTS : 300 APEX_THROTTLING_MIN_WATTS : 150 APEX_PTHREAD_WRAPPER_STACK_SIZE : 0 APEX_PAPI_METRICS : Number of domains: 8 nx: 24 maxcycles: 100 core-major ordering: 1 START_LOG PROGNAME: lulesh-parcels Elapsed time = 1.255209e+01 Run completed: Problem size = 24 Iteration count = 100 Final Origin Energy = 4.739209e+06 Testing plane 0 of energy array: MaxAbsDiff = 9.313226e-10 TotalAbsDiff = 2.841568e-09 MaxRelDiff = 2.946213e-12 END_LOG time_in_SBN3 = 4.570989e-01 time_in_PosVel = 2.182410e-01 time_in_MonoQ = 4.889381e+00 Elapsed: 12599.4 CPU is 2.66028e+09 Hz. Elapsed time: 12.6192 Cores detected: 8 Worker Threads observed: 8 Available CPU time: 100.953 Action : #calls | minimum | mean | maximum | total | stddev | % total ------------------------------------------------------------------------------------------------------------ _advanceDomain_action [{/ho... : 8 --n/a-- 1.17e+01 --n/a-- 9.34e+01 --n/a-- 92.506 _initDomain_action [{/home/... : 8 --n/a-- 2.04e-02 --n/a-- 1.63e-01 --n/a-- 0.162 _finiDomain_action [{/home/... : 8 --n/a-- 2.81e-03 --n/a-- 2.25e-02 --n/a-- 0.022 _main_action [{/home/khuck/... : 1 --n/a-- 4.73e-03 --n/a-- 4.73e-03 --n/a-- 0.005 _SBN1_result_action [{/home... : 56 --n/a-- 1.42e-03 --n/a-- 7.93e-02 --n/a-- 0.079 _SBN1_sends_action [{/home/... : 56 --n/a-- 1.87e-04 --n/a-- 1.05e-02 --n/a-- 0.010 _SBN3_result_action [{/home... : 5600 --n/a-- 1.33e-04 --n/a-- 7.45e-01 --n/a-- 0.738 _SBN3_sends_action [{/home/... : 5600 --n/a-- 9.05e-05 --n/a-- 5.07e-01 --n/a-- 0.502 _PosVel_result_action [{/ho... : 2800 --n/a-- 1.61e-04 --n/a-- 4.50e-01 --n/a-- 0.445 _PosVel_sends_action [{/hom... : 2800 --n/a-- 1.43e-04 --n/a-- 4.00e-01 --n/a-- 0.396 _MonoQ_result_action [{/hom... : 2400 --n/a-- 1.03e-04 --n/a-- 2.47e-01 --n/a-- 0.245 _MonoQ_sends_action [{/home... : 2400 --n/a-- 1.79e-04 --n/a-- 4.29e-01 --n/a-- 0.425 _locality_stop_handler [{/h... : 1 --n/a-- 2.45e-04 --n/a-- 2.45e-04 --n/a-- 0.000 _allreduce_init_handler [{/... : 2 --n/a-- 5.49e-04 --n/a-- 1.10e-03 --n/a-- 0.001 _allreduce_fini_handler [{/... : 2 --n/a-- 2.44e-04 --n/a-- 4.89e-04 --n/a-- 0.000 _allreduce_add_handler [{/h... : 9 --n/a-- 6.74e-05 --n/a-- 6.07e-04 --n/a-- 0.001 _allreduce_remove_handler [... : 9 --n/a-- 4.31e-05 --n/a-- 3.88e-04 --n/a-- 0.000 _allreduce_join_handler [{/... : 99 --n/a-- 4.90e-05 --n/a-- 4.86e-03 --n/a-- 0.005 _allreduce_bcast_handler [{... : 99 --n/a-- 2.75e-05 --n/a-- 2.72e-03 --n/a-- 0.003 CPU Guest % : 12 0.000 0.000 0.000 0.000 0.000 --n/a-- CPU I/O Wait % : 12 0.000 0.000 0.000 0.000 0.000 --n/a-- CPU IRQ % : 12 0.000 0.000 0.000 0.000 0.000 --n/a-- CPU Idle % : 12 0.000 0.789 8.429 9.464 2.305 --n/a-- CPU Nice % : 12 0.000 0.000 0.000 0.000 0.000 --n/a-- CPU Steal % : 12 0.000 0.000 0.000 0.000 0.000 --n/a-- CPU System % : 12 21.000 22.387 24.286 268.643 0.941 --n/a-- CPU User % : 12 77.500 80.426 89.714 965.107 4.315 --n/a-- CPU soft IRQ % : 12 0.000 0.010 0.125 0.125 0.035 --n/a-- failed steals : 1 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 --n/a-- mail : 1 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 --n/a-- spawns : 1 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 --n/a-- stacks : 1 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 --n/a-- steals : 1 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 --n/a-- yields : 1 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 --n/a-- APEX Idle : --n/a-- --n/a-- --n/a-- --n/a-- 4.50e+00 --n/a-- 4.455 ------------------------------------------------------------------------------------------------------------ To enable TAU profiling, set the APEX_TAU environment variable to 1. We will also set some other TAU environment varaibles and re-run the program: export APEX_TAU=1 export TAU_PROFILE_FORMAT=merged export TAU_SAMPLING=1 ./luleshparcels -n 8 -x 24 -i 100 --hpx-threads=8 The \"merged\" profile setting will create a single file (tauprofile.xml) for the whole application, rather than a profile.\\* file for each thread. The sampling flag will enable periodic interruption of the application to get a more detailed profile. After execution, there is a TAU profile file called \"tauprofile.xml\". To view the results of the profiling, run the ParaProf application on the profile (assuming the TAU utilities are in your path): paraprof tauprofile.xml Which should result in a profile like the following: Above: ParaProf main profiler window showing all threads of execution. Above: ParaProf main profiler window showing one thread of execution. Above: ParaProf main profiler window showing one thread of execution, in a callgraph view. For more information on using TAU with APEX, see APEX with TAU .","title":"Building HPX-5 applications with APEX"},{"location":"install/","text":"Installing APEX \u00b6 Installation with HPX \u00b6 APEX is integrated into the HPX runtime , and is integrated into the HPX build system. To enable APEX measurement with HPX, enable the following CMake flags: -DHPX_WITH_APEX=TRUE The -DHPX_WITH_APEX_TAG=develop can be used to indicate a specific release version of APEX, or to use a specific GitHub branch of APEX. We recommend using the default configured version that comes with HPX (currently v2.6.4 ) or the develop branch. Additional CMake flags include: -DAPEX_WITH_LM_SENSORS=TRUE to enable LM sensors support (assumed to be installed in default system paths) -DAPEX_WITH_PAPI=TRUE and -DPAPI_ROOT=... to enable PAPI support -DAPEX_WITH_BFD=TRUE and -DBFD_ROOT=... or -DAPEX_BUILD_BFD=TRUE to enable Binutils support for converting function/lambda/instruction pointers to human-readable code regions. For demangling of C++ symbols, demangle.h needs to be installed with the binutils headers (not typical in system installations). -DAPEX_WITH_MSR=TRUE to enable libmsr support for RAPL power measurement (typically not needed, as RAPL support is natively handled where available) -DAPEX_WITH_OTF2=TRUE and -DOTF2_ROOT=... to enable OTF2 tracing support -DHPX_WITH_HPXMP=TRUE to enable HPX OpenMP support and OMPT measurement support from APEX -DAPEX_WITH_ACTIVEHARMONY=TRUE and -DACTIVEHARMONY_ROOT=... to enable Active Harmony support -DAPEX_WITH_CUDA=TRUE to enable CUPTI and/or NVML support. Examples require a working nvcc compiler in your path. Standalone Installation \u00b6 APEX is open source, and available on Github at http://github.com/UO-OACISS/apex . For stability, most users will want to download the most recent release of APEX (for example, v2.6.4): wget https://github.com/UO-OACISS/apex/archive/refs/tags/v2.6.4.tar.gz tar -xvzf v2.6.4.tar.gz cd apex-2.6.4 Other users may want to work with the most recent code available, in which case you can clone the git repo: git clone https://github.com/UO-OACISS/apex.git cd apex Configuring and building APEX with Spack \u00b6 APEX can be installed with the Spack package management tool . See spack info apex for details. You should see something like this: CMakePackage: apex Description: Autonomic Performance Environment for eXascale (APEX). Homepage: https://uo-oaciss.github.io/apex Preferred version: 2.6.3 https://github.com/UO-OACISS/apex/archive/v2.6.3.tar.gz Safe versions: develop [git] https://github.com/UO-OACISS/apex on branch develop master [git] https://github.com/UO-OACISS/apex on branch master 2.6.3 https://github.com/UO-OACISS/apex/archive/v2.6.3.tar.gz 2.6.2 https://github.com/UO-OACISS/apex/archive/v2.6.2.tar.gz 2.6.1 https://github.com/UO-OACISS/apex/archive/v2.6.1.tar.gz 2.6.0 https://github.com/UO-OACISS/apex/archive/v2.6.0.tar.gz Deprecated versions: 2.5.1 https://github.com/UO-OACISS/apex/archive/v2.5.1.tar.gz 2.5.0 https://github.com/UO-OACISS/apex/archive/v2.5.0.tar.gz 2.4.1 https://github.com/UO-OACISS/apex/archive/v2.4.1.tar.gz 2.4.0 https://github.com/UO-OACISS/apex/archive/v2.4.0.tar.gz 2.3.2 https://github.com/UO-OACISS/apex/archive/v2.3.2.tar.gz 2.3.1 https://github.com/UO-OACISS/apex/archive/v2.3.1.tar.gz 2.3.0 https://github.com/UO-OACISS/apex/archive/v2.3.0.tar.gz 2.2.0 https://github.com/UO-OACISS/apex/archive/v2.2.0.tar.gz Variants: activeharmony [true] false, true Enables Active Harmony support binutils [false] false, true Enables Binutils support boost [false] false, true Enables Boost support build_system [cmake] cmake Build systems supported by the package cuda [false] false, true Enables CUDA support examples [false] false, true Build Examples gperftools [false] false, true Enables Google PerfTools TCMalloc support hip [false] false, true Enables ROCm/HIP support jemalloc [false] false, true Enables JEMalloc support lmsensors [false] false, true Enables LM-Sensors support mpi [false] false, true Enables MPI support openmp [false] false, true Enables OpenMP support otf2 [true] false, true Enables OTF2 support papi [false] false, true Enables PAPI support plugins [true] false, true Enables Policy Plugin support sycl [false] false, true Enables Intel SYCL support (Level0) tests [false] false, true Build Unit Tests when build_system=cmake build_type [Release] Debug, MinSizeRel, RelWithDebInfo, Release CMake build type generator [make] none the build system generator to use when build_system=cmake ^cmake@3.9: ipo [false] false, true CMake interprocedural optimization Build Dependencies: activeharmony boost cuda gmake hip lm-sensors ninja papi roctracer-dev zlib-api binutils cmake gettext gperftools jemalloc mpi otf2 rocm-smi-lib sycl Link Dependencies: activeharmony binutils boost cuda gettext gperftools hip jemalloc lm-sensors mpi otf2 papi rocm-smi-lib roctracer-dev sycl zlib-api Run Dependencies: None Licenses: None Configuring and building APEX with CMake \u00b6 APEX is built with CMake. The minimum CMake settings needed for APEX are: -DCMAKE_INSTALL_PREFIX=... some path to an installation location -DCMAKE_BUILD_TYPE=... one of Release, Debug, or RelWithDebInfo (Release recommended) The process for building APEX is: 1) Get the code (see above) 2) Enter the repo directory: cd apex-2.6.4 3) configure using CMake: cmake -B build -DCMAKE_INSTALL_PREFIX= -DCMAKE_BUILD_TYPE=RelWithDebInfo .. 4) build with cmake: cmake --build build # Run tests, if desired ctest --test-dir build # Build documentation, if desired cd build ; make doc ; cd .. # Install, if desired cmake --install install Other CMake settings, depending on your needs/wants \u00b6 Note 1: The recommended packages include: Active Harmony - for autotuning policies (optional, no longer recommended) OMPT - if OpenMP support is required ( See the OpenMP use case for an example) and your compiler supports OpenMP-Tools. note: GCC does not support OpenMP-Tools, and has no plans to as of January 2024. Compilers known to support OMPT include Clang/LLVM, Intel, NVIDIA, AMD Clang. Binutils/BFD - if your runtime/application uses instruction addresses to identify timers, e.g. OpenMP, CUDA, HIP, OneAPI, OpenACC, etc. PAPI - if you want hardware counter support ( See the PAPI use case for an example) JEMalloc/TCMalloc - if your application is not already using a heap manager - see Note 2, below CUDA - if your application uses CUDA, APEX will use CUPTI/NVML to measure GPU activity ROCM - if your application uses HIP/ROCm, APEX will use Rocprofiler/Roctracer/ROC-SMI to measure GPU activity OneAPI - if your application uses Intel SYCL, APEX will use OneAPI/LevelZero to measure GPU activity Note 2: TCMalloc or JEMalloc will potentially speed up memory allocations significantly in APEX (and in your application). HOWEVER, If your application already uses TCMalloc, JEMalloc or TBBMalloc, DO NOT configure APEX with TCMalloc or JEMalloc. They will be included at application link time, and may conflict with the version detected by and linked into APEX. If you got some kind of tcmalloc crash/error at startup, please preload the dependent tcmalloc shared object library with '--apex:preload /path/to/libtcmalloc.so'. There are several utility libraries that provide additional functionality in APEX. Not all libraries are required, but some are recommended. For the following options, the default values are in italics . -DAPEX_BUILD_EXAMPLES= TRUE or FALSE . Whether or not to build the application examples in APEX. -DAPEX_BUILD_TESTS= TRUE or FALSE . Whether or not to build the APEX unit tests. -DAPEX_WITH_ACTIVEHARMONY= TRUE or FALSE . Active Harmony is a library that intelligently searches for parametric combinations to support adapting to heterogeneous and changing environments. For more information, see http://www.dyninst.org/harmony . APEX uses Active Harmony for runtime adaptation. -DACTIVEHARMONY_ROOT= the path to Active Harmony, or set the ACTIVEHARMONY_ROOT environment variable before running cmake. It should be noted that if Active Harmony is not specified and -DAPEX_WITH_ACTIVEHARMONY is TRUE or not set, APEX will download and build Active Harmony as a CMake project. To disable Active Harmony entirely, specify -DAPEX_WITH_ACTIVEHARMONY=FALSE. -DAPEX_BUILD_ACTIVEHARMONY= TRUE or FALSE . Whether or not Active Harmony is installed on the system, this option forces CMake to automatically download and build Active Harmony as part of the APEX project. -DAPEX_WITH_BFD= TRUE or FALSE . APEX uses libbfd (Binutils) to convert instruction addresses to source code locations. BFD support is useful for generating human-readable output for summaries and concurrency graphs. Libbfd is not required for runtime adaptation. For more information, see https://www.gnu.org/software/binutils/ . -DBFD_ROOT= path to Binutils, or set the BFD_ROOT environment variable. -DAPEX_BUILD_BFD= TRUE or FALSE . Whether or not binutils is found by CMake, this option forces CMake to automatically download and build binutils as part of the APEX project. -DAPEX_WITH_CUDA= TRUE or FALSE . APEX uses CUPTI to measure CUDA kernels and API calls, and/or NVML support to monitor the GPU activity passively. -DCUDAToolkit_ROOT= the path to the CUDA installation, if necessary. -DAPEX_WITH_HIP= TRUE or FALSE . APEX uses Rocprofiler and Roctracer to measure HIP kernels and API calls, and/or ROCM-SMI support to monitor the GPU activity passively. -DROCM_ROOT= the path to the ROCm installation, if necessary. -DAPEX_WITH_KOKKOS= TRUE or FALSE. -DKokkos_ROOT= the path to the Kokkos installation, if necessary. APEX will grab Kokkos as a submodule if not found, only the headers are needed. -DAPEX_WITH_JEMALLOC= TRUE or FALSE . JEMalloc is a heap management library. For more information, see http://www.canonware.com/jemalloc/ . JEMalloc provides faster memory performance in multithreaded environments. -DJEMALLOC\\_ROOT= path to JEMalloc, or set the JEMALLOC_ROOT environment variable before running cmake. -DAPEX_WITH_LEVEL0= TRUE or FALSE . APEX uses Level0 to measure Intel SYCL kernels and API calls and to monitor the GPU activity passively. -DAPEX_WITH_LM_SENSORS= TRUE or FALSE . Lm_sensors (Linux Monitoring Sensors) is a library for monitoring hardware temperatures and fan speeds. For more information, see https://en.wikipedia.org/wiki/Lm_sensors . APEX uses lm_sensors to monitor hardware, where available. -DAPEX_WITH_MPI= TRUE or FALSE . Whether to build MPI global support and related examples. -DAPEX_WITH_OMPT= TRUE or FALSE . OMP-Tools is the 5.0+ standard for OpenMP runtimes to provide callback hooks to performance tools. For more information, see the OpenMP specification v5.0 or newer. APEX has support for most OMPT OpenMP trace events. See the OpenMP use case for an example. Some compilers (Clang 10+, Intel 19+, IBM XL 16+) include OMPT support already, and APEX will use the built-in support. -DAPEX_WITH_OTF2= TRUE or FALSE . Used to enable OTF2 tracing support for the Vampir trace visualization tool. -DOTF2_ROOT= path to an OTF2 installation. -DAPEX_BUILD_OTF2= TRUE or FALSE . If OTF2 is not found by CMake, this option forces CMake to automatically download and build binutils as part of the APEX project. -DAPEX_WITH_PAPI= TRUE or FALSE . PAPI (Performance Application Programming Interface) provides the tool designer and application engineer with a consistent interface and methodology for use of the performance counter hardware found in most major microprocessors. For more information, see http://icl.cs.utk.edu/papi/ . APEX uses PAPI to optionally collect hardware counters for timed events. -DPAPI_ROOT= some path to PAPI, or set the PAPI_ROOT environment variable before running cmake. See the PAPI use case for an example. -DAPEX_WITH_PERFETTO= TRUE or FALSE . Enables native Perfetto trace support, increases build/link time significantly. Only used if you want native Perfetto output support, otherwise APEX will write compressed JSON output of the same data (which is actually smaller than the binary native format). -DAPEX_WITH_PLUGINS= TRUE or FALSE. Enables APEX policy plugin support. -DAPEX_WITH_TCMALLOC= TRUE or FALSE . TCMalloc is a heap management library distributed as part of Google perftools. For more information, see https://github.com/gperftools/gperftools . TCMalloc provides faster memory performance in multithreaded environments. -DGPERFTOOLS_ROOT= path to gperftools (TCMalloc), or set the GPERFTOOLS_ROOT environment variable before running cmake. Other CMake variables of interest \u00b6 For any others not listed, see https://github.com/UO-OACISS/apex/blob/develop/cmake/Modules/APEX_DefaultOptions.cmake","title":"Installation"},{"location":"install/#installing_apex","text":"","title":"Installing APEX"},{"location":"install/#installation_with_hpx","text":"APEX is integrated into the HPX runtime , and is integrated into the HPX build system. To enable APEX measurement with HPX, enable the following CMake flags: -DHPX_WITH_APEX=TRUE The -DHPX_WITH_APEX_TAG=develop can be used to indicate a specific release version of APEX, or to use a specific GitHub branch of APEX. We recommend using the default configured version that comes with HPX (currently v2.6.4 ) or the develop branch. Additional CMake flags include: -DAPEX_WITH_LM_SENSORS=TRUE to enable LM sensors support (assumed to be installed in default system paths) -DAPEX_WITH_PAPI=TRUE and -DPAPI_ROOT=... to enable PAPI support -DAPEX_WITH_BFD=TRUE and -DBFD_ROOT=... or -DAPEX_BUILD_BFD=TRUE to enable Binutils support for converting function/lambda/instruction pointers to human-readable code regions. For demangling of C++ symbols, demangle.h needs to be installed with the binutils headers (not typical in system installations). -DAPEX_WITH_MSR=TRUE to enable libmsr support for RAPL power measurement (typically not needed, as RAPL support is natively handled where available) -DAPEX_WITH_OTF2=TRUE and -DOTF2_ROOT=... to enable OTF2 tracing support -DHPX_WITH_HPXMP=TRUE to enable HPX OpenMP support and OMPT measurement support from APEX -DAPEX_WITH_ACTIVEHARMONY=TRUE and -DACTIVEHARMONY_ROOT=... to enable Active Harmony support -DAPEX_WITH_CUDA=TRUE to enable CUPTI and/or NVML support. Examples require a working nvcc compiler in your path.","title":"Installation with HPX"},{"location":"install/#standalone_installation","text":"APEX is open source, and available on Github at http://github.com/UO-OACISS/apex . For stability, most users will want to download the most recent release of APEX (for example, v2.6.4): wget https://github.com/UO-OACISS/apex/archive/refs/tags/v2.6.4.tar.gz tar -xvzf v2.6.4.tar.gz cd apex-2.6.4 Other users may want to work with the most recent code available, in which case you can clone the git repo: git clone https://github.com/UO-OACISS/apex.git cd apex","title":"Standalone Installation"},{"location":"install/#configuring_and_building_apex_with_spack","text":"APEX can be installed with the Spack package management tool . See spack info apex for details. You should see something like this: CMakePackage: apex Description: Autonomic Performance Environment for eXascale (APEX). Homepage: https://uo-oaciss.github.io/apex Preferred version: 2.6.3 https://github.com/UO-OACISS/apex/archive/v2.6.3.tar.gz Safe versions: develop [git] https://github.com/UO-OACISS/apex on branch develop master [git] https://github.com/UO-OACISS/apex on branch master 2.6.3 https://github.com/UO-OACISS/apex/archive/v2.6.3.tar.gz 2.6.2 https://github.com/UO-OACISS/apex/archive/v2.6.2.tar.gz 2.6.1 https://github.com/UO-OACISS/apex/archive/v2.6.1.tar.gz 2.6.0 https://github.com/UO-OACISS/apex/archive/v2.6.0.tar.gz Deprecated versions: 2.5.1 https://github.com/UO-OACISS/apex/archive/v2.5.1.tar.gz 2.5.0 https://github.com/UO-OACISS/apex/archive/v2.5.0.tar.gz 2.4.1 https://github.com/UO-OACISS/apex/archive/v2.4.1.tar.gz 2.4.0 https://github.com/UO-OACISS/apex/archive/v2.4.0.tar.gz 2.3.2 https://github.com/UO-OACISS/apex/archive/v2.3.2.tar.gz 2.3.1 https://github.com/UO-OACISS/apex/archive/v2.3.1.tar.gz 2.3.0 https://github.com/UO-OACISS/apex/archive/v2.3.0.tar.gz 2.2.0 https://github.com/UO-OACISS/apex/archive/v2.2.0.tar.gz Variants: activeharmony [true] false, true Enables Active Harmony support binutils [false] false, true Enables Binutils support boost [false] false, true Enables Boost support build_system [cmake] cmake Build systems supported by the package cuda [false] false, true Enables CUDA support examples [false] false, true Build Examples gperftools [false] false, true Enables Google PerfTools TCMalloc support hip [false] false, true Enables ROCm/HIP support jemalloc [false] false, true Enables JEMalloc support lmsensors [false] false, true Enables LM-Sensors support mpi [false] false, true Enables MPI support openmp [false] false, true Enables OpenMP support otf2 [true] false, true Enables OTF2 support papi [false] false, true Enables PAPI support plugins [true] false, true Enables Policy Plugin support sycl [false] false, true Enables Intel SYCL support (Level0) tests [false] false, true Build Unit Tests when build_system=cmake build_type [Release] Debug, MinSizeRel, RelWithDebInfo, Release CMake build type generator [make] none the build system generator to use when build_system=cmake ^cmake@3.9: ipo [false] false, true CMake interprocedural optimization Build Dependencies: activeharmony boost cuda gmake hip lm-sensors ninja papi roctracer-dev zlib-api binutils cmake gettext gperftools jemalloc mpi otf2 rocm-smi-lib sycl Link Dependencies: activeharmony binutils boost cuda gettext gperftools hip jemalloc lm-sensors mpi otf2 papi rocm-smi-lib roctracer-dev sycl zlib-api Run Dependencies: None Licenses: None","title":"Configuring and building APEX with Spack"},{"location":"install/#configuring_and_building_apex_with_cmake","text":"APEX is built with CMake. The minimum CMake settings needed for APEX are: -DCMAKE_INSTALL_PREFIX=... some path to an installation location -DCMAKE_BUILD_TYPE=... one of Release, Debug, or RelWithDebInfo (Release recommended) The process for building APEX is: 1) Get the code (see above) 2) Enter the repo directory: cd apex-2.6.4 3) configure using CMake: cmake -B build -DCMAKE_INSTALL_PREFIX= -DCMAKE_BUILD_TYPE=RelWithDebInfo .. 4) build with cmake: cmake --build build # Run tests, if desired ctest --test-dir build # Build documentation, if desired cd build ; make doc ; cd .. # Install, if desired cmake --install install","title":"Configuring and building APEX with CMake"},{"location":"install/#other_cmake_settings_depending_on_your_needswants","text":"Note 1: The recommended packages include: Active Harmony - for autotuning policies (optional, no longer recommended) OMPT - if OpenMP support is required ( See the OpenMP use case for an example) and your compiler supports OpenMP-Tools. note: GCC does not support OpenMP-Tools, and has no plans to as of January 2024. Compilers known to support OMPT include Clang/LLVM, Intel, NVIDIA, AMD Clang. Binutils/BFD - if your runtime/application uses instruction addresses to identify timers, e.g. OpenMP, CUDA, HIP, OneAPI, OpenACC, etc. PAPI - if you want hardware counter support ( See the PAPI use case for an example) JEMalloc/TCMalloc - if your application is not already using a heap manager - see Note 2, below CUDA - if your application uses CUDA, APEX will use CUPTI/NVML to measure GPU activity ROCM - if your application uses HIP/ROCm, APEX will use Rocprofiler/Roctracer/ROC-SMI to measure GPU activity OneAPI - if your application uses Intel SYCL, APEX will use OneAPI/LevelZero to measure GPU activity Note 2: TCMalloc or JEMalloc will potentially speed up memory allocations significantly in APEX (and in your application). HOWEVER, If your application already uses TCMalloc, JEMalloc or TBBMalloc, DO NOT configure APEX with TCMalloc or JEMalloc. They will be included at application link time, and may conflict with the version detected by and linked into APEX. If you got some kind of tcmalloc crash/error at startup, please preload the dependent tcmalloc shared object library with '--apex:preload /path/to/libtcmalloc.so'. There are several utility libraries that provide additional functionality in APEX. Not all libraries are required, but some are recommended. For the following options, the default values are in italics . -DAPEX_BUILD_EXAMPLES= TRUE or FALSE . Whether or not to build the application examples in APEX. -DAPEX_BUILD_TESTS= TRUE or FALSE . Whether or not to build the APEX unit tests. -DAPEX_WITH_ACTIVEHARMONY= TRUE or FALSE . Active Harmony is a library that intelligently searches for parametric combinations to support adapting to heterogeneous and changing environments. For more information, see http://www.dyninst.org/harmony . APEX uses Active Harmony for runtime adaptation. -DACTIVEHARMONY_ROOT= the path to Active Harmony, or set the ACTIVEHARMONY_ROOT environment variable before running cmake. It should be noted that if Active Harmony is not specified and -DAPEX_WITH_ACTIVEHARMONY is TRUE or not set, APEX will download and build Active Harmony as a CMake project. To disable Active Harmony entirely, specify -DAPEX_WITH_ACTIVEHARMONY=FALSE. -DAPEX_BUILD_ACTIVEHARMONY= TRUE or FALSE . Whether or not Active Harmony is installed on the system, this option forces CMake to automatically download and build Active Harmony as part of the APEX project. -DAPEX_WITH_BFD= TRUE or FALSE . APEX uses libbfd (Binutils) to convert instruction addresses to source code locations. BFD support is useful for generating human-readable output for summaries and concurrency graphs. Libbfd is not required for runtime adaptation. For more information, see https://www.gnu.org/software/binutils/ . -DBFD_ROOT= path to Binutils, or set the BFD_ROOT environment variable. -DAPEX_BUILD_BFD= TRUE or FALSE . Whether or not binutils is found by CMake, this option forces CMake to automatically download and build binutils as part of the APEX project. -DAPEX_WITH_CUDA= TRUE or FALSE . APEX uses CUPTI to measure CUDA kernels and API calls, and/or NVML support to monitor the GPU activity passively. -DCUDAToolkit_ROOT= the path to the CUDA installation, if necessary. -DAPEX_WITH_HIP= TRUE or FALSE . APEX uses Rocprofiler and Roctracer to measure HIP kernels and API calls, and/or ROCM-SMI support to monitor the GPU activity passively. -DROCM_ROOT= the path to the ROCm installation, if necessary. -DAPEX_WITH_KOKKOS= TRUE or FALSE. -DKokkos_ROOT= the path to the Kokkos installation, if necessary. APEX will grab Kokkos as a submodule if not found, only the headers are needed. -DAPEX_WITH_JEMALLOC= TRUE or FALSE . JEMalloc is a heap management library. For more information, see http://www.canonware.com/jemalloc/ . JEMalloc provides faster memory performance in multithreaded environments. -DJEMALLOC\\_ROOT= path to JEMalloc, or set the JEMALLOC_ROOT environment variable before running cmake. -DAPEX_WITH_LEVEL0= TRUE or FALSE . APEX uses Level0 to measure Intel SYCL kernels and API calls and to monitor the GPU activity passively. -DAPEX_WITH_LM_SENSORS= TRUE or FALSE . Lm_sensors (Linux Monitoring Sensors) is a library for monitoring hardware temperatures and fan speeds. For more information, see https://en.wikipedia.org/wiki/Lm_sensors . APEX uses lm_sensors to monitor hardware, where available. -DAPEX_WITH_MPI= TRUE or FALSE . Whether to build MPI global support and related examples. -DAPEX_WITH_OMPT= TRUE or FALSE . OMP-Tools is the 5.0+ standard for OpenMP runtimes to provide callback hooks to performance tools. For more information, see the OpenMP specification v5.0 or newer. APEX has support for most OMPT OpenMP trace events. See the OpenMP use case for an example. Some compilers (Clang 10+, Intel 19+, IBM XL 16+) include OMPT support already, and APEX will use the built-in support. -DAPEX_WITH_OTF2= TRUE or FALSE . Used to enable OTF2 tracing support for the Vampir trace visualization tool. -DOTF2_ROOT= path to an OTF2 installation. -DAPEX_BUILD_OTF2= TRUE or FALSE . If OTF2 is not found by CMake, this option forces CMake to automatically download and build binutils as part of the APEX project. -DAPEX_WITH_PAPI= TRUE or FALSE . PAPI (Performance Application Programming Interface) provides the tool designer and application engineer with a consistent interface and methodology for use of the performance counter hardware found in most major microprocessors. For more information, see http://icl.cs.utk.edu/papi/ . APEX uses PAPI to optionally collect hardware counters for timed events. -DPAPI_ROOT= some path to PAPI, or set the PAPI_ROOT environment variable before running cmake. See the PAPI use case for an example. -DAPEX_WITH_PERFETTO= TRUE or FALSE . Enables native Perfetto trace support, increases build/link time significantly. Only used if you want native Perfetto output support, otherwise APEX will write compressed JSON output of the same data (which is actually smaller than the binary native format). -DAPEX_WITH_PLUGINS= TRUE or FALSE. Enables APEX policy plugin support. -DAPEX_WITH_TCMALLOC= TRUE or FALSE . TCMalloc is a heap management library distributed as part of Google perftools. For more information, see https://github.com/gperftools/gperftools . TCMalloc provides faster memory performance in multithreaded environments. -DGPERFTOOLS_ROOT= path to gperftools (TCMalloc), or set the GPERFTOOLS_ROOT environment variable before running cmake.","title":"Other CMake settings, depending on your needs/wants"},{"location":"install/#other_cmake_variables_of_interest","text":"For any others not listed, see https://github.com/UO-OACISS/apex/blob/develop/cmake/Modules/APEX_DefaultOptions.cmake","title":"Other CMake variables of interest"},{"location":"quickstart/","text":"APEX Quickstart \u00b6 Tutorial \u00b6 For an APEX tutorial, please see https://github.com/khuck/apex-tutorial . Installation \u00b6 For detailed instructions and information on dependencies, see build instructions To build APEX stand-alone (to use with OpenMP, OpenACC, CUDA, Kokkos, TBB, C++ threads, etc.) do the following: git clone https://github.com/UO-OACISS/apex.git cd apex cmake -B build -DCMAKE_BUILD_TYPE=Release -DBUILD_EXAMPLES=TRUE . cmake --build build --parallel Runtime \u00b6 To run an example (since -DBUILD_EXAMPLES=TRUE was set), just run the Matmult example and you should get similar output: [khuck@eagle apex]$ ./build/src/examples/Matmult/matmult Spawned thread 1... Spawned thread 2... Spawned thread 3... Done. Elapsed time: 0.300207 seconds Cores detected: 128 Worker Threads observed: 4 Available CPU time: 1.20083 seconds Counter : #samples | minimum | mean | maximum | stddev ------------------------------------------------------------------------------------------------ status:Threads : 1 6.000 6.000 6.000 0.000 status:VmData : 1 4.93e+04 4.93e+04 4.93e+04 0.000 status:VmExe : 1 64.000 64.000 64.000 0.000 status:VmHWM : 1 7808.000 7808.000 7808.000 0.000 status:VmLck : 1 0.000 0.000 0.000 0.000 status:VmLib : 1 6336.000 6336.000 6336.000 0.000 status:VmPMD : 1 16.000 16.000 16.000 0.000 status:VmPTE : 1 4.000 4.000 4.000 0.000 status:VmPeak : 1 3.80e+05 3.80e+05 3.80e+05 0.000 status:VmPin : 1 0.000 0.000 0.000 0.000 status:VmRSS : 1 7808.000 7808.000 7808.000 0.000 status:VmSize : 1 3.15e+05 3.15e+05 3.15e+05 0.000 status:VmStk : 1 192.000 192.000 192.000 0.000 status:VmSwap : 1 0.000 0.000 0.000 0.000 status:nonvoluntary_ctxt_switches : 1 0.000 0.000 0.000 0.000 status:voluntary_ctxt_switches : 1 77.000 77.000 77.000 0.000 ------------------------------------------------------------------------------------------------ Timer : #calls | mean | total | % total ------------------------------------------------------------------------------------------------ APEX MAIN : 1 0.300 0.300 100.000 allocateMatrix : 12 0.009 0.108 9.023 compute : 4 0.206 0.825 68.736 compute_interchange : 4 0.064 0.257 21.369 do_work : 4 0.298 1.193 99.313 freeMatrix : 12 0.000 0.000 0.025 initialize : 12 0.000 0.002 0.146 main : 1 0.299 0.299 24.930 ------------------------------------------------------------------------------------------------ Total timers : 49 Using apex_exec \u00b6 The wrapper script apex_exec can be used to measure applications that don't have APEX linked in. For details, see apex_exec usage .","title":"Quick Start (standalone)"},{"location":"quickstart/#apex_quickstart","text":"","title":"APEX Quickstart"},{"location":"quickstart/#tutorial","text":"For an APEX tutorial, please see https://github.com/khuck/apex-tutorial .","title":"Tutorial"},{"location":"quickstart/#installation","text":"For detailed instructions and information on dependencies, see build instructions To build APEX stand-alone (to use with OpenMP, OpenACC, CUDA, Kokkos, TBB, C++ threads, etc.) do the following: git clone https://github.com/UO-OACISS/apex.git cd apex cmake -B build -DCMAKE_BUILD_TYPE=Release -DBUILD_EXAMPLES=TRUE . cmake --build build --parallel","title":"Installation"},{"location":"quickstart/#runtime","text":"To run an example (since -DBUILD_EXAMPLES=TRUE was set), just run the Matmult example and you should get similar output: [khuck@eagle apex]$ ./build/src/examples/Matmult/matmult Spawned thread 1... Spawned thread 2... Spawned thread 3... Done. Elapsed time: 0.300207 seconds Cores detected: 128 Worker Threads observed: 4 Available CPU time: 1.20083 seconds Counter : #samples | minimum | mean | maximum | stddev ------------------------------------------------------------------------------------------------ status:Threads : 1 6.000 6.000 6.000 0.000 status:VmData : 1 4.93e+04 4.93e+04 4.93e+04 0.000 status:VmExe : 1 64.000 64.000 64.000 0.000 status:VmHWM : 1 7808.000 7808.000 7808.000 0.000 status:VmLck : 1 0.000 0.000 0.000 0.000 status:VmLib : 1 6336.000 6336.000 6336.000 0.000 status:VmPMD : 1 16.000 16.000 16.000 0.000 status:VmPTE : 1 4.000 4.000 4.000 0.000 status:VmPeak : 1 3.80e+05 3.80e+05 3.80e+05 0.000 status:VmPin : 1 0.000 0.000 0.000 0.000 status:VmRSS : 1 7808.000 7808.000 7808.000 0.000 status:VmSize : 1 3.15e+05 3.15e+05 3.15e+05 0.000 status:VmStk : 1 192.000 192.000 192.000 0.000 status:VmSwap : 1 0.000 0.000 0.000 0.000 status:nonvoluntary_ctxt_switches : 1 0.000 0.000 0.000 0.000 status:voluntary_ctxt_switches : 1 77.000 77.000 77.000 0.000 ------------------------------------------------------------------------------------------------ Timer : #calls | mean | total | % total ------------------------------------------------------------------------------------------------ APEX MAIN : 1 0.300 0.300 100.000 allocateMatrix : 12 0.009 0.108 9.023 compute : 4 0.206 0.825 68.736 compute_interchange : 4 0.064 0.257 21.369 do_work : 4 0.298 1.193 99.313 freeMatrix : 12 0.000 0.000 0.025 initialize : 12 0.000 0.002 0.146 main : 1 0.299 0.299 24.930 ------------------------------------------------------------------------------------------------ Total timers : 49","title":"Runtime"},{"location":"quickstart/#using_apex_exec","text":"The wrapper script apex_exec can be used to measure applications that don't have APEX linked in. For details, see apex_exec usage .","title":"Using apex_exec"},{"location":"quickstarthpx/","text":"APEX Quickstart \u00b6 Installation \u00b6 For detailed instructions and information on dependencies, see build instructions . APEX is integrated into the HPX runtime , and is integrated into the HPX build system. To enable APEX measurement with HPX, enable the following CMake flags: -DHPX_WITH_APEX=TRUE The CMake flag -DHPX_WITH_APEX_TAG=develop can be used to indicate a specific release version of APEX, or to use a specific GitHub branch of APEX. We recommend using the default configured version that comes with your version of HPX or the develop branch to access the latest features and bug fixes. Runtime \u00b6 To see APEX data after an HPX run, set the APEX_SCREEN_OUTPUT=1 environment variable. After execution, you'll see output like this: [khuck@eagle build]$ export APEX_SCREEN_OUTPUT=1 [khuck@eagle build]$ ./bin/fibonacci fibonacci(10) == 55 elapsed time: 0.112029 [s] Elapsed time: 0.19137 seconds Cores detected: 128 Worker Threads observed: 32 Available CPU time: 6.12383 seconds Timer : #calls | mean | total | % total ------------------------------------------------------------------------------------------------ APEX MAIN : 1 0.191 0.191 100.000 async : 2 0.000 0.000 0.001 async_launch_policy_dispatch : 5 0.001 0.003 0.041 broadcast_call_shutdown_functions_action : 2 0.000 0.001 0.012 call_shutdown_functions_action : 2 0.002 0.005 0.081 fibonacci_action : 174 0.015 2.569 41.957 load_components_action : 1 0.014 0.014 0.230 primary_namespace_colocate_action : 2 0.000 0.001 0.011 run_helper : 1 0.015 0.015 0.250 shutdown_all_action : 1 0.002 0.002 0.040 APEX Idle : 3.514 57.375 ------------------------------------------------------------------------------------------------ Total timers : 190 HPX applications can also use the apex_exec wrapper script, please see apex_exec flags for details.","title":"Quick Start (HPX)"},{"location":"quickstarthpx/#apex_quickstart","text":"","title":"APEX Quickstart"},{"location":"quickstarthpx/#installation","text":"For detailed instructions and information on dependencies, see build instructions . APEX is integrated into the HPX runtime , and is integrated into the HPX build system. To enable APEX measurement with HPX, enable the following CMake flags: -DHPX_WITH_APEX=TRUE The CMake flag -DHPX_WITH_APEX_TAG=develop can be used to indicate a specific release version of APEX, or to use a specific GitHub branch of APEX. We recommend using the default configured version that comes with your version of HPX or the develop branch to access the latest features and bug fixes.","title":"Installation"},{"location":"quickstarthpx/#runtime","text":"To see APEX data after an HPX run, set the APEX_SCREEN_OUTPUT=1 environment variable. After execution, you'll see output like this: [khuck@eagle build]$ export APEX_SCREEN_OUTPUT=1 [khuck@eagle build]$ ./bin/fibonacci fibonacci(10) == 55 elapsed time: 0.112029 [s] Elapsed time: 0.19137 seconds Cores detected: 128 Worker Threads observed: 32 Available CPU time: 6.12383 seconds Timer : #calls | mean | total | % total ------------------------------------------------------------------------------------------------ APEX MAIN : 1 0.191 0.191 100.000 async : 2 0.000 0.000 0.001 async_launch_policy_dispatch : 5 0.001 0.003 0.041 broadcast_call_shutdown_functions_action : 2 0.000 0.001 0.012 call_shutdown_functions_action : 2 0.002 0.005 0.081 fibonacci_action : 174 0.015 2.569 41.957 load_components_action : 1 0.014 0.014 0.230 primary_namespace_colocate_action : 2 0.000 0.001 0.011 run_helper : 1 0.015 0.015 0.250 shutdown_all_action : 1 0.002 0.002 0.040 APEX Idle : 3.514 57.375 ------------------------------------------------------------------------------------------------ Total timers : 190 HPX applications can also use the apex_exec wrapper script, please see apex_exec flags for details.","title":"Runtime"},{"location":"refman/","text":"API Doxygen Reference \u00b6 The source code is instrumented with Doxygen comments, and the API reference manual can be generated by executing 'make doc' in the build directory, after CMake configuration. A fairly recent version of the API reference documentation is available here: http://www.nic.uoregon.edu/~khuck/apex_docs/doc/html/index.html http://www.nic.uoregon.edu/~khuck/apex_docs/doc/refman.pdf In the event that the API specification and the reference implementation (as generated by the Doxygen comments from the actual source code) do not match, assume that the specification is correct and that the implementation is non-compliant - and subsequently contact the project maintainers so that we may bring the implementation into compliance.","title":"API Doxygen Reference"},{"location":"refman/#api_doxygen_reference","text":"The source code is instrumented with Doxygen comments, and the API reference manual can be generated by executing 'make doc' in the build directory, after CMake configuration. A fairly recent version of the API reference documentation is available here: http://www.nic.uoregon.edu/~khuck/apex_docs/doc/html/index.html http://www.nic.uoregon.edu/~khuck/apex_docs/doc/refman.pdf In the event that the API specification and the reference implementation (as generated by the Doxygen comments from the actual source code) do not match, assume that the specification is correct and that the implementation is non-compliant - and subsequently contact the project maintainers so that we may bring the implementation into compliance.","title":"API Doxygen Reference"},{"location":"spec/","text":"APEX Specification (DRAFT) \u00b6 *...to be fully implemented in a future release. While the following specification is slightly different than the current implementation, the differences are minor. When in doubt, the current implementation is documented by Doxygen, and is available here: http://www.nic.uoregon.edu/~khuck/apex_docs/doc/html/index.html http://www.nic.uoregon.edu/~khuck/apex_docs/doc/refman.pdf * READ ME FIRST! \u00b6 The API specification is provided for users who wish to instrument their own applications, or who wish to instrument a runtime. Please note that most runtimes have already been instrumented (or provide callbacks), and that users typically do not have to make any calls to the APEX API, other than to add application level timers or to write custom policy rules. If that is you, please see the tutorial with lots of up-to-date examples, https://github.com/khuck/apex-tutorial . Introduction \u00b6 This page contains the API specification for APEX. The API specification provides a high-level overview of the API and its functionality. The implementation has Doxygen comments inserted, so for full implementation details, please see the API Reference Manual . A note about C++ \u00b6 The following specification contains both the C and the the C++ API. Typically, the C++ names use overloading for different argument lists, and will replace the apex_ prefix with the apex:: namespace. Because both APIs return handles to internal APEX objects, the type definitions of these objects use the C naming convention. In addition to the simple API presented below, the C++ API includes scoped timers and threads. See http://www.nic.uoregon.edu/~khuck/apex_docs/doc/html/namespaceapex.html for details. Terminology \u00b6 Unfortunately, many terms in Computer Science are overloaded. The following definitions are in use in this document: Thread : an operating system (OS) thread of execution. For example, Posix threads (pthreads). Task : a scheduled unit of work, such as an OpenMP task or an HPX thread. APEX timers are typically used to measure tasks. C example \u00b6 The following is a very small C program that uses the APEX API. For more examples, please see the programs in the src/examples and src/unit_tests/C directories of the APEX source code. #include #include #include \"apex.h\" int foo(int i) { /* start an APEX timer for the function foo */ apex_profiler_handle profiler = apex_start(APEX_FUNCTION_ADDRESS, &foo); int j = i * i; /* stop the APEX timer */ apex_stop(profiler); return j; } int main (int argc, char** argv) { /* initialize APEX */ apex_init(\"apex_start unit test\"); /* start a timer, passing in the address of the main function */ apex_profiler_handle profiler = apex_start(APEX_FUNCTION_ADDRESS, &main); int i,j = 0; for (i = 0 ; i < 3 ; i++) { j += foo(i); } /* stop the timer */ apex_stop(profiler); /* finalize APEX */ apex_finalize(); /* free all memory allocated by APEX */ apex_cleanup(); return 0; } C++ example \u00b6 The following is a slightly more complicated C++ pthread program that uses the APEX API. For more examples, please see the programs in the src/examples and src/unit_tests/C++ directories of the APEX source code. #include #include #include #include \"apex_api.hpp\" void* someThread(void* tmp) { int* tid = (int*)tmp; char name[32]; sprintf(name, \"worker thread %d\", *tid); /* Register this thread with APEX */ apex::register_thread(name); /* Start a timer */ apex::profiler* p = apex::start((apex_function_address)&someThread); /* ... */ /* do some computation */ /* ... */ /* stop the timer */ apex::stop(p); /* tell APEX that this thread is exiting */ apex::exit_thread(); return NULL; } int main (int argc, char** argv) { /* initialize APEX */ apex::init(\"apex::start unit test\"); /* set our node ID */ apex::set_node_id(0); /* start a timer */ apex::profiler* p = apex::start(\"main\"); /* Spawn two threads */ pthread_t thread[2]; int tid = 0; pthread_create(&(thread[0]), NULL, someThread, &tid); int tid2 = 1; pthread_create(&(thread[1]), NULL, someThread, &tid2); /* wait for the threads to finish */ pthread_join(thread[0], NULL); pthread_join(thread[1], NULL); /* stop our main timer */ apex::stop(p); /* finalize APEX */ apex::finalize(); /* free all memory allocated by APEX */ apex::cleanup(); return 0; } Constants, types and enumerations \u00b6 Constants \u00b6 /** A null pointer representing an APEX profiler handle. * Used when a null APEX profile handle is to be passed in to * apex::stop when the profiler object was not retained locally. */ #define APEX_NULL_PROFILER_HANDLE (apex_profiler_handle)(NULL) // for comparisons #define APEX_MAX_EVENTS 128 /*!< The maximum number of event types. Allows for ~20 custom events. */ #define APEX_NULL_FUNCTION_ADDRESS 0L // for comparisons Pre-defined types \u00b6 /** The address of a C++ object in APEX. * Not useful for the caller that gets it back, but required * for stopping the timer later. */ typedef uintptr_t apex_profiler_handle; // address of internal C++ object /** Not useful for the caller that gets it back, but required * for deregistering policies after registration. */ typedef uintptr_t apex_policy_handle; // address of internal C++ object /** Rather than use void pointers everywhere, be explicit about * what the functions are expecting. */ typedef uintptr_t apex_function_address; // generic function pointer Enumerations \u00b6 /** * Typedef for enumerating the different timer types */ typedef enum _apex_profiler_type { APEX_FUNCTION_ADDRESS = 0, /*!< The ID is a function (or instruction) address */ APEX_NAME_STRING, /*!< The ID is a character string */ APEX_FUNCTOR /*!< C++ Object with the () operator defined */ } apex_profiler_type; /** * Typedef for enumerating the different event types */ typedef enum _event_type { APEX_INVALID_EVENT = -1, APEX_STARTUP = 0, /*!< APEX is initialized */ APEX_SHUTDOWN, /*!< APEX is terminated */ APEX_NEW_NODE, /*!< APEX has registered a new process ID */ APEX_NEW_THREAD, /*!< APEX has registered a new OS thread */ APEX_EXIT_THREAD, /*!< APEX has exited an OS thread */ APEX_START_EVENT, /*!< APEX has processed a timer start event */ APEX_RESUME_EVENT, /*!< APEX has processed a timer resume event (the number of calls is not incremented) */ APEX_STOP_EVENT, /*!< APEX has processed a timer stop event */ APEX_YIELD_EVENT, /*!< APEX has processed a timer yield event */ APEX_SAMPLE_VALUE, /*!< APEX has processed a sampled value */ APEX_PERIODIC, /*!< APEX has processed a periodic timer */ APEX_CUSTOM_EVENT_1, /*!< APEX has processed a custom event - useful for large granularity application control events */ APEX_CUSTOM_EVENT_2, // these are just here for padding, and so we can APEX_CUSTOM_EVENT_3, // test with them. APEX_CUSTOM_EVENT_4, APEX_CUSTOM_EVENT_5, APEX_CUSTOM_EVENT_6, APEX_CUSTOM_EVENT_7, APEX_CUSTOM_EVENT_8, APEX_UNUSED_EVENT = APEX_MAX_EVENTS // can't have more custom events than this } apex_event_type; /** * Typedef for enumerating the OS thread states. */ typedef enum _thread_state { APEX_IDLE, /*!< Thread is idle */ APEX_BUSY, /*!< Thread is working */ APEX_THROTTLED, /*!< Thread is throttled (sleeping) */ APEX_WAITING, /*!< Thread is waiting for a resource */ APEX_BLOCKED /*!< Thread is otherwise blocked */ } apex_thread_state; /** * Typedef for enumerating the different optimization strategies * for throttling. */ typedef enum {APEX_MAXIMIZE_THROUGHPUT, /*!< maximize the number of calls to a timer/counter */ APEX_MAXIMIZE_ACCUMULATED, /*!< maximize the accumulated value of a timer/counter */ APEX_MINIMIZE_ACCUMULATED /*!< minimize the accumulated value of a timer/counter */ } apex_optimization_criteria_t; /** * Typedef for enumerating the different optimization methods * for throttling. */ typedef enum {APEX_SIMPLE_HYSTERESIS, /*!< optimize using sliding window of historical observations. A running average of the most recent N observations are used as the measurement. */ APEX_DISCRETE_HILL_CLIMBING, /*!< Use a discrete hill climbing algorithm for optimization */ APEX_ACTIVE_HARMONY /*!< Use Active Harmony for optimization. */ } apex_optimization_method_t; /** The type of a profiler object * */ typedef enum _profile_type { APEX_TIMER, /*!< This profile is a instrumented timer */ APEX_COUNTER /*!< This profile is a sampled counter */ } apex_profile_type; Data structures and classes \u00b6 /** * The APEX context when an event occurs. This context will be passed to * any policies registered for this event. */ typedef struct _context { apex_event_type event_type; /*!< The type of the event currently processing */ apex_policy_handle* policy_handle; /*!< The policy handle for the current policy function */ void * data; /*!< Data associated with the event, such as the custom_data for a custom_event */ } apex_context; /** * The profile object for a timer in APEX. * Returned by the apex_get_profile() call. */ typedef struct _profile { double calls; /*!< Number of times a timer was called, or the number of samples collected for a counter */ double accumulated; /*!< Accumulated values for all calls/samples */ double sum_squares; /*!< Running sum of squares calculation for all calls/samples */ double minimum; /*!< Minimum value seen by the timer or counter */ double maximum; /*!< Maximum value seen by the timer or counter */ apex_profile_type type; /*!< Whether this is a timer or a counter */ double papi_metrics[8]; /*!< Array of accumulated PAPI hardware metrics */ } apex_profile; /** * The APEX tuning request structures. */ typedef struct _apex_param { char * init_value; /*!< Initial value */ const char * value; /*!< Current value */ int num_possible_values; /*!< Number of possible values */ char * possible_values[]; } apex_param_struct; typedef struct _apex_tuning_request { char * name; /*!< Tuning request name */ double (*metric)(void); /*!< function to return the address of the output parameter */ int num_params; /*!< number of tuning input parameters */ char * param_names[]; /*!< the input parameter names */ apex_param_struct * params[]; /*!< the input parameters */ apex_event_type trigger; /*!< the event that triggers the tuning update */ apex_tuning_session_handle tuning_session_handle; /*!< the Active Harmony tuning session handle */ bool running; /*!< the current state of the tuning */ apex_ah_tuning_strategy strategy; /*!< the requested Active Harmony tuning strategy */ } apex_tuning_request_struct; Environment variables \u00b6 Please see the environment variables section of the documentation. Please note that all environment variables can also be queried or set at runtime with associated API calls. For example, the APEX_CSV_OUTPUT variable can also be set/queried with: void apex_set_csv_output (int); int apex_get_csv_output (void); General Utility functions \u00b6 Initialization \u00b6 /* C++ */ void apex::init (const char *thread_name); /* C */ void apex_init (const char *thread_name); APEX initialization is required to set up data structures and spawn the necessary helper threads, including the background system state query thread, the policy engine thread, and the profile handler thread. The thread name parameter will be used as the top-level timer for the the main thread of execution. Finalization \u00b6 /* C++ */ void apex::finalize (void); /* C */ void apex_finalize (void); APEX finalization is required to format any desired output (screen, csv, profile, etc.) and terminate all APEX helper threads. No memory is freed at this point - that is done by the apex_cleanup() call. The reason for this is that applications may want to perform reporting after finalization, so the performance state of the application should still exist. Cleanup \u00b6 /* C++ */ void apex::cleanup (void); /* C */ void apex_cleanup (void); APEX cleanup frees all memory associated with APEX. Setting node ID \u00b6 /* C++ */ void apex::set_node_id (const uint64_t id); /* C */ void apex_set_node_id (const uint64_t id); When running in distributed environments, assign the specified id number as the APEX node ID. This can be an MPI rank or an HPX locality, for example. Registering threads \u00b6 /* C++ */ void apex::register_thread (const std::string &name); /* C */ void apex_register_thread (const char *name); Register a new OS thread with APEX. This method should be called whenever a new OS thread is spawned by the application or the runtime. An empty string or null string is valid input. Exiting a thread \u00b6 /* C++ */ void apex::exit_thread (void); /* C */ void apex_exit_thread (void); Before any thread other than the main thread of execution exits, notify APEX that the thread is exiting. The main thread should not call this function, but apex_finalize instead. Exiting the thread will trigger an event in APEX, so any policies associated with a thread exit will be executed. Getting the APEX version \u00b6 /* C++ */ std::string & apex::version (void); /* C */ const char * apex_version (void); Return the APEX version as a string. Getting the APEX settings \u00b6 /* C++ */ std::string & apex::get_options (void); /* C */ const char * apex_get_options (void); Return the current APEX options as a string. Basic measurement Functions (introspection) \u00b6 Starting a timer \u00b6 /* C++ */ apex_profiler_handle apex::start (const std::string &timer_name); apex_profiler_handle apex::start (const apex_function_address function_address); /* C */ apex_profiler_handle apex_start (apex_profiler_type type, const void * identifier); Create an APEX timer and start it. An APEX profiler object is returned, containing an identifier that APEX uses to stop the timer. The timer is either identified by a name or a function/task instruction pointer address. Stopping a timer \u00b6 /* C++ */ void apex::stop (apex_profiler_handle the_profiler); /* C */ void apex_stop (apex_profiler_handle the_profiler); The timer associated with the profiler object is stopped and placed on an internal queue to be processed by the profiler handler thread in the background. The profiler object is flagged as \"stopped\", so that when the profiler is processed the call count for this particular timer will be incremented by 1, unless the timer was started by apex_resume() (see below). The profiler handle will be freed internally by APEX after processing. Yielding a timer \u00b6 /* C++ */ void apex::yield (apex_profiler_handle the_profiler); /* C */ void apex_yield (apex_profiler_handle the_profiler); The timer associated with the profiler object is stopped and placed on an internal queue to be processed by the profiler handler thread in the background. The profiler object is flagged as NOT stopped , so that when the profiler is processed the call count will NOT be incremented. An application using apex_yield should not use apex_resume to restart the timer, it should use apex_start. apex_yield() is intended for situations when the completion state of the task is known and the state is not complete . below). The profiler handle will be freed internally by APEX after processing. Resuming a timer \u00b6 /* C++ */ apex_profiler_handle apex::resume (const std::string &timer_name); apex_profiler_handle apex::resume (const apex_function_address function_address); /* C */ apex_profiler_handle apex_resume (apex_profiler_type type, const void * identifier); Create an APEX timer and start it. An APEX profiler object is returned, containing an identifier that APEX uses to stop the timer. The profiler is flagged as NOT a new task , so that when it is stopped by apex_stop the call count for this particular timer will not be incremented. Apex_resume is intended for situations when the completion state of a task is NOT known when control is returned to the task scheduler, but is known when an interrupted task is resumed. Creating a new task dependency \u00b6 /* C++ */ void apex::new_task (std::string & name, const void * task_id); void apex::new_task (const apex_function_address function_address, const void * task_id); /* C */ void apex_new_task (apex_profiler_type type, const void * identifier, const void * task_id) Register the creation of a new task. This is used to track task dependencies in APEX. APEX assumes that the current APEX profiler refers to the task that is the parent of this new task. The task_info object is a generic pointer to whatever data might need to be passed to a policy executed on when a new task is created. Sampling a value \u00b6 /* C++ */ void apex::sample_value (const std::string & name, const double value) /* C */ void apex_sample_value (const char * name, const double value); Record a measurement of the specified counter with the specified value. For example, \"bytes transferred\" and \"1024\". Setting the OS thread state \u00b6 /* C++ */ void apex::set_state (apex_thread_state state); /* C */ void apex_set_state (apex_thread_state state); Set the state of the current OS thread. States can include things like idle, busy, waiting, throttled, blocked. Policy-related methods (adaptation) \u00b6 Registering an event-based policy function \u00b6 /* C++ */ apex_policy_handle apex::register_policy (const apex_event_type when, std::function f); std::set apex::register_policy (std::set when, std::function f); /* C */ apex_policy_handle apex_register_policy (const apex_event_type when, int(*f)(apex_context const&)); APEX provides the ability to call an application-specified function when certain events occur in the APEX library, or periodically. This assigns the passed in function to the event, so that when that event occurs in APEX, the function is called. The context for the event will be passed to the registered function. A set of events can also be used to register a policy function, which will return a set of policy handles. When any event in the set occurs, the function will be called. Registering a periodic policy \u00b6 /* C++ */ apex_policy_handle apex::register_periodic_policy(const unsigned long period, std::function f); /* C */ apex_policy_handle apex_register_periodic_policy (const unsigned long period, int(*f)(apex_context const&)); Apex provides the ability to call an application-specified function periodically. This method assigns the passed in function to be called on a periodic basis. The context for the event will be passed to the registered function. The period units are in microseconds (us). De-registering a policy \u00b6 /* C++ */ apex::deregister_policy (apex_policy_handle handle); /* C */ apex_deregister_policy (apex_policy_handle handle); Remove the specified policy so that it will no longer be executed, whether it is event-based or periodic. The calling code should not try to dereference the policy handle after this call, as the memory pointed to by the handle will be freed. Registering a custom event \u00b6 /* C++ */ apex_event_type apex::register_custom_event (const std::string & name); /* C */ apex_event_type apex_register_custom_event (const char * name); Register a new event type with APEX. Trigger a custom event \u00b6 /* C++ */ void apex::custom_event (apex_event_type event_type, const void * event_data); /* C */ void apex_custom_event (const char * name, const void * event_data); Trigger a custom event. This function will pass a custom event to the APEX event listeners. Each listeners' custom event handler will handle the custom event. Policy functions will be passed the custom event name in the event context. The event data pointer is to be used to pass memory to the policy function from the code that triggered the event. Request a profile from APEX \u00b6 /* C++ */ apex_profile * apex::get_profile (const std::string & name); apex_profile * apex::get_profile (const apex_function_address function_address); /* C */ apex_profile * apex_get_profile (apex_profiler_type type, const void * identifier) This function will return the current profile for the specified identifier. Because profiles are updated out-of-band, it is possible that this profile values are out of date. This profile can be either a timer or a sampled value. Reset a profile \u00b6 /* C++ */ void apex::reset (const std::string & timer_name); void apex::reset (const apex_function_address function_address); /* C */ void apex_reset (apex_profiler_type type, const void * identifier) This function will reset the profile associated with the specified timer or counter id to zero. If the identifier is null, all timers and counters will be reset. Concurrency Throttling Policy Functions \u00b6 Setup tuning for adaptation \u00b6 /* C++ */ apex_tuning_session_handle setup_custom_tuning(apex_tuning_request & request); apex_tuning_session_handle setup_custom_tuning(apex_tuning_request * request); Setup tuning of specified parameters to optimize for a custom metric, using multiple input criteria. This function will initialize a policy to optimize a custom metric, using the list of tunable parameters. The system tries to minimize the custom metric. After evaluating the state of the system, the policy will assign new values to the inputs. Get the current thread cap \u00b6 /* C++ */ int apex::get_thread_cap (void); /* C */ int apex_get_thread_cap (void); This function will return the current thread cap based on the throttling policy. Set the current thread cap \u00b6 /* C++ */ void apex::set_thread_cap (int new_cap); /* C */ void apex_set_thread_cap (int new_cap); This function will set the current thread cap based on an external throttling policy. Event-based API (OCR, Legion support - TBD ) \u00b6 The OCR and Legion runtimes teams have met to propose a common API for measuring asynchronous task-based runtimes. For more details, see https://github.com/UO-OACISS/apex/issues/37 . /* C++ */ apex::task_create (uint64_t parent_id) apex::dependency_reached (uint64_t event_id, uint64_t data_id, uint64_t task_id, uint64_t parent_id, ?) apex::task_ready (uint64_t why_ready) apex::task_execute (uint64_t why_delay, const apex_function_address function) apex::task_finished (uint64_t task_id) apex::task_destroy (uint64_t task_id) apex::data_create (uint64_t data_id) apex::data_new_size (uint64_t data_id) apex::data_move_from (uint64_t data_id, uint64_t target_location) apex::data_move_to (uint64_t data_id, uint64_t source_location) apex::data_replace (uint64_t data_id, uint64_t new_id) apex::data_destroy (uint64_t data_id) apex::event_create (uint64_t event_id, parent_task_id) apex::event_add_dependency (uint64_t event_id, uint64_t data_event_task_id, uint64_t parent_task_id) apex::event_trigger (uint64_t event_id) apex::event_destroy (uint64_t event_id) /* C API tbd */","title":"API Specification"},{"location":"spec/#apex_specification_draft","text":"*...to be fully implemented in a future release. While the following specification is slightly different than the current implementation, the differences are minor. When in doubt, the current implementation is documented by Doxygen, and is available here: http://www.nic.uoregon.edu/~khuck/apex_docs/doc/html/index.html http://www.nic.uoregon.edu/~khuck/apex_docs/doc/refman.pdf *","title":"APEX Specification (DRAFT)"},{"location":"spec/#read_me_first","text":"The API specification is provided for users who wish to instrument their own applications, or who wish to instrument a runtime. Please note that most runtimes have already been instrumented (or provide callbacks), and that users typically do not have to make any calls to the APEX API, other than to add application level timers or to write custom policy rules. If that is you, please see the tutorial with lots of up-to-date examples, https://github.com/khuck/apex-tutorial .","title":"READ ME FIRST!"},{"location":"spec/#introduction","text":"This page contains the API specification for APEX. The API specification provides a high-level overview of the API and its functionality. The implementation has Doxygen comments inserted, so for full implementation details, please see the API Reference Manual .","title":"Introduction"},{"location":"spec/#a_note_about_c","text":"The following specification contains both the C and the the C++ API. Typically, the C++ names use overloading for different argument lists, and will replace the apex_ prefix with the apex:: namespace. Because both APIs return handles to internal APEX objects, the type definitions of these objects use the C naming convention. In addition to the simple API presented below, the C++ API includes scoped timers and threads. See http://www.nic.uoregon.edu/~khuck/apex_docs/doc/html/namespaceapex.html for details.","title":"A note about C++"},{"location":"spec/#terminology","text":"Unfortunately, many terms in Computer Science are overloaded. The following definitions are in use in this document: Thread : an operating system (OS) thread of execution. For example, Posix threads (pthreads). Task : a scheduled unit of work, such as an OpenMP task or an HPX thread. APEX timers are typically used to measure tasks.","title":"Terminology"},{"location":"spec/#c_example","text":"The following is a very small C program that uses the APEX API. For more examples, please see the programs in the src/examples and src/unit_tests/C directories of the APEX source code. #include #include #include \"apex.h\" int foo(int i) { /* start an APEX timer for the function foo */ apex_profiler_handle profiler = apex_start(APEX_FUNCTION_ADDRESS, &foo); int j = i * i; /* stop the APEX timer */ apex_stop(profiler); return j; } int main (int argc, char** argv) { /* initialize APEX */ apex_init(\"apex_start unit test\"); /* start a timer, passing in the address of the main function */ apex_profiler_handle profiler = apex_start(APEX_FUNCTION_ADDRESS, &main); int i,j = 0; for (i = 0 ; i < 3 ; i++) { j += foo(i); } /* stop the timer */ apex_stop(profiler); /* finalize APEX */ apex_finalize(); /* free all memory allocated by APEX */ apex_cleanup(); return 0; }","title":"C example"},{"location":"spec/#c_example_1","text":"The following is a slightly more complicated C++ pthread program that uses the APEX API. For more examples, please see the programs in the src/examples and src/unit_tests/C++ directories of the APEX source code. #include #include #include #include \"apex_api.hpp\" void* someThread(void* tmp) { int* tid = (int*)tmp; char name[32]; sprintf(name, \"worker thread %d\", *tid); /* Register this thread with APEX */ apex::register_thread(name); /* Start a timer */ apex::profiler* p = apex::start((apex_function_address)&someThread); /* ... */ /* do some computation */ /* ... */ /* stop the timer */ apex::stop(p); /* tell APEX that this thread is exiting */ apex::exit_thread(); return NULL; } int main (int argc, char** argv) { /* initialize APEX */ apex::init(\"apex::start unit test\"); /* set our node ID */ apex::set_node_id(0); /* start a timer */ apex::profiler* p = apex::start(\"main\"); /* Spawn two threads */ pthread_t thread[2]; int tid = 0; pthread_create(&(thread[0]), NULL, someThread, &tid); int tid2 = 1; pthread_create(&(thread[1]), NULL, someThread, &tid2); /* wait for the threads to finish */ pthread_join(thread[0], NULL); pthread_join(thread[1], NULL); /* stop our main timer */ apex::stop(p); /* finalize APEX */ apex::finalize(); /* free all memory allocated by APEX */ apex::cleanup(); return 0; }","title":"C++ example"},{"location":"spec/#constants_types_and_enumerations","text":"","title":"Constants, types and enumerations"},{"location":"spec/#constants","text":"/** A null pointer representing an APEX profiler handle. * Used when a null APEX profile handle is to be passed in to * apex::stop when the profiler object was not retained locally. */ #define APEX_NULL_PROFILER_HANDLE (apex_profiler_handle)(NULL) // for comparisons #define APEX_MAX_EVENTS 128 /*!< The maximum number of event types. Allows for ~20 custom events. */ #define APEX_NULL_FUNCTION_ADDRESS 0L // for comparisons","title":"Constants"},{"location":"spec/#pre-defined_types","text":"/** The address of a C++ object in APEX. * Not useful for the caller that gets it back, but required * for stopping the timer later. */ typedef uintptr_t apex_profiler_handle; // address of internal C++ object /** Not useful for the caller that gets it back, but required * for deregistering policies after registration. */ typedef uintptr_t apex_policy_handle; // address of internal C++ object /** Rather than use void pointers everywhere, be explicit about * what the functions are expecting. */ typedef uintptr_t apex_function_address; // generic function pointer","title":"Pre-defined types"},{"location":"spec/#enumerations","text":"/** * Typedef for enumerating the different timer types */ typedef enum _apex_profiler_type { APEX_FUNCTION_ADDRESS = 0, /*!< The ID is a function (or instruction) address */ APEX_NAME_STRING, /*!< The ID is a character string */ APEX_FUNCTOR /*!< C++ Object with the () operator defined */ } apex_profiler_type; /** * Typedef for enumerating the different event types */ typedef enum _event_type { APEX_INVALID_EVENT = -1, APEX_STARTUP = 0, /*!< APEX is initialized */ APEX_SHUTDOWN, /*!< APEX is terminated */ APEX_NEW_NODE, /*!< APEX has registered a new process ID */ APEX_NEW_THREAD, /*!< APEX has registered a new OS thread */ APEX_EXIT_THREAD, /*!< APEX has exited an OS thread */ APEX_START_EVENT, /*!< APEX has processed a timer start event */ APEX_RESUME_EVENT, /*!< APEX has processed a timer resume event (the number of calls is not incremented) */ APEX_STOP_EVENT, /*!< APEX has processed a timer stop event */ APEX_YIELD_EVENT, /*!< APEX has processed a timer yield event */ APEX_SAMPLE_VALUE, /*!< APEX has processed a sampled value */ APEX_PERIODIC, /*!< APEX has processed a periodic timer */ APEX_CUSTOM_EVENT_1, /*!< APEX has processed a custom event - useful for large granularity application control events */ APEX_CUSTOM_EVENT_2, // these are just here for padding, and so we can APEX_CUSTOM_EVENT_3, // test with them. APEX_CUSTOM_EVENT_4, APEX_CUSTOM_EVENT_5, APEX_CUSTOM_EVENT_6, APEX_CUSTOM_EVENT_7, APEX_CUSTOM_EVENT_8, APEX_UNUSED_EVENT = APEX_MAX_EVENTS // can't have more custom events than this } apex_event_type; /** * Typedef for enumerating the OS thread states. */ typedef enum _thread_state { APEX_IDLE, /*!< Thread is idle */ APEX_BUSY, /*!< Thread is working */ APEX_THROTTLED, /*!< Thread is throttled (sleeping) */ APEX_WAITING, /*!< Thread is waiting for a resource */ APEX_BLOCKED /*!< Thread is otherwise blocked */ } apex_thread_state; /** * Typedef for enumerating the different optimization strategies * for throttling. */ typedef enum {APEX_MAXIMIZE_THROUGHPUT, /*!< maximize the number of calls to a timer/counter */ APEX_MAXIMIZE_ACCUMULATED, /*!< maximize the accumulated value of a timer/counter */ APEX_MINIMIZE_ACCUMULATED /*!< minimize the accumulated value of a timer/counter */ } apex_optimization_criteria_t; /** * Typedef for enumerating the different optimization methods * for throttling. */ typedef enum {APEX_SIMPLE_HYSTERESIS, /*!< optimize using sliding window of historical observations. A running average of the most recent N observations are used as the measurement. */ APEX_DISCRETE_HILL_CLIMBING, /*!< Use a discrete hill climbing algorithm for optimization */ APEX_ACTIVE_HARMONY /*!< Use Active Harmony for optimization. */ } apex_optimization_method_t; /** The type of a profiler object * */ typedef enum _profile_type { APEX_TIMER, /*!< This profile is a instrumented timer */ APEX_COUNTER /*!< This profile is a sampled counter */ } apex_profile_type;","title":"Enumerations"},{"location":"spec/#data_structures_and_classes","text":"/** * The APEX context when an event occurs. This context will be passed to * any policies registered for this event. */ typedef struct _context { apex_event_type event_type; /*!< The type of the event currently processing */ apex_policy_handle* policy_handle; /*!< The policy handle for the current policy function */ void * data; /*!< Data associated with the event, such as the custom_data for a custom_event */ } apex_context; /** * The profile object for a timer in APEX. * Returned by the apex_get_profile() call. */ typedef struct _profile { double calls; /*!< Number of times a timer was called, or the number of samples collected for a counter */ double accumulated; /*!< Accumulated values for all calls/samples */ double sum_squares; /*!< Running sum of squares calculation for all calls/samples */ double minimum; /*!< Minimum value seen by the timer or counter */ double maximum; /*!< Maximum value seen by the timer or counter */ apex_profile_type type; /*!< Whether this is a timer or a counter */ double papi_metrics[8]; /*!< Array of accumulated PAPI hardware metrics */ } apex_profile; /** * The APEX tuning request structures. */ typedef struct _apex_param { char * init_value; /*!< Initial value */ const char * value; /*!< Current value */ int num_possible_values; /*!< Number of possible values */ char * possible_values[]; } apex_param_struct; typedef struct _apex_tuning_request { char * name; /*!< Tuning request name */ double (*metric)(void); /*!< function to return the address of the output parameter */ int num_params; /*!< number of tuning input parameters */ char * param_names[]; /*!< the input parameter names */ apex_param_struct * params[]; /*!< the input parameters */ apex_event_type trigger; /*!< the event that triggers the tuning update */ apex_tuning_session_handle tuning_session_handle; /*!< the Active Harmony tuning session handle */ bool running; /*!< the current state of the tuning */ apex_ah_tuning_strategy strategy; /*!< the requested Active Harmony tuning strategy */ } apex_tuning_request_struct;","title":"Data structures and classes"},{"location":"spec/#environment_variables","text":"Please see the environment variables section of the documentation. Please note that all environment variables can also be queried or set at runtime with associated API calls. For example, the APEX_CSV_OUTPUT variable can also be set/queried with: void apex_set_csv_output (int); int apex_get_csv_output (void);","title":"Environment variables"},{"location":"spec/#general_utility_functions","text":"","title":"General Utility functions"},{"location":"spec/#initialization","text":"/* C++ */ void apex::init (const char *thread_name); /* C */ void apex_init (const char *thread_name); APEX initialization is required to set up data structures and spawn the necessary helper threads, including the background system state query thread, the policy engine thread, and the profile handler thread. The thread name parameter will be used as the top-level timer for the the main thread of execution.","title":"Initialization"},{"location":"spec/#finalization","text":"/* C++ */ void apex::finalize (void); /* C */ void apex_finalize (void); APEX finalization is required to format any desired output (screen, csv, profile, etc.) and terminate all APEX helper threads. No memory is freed at this point - that is done by the apex_cleanup() call. The reason for this is that applications may want to perform reporting after finalization, so the performance state of the application should still exist.","title":"Finalization"},{"location":"spec/#cleanup","text":"/* C++ */ void apex::cleanup (void); /* C */ void apex_cleanup (void); APEX cleanup frees all memory associated with APEX.","title":"Cleanup"},{"location":"spec/#setting_node_id","text":"/* C++ */ void apex::set_node_id (const uint64_t id); /* C */ void apex_set_node_id (const uint64_t id); When running in distributed environments, assign the specified id number as the APEX node ID. This can be an MPI rank or an HPX locality, for example.","title":"Setting node ID"},{"location":"spec/#registering_threads","text":"/* C++ */ void apex::register_thread (const std::string &name); /* C */ void apex_register_thread (const char *name); Register a new OS thread with APEX. This method should be called whenever a new OS thread is spawned by the application or the runtime. An empty string or null string is valid input.","title":"Registering threads"},{"location":"spec/#exiting_a_thread","text":"/* C++ */ void apex::exit_thread (void); /* C */ void apex_exit_thread (void); Before any thread other than the main thread of execution exits, notify APEX that the thread is exiting. The main thread should not call this function, but apex_finalize instead. Exiting the thread will trigger an event in APEX, so any policies associated with a thread exit will be executed.","title":"Exiting a thread"},{"location":"spec/#getting_the_apex_version","text":"/* C++ */ std::string & apex::version (void); /* C */ const char * apex_version (void); Return the APEX version as a string.","title":"Getting the APEX version"},{"location":"spec/#getting_the_apex_settings","text":"/* C++ */ std::string & apex::get_options (void); /* C */ const char * apex_get_options (void); Return the current APEX options as a string.","title":"Getting the APEX settings"},{"location":"spec/#basic_measurement_functions_introspection","text":"","title":"Basic measurement Functions (introspection)"},{"location":"spec/#starting_a_timer","text":"/* C++ */ apex_profiler_handle apex::start (const std::string &timer_name); apex_profiler_handle apex::start (const apex_function_address function_address); /* C */ apex_profiler_handle apex_start (apex_profiler_type type, const void * identifier); Create an APEX timer and start it. An APEX profiler object is returned, containing an identifier that APEX uses to stop the timer. The timer is either identified by a name or a function/task instruction pointer address.","title":"Starting a timer"},{"location":"spec/#stopping_a_timer","text":"/* C++ */ void apex::stop (apex_profiler_handle the_profiler); /* C */ void apex_stop (apex_profiler_handle the_profiler); The timer associated with the profiler object is stopped and placed on an internal queue to be processed by the profiler handler thread in the background. The profiler object is flagged as \"stopped\", so that when the profiler is processed the call count for this particular timer will be incremented by 1, unless the timer was started by apex_resume() (see below). The profiler handle will be freed internally by APEX after processing.","title":"Stopping a timer"},{"location":"spec/#yielding_a_timer","text":"/* C++ */ void apex::yield (apex_profiler_handle the_profiler); /* C */ void apex_yield (apex_profiler_handle the_profiler); The timer associated with the profiler object is stopped and placed on an internal queue to be processed by the profiler handler thread in the background. The profiler object is flagged as NOT stopped , so that when the profiler is processed the call count will NOT be incremented. An application using apex_yield should not use apex_resume to restart the timer, it should use apex_start. apex_yield() is intended for situations when the completion state of the task is known and the state is not complete . below). The profiler handle will be freed internally by APEX after processing.","title":"Yielding a timer"},{"location":"spec/#resuming_a_timer","text":"/* C++ */ apex_profiler_handle apex::resume (const std::string &timer_name); apex_profiler_handle apex::resume (const apex_function_address function_address); /* C */ apex_profiler_handle apex_resume (apex_profiler_type type, const void * identifier); Create an APEX timer and start it. An APEX profiler object is returned, containing an identifier that APEX uses to stop the timer. The profiler is flagged as NOT a new task , so that when it is stopped by apex_stop the call count for this particular timer will not be incremented. Apex_resume is intended for situations when the completion state of a task is NOT known when control is returned to the task scheduler, but is known when an interrupted task is resumed.","title":"Resuming a timer"},{"location":"spec/#creating_a_new_task_dependency","text":"/* C++ */ void apex::new_task (std::string & name, const void * task_id); void apex::new_task (const apex_function_address function_address, const void * task_id); /* C */ void apex_new_task (apex_profiler_type type, const void * identifier, const void * task_id) Register the creation of a new task. This is used to track task dependencies in APEX. APEX assumes that the current APEX profiler refers to the task that is the parent of this new task. The task_info object is a generic pointer to whatever data might need to be passed to a policy executed on when a new task is created.","title":"Creating a new task dependency"},{"location":"spec/#sampling_a_value","text":"/* C++ */ void apex::sample_value (const std::string & name, const double value) /* C */ void apex_sample_value (const char * name, const double value); Record a measurement of the specified counter with the specified value. For example, \"bytes transferred\" and \"1024\".","title":"Sampling a value"},{"location":"spec/#setting_the_os_thread_state","text":"/* C++ */ void apex::set_state (apex_thread_state state); /* C */ void apex_set_state (apex_thread_state state); Set the state of the current OS thread. States can include things like idle, busy, waiting, throttled, blocked.","title":"Setting the OS thread state"},{"location":"spec/#policy-related_methods_adaptation","text":"","title":"Policy-related methods (adaptation)"},{"location":"spec/#registering_an_event-based_policy_function","text":"/* C++ */ apex_policy_handle apex::register_policy (const apex_event_type when, std::function f); std::set apex::register_policy (std::set when, std::function f); /* C */ apex_policy_handle apex_register_policy (const apex_event_type when, int(*f)(apex_context const&)); APEX provides the ability to call an application-specified function when certain events occur in the APEX library, or periodically. This assigns the passed in function to the event, so that when that event occurs in APEX, the function is called. The context for the event will be passed to the registered function. A set of events can also be used to register a policy function, which will return a set of policy handles. When any event in the set occurs, the function will be called.","title":"Registering an event-based policy function"},{"location":"spec/#registering_a_periodic_policy","text":"/* C++ */ apex_policy_handle apex::register_periodic_policy(const unsigned long period, std::function f); /* C */ apex_policy_handle apex_register_periodic_policy (const unsigned long period, int(*f)(apex_context const&)); Apex provides the ability to call an application-specified function periodically. This method assigns the passed in function to be called on a periodic basis. The context for the event will be passed to the registered function. The period units are in microseconds (us).","title":"Registering a periodic policy"},{"location":"spec/#de-registering_a_policy","text":"/* C++ */ apex::deregister_policy (apex_policy_handle handle); /* C */ apex_deregister_policy (apex_policy_handle handle); Remove the specified policy so that it will no longer be executed, whether it is event-based or periodic. The calling code should not try to dereference the policy handle after this call, as the memory pointed to by the handle will be freed.","title":"De-registering a policy"},{"location":"spec/#registering_a_custom_event","text":"/* C++ */ apex_event_type apex::register_custom_event (const std::string & name); /* C */ apex_event_type apex_register_custom_event (const char * name); Register a new event type with APEX.","title":"Registering a custom event"},{"location":"spec/#trigger_a_custom_event","text":"/* C++ */ void apex::custom_event (apex_event_type event_type, const void * event_data); /* C */ void apex_custom_event (const char * name, const void * event_data); Trigger a custom event. This function will pass a custom event to the APEX event listeners. Each listeners' custom event handler will handle the custom event. Policy functions will be passed the custom event name in the event context. The event data pointer is to be used to pass memory to the policy function from the code that triggered the event.","title":"Trigger a custom event"},{"location":"spec/#request_a_profile_from_apex","text":"/* C++ */ apex_profile * apex::get_profile (const std::string & name); apex_profile * apex::get_profile (const apex_function_address function_address); /* C */ apex_profile * apex_get_profile (apex_profiler_type type, const void * identifier) This function will return the current profile for the specified identifier. Because profiles are updated out-of-band, it is possible that this profile values are out of date. This profile can be either a timer or a sampled value.","title":"Request a profile from APEX"},{"location":"spec/#reset_a_profile","text":"/* C++ */ void apex::reset (const std::string & timer_name); void apex::reset (const apex_function_address function_address); /* C */ void apex_reset (apex_profiler_type type, const void * identifier) This function will reset the profile associated with the specified timer or counter id to zero. If the identifier is null, all timers and counters will be reset.","title":"Reset a profile"},{"location":"spec/#concurrency_throttling_policy_functions","text":"","title":"Concurrency Throttling Policy Functions"},{"location":"spec/#setup_tuning_for_adaptation","text":"/* C++ */ apex_tuning_session_handle setup_custom_tuning(apex_tuning_request & request); apex_tuning_session_handle setup_custom_tuning(apex_tuning_request * request); Setup tuning of specified parameters to optimize for a custom metric, using multiple input criteria. This function will initialize a policy to optimize a custom metric, using the list of tunable parameters. The system tries to minimize the custom metric. After evaluating the state of the system, the policy will assign new values to the inputs.","title":"Setup tuning for adaptation"},{"location":"spec/#get_the_current_thread_cap","text":"/* C++ */ int apex::get_thread_cap (void); /* C */ int apex_get_thread_cap (void); This function will return the current thread cap based on the throttling policy.","title":"Get the current thread cap"},{"location":"spec/#set_the_current_thread_cap","text":"/* C++ */ void apex::set_thread_cap (int new_cap); /* C */ void apex_set_thread_cap (int new_cap); This function will set the current thread cap based on an external throttling policy.","title":"Set the current thread cap"},{"location":"spec/#event-based_api_ocr_legion_support_-_tbd","text":"The OCR and Legion runtimes teams have met to propose a common API for measuring asynchronous task-based runtimes. For more details, see https://github.com/UO-OACISS/apex/issues/37 . /* C++ */ apex::task_create (uint64_t parent_id) apex::dependency_reached (uint64_t event_id, uint64_t data_id, uint64_t task_id, uint64_t parent_id, ?) apex::task_ready (uint64_t why_ready) apex::task_execute (uint64_t why_delay, const apex_function_address function) apex::task_finished (uint64_t task_id) apex::task_destroy (uint64_t task_id) apex::data_create (uint64_t data_id) apex::data_new_size (uint64_t data_id) apex::data_move_from (uint64_t data_id, uint64_t target_location) apex::data_move_to (uint64_t data_id, uint64_t source_location) apex::data_replace (uint64_t data_id, uint64_t new_id) apex::data_destroy (uint64_t data_id) apex::event_create (uint64_t event_id, parent_task_id) apex::event_add_dependency (uint64_t event_id, uint64_t data_event_task_id, uint64_t parent_task_id) apex::event_trigger (uint64_t event_id) apex::event_destroy (uint64_t event_id) /* C API tbd */","title":"Event-based API (OCR, Legion support - TBD)"},{"location":"usage/","text":"Usage \u00b6 Tutorial \u00b6 For an APEX tutorial, please see https://github.com/khuck/apex-tutorial . Supported Runtime Systems \u00b6 HPX (Louisiana State University) \u00b6 HPX (High Performance ParalleX) is the original implementation of the ParalleX model. Developed and maintained by the Ste||ar Group at Louisiana State University, HPX is implemented in C++. For more information, see http://stellar-group.org/projects/hpx/ . For a tutorial on HPX with APEX (presented at SC'15, Austin TX) see https://github.com/khuck/SC15_APEX_tutorial (somewhat outdated). APEX is configured and built as part of HPX. In fact, you don't even need to donwload it separately - it will be automatically checked out from Github as part of the HPX Cmake configuration. However, you do need to pass the correct Cmake options to the HPX configuration step. Configuring HPX with APEX \u00b6 See Intallation with HPX . Running HPX with APEX \u00b6 See APEX Quickstart . OpenMP \u00b6 The OpenMP API supports multi-platform shared-memory parallel programming in C/C++ and Fortran. The OpenMP API defines a portable, scalable model with a simple and flexible interface for developing parallel applications on platforms from the desktop to the supercomputer. For more information, see http://openmp.org/ . Configuring APEX for OpenMP OMPT support \u00b6 The CMake process will automatically detect whether your compiler has OpenMP support. If you configure APEX with -DUSE_OMPT=TRUE and have a compiler with full OpenMP 5.0 OMPT support, APEX will detect the support. If your compiler is GCC, Intel or Clang and does not have native OMPT support, APEX can build and use the open source LLVM OpenMP runtime as a drop-in replacement for the compiler's native runtime library, but this is no longer recommended and is deprecated. APEX uses Binutils to resolve the OpenMP outlined regions from instruction addresses to human-readable names, so also configure APEX with -DUSE_BFD=TRUE (see Other CMake Settings ). The following example was configured and run with Intel 20 compilers. The CMake configuration for this example was: cmake -DCMAKE_C_COMPILER=`which icc` -DCMAKE_CXX_COMPILER=`which icpc` -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=../install -DBUILD_TESTS=TRUE -DUSE_BFD=TRUE -DBFD_ROOT=/usr/local/packages/binutils/2.34 -DUSE_OMPT=TRUE .. Running OpenMP applications with APEX \u00b6 Using the apex_exec wrapper script, execute the OpenMP program as normal: [khuck@delphi apex]$ ./install/bin/apex_exec --apex:screen --apex:taskgraph --apex:ompt build/src/unit_tests/C++/apex_openmp_cpp Program to run : build/src/unit_tests/C++/apex_openmp_cpp Initializing... No Sharing... Result: 2690568.772590 Elapsed time: 0.0398378 seconds Cores detected: 72 Worker Threads observed: 72 Available CPU time: 2.86832 seconds Counter : #samples | minimum | mean | maximum | stddev ------------------------------------------------------------------------------------------------ Iterations: OpenMP Work Loop: no_shari... : 71 1.05e+06 1.05e+06 1.05e+06 0.000 Iterations: OpenMP Work Loop: my_init(... : 144 1.05e+06 1.05e+06 1.05e+06 0.000 OpenMP Initial Thread : 1 1.000 1.000 1.000 0.000 OpenMP Worker Thread : 71 1.000 1.000 1.000 0.000 Single: OpenMP Work Single Executor: L... : 1 1.000 1.000 1.000 0.000 Single: OpenMP Work Single Executor: L... : 2 1.000 1.000 1.000 0.000 Single: OpenMP Work Single Other: L__Z... : 71 1.000 1.000 1.000 0.000 Single: OpenMP Work Single Other: L__Z... : 142 1.000 1.000 1.000 0.000 status:Threads : 1 3.000 3.000 3.000 0.000 status:VmData : 1 1.07e+05 1.07e+05 1.07e+05 0.000 status:VmExe : 1 20.000 20.000 20.000 0.000 status:VmHWM : 1 9356.000 9356.000 9356.000 0.000 status:VmLck : 1 0.000 0.000 0.000 0.000 status:VmLib : 1 4.39e+04 4.39e+04 4.39e+04 0.000 status:VmPTE : 1 128.000 128.000 128.000 0.000 status:VmPeak : 1 2.49e+05 2.49e+05 2.49e+05 0.000 status:VmPin : 1 0.000 0.000 0.000 0.000 status:VmRSS : 1 9356.000 9356.000 9356.000 0.000 status:VmSize : 1 1.84e+05 1.84e+05 1.84e+05 0.000 status:VmStk : 1 136.000 136.000 136.000 0.000 status:VmSwap : 1 0.000 0.000 0.000 0.000 status:nonvoluntary_ctxt_switches : 1 43.000 43.000 43.000 0.000 status:voluntary_ctxt_switches : 1 46.000 46.000 46.000 0.000 ------------------------------------------------------------------------------------------------ Timer : #calls | mean | total | % total ------------------------------------------------------------------------------------------------ APEX MAIN : 1 0.040 0.040 100.000 OpenMP Parallel Region: no_sharing(double*, doubl... : 1 0.006 0.006 0.211 OpenMP Parallel Region: my_init(double*) [{/home/... : 2 0.014 0.028 0.961 OpenMP Work Loop: no_sharing(double*, double*) [{... : 72 0.003 0.195 6.806 OpenMP Work Loop: my_init(double*) [{/home/users/... : 143 0.001 0.161 5.622 OpenMP Work Single Executor: L__Z10no_sharingPdS_... : 1 0.001 0.001 0.028 OpenMP Work Single Executor: L__Z7my_initPd_39__p... : 2 0.000 0.001 0.018 OpenMP Work Single Other: L__Z10no_sharingPdS__20... : 71 0.000 0.029 1.027 OpenMP Work Single Other: L__Z7my_initPd_39__par_... : 141 0.001 0.100 3.472 ------------------------------------------------------------------------------------------------ Total timers : 433 If GraphViz is installed on your system, the dot program will generate a taskgraph image based on the taskgraph.0.dot file that was generated by APEX: OpenACC \u00b6 Configuring APEX for OpenACC support \u00b6 Nothing special needs to be done to enable OpenACC support. If your compiler supports OpenACC (PGI, GCC 10+), then CMake will detect it and enable OpenACC support in APEX. In this example, APEX was configured with GCC 10.0.0: cmake -DCMAKE_C_COMPILER=`which gcc` -DCMAKE_CXX_COMPILER=`which g++` -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=../install -DBUILD_TESTS=TRUE -DUSE_BFD=FALSE -DBFD_ROOT=/usr/local/packages/binutils/2.34 .. Running OpenACC programs with APEX \u00b6 Enabling OpenACC support requires setting the ACC_PROFLIB environment variable with the path to libapex.so , or by using the apex_exec script with the --apex:openacc flag: [khuck@gorgon apex]$ ./install/bin/apex_exec --apex:screen --apex:taskgraph --apex:openacc ./build/src/unit_tests/C/apex_openacc Program to run : ./build/src/unit_tests/C/apex_openacc Jacobi relaxation Calculation: 128 x 128 mesh Device API: none Device type: default Device vendor: -1 Device API: CUDA Device type: nvidia Device vendor: -1 0, 0.250000 Elapsed time: 0.451705 seconds Cores detected: 160 Worker Threads observed: 1 Available CPU time: 0.451705 seconds Counter : #samples | minimum | mean | maximum | stddev ------------------------------------------------------------------------------------------------ OpenACC Gangs : 200 1.000 2560.500 5120.000 2559.500 OpenACC Vector Lanes : 200 32.000 32.000 32.000 0.000 OpenACC Workers : 200 1.000 1.000 1.000 0.000 OpenACC device alloc (implicit) parall... : 301 15.000 889.206 2.62e+05 1.51e+04 OpenACC device free (implicit) paralle... : 301 0.000 0.000 0.000 0.000 OpenACC enqueue data transfer (HtoD) (... : 200 16.000 20.000 24.000 4.000 status:Threads : 1 3.000 3.000 3.000 0.000 status:VmData : 1 1.81e+04 1.81e+04 1.81e+04 0.000 status:VmExe : 1 64.000 64.000 64.000 0.000 status:VmHWM : 1 4416.000 4416.000 4416.000 0.000 status:VmLck : 1 0.000 0.000 0.000 0.000 status:VmLib : 1 8640.000 8640.000 8640.000 0.000 status:VmPMD : 1 16.000 16.000 16.000 0.000 status:VmPTE : 1 3.000 3.000 3.000 0.000 status:VmPeak : 1 1.59e+05 1.59e+05 1.59e+05 0.000 status:VmPin : 1 0.000 0.000 0.000 0.000 status:VmRSS : 1 4416.000 4416.000 4416.000 0.000 status:VmSize : 1 9.34e+04 9.34e+04 9.34e+04 0.000 status:VmStk : 1 192.000 192.000 192.000 0.000 status:VmSwap : 1 0.000 0.000 0.000 0.000 status:nonvoluntary_ctxt_switches : 1 0.000 0.000 0.000 0.000 status:voluntary_ctxt_switches : 1 46.000 46.000 46.000 0.000 ------------------------------------------------------------------------------------------------ Timer : #calls | mean | total | % total ------------------------------------------------------------------------------------------------ APEX MAIN : 1 0.452 0.452 100.000 OpenACC compute construct parallel : 200 0.001 0.215 47.492 OpenACC device init (implicit) parallel : 1 0.081 0.081 17.965 OpenACC enqueue data transfer (HtoD) (implicit) p... : 200 0.000 0.002 0.523 OpenACC enqueue launch: main$_omp_fn$0 (implicit)... : 100 0.000 0.001 0.288 OpenACC enqueue launch: main$_omp_fn$1 (implicit)... : 100 0.000 0.001 0.267 OpenACC enter data (implicit) parallel : 200 0.000 0.002 0.491 OpenACC enter data data : 1 0.000 0.000 0.078 OpenACC exit data (implicit) parallel : 200 0.000 0.003 0.733 OpenACC exit data data : 1 0.000 0.000 0.043 APEX Idle : 0.145 32.120 ------------------------------------------------------------------------------------------------ Total timers : 1003 CUDA \u00b6 Configuring APEX for CUDA support \u00b6 Enabling CUDA support in APEX requires the -DAPEX_WITH_CUDA=TRUE flag and the -DCUDA_ROOT=/path/to/cuda CMake variables at configuration time. CMake will look for the CUPTI and NVML libraries in the installation, and if found the support will be enabled. cmake -DCMAKE_C_COMPILER=`which gcc` -DCMAKE_CXX_COMPILER=`which g++` -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=../install -DBUILD_TESTS=TRUE -DUSE_BFD=TRUE -DAPEX_WITH_CUDA=TRUE -DCUDA_ROOT=/usr/local/packages/cuda/10.2 -DBFD_ROOT=/usr/local/packages/binutils/2.34 .. Running CUDA programs with APEX \u00b6 Enabling CUDA support only requires using the apex_exec wrapper script. [khuck@gorgon apex]$ ./install/bin/apex_exec --apex:screen --apex:taskgraph --apex:cuda ./build/src/unit_tests/CUDA/apex_cuda_cu Program to run : ./build/src/unit_tests/CUDA/apex_cuda_cu On device: name=hello, value=10 On device: name=dello, value=11 On device: name=dello, value=12 On device: name=dello, value=13 On host: name=dello, value=14 Elapsed time: 0.410402 seconds Cores detected: 160 Worker Threads observed: 1 Available CPU time: 0.410402 seconds Counter : #samples | minimum | mean | maximum | stddev ------------------------------------------------------------------------------------------------ Device 0 GPU Clock Memory (MHz) : 1 877.000 877.000 877.000 0.000 Device 0 GPU Clock SM (MHz) : 1 135.000 135.000 135.000 0.000 Device 0 GPU Memory Free (MB) : 1 3.41e+04 3.41e+04 3.41e+04 0.000 Device 0 GPU Memory Used (MB) : 1 0.197 0.197 0.197 0.000 Device 0 GPU Memory Utilization % : 1 0.000 0.000 0.000 0.000 Device 0 GPU NvLink Link Count : 1 6.000 6.000 6.000 0.000 Device 0 GPU NvLink Speed MB/s : 1 2.58e+04 2.58e+04 2.58e+04 0.000 Device 0 GPU NvLink Utilization C0 : 1 0.000 0.000 0.000 0.000 Device 0 GPU NvLink Utilization C1 : 1 0.000 0.000 0.000 0.000 Device 0 GPU Power (W) : 1 38.912 38.912 38.912 0.000 Device 0 GPU Temperature (C) : 1 33.000 33.000 33.000 0.000 Device 0 GPU Utilization % : 1 0.000 0.000 0.000 0.000 Device 0 PCIe RX Throughput (MB/s) : 1 1.000 1.000 1.000 0.000 Device 0 PCIe TX Throughput (MB/s) : 1 3.000 3.000 3.000 0.000 GPU: Bytes Allocated : 2 6.000 11.000 16.000 5.000 status:Threads : 1 4.000 4.000 4.000 0.000 status:VmData : 1 5.72e+04 5.72e+04 5.72e+04 0.000 status:VmExe : 1 64.000 64.000 64.000 0.000 status:VmHWM : 1 1.77e+04 1.77e+04 1.77e+04 0.000 status:VmLck : 1 0.000 0.000 0.000 0.000 status:VmLib : 1 6.92e+04 6.92e+04 6.92e+04 0.000 status:VmPMD : 1 12.000 12.000 12.000 0.000 status:VmPTE : 1 7.000 7.000 7.000 0.000 status:VmPeak : 1 2.58e+05 2.58e+05 2.58e+05 0.000 status:VmPin : 1 0.000 0.000 0.000 0.000 status:VmRSS : 1 1.77e+04 1.77e+04 1.77e+04 0.000 status:VmSize : 1 1.93e+05 1.93e+05 1.93e+05 0.000 status:VmStk : 1 192.000 192.000 192.000 0.000 status:VmSwap : 1 0.000 0.000 0.000 0.000 status:nonvoluntary_ctxt_switches : 1 0.000 0.000 0.000 0.000 status:voluntary_ctxt_switches : 1 102.000 102.000 102.000 0.000 ------------------------------------------------------------------------------------------------ Timer : #calls | mean | total | % total ------------------------------------------------------------------------------------------------ APEX MAIN : 1 0.410 0.410 100.000 GPU: Unified Memory copy DTOH : 1 0.000 0.000 0.001 GPU: Unified Memory copy HTOD : 1 0.000 0.000 0.001 GPU: Kernel(DataElement*) : 4 0.000 0.000 0.084 cudaDeviceSynchronize : 4 0.000 0.000 0.092 cudaFree : 2 0.000 0.000 0.045 cudaLaunchKernel : 4 0.000 0.000 0.007 cudaMallocManaged : 2 0.104 0.208 50.601 launch [/home/users/khuck/src/apex/src/unit_tests... : 4 0.001 0.003 0.798 APEX Idle : 0.199 48.371 ------------------------------------------------------------------------------------------------ Total timers : 22 To get additional information you can also enable the --apex:cuda_driver flag to see CUDA driver API calls, or enable the --apex:cuda_counters flag to enable CUDA counters. [khuck@gorgon apex]$ ./install/bin/apex_exec --apex:screen --apex:taskgraph --apex:cuda --apex:cuda_counters --apex:cuda_driver ./build/src/unit_tests/CUDA/apex_cuda_cu Program to run : ./build/src/unit_tests/CUDA/apex_cuda_cu On device: name=hello, value=10 On device: name=dello, value=11 On device: name=dello, value=12 On device: name=dello, value=13 On host: name=dello, value=14 Elapsed time: 0.309145 seconds Cores detected: 160 Worker Threads observed: 1 Available CPU time: 0.309145 seconds Counter : #samples | minimum | mean | maximum | stddev ------------------------------------------------------------------------------------------------ Device 0 GPU Clock Memory (MHz) : 1 877.000 877.000 877.000 0.000 Device 0 GPU Clock SM (MHz) : 1 135.000 135.000 135.000 0.000 Device 0 GPU Memory Free (MB) : 1 3.41e+04 3.41e+04 3.41e+04 0.000 Device 0 GPU Memory Used (MB) : 1 0.197 0.197 0.197 0.000 Device 0 GPU Memory Utilization % : 1 0.000 0.000 0.000 0.000 Device 0 GPU NvLink Link Count : 1 6.000 6.000 6.000 0.000 Device 0 GPU NvLink Speed MB/s : 1 2.58e+04 2.58e+04 2.58e+04 0.000 Device 0 GPU NvLink Utilization C0 : 1 0.000 0.000 0.000 0.000 Device 0 GPU NvLink Utilization C1 : 1 0.000 0.000 0.000 0.000 Device 0 GPU Power (W) : 1 38.912 38.912 38.912 0.000 Device 0 GPU Temperature (C) : 1 33.000 33.000 33.000 0.000 Device 0 GPU Utilization % : 1 0.000 0.000 0.000 0.000 Device 0 PCIe RX Throughput (MB/s) : 1 2.000 2.000 2.000 0.000 Device 0 PCIe TX Throughput (MB/s) : 1 3.000 3.000 3.000 0.000 GPU: Bandwith (GB/s) <- Unified Memory... : 1 18.618 18.618 18.618 0.000 GPU: Bandwith (GB/s) <- Unified Memory... : 1 11.770 11.770 11.770 0.000 GPU: Bytes <- Unified Memory copy DTOH : 1 6.55e+04 6.55e+04 6.55e+04 0.000 GPU: Bytes <- Unified Memory copy HTOD : 1 6.55e+04 6.55e+04 6.55e+04 0.000 GPU: Bytes Allocated : 3 0.000 7.333 16.000 6.600 GPU: Dynamic Shared Memory (B) : 4 0.000 0.000 0.000 0.000 GPU: Local Memory Per Thread (B) : 4 0.000 0.000 0.000 0.000 GPU: Local Memory Total (B) : 4 1.36e+08 1.36e+08 1.36e+08 0.000 GPU: Registers Per Thread : 4 32.000 32.000 32.000 0.000 GPU: Shared Memory Size (B) : 4 0.000 0.000 0.000 0.000 GPU: Static Shared Memory (B) : 4 0.000 0.000 0.000 0.000 Unified Memory CPU Page Fault Count : 2 1.000 1.000 1.000 0.000 Unified Memory GPU Page Fault Groups : 1 1.000 1.000 1.000 0.000 status:Threads : 1 4.000 4.000 4.000 0.000 status:VmData : 1 5.69e+04 5.69e+04 5.69e+04 0.000 status:VmExe : 1 64.000 64.000 64.000 0.000 status:VmHWM : 1 1.70e+04 1.70e+04 1.70e+04 0.000 status:VmLck : 1 0.000 0.000 0.000 0.000 status:VmLib : 1 6.92e+04 6.92e+04 6.92e+04 0.000 status:VmPMD : 1 16.000 16.000 16.000 0.000 status:VmPTE : 1 7.000 7.000 7.000 0.000 status:VmPeak : 1 2.58e+05 2.58e+05 2.58e+05 0.000 status:VmPin : 1 0.000 0.000 0.000 0.000 status:VmRSS : 1 1.70e+04 1.70e+04 1.70e+04 0.000 status:VmSize : 1 1.93e+05 1.93e+05 1.93e+05 0.000 status:VmStk : 1 192.000 192.000 192.000 0.000 status:VmSwap : 1 0.000 0.000 0.000 0.000 status:nonvoluntary_ctxt_switches : 1 0.000 0.000 0.000 0.000 status:voluntary_ctxt_switches : 1 100.000 100.000 100.000 0.000 ------------------------------------------------------------------------------------------------ Timer : #calls | mean | total | % total ------------------------------------------------------------------------------------------------ APEX MAIN : 1 0.309 0.309 100.000 GPU: Unified Memory copy DTOH : 1 0.000 0.000 0.001 GPU: Unified Memory copy HTOD : 1 0.000 0.000 0.002 GPU: Kernel(DataElement*) : 4 0.000 0.001 0.353 cuCtxGetCurrent : 2 0.000 0.000 0.002 cuCtxGetDevice : 1 0.000 0.000 0.001 cuCtxSetCurrent : 1 0.000 0.000 0.001 cuCtxSynchronize : 4 0.000 0.001 0.349 cuDeviceGet : 4 0.000 0.000 0.002 cuDeviceGetAttribute : 376 0.000 0.002 0.754 cuDeviceGetCount : 1 0.000 0.000 0.008 cuDeviceGetName : 4 0.000 0.000 0.046 cuDeviceGetUuid : 4 0.000 0.000 0.002 cuDevicePrimaryCtxRetain : 1 0.111 0.111 35.773 cuDeviceTotalMem_v2 : 4 0.002 0.006 2.022 cuLaunchKernel : 4 0.000 0.000 0.005 cuMemAllocManaged : 2 0.012 0.024 7.743 cuMemFree_v2 : 2 0.000 0.000 0.051 cuModuleGetFunction : 1 0.000 0.000 0.005 cudaDeviceSynchronize : 4 0.000 0.001 0.361 cudaFree : 2 0.000 0.000 0.057 cudaLaunchKernel : 4 0.000 0.000 0.051 cudaMallocManaged : 2 0.060 0.120 38.773 launch [/home/users/khuck/src/apex/src/unit_tests... : 4 0.000 0.001 0.442 APEX Idle : 0.041 13.195 ------------------------------------------------------------------------------------------------ Total timers : 433 The following flags will enable different types of CUDA support: --apex:cuda enable CUDA/CUPTI measurement (default: off) --apex:cuda-counters enable CUDA/CUPTI counter support (default: off) --apex:cuda-driver enable CUDA driver API callbacks (default: off) --apex:cuda-details enable per-kernel statistics where available (default: off) --apex:monitor-gpu enable GPU monitoring services (CUDA NVML, ROCm SMI) HIP/ROCm \u00b6 APEX suports HIP measurement using the Roc* libraries provided by AMD. Configuring APEX for HIP support \u00b6 Enabling HIP support in APEX requires the -DAPEX_WITH_HIP=TRUE flag and the -DROCM_ROOT=/path/to/rocm CMake variables at configuration time. CMake will look for the profile/trace and smi libraries in the installation, and if found the support will be enabled. cmake -B build -DCMAKE_C_COMPILER=`which clang` -DCMAKE_CXX_COMPILER=`which hipcc` -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=./install -DBUILD_TESTS=TRUE -DUSE_BFD=TRUE -DAPEX_WITH_HIP=TRUE -DROCM_ROOT=/opt/rocm-5.7.1 -DBFD_ROOT=/usr/local/packages/binutils/2.34 .. Running HIP programs with APEX \u00b6 Enabling CUDA support only requires using the apex_exec wrapper script. The following flags will enable additional support: --apex:hip enable HIP/ROCTracer measurement (default: off) --apex:hip-metrics enable HIP/ROCProfiler metric support (default: off) --apex:hip-counters enable HIP/ROCTracer counter support (default: off) --apex:hip-driver enable HIP/ROCTracer KSA driver API callbacks (default: off) --apex:hip-details enable per-kernel statistics where available (default: off) --apex:monitor-gpu enable GPU monitoring services (CUDA NVML, ROCm SMI) Kokkos \u00b6 Configuring APEX for Kokkos support \u00b6 Like OpenACC, nothing special needs to be done to enable Kokkos support. Running Kokkos programs with APEX \u00b6 Enabling Kokkos support requires setting the KOKKOS_PROFILE_LIBRARY environment variable with the path to libapex.so , or by using the apex_exec script with the --apex:kokkos flag. We also recommend using the --apex:kokkos-fence option which will time the full kernel execution time, not just the time to launch a kernel if the back-end activity is not measured by some other method (OMPT, CUDA, HIP, SYCL, OpenACC). APEX also has experimental autotuning support for Kokkos kernels, see https://github.com/UO-OACISS/apex/wiki/Using-APEX-with-Kokkos#autotuning-support . Configuring APEX for RAJA support \u00b6 Like OpenACC, nothing special needs to be done to enable RAJA support. Running RAJA programs with APEX \u00b6 Enabling RAJA support requires setting the RAJA_PLUGINS environment variable with the path to libapex.so , or by using the apex_exec script with the --apex:raja flag. The following flags will enable different types of Kokkos support: --apex:kokkos enable Kokkos support --apex:kokkos-tuning enable Kokkos runtime autotuning support --apex:kokkos-fence enable Kokkos fences for async kernels C++ Threads \u00b6 APEX suports C++ threads on Linux, with the assumption that they are implemented on top of POSIX threads. Configuring APEX for C++ Thread support \u00b6 Nothing special needs to be done to enable C++ thread support. Running C++ Thread programs with APEX \u00b6 Enabling C++ Thread support requires using the apex_exec script with the --apex:pthread flag. That will enable the preloading of a wrapper library to intercept pthread_create() calls. A sample program with C++ threads is in the APEX unit tests: khuck@Kevins-MacBook-Air build % ../install/bin/apex_exec --apex:pthread src/unit_tests/C++/apex_fibonacci_std_async_cpp Program to run : src/unit_tests/C++/apex_fibonacci_std_async_cpp usage: apex_fibonacci_std_async_cpp Using default value of 10 fib of 10 is 55 (valid value: 55) Elapsed time: 0.005359 seconds Cores detected: 8 Worker Threads observed: 178 Available CPU time: 0.042872 seconds Timer : #calls | mean | total | % total ------------------------------------------------------------------------------------------------ fib(int, std::__1::shared_ptr) : 177 0.001 0.171 --n/a-- APEX MAIN : 1 0.005 0.005 100.000 ------------------------------------------------------------------------------------------------ Total timers : 177 Note that APEX detected 178 total OS threads. That is because some C++ thread implementations (GCC, Clang, others) implement every std::async() call as a new OS thread, resulting in a pthread_create() call. Other Runtime Systems \u00b6 We are currently evaluating support for TBB, OpenCL, SYCL/DPC++/OneAPI, among others. Performance Measurement Features \u00b6 For all the following examples, we will use a simple CUDA program that is in the APEX unit tests. Profiling \u00b6 Profiling with APEX is the usual and most simple mode of operation. In order to profile an application and get a report at the end of execution, enable screen output (see Environment Variables for details) and run an application linked with the APEX library or with the apex_exec --apex:screen flag (enabled by default). The output should look like examples shown previously. [khuck@cyclops apex]$ export APEX_SCREEN_OUTPUT=1 [khuck@cyclops apex]$ ./build/src/unit_tests/CUDA/apex_cuda_cu Found 4 total devices On device: name=hello, value=10 On device: name=dello, value=11 On device: name=dello, value=12 On device: name=dello, value=13 On host: name=dello, value=14 Elapsed time: 0.46147 seconds Cores detected: 160 Worker Threads observed: 1 Available CPU time: 0.46147 seconds Counter : #samples | minimum | mean | maximum | stddev ------------------------------------------------------------------------------------------------ 1 Minute Load average : 1 13.320 13.320 13.320 0.000 Device 0 GPU Clock Memory (MHz) : 1 877.000 877.000 877.000 0.000 Device 0 GPU Clock SM (MHz) : 1 1530.000 1530.000 1530.000 0.000 Device 0 GPU Memory Free (MB) : 1 1.34e+04 1.34e+04 1.34e+04 0.000 Device 0 GPU Memory Used (MB) : 1 2.07e+04 2.07e+04 2.07e+04 0.000 Device 0 GPU Memory Utilization % : 1 48.000 48.000 48.000 0.000 Device 0 GPU NvLink Link Count : 1 6.000 6.000 6.000 0.000 Device 0 GPU NvLink Speed MB/s : 1 2.58e+04 2.58e+04 2.58e+04 0.000 Device 0 GPU NvLink Utilization C0 : 1 0.000 0.000 0.000 0.000 Device 0 GPU NvLink Utilization C1 : 1 0.000 0.000 0.000 0.000 Device 0 GPU Power (W) : 1 240.573 240.573 240.573 0.000 Device 0 GPU Temperature (C) : 1 73.000 73.000 73.000 0.000 Device 0 GPU Utilization % : 1 95.000 95.000 95.000 0.000 Device 0 PCIe RX Throughput (MB/s) : 1 5.000 5.000 5.000 0.000 Device 0 PCIe TX Throughput (MB/s) : 1 0.000 0.000 0.000 0.000 GPU: Bytes Allocated : 2 6.000 11.000 16.000 5.000 status:Threads : 1 7.000 7.000 7.000 0.000 status:VmData : 1 2.77e+05 2.77e+05 2.77e+05 0.000 status:VmExe : 1 64.000 64.000 64.000 0.000 status:VmHWM : 1 2.19e+05 2.19e+05 2.19e+05 0.000 status:VmLck : 1 0.000 0.000 0.000 0.000 status:VmLib : 1 8.74e+04 8.74e+04 8.74e+04 0.000 status:VmPMD : 1 16.000 16.000 16.000 0.000 status:VmPTE : 1 35.000 35.000 35.000 0.000 status:VmPeak : 1 7.17e+05 7.17e+05 7.17e+05 0.000 status:VmPin : 1 1.67e+05 1.67e+05 1.67e+05 0.000 status:VmRSS : 1 2.19e+05 2.19e+05 2.19e+05 0.000 status:VmSize : 1 6.52e+05 6.52e+05 6.52e+05 0.000 status:VmStk : 1 192.000 192.000 192.000 0.000 status:VmSwap : 1 0.000 0.000 0.000 0.000 status:nonvoluntary_ctxt_switches : 1 9.000 9.000 9.000 0.000 status:voluntary_ctxt_switches : 1 1331.000 1331.000 1331.000 0.000 ------------------------------------------------------------------------------------------------ Timer : #calls | mean | total | % total ------------------------------------------------------------------------------------------------ APEX MAIN : 1 0.461 0.461 100.000 GPU: Unified Memcpy DTOH : 1 0.000 0.000 0.001 GPU: Unified Memcpy HTOD : 1 0.000 0.000 0.001 GPU: Kernel(DataElement*) : 4 0.000 0.000 0.086 cudaDeviceSynchronize : 4 0.000 0.001 0.169 cudaFree : 2 0.000 0.000 0.052 cudaLaunchKernel : 4 0.000 0.000 0.021 cudaMallocManaged : 2 0.135 0.269 58.397 launch [/home/users/khuck/src/apex/src/unit_tests... : 4 0.028 0.110 23.870 APEX Idle : 0.080 17.403 ------------------------------------------------------------------------------------------------ Total timers : 22 Profiling with CSV output \u00b6 To enable CSV output, use one of the methods described in the Environment Variables page, and run as the previous example. [khuck@cyclops apex]$ export APEX_CSV_OUTPUT=1 [khuck@cyclops apex]$ ./build/src/unit_tests/CUDA/apex_cuda_cu Found 4 total devices On device: name=hello, value=10 On device: name=dello, value=11 On device: name=dello, value=12 On device: name=dello, value=13 On host: name=dello, value=14 [khuck@cyclops apex]$ cat apex.0.csv \"counter\",\"num samples\",\"minimum\",\"mean\"\"maximum\",\"stddev\" \"1 Minute Load average\",1,22,22,22,0 \"Device 0 GPU Clock Memory (MHz)\",1,877,877,877,0 \"Device 0 GPU Clock SM (MHz)\",1,1530,1530,1530,0 \"Device 0 GPU Memory Free (MB)\",1,13411,13411,13411,0 \"Device 0 GPU Memory Used (MB)\",1,20679,20679,20679,0 \"Device 0 GPU Memory Utilization %\",1,58,58,58,0 \"Device 0 GPU NvLink Link Count\",1,6,6,6,0 \"Device 0 GPU NvLink Speed MB/s\",1,25781,25781,25781,0 \"Device 0 GPU NvLink Utilization C0\",1,0,0,0,0 \"Device 0 GPU NvLink Utilization C1\",1,0,0,0,0 \"Device 0 GPU Power (W)\",1,255,255,255,0 \"Device 0 GPU Temperature (C)\",1,75,75,75,0 \"Device 0 GPU Utilization %\",1,99,99,99,0 \"Device 0 PCIe RX Throughput (MB/s)\",1,7,7,7,0 \"Device 0 PCIe TX Throughput (MB/s)\",1,2,2,2,0 \"GPU: Bytes Allocated\",2,6,11,16,5 \"status:Threads\",1,7,7,7,0 \"status:VmData\",1,277120,277120,277120,0 \"status:VmExe\",1,64,64,64,0 \"status:VmHWM\",1,219008,219008,219008,0 \"status:VmLck\",1,0,0,0,0 \"status:VmLib\",1,87424,87424,87424,0 \"status:VmPMD\",1,16,16,16,0 \"status:VmPTE\",1,36,36,36,0 \"status:VmPeak\",1,717248,717248,717248,0 \"status:VmPin\",1,166528,166528,166528,0 \"status:VmRSS\",1,219008,219008,219008,0 \"status:VmSize\",1,652032,652032,652032,0 \"status:VmStk\",1,192,192,192,0 \"status:VmSwap\",1,0,0,0,0 \"status:nonvoluntary_ctxt_switches\",1,8,8,8,0 \"status:voluntary_ctxt_switches\",1,1276,1276,1276,0 \"task\",\"num calls\",\"total cycles\",\"total microseconds\" \"APEX MAIN\",1,0,431162 \"GPU: Unified Memcpy DTOH\",1,0,3 \"GPU: Unified Memcpy HTOD\",1,0,4 \"GPU: Kernel(DataElement*)\",4,0,1082 \"cudaDeviceSynchronize\",4,0,9993 \"cudaFree\",2,0,172 \"cudaLaunchKernel\",4,0,66 \"cudaMallocManaged\",2,0,194367 \"launch [/home/users/khuck/src/apex/src/unit_tests/CUDA/apex_cuda.cu:35]\",4,0,164490 Profiling with TAU profile output \u00b6 To enable TAU profile output, use one of the methods described in the Environment Variables page, and run as the previous example. The output can be summarized with the TAU pprof command, which is installed with the TAU software. [khuck@cyclops apex]$ export APEX_CSV_OUTPUT=1 [khuck@cyclops apex]$ ./build/src/unit_tests/CUDA/apex_cuda_cu Found 4 total devices On device: name=hello, value=10 On device: name=dello, value=11 On device: name=dello, value=12 On device: name=dello, value=13 On host: name=dello, value=14 [khuck@cyclops apex]$ cat profile.0.0.0 9 templated_functions_MULTI_TIME # Name Calls Subrs Excl Incl ProfileCalls # \"GPU: Unified Memcpy DTOH\" 1 0 2.656 2.656 0 GROUP=\"TAU_USER\" \"cudaFree\" 2 0 193.18 193.18 0 GROUP=\"TAU_USER\" \"cudaMallocManaged\" 2 0 184435 184435 0 GROUP=\"TAU_USER\" \"GPU: Unified Memcpy HTOD\" 1 0 4.64 4.64 0 GROUP=\"TAU_USER\" \"GPU: Kernel(DataElement*)\" 4 0 355.293 355.293 0 GROUP=\"TAU_USER\" \"cudaLaunchKernel\" 4 0 67.4 67.4 0 GROUP=\"TAU_USER\" \"cudaDeviceSynchronize\" 4 0 811.244 811.244 0 GROUP=\"TAU_USER\" \"launch [/home/users/khuck/src/apex/src/unit_tests/CUDA/apex_cuda.cu:35]\" 4 0 100327 100327 0 GROUP=\"TAU_USER\" \"APEX MAIN\" 1 0 67830.2 354026 0 GROUP=\"TAU_USER\" 0 aggregates 32 userevents # eventname numevents max min mean sumsqr \"status:VmSwap\" 1 0 0 0 0 \"status:VmSize\" 1 652032 652032 652032 4.25146e+11 \"status:Threads\" 1 7 7 7 49 \"status:VmPeak\" 1 717248 717248 717248 5.14445e+11 \"Device 0 GPU Power (W)\" 1 224.057 224.057 224.057 50201.5 \"Device 0 GPU NvLink Speed MB/s\" 1 25781 25781 25781 6.6466e+08 \"status:VmExe\" 1 64 64 64 4096 \"status:nonvoluntary_ctxt_switches\" 1 12 12 12 144 \"Device 0 GPU Memory Utilization %\" 1 73 73 73 5329 \"status:VmStk\" 1 192 192 192 36864 \"status:VmData\" 1 277120 277120 277120 7.67955e+10 \"status:VmLck\" 1 0 0 0 0 \"status:VmPin\" 1 166528 166528 166528 2.77316e+10 \"status:VmPTE\" 1 35 35 35 1225 \"Device 0 GPU NvLink Utilization C1\" 1 0 0 0 0 \"status:VmHWM\" 1 219008 219008 219008 4.79645e+10 \"status:VmRSS\" 1 219008 219008 219008 4.79645e+10 \"GPU: Bytes Allocated\" 2 16 6 11 292 \"status:VmLib\" 1 87424 87424 87424 7.64296e+09 \"Device 0 GPU Utilization %\" 1 99 99 99 9801 \"status:voluntary_ctxt_switches\" 1 1320 1320 1320 1.7424e+06 \"Device 0 GPU Clock SM (MHz)\" 1 1530 1530 1530 2.3409e+06 \"status:VmPMD\" 1 20 20 20 400 \"1 Minute Load average\" 1 16.43 16.43 16.43 269.945 \"Device 0 GPU Clock Memory (MHz)\" 1 877 877 877 769129 \"Device 0 PCIe TX Throughput (MB/s)\" 1 2 2 2 4 \"Device 0 GPU Temperature (C)\" 1 73 73 73 5329 \"Device 0 PCIe RX Throughput (MB/s)\" 1 6 6 6 36 \"Device 0 GPU Memory Used (MB)\" 1 20679.1 20679.1 20679.1 4.27625e+08 \"Device 0 GPU NvLink Utilization C0\" 1 0 0 0 0 \"Device 0 GPU NvLink Link Count\" 1 6 6 6 36 \"Device 0 GPU Memory Free (MB)\" 1 13410.6 13410.6 13410.6 1.79845e+08 [khuck@cyclops apex]$ which pprof ~/src/tau2/ibm64linux/bin/pprof [khuck@cyclops apex]$ pprof Reading Profile files in profile.* NODE 0;CONTEXT 0;THREAD 0: --------------------------------------------------------------------------------------- %Time Exclusive Inclusive #Call #Subrs Inclusive Name msec total msec usec/call --------------------------------------------------------------------------------------- 100.0 67 354 1 0 354026 APEX MAIN 52.1 184 184 2 0 92218 cudaMallocManaged 28.3 100 100 4 0 25082 launch [/home/users/khuck/src/apex/src/unit_tests/CUDA/apex_cuda.cu:35] 0.2 0.811 0.811 4 0 203 cudaDeviceSynchronize 0.1 0.355 0.355 4 0 89 GPU: Kernel(DataElement*) 0.1 0.193 0.193 2 0 97 cudaFree 0.0 0.0674 0.0674 4 0 17 cudaLaunchKernel 0.0 0.00464 0.00464 1 0 5 GPU: Unified Memcpy HTOD 0.0 0.00266 0.00266 1 0 3 GPU: Unified Memcpy DTOH --------------------------------------------------------------------------------------- USER EVENTS Profile :NODE 0, CONTEXT 0, THREAD 0 --------------------------------------------------------------------------------------- NumSamples MaxValue MinValue MeanValue Std. Dev. Event Name --------------------------------------------------------------------------------------- 1 16.43 16.43 16.43 0.01 1 Minute Load average 1 877 877 877 0 Device 0 GPU Clock Memory (MHz) 1 1530 1530 1530 0 Device 0 GPU Clock SM (MHz) 1 1.341E+04 1.341E+04 1.341E+04 28.42 Device 0 GPU Memory Free (MB) 1 2.068E+04 2.068E+04 2.068E+04 13.3 Device 0 GPU Memory Used (MB) 1 73 73 73 0 Device 0 GPU Memory Utilization % 1 6 6 6 0 Device 0 GPU NvLink Link Count 1 2.578E+04 2.578E+04 2.578E+04 6.245 Device 0 GPU NvLink Speed MB/s 1 0 0 0 0 Device 0 GPU NvLink Utilization C0 1 0 0 0 0 Device 0 GPU NvLink Utilization C1 1 224.1 224.1 224.1 0.1981 Device 0 GPU Power (W) 1 73 73 73 0 Device 0 GPU Temperature (C) 1 99 99 99 0 Device 0 GPU Utilization % 1 6 6 6 0 Device 0 PCIe RX Throughput (MB/s) 1 2 2 2 0 Device 0 PCIe TX Throughput (MB/s) 2 16 6 11 5 GPU: Bytes Allocated 1 7 7 7 0 status:Threads 1 2.771E+05 2.771E+05 2.771E+05 74.83 status:VmData 1 64 64 64 0 status:VmExe 1 2.19E+05 2.19E+05 2.19E+05 63.75 status:VmHWM 1 0 0 0 0 status:VmLck 1 8.742E+04 8.742E+04 8.742E+04 64.99 status:VmLib 1 20 20 20 0 status:VmPMD 1 35 35 35 0 status:VmPTE 1 7.172E+05 7.172E+05 7.172E+05 553.6 status:VmPeak 1 1.665E+05 1.665E+05 1.665E+05 158.8 status:VmPin 1 2.19E+05 2.19E+05 2.19E+05 63.75 status:VmRSS 1 6.52E+05 6.52E+05 6.52E+05 520.6 status:VmSize 1 192 192 192 0 status:VmStk 1 0 0 0 0 status:VmSwap 1 12 12 12 0 status:nonvoluntary_ctxt_switches 1 1320 1320 1320 0 status:voluntary_ctxt_switches --------------------------------------------------------------------------------------- Profiling with Taskgraph output \u00b6 APEX can capture the task dependency graph from the application, and output it as a GraphViz graph. The graph represents summarized task \"type\" dependencies, not a full dependency graph/tree with every task instance. [khuck@cyclops apex]$ apex_exec --apex:taskgraph --apex:cuda ./build/src/unit_tests/CUDA/apex_cuda_cu [khuck@cyclops apex]$ dot -Tpdf -O taskgraph.0.dot Profiling with Tasktree output \u00b6 APEX can capture the task dependency tree from the application, and output it as a GraphViz graph or ASCII. The graph represents summarized task \"type\" dependencies, not a full dependency graph/tree with every task instance. The difference between the graph and the tree is that in the tree, there are no cycles and child tasks have only one parent. [khuck@cyclops apex]$ apex_exec --apex:tasktree --apex:cuda ./build/src/unit_tests/CUDA/apex_cuda_cu [khuck@cyclops apex]$ apex-treesummary.py apex_tasktree.csv Profiling with Scatterplot output \u00b6 For this example, we are using an HPX quickstart example, the fibonacci example. After execution, APEX writes a sample data file to disk, apex_task_samples.csv . That file is post-processed with the APEX python script task_scatterplot.py . [khuck@cyclops apex]$ export APEX_TASK_SCATTERPLOT=1 [khuck@cyclops build]$ ./bin/fibonacci --n-value=20 [khuck@cyclops build]$ /home/users/khuck/src/apex/install/bin/task_scatterplot.py Parsed 2362 samples Plotting async_launch_policy_dispatch Plotting async_launch_policy_dispatch::call Plotting async Rendering... Profiling with OTF2 Trace output \u00b6 For this example, we are using an APEX unit test that computes the value of PI. OTF2 is the \"Open Trace Format v2\", used for tracing large scale HPC applications. For more information on OTF2 and associated tools, see The VI-HPS Score-P web site . Vampir is a commercial trace viewer that can be used to visualize and analyze OTF2 trace data. Traveler is an open source tool that can be used to visualize and analyze APEX OTF2 trace data. [khuck@cyclops apex]$ export APEX_OTF2=1 [khuck@cyclops apex]$ ./build/src/unit_tests/CUDA/pi_cu Found 4 total devices 134217728 num streams 4 making streams starting compute n is 0 num darts in circle 0: 105418094 pi is 3.141704 Closing OTF2 event files... Writing OTF2 definition files... Writing OTF2 Global definition file... Writing OTF2 Node information... Writing OTF2 Communicators... Closing the archive... done. [khuck@eagle apex]$ module load vampir [khuck@eagle apex]$ vampir OTF2_archive/APEX.otf2 Profiling with Google Trace Events Format output \u00b6 For this example, we are using an APEX unit test that computes the value of PI. Google Trace Events is a format developed by Google for tracing activity on devices, but is free and open and JSON based. For more information on Google Trace Events and associated tools, see the Google Trace Event Format document . The Google Chrome Web Browser can be used to visualize and analyze GTE trace data. [khuck@cyclops apex]$ export APEX_TRACE_EVENT=1 [khuck@cyclops apex]$ ./build/src/unit_tests/CUDA/pi_cu","title":"Usage"},{"location":"usage/#usage","text":"","title":"Usage"},{"location":"usage/#tutorial","text":"For an APEX tutorial, please see https://github.com/khuck/apex-tutorial .","title":"Tutorial"},{"location":"usage/#supported_runtime_systems","text":"","title":"Supported Runtime Systems"},{"location":"usage/#hpx_louisiana_state_university","text":"HPX (High Performance ParalleX) is the original implementation of the ParalleX model. Developed and maintained by the Ste||ar Group at Louisiana State University, HPX is implemented in C++. For more information, see http://stellar-group.org/projects/hpx/ . For a tutorial on HPX with APEX (presented at SC'15, Austin TX) see https://github.com/khuck/SC15_APEX_tutorial (somewhat outdated). APEX is configured and built as part of HPX. In fact, you don't even need to donwload it separately - it will be automatically checked out from Github as part of the HPX Cmake configuration. However, you do need to pass the correct Cmake options to the HPX configuration step.","title":"HPX (Louisiana State University)"},{"location":"usage/#configuring_hpx_with_apex","text":"See Intallation with HPX .","title":"Configuring HPX with APEX"},{"location":"usage/#running_hpx_with_apex","text":"See APEX Quickstart .","title":"Running HPX with APEX"},{"location":"usage/#openmp","text":"The OpenMP API supports multi-platform shared-memory parallel programming in C/C++ and Fortran. The OpenMP API defines a portable, scalable model with a simple and flexible interface for developing parallel applications on platforms from the desktop to the supercomputer. For more information, see http://openmp.org/ .","title":"OpenMP"},{"location":"usage/#configuring_apex_for_openmp_ompt_support","text":"The CMake process will automatically detect whether your compiler has OpenMP support. If you configure APEX with -DUSE_OMPT=TRUE and have a compiler with full OpenMP 5.0 OMPT support, APEX will detect the support. If your compiler is GCC, Intel or Clang and does not have native OMPT support, APEX can build and use the open source LLVM OpenMP runtime as a drop-in replacement for the compiler's native runtime library, but this is no longer recommended and is deprecated. APEX uses Binutils to resolve the OpenMP outlined regions from instruction addresses to human-readable names, so also configure APEX with -DUSE_BFD=TRUE (see Other CMake Settings ). The following example was configured and run with Intel 20 compilers. The CMake configuration for this example was: cmake -DCMAKE_C_COMPILER=`which icc` -DCMAKE_CXX_COMPILER=`which icpc` -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=../install -DBUILD_TESTS=TRUE -DUSE_BFD=TRUE -DBFD_ROOT=/usr/local/packages/binutils/2.34 -DUSE_OMPT=TRUE ..","title":"Configuring APEX for OpenMP OMPT support"},{"location":"usage/#running_openmp_applications_with_apex","text":"Using the apex_exec wrapper script, execute the OpenMP program as normal: [khuck@delphi apex]$ ./install/bin/apex_exec --apex:screen --apex:taskgraph --apex:ompt build/src/unit_tests/C++/apex_openmp_cpp Program to run : build/src/unit_tests/C++/apex_openmp_cpp Initializing... No Sharing... Result: 2690568.772590 Elapsed time: 0.0398378 seconds Cores detected: 72 Worker Threads observed: 72 Available CPU time: 2.86832 seconds Counter : #samples | minimum | mean | maximum | stddev ------------------------------------------------------------------------------------------------ Iterations: OpenMP Work Loop: no_shari... : 71 1.05e+06 1.05e+06 1.05e+06 0.000 Iterations: OpenMP Work Loop: my_init(... : 144 1.05e+06 1.05e+06 1.05e+06 0.000 OpenMP Initial Thread : 1 1.000 1.000 1.000 0.000 OpenMP Worker Thread : 71 1.000 1.000 1.000 0.000 Single: OpenMP Work Single Executor: L... : 1 1.000 1.000 1.000 0.000 Single: OpenMP Work Single Executor: L... : 2 1.000 1.000 1.000 0.000 Single: OpenMP Work Single Other: L__Z... : 71 1.000 1.000 1.000 0.000 Single: OpenMP Work Single Other: L__Z... : 142 1.000 1.000 1.000 0.000 status:Threads : 1 3.000 3.000 3.000 0.000 status:VmData : 1 1.07e+05 1.07e+05 1.07e+05 0.000 status:VmExe : 1 20.000 20.000 20.000 0.000 status:VmHWM : 1 9356.000 9356.000 9356.000 0.000 status:VmLck : 1 0.000 0.000 0.000 0.000 status:VmLib : 1 4.39e+04 4.39e+04 4.39e+04 0.000 status:VmPTE : 1 128.000 128.000 128.000 0.000 status:VmPeak : 1 2.49e+05 2.49e+05 2.49e+05 0.000 status:VmPin : 1 0.000 0.000 0.000 0.000 status:VmRSS : 1 9356.000 9356.000 9356.000 0.000 status:VmSize : 1 1.84e+05 1.84e+05 1.84e+05 0.000 status:VmStk : 1 136.000 136.000 136.000 0.000 status:VmSwap : 1 0.000 0.000 0.000 0.000 status:nonvoluntary_ctxt_switches : 1 43.000 43.000 43.000 0.000 status:voluntary_ctxt_switches : 1 46.000 46.000 46.000 0.000 ------------------------------------------------------------------------------------------------ Timer : #calls | mean | total | % total ------------------------------------------------------------------------------------------------ APEX MAIN : 1 0.040 0.040 100.000 OpenMP Parallel Region: no_sharing(double*, doubl... : 1 0.006 0.006 0.211 OpenMP Parallel Region: my_init(double*) [{/home/... : 2 0.014 0.028 0.961 OpenMP Work Loop: no_sharing(double*, double*) [{... : 72 0.003 0.195 6.806 OpenMP Work Loop: my_init(double*) [{/home/users/... : 143 0.001 0.161 5.622 OpenMP Work Single Executor: L__Z10no_sharingPdS_... : 1 0.001 0.001 0.028 OpenMP Work Single Executor: L__Z7my_initPd_39__p... : 2 0.000 0.001 0.018 OpenMP Work Single Other: L__Z10no_sharingPdS__20... : 71 0.000 0.029 1.027 OpenMP Work Single Other: L__Z7my_initPd_39__par_... : 141 0.001 0.100 3.472 ------------------------------------------------------------------------------------------------ Total timers : 433 If GraphViz is installed on your system, the dot program will generate a taskgraph image based on the taskgraph.0.dot file that was generated by APEX:","title":"Running OpenMP applications with APEX"},{"location":"usage/#openacc","text":"","title":"OpenACC"},{"location":"usage/#configuring_apex_for_openacc_support","text":"Nothing special needs to be done to enable OpenACC support. If your compiler supports OpenACC (PGI, GCC 10+), then CMake will detect it and enable OpenACC support in APEX. In this example, APEX was configured with GCC 10.0.0: cmake -DCMAKE_C_COMPILER=`which gcc` -DCMAKE_CXX_COMPILER=`which g++` -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=../install -DBUILD_TESTS=TRUE -DUSE_BFD=FALSE -DBFD_ROOT=/usr/local/packages/binutils/2.34 ..","title":"Configuring APEX for OpenACC support"},{"location":"usage/#running_openacc_programs_with_apex","text":"Enabling OpenACC support requires setting the ACC_PROFLIB environment variable with the path to libapex.so , or by using the apex_exec script with the --apex:openacc flag: [khuck@gorgon apex]$ ./install/bin/apex_exec --apex:screen --apex:taskgraph --apex:openacc ./build/src/unit_tests/C/apex_openacc Program to run : ./build/src/unit_tests/C/apex_openacc Jacobi relaxation Calculation: 128 x 128 mesh Device API: none Device type: default Device vendor: -1 Device API: CUDA Device type: nvidia Device vendor: -1 0, 0.250000 Elapsed time: 0.451705 seconds Cores detected: 160 Worker Threads observed: 1 Available CPU time: 0.451705 seconds Counter : #samples | minimum | mean | maximum | stddev ------------------------------------------------------------------------------------------------ OpenACC Gangs : 200 1.000 2560.500 5120.000 2559.500 OpenACC Vector Lanes : 200 32.000 32.000 32.000 0.000 OpenACC Workers : 200 1.000 1.000 1.000 0.000 OpenACC device alloc (implicit) parall... : 301 15.000 889.206 2.62e+05 1.51e+04 OpenACC device free (implicit) paralle... : 301 0.000 0.000 0.000 0.000 OpenACC enqueue data transfer (HtoD) (... : 200 16.000 20.000 24.000 4.000 status:Threads : 1 3.000 3.000 3.000 0.000 status:VmData : 1 1.81e+04 1.81e+04 1.81e+04 0.000 status:VmExe : 1 64.000 64.000 64.000 0.000 status:VmHWM : 1 4416.000 4416.000 4416.000 0.000 status:VmLck : 1 0.000 0.000 0.000 0.000 status:VmLib : 1 8640.000 8640.000 8640.000 0.000 status:VmPMD : 1 16.000 16.000 16.000 0.000 status:VmPTE : 1 3.000 3.000 3.000 0.000 status:VmPeak : 1 1.59e+05 1.59e+05 1.59e+05 0.000 status:VmPin : 1 0.000 0.000 0.000 0.000 status:VmRSS : 1 4416.000 4416.000 4416.000 0.000 status:VmSize : 1 9.34e+04 9.34e+04 9.34e+04 0.000 status:VmStk : 1 192.000 192.000 192.000 0.000 status:VmSwap : 1 0.000 0.000 0.000 0.000 status:nonvoluntary_ctxt_switches : 1 0.000 0.000 0.000 0.000 status:voluntary_ctxt_switches : 1 46.000 46.000 46.000 0.000 ------------------------------------------------------------------------------------------------ Timer : #calls | mean | total | % total ------------------------------------------------------------------------------------------------ APEX MAIN : 1 0.452 0.452 100.000 OpenACC compute construct parallel : 200 0.001 0.215 47.492 OpenACC device init (implicit) parallel : 1 0.081 0.081 17.965 OpenACC enqueue data transfer (HtoD) (implicit) p... : 200 0.000 0.002 0.523 OpenACC enqueue launch: main$_omp_fn$0 (implicit)... : 100 0.000 0.001 0.288 OpenACC enqueue launch: main$_omp_fn$1 (implicit)... : 100 0.000 0.001 0.267 OpenACC enter data (implicit) parallel : 200 0.000 0.002 0.491 OpenACC enter data data : 1 0.000 0.000 0.078 OpenACC exit data (implicit) parallel : 200 0.000 0.003 0.733 OpenACC exit data data : 1 0.000 0.000 0.043 APEX Idle : 0.145 32.120 ------------------------------------------------------------------------------------------------ Total timers : 1003","title":"Running OpenACC programs with APEX"},{"location":"usage/#cuda","text":"","title":"CUDA"},{"location":"usage/#configuring_apex_for_cuda_support","text":"Enabling CUDA support in APEX requires the -DAPEX_WITH_CUDA=TRUE flag and the -DCUDA_ROOT=/path/to/cuda CMake variables at configuration time. CMake will look for the CUPTI and NVML libraries in the installation, and if found the support will be enabled. cmake -DCMAKE_C_COMPILER=`which gcc` -DCMAKE_CXX_COMPILER=`which g++` -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=../install -DBUILD_TESTS=TRUE -DUSE_BFD=TRUE -DAPEX_WITH_CUDA=TRUE -DCUDA_ROOT=/usr/local/packages/cuda/10.2 -DBFD_ROOT=/usr/local/packages/binutils/2.34 ..","title":"Configuring APEX for CUDA support"},{"location":"usage/#running_cuda_programs_with_apex","text":"Enabling CUDA support only requires using the apex_exec wrapper script. [khuck@gorgon apex]$ ./install/bin/apex_exec --apex:screen --apex:taskgraph --apex:cuda ./build/src/unit_tests/CUDA/apex_cuda_cu Program to run : ./build/src/unit_tests/CUDA/apex_cuda_cu On device: name=hello, value=10 On device: name=dello, value=11 On device: name=dello, value=12 On device: name=dello, value=13 On host: name=dello, value=14 Elapsed time: 0.410402 seconds Cores detected: 160 Worker Threads observed: 1 Available CPU time: 0.410402 seconds Counter : #samples | minimum | mean | maximum | stddev ------------------------------------------------------------------------------------------------ Device 0 GPU Clock Memory (MHz) : 1 877.000 877.000 877.000 0.000 Device 0 GPU Clock SM (MHz) : 1 135.000 135.000 135.000 0.000 Device 0 GPU Memory Free (MB) : 1 3.41e+04 3.41e+04 3.41e+04 0.000 Device 0 GPU Memory Used (MB) : 1 0.197 0.197 0.197 0.000 Device 0 GPU Memory Utilization % : 1 0.000 0.000 0.000 0.000 Device 0 GPU NvLink Link Count : 1 6.000 6.000 6.000 0.000 Device 0 GPU NvLink Speed MB/s : 1 2.58e+04 2.58e+04 2.58e+04 0.000 Device 0 GPU NvLink Utilization C0 : 1 0.000 0.000 0.000 0.000 Device 0 GPU NvLink Utilization C1 : 1 0.000 0.000 0.000 0.000 Device 0 GPU Power (W) : 1 38.912 38.912 38.912 0.000 Device 0 GPU Temperature (C) : 1 33.000 33.000 33.000 0.000 Device 0 GPU Utilization % : 1 0.000 0.000 0.000 0.000 Device 0 PCIe RX Throughput (MB/s) : 1 1.000 1.000 1.000 0.000 Device 0 PCIe TX Throughput (MB/s) : 1 3.000 3.000 3.000 0.000 GPU: Bytes Allocated : 2 6.000 11.000 16.000 5.000 status:Threads : 1 4.000 4.000 4.000 0.000 status:VmData : 1 5.72e+04 5.72e+04 5.72e+04 0.000 status:VmExe : 1 64.000 64.000 64.000 0.000 status:VmHWM : 1 1.77e+04 1.77e+04 1.77e+04 0.000 status:VmLck : 1 0.000 0.000 0.000 0.000 status:VmLib : 1 6.92e+04 6.92e+04 6.92e+04 0.000 status:VmPMD : 1 12.000 12.000 12.000 0.000 status:VmPTE : 1 7.000 7.000 7.000 0.000 status:VmPeak : 1 2.58e+05 2.58e+05 2.58e+05 0.000 status:VmPin : 1 0.000 0.000 0.000 0.000 status:VmRSS : 1 1.77e+04 1.77e+04 1.77e+04 0.000 status:VmSize : 1 1.93e+05 1.93e+05 1.93e+05 0.000 status:VmStk : 1 192.000 192.000 192.000 0.000 status:VmSwap : 1 0.000 0.000 0.000 0.000 status:nonvoluntary_ctxt_switches : 1 0.000 0.000 0.000 0.000 status:voluntary_ctxt_switches : 1 102.000 102.000 102.000 0.000 ------------------------------------------------------------------------------------------------ Timer : #calls | mean | total | % total ------------------------------------------------------------------------------------------------ APEX MAIN : 1 0.410 0.410 100.000 GPU: Unified Memory copy DTOH : 1 0.000 0.000 0.001 GPU: Unified Memory copy HTOD : 1 0.000 0.000 0.001 GPU: Kernel(DataElement*) : 4 0.000 0.000 0.084 cudaDeviceSynchronize : 4 0.000 0.000 0.092 cudaFree : 2 0.000 0.000 0.045 cudaLaunchKernel : 4 0.000 0.000 0.007 cudaMallocManaged : 2 0.104 0.208 50.601 launch [/home/users/khuck/src/apex/src/unit_tests... : 4 0.001 0.003 0.798 APEX Idle : 0.199 48.371 ------------------------------------------------------------------------------------------------ Total timers : 22 To get additional information you can also enable the --apex:cuda_driver flag to see CUDA driver API calls, or enable the --apex:cuda_counters flag to enable CUDA counters. [khuck@gorgon apex]$ ./install/bin/apex_exec --apex:screen --apex:taskgraph --apex:cuda --apex:cuda_counters --apex:cuda_driver ./build/src/unit_tests/CUDA/apex_cuda_cu Program to run : ./build/src/unit_tests/CUDA/apex_cuda_cu On device: name=hello, value=10 On device: name=dello, value=11 On device: name=dello, value=12 On device: name=dello, value=13 On host: name=dello, value=14 Elapsed time: 0.309145 seconds Cores detected: 160 Worker Threads observed: 1 Available CPU time: 0.309145 seconds Counter : #samples | minimum | mean | maximum | stddev ------------------------------------------------------------------------------------------------ Device 0 GPU Clock Memory (MHz) : 1 877.000 877.000 877.000 0.000 Device 0 GPU Clock SM (MHz) : 1 135.000 135.000 135.000 0.000 Device 0 GPU Memory Free (MB) : 1 3.41e+04 3.41e+04 3.41e+04 0.000 Device 0 GPU Memory Used (MB) : 1 0.197 0.197 0.197 0.000 Device 0 GPU Memory Utilization % : 1 0.000 0.000 0.000 0.000 Device 0 GPU NvLink Link Count : 1 6.000 6.000 6.000 0.000 Device 0 GPU NvLink Speed MB/s : 1 2.58e+04 2.58e+04 2.58e+04 0.000 Device 0 GPU NvLink Utilization C0 : 1 0.000 0.000 0.000 0.000 Device 0 GPU NvLink Utilization C1 : 1 0.000 0.000 0.000 0.000 Device 0 GPU Power (W) : 1 38.912 38.912 38.912 0.000 Device 0 GPU Temperature (C) : 1 33.000 33.000 33.000 0.000 Device 0 GPU Utilization % : 1 0.000 0.000 0.000 0.000 Device 0 PCIe RX Throughput (MB/s) : 1 2.000 2.000 2.000 0.000 Device 0 PCIe TX Throughput (MB/s) : 1 3.000 3.000 3.000 0.000 GPU: Bandwith (GB/s) <- Unified Memory... : 1 18.618 18.618 18.618 0.000 GPU: Bandwith (GB/s) <- Unified Memory... : 1 11.770 11.770 11.770 0.000 GPU: Bytes <- Unified Memory copy DTOH : 1 6.55e+04 6.55e+04 6.55e+04 0.000 GPU: Bytes <- Unified Memory copy HTOD : 1 6.55e+04 6.55e+04 6.55e+04 0.000 GPU: Bytes Allocated : 3 0.000 7.333 16.000 6.600 GPU: Dynamic Shared Memory (B) : 4 0.000 0.000 0.000 0.000 GPU: Local Memory Per Thread (B) : 4 0.000 0.000 0.000 0.000 GPU: Local Memory Total (B) : 4 1.36e+08 1.36e+08 1.36e+08 0.000 GPU: Registers Per Thread : 4 32.000 32.000 32.000 0.000 GPU: Shared Memory Size (B) : 4 0.000 0.000 0.000 0.000 GPU: Static Shared Memory (B) : 4 0.000 0.000 0.000 0.000 Unified Memory CPU Page Fault Count : 2 1.000 1.000 1.000 0.000 Unified Memory GPU Page Fault Groups : 1 1.000 1.000 1.000 0.000 status:Threads : 1 4.000 4.000 4.000 0.000 status:VmData : 1 5.69e+04 5.69e+04 5.69e+04 0.000 status:VmExe : 1 64.000 64.000 64.000 0.000 status:VmHWM : 1 1.70e+04 1.70e+04 1.70e+04 0.000 status:VmLck : 1 0.000 0.000 0.000 0.000 status:VmLib : 1 6.92e+04 6.92e+04 6.92e+04 0.000 status:VmPMD : 1 16.000 16.000 16.000 0.000 status:VmPTE : 1 7.000 7.000 7.000 0.000 status:VmPeak : 1 2.58e+05 2.58e+05 2.58e+05 0.000 status:VmPin : 1 0.000 0.000 0.000 0.000 status:VmRSS : 1 1.70e+04 1.70e+04 1.70e+04 0.000 status:VmSize : 1 1.93e+05 1.93e+05 1.93e+05 0.000 status:VmStk : 1 192.000 192.000 192.000 0.000 status:VmSwap : 1 0.000 0.000 0.000 0.000 status:nonvoluntary_ctxt_switches : 1 0.000 0.000 0.000 0.000 status:voluntary_ctxt_switches : 1 100.000 100.000 100.000 0.000 ------------------------------------------------------------------------------------------------ Timer : #calls | mean | total | % total ------------------------------------------------------------------------------------------------ APEX MAIN : 1 0.309 0.309 100.000 GPU: Unified Memory copy DTOH : 1 0.000 0.000 0.001 GPU: Unified Memory copy HTOD : 1 0.000 0.000 0.002 GPU: Kernel(DataElement*) : 4 0.000 0.001 0.353 cuCtxGetCurrent : 2 0.000 0.000 0.002 cuCtxGetDevice : 1 0.000 0.000 0.001 cuCtxSetCurrent : 1 0.000 0.000 0.001 cuCtxSynchronize : 4 0.000 0.001 0.349 cuDeviceGet : 4 0.000 0.000 0.002 cuDeviceGetAttribute : 376 0.000 0.002 0.754 cuDeviceGetCount : 1 0.000 0.000 0.008 cuDeviceGetName : 4 0.000 0.000 0.046 cuDeviceGetUuid : 4 0.000 0.000 0.002 cuDevicePrimaryCtxRetain : 1 0.111 0.111 35.773 cuDeviceTotalMem_v2 : 4 0.002 0.006 2.022 cuLaunchKernel : 4 0.000 0.000 0.005 cuMemAllocManaged : 2 0.012 0.024 7.743 cuMemFree_v2 : 2 0.000 0.000 0.051 cuModuleGetFunction : 1 0.000 0.000 0.005 cudaDeviceSynchronize : 4 0.000 0.001 0.361 cudaFree : 2 0.000 0.000 0.057 cudaLaunchKernel : 4 0.000 0.000 0.051 cudaMallocManaged : 2 0.060 0.120 38.773 launch [/home/users/khuck/src/apex/src/unit_tests... : 4 0.000 0.001 0.442 APEX Idle : 0.041 13.195 ------------------------------------------------------------------------------------------------ Total timers : 433 The following flags will enable different types of CUDA support: --apex:cuda enable CUDA/CUPTI measurement (default: off) --apex:cuda-counters enable CUDA/CUPTI counter support (default: off) --apex:cuda-driver enable CUDA driver API callbacks (default: off) --apex:cuda-details enable per-kernel statistics where available (default: off) --apex:monitor-gpu enable GPU monitoring services (CUDA NVML, ROCm SMI)","title":"Running CUDA programs with APEX"},{"location":"usage/#hiprocm","text":"APEX suports HIP measurement using the Roc* libraries provided by AMD.","title":"HIP/ROCm"},{"location":"usage/#configuring_apex_for_hip_support","text":"Enabling HIP support in APEX requires the -DAPEX_WITH_HIP=TRUE flag and the -DROCM_ROOT=/path/to/rocm CMake variables at configuration time. CMake will look for the profile/trace and smi libraries in the installation, and if found the support will be enabled. cmake -B build -DCMAKE_C_COMPILER=`which clang` -DCMAKE_CXX_COMPILER=`which hipcc` -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=./install -DBUILD_TESTS=TRUE -DUSE_BFD=TRUE -DAPEX_WITH_HIP=TRUE -DROCM_ROOT=/opt/rocm-5.7.1 -DBFD_ROOT=/usr/local/packages/binutils/2.34 ..","title":"Configuring APEX for HIP support"},{"location":"usage/#running_hip_programs_with_apex","text":"Enabling CUDA support only requires using the apex_exec wrapper script. The following flags will enable additional support: --apex:hip enable HIP/ROCTracer measurement (default: off) --apex:hip-metrics enable HIP/ROCProfiler metric support (default: off) --apex:hip-counters enable HIP/ROCTracer counter support (default: off) --apex:hip-driver enable HIP/ROCTracer KSA driver API callbacks (default: off) --apex:hip-details enable per-kernel statistics where available (default: off) --apex:monitor-gpu enable GPU monitoring services (CUDA NVML, ROCm SMI)","title":"Running HIP programs with APEX"},{"location":"usage/#kokkos","text":"","title":"Kokkos"},{"location":"usage/#configuring_apex_for_kokkos_support","text":"Like OpenACC, nothing special needs to be done to enable Kokkos support.","title":"Configuring APEX for Kokkos support"},{"location":"usage/#running_kokkos_programs_with_apex","text":"Enabling Kokkos support requires setting the KOKKOS_PROFILE_LIBRARY environment variable with the path to libapex.so , or by using the apex_exec script with the --apex:kokkos flag. We also recommend using the --apex:kokkos-fence option which will time the full kernel execution time, not just the time to launch a kernel if the back-end activity is not measured by some other method (OMPT, CUDA, HIP, SYCL, OpenACC). APEX also has experimental autotuning support for Kokkos kernels, see https://github.com/UO-OACISS/apex/wiki/Using-APEX-with-Kokkos#autotuning-support .","title":"Running Kokkos programs with APEX"},{"location":"usage/#configuring_apex_for_raja_support","text":"Like OpenACC, nothing special needs to be done to enable RAJA support.","title":"Configuring APEX for RAJA support"},{"location":"usage/#running_raja_programs_with_apex","text":"Enabling RAJA support requires setting the RAJA_PLUGINS environment variable with the path to libapex.so , or by using the apex_exec script with the --apex:raja flag. The following flags will enable different types of Kokkos support: --apex:kokkos enable Kokkos support --apex:kokkos-tuning enable Kokkos runtime autotuning support --apex:kokkos-fence enable Kokkos fences for async kernels","title":"Running RAJA programs with APEX"},{"location":"usage/#c_threads","text":"APEX suports C++ threads on Linux, with the assumption that they are implemented on top of POSIX threads.","title":"C++ Threads"},{"location":"usage/#configuring_apex_for_c_thread_support","text":"Nothing special needs to be done to enable C++ thread support.","title":"Configuring APEX for C++ Thread support"},{"location":"usage/#running_c_thread_programs_with_apex","text":"Enabling C++ Thread support requires using the apex_exec script with the --apex:pthread flag. That will enable the preloading of a wrapper library to intercept pthread_create() calls. A sample program with C++ threads is in the APEX unit tests: khuck@Kevins-MacBook-Air build % ../install/bin/apex_exec --apex:pthread src/unit_tests/C++/apex_fibonacci_std_async_cpp Program to run : src/unit_tests/C++/apex_fibonacci_std_async_cpp usage: apex_fibonacci_std_async_cpp Using default value of 10 fib of 10 is 55 (valid value: 55) Elapsed time: 0.005359 seconds Cores detected: 8 Worker Threads observed: 178 Available CPU time: 0.042872 seconds Timer : #calls | mean | total | % total ------------------------------------------------------------------------------------------------ fib(int, std::__1::shared_ptr) : 177 0.001 0.171 --n/a-- APEX MAIN : 1 0.005 0.005 100.000 ------------------------------------------------------------------------------------------------ Total timers : 177 Note that APEX detected 178 total OS threads. That is because some C++ thread implementations (GCC, Clang, others) implement every std::async() call as a new OS thread, resulting in a pthread_create() call.","title":"Running C++ Thread programs with APEX"},{"location":"usage/#other_runtime_systems","text":"We are currently evaluating support for TBB, OpenCL, SYCL/DPC++/OneAPI, among others.","title":"Other Runtime Systems"},{"location":"usage/#performance_measurement_features","text":"For all the following examples, we will use a simple CUDA program that is in the APEX unit tests.","title":"Performance Measurement Features"},{"location":"usage/#profiling","text":"Profiling with APEX is the usual and most simple mode of operation. In order to profile an application and get a report at the end of execution, enable screen output (see Environment Variables for details) and run an application linked with the APEX library or with the apex_exec --apex:screen flag (enabled by default). The output should look like examples shown previously. [khuck@cyclops apex]$ export APEX_SCREEN_OUTPUT=1 [khuck@cyclops apex]$ ./build/src/unit_tests/CUDA/apex_cuda_cu Found 4 total devices On device: name=hello, value=10 On device: name=dello, value=11 On device: name=dello, value=12 On device: name=dello, value=13 On host: name=dello, value=14 Elapsed time: 0.46147 seconds Cores detected: 160 Worker Threads observed: 1 Available CPU time: 0.46147 seconds Counter : #samples | minimum | mean | maximum | stddev ------------------------------------------------------------------------------------------------ 1 Minute Load average : 1 13.320 13.320 13.320 0.000 Device 0 GPU Clock Memory (MHz) : 1 877.000 877.000 877.000 0.000 Device 0 GPU Clock SM (MHz) : 1 1530.000 1530.000 1530.000 0.000 Device 0 GPU Memory Free (MB) : 1 1.34e+04 1.34e+04 1.34e+04 0.000 Device 0 GPU Memory Used (MB) : 1 2.07e+04 2.07e+04 2.07e+04 0.000 Device 0 GPU Memory Utilization % : 1 48.000 48.000 48.000 0.000 Device 0 GPU NvLink Link Count : 1 6.000 6.000 6.000 0.000 Device 0 GPU NvLink Speed MB/s : 1 2.58e+04 2.58e+04 2.58e+04 0.000 Device 0 GPU NvLink Utilization C0 : 1 0.000 0.000 0.000 0.000 Device 0 GPU NvLink Utilization C1 : 1 0.000 0.000 0.000 0.000 Device 0 GPU Power (W) : 1 240.573 240.573 240.573 0.000 Device 0 GPU Temperature (C) : 1 73.000 73.000 73.000 0.000 Device 0 GPU Utilization % : 1 95.000 95.000 95.000 0.000 Device 0 PCIe RX Throughput (MB/s) : 1 5.000 5.000 5.000 0.000 Device 0 PCIe TX Throughput (MB/s) : 1 0.000 0.000 0.000 0.000 GPU: Bytes Allocated : 2 6.000 11.000 16.000 5.000 status:Threads : 1 7.000 7.000 7.000 0.000 status:VmData : 1 2.77e+05 2.77e+05 2.77e+05 0.000 status:VmExe : 1 64.000 64.000 64.000 0.000 status:VmHWM : 1 2.19e+05 2.19e+05 2.19e+05 0.000 status:VmLck : 1 0.000 0.000 0.000 0.000 status:VmLib : 1 8.74e+04 8.74e+04 8.74e+04 0.000 status:VmPMD : 1 16.000 16.000 16.000 0.000 status:VmPTE : 1 35.000 35.000 35.000 0.000 status:VmPeak : 1 7.17e+05 7.17e+05 7.17e+05 0.000 status:VmPin : 1 1.67e+05 1.67e+05 1.67e+05 0.000 status:VmRSS : 1 2.19e+05 2.19e+05 2.19e+05 0.000 status:VmSize : 1 6.52e+05 6.52e+05 6.52e+05 0.000 status:VmStk : 1 192.000 192.000 192.000 0.000 status:VmSwap : 1 0.000 0.000 0.000 0.000 status:nonvoluntary_ctxt_switches : 1 9.000 9.000 9.000 0.000 status:voluntary_ctxt_switches : 1 1331.000 1331.000 1331.000 0.000 ------------------------------------------------------------------------------------------------ Timer : #calls | mean | total | % total ------------------------------------------------------------------------------------------------ APEX MAIN : 1 0.461 0.461 100.000 GPU: Unified Memcpy DTOH : 1 0.000 0.000 0.001 GPU: Unified Memcpy HTOD : 1 0.000 0.000 0.001 GPU: Kernel(DataElement*) : 4 0.000 0.000 0.086 cudaDeviceSynchronize : 4 0.000 0.001 0.169 cudaFree : 2 0.000 0.000 0.052 cudaLaunchKernel : 4 0.000 0.000 0.021 cudaMallocManaged : 2 0.135 0.269 58.397 launch [/home/users/khuck/src/apex/src/unit_tests... : 4 0.028 0.110 23.870 APEX Idle : 0.080 17.403 ------------------------------------------------------------------------------------------------ Total timers : 22","title":"Profiling"},{"location":"usage/#profiling_with_csv_output","text":"To enable CSV output, use one of the methods described in the Environment Variables page, and run as the previous example. [khuck@cyclops apex]$ export APEX_CSV_OUTPUT=1 [khuck@cyclops apex]$ ./build/src/unit_tests/CUDA/apex_cuda_cu Found 4 total devices On device: name=hello, value=10 On device: name=dello, value=11 On device: name=dello, value=12 On device: name=dello, value=13 On host: name=dello, value=14 [khuck@cyclops apex]$ cat apex.0.csv \"counter\",\"num samples\",\"minimum\",\"mean\"\"maximum\",\"stddev\" \"1 Minute Load average\",1,22,22,22,0 \"Device 0 GPU Clock Memory (MHz)\",1,877,877,877,0 \"Device 0 GPU Clock SM (MHz)\",1,1530,1530,1530,0 \"Device 0 GPU Memory Free (MB)\",1,13411,13411,13411,0 \"Device 0 GPU Memory Used (MB)\",1,20679,20679,20679,0 \"Device 0 GPU Memory Utilization %\",1,58,58,58,0 \"Device 0 GPU NvLink Link Count\",1,6,6,6,0 \"Device 0 GPU NvLink Speed MB/s\",1,25781,25781,25781,0 \"Device 0 GPU NvLink Utilization C0\",1,0,0,0,0 \"Device 0 GPU NvLink Utilization C1\",1,0,0,0,0 \"Device 0 GPU Power (W)\",1,255,255,255,0 \"Device 0 GPU Temperature (C)\",1,75,75,75,0 \"Device 0 GPU Utilization %\",1,99,99,99,0 \"Device 0 PCIe RX Throughput (MB/s)\",1,7,7,7,0 \"Device 0 PCIe TX Throughput (MB/s)\",1,2,2,2,0 \"GPU: Bytes Allocated\",2,6,11,16,5 \"status:Threads\",1,7,7,7,0 \"status:VmData\",1,277120,277120,277120,0 \"status:VmExe\",1,64,64,64,0 \"status:VmHWM\",1,219008,219008,219008,0 \"status:VmLck\",1,0,0,0,0 \"status:VmLib\",1,87424,87424,87424,0 \"status:VmPMD\",1,16,16,16,0 \"status:VmPTE\",1,36,36,36,0 \"status:VmPeak\",1,717248,717248,717248,0 \"status:VmPin\",1,166528,166528,166528,0 \"status:VmRSS\",1,219008,219008,219008,0 \"status:VmSize\",1,652032,652032,652032,0 \"status:VmStk\",1,192,192,192,0 \"status:VmSwap\",1,0,0,0,0 \"status:nonvoluntary_ctxt_switches\",1,8,8,8,0 \"status:voluntary_ctxt_switches\",1,1276,1276,1276,0 \"task\",\"num calls\",\"total cycles\",\"total microseconds\" \"APEX MAIN\",1,0,431162 \"GPU: Unified Memcpy DTOH\",1,0,3 \"GPU: Unified Memcpy HTOD\",1,0,4 \"GPU: Kernel(DataElement*)\",4,0,1082 \"cudaDeviceSynchronize\",4,0,9993 \"cudaFree\",2,0,172 \"cudaLaunchKernel\",4,0,66 \"cudaMallocManaged\",2,0,194367 \"launch [/home/users/khuck/src/apex/src/unit_tests/CUDA/apex_cuda.cu:35]\",4,0,164490","title":"Profiling with CSV output"},{"location":"usage/#profiling_with_tau_profile_output","text":"To enable TAU profile output, use one of the methods described in the Environment Variables page, and run as the previous example. The output can be summarized with the TAU pprof command, which is installed with the TAU software. [khuck@cyclops apex]$ export APEX_CSV_OUTPUT=1 [khuck@cyclops apex]$ ./build/src/unit_tests/CUDA/apex_cuda_cu Found 4 total devices On device: name=hello, value=10 On device: name=dello, value=11 On device: name=dello, value=12 On device: name=dello, value=13 On host: name=dello, value=14 [khuck@cyclops apex]$ cat profile.0.0.0 9 templated_functions_MULTI_TIME # Name Calls Subrs Excl Incl ProfileCalls # \"GPU: Unified Memcpy DTOH\" 1 0 2.656 2.656 0 GROUP=\"TAU_USER\" \"cudaFree\" 2 0 193.18 193.18 0 GROUP=\"TAU_USER\" \"cudaMallocManaged\" 2 0 184435 184435 0 GROUP=\"TAU_USER\" \"GPU: Unified Memcpy HTOD\" 1 0 4.64 4.64 0 GROUP=\"TAU_USER\" \"GPU: Kernel(DataElement*)\" 4 0 355.293 355.293 0 GROUP=\"TAU_USER\" \"cudaLaunchKernel\" 4 0 67.4 67.4 0 GROUP=\"TAU_USER\" \"cudaDeviceSynchronize\" 4 0 811.244 811.244 0 GROUP=\"TAU_USER\" \"launch [/home/users/khuck/src/apex/src/unit_tests/CUDA/apex_cuda.cu:35]\" 4 0 100327 100327 0 GROUP=\"TAU_USER\" \"APEX MAIN\" 1 0 67830.2 354026 0 GROUP=\"TAU_USER\" 0 aggregates 32 userevents # eventname numevents max min mean sumsqr \"status:VmSwap\" 1 0 0 0 0 \"status:VmSize\" 1 652032 652032 652032 4.25146e+11 \"status:Threads\" 1 7 7 7 49 \"status:VmPeak\" 1 717248 717248 717248 5.14445e+11 \"Device 0 GPU Power (W)\" 1 224.057 224.057 224.057 50201.5 \"Device 0 GPU NvLink Speed MB/s\" 1 25781 25781 25781 6.6466e+08 \"status:VmExe\" 1 64 64 64 4096 \"status:nonvoluntary_ctxt_switches\" 1 12 12 12 144 \"Device 0 GPU Memory Utilization %\" 1 73 73 73 5329 \"status:VmStk\" 1 192 192 192 36864 \"status:VmData\" 1 277120 277120 277120 7.67955e+10 \"status:VmLck\" 1 0 0 0 0 \"status:VmPin\" 1 166528 166528 166528 2.77316e+10 \"status:VmPTE\" 1 35 35 35 1225 \"Device 0 GPU NvLink Utilization C1\" 1 0 0 0 0 \"status:VmHWM\" 1 219008 219008 219008 4.79645e+10 \"status:VmRSS\" 1 219008 219008 219008 4.79645e+10 \"GPU: Bytes Allocated\" 2 16 6 11 292 \"status:VmLib\" 1 87424 87424 87424 7.64296e+09 \"Device 0 GPU Utilization %\" 1 99 99 99 9801 \"status:voluntary_ctxt_switches\" 1 1320 1320 1320 1.7424e+06 \"Device 0 GPU Clock SM (MHz)\" 1 1530 1530 1530 2.3409e+06 \"status:VmPMD\" 1 20 20 20 400 \"1 Minute Load average\" 1 16.43 16.43 16.43 269.945 \"Device 0 GPU Clock Memory (MHz)\" 1 877 877 877 769129 \"Device 0 PCIe TX Throughput (MB/s)\" 1 2 2 2 4 \"Device 0 GPU Temperature (C)\" 1 73 73 73 5329 \"Device 0 PCIe RX Throughput (MB/s)\" 1 6 6 6 36 \"Device 0 GPU Memory Used (MB)\" 1 20679.1 20679.1 20679.1 4.27625e+08 \"Device 0 GPU NvLink Utilization C0\" 1 0 0 0 0 \"Device 0 GPU NvLink Link Count\" 1 6 6 6 36 \"Device 0 GPU Memory Free (MB)\" 1 13410.6 13410.6 13410.6 1.79845e+08 [khuck@cyclops apex]$ which pprof ~/src/tau2/ibm64linux/bin/pprof [khuck@cyclops apex]$ pprof Reading Profile files in profile.* NODE 0;CONTEXT 0;THREAD 0: --------------------------------------------------------------------------------------- %Time Exclusive Inclusive #Call #Subrs Inclusive Name msec total msec usec/call --------------------------------------------------------------------------------------- 100.0 67 354 1 0 354026 APEX MAIN 52.1 184 184 2 0 92218 cudaMallocManaged 28.3 100 100 4 0 25082 launch [/home/users/khuck/src/apex/src/unit_tests/CUDA/apex_cuda.cu:35] 0.2 0.811 0.811 4 0 203 cudaDeviceSynchronize 0.1 0.355 0.355 4 0 89 GPU: Kernel(DataElement*) 0.1 0.193 0.193 2 0 97 cudaFree 0.0 0.0674 0.0674 4 0 17 cudaLaunchKernel 0.0 0.00464 0.00464 1 0 5 GPU: Unified Memcpy HTOD 0.0 0.00266 0.00266 1 0 3 GPU: Unified Memcpy DTOH --------------------------------------------------------------------------------------- USER EVENTS Profile :NODE 0, CONTEXT 0, THREAD 0 --------------------------------------------------------------------------------------- NumSamples MaxValue MinValue MeanValue Std. Dev. Event Name --------------------------------------------------------------------------------------- 1 16.43 16.43 16.43 0.01 1 Minute Load average 1 877 877 877 0 Device 0 GPU Clock Memory (MHz) 1 1530 1530 1530 0 Device 0 GPU Clock SM (MHz) 1 1.341E+04 1.341E+04 1.341E+04 28.42 Device 0 GPU Memory Free (MB) 1 2.068E+04 2.068E+04 2.068E+04 13.3 Device 0 GPU Memory Used (MB) 1 73 73 73 0 Device 0 GPU Memory Utilization % 1 6 6 6 0 Device 0 GPU NvLink Link Count 1 2.578E+04 2.578E+04 2.578E+04 6.245 Device 0 GPU NvLink Speed MB/s 1 0 0 0 0 Device 0 GPU NvLink Utilization C0 1 0 0 0 0 Device 0 GPU NvLink Utilization C1 1 224.1 224.1 224.1 0.1981 Device 0 GPU Power (W) 1 73 73 73 0 Device 0 GPU Temperature (C) 1 99 99 99 0 Device 0 GPU Utilization % 1 6 6 6 0 Device 0 PCIe RX Throughput (MB/s) 1 2 2 2 0 Device 0 PCIe TX Throughput (MB/s) 2 16 6 11 5 GPU: Bytes Allocated 1 7 7 7 0 status:Threads 1 2.771E+05 2.771E+05 2.771E+05 74.83 status:VmData 1 64 64 64 0 status:VmExe 1 2.19E+05 2.19E+05 2.19E+05 63.75 status:VmHWM 1 0 0 0 0 status:VmLck 1 8.742E+04 8.742E+04 8.742E+04 64.99 status:VmLib 1 20 20 20 0 status:VmPMD 1 35 35 35 0 status:VmPTE 1 7.172E+05 7.172E+05 7.172E+05 553.6 status:VmPeak 1 1.665E+05 1.665E+05 1.665E+05 158.8 status:VmPin 1 2.19E+05 2.19E+05 2.19E+05 63.75 status:VmRSS 1 6.52E+05 6.52E+05 6.52E+05 520.6 status:VmSize 1 192 192 192 0 status:VmStk 1 0 0 0 0 status:VmSwap 1 12 12 12 0 status:nonvoluntary_ctxt_switches 1 1320 1320 1320 0 status:voluntary_ctxt_switches ---------------------------------------------------------------------------------------","title":"Profiling with TAU profile output"},{"location":"usage/#profiling_with_taskgraph_output","text":"APEX can capture the task dependency graph from the application, and output it as a GraphViz graph. The graph represents summarized task \"type\" dependencies, not a full dependency graph/tree with every task instance. [khuck@cyclops apex]$ apex_exec --apex:taskgraph --apex:cuda ./build/src/unit_tests/CUDA/apex_cuda_cu [khuck@cyclops apex]$ dot -Tpdf -O taskgraph.0.dot","title":"Profiling with Taskgraph output"},{"location":"usage/#profiling_with_tasktree_output","text":"APEX can capture the task dependency tree from the application, and output it as a GraphViz graph or ASCII. The graph represents summarized task \"type\" dependencies, not a full dependency graph/tree with every task instance. The difference between the graph and the tree is that in the tree, there are no cycles and child tasks have only one parent. [khuck@cyclops apex]$ apex_exec --apex:tasktree --apex:cuda ./build/src/unit_tests/CUDA/apex_cuda_cu [khuck@cyclops apex]$ apex-treesummary.py apex_tasktree.csv","title":"Profiling with Tasktree output"},{"location":"usage/#profiling_with_scatterplot_output","text":"For this example, we are using an HPX quickstart example, the fibonacci example. After execution, APEX writes a sample data file to disk, apex_task_samples.csv . That file is post-processed with the APEX python script task_scatterplot.py . [khuck@cyclops apex]$ export APEX_TASK_SCATTERPLOT=1 [khuck@cyclops build]$ ./bin/fibonacci --n-value=20 [khuck@cyclops build]$ /home/users/khuck/src/apex/install/bin/task_scatterplot.py Parsed 2362 samples Plotting async_launch_policy_dispatch Plotting async_launch_policy_dispatch::call Plotting async Rendering...","title":"Profiling with Scatterplot output"},{"location":"usage/#profiling_with_otf2_trace_output","text":"For this example, we are using an APEX unit test that computes the value of PI. OTF2 is the \"Open Trace Format v2\", used for tracing large scale HPC applications. For more information on OTF2 and associated tools, see The VI-HPS Score-P web site . Vampir is a commercial trace viewer that can be used to visualize and analyze OTF2 trace data. Traveler is an open source tool that can be used to visualize and analyze APEX OTF2 trace data. [khuck@cyclops apex]$ export APEX_OTF2=1 [khuck@cyclops apex]$ ./build/src/unit_tests/CUDA/pi_cu Found 4 total devices 134217728 num streams 4 making streams starting compute n is 0 num darts in circle 0: 105418094 pi is 3.141704 Closing OTF2 event files... Writing OTF2 definition files... Writing OTF2 Global definition file... Writing OTF2 Node information... Writing OTF2 Communicators... Closing the archive... done. [khuck@eagle apex]$ module load vampir [khuck@eagle apex]$ vampir OTF2_archive/APEX.otf2","title":"Profiling with OTF2 Trace output"},{"location":"usage/#profiling_with_google_trace_events_format_output","text":"For this example, we are using an APEX unit test that computes the value of PI. Google Trace Events is a format developed by Google for tracing activity on devices, but is free and open and JSON based. For more information on Google Trace Events and associated tools, see the Google Trace Event Format document . The Google Chrome Web Browser can be used to visualize and analyze GTE trace data. [khuck@cyclops apex]$ export APEX_TRACE_EVENT=1 [khuck@cyclops apex]$ ./build/src/unit_tests/CUDA/pi_cu","title":"Profiling with Google Trace Events Format output"},{"location":"usecases/","text":"Before you start \u00b6 All examples on this page assume you have downloaded, configured and built APEX. See the Getting Started page for instructions on how to do that. Simple example \u00b6 In the APEX installation directory, there is a bin directory. In the bin directory are a number of examples, one of which is a simple matrix multiplication example, matmult . To run the matmult example, simply type 'matmult'. The output should be something like this: khuck@ktau:~/src/apex/install/bin$ ./matmult Spawned thread 1... Spawned thread 2... Spawned thread 3... Done. Not very interesting, eh? To see what APEX measured, set the APEX_SCREEN_OUTPUT environment variable to 1, and run it again: khuck@ktau:~/src/apex/install/bin$ export APEX_SCREEN_OUTPUT=1 khuck@ktau:~/src/apex/install/bin$ ./matmult v0.1-e050e17-master Built on: 14:38:56 Dec 22 2015 C++ Language Standard version : 201402 GCC Compiler version : 5.2.1 20151010 APEX_TAU : 0 APEX_POLICY : 1 APEX_MEASURE_CONCURRENCY : 0 APEX_MEASURE_CONCURRENCY_PERIOD : 1000000 APEX_SCREEN_OUTPUT : 1 APEX_PROFILE_OUTPUT : 0 APEX_CSV_OUTPUT : 1 APEX_TASKGRAPH_OUTPUT : 0 APEX_PROC_CPUINFO : 0 APEX_PROC_MEMINFO : 0 APEX_PROC_NET_DEV : 0 APEX_PROC_SELF_STATUS : 0 APEX_PROC_STAT : 1 APEX_THROTTLE_CONCURRENCY : 1 APEX_THROTTLING_MAX_THREADS : 8 APEX_THROTTLING_MIN_THREADS : 1 APEX_THROTTLE_ENERGY : 0 APEX_THROTTLING_MAX_WATTS : 300 APEX_THROTTLING_MIN_WATTS : 150 APEX_PTHREAD_WRAPPER_STACK_SIZE : 0 APEX_PAPI_METRICS : Spawned thread 1... Spawned thread 2... Spawned thread 3... Done. CPU is 2.66013e+09 Hz. Elapsed time: 0.966516 Cores detected: 8 Worker Threads observed: 4 Available CPU time: 3.86607 Action : #calls | minimum | mean | maximum | total | stddev | % total ------------------------------------------------------------------------------------------------------------ allocateMatrix : 12 --n/a-- 1.94e-02 --n/a-- 2.33e-01 --n/a-- 6.014 compute : 4 --n/a-- 6.89e-01 --n/a-- 2.76e+00 --n/a-- 71.279 compute_interchange : 4 --n/a-- 1.85e-01 --n/a-- 7.38e-01 --n/a-- 19.091 do_work : 4 --n/a-- 9.43e-01 --n/a-- 3.77e+00 --n/a-- 97.601 freeMatrix : 12 --n/a-- 2.36e-04 --n/a-- 2.83e-03 --n/a-- 0.073 initialize : 12 --n/a-- 3.56e-03 --n/a-- 4.27e-02 --n/a-- 1.104 main : 1 --n/a-- 9.66e-01 --n/a-- 9.66e-01 --n/a-- 24.983 APEX Idle : --n/a-- --n/a-- --n/a-- --n/a-- --n/a-- --n/a-- --n/a-- ------------------------------------------------------------------------------------------------------------ In this output, we see the status of all of the environment variables (as read by APEX at initialization), the regular program output, and then a summary from APEX at the end. Because APEX captures timestamps using the low-overhead rdtsc function call (where available), the measurements are done in cycles. APEX estimates the Hz rating of the CPU to convert to seconds for output. APEX reports the elapsed wall-clock time, the number of cores detected, the number of worker threads observed, as well as the total available CPU time (wall-clock times workers). OpenMP example \u00b6 In the APEX installation directory, there is a bin directory. In the bin directory are a number of examples, one of which is the OpenMP implementation of LULESH (for details, see the LLNL explanation of LULESH ). When APEX is configured with OpenMP OMPT support (using the -DBUILD_OMPT=TRUE or equivalent CMake configuration settings) it will measure OpenMP events. Executing the LULESH example (with APEX_SCREEN_OUTPUT=1) gives the following output: khuck@ktau:~/src/apex$ ./install/bin/lulesh_OpenMP_2.0 v0.1-e050e17-master Built on: 14:38:56 Dec 22 2015 C++ Language Standard version : 201402 GCC Compiler version : 5.2.1 20151010 APEX_TAU : 0 APEX_POLICY : 1 APEX_MEASURE_CONCURRENCY : 0 APEX_MEASURE_CONCURRENCY_PERIOD : 1000000 APEX_SCREEN_OUTPUT : 1 APEX_PROFILE_OUTPUT : 0 APEX_CSV_OUTPUT : 1 APEX_TASKGRAPH_OUTPUT : 0 APEX_PROC_CPUINFO : 0 APEX_PROC_MEMINFO : 0 APEX_PROC_NET_DEV : 0 APEX_PROC_SELF_STATUS : 0 APEX_PROC_STAT : 1 APEX_THROTTLE_CONCURRENCY : 1 APEX_THROTTLING_MAX_THREADS : 8 APEX_THROTTLING_MIN_THREADS : 1 APEX_THROTTLE_ENERGY : 0 APEX_THROTTLING_MAX_WATTS : 300 APEX_THROTTLING_MIN_WATTS : 150 APEX_PTHREAD_WRAPPER_STACK_SIZE : 0 APEX_PAPI_METRICS : Running problem size 30^3 per domain until completion Num processors: 1 Registering OMPT events...done. Num threads: 8 Total number of elements: 27000 To run other sizes, use -s . To run a fixed number of iterations, use -i . To run a more or less balanced region set, use -b . To change the relative costs of regions, use -c . To print out progress, use -p To write an output file for VisIt, use -v See help (-h) for more options APEX: disabling lightweight timer OpenMP_BARRIER: CalcPressur... APEX: disabling lightweight timer OpenMP_BARRIER: CalcPressur... APEX: disabling lightweight timer OpenMP_BARRIER: EvalEOSForE... APEX: disabling lightweight timer OpenMP_BARRIER: CalcEnergyF... APEX: disabling lightweight timer OpenMP_BARRIER: CalcEnergyF... APEX: disabling lightweight timer OpenMP_BARRIER: CalcEnergyF... APEX: disabling lightweight timer OpenMP_BARRIER: CalcEnergyF... APEX: disabling lightweight timer OpenMP_BARRIER: CalcEnergyF... APEX: disabling lightweight timer OpenMP_BARRIER: CalcCourant... APEX: disabling lightweight timer OpenMP_BARRIER: CalcHydroCo... APEX: disabling lightweight timer OpenMP_BARRIER: CalcMonoton... APEX: disabling lightweight timer OpenMP_BARRIER: EvalEOSForE... APEX: disabling lightweight timer OpenMP_BARRIER: CalcSoundSp... APEX: disabling lightweight timer OpenMP_BARRIER: InitStressT... APEX: disabling lightweight timer OpenMP_BARRIER: CalcVolumeF... APEX: disabling lightweight timer OpenMP_BARRIER: CalcAcceler... APEX: disabling lightweight timer OpenMP_BARRIER: CalcVelocit... APEX: disabling lightweight timer OpenMP_BARRIER: CalcPositio... APEX: disabling lightweight timer OpenMP_BARRIER: CalcLagrang... APEX: disabling lightweight timer OpenMP_BARRIER: UpdateVolum... APEX: disabling lightweight timer OpenMP_BARRIER: ApplyAccele... APEX: disabling lightweight timer OpenMP_BARRIER: CalcForceFo... Run completed: Problem size = 30 MPI tasks = 1 Iteration count = 932 Final Origin Energy = 2.025075e+05 Testing Plane 0 of Energy Array on rank 0: MaxAbsDiff = 6.548362e-11 TotalAbsDiff = 8.615093e-10 MaxRelDiff = 1.461140e-12 Elapsed time = 55.00 (s) Grind time (us/z/c) = 2.1855548 (per dom) ( 2.1855548 overall) FOM = 457.54973 (z/s) CPU is 2.66013e+09 Hz. Elapsed time: 55.0085 Cores detected: 8 Worker Threads observed: 8 Available CPU time: 440.068 Action : #calls | minimum | mean | maximum | total | stddev | % total ------------------------------------------------------------------------------------------------------------ CPU Guest % : 54 0.000 0.000 0.000 0.000 0.000 --n/a-- CPU I/O Wait % : 54 0.000 0.040 0.714 2.143 0.133 --n/a-- CPU IRQ % : 54 0.000 0.000 0.000 0.000 0.000 --n/a-- CPU Idle % : 54 0.857 1.384 4.857 74.714 0.763 --n/a-- CPU Nice % : 54 0.000 0.000 0.000 0.000 0.000 --n/a-- CPU Steal % : 54 0.000 0.000 0.000 0.000 0.000 --n/a-- CPU System % : 54 15.286 23.339 26.714 1260.286 2.301 --n/a-- CPU User % : 54 84.143 88.373 97.143 4772.143 2.268 --n/a-- CPU soft IRQ % : 54 0.000 0.026 0.286 1.429 0.068 --n/a-- OpenMP_BARRIER: ApplyAccele... : DISABLED (high frequency, short duration) OpenMP_BARRIER: ApplyMateri... : 14912 --n/a-- 3.96e-05 --n/a-- 5.91e-01 --n/a-- 0.134 OpenMP_BARRIER: CalcAcceler... : DISABLED (high frequency, short duration) OpenMP_BARRIER: CalcCourant... : DISABLED (high frequency, short duration) OpenMP_BARRIER: CalcEnergyF... : DISABLED (high frequency, short duration) OpenMP_BARRIER: CalcEnergyF... : DISABLED (high frequency, short duration) OpenMP_BARRIER: CalcEnergyF... : DISABLED (high frequency, short duration) OpenMP_BARRIER: CalcEnergyF... : DISABLED (high frequency, short duration) OpenMP_BARRIER: CalcEnergyF... : DISABLED (high frequency, short duration) OpenMP_BARRIER: CalcFBHourg... : 7456 --n/a-- 1.11e-04 --n/a-- 8.27e-01 --n/a-- 0.188 OpenMP_BARRIER: CalcFBHourg... : 7456 --n/a-- 1.49e-04 --n/a-- 1.11e+00 --n/a-- 0.252 OpenMP_BARRIER: CalcForceFo... : DISABLED (high frequency, short duration) OpenMP_BARRIER: CalcHourgla... : 7456 --n/a-- 1.32e-04 --n/a-- 9.84e-01 --n/a-- 0.224 OpenMP_BARRIER: CalcHydroCo... : DISABLED (high frequency, short duration) OpenMP_BARRIER: CalcKinemat... : 7456 --n/a-- 7.88e-05 --n/a-- 5.88e-01 --n/a-- 0.134 OpenMP_BARRIER: CalcLagrang... : DISABLED (high frequency, short duration) OpenMP_BARRIER: CalcMonoton... : 7456 --n/a-- 6.98e-05 --n/a-- 5.21e-01 --n/a-- 0.118 OpenMP_BARRIER: CalcMonoton... : DISABLED (high frequency, short duration) OpenMP_BARRIER: CalcPositio... : DISABLED (high frequency, short duration) OpenMP_BARRIER: CalcPressur... : DISABLED (high frequency, short duration) OpenMP_BARRIER: CalcPressur... : DISABLED (high frequency, short duration) OpenMP_BARRIER: CalcSoundSp... : DISABLED (high frequency, short duration) OpenMP_BARRIER: CalcVelocit... : DISABLED (high frequency, short duration) OpenMP_BARRIER: CalcVolumeF... : DISABLED (high frequency, short duration) OpenMP_BARRIER: EvalEOSForE... : DISABLED (high frequency, short duration) OpenMP_BARRIER: EvalEOSForE... : DISABLED (high frequency, short duration) OpenMP_BARRIER: InitStressT... : DISABLED (high frequency, short duration) OpenMP_BARRIER: IntegrateSt... : 7456 --n/a-- 6.66e-05 --n/a-- 4.97e-01 --n/a-- 0.113 OpenMP_BARRIER: IntegrateSt... : 7456 --n/a-- 1.28e-04 --n/a-- 9.54e-01 --n/a-- 0.217 OpenMP_BARRIER: UpdateVolum... : DISABLED (high frequency, short duration) OpenMP_PARALLEL_REGION: App... : 932 --n/a-- 1.09e-04 --n/a-- 1.01e-01 --n/a-- 0.023 OpenMP_PARALLEL_REGION: App... : 932 --n/a-- 2.58e-04 --n/a-- 2.40e-01 --n/a-- 0.055 OpenMP_PARALLEL_REGION: Cal... : 932 --n/a-- 7.83e-04 --n/a-- 7.30e-01 --n/a-- 0.166 OpenMP_PARALLEL_REGION: Cal... : 10252 --n/a-- 7.72e-05 --n/a-- 7.91e-01 --n/a-- 0.180 OpenMP_PARALLEL_REGION: Cal... : 32620 --n/a-- 4.29e-05 --n/a-- 1.40e+00 --n/a-- 0.318 OpenMP_PARALLEL_REGION: Cal... : 32620 --n/a-- 5.07e-05 --n/a-- 1.65e+00 --n/a-- 0.376 OpenMP_PARALLEL_REGION: Cal... : 32620 --n/a-- 3.31e-05 --n/a-- 1.08e+00 --n/a-- 0.245 OpenMP_PARALLEL_REGION: Cal... : 32620 --n/a-- 4.75e-05 --n/a-- 1.55e+00 --n/a-- 0.352 OpenMP_PARALLEL_REGION: Cal... : 32620 --n/a-- 4.09e-05 --n/a-- 1.34e+00 --n/a-- 0.303 OpenMP_PARALLEL_REGION: Cal... : 932 --n/a-- 8.10e-03 --n/a-- 7.55e+00 --n/a-- 1.715 OpenMP_PARALLEL_REGION: Cal... : 932 --n/a-- 3.51e-03 --n/a-- 3.28e+00 --n/a-- 0.744 OpenMP_PARALLEL_REGION: Cal... : 932 --n/a-- 4.34e-04 --n/a-- 4.05e-01 --n/a-- 0.092 OpenMP_PARALLEL_REGION: Cal... : 932 --n/a-- 4.27e-03 --n/a-- 3.98e+00 --n/a-- 0.905 OpenMP_PARALLEL_REGION: Cal... : 10252 --n/a-- 4.72e-05 --n/a-- 4.84e-01 --n/a-- 0.110 OpenMP_PARALLEL_REGION: Cal... : 932 --n/a-- 1.68e-03 --n/a-- 1.57e+00 --n/a-- 0.356 OpenMP_PARALLEL_REGION: Cal... : 932 --n/a-- 2.29e-04 --n/a-- 2.13e-01 --n/a-- 0.048 OpenMP_PARALLEL_REGION: Cal... : 932 --n/a-- 1.15e-03 --n/a-- 1.07e+00 --n/a-- 0.244 OpenMP_PARALLEL_REGION: Cal... : 10252 --n/a-- 2.29e-04 --n/a-- 2.34e+00 --n/a-- 0.533 OpenMP_PARALLEL_REGION: Cal... : 932 --n/a-- 4.98e-04 --n/a-- 4.64e-01 --n/a-- 0.105 OpenMP_PARALLEL_REGION: Cal... : 97860 --n/a-- 3.26e-05 --n/a-- 3.19e+00 --n/a-- 0.725 OpenMP_PARALLEL_REGION: Cal... : 97860 --n/a-- 3.20e-05 --n/a-- 3.13e+00 --n/a-- 0.712 OpenMP_PARALLEL_REGION: Cal... : 10252 --n/a-- 4.52e-05 --n/a-- 4.63e-01 --n/a-- 0.105 OpenMP_PARALLEL_REGION: Cal... : 932 --n/a-- 3.39e-04 --n/a-- 3.16e-01 --n/a-- 0.072 OpenMP_PARALLEL_REGION: Cal... : 932 --n/a-- 1.57e-04 --n/a-- 1.47e-01 --n/a-- 0.033 OpenMP_PARALLEL_REGION: Eva... : 32620 --n/a-- 1.07e-04 --n/a-- 3.50e+00 --n/a-- 0.796 OpenMP_PARALLEL_REGION: Eva... : 10252 --n/a-- 2.86e-05 --n/a-- 2.93e-01 --n/a-- 0.067 OpenMP_PARALLEL_REGION: Ini... : 932 --n/a-- 3.52e-04 --n/a-- 3.28e-01 --n/a-- 0.074 OpenMP_PARALLEL_REGION: Int... : 932 --n/a-- 3.14e-03 --n/a-- 2.93e+00 --n/a-- 0.666 OpenMP_PARALLEL_REGION: Int... : 932 --n/a-- 2.18e-03 --n/a-- 2.03e+00 --n/a-- 0.461 OpenMP_PARALLEL_REGION: Upd... : 932 --n/a-- 1.34e-04 --n/a-- 1.25e-01 --n/a-- 0.028 APEX Idle : --n/a-- --n/a-- --n/a-- --n/a-- 3.87e+02 --n/a-- 88.011 ------------------------------------------------------------------------------------------------------------ There are several lightweight events that APEX elects to ignore. The other events are timed by APEX and reported at exit, along with the /proc/stat data (CPU % counters). With PAPI \u00b6 When APEX is configured with PAPI support (using -DPAPI_ROOT=/path/to/papi and -DUSE_PAPI=TRUE), hardware counter data can also be collected by APEX. To specify hardware counters of interest, use the APEX_PAPI_METRICS environment variable: khuck@ktau:~/src/apex$ export APEX_PAPI_METRICS=\"PAPI_TOT_INS PAPI_L2_TCM\" ...and then execute as normal: khuck@ktau:~/src/apex$ ./install/bin/matmult v0.1-e050e17-master Built on: 14:38:56 Dec 22 2015 C++ Language Standard version : 201402 GCC Compiler version : 5.2.1 20151010 APEX_TAU : 1 APEX_POLICY : 1 APEX_MEASURE_CONCURRENCY : 0 APEX_MEASURE_CONCURRENCY_PERIOD : 1000000 APEX_SCREEN_OUTPUT : 1 APEX_PROFILE_OUTPUT : 0 APEX_CSV_OUTPUT : 1 APEX_TASKGRAPH_OUTPUT : 0 APEX_PROC_CPUINFO : 0 APEX_PROC_MEMINFO : 0 APEX_PROC_NET_DEV : 0 APEX_PROC_SELF_STATUS : 0 APEX_PROC_STAT : 1 APEX_THROTTLE_CONCURRENCY : 1 APEX_THROTTLING_MAX_THREADS : 8 APEX_THROTTLING_MIN_THREADS : 1 APEX_THROTTLE_ENERGY : 0 APEX_THROTTLING_MAX_WATTS : 300 APEX_THROTTLING_MIN_WATTS : 150 APEX_PTHREAD_WRAPPER_STACK_SIZE : 0 APEX_PAPI_METRICS : PAPI_TOT_INS PAPI_L2_TCM Spawned thread 1... Spawned thread 2... Spawned thread 3... Done. CPU is 2.66019e+09 Hz. Elapsed time: 0.954974 Cores detected: 8 Worker Threads observed: 4 Available CPU time: 3.81989 Action : #calls | minimum | mean | maximum | total | stddev | % total PAPI_TOT_INS PAPI_L2_TCM ------------------------------------------------------------------------------------------------------------ allocateMatrix : 12 --n/a-- 2.21e-02 --n/a-- 2.65e-01 --n/a-- 6.930 1.62e+06 9.10e+03 compute : 4 --n/a-- 6.85e-01 --n/a-- 2.74e+00 --n/a-- 71.743 4.31e+09 1.71e+06 compute_interchange : 4 --n/a-- 1.81e-01 --n/a-- 7.23e-01 --n/a-- 18.922 3.77e+09 8.12e+05 do_work : 4 --n/a-- 9.44e-01 --n/a-- 3.78e+00 --n/a-- 98.851 8.10e+09 2.92e+06 freeMatrix : 12 --n/a-- 2.07e-04 --n/a-- 2.49e-03 --n/a-- 0.065 1.13e+06 6.30e+03 initialize : 12 --n/a-- 3.58e-03 --n/a-- 4.29e-02 --n/a-- 1.124 2.21e+07 3.80e+05 main : 1 --n/a-- 9.54e-01 --n/a-- 9.54e-01 --n/a-- 24.978 2.03e+09 7.66e+05 APEX Idle : --n/a-- --n/a-- --n/a-- --n/a-- --n/a-- --n/a-- --n/a-- ------------------------------------------------------------------------------------------------------------ CSV output \u00b6 While APEX is not designed for post-mortem performance analysis, you can export the data that APEX collected. If you set the APEX_CSV_OUTPUT environment variable to 1, APEX will also dump the timer statistics as a CSV file: khuck@ktau:~/src/apex$ cat apex.0.csv \"task\",\"num calls\",\"total cycles\",\"total microseconds\",\"PAPI_TOT_INS\",\"PAPI_L2_TCM\" \"allocateMatrix\",12,704195504,264717,1615804,9100 \"compute\",4,7290209200,2740489,4306522734,1709040 \"compute_interchange\",4,1922797744,722806,3769652571,812196 \"do_work\",4,10044907856,3776018,8101109302,2922142 \"freeMatrix\",12,6613336,2486,1132717,6301 \"initialize\",12,114177592,42921,22093639,379785 \"main\",1,2538202992,954145,2025172707,766218 With TAU \u00b6 If APEX is configured with TAU support, then APEX measurements will be forwarded to TAU and recorded as a TAU profile. In addition, all other TAU features are supported, including sampling, MPI measurement, I/O measurement, tracing, etc. To configure APEX with TAU, specify the flags -DUSE_TAU, -DTAU_ROOT, -DTAU_ARCH, and -DTAU_OPTIONS. For example, if TAU was configured with \"./configure -pthread\" on an x86_64 Linux machine, the APEX configuration options would be \"-DUSE_TAU=1 -DTAU_ROOT=/path/to/tau -DTAU_ARCH=x86_64 -DTAU_OPTIONS=-pthread\". If TAU was configured with \"./configure -mpi -pthread\" on an x86_64 Linux machine, the APEX configuration options would be \"-DUSE_TAU=1 -DTAU_ROOT=/path/to/tau -DTAU_ARCH=x86_64 -DTAU_OPTIONS=-mpi-pthread\". Here is a suggested configuration for TAU on x86-Linux to use with APEX (some systems require special flags - please contact the maintaners if you are interested): # download the latest TAU release wget http://www.cs.uoregon.edu/research/paracomp/tau/tauprofile/dist/tau_latest.tar.gz # expand the tar file tar -xvzf tau_latest.tar.gz cd tau-2.25 # configure TAU ./configure -papi=/usr/local/papi/5.3.2 -pthread -prefix=/usr/local/tau/2.25 # build make -j install # set our path to include the new TAU installation export PATH=$PATH:/usr/local/tau/2.25/x86_64/bin Here is a suggested configuration for APEX to use the above TAU installation: cd xpress-apex mkdir build-tau cd build-tau cmake -DBUILD_EXAMPLES=TRUE -DBUILD_TESTS=TRUE -DCMAKE_BUILD_TYPE=RelWithDebInfo \\ -DUSE_TAU=TRUE -DTAU_ROOT=/usr/local/tau/2.25 -DTAU_ARCH=x86_64 -DTAU_OPTIONS=-papi-pthread \\ -DBUILD_BFD=TRUE -DBUILD_ACTIVEHARMONY=TRUE -DCMAKE_INSTALL_PREFIX=../install-tau .. make make tests make install After configuring, building and installing TAU and then configuring, building and installing APEX, the TAU profiling is enabled by setting the environment variable \"APEX_TAU=1\". After executing an example (say 'matmult'), there should be profile.* files in the working directory: khuck@ktau:~/src/xpress-apex$ export APEX_TAU=1 khuck@ktau:~/src/xpress-apex$ ./install/bin/matmult Spawned thread 1... Spawned thread 2... Spawned thread 3... Done. khuck@ktau:~/src/xpress-apex$ ls profile.* profile.0.0.0 profile.0.0.1 profile.0.0.2 profile.0.0.3 profile.0.0.4 profile.0.0.5 If the TAU analysis utilties are in your path, you can execute paraprof to view the profiles: khuck@ktau:~/src/xpress-apex$ paraprof ...which should launch the ParaProf profile viewer/analysis program. The profile should look something like the following (for a complete manual on using ParaProf, see the TAU website ). If you want to collect a TAU trace, you would enable the appropriate TAU environment variable (TAU_TRACE=1), and then re-run the example. After the execution, the trace files need to be merged (using tau_treemerge.pl) and then converted (with tau2slog2) to be viewed with the Jumpshot trace viewer (included with TAU): khuck@ktau:~/src/xpress-apex$ export APEX_TAU=1 khuck@ktau:~/src/xpress-apex$ export TAU_TRACE=1 khuck@ktau:~/src/xpress-apex$ ./install/bin/matmult Spawned thread 1... Spawned thread 2... Spawned thread 3... Done. khuck@ktau:~/src/xpress-apex$ ls *.edf *.trc events.0.edf tautrace.0.0.1.trc tautrace.0.0.3.trc tautrace.0.0.5.trc tautrace.0.0.0.trc tautrace.0.0.2.trc tautrace.0.0.4.trc # merge the trace khuck@ktau:~/src/xpress-apex$ tau_treemerge.pl /home/khuck/src/tau2/x86_64/bin/tau_merge -m tau.edf -e events.0.edf events.0.edf events.0.edf events.0.edf events.0.edf events.0.edf tautrace.0.0.0.trc tautrace.0.0.1.trc tautrace.0.0.2.trc tautrace.0.0.3.trc tautrace.0.0.4.trc tautrace.0.0.5.trc tau.trc tautrace.0.0.0.trc: 34 records read. tautrace.0.0.1.trc: 8 records read. tautrace.0.0.2.trc: 8 records read. tautrace.0.0.3.trc: 30 records read. tautrace.0.0.4.trc: 30 records read. tautrace.0.0.5.trc: 30 records read. # convert the trace khuck@ktau:~/src/xpress-apex$ tau2slog2 tau.trc tau.edf -o tau.slog2 140 records initialized. Processing. 2 Records read. 1% converted 4 Records read. 2% converted 6 Records read. 4% converted 8 Records read. 5% converted 10 Records read. 7% converted 12 Records read. 8% converted 14 Records read. 10% converted 16 Records read. 11% converted 18 Records read. 12% converted 20 Records read. 14% converted 22 Records read. 15% converted 24 Records read. 17% converted 26 Records read. 18% converted 28 Records read. 20% converted 30 Records read. 21% converted 32 Records read. 22% converted 34 Records read. 24% converted 36 Records read. 25% converted 38 Records read. 27% converted 40 Records read. 28% converted 42 Records read. 30% converted 44 Records read. 31% converted 46 Records read. 32% converted 48 Records read. 34% converted 50 Records read. 35% converted 52 Records read. 37% converted 54 Records read. 38% converted 56 Records read. 40% converted 58 Records read. 41% converted 60 Records read. 42% converted 62 Records read. 44% converted 64 Records read. 45% converted 66 Records read. 47% converted 68 Records read. 48% converted 70 Records read. 50% converted 72 Records read. 51% converted 74 Records read. 52% converted 76 Records read. 54% converted 78 Records read. 55% converted 80 Records read. 57% converted 82 Records read. 58% converted 84 Records read. 60% converted 86 Records read. 61% converted 88 Records read. 62% converted 90 Records read. 64% converted 92 Records read. 65% converted 94 Records read. 67% converted 96 Records read. 68% converted 98 Records read. 70% converted 100 Records read. 71% converted 102 Records read. 72% converted 104 Records read. 74% converted 106 Records read. 75% converted 108 Records read. 77% converted 110 Records read. 78% converted 112 Records read. 80% converted 114 Records read. 81% converted 116 Records read. 82% converted 118 Records read. 84% converted 120 Records read. 85% converted 122 Records read. 87% converted 124 Records read. 88% converted 1521 enters: 0 exits: 0 126 Records read. 90% converted 1521 enters: 0 exits: 0 128 Records read. 91% converted 130 Records read. 92% converted 1521 enters: 0 exits: 0 132 Records read. 94% converted 1521 enters: 0 exits: 0 134 Records read. 95% converted 136 Records read. 97% converted 1521 enters: 0 exits: 0 138 Records read. 98% converted 1521 enters: 0 exits: 0 140 Records read. 100% converted Reached end of trace file. Getting YMap, Maxnode: 0, Maxthread: 5 SLOG-2 Header: version = SLOG 2.0.6 NumOfChildrenPerNode = 2 TreeLeafByteSize = 65536 MaxTreeDepth = 0 MaxBufferByteSize = 1960 Categories is FBinfo(641 @ 2068) MethodDefs is FBinfo(0 @ 0) LineIDMaps is FBinfo(197 @ 2709) TreeRoot is FBinfo(1960 @ 108) TreeDir is FBinfo(38 @ 2906) Annotations is FBinfo(0 @ 0) Postamble is FBinfo(0 @ 0) 1521 enters: 0 exits: 0 Number of Drawables = 58 timeElapsed between 1 & 2 = 67 msec timeElapsed between 2 & 3 = 28 msec # open jumpshot khuck@ktau:~/src/xpress-apex$ jumpshot tau.slog2 Policy Rules and Runtime Adaptation \u00b6 ...Coming soon!","title":"Before you start"},{"location":"usecases/#before_you_start","text":"All examples on this page assume you have downloaded, configured and built APEX. See the Getting Started page for instructions on how to do that.","title":"Before you start"},{"location":"usecases/#simple_example","text":"In the APEX installation directory, there is a bin directory. In the bin directory are a number of examples, one of which is a simple matrix multiplication example, matmult . To run the matmult example, simply type 'matmult'. The output should be something like this: khuck@ktau:~/src/apex/install/bin$ ./matmult Spawned thread 1... Spawned thread 2... Spawned thread 3... Done. Not very interesting, eh? To see what APEX measured, set the APEX_SCREEN_OUTPUT environment variable to 1, and run it again: khuck@ktau:~/src/apex/install/bin$ export APEX_SCREEN_OUTPUT=1 khuck@ktau:~/src/apex/install/bin$ ./matmult v0.1-e050e17-master Built on: 14:38:56 Dec 22 2015 C++ Language Standard version : 201402 GCC Compiler version : 5.2.1 20151010 APEX_TAU : 0 APEX_POLICY : 1 APEX_MEASURE_CONCURRENCY : 0 APEX_MEASURE_CONCURRENCY_PERIOD : 1000000 APEX_SCREEN_OUTPUT : 1 APEX_PROFILE_OUTPUT : 0 APEX_CSV_OUTPUT : 1 APEX_TASKGRAPH_OUTPUT : 0 APEX_PROC_CPUINFO : 0 APEX_PROC_MEMINFO : 0 APEX_PROC_NET_DEV : 0 APEX_PROC_SELF_STATUS : 0 APEX_PROC_STAT : 1 APEX_THROTTLE_CONCURRENCY : 1 APEX_THROTTLING_MAX_THREADS : 8 APEX_THROTTLING_MIN_THREADS : 1 APEX_THROTTLE_ENERGY : 0 APEX_THROTTLING_MAX_WATTS : 300 APEX_THROTTLING_MIN_WATTS : 150 APEX_PTHREAD_WRAPPER_STACK_SIZE : 0 APEX_PAPI_METRICS : Spawned thread 1... Spawned thread 2... Spawned thread 3... Done. CPU is 2.66013e+09 Hz. Elapsed time: 0.966516 Cores detected: 8 Worker Threads observed: 4 Available CPU time: 3.86607 Action : #calls | minimum | mean | maximum | total | stddev | % total ------------------------------------------------------------------------------------------------------------ allocateMatrix : 12 --n/a-- 1.94e-02 --n/a-- 2.33e-01 --n/a-- 6.014 compute : 4 --n/a-- 6.89e-01 --n/a-- 2.76e+00 --n/a-- 71.279 compute_interchange : 4 --n/a-- 1.85e-01 --n/a-- 7.38e-01 --n/a-- 19.091 do_work : 4 --n/a-- 9.43e-01 --n/a-- 3.77e+00 --n/a-- 97.601 freeMatrix : 12 --n/a-- 2.36e-04 --n/a-- 2.83e-03 --n/a-- 0.073 initialize : 12 --n/a-- 3.56e-03 --n/a-- 4.27e-02 --n/a-- 1.104 main : 1 --n/a-- 9.66e-01 --n/a-- 9.66e-01 --n/a-- 24.983 APEX Idle : --n/a-- --n/a-- --n/a-- --n/a-- --n/a-- --n/a-- --n/a-- ------------------------------------------------------------------------------------------------------------ In this output, we see the status of all of the environment variables (as read by APEX at initialization), the regular program output, and then a summary from APEX at the end. Because APEX captures timestamps using the low-overhead rdtsc function call (where available), the measurements are done in cycles. APEX estimates the Hz rating of the CPU to convert to seconds for output. APEX reports the elapsed wall-clock time, the number of cores detected, the number of worker threads observed, as well as the total available CPU time (wall-clock times workers).","title":"Simple example"},{"location":"usecases/#openmp_example","text":"In the APEX installation directory, there is a bin directory. In the bin directory are a number of examples, one of which is the OpenMP implementation of LULESH (for details, see the LLNL explanation of LULESH ). When APEX is configured with OpenMP OMPT support (using the -DBUILD_OMPT=TRUE or equivalent CMake configuration settings) it will measure OpenMP events. Executing the LULESH example (with APEX_SCREEN_OUTPUT=1) gives the following output: khuck@ktau:~/src/apex$ ./install/bin/lulesh_OpenMP_2.0 v0.1-e050e17-master Built on: 14:38:56 Dec 22 2015 C++ Language Standard version : 201402 GCC Compiler version : 5.2.1 20151010 APEX_TAU : 0 APEX_POLICY : 1 APEX_MEASURE_CONCURRENCY : 0 APEX_MEASURE_CONCURRENCY_PERIOD : 1000000 APEX_SCREEN_OUTPUT : 1 APEX_PROFILE_OUTPUT : 0 APEX_CSV_OUTPUT : 1 APEX_TASKGRAPH_OUTPUT : 0 APEX_PROC_CPUINFO : 0 APEX_PROC_MEMINFO : 0 APEX_PROC_NET_DEV : 0 APEX_PROC_SELF_STATUS : 0 APEX_PROC_STAT : 1 APEX_THROTTLE_CONCURRENCY : 1 APEX_THROTTLING_MAX_THREADS : 8 APEX_THROTTLING_MIN_THREADS : 1 APEX_THROTTLE_ENERGY : 0 APEX_THROTTLING_MAX_WATTS : 300 APEX_THROTTLING_MIN_WATTS : 150 APEX_PTHREAD_WRAPPER_STACK_SIZE : 0 APEX_PAPI_METRICS : Running problem size 30^3 per domain until completion Num processors: 1 Registering OMPT events...done. Num threads: 8 Total number of elements: 27000 To run other sizes, use -s . To run a fixed number of iterations, use -i . To run a more or less balanced region set, use -b . To change the relative costs of regions, use -c . To print out progress, use -p To write an output file for VisIt, use -v See help (-h) for more options APEX: disabling lightweight timer OpenMP_BARRIER: CalcPressur... APEX: disabling lightweight timer OpenMP_BARRIER: CalcPressur... APEX: disabling lightweight timer OpenMP_BARRIER: EvalEOSForE... APEX: disabling lightweight timer OpenMP_BARRIER: CalcEnergyF... APEX: disabling lightweight timer OpenMP_BARRIER: CalcEnergyF... APEX: disabling lightweight timer OpenMP_BARRIER: CalcEnergyF... APEX: disabling lightweight timer OpenMP_BARRIER: CalcEnergyF... APEX: disabling lightweight timer OpenMP_BARRIER: CalcEnergyF... APEX: disabling lightweight timer OpenMP_BARRIER: CalcCourant... APEX: disabling lightweight timer OpenMP_BARRIER: CalcHydroCo... APEX: disabling lightweight timer OpenMP_BARRIER: CalcMonoton... APEX: disabling lightweight timer OpenMP_BARRIER: EvalEOSForE... APEX: disabling lightweight timer OpenMP_BARRIER: CalcSoundSp... APEX: disabling lightweight timer OpenMP_BARRIER: InitStressT... APEX: disabling lightweight timer OpenMP_BARRIER: CalcVolumeF... APEX: disabling lightweight timer OpenMP_BARRIER: CalcAcceler... APEX: disabling lightweight timer OpenMP_BARRIER: CalcVelocit... APEX: disabling lightweight timer OpenMP_BARRIER: CalcPositio... APEX: disabling lightweight timer OpenMP_BARRIER: CalcLagrang... APEX: disabling lightweight timer OpenMP_BARRIER: UpdateVolum... APEX: disabling lightweight timer OpenMP_BARRIER: ApplyAccele... APEX: disabling lightweight timer OpenMP_BARRIER: CalcForceFo... Run completed: Problem size = 30 MPI tasks = 1 Iteration count = 932 Final Origin Energy = 2.025075e+05 Testing Plane 0 of Energy Array on rank 0: MaxAbsDiff = 6.548362e-11 TotalAbsDiff = 8.615093e-10 MaxRelDiff = 1.461140e-12 Elapsed time = 55.00 (s) Grind time (us/z/c) = 2.1855548 (per dom) ( 2.1855548 overall) FOM = 457.54973 (z/s) CPU is 2.66013e+09 Hz. Elapsed time: 55.0085 Cores detected: 8 Worker Threads observed: 8 Available CPU time: 440.068 Action : #calls | minimum | mean | maximum | total | stddev | % total ------------------------------------------------------------------------------------------------------------ CPU Guest % : 54 0.000 0.000 0.000 0.000 0.000 --n/a-- CPU I/O Wait % : 54 0.000 0.040 0.714 2.143 0.133 --n/a-- CPU IRQ % : 54 0.000 0.000 0.000 0.000 0.000 --n/a-- CPU Idle % : 54 0.857 1.384 4.857 74.714 0.763 --n/a-- CPU Nice % : 54 0.000 0.000 0.000 0.000 0.000 --n/a-- CPU Steal % : 54 0.000 0.000 0.000 0.000 0.000 --n/a-- CPU System % : 54 15.286 23.339 26.714 1260.286 2.301 --n/a-- CPU User % : 54 84.143 88.373 97.143 4772.143 2.268 --n/a-- CPU soft IRQ % : 54 0.000 0.026 0.286 1.429 0.068 --n/a-- OpenMP_BARRIER: ApplyAccele... : DISABLED (high frequency, short duration) OpenMP_BARRIER: ApplyMateri... : 14912 --n/a-- 3.96e-05 --n/a-- 5.91e-01 --n/a-- 0.134 OpenMP_BARRIER: CalcAcceler... : DISABLED (high frequency, short duration) OpenMP_BARRIER: CalcCourant... : DISABLED (high frequency, short duration) OpenMP_BARRIER: CalcEnergyF... : DISABLED (high frequency, short duration) OpenMP_BARRIER: CalcEnergyF... : DISABLED (high frequency, short duration) OpenMP_BARRIER: CalcEnergyF... : DISABLED (high frequency, short duration) OpenMP_BARRIER: CalcEnergyF... : DISABLED (high frequency, short duration) OpenMP_BARRIER: CalcEnergyF... : DISABLED (high frequency, short duration) OpenMP_BARRIER: CalcFBHourg... : 7456 --n/a-- 1.11e-04 --n/a-- 8.27e-01 --n/a-- 0.188 OpenMP_BARRIER: CalcFBHourg... : 7456 --n/a-- 1.49e-04 --n/a-- 1.11e+00 --n/a-- 0.252 OpenMP_BARRIER: CalcForceFo... : DISABLED (high frequency, short duration) OpenMP_BARRIER: CalcHourgla... : 7456 --n/a-- 1.32e-04 --n/a-- 9.84e-01 --n/a-- 0.224 OpenMP_BARRIER: CalcHydroCo... : DISABLED (high frequency, short duration) OpenMP_BARRIER: CalcKinemat... : 7456 --n/a-- 7.88e-05 --n/a-- 5.88e-01 --n/a-- 0.134 OpenMP_BARRIER: CalcLagrang... : DISABLED (high frequency, short duration) OpenMP_BARRIER: CalcMonoton... : 7456 --n/a-- 6.98e-05 --n/a-- 5.21e-01 --n/a-- 0.118 OpenMP_BARRIER: CalcMonoton... : DISABLED (high frequency, short duration) OpenMP_BARRIER: CalcPositio... : DISABLED (high frequency, short duration) OpenMP_BARRIER: CalcPressur... : DISABLED (high frequency, short duration) OpenMP_BARRIER: CalcPressur... : DISABLED (high frequency, short duration) OpenMP_BARRIER: CalcSoundSp... : DISABLED (high frequency, short duration) OpenMP_BARRIER: CalcVelocit... : DISABLED (high frequency, short duration) OpenMP_BARRIER: CalcVolumeF... : DISABLED (high frequency, short duration) OpenMP_BARRIER: EvalEOSForE... : DISABLED (high frequency, short duration) OpenMP_BARRIER: EvalEOSForE... : DISABLED (high frequency, short duration) OpenMP_BARRIER: InitStressT... : DISABLED (high frequency, short duration) OpenMP_BARRIER: IntegrateSt... : 7456 --n/a-- 6.66e-05 --n/a-- 4.97e-01 --n/a-- 0.113 OpenMP_BARRIER: IntegrateSt... : 7456 --n/a-- 1.28e-04 --n/a-- 9.54e-01 --n/a-- 0.217 OpenMP_BARRIER: UpdateVolum... : DISABLED (high frequency, short duration) OpenMP_PARALLEL_REGION: App... : 932 --n/a-- 1.09e-04 --n/a-- 1.01e-01 --n/a-- 0.023 OpenMP_PARALLEL_REGION: App... : 932 --n/a-- 2.58e-04 --n/a-- 2.40e-01 --n/a-- 0.055 OpenMP_PARALLEL_REGION: Cal... : 932 --n/a-- 7.83e-04 --n/a-- 7.30e-01 --n/a-- 0.166 OpenMP_PARALLEL_REGION: Cal... : 10252 --n/a-- 7.72e-05 --n/a-- 7.91e-01 --n/a-- 0.180 OpenMP_PARALLEL_REGION: Cal... : 32620 --n/a-- 4.29e-05 --n/a-- 1.40e+00 --n/a-- 0.318 OpenMP_PARALLEL_REGION: Cal... : 32620 --n/a-- 5.07e-05 --n/a-- 1.65e+00 --n/a-- 0.376 OpenMP_PARALLEL_REGION: Cal... : 32620 --n/a-- 3.31e-05 --n/a-- 1.08e+00 --n/a-- 0.245 OpenMP_PARALLEL_REGION: Cal... : 32620 --n/a-- 4.75e-05 --n/a-- 1.55e+00 --n/a-- 0.352 OpenMP_PARALLEL_REGION: Cal... : 32620 --n/a-- 4.09e-05 --n/a-- 1.34e+00 --n/a-- 0.303 OpenMP_PARALLEL_REGION: Cal... : 932 --n/a-- 8.10e-03 --n/a-- 7.55e+00 --n/a-- 1.715 OpenMP_PARALLEL_REGION: Cal... : 932 --n/a-- 3.51e-03 --n/a-- 3.28e+00 --n/a-- 0.744 OpenMP_PARALLEL_REGION: Cal... : 932 --n/a-- 4.34e-04 --n/a-- 4.05e-01 --n/a-- 0.092 OpenMP_PARALLEL_REGION: Cal... : 932 --n/a-- 4.27e-03 --n/a-- 3.98e+00 --n/a-- 0.905 OpenMP_PARALLEL_REGION: Cal... : 10252 --n/a-- 4.72e-05 --n/a-- 4.84e-01 --n/a-- 0.110 OpenMP_PARALLEL_REGION: Cal... : 932 --n/a-- 1.68e-03 --n/a-- 1.57e+00 --n/a-- 0.356 OpenMP_PARALLEL_REGION: Cal... : 932 --n/a-- 2.29e-04 --n/a-- 2.13e-01 --n/a-- 0.048 OpenMP_PARALLEL_REGION: Cal... : 932 --n/a-- 1.15e-03 --n/a-- 1.07e+00 --n/a-- 0.244 OpenMP_PARALLEL_REGION: Cal... : 10252 --n/a-- 2.29e-04 --n/a-- 2.34e+00 --n/a-- 0.533 OpenMP_PARALLEL_REGION: Cal... : 932 --n/a-- 4.98e-04 --n/a-- 4.64e-01 --n/a-- 0.105 OpenMP_PARALLEL_REGION: Cal... : 97860 --n/a-- 3.26e-05 --n/a-- 3.19e+00 --n/a-- 0.725 OpenMP_PARALLEL_REGION: Cal... : 97860 --n/a-- 3.20e-05 --n/a-- 3.13e+00 --n/a-- 0.712 OpenMP_PARALLEL_REGION: Cal... : 10252 --n/a-- 4.52e-05 --n/a-- 4.63e-01 --n/a-- 0.105 OpenMP_PARALLEL_REGION: Cal... : 932 --n/a-- 3.39e-04 --n/a-- 3.16e-01 --n/a-- 0.072 OpenMP_PARALLEL_REGION: Cal... : 932 --n/a-- 1.57e-04 --n/a-- 1.47e-01 --n/a-- 0.033 OpenMP_PARALLEL_REGION: Eva... : 32620 --n/a-- 1.07e-04 --n/a-- 3.50e+00 --n/a-- 0.796 OpenMP_PARALLEL_REGION: Eva... : 10252 --n/a-- 2.86e-05 --n/a-- 2.93e-01 --n/a-- 0.067 OpenMP_PARALLEL_REGION: Ini... : 932 --n/a-- 3.52e-04 --n/a-- 3.28e-01 --n/a-- 0.074 OpenMP_PARALLEL_REGION: Int... : 932 --n/a-- 3.14e-03 --n/a-- 2.93e+00 --n/a-- 0.666 OpenMP_PARALLEL_REGION: Int... : 932 --n/a-- 2.18e-03 --n/a-- 2.03e+00 --n/a-- 0.461 OpenMP_PARALLEL_REGION: Upd... : 932 --n/a-- 1.34e-04 --n/a-- 1.25e-01 --n/a-- 0.028 APEX Idle : --n/a-- --n/a-- --n/a-- --n/a-- 3.87e+02 --n/a-- 88.011 ------------------------------------------------------------------------------------------------------------ There are several lightweight events that APEX elects to ignore. The other events are timed by APEX and reported at exit, along with the /proc/stat data (CPU % counters).","title":"OpenMP example"},{"location":"usecases/#with_papi","text":"When APEX is configured with PAPI support (using -DPAPI_ROOT=/path/to/papi and -DUSE_PAPI=TRUE), hardware counter data can also be collected by APEX. To specify hardware counters of interest, use the APEX_PAPI_METRICS environment variable: khuck@ktau:~/src/apex$ export APEX_PAPI_METRICS=\"PAPI_TOT_INS PAPI_L2_TCM\" ...and then execute as normal: khuck@ktau:~/src/apex$ ./install/bin/matmult v0.1-e050e17-master Built on: 14:38:56 Dec 22 2015 C++ Language Standard version : 201402 GCC Compiler version : 5.2.1 20151010 APEX_TAU : 1 APEX_POLICY : 1 APEX_MEASURE_CONCURRENCY : 0 APEX_MEASURE_CONCURRENCY_PERIOD : 1000000 APEX_SCREEN_OUTPUT : 1 APEX_PROFILE_OUTPUT : 0 APEX_CSV_OUTPUT : 1 APEX_TASKGRAPH_OUTPUT : 0 APEX_PROC_CPUINFO : 0 APEX_PROC_MEMINFO : 0 APEX_PROC_NET_DEV : 0 APEX_PROC_SELF_STATUS : 0 APEX_PROC_STAT : 1 APEX_THROTTLE_CONCURRENCY : 1 APEX_THROTTLING_MAX_THREADS : 8 APEX_THROTTLING_MIN_THREADS : 1 APEX_THROTTLE_ENERGY : 0 APEX_THROTTLING_MAX_WATTS : 300 APEX_THROTTLING_MIN_WATTS : 150 APEX_PTHREAD_WRAPPER_STACK_SIZE : 0 APEX_PAPI_METRICS : PAPI_TOT_INS PAPI_L2_TCM Spawned thread 1... Spawned thread 2... Spawned thread 3... Done. CPU is 2.66019e+09 Hz. Elapsed time: 0.954974 Cores detected: 8 Worker Threads observed: 4 Available CPU time: 3.81989 Action : #calls | minimum | mean | maximum | total | stddev | % total PAPI_TOT_INS PAPI_L2_TCM ------------------------------------------------------------------------------------------------------------ allocateMatrix : 12 --n/a-- 2.21e-02 --n/a-- 2.65e-01 --n/a-- 6.930 1.62e+06 9.10e+03 compute : 4 --n/a-- 6.85e-01 --n/a-- 2.74e+00 --n/a-- 71.743 4.31e+09 1.71e+06 compute_interchange : 4 --n/a-- 1.81e-01 --n/a-- 7.23e-01 --n/a-- 18.922 3.77e+09 8.12e+05 do_work : 4 --n/a-- 9.44e-01 --n/a-- 3.78e+00 --n/a-- 98.851 8.10e+09 2.92e+06 freeMatrix : 12 --n/a-- 2.07e-04 --n/a-- 2.49e-03 --n/a-- 0.065 1.13e+06 6.30e+03 initialize : 12 --n/a-- 3.58e-03 --n/a-- 4.29e-02 --n/a-- 1.124 2.21e+07 3.80e+05 main : 1 --n/a-- 9.54e-01 --n/a-- 9.54e-01 --n/a-- 24.978 2.03e+09 7.66e+05 APEX Idle : --n/a-- --n/a-- --n/a-- --n/a-- --n/a-- --n/a-- --n/a-- ------------------------------------------------------------------------------------------------------------","title":"With PAPI"},{"location":"usecases/#csv_output","text":"While APEX is not designed for post-mortem performance analysis, you can export the data that APEX collected. If you set the APEX_CSV_OUTPUT environment variable to 1, APEX will also dump the timer statistics as a CSV file: khuck@ktau:~/src/apex$ cat apex.0.csv \"task\",\"num calls\",\"total cycles\",\"total microseconds\",\"PAPI_TOT_INS\",\"PAPI_L2_TCM\" \"allocateMatrix\",12,704195504,264717,1615804,9100 \"compute\",4,7290209200,2740489,4306522734,1709040 \"compute_interchange\",4,1922797744,722806,3769652571,812196 \"do_work\",4,10044907856,3776018,8101109302,2922142 \"freeMatrix\",12,6613336,2486,1132717,6301 \"initialize\",12,114177592,42921,22093639,379785 \"main\",1,2538202992,954145,2025172707,766218","title":"CSV output"},{"location":"usecases/#with_tau","text":"If APEX is configured with TAU support, then APEX measurements will be forwarded to TAU and recorded as a TAU profile. In addition, all other TAU features are supported, including sampling, MPI measurement, I/O measurement, tracing, etc. To configure APEX with TAU, specify the flags -DUSE_TAU, -DTAU_ROOT, -DTAU_ARCH, and -DTAU_OPTIONS. For example, if TAU was configured with \"./configure -pthread\" on an x86_64 Linux machine, the APEX configuration options would be \"-DUSE_TAU=1 -DTAU_ROOT=/path/to/tau -DTAU_ARCH=x86_64 -DTAU_OPTIONS=-pthread\". If TAU was configured with \"./configure -mpi -pthread\" on an x86_64 Linux machine, the APEX configuration options would be \"-DUSE_TAU=1 -DTAU_ROOT=/path/to/tau -DTAU_ARCH=x86_64 -DTAU_OPTIONS=-mpi-pthread\". Here is a suggested configuration for TAU on x86-Linux to use with APEX (some systems require special flags - please contact the maintaners if you are interested): # download the latest TAU release wget http://www.cs.uoregon.edu/research/paracomp/tau/tauprofile/dist/tau_latest.tar.gz # expand the tar file tar -xvzf tau_latest.tar.gz cd tau-2.25 # configure TAU ./configure -papi=/usr/local/papi/5.3.2 -pthread -prefix=/usr/local/tau/2.25 # build make -j install # set our path to include the new TAU installation export PATH=$PATH:/usr/local/tau/2.25/x86_64/bin Here is a suggested configuration for APEX to use the above TAU installation: cd xpress-apex mkdir build-tau cd build-tau cmake -DBUILD_EXAMPLES=TRUE -DBUILD_TESTS=TRUE -DCMAKE_BUILD_TYPE=RelWithDebInfo \\ -DUSE_TAU=TRUE -DTAU_ROOT=/usr/local/tau/2.25 -DTAU_ARCH=x86_64 -DTAU_OPTIONS=-papi-pthread \\ -DBUILD_BFD=TRUE -DBUILD_ACTIVEHARMONY=TRUE -DCMAKE_INSTALL_PREFIX=../install-tau .. make make tests make install After configuring, building and installing TAU and then configuring, building and installing APEX, the TAU profiling is enabled by setting the environment variable \"APEX_TAU=1\". After executing an example (say 'matmult'), there should be profile.* files in the working directory: khuck@ktau:~/src/xpress-apex$ export APEX_TAU=1 khuck@ktau:~/src/xpress-apex$ ./install/bin/matmult Spawned thread 1... Spawned thread 2... Spawned thread 3... Done. khuck@ktau:~/src/xpress-apex$ ls profile.* profile.0.0.0 profile.0.0.1 profile.0.0.2 profile.0.0.3 profile.0.0.4 profile.0.0.5 If the TAU analysis utilties are in your path, you can execute paraprof to view the profiles: khuck@ktau:~/src/xpress-apex$ paraprof ...which should launch the ParaProf profile viewer/analysis program. The profile should look something like the following (for a complete manual on using ParaProf, see the TAU website ). If you want to collect a TAU trace, you would enable the appropriate TAU environment variable (TAU_TRACE=1), and then re-run the example. After the execution, the trace files need to be merged (using tau_treemerge.pl) and then converted (with tau2slog2) to be viewed with the Jumpshot trace viewer (included with TAU): khuck@ktau:~/src/xpress-apex$ export APEX_TAU=1 khuck@ktau:~/src/xpress-apex$ export TAU_TRACE=1 khuck@ktau:~/src/xpress-apex$ ./install/bin/matmult Spawned thread 1... Spawned thread 2... Spawned thread 3... Done. khuck@ktau:~/src/xpress-apex$ ls *.edf *.trc events.0.edf tautrace.0.0.1.trc tautrace.0.0.3.trc tautrace.0.0.5.trc tautrace.0.0.0.trc tautrace.0.0.2.trc tautrace.0.0.4.trc # merge the trace khuck@ktau:~/src/xpress-apex$ tau_treemerge.pl /home/khuck/src/tau2/x86_64/bin/tau_merge -m tau.edf -e events.0.edf events.0.edf events.0.edf events.0.edf events.0.edf events.0.edf tautrace.0.0.0.trc tautrace.0.0.1.trc tautrace.0.0.2.trc tautrace.0.0.3.trc tautrace.0.0.4.trc tautrace.0.0.5.trc tau.trc tautrace.0.0.0.trc: 34 records read. tautrace.0.0.1.trc: 8 records read. tautrace.0.0.2.trc: 8 records read. tautrace.0.0.3.trc: 30 records read. tautrace.0.0.4.trc: 30 records read. tautrace.0.0.5.trc: 30 records read. # convert the trace khuck@ktau:~/src/xpress-apex$ tau2slog2 tau.trc tau.edf -o tau.slog2 140 records initialized. Processing. 2 Records read. 1% converted 4 Records read. 2% converted 6 Records read. 4% converted 8 Records read. 5% converted 10 Records read. 7% converted 12 Records read. 8% converted 14 Records read. 10% converted 16 Records read. 11% converted 18 Records read. 12% converted 20 Records read. 14% converted 22 Records read. 15% converted 24 Records read. 17% converted 26 Records read. 18% converted 28 Records read. 20% converted 30 Records read. 21% converted 32 Records read. 22% converted 34 Records read. 24% converted 36 Records read. 25% converted 38 Records read. 27% converted 40 Records read. 28% converted 42 Records read. 30% converted 44 Records read. 31% converted 46 Records read. 32% converted 48 Records read. 34% converted 50 Records read. 35% converted 52 Records read. 37% converted 54 Records read. 38% converted 56 Records read. 40% converted 58 Records read. 41% converted 60 Records read. 42% converted 62 Records read. 44% converted 64 Records read. 45% converted 66 Records read. 47% converted 68 Records read. 48% converted 70 Records read. 50% converted 72 Records read. 51% converted 74 Records read. 52% converted 76 Records read. 54% converted 78 Records read. 55% converted 80 Records read. 57% converted 82 Records read. 58% converted 84 Records read. 60% converted 86 Records read. 61% converted 88 Records read. 62% converted 90 Records read. 64% converted 92 Records read. 65% converted 94 Records read. 67% converted 96 Records read. 68% converted 98 Records read. 70% converted 100 Records read. 71% converted 102 Records read. 72% converted 104 Records read. 74% converted 106 Records read. 75% converted 108 Records read. 77% converted 110 Records read. 78% converted 112 Records read. 80% converted 114 Records read. 81% converted 116 Records read. 82% converted 118 Records read. 84% converted 120 Records read. 85% converted 122 Records read. 87% converted 124 Records read. 88% converted 1521 enters: 0 exits: 0 126 Records read. 90% converted 1521 enters: 0 exits: 0 128 Records read. 91% converted 130 Records read. 92% converted 1521 enters: 0 exits: 0 132 Records read. 94% converted 1521 enters: 0 exits: 0 134 Records read. 95% converted 136 Records read. 97% converted 1521 enters: 0 exits: 0 138 Records read. 98% converted 1521 enters: 0 exits: 0 140 Records read. 100% converted Reached end of trace file. Getting YMap, Maxnode: 0, Maxthread: 5 SLOG-2 Header: version = SLOG 2.0.6 NumOfChildrenPerNode = 2 TreeLeafByteSize = 65536 MaxTreeDepth = 0 MaxBufferByteSize = 1960 Categories is FBinfo(641 @ 2068) MethodDefs is FBinfo(0 @ 0) LineIDMaps is FBinfo(197 @ 2709) TreeRoot is FBinfo(1960 @ 108) TreeDir is FBinfo(38 @ 2906) Annotations is FBinfo(0 @ 0) Postamble is FBinfo(0 @ 0) 1521 enters: 0 exits: 0 Number of Drawables = 58 timeElapsed between 1 & 2 = 67 msec timeElapsed between 2 & 3 = 28 msec # open jumpshot khuck@ktau:~/src/xpress-apex$ jumpshot tau.slog2","title":"With TAU"},{"location":"usecases/#policy_rules_and_runtime_adaptation","text":"...Coming soon!","title":"Policy Rules and Runtime Adaptation"}]}
\ No newline at end of file
+{"config":{"indexing":"full","lang":["en"],"min_search_length":3,"prebuild_index":false,"separator":"[\\s\\-]+"},"docs":[{"location":"","text":"APEX: Autonomic Performance Environment for eXascale \u00b6 One of the key components of the US Department of Energy funded XPRESS project was a new approach to performance observation, measurement, analysis and runtime decision making in order to optimize performance. The particular challenges of accurately measuring the performance characteristics of ParalleX [1] (e.g. HPX) applications (as well as other asynchronous multitasking runtime architectures) requires a new approach to parallel performance observation. The traditional model of multiple operating system processes and threads observing themselves in a first-person manner while writing out performance profiles or traces for offline analysis will not adequately capture the full execution context, nor provide opportunities for runtime adaptation. The approach taken in the completed XPRESS project was a new performance measurement system, called (Autonomic Performance Environment for eXascale). APEX includes methods for information sharing between the layers of the software stack, from the hardware through operating and runtime systems, all the way to domain specific or legacy applications. The performance measurement components incorporate relevant information across stack layers, with merging of third-person performance observation of node-level and global resources, remote processes, and both operating and runtime system threads. For a complete design description of APEX, see the publication \"APEX: An Autonomic Performance Environment for eXascale\" [3] . Since it's original project, APEX has been extended to support many popular runtime systems [11] . In short, APEX is an introspection and runtime adaptation library for asynchronous multitasking runtime systems. However, APEX is not only useful for AMT/AMR runtimes running on future exascale systems - it can be used by any application wanting to perform runtime adaptation to deal with heterogeneous and/or variable environments. Introspection \u00b6 APEX provides an API for measuring actions within a runtime. The API includes methods for timer start/stop, as well as sampled counter values. APEX is designed to be integrated into a runtime, library and/or application and provide performance introspection for the purpose of runtime adaptation. While APEX can provide rudimentary post-mortem performance analysis measurement, there are many other performance measurement tools that perform that task more robustly (such as TAU http://tau.uoregon.edu ). That said, APEX includes an event listener that integrates with the TAU measurement system, so APEX events can be forwarded to TAU and collected in a TAU profile and/or trace to be used for post-mortem performance anlaysis. Runtime Adaptation \u00b6 APEX provides a mechanism for dynamic runtime behavior, either for autotuning or adaptation to changing environment. The infrastruture that provides the adaptation is the Policy Engine , which executes policies either periodically or triggered by events. The policies have access to the performance state as observed by the APEX introspection API. APEX has several built in search strategies, including exhaustive, random, simulated annealing, and hill climibing. APEX is also integrated with Active Harmony http://www.dyninst.org/harmony to provide dynamic search using the Nelder Mead algorithm. Citing APEX \u00b6 Please use the following citation: https://doi.org/10.1109/ESPM256814.2022.00008 References & APEX-related Publications \u00b6 Thomas Sterling, Daniel Kogler, Matthew Anderson, and Maciej Brodowicz. \"SLOWER: A performance model for Exascale computing\". Supercomputing Frontiers and Innovations , 1:42\u201357, September 2014. http://superfri.org/superfri/article/view/10 Koniges, Alice, Jayashree Ajay Candadai, Hartmut Kaiser, Kevin Huck, Jeremy Kemp, Thomas Heller, Matthew Anderson et al. \"HPX Applications and Performance Adaptation\". No. SAND2015-8999C. Sandia National Lab.(SNL-NM), Albuquerque, NM (United States), 2015. https://www.osti.gov/servlets/purl/1332791 Kevin A. Huck, Allan Porterfield, Nick Chaimov, Hartmut Kaiser, Allen D. Malony, Thomas Sterling, Rob Fowler. \"An Autonomic Performance Environment for eXascale\", Journal of Supercomputing Frontiers and Innovations , 2015. http://superfri.org/superfri/article/view/64 Grubel, Patricia, Hartmut Kaiser, Kevin Huck, and Jeanine Cook. \"Using intrinsic performance counters to assess efficiency in task-based parallel applications.\" In 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) , pp. 1692-1701. IEEE, 2016. https://www.cs.uoregon.edu/research/paracomp/papers/ipdps16/hpcmaspa2016.pdf Bari, Md Abdullah Shahneous, Nicholas Chaimov, Abid M. Malik, Kevin A. Huck, Barbara Chapman, Allen D. Malony, and Osman Sarood. \"Arcs: Adaptive runtime configuration selection for power-constrained openmp applications.\" In 2016 IEEE International Conference on Cluster Computing (CLUSTER) , pp. 461-470. IEEE, 2016. https://www.cs.uoregon.edu/research/paracomp/papers/cluster16/arcs.pdf Tohid, R., Bibek Wagle, Shahrzad Shirzad, Patrick Diehl, Adrian Serio, Alireza Kheirkhahan, Parsa Amini et al. \"Asynchronous execution of python code on task-based runtime systems.\" In 2018 IEEE/ACM 4th International Workshop on Extreme Scale Programming Models and Middleware (ESPM2), pp. 37-45. IEEE, 2018. http://hdc.cs.arizona.edu/papers/espm2_2018_phylanx.pdf Heller, Thomas, Bryce Adelstein Lelbach, Kevin A. Huck, John Biddiscombe, Patricia Grubel, Alice E. Koniges, Matthias Kretz et al. \"Harnessing billions of tasks for a scalable portable hydrodynamic simulation of the merger of two stars.\" The International Journal of High Performance Computing Applications 33, no. 4 (2019): 699-715. https://journals.sagepub.com/doi/full/10.1177/1094342018819744 Wagle, Bibek, Mohammad Alaul Haque Monil, Kevin Huck, Allen D. Malony, Adrian Serio, and Hartmut Kaiser. \"Runtime adaptive task inlining on asynchronous multitasking runtime systems.\" In Proceedings of the 48th International Conference on Parallel Processing, pp. 1-10. 2019. https://dl.acm.org/doi/abs/10.1145/3337821.3337915 Dai\u00df, Gregor, Parsa Amini, John Biddiscombe, Patrick Diehl, Juhan Frank, Kevin Huck, Hartmut Kaiser, Dominic Marcello, David Pfander, and Dirk Pf\u00fcger. \"From piz daint to the stars: simulation of stellar mergers using high-level abstractions.\" In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1-37. 2019. https://arxiv.org/abs/1908.03121 Steven R. Brandt, Alex Bigelow, Sayef Azad Sakin, Katy Williams, Katherine E. Isaacs, Kevin Huck, Rod Tohid, Bibek Wagle, Shahrzad Shirzad, and Hartmut Kaiser. 2020. \"JetLag: An Interactive, Asynchronous Array Computing Environment\". In Practice and Experience in Advanced Research Computing (PEARC '20). Association for Computing Machinery, New York, NY, USA, 8\u201312. DOI: https://doi.org/10.1145/3311790.3396657 Kevin A. Huck, \"Broad Performance Measurement Support for Asynchronous Multi-Tasking with APEX,\" 2022 IEEE/ACM 7th International Workshop on Extreme Scale Programming Models and Middleware (ESPM2), Dallas, TX, USA, 2022, pp. 20-29. https://doi.org/10.1109/ESPM256814.2022.00008","title":"Home"},{"location":"#apex_autonomic_performance_environment_for_exascale","text":"One of the key components of the US Department of Energy funded XPRESS project was a new approach to performance observation, measurement, analysis and runtime decision making in order to optimize performance. The particular challenges of accurately measuring the performance characteristics of ParalleX [1] (e.g. HPX) applications (as well as other asynchronous multitasking runtime architectures) requires a new approach to parallel performance observation. The traditional model of multiple operating system processes and threads observing themselves in a first-person manner while writing out performance profiles or traces for offline analysis will not adequately capture the full execution context, nor provide opportunities for runtime adaptation. The approach taken in the completed XPRESS project was a new performance measurement system, called (Autonomic Performance Environment for eXascale). APEX includes methods for information sharing between the layers of the software stack, from the hardware through operating and runtime systems, all the way to domain specific or legacy applications. The performance measurement components incorporate relevant information across stack layers, with merging of third-person performance observation of node-level and global resources, remote processes, and both operating and runtime system threads. For a complete design description of APEX, see the publication \"APEX: An Autonomic Performance Environment for eXascale\" [3] . Since it's original project, APEX has been extended to support many popular runtime systems [11] . In short, APEX is an introspection and runtime adaptation library for asynchronous multitasking runtime systems. However, APEX is not only useful for AMT/AMR runtimes running on future exascale systems - it can be used by any application wanting to perform runtime adaptation to deal with heterogeneous and/or variable environments.","title":"APEX: Autonomic Performance Environment for eXascale"},{"location":"#introspection","text":"APEX provides an API for measuring actions within a runtime. The API includes methods for timer start/stop, as well as sampled counter values. APEX is designed to be integrated into a runtime, library and/or application and provide performance introspection for the purpose of runtime adaptation. While APEX can provide rudimentary post-mortem performance analysis measurement, there are many other performance measurement tools that perform that task more robustly (such as TAU http://tau.uoregon.edu ). That said, APEX includes an event listener that integrates with the TAU measurement system, so APEX events can be forwarded to TAU and collected in a TAU profile and/or trace to be used for post-mortem performance anlaysis.","title":"Introspection"},{"location":"#runtime_adaptation","text":"APEX provides a mechanism for dynamic runtime behavior, either for autotuning or adaptation to changing environment. The infrastruture that provides the adaptation is the Policy Engine , which executes policies either periodically or triggered by events. The policies have access to the performance state as observed by the APEX introspection API. APEX has several built in search strategies, including exhaustive, random, simulated annealing, and hill climibing. APEX is also integrated with Active Harmony http://www.dyninst.org/harmony to provide dynamic search using the Nelder Mead algorithm.","title":"Runtime Adaptation"},{"location":"#citing_apex","text":"Please use the following citation: https://doi.org/10.1109/ESPM256814.2022.00008","title":"Citing APEX"},{"location":"#references_apex-related_publications","text":"Thomas Sterling, Daniel Kogler, Matthew Anderson, and Maciej Brodowicz. \"SLOWER: A performance model for Exascale computing\". Supercomputing Frontiers and Innovations , 1:42\u201357, September 2014. http://superfri.org/superfri/article/view/10 Koniges, Alice, Jayashree Ajay Candadai, Hartmut Kaiser, Kevin Huck, Jeremy Kemp, Thomas Heller, Matthew Anderson et al. \"HPX Applications and Performance Adaptation\". No. SAND2015-8999C. Sandia National Lab.(SNL-NM), Albuquerque, NM (United States), 2015. https://www.osti.gov/servlets/purl/1332791 Kevin A. Huck, Allan Porterfield, Nick Chaimov, Hartmut Kaiser, Allen D. Malony, Thomas Sterling, Rob Fowler. \"An Autonomic Performance Environment for eXascale\", Journal of Supercomputing Frontiers and Innovations , 2015. http://superfri.org/superfri/article/view/64 Grubel, Patricia, Hartmut Kaiser, Kevin Huck, and Jeanine Cook. \"Using intrinsic performance counters to assess efficiency in task-based parallel applications.\" In 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) , pp. 1692-1701. IEEE, 2016. https://www.cs.uoregon.edu/research/paracomp/papers/ipdps16/hpcmaspa2016.pdf Bari, Md Abdullah Shahneous, Nicholas Chaimov, Abid M. Malik, Kevin A. Huck, Barbara Chapman, Allen D. Malony, and Osman Sarood. \"Arcs: Adaptive runtime configuration selection for power-constrained openmp applications.\" In 2016 IEEE International Conference on Cluster Computing (CLUSTER) , pp. 461-470. IEEE, 2016. https://www.cs.uoregon.edu/research/paracomp/papers/cluster16/arcs.pdf Tohid, R., Bibek Wagle, Shahrzad Shirzad, Patrick Diehl, Adrian Serio, Alireza Kheirkhahan, Parsa Amini et al. \"Asynchronous execution of python code on task-based runtime systems.\" In 2018 IEEE/ACM 4th International Workshop on Extreme Scale Programming Models and Middleware (ESPM2), pp. 37-45. IEEE, 2018. http://hdc.cs.arizona.edu/papers/espm2_2018_phylanx.pdf Heller, Thomas, Bryce Adelstein Lelbach, Kevin A. Huck, John Biddiscombe, Patricia Grubel, Alice E. Koniges, Matthias Kretz et al. \"Harnessing billions of tasks for a scalable portable hydrodynamic simulation of the merger of two stars.\" The International Journal of High Performance Computing Applications 33, no. 4 (2019): 699-715. https://journals.sagepub.com/doi/full/10.1177/1094342018819744 Wagle, Bibek, Mohammad Alaul Haque Monil, Kevin Huck, Allen D. Malony, Adrian Serio, and Hartmut Kaiser. \"Runtime adaptive task inlining on asynchronous multitasking runtime systems.\" In Proceedings of the 48th International Conference on Parallel Processing, pp. 1-10. 2019. https://dl.acm.org/doi/abs/10.1145/3337821.3337915 Dai\u00df, Gregor, Parsa Amini, John Biddiscombe, Patrick Diehl, Juhan Frank, Kevin Huck, Hartmut Kaiser, Dominic Marcello, David Pfander, and Dirk Pf\u00fcger. \"From piz daint to the stars: simulation of stellar mergers using high-level abstractions.\" In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1-37. 2019. https://arxiv.org/abs/1908.03121 Steven R. Brandt, Alex Bigelow, Sayef Azad Sakin, Katy Williams, Katherine E. Isaacs, Kevin Huck, Rod Tohid, Bibek Wagle, Shahrzad Shirzad, and Hartmut Kaiser. 2020. \"JetLag: An Interactive, Asynchronous Array Computing Environment\". In Practice and Experience in Advanced Research Computing (PEARC '20). Association for Computing Machinery, New York, NY, USA, 8\u201312. DOI: https://doi.org/10.1145/3311790.3396657 Kevin A. Huck, \"Broad Performance Measurement Support for Asynchronous Multi-Tasking with APEX,\" 2022 IEEE/ACM 7th International Workshop on Extreme Scale Programming Models and Middleware (ESPM2), Dallas, TX, USA, 2022, pp. 20-29. https://doi.org/10.1109/ESPM256814.2022.00008","title":"References & APEX-related Publications"},{"location":"environment/","text":"APEX Runtime Options \u00b6 Environment Variables \u00b6 There are a number of environment variables that control APEX behavior at runtime. The variables can be defined in the environment before application execution, or specified in a file called apex.conf in the current execution directory. The format of the configuration file is: APEX_VARIABLE1=value APEX_VARIABLE2=value ... To generate a default APEX configuration file in the current working directory, run the ./install/bin/apex_make_default_config program. To get a list of all known environment variables, run the ./install/bin/apex_environment_help program. Environment Variable Default Value Valid Values Description APEX_DISABLE 0 0,1 Disable APEX during the application execution APEX_SUSPEND 0 0,1 Suspend APEX timers and counters during the application execution APEX_PAPI_SUSPEND 0 0,1 Suspend PAPI counters during the application execution APEX_SCREEN_OUTPUT 0 0,1 Output APEX performance summary at exit APEX_VERBOSE 0 0,1 Output APEX options at entry APEX_PROFILE_OUTPUT 0 0,1 Output TAU profile of performance summary APEX_CSV_OUTPUT 0 0,1 Output CSV profile of performance summary APEX_TASKGRAPH_OUTPUT 0 0,1 Output graphviz reduced taskgraph APEX_POLICY 1 0,1 Enable APEX policy listener and execute registered policies APEX_PROC_STAT 1 0,1 Periodically read data from /proc/stat APEX_PROC_CPUINFO 0 0,1 Read data (once) from /proc/cpuinfo APEX_PROC_MEMINFO 0 0,1 Periodically read data from /proc/meminfo APEX_PROC_NET_DEV 0 0,1 Periodically read data from /proc/net/dev APEX_PROC_SELF_STATUS 0 0,1 Periodically read data from /proc/self/status APEX_PROC_SELF_IO 0 0,1 Periodically read data from /proc/self/io APEX_PROC_STAT_DETAILS 0 0,1 Periodically read detailed data from /proc/self/stat APEX_PROC_PERIOD 1000000 Integer /proc data read sampling period, in microseconds APEX_MEASURE_CONCURRENCY 0 0,1 Periodically sample thread activity and output report at exit APEX_MEASURE_CONCURRENCY_PERIOD 1000000 Integer Thread concurrency sampling period, in microseconds APEX_OTF2 0 0,1 Enable OTF2 trace output. APEX_TRACE_EVENT 0 0,1 Enable Google Trace Event output. APEX_OTF2_ARCHIVE_PATH OTF2_archive valid path OTF2 trace directory. APEX_OTF2_ARCHIVE_NAME APEX valid string OTF2 trace filename. APEX_TAU 0 0,1 Enable TAU profiling (if application is executed with tau_exec ). APEX_THROTTLE_CONCURRENCY 0 0,1 Enable thread concurrency throttling APEX_THROTTLING_MIN_THREADS 1 0,1 Minimum threads allowed APEX_THROTTLING_MAX_THREADS 8 0,1 Maximum threads allowed APEX_THROTTLE_ENERGY 0 0,1 Enable energy throttling APEX_THROTTLE_ENERGY_PERIOD 1000000 Integer Power sampling period, in microseconds APEX_THROTTLING_MIN_WATTS 150 Integer Minimum Watt threshold APEX_THROTTLING_MAX_WATTS 300 Integer Maximum Watt threshold APEX_PTHREAD_WRAPPER_STACK_SIZE 0 16k-8M When wrapping pthread_create, use this size for the stack. APEX_PAPI_METRICS null space-delimited string of metric names List of metrics to be measured by APEX when timers are used. Only meaningful if APEX is configured with PAPI support. Any supported metric from papi_avail ( see PAPI Documentation ) can be used. APEX_PAPI_SUSPEND 0 0,1 Suspend collection of PAPI metrics for APEX timers during the application execution APEX_PROCESS_ASYNC_STATE 1 0,1 Enable/disable asynchronous processing of statistics (useful when only collecting trace data) APEX_UNTIED_TIMERS 0 0,1 Disable callstack state maintenance for specific OS threads. This allows APEX timers to start on one thread and stop on another. This is not compatible with tracing. APEX_OMPT_REQUIRED_EVENTS_ONLY 0 0,1 Disable moderate-frequency, moderate-overhead OMPT events. APEX_OMPT_HIGH_OVERHEAD_EVENTS 0 0,1 Disable high-frequency, high-overhead OMPT events. APEX_PIN_APEX_THREADS 1 0,1 Pin APEX asynchronous threads to the last core/PU on the system. APEX_TASK_SCATTERPLOT 0 0,1 Periodically sample APEX tasks, generating a scatterplot of time distributions. APEX_TIME_TOP_LEVEL_OS_THREADS 0 0,1 When registering threads, measure their lifetimes. APEX_CUDA_COUNTERS 0 0,1 Enable CUDA CUPTI counter measurement. APEX_CUDA_KERNEL_DETAILS 0 0,1 Enable Context information for CUDA CUPTI counter measurement and CUDA CUPTI API callback timers. APEX_CUDA_RUNTIME_API 1 0,1 Enable callbacks for the CUDA Runtime API ( cuda*() functions). APEX_CUDA_DRIVER_API 0 0,1 Enable callbacks for the CUDA Driver API ( cu*() functions). APEX_JUPYTER_SUPPORT 0 0,1 When running HPX in a Jupyter notebook, enable special handling for APEX data output and system reset. apex_exec flags \u00b6 To control the behavior of APEX when using apex_exec , many flags are available, several of which will automatically set the above environment variables as necessary: Usage: apex_exec executable where APEX options are zero or more of: --apex:help show this usage message --apex:debug run with APEX in debugger --apex:verbose enable verbose list of APEX environment variables --apex:screen enable screen text output (on by default) --apex:screen-detail enable detailed text output (off by default) --apex:quiet disable screen text output --apex:final-output-only only output performance data at exit (ignore intermediate dump calls) --apex:csv enable csv text output --apex:tau enable tau profile output --apex:taskgraph enable taskgraph output (graphviz required for post-processing) --apex:tasktree enable tasktree output (python3 with Pandas required for post-processing) --apex:hatchet enable Hatchet tasktree output (python3 with Hatchet required for post-processing) --apex:concur Periodically sample thread activity (default: off) --apex:concur-max Max timers to track with concurrency activity (default: 5) --apex:concur-period Frequency of concurrency sampling, in microseconds (default: 1000000) --apex:throttle throttle short-lived timers to reduce overhead (default: off) --apex:throttle-calls minimum number of calls before throttling (default: 1000) --apex:throttle-per minimum timer duration in microseconds (default: 10) --apex:otf2 enable OTF2 trace output (requries --apex:mpi with MPI configurations) --apex:otf2path specify location of OTF2 archive (default: ./OTF2_archive) --apex:otf2name specify name of OTF2 file (default: APEX) --apex:gtrace enable Google Trace Events output (deprecated) --apex:pftrace enable Perfetto Trace output --apex:scatter enable scatterplot output (python required for post-processing) --apex:openacc enable OpenACC support --apex:kokkos enable Kokkos support --apex:kokkos-tuning enable Kokkos runtime autotuning support --apex:kokkos-fence enable Kokkos fences for async kernels --apex:raja enable RAJA support --apex:pthread enable pthread wrapper support --apex:gpu-memory enable GPU memory wrapper support --apex:cpu-memory enable CPU memory wrapper support --apex:untied enable tasks to migrate cores/OS threads during execution (not compatible with trace output) --apex:cuda enable CUDA/CUPTI measurement (default: off) --apex:cuda-counters enable CUDA/CUPTI counter support (default: off) --apex:cuda-driver enable CUDA driver API callbacks (default: off) --apex:cuda-details enable per-kernel statistics where available (default: off) --apex:hip enable HIP/ROCTracer measurement (default: off) --apex:hip-metrics enable HIP/ROCProfiler metric support (default: off) --apex:hip-counters enable HIP/ROCTracer counter support (default: off) --apex:hip-driver enable HIP/ROCTracer KSA driver API callbacks (default: off) --apex:hip-details enable per-kernel statistics where available (default: off) --apex:monitor-gpu enable GPU monitoring services (CUDA NVML, ROCm SMI) --apex:level0 enable OneAPI Level0 measurement (default: off) --apex:cpuinfo enable sampling of /proc/cpuinfo (Linux only) --apex:meminfo enable sampling of /proc/meminfo (Linux only) --apex:net enable sampling of /proc/net/dev (Linux only) --apex:status enable sampling of /proc/self/status (Linux only) --apex:io enable sampling of /proc/self/io (Linux only) --apex:period specify frequency of OS/HW sampling --apex:mpi enable MPI profiling (required for OTF2 support with MPI configurations) --apex:ompt enable OpenMP profiling (requires runtime support) --apex:ompt-simple only enable OpenMP Tools required events --apex:ompt-details enable all OpenMP Tools events --apex:source resolve function, file and line info for address lookups with binutils (default: function only) --apex:preload extra libraries to load with LD_PRELOAD _before_ APEX libraries (LD_PRELOAD value is added _after_ APEX libraries) --apex:postprocess run post-process scripts (graphviz, python) on output data after exit","title":"Useful Environment Variables"},{"location":"environment/#apex_runtime_options","text":"","title":"APEX Runtime Options"},{"location":"environment/#environment_variables","text":"There are a number of environment variables that control APEX behavior at runtime. The variables can be defined in the environment before application execution, or specified in a file called apex.conf in the current execution directory. The format of the configuration file is: APEX_VARIABLE1=value APEX_VARIABLE2=value ... To generate a default APEX configuration file in the current working directory, run the ./install/bin/apex_make_default_config program. To get a list of all known environment variables, run the ./install/bin/apex_environment_help program. Environment Variable Default Value Valid Values Description APEX_DISABLE 0 0,1 Disable APEX during the application execution APEX_SUSPEND 0 0,1 Suspend APEX timers and counters during the application execution APEX_PAPI_SUSPEND 0 0,1 Suspend PAPI counters during the application execution APEX_SCREEN_OUTPUT 0 0,1 Output APEX performance summary at exit APEX_VERBOSE 0 0,1 Output APEX options at entry APEX_PROFILE_OUTPUT 0 0,1 Output TAU profile of performance summary APEX_CSV_OUTPUT 0 0,1 Output CSV profile of performance summary APEX_TASKGRAPH_OUTPUT 0 0,1 Output graphviz reduced taskgraph APEX_POLICY 1 0,1 Enable APEX policy listener and execute registered policies APEX_PROC_STAT 1 0,1 Periodically read data from /proc/stat APEX_PROC_CPUINFO 0 0,1 Read data (once) from /proc/cpuinfo APEX_PROC_MEMINFO 0 0,1 Periodically read data from /proc/meminfo APEX_PROC_NET_DEV 0 0,1 Periodically read data from /proc/net/dev APEX_PROC_SELF_STATUS 0 0,1 Periodically read data from /proc/self/status APEX_PROC_SELF_IO 0 0,1 Periodically read data from /proc/self/io APEX_PROC_STAT_DETAILS 0 0,1 Periodically read detailed data from /proc/self/stat APEX_PROC_PERIOD 1000000 Integer /proc data read sampling period, in microseconds APEX_MEASURE_CONCURRENCY 0 0,1 Periodically sample thread activity and output report at exit APEX_MEASURE_CONCURRENCY_PERIOD 1000000 Integer Thread concurrency sampling period, in microseconds APEX_OTF2 0 0,1 Enable OTF2 trace output. APEX_TRACE_EVENT 0 0,1 Enable Google Trace Event output. APEX_OTF2_ARCHIVE_PATH OTF2_archive valid path OTF2 trace directory. APEX_OTF2_ARCHIVE_NAME APEX valid string OTF2 trace filename. APEX_TAU 0 0,1 Enable TAU profiling (if application is executed with tau_exec ). APEX_THROTTLE_CONCURRENCY 0 0,1 Enable thread concurrency throttling APEX_THROTTLING_MIN_THREADS 1 0,1 Minimum threads allowed APEX_THROTTLING_MAX_THREADS 8 0,1 Maximum threads allowed APEX_THROTTLE_ENERGY 0 0,1 Enable energy throttling APEX_THROTTLE_ENERGY_PERIOD 1000000 Integer Power sampling period, in microseconds APEX_THROTTLING_MIN_WATTS 150 Integer Minimum Watt threshold APEX_THROTTLING_MAX_WATTS 300 Integer Maximum Watt threshold APEX_PTHREAD_WRAPPER_STACK_SIZE 0 16k-8M When wrapping pthread_create, use this size for the stack. APEX_PAPI_METRICS null space-delimited string of metric names List of metrics to be measured by APEX when timers are used. Only meaningful if APEX is configured with PAPI support. Any supported metric from papi_avail ( see PAPI Documentation ) can be used. APEX_PAPI_SUSPEND 0 0,1 Suspend collection of PAPI metrics for APEX timers during the application execution APEX_PROCESS_ASYNC_STATE 1 0,1 Enable/disable asynchronous processing of statistics (useful when only collecting trace data) APEX_UNTIED_TIMERS 0 0,1 Disable callstack state maintenance for specific OS threads. This allows APEX timers to start on one thread and stop on another. This is not compatible with tracing. APEX_OMPT_REQUIRED_EVENTS_ONLY 0 0,1 Disable moderate-frequency, moderate-overhead OMPT events. APEX_OMPT_HIGH_OVERHEAD_EVENTS 0 0,1 Disable high-frequency, high-overhead OMPT events. APEX_PIN_APEX_THREADS 1 0,1 Pin APEX asynchronous threads to the last core/PU on the system. APEX_TASK_SCATTERPLOT 0 0,1 Periodically sample APEX tasks, generating a scatterplot of time distributions. APEX_TIME_TOP_LEVEL_OS_THREADS 0 0,1 When registering threads, measure their lifetimes. APEX_CUDA_COUNTERS 0 0,1 Enable CUDA CUPTI counter measurement. APEX_CUDA_KERNEL_DETAILS 0 0,1 Enable Context information for CUDA CUPTI counter measurement and CUDA CUPTI API callback timers. APEX_CUDA_RUNTIME_API 1 0,1 Enable callbacks for the CUDA Runtime API ( cuda*() functions). APEX_CUDA_DRIVER_API 0 0,1 Enable callbacks for the CUDA Driver API ( cu*() functions). APEX_JUPYTER_SUPPORT 0 0,1 When running HPX in a Jupyter notebook, enable special handling for APEX data output and system reset.","title":"Environment Variables"},{"location":"environment/#apex_exec_flags","text":"To control the behavior of APEX when using apex_exec , many flags are available, several of which will automatically set the above environment variables as necessary: Usage: apex_exec executable where APEX options are zero or more of: --apex:help show this usage message --apex:debug run with APEX in debugger --apex:verbose enable verbose list of APEX environment variables --apex:screen enable screen text output (on by default) --apex:screen-detail enable detailed text output (off by default) --apex:quiet disable screen text output --apex:final-output-only only output performance data at exit (ignore intermediate dump calls) --apex:csv enable csv text output --apex:tau enable tau profile output --apex:taskgraph enable taskgraph output (graphviz required for post-processing) --apex:tasktree enable tasktree output (python3 with Pandas required for post-processing) --apex:hatchet enable Hatchet tasktree output (python3 with Hatchet required for post-processing) --apex:concur Periodically sample thread activity (default: off) --apex:concur-max Max timers to track with concurrency activity (default: 5) --apex:concur-period Frequency of concurrency sampling, in microseconds (default: 1000000) --apex:throttle throttle short-lived timers to reduce overhead (default: off) --apex:throttle-calls minimum number of calls before throttling (default: 1000) --apex:throttle-per minimum timer duration in microseconds (default: 10) --apex:otf2 enable OTF2 trace output (requries --apex:mpi with MPI configurations) --apex:otf2path specify location of OTF2 archive (default: ./OTF2_archive) --apex:otf2name specify name of OTF2 file (default: APEX) --apex:gtrace enable Google Trace Events output (deprecated) --apex:pftrace enable Perfetto Trace output --apex:scatter enable scatterplot output (python required for post-processing) --apex:openacc enable OpenACC support --apex:kokkos enable Kokkos support --apex:kokkos-tuning enable Kokkos runtime autotuning support --apex:kokkos-fence enable Kokkos fences for async kernels --apex:raja enable RAJA support --apex:pthread enable pthread wrapper support --apex:gpu-memory enable GPU memory wrapper support --apex:cpu-memory enable CPU memory wrapper support --apex:untied enable tasks to migrate cores/OS threads during execution (not compatible with trace output) --apex:cuda enable CUDA/CUPTI measurement (default: off) --apex:cuda-counters enable CUDA/CUPTI counter support (default: off) --apex:cuda-driver enable CUDA driver API callbacks (default: off) --apex:cuda-details enable per-kernel statistics where available (default: off) --apex:hip enable HIP/ROCTracer measurement (default: off) --apex:hip-metrics enable HIP/ROCProfiler metric support (default: off) --apex:hip-counters enable HIP/ROCTracer counter support (default: off) --apex:hip-driver enable HIP/ROCTracer KSA driver API callbacks (default: off) --apex:hip-details enable per-kernel statistics where available (default: off) --apex:monitor-gpu enable GPU monitoring services (CUDA NVML, ROCm SMI) --apex:level0 enable OneAPI Level0 measurement (default: off) --apex:cpuinfo enable sampling of /proc/cpuinfo (Linux only) --apex:meminfo enable sampling of /proc/meminfo (Linux only) --apex:net enable sampling of /proc/net/dev (Linux only) --apex:status enable sampling of /proc/self/status (Linux only) --apex:io enable sampling of /proc/self/io (Linux only) --apex:period specify frequency of OS/HW sampling --apex:mpi enable MPI profiling (required for OTF2 support with MPI configurations) --apex:ompt enable OpenMP profiling (requires runtime support) --apex:ompt-simple only enable OpenMP Tools required events --apex:ompt-details enable all OpenMP Tools events --apex:source resolve function, file and line info for address lookups with binutils (default: function only) --apex:preload extra libraries to load with LD_PRELOAD _before_ APEX libraries (LD_PRELOAD value is added _after_ APEX libraries) --apex:postprocess run post-process scripts (graphviz, python) on output data after exit","title":"apex_exec flags"},{"location":"examples/","text":"HPX-3 and 1D stencil \u00b6 ...coming soon... HPX-5 and LULESH \u00b6 ...coming soon... HPX-5 and SSSP \u00b6 ...coming soon... HPX-5 and MiniGhost \u00b6 ...coming soon... OpenMP and LULESH 2.0 \u00b6 ...coming soon... OpenMP and NPB 3.2.1 \u00b6 ...coming soon... MPI applications \u00b6 ...coming soon...","title":"HPX-3 and 1D stencil"},{"location":"examples/#hpx-3_and_1d_stencil","text":"...coming soon...","title":"HPX-3 and 1D stencil"},{"location":"examples/#hpx-5_and_lulesh","text":"...coming soon...","title":"HPX-5 and LULESH"},{"location":"examples/#hpx-5_and_sssp","text":"...coming soon...","title":"HPX-5 and SSSP"},{"location":"examples/#hpx-5_and_minighost","text":"...coming soon...","title":"HPX-5 and MiniGhost"},{"location":"examples/#openmp_and_lulesh_20","text":"...coming soon...","title":"OpenMP and LULESH 2.0"},{"location":"examples/#openmp_and_npb_321","text":"...coming soon...","title":"OpenMP and NPB 3.2.1"},{"location":"examples/#mpi_applications","text":"...coming soon...","title":"MPI applications"},{"location":"feature/","text":"Feature Overview \u00b6 APEX: Motivation \u00b6 Frequently, software components or even entire applications run into a situation where the context of the execution environment has changed in some way (or does not meet assumptions). In those situations, the software requires some mechanism for evaluating its own performance and that of the underlying runtime system, operating system and hardware. The types of adaptation that the software wants to do could include: Controlling concurrency to improve energy efficiency for performance Parametric variability adjust the decomposition granularity for this machine / dataset choose a different algorithm for better performance/accuracy choose a different preconditioner for better performance/accuracy choose a different solver for better performance/accuracy Load Balancing when to perform AGAS migration? when to perform repartitioning? when to perform data exchanges? Parallel Algorithms (for_each\u2026) - choose a different execution model separate what from how Address the \u201cSLOW(ER)\u201d performance model avoid S tarvation reduce L atency reduce O verhead reduce W aiting reduce E nergy consumption improve R esiliency APEX provides both performance awareness and performance adaptation . APEX provides top-down and bottom-up performance mapping and feedback. APEX exposes node-wide resource utilization data and analysis, energy consumption, and health information in real time Software can subsequently associate performance state with policy for feedback control APEX introspection OS: track system resources, utilization, job contention, overhead Runtime (e.g. HPX, OpenMP, CUDA, OpenACC, Kokkos...): track threads, queues, concurrency, remote operations, parcels, memory management Application timer / counter observation Above: APEX architecture diagram (when linked with an HPX application). The application and runtime send events to the APEX instrumentation API, which updates the performance state. The Policy Engine executes policies that change application behavior based on rule outcomes. Supported Parallel Models \u00b6 HPX - APEX is fully integrated into the HPX runtime, so that all tasks that are scheduled by the thread scheduler are measured by APEX. In addition, all HPX counters are captured by APEX. C++ threads ( std::thread , std::async ) and vanilla POSIX threads - Using a pthread_create() wrapper, APEX can capture all spawned threads and measure the time spent in those top level functions. OpenMP - Using the OpenMP 5.0 OMPT interface, APEX can capture performance data related to OpenMP pragmas. OpenACC - Using the OpenACC Profiling interface, APEX can capture performance data related to OpenACC pragmas. Kokkos - Using the Kokkos profiling interface, APEX can capture performance data related to Kokkos parallel abstractions. RAJA - Using the RAJA profiling interface, APEX can capture performance data related to RAJA parallel abstractions. Unlike Kokkos, RAJA doesn't give any details, so don't expect much. CUDA - Using the NVIDIA CUPTI and NVML libraries, APEX can capture runtime and driver API calls as well as memory transfers and kernels executed on a device, and monitor GPU utilization. HIP - Using the AMD Roctracer, Rocprofiler and ROCM-SMI libraries, APEX can capture runtime and driver API calls as well as memory transfers and kernels executed on a device, and monitor GPU utilization. Intel SYCL - Using the Intel Level0 libraries, APEX can capture runtime and driver API calls as well as memory transfers and kernels executed on a device, and monitor GPU utilization. PhiProf - APEX is integrated with support to intercept PhiProf profiling data. See https://github.com/fmihpc/phiprof . StarPU - APEX is integrated with support to profile StarPU. See https://starpu.gitlabpages.inria.fr . Distributed Execution over MPI - While APEX doesn't measure all MPI function calls, it is \"MPI-aware\", and can detect when used in a distributed run so that each process can write separate or aggregated performance data. APEX provides rudimentary support for measuring point-to-point and collectives. Parallel Models with Experimental Support / In Development / Wish List \u00b6 Argobots - APEX has been used to instrument services based on Argobots, but it is not integrated into the runtime. TBB - The APEX team is evaluating integrated TBB support. Legion - No plans at this time. Charm++ - No plans at this time. Iris - Plans are afoot. Stay tuned. YAKL - Plans are afoot. Stay tuned. Introspection \u00b6 APEX collects data through inspectors . The synchronous data collection uses an event API and event listeners . The API includes events for: Initialize, terminate, thread creation, thread exit added to the HPX thread scheduler added to the OpenMP runtime using the OMPT interface added to the pthread runtime by wrapping the pthread API calls Timer start, stop, yield, resume added to HPX task scheduler added to the OpenMP runtime using the OMPT interface added to the pthread runtime by wrapping the pthread API calls added to the CUDA runtime by subscribing to CUPTI callbacks and asynchronous GPU activity added to the Kokkos runtime by registering for callbacks added to the OpenACC runtime by registering for callbacks Sampled values counters from HPX counters from OpenMP counters from CUPTI Custom events (meta-events) useful for triggering policies Asynchonous data collection does not rely on events, but occurs periodically. APEX exploits access to performance data from lower stack components (i.e. the runtime) or by reading from the RCR blackboard (i.e., power, energy). Other operating system and hardware health data is collected through other interfaces: /proc/stat /proc/cpuinfo /proc/meminfo /proc/net/dev /proc/self/status lm_sensors power measurements counters from NVIDIA Monitoring Library (NVML) PAPI hardware counters and components Event Listeners \u00b6 There are a number of listeners in APEX that are triggered by the events passed in through the API. For example, the Profiling Listener records events related to maintaining the performance state. Profiling Listener \u00b6 Start Event: records the name/address of the timer, gets a timestamp (using rdtsc), returns a profiler handle Stop Event: gets a timestamp, optionally puts the profiler object in a queue for back-end processing and returns Sample Event: put the name & value in the queue Internally to APEX, there is an asynchronous consumer thread that processes profiler objects and samples to build a performance profile (in HPX, this thread is processed/scheduled as an HPX thread/task), construct task graphs, and scatterplots of sampled task times. TAU Listener \u00b6 The TAU Listener (used for postmortem analysis) synchronously passes all measurement events to TAU to build an offline profile or trace. TAU will also capture any other events for which it is configured, including MPI, memory, file I/O, etc. Concurrency Tracking \u00b6 The concurrency listener (also used for postmortem analysis) maintains a timeline of total concurrency, periodically sampled from within APEX. Start event: push timer ID on stack Stop event: pop timer ID off stack An asynchronous consumer thread periodically logs the current timer for each thread. This thread will output a concurrency data report and gnuplot script at APEX termination. OTF2 Tracing \u00b6 The OTF2 listener will construct a full event trace and write the events out to an OTF2 archive. OTF2 files can be visualized with tools like Vampir or Traveler . Due to the constraints of OTF2 trace collection, tasks that start on one OS thread and end on another OS thread are not supported. Similarly, tasks/functions that are not perfectly nested are not supported by OTF2 tracing. For those types of tasks, we recommend the Trace Event listener. Google Trace Event Listener \u00b6 The Trace Event listener will construct a full event trace and write the events to one or more Google Trace Event trace files. The files can be visualized with the Google Chrome web browser, by navigating to the https://ui.perfetto.dev URL. Policy Listener \u00b6 Policies are rules that decide on outcomes based on observed state. Triggered policies are invoked by introspection API events. Periodic policies are run periodically on asynchronous thread. Polices are registered with the Policy Engine at program startup by runtime code and/or from the application. Applications, runtimes, and the OS can register callback functions to be executed. Callback functions define the policy rules - \u201cIf x < y then...(take some action!)\u201d. Enables runtime adaptation using introspection data Engages actuators across stack layers Is also used to involve online auto-tuning support","title":"Feature Overview"},{"location":"feature/#feature_overview","text":"","title":"Feature Overview"},{"location":"feature/#apex_motivation","text":"Frequently, software components or even entire applications run into a situation where the context of the execution environment has changed in some way (or does not meet assumptions). In those situations, the software requires some mechanism for evaluating its own performance and that of the underlying runtime system, operating system and hardware. The types of adaptation that the software wants to do could include: Controlling concurrency to improve energy efficiency for performance Parametric variability adjust the decomposition granularity for this machine / dataset choose a different algorithm for better performance/accuracy choose a different preconditioner for better performance/accuracy choose a different solver for better performance/accuracy Load Balancing when to perform AGAS migration? when to perform repartitioning? when to perform data exchanges? Parallel Algorithms (for_each\u2026) - choose a different execution model separate what from how Address the \u201cSLOW(ER)\u201d performance model avoid S tarvation reduce L atency reduce O verhead reduce W aiting reduce E nergy consumption improve R esiliency APEX provides both performance awareness and performance adaptation . APEX provides top-down and bottom-up performance mapping and feedback. APEX exposes node-wide resource utilization data and analysis, energy consumption, and health information in real time Software can subsequently associate performance state with policy for feedback control APEX introspection OS: track system resources, utilization, job contention, overhead Runtime (e.g. HPX, OpenMP, CUDA, OpenACC, Kokkos...): track threads, queues, concurrency, remote operations, parcels, memory management Application timer / counter observation Above: APEX architecture diagram (when linked with an HPX application). The application and runtime send events to the APEX instrumentation API, which updates the performance state. The Policy Engine executes policies that change application behavior based on rule outcomes.","title":"APEX: Motivation"},{"location":"feature/#supported_parallel_models","text":"HPX - APEX is fully integrated into the HPX runtime, so that all tasks that are scheduled by the thread scheduler are measured by APEX. In addition, all HPX counters are captured by APEX. C++ threads ( std::thread , std::async ) and vanilla POSIX threads - Using a pthread_create() wrapper, APEX can capture all spawned threads and measure the time spent in those top level functions. OpenMP - Using the OpenMP 5.0 OMPT interface, APEX can capture performance data related to OpenMP pragmas. OpenACC - Using the OpenACC Profiling interface, APEX can capture performance data related to OpenACC pragmas. Kokkos - Using the Kokkos profiling interface, APEX can capture performance data related to Kokkos parallel abstractions. RAJA - Using the RAJA profiling interface, APEX can capture performance data related to RAJA parallel abstractions. Unlike Kokkos, RAJA doesn't give any details, so don't expect much. CUDA - Using the NVIDIA CUPTI and NVML libraries, APEX can capture runtime and driver API calls as well as memory transfers and kernels executed on a device, and monitor GPU utilization. HIP - Using the AMD Roctracer, Rocprofiler and ROCM-SMI libraries, APEX can capture runtime and driver API calls as well as memory transfers and kernels executed on a device, and monitor GPU utilization. Intel SYCL - Using the Intel Level0 libraries, APEX can capture runtime and driver API calls as well as memory transfers and kernels executed on a device, and monitor GPU utilization. PhiProf - APEX is integrated with support to intercept PhiProf profiling data. See https://github.com/fmihpc/phiprof . StarPU - APEX is integrated with support to profile StarPU. See https://starpu.gitlabpages.inria.fr . Distributed Execution over MPI - While APEX doesn't measure all MPI function calls, it is \"MPI-aware\", and can detect when used in a distributed run so that each process can write separate or aggregated performance data. APEX provides rudimentary support for measuring point-to-point and collectives.","title":"Supported Parallel Models"},{"location":"feature/#parallel_models_with_experimental_support_in_development_wish_list","text":"Argobots - APEX has been used to instrument services based on Argobots, but it is not integrated into the runtime. TBB - The APEX team is evaluating integrated TBB support. Legion - No plans at this time. Charm++ - No plans at this time. Iris - Plans are afoot. Stay tuned. YAKL - Plans are afoot. Stay tuned.","title":"Parallel Models with Experimental Support / In Development / Wish List"},{"location":"feature/#introspection","text":"APEX collects data through inspectors . The synchronous data collection uses an event API and event listeners . The API includes events for: Initialize, terminate, thread creation, thread exit added to the HPX thread scheduler added to the OpenMP runtime using the OMPT interface added to the pthread runtime by wrapping the pthread API calls Timer start, stop, yield, resume added to HPX task scheduler added to the OpenMP runtime using the OMPT interface added to the pthread runtime by wrapping the pthread API calls added to the CUDA runtime by subscribing to CUPTI callbacks and asynchronous GPU activity added to the Kokkos runtime by registering for callbacks added to the OpenACC runtime by registering for callbacks Sampled values counters from HPX counters from OpenMP counters from CUPTI Custom events (meta-events) useful for triggering policies Asynchonous data collection does not rely on events, but occurs periodically. APEX exploits access to performance data from lower stack components (i.e. the runtime) or by reading from the RCR blackboard (i.e., power, energy). Other operating system and hardware health data is collected through other interfaces: /proc/stat /proc/cpuinfo /proc/meminfo /proc/net/dev /proc/self/status lm_sensors power measurements counters from NVIDIA Monitoring Library (NVML) PAPI hardware counters and components","title":"Introspection"},{"location":"feature/#event_listeners","text":"There are a number of listeners in APEX that are triggered by the events passed in through the API. For example, the Profiling Listener records events related to maintaining the performance state.","title":"Event Listeners"},{"location":"feature/#profiling_listener","text":"Start Event: records the name/address of the timer, gets a timestamp (using rdtsc), returns a profiler handle Stop Event: gets a timestamp, optionally puts the profiler object in a queue for back-end processing and returns Sample Event: put the name & value in the queue Internally to APEX, there is an asynchronous consumer thread that processes profiler objects and samples to build a performance profile (in HPX, this thread is processed/scheduled as an HPX thread/task), construct task graphs, and scatterplots of sampled task times.","title":"Profiling Listener"},{"location":"feature/#tau_listener","text":"The TAU Listener (used for postmortem analysis) synchronously passes all measurement events to TAU to build an offline profile or trace. TAU will also capture any other events for which it is configured, including MPI, memory, file I/O, etc.","title":"TAU Listener"},{"location":"feature/#concurrency_tracking","text":"The concurrency listener (also used for postmortem analysis) maintains a timeline of total concurrency, periodically sampled from within APEX. Start event: push timer ID on stack Stop event: pop timer ID off stack An asynchronous consumer thread periodically logs the current timer for each thread. This thread will output a concurrency data report and gnuplot script at APEX termination.","title":"Concurrency Tracking"},{"location":"feature/#otf2_tracing","text":"The OTF2 listener will construct a full event trace and write the events out to an OTF2 archive. OTF2 files can be visualized with tools like Vampir or Traveler . Due to the constraints of OTF2 trace collection, tasks that start on one OS thread and end on another OS thread are not supported. Similarly, tasks/functions that are not perfectly nested are not supported by OTF2 tracing. For those types of tasks, we recommend the Trace Event listener.","title":"OTF2 Tracing"},{"location":"feature/#google_trace_event_listener","text":"The Trace Event listener will construct a full event trace and write the events to one or more Google Trace Event trace files. The files can be visualized with the Google Chrome web browser, by navigating to the https://ui.perfetto.dev URL.","title":"Google Trace Event Listener"},{"location":"feature/#policy_listener","text":"Policies are rules that decide on outcomes based on observed state. Triggered policies are invoked by introspection API events. Periodic policies are run periodically on asynchronous thread. Polices are registered with the Policy Engine at program startup by runtime code and/or from the application. Applications, runtimes, and the OS can register callback functions to be executed. Callback functions define the policy rules - \u201cIf x < y then...(take some action!)\u201d. Enables runtime adaptation using introspection data Engages actuators across stack layers Is also used to involve online auto-tuning support","title":"Policy Listener"},{"location":"hpx5/","text":"Supported Runtime Systems \u00b6 HPX-5 (Indiana University) \u00b6 Note: Support for HPX-5 has stalled since the end of the XPRESS project. These instructions were valid as of ~2017. HPX-5 High Performance ParalleX is a second implementation of the ParalleX model. Developed and maintained by the CREST Group at Indiana University, HPX-5 is implemented in C. For more information, see https://hpx.crest.iu.edu . Configuring HPX-5 with APEX \u00b6 APEX is built as a pre-requisite dependency of HPX-5. So, before configuring and building HPX-5, configure and build APEX as a standalone library. In addition to the usual required options for CMake, we will also include the options to include Active Harmony (for policies), TAU (for performance analysis - see APEX with TAU for instructions on configuring TAU) and Binutils support, because the HPX-5 instrumentation uses function addresses to identify timers rather than strings. To include Binutils, we can choose one of: use a system-installed binutils by specifying -DUSE_BFD=TRUE use a custom build of Binutils by specifying -DUSE_BFD=TRUE -DBFD_ROOT= have APEX download and build Binutils automatically by specifying -DBUILD_BFD=TRUE . Note: HPX-5 uses JEMalloc, TBB Malloc or DLMalloc, so DO NOT configure APEX with either TCMalloc or JEMalloc. For example, assume TAU is installed in /usr/local/tau/2.25 and we will have CMake download and build Binutils and Active Harmony, and we want to install APEX to /usr/local/apex/2.3.1. To configure, build and install APEX in the main source directory (your paths may vary): cd $HOME/src wget https://github.com/khuck/xpress-apex/archive/v2.3.1.tar.gz tar -xvzf v2.3.1.tar.gz cd xpress-apex-2.3.1 mkdir build cd build cmake \\ -DBUILD_BFD=TRUE -DCMAKE_INSTALL_PREFIX=/usr/local/xpress-apex/2.3.1 -DCMAKE_BUILD_TYPE=RelWithDebInfo .. make make test # optional make doc # optional make install Keep in mind that APEX will automatically download, configure and build Active Harmony as part of the build process, unless you pass -DUSE_ACTIVEHARMONY=FALSE to the cmake command. After the build is complete, add the package configuration path to your PKG_CONFIG_PATH environment variable (HPX-5 uses autotools for configuration so it will find APEX using the utility pkg-config): export PKG_CONFIG_PATH=$PKG_CONFIG_PATH:/usr/local/xpress-apex/2.3.1/lib/pkgconfig To confirm the PKG_CONFIG_PATH variable is set correctly, try executing the pkg-config command: pkg-config --libs apex Which should give the following output (or something similar): -L/usr/local/xpress-apex/2.3.1/lib -L/usr/local/tau/2.25/x86_64/lib -L/usr/local/xpress-apex/2.3.1/lib -lapex -lpthread -lTAUsh-papi-pthread -lharmony -lbfd -liberty -lz -lm -Wl,-rpath,/usr/local/tau/2.25/x86_64/lib,-rpath,/usr/local/xpress-apex/2.3.1/lib -lstdc++ Once APEX is installed, you can configure and build HPX-5 with APEX. To include APEX in the HPX-5 configuration, include the --with-apex=yes option when calling configure. Assuming you have downloaded HPX-5 v.3.0, you would do the following: # go to the HPX source directory cd HPX_Release_v3.0.0/hpx # If you haven't already set the pkgconfig path, do so now... export PKG_CONFIG_PATH=$PKG_CONFIG_PATH:/usr/local/xpress-apex/2.3.1/lib/pkgconfig # configure ./bootstrap ./configure --enable-testsuite --prefix=/home/khuck/src/hpx-iu/hpx-install --with-apex=yes # build! make -j8 # install! make install To confirm that HPX-5 was configured and built with APEX correctly, run the simple APEX example: export APEX_SCREEN_OUTPUT=1 ./tests/unit/apex Which should give output similar to this: v0.1-5e4ac87-master Built on: 13:23:34 Dec 17 2015 C++ Language Standard version : 201402 GCC Compiler version : 5.2.1 20151010 APEX_TAU : 0 APEX_POLICY : 1 APEX_MEASURE_CONCURRENCY : 0 APEX_MEASURE_CONCURRENCY_PERIOD : 1000000 APEX_SCREEN_OUTPUT : 1 APEX_PROFILE_OUTPUT : 0 APEX_CSV_OUTPUT : 0 APEX_TASKGRAPH_OUTPUT : 0 APEX_PROC_CPUINFO : 0 APEX_PROC_MEMINFO : 0 APEX_PROC_NET_DEV : 0 APEX_PROC_SELF_STATUS : 0 APEX_PROC_STAT : 1 APEX_THROTTLE_CONCURRENCY : 0 APEX_THROTTLING_MAX_THREADS : 8 APEX_THROTTLING_MIN_THREADS : 1 APEX_THROTTLE_ENERGY : 0 APEX_THROTTLING_MAX_WATTS : 300 APEX_THROTTLING_MIN_WATTS : 150 APEX_PTHREAD_WRAPPER_STACK_SIZE : 0 APEX_PAPI_METRICS : Missing fib number. Using 10. fib(10)=55 seconds: 0.0005629 localities: 1 threads/locality: 8 Info: 34 items remaining on on the profiler_listener queue...done. CPU is 2.66036e+09 Hz. Elapsed time: 0.0364015 Cores detected: 8 Worker Threads observed: 8 Available CPU time: 0.291212 Action : #calls | minimum | mean | maximum | total | stddev | % total ------------------------------------------------------------------------------------------------------------ _fib_main_action [{/home/kh... : 1 --n/a-- 4.52e-04 --n/a-- 4.52e-04 --n/a-- 0.155 _fib_action [{/home/khuck/s... : 177 --n/a-- 4.39e-06 --n/a-- 7.77e-04 --n/a-- 0.267 _locality_stop_handler [{/h... : 1 --n/a-- 1.21e-05 --n/a-- 1.21e-05 --n/a-- 0.004 failed steals : 1 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 --n/a-- mail : 1 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 --n/a-- spawns : 1 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 --n/a-- stacks : 1 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 --n/a-- steals : 1 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 --n/a-- yields : 1 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 --n/a-- APEX Idle : --n/a-- --n/a-- --n/a-- --n/a-- 2.90e-01 --n/a-- 99.574 ------------------------------------------------------------------------------------------------------------ Building HPX-5 applications with APEX \u00b6 APEX will automatically be included in the link when HPX-5 applciations are built. To build an example, go to the hpx-apps directory and build the LULESH parcels example: cd hpx-apps/lulesh/parcels # assuming HPX-5 is installed in /usr/local/hpx/3.0, set the pkgconfig path export PKG_CONFIG_PATH=$PKG_CONFIG_PATH:/usr/local/hpx/3.0/lib/pkgconfig # configure ./bootstrap ./configure # make! make Then, to run the LULESH example: export APEX_SCREEN_OUTPUT=1 ./luleshparcels -n 8 -x 24 -i 100 --hpx-threads=8 Should give the following output (or similar): v0.1-907c977-master Built on: 09:50:08 Dec 23 2015 C++ Language Standard version : 201402 GCC Compiler version : 5.2.1 20151010 APEX_TAU : 0 APEX_POLICY : 1 APEX_MEASURE_CONCURRENCY : 0 APEX_MEASURE_CONCURRENCY_PERIOD : 1000000 APEX_SCREEN_OUTPUT : 1 APEX_PROFILE_OUTPUT : 0 APEX_CSV_OUTPUT : 0 APEX_TASKGRAPH_OUTPUT : 0 APEX_PROC_CPUINFO : 0 APEX_PROC_MEMINFO : 0 APEX_PROC_NET_DEV : 0 APEX_PROC_SELF_STATUS : 0 APEX_PROC_STAT : 1 APEX_THROTTLE_CONCURRENCY : 0 APEX_THROTTLING_MAX_THREADS : 8 APEX_THROTTLING_MIN_THREADS : 1 APEX_THROTTLE_ENERGY : 0 APEX_THROTTLING_MAX_WATTS : 300 APEX_THROTTLING_MIN_WATTS : 150 APEX_PTHREAD_WRAPPER_STACK_SIZE : 0 APEX_PAPI_METRICS : Number of domains: 8 nx: 24 maxcycles: 100 core-major ordering: 1 START_LOG PROGNAME: lulesh-parcels Elapsed time = 1.255209e+01 Run completed: Problem size = 24 Iteration count = 100 Final Origin Energy = 4.739209e+06 Testing plane 0 of energy array: MaxAbsDiff = 9.313226e-10 TotalAbsDiff = 2.841568e-09 MaxRelDiff = 2.946213e-12 END_LOG time_in_SBN3 = 4.570989e-01 time_in_PosVel = 2.182410e-01 time_in_MonoQ = 4.889381e+00 Elapsed: 12599.4 CPU is 2.66028e+09 Hz. Elapsed time: 12.6192 Cores detected: 8 Worker Threads observed: 8 Available CPU time: 100.953 Action : #calls | minimum | mean | maximum | total | stddev | % total ------------------------------------------------------------------------------------------------------------ _advanceDomain_action [{/ho... : 8 --n/a-- 1.17e+01 --n/a-- 9.34e+01 --n/a-- 92.506 _initDomain_action [{/home/... : 8 --n/a-- 2.04e-02 --n/a-- 1.63e-01 --n/a-- 0.162 _finiDomain_action [{/home/... : 8 --n/a-- 2.81e-03 --n/a-- 2.25e-02 --n/a-- 0.022 _main_action [{/home/khuck/... : 1 --n/a-- 4.73e-03 --n/a-- 4.73e-03 --n/a-- 0.005 _SBN1_result_action [{/home... : 56 --n/a-- 1.42e-03 --n/a-- 7.93e-02 --n/a-- 0.079 _SBN1_sends_action [{/home/... : 56 --n/a-- 1.87e-04 --n/a-- 1.05e-02 --n/a-- 0.010 _SBN3_result_action [{/home... : 5600 --n/a-- 1.33e-04 --n/a-- 7.45e-01 --n/a-- 0.738 _SBN3_sends_action [{/home/... : 5600 --n/a-- 9.05e-05 --n/a-- 5.07e-01 --n/a-- 0.502 _PosVel_result_action [{/ho... : 2800 --n/a-- 1.61e-04 --n/a-- 4.50e-01 --n/a-- 0.445 _PosVel_sends_action [{/hom... : 2800 --n/a-- 1.43e-04 --n/a-- 4.00e-01 --n/a-- 0.396 _MonoQ_result_action [{/hom... : 2400 --n/a-- 1.03e-04 --n/a-- 2.47e-01 --n/a-- 0.245 _MonoQ_sends_action [{/home... : 2400 --n/a-- 1.79e-04 --n/a-- 4.29e-01 --n/a-- 0.425 _locality_stop_handler [{/h... : 1 --n/a-- 2.45e-04 --n/a-- 2.45e-04 --n/a-- 0.000 _allreduce_init_handler [{/... : 2 --n/a-- 5.49e-04 --n/a-- 1.10e-03 --n/a-- 0.001 _allreduce_fini_handler [{/... : 2 --n/a-- 2.44e-04 --n/a-- 4.89e-04 --n/a-- 0.000 _allreduce_add_handler [{/h... : 9 --n/a-- 6.74e-05 --n/a-- 6.07e-04 --n/a-- 0.001 _allreduce_remove_handler [... : 9 --n/a-- 4.31e-05 --n/a-- 3.88e-04 --n/a-- 0.000 _allreduce_join_handler [{/... : 99 --n/a-- 4.90e-05 --n/a-- 4.86e-03 --n/a-- 0.005 _allreduce_bcast_handler [{... : 99 --n/a-- 2.75e-05 --n/a-- 2.72e-03 --n/a-- 0.003 CPU Guest % : 12 0.000 0.000 0.000 0.000 0.000 --n/a-- CPU I/O Wait % : 12 0.000 0.000 0.000 0.000 0.000 --n/a-- CPU IRQ % : 12 0.000 0.000 0.000 0.000 0.000 --n/a-- CPU Idle % : 12 0.000 0.789 8.429 9.464 2.305 --n/a-- CPU Nice % : 12 0.000 0.000 0.000 0.000 0.000 --n/a-- CPU Steal % : 12 0.000 0.000 0.000 0.000 0.000 --n/a-- CPU System % : 12 21.000 22.387 24.286 268.643 0.941 --n/a-- CPU User % : 12 77.500 80.426 89.714 965.107 4.315 --n/a-- CPU soft IRQ % : 12 0.000 0.010 0.125 0.125 0.035 --n/a-- failed steals : 1 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 --n/a-- mail : 1 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 --n/a-- spawns : 1 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 --n/a-- stacks : 1 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 --n/a-- steals : 1 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 --n/a-- yields : 1 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 --n/a-- APEX Idle : --n/a-- --n/a-- --n/a-- --n/a-- 4.50e+00 --n/a-- 4.455 ------------------------------------------------------------------------------------------------------------ To enable TAU profiling, set the APEX_TAU environment variable to 1. We will also set some other TAU environment varaibles and re-run the program: export APEX_TAU=1 export TAU_PROFILE_FORMAT=merged export TAU_SAMPLING=1 ./luleshparcels -n 8 -x 24 -i 100 --hpx-threads=8 The \"merged\" profile setting will create a single file (tauprofile.xml) for the whole application, rather than a profile.\\* file for each thread. The sampling flag will enable periodic interruption of the application to get a more detailed profile. After execution, there is a TAU profile file called \"tauprofile.xml\". To view the results of the profiling, run the ParaProf application on the profile (assuming the TAU utilities are in your path): paraprof tauprofile.xml Which should result in a profile like the following: Above: ParaProf main profiler window showing all threads of execution. Above: ParaProf main profiler window showing one thread of execution. Above: ParaProf main profiler window showing one thread of execution, in a callgraph view. For more information on using TAU with APEX, see APEX with TAU .","title":"Supported Runtime Systems"},{"location":"hpx5/#supported_runtime_systems","text":"","title":"Supported Runtime Systems"},{"location":"hpx5/#hpx-5_indiana_university","text":"Note: Support for HPX-5 has stalled since the end of the XPRESS project. These instructions were valid as of ~2017. HPX-5 High Performance ParalleX is a second implementation of the ParalleX model. Developed and maintained by the CREST Group at Indiana University, HPX-5 is implemented in C. For more information, see https://hpx.crest.iu.edu .","title":"HPX-5 (Indiana University)"},{"location":"hpx5/#configuring_hpx-5_with_apex","text":"APEX is built as a pre-requisite dependency of HPX-5. So, before configuring and building HPX-5, configure and build APEX as a standalone library. In addition to the usual required options for CMake, we will also include the options to include Active Harmony (for policies), TAU (for performance analysis - see APEX with TAU for instructions on configuring TAU) and Binutils support, because the HPX-5 instrumentation uses function addresses to identify timers rather than strings. To include Binutils, we can choose one of: use a system-installed binutils by specifying -DUSE_BFD=TRUE use a custom build of Binutils by specifying -DUSE_BFD=TRUE -DBFD_ROOT= have APEX download and build Binutils automatically by specifying -DBUILD_BFD=TRUE . Note: HPX-5 uses JEMalloc, TBB Malloc or DLMalloc, so DO NOT configure APEX with either TCMalloc or JEMalloc. For example, assume TAU is installed in /usr/local/tau/2.25 and we will have CMake download and build Binutils and Active Harmony, and we want to install APEX to /usr/local/apex/2.3.1. To configure, build and install APEX in the main source directory (your paths may vary): cd $HOME/src wget https://github.com/khuck/xpress-apex/archive/v2.3.1.tar.gz tar -xvzf v2.3.1.tar.gz cd xpress-apex-2.3.1 mkdir build cd build cmake \\ -DBUILD_BFD=TRUE -DCMAKE_INSTALL_PREFIX=/usr/local/xpress-apex/2.3.1 -DCMAKE_BUILD_TYPE=RelWithDebInfo .. make make test # optional make doc # optional make install Keep in mind that APEX will automatically download, configure and build Active Harmony as part of the build process, unless you pass -DUSE_ACTIVEHARMONY=FALSE to the cmake command. After the build is complete, add the package configuration path to your PKG_CONFIG_PATH environment variable (HPX-5 uses autotools for configuration so it will find APEX using the utility pkg-config): export PKG_CONFIG_PATH=$PKG_CONFIG_PATH:/usr/local/xpress-apex/2.3.1/lib/pkgconfig To confirm the PKG_CONFIG_PATH variable is set correctly, try executing the pkg-config command: pkg-config --libs apex Which should give the following output (or something similar): -L/usr/local/xpress-apex/2.3.1/lib -L/usr/local/tau/2.25/x86_64/lib -L/usr/local/xpress-apex/2.3.1/lib -lapex -lpthread -lTAUsh-papi-pthread -lharmony -lbfd -liberty -lz -lm -Wl,-rpath,/usr/local/tau/2.25/x86_64/lib,-rpath,/usr/local/xpress-apex/2.3.1/lib -lstdc++ Once APEX is installed, you can configure and build HPX-5 with APEX. To include APEX in the HPX-5 configuration, include the --with-apex=yes option when calling configure. Assuming you have downloaded HPX-5 v.3.0, you would do the following: # go to the HPX source directory cd HPX_Release_v3.0.0/hpx # If you haven't already set the pkgconfig path, do so now... export PKG_CONFIG_PATH=$PKG_CONFIG_PATH:/usr/local/xpress-apex/2.3.1/lib/pkgconfig # configure ./bootstrap ./configure --enable-testsuite --prefix=/home/khuck/src/hpx-iu/hpx-install --with-apex=yes # build! make -j8 # install! make install To confirm that HPX-5 was configured and built with APEX correctly, run the simple APEX example: export APEX_SCREEN_OUTPUT=1 ./tests/unit/apex Which should give output similar to this: v0.1-5e4ac87-master Built on: 13:23:34 Dec 17 2015 C++ Language Standard version : 201402 GCC Compiler version : 5.2.1 20151010 APEX_TAU : 0 APEX_POLICY : 1 APEX_MEASURE_CONCURRENCY : 0 APEX_MEASURE_CONCURRENCY_PERIOD : 1000000 APEX_SCREEN_OUTPUT : 1 APEX_PROFILE_OUTPUT : 0 APEX_CSV_OUTPUT : 0 APEX_TASKGRAPH_OUTPUT : 0 APEX_PROC_CPUINFO : 0 APEX_PROC_MEMINFO : 0 APEX_PROC_NET_DEV : 0 APEX_PROC_SELF_STATUS : 0 APEX_PROC_STAT : 1 APEX_THROTTLE_CONCURRENCY : 0 APEX_THROTTLING_MAX_THREADS : 8 APEX_THROTTLING_MIN_THREADS : 1 APEX_THROTTLE_ENERGY : 0 APEX_THROTTLING_MAX_WATTS : 300 APEX_THROTTLING_MIN_WATTS : 150 APEX_PTHREAD_WRAPPER_STACK_SIZE : 0 APEX_PAPI_METRICS : Missing fib number. Using 10. fib(10)=55 seconds: 0.0005629 localities: 1 threads/locality: 8 Info: 34 items remaining on on the profiler_listener queue...done. CPU is 2.66036e+09 Hz. Elapsed time: 0.0364015 Cores detected: 8 Worker Threads observed: 8 Available CPU time: 0.291212 Action : #calls | minimum | mean | maximum | total | stddev | % total ------------------------------------------------------------------------------------------------------------ _fib_main_action [{/home/kh... : 1 --n/a-- 4.52e-04 --n/a-- 4.52e-04 --n/a-- 0.155 _fib_action [{/home/khuck/s... : 177 --n/a-- 4.39e-06 --n/a-- 7.77e-04 --n/a-- 0.267 _locality_stop_handler [{/h... : 1 --n/a-- 1.21e-05 --n/a-- 1.21e-05 --n/a-- 0.004 failed steals : 1 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 --n/a-- mail : 1 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 --n/a-- spawns : 1 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 --n/a-- stacks : 1 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 --n/a-- steals : 1 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 --n/a-- yields : 1 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 --n/a-- APEX Idle : --n/a-- --n/a-- --n/a-- --n/a-- 2.90e-01 --n/a-- 99.574 ------------------------------------------------------------------------------------------------------------","title":"Configuring HPX-5 with APEX"},{"location":"hpx5/#building_hpx-5_applications_with_apex","text":"APEX will automatically be included in the link when HPX-5 applciations are built. To build an example, go to the hpx-apps directory and build the LULESH parcels example: cd hpx-apps/lulesh/parcels # assuming HPX-5 is installed in /usr/local/hpx/3.0, set the pkgconfig path export PKG_CONFIG_PATH=$PKG_CONFIG_PATH:/usr/local/hpx/3.0/lib/pkgconfig # configure ./bootstrap ./configure # make! make Then, to run the LULESH example: export APEX_SCREEN_OUTPUT=1 ./luleshparcels -n 8 -x 24 -i 100 --hpx-threads=8 Should give the following output (or similar): v0.1-907c977-master Built on: 09:50:08 Dec 23 2015 C++ Language Standard version : 201402 GCC Compiler version : 5.2.1 20151010 APEX_TAU : 0 APEX_POLICY : 1 APEX_MEASURE_CONCURRENCY : 0 APEX_MEASURE_CONCURRENCY_PERIOD : 1000000 APEX_SCREEN_OUTPUT : 1 APEX_PROFILE_OUTPUT : 0 APEX_CSV_OUTPUT : 0 APEX_TASKGRAPH_OUTPUT : 0 APEX_PROC_CPUINFO : 0 APEX_PROC_MEMINFO : 0 APEX_PROC_NET_DEV : 0 APEX_PROC_SELF_STATUS : 0 APEX_PROC_STAT : 1 APEX_THROTTLE_CONCURRENCY : 0 APEX_THROTTLING_MAX_THREADS : 8 APEX_THROTTLING_MIN_THREADS : 1 APEX_THROTTLE_ENERGY : 0 APEX_THROTTLING_MAX_WATTS : 300 APEX_THROTTLING_MIN_WATTS : 150 APEX_PTHREAD_WRAPPER_STACK_SIZE : 0 APEX_PAPI_METRICS : Number of domains: 8 nx: 24 maxcycles: 100 core-major ordering: 1 START_LOG PROGNAME: lulesh-parcels Elapsed time = 1.255209e+01 Run completed: Problem size = 24 Iteration count = 100 Final Origin Energy = 4.739209e+06 Testing plane 0 of energy array: MaxAbsDiff = 9.313226e-10 TotalAbsDiff = 2.841568e-09 MaxRelDiff = 2.946213e-12 END_LOG time_in_SBN3 = 4.570989e-01 time_in_PosVel = 2.182410e-01 time_in_MonoQ = 4.889381e+00 Elapsed: 12599.4 CPU is 2.66028e+09 Hz. Elapsed time: 12.6192 Cores detected: 8 Worker Threads observed: 8 Available CPU time: 100.953 Action : #calls | minimum | mean | maximum | total | stddev | % total ------------------------------------------------------------------------------------------------------------ _advanceDomain_action [{/ho... : 8 --n/a-- 1.17e+01 --n/a-- 9.34e+01 --n/a-- 92.506 _initDomain_action [{/home/... : 8 --n/a-- 2.04e-02 --n/a-- 1.63e-01 --n/a-- 0.162 _finiDomain_action [{/home/... : 8 --n/a-- 2.81e-03 --n/a-- 2.25e-02 --n/a-- 0.022 _main_action [{/home/khuck/... : 1 --n/a-- 4.73e-03 --n/a-- 4.73e-03 --n/a-- 0.005 _SBN1_result_action [{/home... : 56 --n/a-- 1.42e-03 --n/a-- 7.93e-02 --n/a-- 0.079 _SBN1_sends_action [{/home/... : 56 --n/a-- 1.87e-04 --n/a-- 1.05e-02 --n/a-- 0.010 _SBN3_result_action [{/home... : 5600 --n/a-- 1.33e-04 --n/a-- 7.45e-01 --n/a-- 0.738 _SBN3_sends_action [{/home/... : 5600 --n/a-- 9.05e-05 --n/a-- 5.07e-01 --n/a-- 0.502 _PosVel_result_action [{/ho... : 2800 --n/a-- 1.61e-04 --n/a-- 4.50e-01 --n/a-- 0.445 _PosVel_sends_action [{/hom... : 2800 --n/a-- 1.43e-04 --n/a-- 4.00e-01 --n/a-- 0.396 _MonoQ_result_action [{/hom... : 2400 --n/a-- 1.03e-04 --n/a-- 2.47e-01 --n/a-- 0.245 _MonoQ_sends_action [{/home... : 2400 --n/a-- 1.79e-04 --n/a-- 4.29e-01 --n/a-- 0.425 _locality_stop_handler [{/h... : 1 --n/a-- 2.45e-04 --n/a-- 2.45e-04 --n/a-- 0.000 _allreduce_init_handler [{/... : 2 --n/a-- 5.49e-04 --n/a-- 1.10e-03 --n/a-- 0.001 _allreduce_fini_handler [{/... : 2 --n/a-- 2.44e-04 --n/a-- 4.89e-04 --n/a-- 0.000 _allreduce_add_handler [{/h... : 9 --n/a-- 6.74e-05 --n/a-- 6.07e-04 --n/a-- 0.001 _allreduce_remove_handler [... : 9 --n/a-- 4.31e-05 --n/a-- 3.88e-04 --n/a-- 0.000 _allreduce_join_handler [{/... : 99 --n/a-- 4.90e-05 --n/a-- 4.86e-03 --n/a-- 0.005 _allreduce_bcast_handler [{... : 99 --n/a-- 2.75e-05 --n/a-- 2.72e-03 --n/a-- 0.003 CPU Guest % : 12 0.000 0.000 0.000 0.000 0.000 --n/a-- CPU I/O Wait % : 12 0.000 0.000 0.000 0.000 0.000 --n/a-- CPU IRQ % : 12 0.000 0.000 0.000 0.000 0.000 --n/a-- CPU Idle % : 12 0.000 0.789 8.429 9.464 2.305 --n/a-- CPU Nice % : 12 0.000 0.000 0.000 0.000 0.000 --n/a-- CPU Steal % : 12 0.000 0.000 0.000 0.000 0.000 --n/a-- CPU System % : 12 21.000 22.387 24.286 268.643 0.941 --n/a-- CPU User % : 12 77.500 80.426 89.714 965.107 4.315 --n/a-- CPU soft IRQ % : 12 0.000 0.010 0.125 0.125 0.035 --n/a-- failed steals : 1 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 --n/a-- mail : 1 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 --n/a-- spawns : 1 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 --n/a-- stacks : 1 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 --n/a-- steals : 1 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 --n/a-- yields : 1 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 --n/a-- APEX Idle : --n/a-- --n/a-- --n/a-- --n/a-- 4.50e+00 --n/a-- 4.455 ------------------------------------------------------------------------------------------------------------ To enable TAU profiling, set the APEX_TAU environment variable to 1. We will also set some other TAU environment varaibles and re-run the program: export APEX_TAU=1 export TAU_PROFILE_FORMAT=merged export TAU_SAMPLING=1 ./luleshparcels -n 8 -x 24 -i 100 --hpx-threads=8 The \"merged\" profile setting will create a single file (tauprofile.xml) for the whole application, rather than a profile.\\* file for each thread. The sampling flag will enable periodic interruption of the application to get a more detailed profile. After execution, there is a TAU profile file called \"tauprofile.xml\". To view the results of the profiling, run the ParaProf application on the profile (assuming the TAU utilities are in your path): paraprof tauprofile.xml Which should result in a profile like the following: Above: ParaProf main profiler window showing all threads of execution. Above: ParaProf main profiler window showing one thread of execution. Above: ParaProf main profiler window showing one thread of execution, in a callgraph view. For more information on using TAU with APEX, see APEX with TAU .","title":"Building HPX-5 applications with APEX"},{"location":"install/","text":"Installing APEX \u00b6 Installation with HPX \u00b6 APEX is integrated into the HPX runtime , and is integrated into the HPX build system. To enable APEX measurement with HPX, enable the following CMake flags: -DHPX_WITH_APEX=TRUE The -DHPX_WITH_APEX_TAG=develop can be used to indicate a specific release version of APEX, or to use a specific GitHub branch of APEX. We recommend using the default configured version that comes with HPX (currently v2.6.5 ) or the develop branch. Additional CMake flags include: -DAPEX_WITH_LM_SENSORS=TRUE to enable LM sensors support (assumed to be installed in default system paths) -DAPEX_WITH_PAPI=TRUE and -DPAPI_ROOT=... to enable PAPI support -DAPEX_WITH_BFD=TRUE and -DBFD_ROOT=... or -DAPEX_BUILD_BFD=TRUE to enable Binutils support for converting function/lambda/instruction pointers to human-readable code regions. For demangling of C++ symbols, demangle.h needs to be installed with the binutils headers (not typical in system installations). -DAPEX_WITH_MSR=TRUE to enable libmsr support for RAPL power measurement (typically not needed, as RAPL support is natively handled where available) -DAPEX_WITH_OTF2=TRUE and -DOTF2_ROOT=... to enable OTF2 tracing support -DHPX_WITH_HPXMP=TRUE to enable HPX OpenMP support and OMPT measurement support from APEX -DAPEX_WITH_ACTIVEHARMONY=TRUE and -DACTIVEHARMONY_ROOT=... to enable Active Harmony support -DAPEX_WITH_CUDA=TRUE to enable CUPTI and/or NVML support. Examples require a working nvcc compiler in your path. Standalone Installation \u00b6 APEX is open source, and available on Github at http://github.com/UO-OACISS/apex . For stability, most users will want to download the most recent release of APEX (for example, v2.6.5): wget https://github.com/UO-OACISS/apex/archive/refs/tags/v2.6.5.tar.gz tar -xvzf v2.6.5.tar.gz cd apex-2.6.5 Other users may want to work with the most recent code available, in which case you can clone the git repo: git clone https://github.com/UO-OACISS/apex.git cd apex Configuring and building APEX with Spack \u00b6 APEX can be installed with the Spack package management tool . See spack info apex for details. You should see something like this: CMakePackage: apex Description: Autonomic Performance Environment for eXascale (APEX). Homepage: https://uo-oaciss.github.io/apex Preferred version: 2.6.3 https://github.com/UO-OACISS/apex/archive/v2.6.3.tar.gz Safe versions: develop [git] https://github.com/UO-OACISS/apex on branch develop master [git] https://github.com/UO-OACISS/apex on branch master 2.6.3 https://github.com/UO-OACISS/apex/archive/v2.6.3.tar.gz 2.6.2 https://github.com/UO-OACISS/apex/archive/v2.6.2.tar.gz 2.6.1 https://github.com/UO-OACISS/apex/archive/v2.6.1.tar.gz 2.6.0 https://github.com/UO-OACISS/apex/archive/v2.6.0.tar.gz Deprecated versions: 2.5.1 https://github.com/UO-OACISS/apex/archive/v2.5.1.tar.gz 2.5.0 https://github.com/UO-OACISS/apex/archive/v2.5.0.tar.gz 2.4.1 https://github.com/UO-OACISS/apex/archive/v2.4.1.tar.gz 2.4.0 https://github.com/UO-OACISS/apex/archive/v2.4.0.tar.gz 2.3.2 https://github.com/UO-OACISS/apex/archive/v2.3.2.tar.gz 2.3.1 https://github.com/UO-OACISS/apex/archive/v2.3.1.tar.gz 2.3.0 https://github.com/UO-OACISS/apex/archive/v2.3.0.tar.gz 2.2.0 https://github.com/UO-OACISS/apex/archive/v2.2.0.tar.gz Variants: activeharmony [true] false, true Enables Active Harmony support binutils [false] false, true Enables Binutils support boost [false] false, true Enables Boost support build_system [cmake] cmake Build systems supported by the package cuda [false] false, true Enables CUDA support examples [false] false, true Build Examples gperftools [false] false, true Enables Google PerfTools TCMalloc support hip [false] false, true Enables ROCm/HIP support jemalloc [false] false, true Enables JEMalloc support lmsensors [false] false, true Enables LM-Sensors support mpi [false] false, true Enables MPI support openmp [false] false, true Enables OpenMP support otf2 [true] false, true Enables OTF2 support papi [false] false, true Enables PAPI support plugins [true] false, true Enables Policy Plugin support sycl [false] false, true Enables Intel SYCL support (Level0) tests [false] false, true Build Unit Tests when build_system=cmake build_type [Release] Debug, MinSizeRel, RelWithDebInfo, Release CMake build type generator [make] none the build system generator to use when build_system=cmake ^cmake@3.9: ipo [false] false, true CMake interprocedural optimization Build Dependencies: activeharmony boost cuda gmake hip lm-sensors ninja papi roctracer-dev zlib-api binutils cmake gettext gperftools jemalloc mpi otf2 rocm-smi-lib sycl Link Dependencies: activeharmony binutils boost cuda gettext gperftools hip jemalloc lm-sensors mpi otf2 papi rocm-smi-lib roctracer-dev sycl zlib-api Run Dependencies: None Licenses: None Configuring and building APEX with CMake \u00b6 APEX is built with CMake. The minimum CMake settings needed for APEX are: -DCMAKE_INSTALL_PREFIX=... some path to an installation location -DCMAKE_BUILD_TYPE=... one of Release, Debug, or RelWithDebInfo (Release recommended) The process for building APEX is: 1) Get the code (see above) 2) Enter the repo directory: cd apex-2.6.5 3) configure using CMake: cmake -B build -DCMAKE_INSTALL_PREFIX= -DCMAKE_BUILD_TYPE=RelWithDebInfo .. 4) build with cmake: cmake --build build # Run tests, if desired ctest --test-dir build # Build documentation, if desired cd build ; make doc ; cd .. # Install, if desired cmake --install install Other CMake settings, depending on your needs/wants \u00b6 Note 1: The recommended packages include: Active Harmony - for autotuning policies (optional, no longer recommended) OMPT - if OpenMP support is required ( See the OpenMP use case for an example) and your compiler supports OpenMP-Tools. note: GCC does not support OpenMP-Tools, and has no plans to as of January 2024. Compilers known to support OMPT include Clang/LLVM, Intel, NVIDIA, AMD Clang. Binutils/BFD - if your runtime/application uses instruction addresses to identify timers, e.g. OpenMP, CUDA, HIP, OneAPI, OpenACC, etc. PAPI - if you want hardware counter support ( See the PAPI use case for an example) JEMalloc/TCMalloc - if your application is not already using a heap manager - see Note 2, below CUDA - if your application uses CUDA, APEX will use CUPTI/NVML to measure GPU activity ROCM - if your application uses HIP/ROCm, APEX will use Rocprofiler/Roctracer/ROC-SMI to measure GPU activity OneAPI - if your application uses Intel SYCL, APEX will use OneAPI/LevelZero to measure GPU activity Note 2: TCMalloc or JEMalloc will potentially speed up memory allocations significantly in APEX (and in your application). HOWEVER, If your application already uses TCMalloc, JEMalloc or TBBMalloc, DO NOT configure APEX with TCMalloc or JEMalloc. They will be included at application link time, and may conflict with the version detected by and linked into APEX. If you got some kind of tcmalloc crash/error at startup, please preload the dependent tcmalloc shared object library with '--apex:preload /path/to/libtcmalloc.so'. There are several utility libraries that provide additional functionality in APEX. Not all libraries are required, but some are recommended. For the following options, the default values are in italics . -DAPEX_BUILD_EXAMPLES= TRUE or FALSE . Whether or not to build the application examples in APEX. -DAPEX_BUILD_TESTS= TRUE or FALSE . Whether or not to build the APEX unit tests. -DAPEX_WITH_ACTIVEHARMONY= TRUE or FALSE . Active Harmony is a library that intelligently searches for parametric combinations to support adapting to heterogeneous and changing environments. For more information, see http://www.dyninst.org/harmony . APEX uses Active Harmony for runtime adaptation. -DACTIVEHARMONY_ROOT= the path to Active Harmony, or set the ACTIVEHARMONY_ROOT environment variable before running cmake. It should be noted that if Active Harmony is not specified and -DAPEX_WITH_ACTIVEHARMONY is TRUE or not set, APEX will download and build Active Harmony as a CMake project. To disable Active Harmony entirely, specify -DAPEX_WITH_ACTIVEHARMONY=FALSE. -DAPEX_BUILD_ACTIVEHARMONY= TRUE or FALSE . Whether or not Active Harmony is installed on the system, this option forces CMake to automatically download and build Active Harmony as part of the APEX project. -DAPEX_WITH_BFD= TRUE or FALSE . APEX uses libbfd (Binutils) to convert instruction addresses to source code locations. BFD support is useful for generating human-readable output for summaries and concurrency graphs. Libbfd is not required for runtime adaptation. For more information, see https://www.gnu.org/software/binutils/ . -DBFD_ROOT= path to Binutils, or set the BFD_ROOT environment variable. -DAPEX_BUILD_BFD= TRUE or FALSE . Whether or not binutils is found by CMake, this option forces CMake to automatically download and build binutils as part of the APEX project. -DAPEX_WITH_CUDA= TRUE or FALSE . APEX uses CUPTI to measure CUDA kernels and API calls, and/or NVML support to monitor the GPU activity passively. -DCUDAToolkit_ROOT= the path to the CUDA installation, if necessary. -DAPEX_WITH_HIP= TRUE or FALSE . APEX uses Rocprofiler and Roctracer to measure HIP kernels and API calls, and/or ROCM-SMI support to monitor the GPU activity passively. -DROCM_ROOT= the path to the ROCm installation, if necessary. -DAPEX_WITH_KOKKOS= TRUE or FALSE. -DKokkos_ROOT= the path to the Kokkos installation, if necessary. APEX will grab Kokkos as a submodule if not found, only the headers are needed. -DAPEX_WITH_JEMALLOC= TRUE or FALSE . JEMalloc is a heap management library. For more information, see http://www.canonware.com/jemalloc/ . JEMalloc provides faster memory performance in multithreaded environments. -DJEMALLOC\\_ROOT= path to JEMalloc, or set the JEMALLOC_ROOT environment variable before running cmake. -DAPEX_WITH_LEVEL0= TRUE or FALSE . APEX uses Level0 to measure Intel SYCL kernels and API calls and to monitor the GPU activity passively. -DAPEX_WITH_LM_SENSORS= TRUE or FALSE . Lm_sensors (Linux Monitoring Sensors) is a library for monitoring hardware temperatures and fan speeds. For more information, see https://en.wikipedia.org/wiki/Lm_sensors . APEX uses lm_sensors to monitor hardware, where available. -DAPEX_WITH_MPI= TRUE or FALSE . Whether to build MPI global support and related examples. -DAPEX_WITH_OMPT= TRUE or FALSE . OMP-Tools is the 5.0+ standard for OpenMP runtimes to provide callback hooks to performance tools. For more information, see the OpenMP specification v5.0 or newer. APEX has support for most OMPT OpenMP trace events. See the OpenMP use case for an example. Some compilers (Clang 10+, Intel 19+, IBM XL 16+) include OMPT support already, and APEX will use the built-in support. -DAPEX_WITH_OTF2= TRUE or FALSE . Used to enable OTF2 tracing support for the Vampir trace visualization tool. -DOTF2_ROOT= path to an OTF2 installation. -DAPEX_BUILD_OTF2= TRUE or FALSE . If OTF2 is not found by CMake, this option forces CMake to automatically download and build binutils as part of the APEX project. -DAPEX_WITH_PAPI= TRUE or FALSE . PAPI (Performance Application Programming Interface) provides the tool designer and application engineer with a consistent interface and methodology for use of the performance counter hardware found in most major microprocessors. For more information, see http://icl.cs.utk.edu/papi/ . APEX uses PAPI to optionally collect hardware counters for timed events. -DPAPI_ROOT= some path to PAPI, or set the PAPI_ROOT environment variable before running cmake. See the PAPI use case for an example. -DAPEX_WITH_PERFETTO= TRUE or FALSE . Enables native Perfetto trace support, increases build/link time significantly. Only used if you want native Perfetto output support, otherwise APEX will write compressed JSON output of the same data (which is actually smaller than the binary native format). -DAPEX_WITH_PLUGINS= TRUE or FALSE. Enables APEX policy plugin support. -DAPEX_WITH_TCMALLOC= TRUE or FALSE . TCMalloc is a heap management library distributed as part of Google perftools. For more information, see https://github.com/gperftools/gperftools . TCMalloc provides faster memory performance in multithreaded environments. -DGPERFTOOLS_ROOT= path to gperftools (TCMalloc), or set the GPERFTOOLS_ROOT environment variable before running cmake. Other CMake variables of interest \u00b6 For any others not listed, see https://github.com/UO-OACISS/apex/blob/develop/cmake/Modules/APEX_DefaultOptions.cmake","title":"Installation"},{"location":"install/#installing_apex","text":"","title":"Installing APEX"},{"location":"install/#installation_with_hpx","text":"APEX is integrated into the HPX runtime , and is integrated into the HPX build system. To enable APEX measurement with HPX, enable the following CMake flags: -DHPX_WITH_APEX=TRUE The -DHPX_WITH_APEX_TAG=develop can be used to indicate a specific release version of APEX, or to use a specific GitHub branch of APEX. We recommend using the default configured version that comes with HPX (currently v2.6.5 ) or the develop branch. Additional CMake flags include: -DAPEX_WITH_LM_SENSORS=TRUE to enable LM sensors support (assumed to be installed in default system paths) -DAPEX_WITH_PAPI=TRUE and -DPAPI_ROOT=... to enable PAPI support -DAPEX_WITH_BFD=TRUE and -DBFD_ROOT=... or -DAPEX_BUILD_BFD=TRUE to enable Binutils support for converting function/lambda/instruction pointers to human-readable code regions. For demangling of C++ symbols, demangle.h needs to be installed with the binutils headers (not typical in system installations). -DAPEX_WITH_MSR=TRUE to enable libmsr support for RAPL power measurement (typically not needed, as RAPL support is natively handled where available) -DAPEX_WITH_OTF2=TRUE and -DOTF2_ROOT=... to enable OTF2 tracing support -DHPX_WITH_HPXMP=TRUE to enable HPX OpenMP support and OMPT measurement support from APEX -DAPEX_WITH_ACTIVEHARMONY=TRUE and -DACTIVEHARMONY_ROOT=... to enable Active Harmony support -DAPEX_WITH_CUDA=TRUE to enable CUPTI and/or NVML support. Examples require a working nvcc compiler in your path.","title":"Installation with HPX"},{"location":"install/#standalone_installation","text":"APEX is open source, and available on Github at http://github.com/UO-OACISS/apex . For stability, most users will want to download the most recent release of APEX (for example, v2.6.5): wget https://github.com/UO-OACISS/apex/archive/refs/tags/v2.6.5.tar.gz tar -xvzf v2.6.5.tar.gz cd apex-2.6.5 Other users may want to work with the most recent code available, in which case you can clone the git repo: git clone https://github.com/UO-OACISS/apex.git cd apex","title":"Standalone Installation"},{"location":"install/#configuring_and_building_apex_with_spack","text":"APEX can be installed with the Spack package management tool . See spack info apex for details. You should see something like this: CMakePackage: apex Description: Autonomic Performance Environment for eXascale (APEX). Homepage: https://uo-oaciss.github.io/apex Preferred version: 2.6.3 https://github.com/UO-OACISS/apex/archive/v2.6.3.tar.gz Safe versions: develop [git] https://github.com/UO-OACISS/apex on branch develop master [git] https://github.com/UO-OACISS/apex on branch master 2.6.3 https://github.com/UO-OACISS/apex/archive/v2.6.3.tar.gz 2.6.2 https://github.com/UO-OACISS/apex/archive/v2.6.2.tar.gz 2.6.1 https://github.com/UO-OACISS/apex/archive/v2.6.1.tar.gz 2.6.0 https://github.com/UO-OACISS/apex/archive/v2.6.0.tar.gz Deprecated versions: 2.5.1 https://github.com/UO-OACISS/apex/archive/v2.5.1.tar.gz 2.5.0 https://github.com/UO-OACISS/apex/archive/v2.5.0.tar.gz 2.4.1 https://github.com/UO-OACISS/apex/archive/v2.4.1.tar.gz 2.4.0 https://github.com/UO-OACISS/apex/archive/v2.4.0.tar.gz 2.3.2 https://github.com/UO-OACISS/apex/archive/v2.3.2.tar.gz 2.3.1 https://github.com/UO-OACISS/apex/archive/v2.3.1.tar.gz 2.3.0 https://github.com/UO-OACISS/apex/archive/v2.3.0.tar.gz 2.2.0 https://github.com/UO-OACISS/apex/archive/v2.2.0.tar.gz Variants: activeharmony [true] false, true Enables Active Harmony support binutils [false] false, true Enables Binutils support boost [false] false, true Enables Boost support build_system [cmake] cmake Build systems supported by the package cuda [false] false, true Enables CUDA support examples [false] false, true Build Examples gperftools [false] false, true Enables Google PerfTools TCMalloc support hip [false] false, true Enables ROCm/HIP support jemalloc [false] false, true Enables JEMalloc support lmsensors [false] false, true Enables LM-Sensors support mpi [false] false, true Enables MPI support openmp [false] false, true Enables OpenMP support otf2 [true] false, true Enables OTF2 support papi [false] false, true Enables PAPI support plugins [true] false, true Enables Policy Plugin support sycl [false] false, true Enables Intel SYCL support (Level0) tests [false] false, true Build Unit Tests when build_system=cmake build_type [Release] Debug, MinSizeRel, RelWithDebInfo, Release CMake build type generator [make] none the build system generator to use when build_system=cmake ^cmake@3.9: ipo [false] false, true CMake interprocedural optimization Build Dependencies: activeharmony boost cuda gmake hip lm-sensors ninja papi roctracer-dev zlib-api binutils cmake gettext gperftools jemalloc mpi otf2 rocm-smi-lib sycl Link Dependencies: activeharmony binutils boost cuda gettext gperftools hip jemalloc lm-sensors mpi otf2 papi rocm-smi-lib roctracer-dev sycl zlib-api Run Dependencies: None Licenses: None","title":"Configuring and building APEX with Spack"},{"location":"install/#configuring_and_building_apex_with_cmake","text":"APEX is built with CMake. The minimum CMake settings needed for APEX are: -DCMAKE_INSTALL_PREFIX=... some path to an installation location -DCMAKE_BUILD_TYPE=... one of Release, Debug, or RelWithDebInfo (Release recommended) The process for building APEX is: 1) Get the code (see above) 2) Enter the repo directory: cd apex-2.6.5 3) configure using CMake: cmake -B build -DCMAKE_INSTALL_PREFIX= -DCMAKE_BUILD_TYPE=RelWithDebInfo .. 4) build with cmake: cmake --build build # Run tests, if desired ctest --test-dir build # Build documentation, if desired cd build ; make doc ; cd .. # Install, if desired cmake --install install","title":"Configuring and building APEX with CMake"},{"location":"install/#other_cmake_settings_depending_on_your_needswants","text":"Note 1: The recommended packages include: Active Harmony - for autotuning policies (optional, no longer recommended) OMPT - if OpenMP support is required ( See the OpenMP use case for an example) and your compiler supports OpenMP-Tools. note: GCC does not support OpenMP-Tools, and has no plans to as of January 2024. Compilers known to support OMPT include Clang/LLVM, Intel, NVIDIA, AMD Clang. Binutils/BFD - if your runtime/application uses instruction addresses to identify timers, e.g. OpenMP, CUDA, HIP, OneAPI, OpenACC, etc. PAPI - if you want hardware counter support ( See the PAPI use case for an example) JEMalloc/TCMalloc - if your application is not already using a heap manager - see Note 2, below CUDA - if your application uses CUDA, APEX will use CUPTI/NVML to measure GPU activity ROCM - if your application uses HIP/ROCm, APEX will use Rocprofiler/Roctracer/ROC-SMI to measure GPU activity OneAPI - if your application uses Intel SYCL, APEX will use OneAPI/LevelZero to measure GPU activity Note 2: TCMalloc or JEMalloc will potentially speed up memory allocations significantly in APEX (and in your application). HOWEVER, If your application already uses TCMalloc, JEMalloc or TBBMalloc, DO NOT configure APEX with TCMalloc or JEMalloc. They will be included at application link time, and may conflict with the version detected by and linked into APEX. If you got some kind of tcmalloc crash/error at startup, please preload the dependent tcmalloc shared object library with '--apex:preload /path/to/libtcmalloc.so'. There are several utility libraries that provide additional functionality in APEX. Not all libraries are required, but some are recommended. For the following options, the default values are in italics . -DAPEX_BUILD_EXAMPLES= TRUE or FALSE . Whether or not to build the application examples in APEX. -DAPEX_BUILD_TESTS= TRUE or FALSE . Whether or not to build the APEX unit tests. -DAPEX_WITH_ACTIVEHARMONY= TRUE or FALSE . Active Harmony is a library that intelligently searches for parametric combinations to support adapting to heterogeneous and changing environments. For more information, see http://www.dyninst.org/harmony . APEX uses Active Harmony for runtime adaptation. -DACTIVEHARMONY_ROOT= the path to Active Harmony, or set the ACTIVEHARMONY_ROOT environment variable before running cmake. It should be noted that if Active Harmony is not specified and -DAPEX_WITH_ACTIVEHARMONY is TRUE or not set, APEX will download and build Active Harmony as a CMake project. To disable Active Harmony entirely, specify -DAPEX_WITH_ACTIVEHARMONY=FALSE. -DAPEX_BUILD_ACTIVEHARMONY= TRUE or FALSE . Whether or not Active Harmony is installed on the system, this option forces CMake to automatically download and build Active Harmony as part of the APEX project. -DAPEX_WITH_BFD= TRUE or FALSE . APEX uses libbfd (Binutils) to convert instruction addresses to source code locations. BFD support is useful for generating human-readable output for summaries and concurrency graphs. Libbfd is not required for runtime adaptation. For more information, see https://www.gnu.org/software/binutils/ . -DBFD_ROOT= path to Binutils, or set the BFD_ROOT environment variable. -DAPEX_BUILD_BFD= TRUE or FALSE . Whether or not binutils is found by CMake, this option forces CMake to automatically download and build binutils as part of the APEX project. -DAPEX_WITH_CUDA= TRUE or FALSE . APEX uses CUPTI to measure CUDA kernels and API calls, and/or NVML support to monitor the GPU activity passively. -DCUDAToolkit_ROOT= the path to the CUDA installation, if necessary. -DAPEX_WITH_HIP= TRUE or FALSE . APEX uses Rocprofiler and Roctracer to measure HIP kernels and API calls, and/or ROCM-SMI support to monitor the GPU activity passively. -DROCM_ROOT= the path to the ROCm installation, if necessary. -DAPEX_WITH_KOKKOS= TRUE or FALSE. -DKokkos_ROOT= the path to the Kokkos installation, if necessary. APEX will grab Kokkos as a submodule if not found, only the headers are needed. -DAPEX_WITH_JEMALLOC= TRUE or FALSE . JEMalloc is a heap management library. For more information, see http://www.canonware.com/jemalloc/ . JEMalloc provides faster memory performance in multithreaded environments. -DJEMALLOC\\_ROOT= path to JEMalloc, or set the JEMALLOC_ROOT environment variable before running cmake. -DAPEX_WITH_LEVEL0= TRUE or FALSE . APEX uses Level0 to measure Intel SYCL kernels and API calls and to monitor the GPU activity passively. -DAPEX_WITH_LM_SENSORS= TRUE or FALSE . Lm_sensors (Linux Monitoring Sensors) is a library for monitoring hardware temperatures and fan speeds. For more information, see https://en.wikipedia.org/wiki/Lm_sensors . APEX uses lm_sensors to monitor hardware, where available. -DAPEX_WITH_MPI= TRUE or FALSE . Whether to build MPI global support and related examples. -DAPEX_WITH_OMPT= TRUE or FALSE . OMP-Tools is the 5.0+ standard for OpenMP runtimes to provide callback hooks to performance tools. For more information, see the OpenMP specification v5.0 or newer. APEX has support for most OMPT OpenMP trace events. See the OpenMP use case for an example. Some compilers (Clang 10+, Intel 19+, IBM XL 16+) include OMPT support already, and APEX will use the built-in support. -DAPEX_WITH_OTF2= TRUE or FALSE . Used to enable OTF2 tracing support for the Vampir trace visualization tool. -DOTF2_ROOT= path to an OTF2 installation. -DAPEX_BUILD_OTF2= TRUE or FALSE . If OTF2 is not found by CMake, this option forces CMake to automatically download and build binutils as part of the APEX project. -DAPEX_WITH_PAPI= TRUE or FALSE . PAPI (Performance Application Programming Interface) provides the tool designer and application engineer with a consistent interface and methodology for use of the performance counter hardware found in most major microprocessors. For more information, see http://icl.cs.utk.edu/papi/ . APEX uses PAPI to optionally collect hardware counters for timed events. -DPAPI_ROOT= some path to PAPI, or set the PAPI_ROOT environment variable before running cmake. See the PAPI use case for an example. -DAPEX_WITH_PERFETTO= TRUE or FALSE . Enables native Perfetto trace support, increases build/link time significantly. Only used if you want native Perfetto output support, otherwise APEX will write compressed JSON output of the same data (which is actually smaller than the binary native format). -DAPEX_WITH_PLUGINS= TRUE or FALSE. Enables APEX policy plugin support. -DAPEX_WITH_TCMALLOC= TRUE or FALSE . TCMalloc is a heap management library distributed as part of Google perftools. For more information, see https://github.com/gperftools/gperftools . TCMalloc provides faster memory performance in multithreaded environments. -DGPERFTOOLS_ROOT= path to gperftools (TCMalloc), or set the GPERFTOOLS_ROOT environment variable before running cmake.","title":"Other CMake settings, depending on your needs/wants"},{"location":"install/#other_cmake_variables_of_interest","text":"For any others not listed, see https://github.com/UO-OACISS/apex/blob/develop/cmake/Modules/APEX_DefaultOptions.cmake","title":"Other CMake variables of interest"},{"location":"quickstart/","text":"APEX Quickstart \u00b6 Tutorial \u00b6 For an APEX tutorial, please see https://github.com/khuck/apex-tutorial . Installation \u00b6 For detailed instructions and information on dependencies, see build instructions To build APEX stand-alone (to use with OpenMP, OpenACC, CUDA, Kokkos, TBB, C++ threads, etc.) do the following: git clone https://github.com/UO-OACISS/apex.git cd apex cmake -B build -DCMAKE_BUILD_TYPE=Release -DBUILD_EXAMPLES=TRUE . cmake --build build --parallel Runtime \u00b6 To run an example (since -DBUILD_EXAMPLES=TRUE was set), just run the Matmult example and you should get similar output: [khuck@eagle apex]$ ./build/src/examples/Matmult/matmult Spawned thread 1... Spawned thread 2... Spawned thread 3... Done. Elapsed time: 0.300207 seconds Cores detected: 128 Worker Threads observed: 4 Available CPU time: 1.20083 seconds Counter : #samples | minimum | mean | maximum | stddev ------------------------------------------------------------------------------------------------ status:Threads : 1 6.000 6.000 6.000 0.000 status:VmData : 1 4.93e+04 4.93e+04 4.93e+04 0.000 status:VmExe : 1 64.000 64.000 64.000 0.000 status:VmHWM : 1 7808.000 7808.000 7808.000 0.000 status:VmLck : 1 0.000 0.000 0.000 0.000 status:VmLib : 1 6336.000 6336.000 6336.000 0.000 status:VmPMD : 1 16.000 16.000 16.000 0.000 status:VmPTE : 1 4.000 4.000 4.000 0.000 status:VmPeak : 1 3.80e+05 3.80e+05 3.80e+05 0.000 status:VmPin : 1 0.000 0.000 0.000 0.000 status:VmRSS : 1 7808.000 7808.000 7808.000 0.000 status:VmSize : 1 3.15e+05 3.15e+05 3.15e+05 0.000 status:VmStk : 1 192.000 192.000 192.000 0.000 status:VmSwap : 1 0.000 0.000 0.000 0.000 status:nonvoluntary_ctxt_switches : 1 0.000 0.000 0.000 0.000 status:voluntary_ctxt_switches : 1 77.000 77.000 77.000 0.000 ------------------------------------------------------------------------------------------------ Timer : #calls | mean | total | % total ------------------------------------------------------------------------------------------------ APEX MAIN : 1 0.300 0.300 100.000 allocateMatrix : 12 0.009 0.108 9.023 compute : 4 0.206 0.825 68.736 compute_interchange : 4 0.064 0.257 21.369 do_work : 4 0.298 1.193 99.313 freeMatrix : 12 0.000 0.000 0.025 initialize : 12 0.000 0.002 0.146 main : 1 0.299 0.299 24.930 ------------------------------------------------------------------------------------------------ Total timers : 49 Using apex_exec \u00b6 The wrapper script apex_exec can be used to measure applications that don't have APEX linked in. For details, see apex_exec usage .","title":"Quick Start (standalone)"},{"location":"quickstart/#apex_quickstart","text":"","title":"APEX Quickstart"},{"location":"quickstart/#tutorial","text":"For an APEX tutorial, please see https://github.com/khuck/apex-tutorial .","title":"Tutorial"},{"location":"quickstart/#installation","text":"For detailed instructions and information on dependencies, see build instructions To build APEX stand-alone (to use with OpenMP, OpenACC, CUDA, Kokkos, TBB, C++ threads, etc.) do the following: git clone https://github.com/UO-OACISS/apex.git cd apex cmake -B build -DCMAKE_BUILD_TYPE=Release -DBUILD_EXAMPLES=TRUE . cmake --build build --parallel","title":"Installation"},{"location":"quickstart/#runtime","text":"To run an example (since -DBUILD_EXAMPLES=TRUE was set), just run the Matmult example and you should get similar output: [khuck@eagle apex]$ ./build/src/examples/Matmult/matmult Spawned thread 1... Spawned thread 2... Spawned thread 3... Done. Elapsed time: 0.300207 seconds Cores detected: 128 Worker Threads observed: 4 Available CPU time: 1.20083 seconds Counter : #samples | minimum | mean | maximum | stddev ------------------------------------------------------------------------------------------------ status:Threads : 1 6.000 6.000 6.000 0.000 status:VmData : 1 4.93e+04 4.93e+04 4.93e+04 0.000 status:VmExe : 1 64.000 64.000 64.000 0.000 status:VmHWM : 1 7808.000 7808.000 7808.000 0.000 status:VmLck : 1 0.000 0.000 0.000 0.000 status:VmLib : 1 6336.000 6336.000 6336.000 0.000 status:VmPMD : 1 16.000 16.000 16.000 0.000 status:VmPTE : 1 4.000 4.000 4.000 0.000 status:VmPeak : 1 3.80e+05 3.80e+05 3.80e+05 0.000 status:VmPin : 1 0.000 0.000 0.000 0.000 status:VmRSS : 1 7808.000 7808.000 7808.000 0.000 status:VmSize : 1 3.15e+05 3.15e+05 3.15e+05 0.000 status:VmStk : 1 192.000 192.000 192.000 0.000 status:VmSwap : 1 0.000 0.000 0.000 0.000 status:nonvoluntary_ctxt_switches : 1 0.000 0.000 0.000 0.000 status:voluntary_ctxt_switches : 1 77.000 77.000 77.000 0.000 ------------------------------------------------------------------------------------------------ Timer : #calls | mean | total | % total ------------------------------------------------------------------------------------------------ APEX MAIN : 1 0.300 0.300 100.000 allocateMatrix : 12 0.009 0.108 9.023 compute : 4 0.206 0.825 68.736 compute_interchange : 4 0.064 0.257 21.369 do_work : 4 0.298 1.193 99.313 freeMatrix : 12 0.000 0.000 0.025 initialize : 12 0.000 0.002 0.146 main : 1 0.299 0.299 24.930 ------------------------------------------------------------------------------------------------ Total timers : 49","title":"Runtime"},{"location":"quickstart/#using_apex_exec","text":"The wrapper script apex_exec can be used to measure applications that don't have APEX linked in. For details, see apex_exec usage .","title":"Using apex_exec"},{"location":"quickstarthpx/","text":"APEX Quickstart \u00b6 Installation \u00b6 For detailed instructions and information on dependencies, see build instructions . APEX is integrated into the HPX runtime , and is integrated into the HPX build system. To enable APEX measurement with HPX, enable the following CMake flags: -DHPX_WITH_APEX=TRUE The CMake flag -DHPX_WITH_APEX_TAG=develop can be used to indicate a specific release version of APEX, or to use a specific GitHub branch of APEX. We recommend using the default configured version that comes with your version of HPX or the develop branch to access the latest features and bug fixes. Runtime \u00b6 To see APEX data after an HPX run, set the APEX_SCREEN_OUTPUT=1 environment variable. After execution, you'll see output like this: [khuck@eagle build]$ export APEX_SCREEN_OUTPUT=1 [khuck@eagle build]$ ./bin/fibonacci fibonacci(10) == 55 elapsed time: 0.112029 [s] Elapsed time: 0.19137 seconds Cores detected: 128 Worker Threads observed: 32 Available CPU time: 6.12383 seconds Timer : #calls | mean | total | % total ------------------------------------------------------------------------------------------------ APEX MAIN : 1 0.191 0.191 100.000 async : 2 0.000 0.000 0.001 async_launch_policy_dispatch : 5 0.001 0.003 0.041 broadcast_call_shutdown_functions_action : 2 0.000 0.001 0.012 call_shutdown_functions_action : 2 0.002 0.005 0.081 fibonacci_action : 174 0.015 2.569 41.957 load_components_action : 1 0.014 0.014 0.230 primary_namespace_colocate_action : 2 0.000 0.001 0.011 run_helper : 1 0.015 0.015 0.250 shutdown_all_action : 1 0.002 0.002 0.040 APEX Idle : 3.514 57.375 ------------------------------------------------------------------------------------------------ Total timers : 190 HPX applications can also use the apex_exec wrapper script, please see apex_exec flags for details.","title":"Quick Start (HPX)"},{"location":"quickstarthpx/#apex_quickstart","text":"","title":"APEX Quickstart"},{"location":"quickstarthpx/#installation","text":"For detailed instructions and information on dependencies, see build instructions . APEX is integrated into the HPX runtime , and is integrated into the HPX build system. To enable APEX measurement with HPX, enable the following CMake flags: -DHPX_WITH_APEX=TRUE The CMake flag -DHPX_WITH_APEX_TAG=develop can be used to indicate a specific release version of APEX, or to use a specific GitHub branch of APEX. We recommend using the default configured version that comes with your version of HPX or the develop branch to access the latest features and bug fixes.","title":"Installation"},{"location":"quickstarthpx/#runtime","text":"To see APEX data after an HPX run, set the APEX_SCREEN_OUTPUT=1 environment variable. After execution, you'll see output like this: [khuck@eagle build]$ export APEX_SCREEN_OUTPUT=1 [khuck@eagle build]$ ./bin/fibonacci fibonacci(10) == 55 elapsed time: 0.112029 [s] Elapsed time: 0.19137 seconds Cores detected: 128 Worker Threads observed: 32 Available CPU time: 6.12383 seconds Timer : #calls | mean | total | % total ------------------------------------------------------------------------------------------------ APEX MAIN : 1 0.191 0.191 100.000 async : 2 0.000 0.000 0.001 async_launch_policy_dispatch : 5 0.001 0.003 0.041 broadcast_call_shutdown_functions_action : 2 0.000 0.001 0.012 call_shutdown_functions_action : 2 0.002 0.005 0.081 fibonacci_action : 174 0.015 2.569 41.957 load_components_action : 1 0.014 0.014 0.230 primary_namespace_colocate_action : 2 0.000 0.001 0.011 run_helper : 1 0.015 0.015 0.250 shutdown_all_action : 1 0.002 0.002 0.040 APEX Idle : 3.514 57.375 ------------------------------------------------------------------------------------------------ Total timers : 190 HPX applications can also use the apex_exec wrapper script, please see apex_exec flags for details.","title":"Runtime"},{"location":"refman/","text":"API Doxygen Reference \u00b6 The source code is instrumented with Doxygen comments, and the API reference manual can be generated by executing 'make doc' in the build directory, after CMake configuration. A fairly recent version of the API reference documentation is available here: http://www.nic.uoregon.edu/~khuck/apex_docs/doc/html/index.html http://www.nic.uoregon.edu/~khuck/apex_docs/doc/refman.pdf In the event that the API specification and the reference implementation (as generated by the Doxygen comments from the actual source code) do not match, assume that the specification is correct and that the implementation is non-compliant - and subsequently contact the project maintainers so that we may bring the implementation into compliance.","title":"API Doxygen Reference"},{"location":"refman/#api_doxygen_reference","text":"The source code is instrumented with Doxygen comments, and the API reference manual can be generated by executing 'make doc' in the build directory, after CMake configuration. A fairly recent version of the API reference documentation is available here: http://www.nic.uoregon.edu/~khuck/apex_docs/doc/html/index.html http://www.nic.uoregon.edu/~khuck/apex_docs/doc/refman.pdf In the event that the API specification and the reference implementation (as generated by the Doxygen comments from the actual source code) do not match, assume that the specification is correct and that the implementation is non-compliant - and subsequently contact the project maintainers so that we may bring the implementation into compliance.","title":"API Doxygen Reference"},{"location":"spec/","text":"APEX Specification (DRAFT) \u00b6 *...to be fully implemented in a future release. While the following specification is slightly different than the current implementation, the differences are minor. When in doubt, the current implementation is documented by Doxygen, and is available here: http://www.nic.uoregon.edu/~khuck/apex_docs/doc/html/index.html http://www.nic.uoregon.edu/~khuck/apex_docs/doc/refman.pdf * READ ME FIRST! \u00b6 The API specification is provided for users who wish to instrument their own applications, or who wish to instrument a runtime. Please note that most runtimes have already been instrumented (or provide callbacks), and that users typically do not have to make any calls to the APEX API, other than to add application level timers or to write custom policy rules. If that is you, please see the tutorial with lots of up-to-date examples, https://github.com/khuck/apex-tutorial . Introduction \u00b6 This page contains the API specification for APEX. The API specification provides a high-level overview of the API and its functionality. The implementation has Doxygen comments inserted, so for full implementation details, please see the API Reference Manual . A note about C++ \u00b6 The following specification contains both the C and the the C++ API. Typically, the C++ names use overloading for different argument lists, and will replace the apex_ prefix with the apex:: namespace. Because both APIs return handles to internal APEX objects, the type definitions of these objects use the C naming convention. In addition to the simple API presented below, the C++ API includes scoped timers and threads. See http://www.nic.uoregon.edu/~khuck/apex_docs/doc/html/namespaceapex.html for details. Terminology \u00b6 Unfortunately, many terms in Computer Science are overloaded. The following definitions are in use in this document: Thread : an operating system (OS) thread of execution. For example, Posix threads (pthreads). Task : a scheduled unit of work, such as an OpenMP task or an HPX thread. APEX timers are typically used to measure tasks. C example \u00b6 The following is a very small C program that uses the APEX API. For more examples, please see the programs in the src/examples and src/unit_tests/C directories of the APEX source code. #include #include #include \"apex.h\" int foo(int i) { /* start an APEX timer for the function foo */ apex_profiler_handle profiler = apex_start(APEX_FUNCTION_ADDRESS, &foo); int j = i * i; /* stop the APEX timer */ apex_stop(profiler); return j; } int main (int argc, char** argv) { /* initialize APEX */ apex_init(\"apex_start unit test\"); /* start a timer, passing in the address of the main function */ apex_profiler_handle profiler = apex_start(APEX_FUNCTION_ADDRESS, &main); int i,j = 0; for (i = 0 ; i < 3 ; i++) { j += foo(i); } /* stop the timer */ apex_stop(profiler); /* finalize APEX */ apex_finalize(); /* free all memory allocated by APEX */ apex_cleanup(); return 0; } C++ example \u00b6 The following is a slightly more complicated C++ pthread program that uses the APEX API. For more examples, please see the programs in the src/examples and src/unit_tests/C++ directories of the APEX source code. #include #include #include #include \"apex_api.hpp\" void* someThread(void* tmp) { int* tid = (int*)tmp; char name[32]; sprintf(name, \"worker thread %d\", *tid); /* Register this thread with APEX */ apex::register_thread(name); /* Start a timer */ apex::profiler* p = apex::start((apex_function_address)&someThread); /* ... */ /* do some computation */ /* ... */ /* stop the timer */ apex::stop(p); /* tell APEX that this thread is exiting */ apex::exit_thread(); return NULL; } int main (int argc, char** argv) { /* initialize APEX */ apex::init(\"apex::start unit test\"); /* set our node ID */ apex::set_node_id(0); /* start a timer */ apex::profiler* p = apex::start(\"main\"); /* Spawn two threads */ pthread_t thread[2]; int tid = 0; pthread_create(&(thread[0]), NULL, someThread, &tid); int tid2 = 1; pthread_create(&(thread[1]), NULL, someThread, &tid2); /* wait for the threads to finish */ pthread_join(thread[0], NULL); pthread_join(thread[1], NULL); /* stop our main timer */ apex::stop(p); /* finalize APEX */ apex::finalize(); /* free all memory allocated by APEX */ apex::cleanup(); return 0; } Constants, types and enumerations \u00b6 Constants \u00b6 /** A null pointer representing an APEX profiler handle. * Used when a null APEX profile handle is to be passed in to * apex::stop when the profiler object was not retained locally. */ #define APEX_NULL_PROFILER_HANDLE (apex_profiler_handle)(NULL) // for comparisons #define APEX_MAX_EVENTS 128 /*!< The maximum number of event types. Allows for ~20 custom events. */ #define APEX_NULL_FUNCTION_ADDRESS 0L // for comparisons Pre-defined types \u00b6 /** The address of a C++ object in APEX. * Not useful for the caller that gets it back, but required * for stopping the timer later. */ typedef uintptr_t apex_profiler_handle; // address of internal C++ object /** Not useful for the caller that gets it back, but required * for deregistering policies after registration. */ typedef uintptr_t apex_policy_handle; // address of internal C++ object /** Rather than use void pointers everywhere, be explicit about * what the functions are expecting. */ typedef uintptr_t apex_function_address; // generic function pointer Enumerations \u00b6 /** * Typedef for enumerating the different timer types */ typedef enum _apex_profiler_type { APEX_FUNCTION_ADDRESS = 0, /*!< The ID is a function (or instruction) address */ APEX_NAME_STRING, /*!< The ID is a character string */ APEX_FUNCTOR /*!< C++ Object with the () operator defined */ } apex_profiler_type; /** * Typedef for enumerating the different event types */ typedef enum _event_type { APEX_INVALID_EVENT = -1, APEX_STARTUP = 0, /*!< APEX is initialized */ APEX_SHUTDOWN, /*!< APEX is terminated */ APEX_NEW_NODE, /*!< APEX has registered a new process ID */ APEX_NEW_THREAD, /*!< APEX has registered a new OS thread */ APEX_EXIT_THREAD, /*!< APEX has exited an OS thread */ APEX_START_EVENT, /*!< APEX has processed a timer start event */ APEX_RESUME_EVENT, /*!< APEX has processed a timer resume event (the number of calls is not incremented) */ APEX_STOP_EVENT, /*!< APEX has processed a timer stop event */ APEX_YIELD_EVENT, /*!< APEX has processed a timer yield event */ APEX_SAMPLE_VALUE, /*!< APEX has processed a sampled value */ APEX_PERIODIC, /*!< APEX has processed a periodic timer */ APEX_CUSTOM_EVENT_1, /*!< APEX has processed a custom event - useful for large granularity application control events */ APEX_CUSTOM_EVENT_2, // these are just here for padding, and so we can APEX_CUSTOM_EVENT_3, // test with them. APEX_CUSTOM_EVENT_4, APEX_CUSTOM_EVENT_5, APEX_CUSTOM_EVENT_6, APEX_CUSTOM_EVENT_7, APEX_CUSTOM_EVENT_8, APEX_UNUSED_EVENT = APEX_MAX_EVENTS // can't have more custom events than this } apex_event_type; /** * Typedef for enumerating the OS thread states. */ typedef enum _thread_state { APEX_IDLE, /*!< Thread is idle */ APEX_BUSY, /*!< Thread is working */ APEX_THROTTLED, /*!< Thread is throttled (sleeping) */ APEX_WAITING, /*!< Thread is waiting for a resource */ APEX_BLOCKED /*!< Thread is otherwise blocked */ } apex_thread_state; /** * Typedef for enumerating the different optimization strategies * for throttling. */ typedef enum {APEX_MAXIMIZE_THROUGHPUT, /*!< maximize the number of calls to a timer/counter */ APEX_MAXIMIZE_ACCUMULATED, /*!< maximize the accumulated value of a timer/counter */ APEX_MINIMIZE_ACCUMULATED /*!< minimize the accumulated value of a timer/counter */ } apex_optimization_criteria_t; /** * Typedef for enumerating the different optimization methods * for throttling. */ typedef enum {APEX_SIMPLE_HYSTERESIS, /*!< optimize using sliding window of historical observations. A running average of the most recent N observations are used as the measurement. */ APEX_DISCRETE_HILL_CLIMBING, /*!< Use a discrete hill climbing algorithm for optimization */ APEX_ACTIVE_HARMONY /*!< Use Active Harmony for optimization. */ } apex_optimization_method_t; /** The type of a profiler object * */ typedef enum _profile_type { APEX_TIMER, /*!< This profile is a instrumented timer */ APEX_COUNTER /*!< This profile is a sampled counter */ } apex_profile_type; Data structures and classes \u00b6 /** * The APEX context when an event occurs. This context will be passed to * any policies registered for this event. */ typedef struct _context { apex_event_type event_type; /*!< The type of the event currently processing */ apex_policy_handle* policy_handle; /*!< The policy handle for the current policy function */ void * data; /*!< Data associated with the event, such as the custom_data for a custom_event */ } apex_context; /** * The profile object for a timer in APEX. * Returned by the apex_get_profile() call. */ typedef struct _profile { double calls; /*!< Number of times a timer was called, or the number of samples collected for a counter */ double accumulated; /*!< Accumulated values for all calls/samples */ double sum_squares; /*!< Running sum of squares calculation for all calls/samples */ double minimum; /*!< Minimum value seen by the timer or counter */ double maximum; /*!< Maximum value seen by the timer or counter */ apex_profile_type type; /*!< Whether this is a timer or a counter */ double papi_metrics[8]; /*!< Array of accumulated PAPI hardware metrics */ } apex_profile; /** * The APEX tuning request structures. */ typedef struct _apex_param { char * init_value; /*!< Initial value */ const char * value; /*!< Current value */ int num_possible_values; /*!< Number of possible values */ char * possible_values[]; } apex_param_struct; typedef struct _apex_tuning_request { char * name; /*!< Tuning request name */ double (*metric)(void); /*!< function to return the address of the output parameter */ int num_params; /*!< number of tuning input parameters */ char * param_names[]; /*!< the input parameter names */ apex_param_struct * params[]; /*!< the input parameters */ apex_event_type trigger; /*!< the event that triggers the tuning update */ apex_tuning_session_handle tuning_session_handle; /*!< the Active Harmony tuning session handle */ bool running; /*!< the current state of the tuning */ apex_ah_tuning_strategy strategy; /*!< the requested Active Harmony tuning strategy */ } apex_tuning_request_struct; Environment variables \u00b6 Please see the environment variables section of the documentation. Please note that all environment variables can also be queried or set at runtime with associated API calls. For example, the APEX_CSV_OUTPUT variable can also be set/queried with: void apex_set_csv_output (int); int apex_get_csv_output (void); General Utility functions \u00b6 Initialization \u00b6 /* C++ */ void apex::init (const char *thread_name); /* C */ void apex_init (const char *thread_name); APEX initialization is required to set up data structures and spawn the necessary helper threads, including the background system state query thread, the policy engine thread, and the profile handler thread. The thread name parameter will be used as the top-level timer for the the main thread of execution. Finalization \u00b6 /* C++ */ void apex::finalize (void); /* C */ void apex_finalize (void); APEX finalization is required to format any desired output (screen, csv, profile, etc.) and terminate all APEX helper threads. No memory is freed at this point - that is done by the apex_cleanup() call. The reason for this is that applications may want to perform reporting after finalization, so the performance state of the application should still exist. Cleanup \u00b6 /* C++ */ void apex::cleanup (void); /* C */ void apex_cleanup (void); APEX cleanup frees all memory associated with APEX. Setting node ID \u00b6 /* C++ */ void apex::set_node_id (const uint64_t id); /* C */ void apex_set_node_id (const uint64_t id); When running in distributed environments, assign the specified id number as the APEX node ID. This can be an MPI rank or an HPX locality, for example. Registering threads \u00b6 /* C++ */ void apex::register_thread (const std::string &name); /* C */ void apex_register_thread (const char *name); Register a new OS thread with APEX. This method should be called whenever a new OS thread is spawned by the application or the runtime. An empty string or null string is valid input. Exiting a thread \u00b6 /* C++ */ void apex::exit_thread (void); /* C */ void apex_exit_thread (void); Before any thread other than the main thread of execution exits, notify APEX that the thread is exiting. The main thread should not call this function, but apex_finalize instead. Exiting the thread will trigger an event in APEX, so any policies associated with a thread exit will be executed. Getting the APEX version \u00b6 /* C++ */ std::string & apex::version (void); /* C */ const char * apex_version (void); Return the APEX version as a string. Getting the APEX settings \u00b6 /* C++ */ std::string & apex::get_options (void); /* C */ const char * apex_get_options (void); Return the current APEX options as a string. Basic measurement Functions (introspection) \u00b6 Starting a timer \u00b6 /* C++ */ apex_profiler_handle apex::start (const std::string &timer_name); apex_profiler_handle apex::start (const apex_function_address function_address); /* C */ apex_profiler_handle apex_start (apex_profiler_type type, const void * identifier); Create an APEX timer and start it. An APEX profiler object is returned, containing an identifier that APEX uses to stop the timer. The timer is either identified by a name or a function/task instruction pointer address. Stopping a timer \u00b6 /* C++ */ void apex::stop (apex_profiler_handle the_profiler); /* C */ void apex_stop (apex_profiler_handle the_profiler); The timer associated with the profiler object is stopped and placed on an internal queue to be processed by the profiler handler thread in the background. The profiler object is flagged as \"stopped\", so that when the profiler is processed the call count for this particular timer will be incremented by 1, unless the timer was started by apex_resume() (see below). The profiler handle will be freed internally by APEX after processing. Yielding a timer \u00b6 /* C++ */ void apex::yield (apex_profiler_handle the_profiler); /* C */ void apex_yield (apex_profiler_handle the_profiler); The timer associated with the profiler object is stopped and placed on an internal queue to be processed by the profiler handler thread in the background. The profiler object is flagged as NOT stopped , so that when the profiler is processed the call count will NOT be incremented. An application using apex_yield should not use apex_resume to restart the timer, it should use apex_start. apex_yield() is intended for situations when the completion state of the task is known and the state is not complete . below). The profiler handle will be freed internally by APEX after processing. Resuming a timer \u00b6 /* C++ */ apex_profiler_handle apex::resume (const std::string &timer_name); apex_profiler_handle apex::resume (const apex_function_address function_address); /* C */ apex_profiler_handle apex_resume (apex_profiler_type type, const void * identifier); Create an APEX timer and start it. An APEX profiler object is returned, containing an identifier that APEX uses to stop the timer. The profiler is flagged as NOT a new task , so that when it is stopped by apex_stop the call count for this particular timer will not be incremented. Apex_resume is intended for situations when the completion state of a task is NOT known when control is returned to the task scheduler, but is known when an interrupted task is resumed. Creating a new task dependency \u00b6 /* C++ */ void apex::new_task (std::string & name, const void * task_id); void apex::new_task (const apex_function_address function_address, const void * task_id); /* C */ void apex_new_task (apex_profiler_type type, const void * identifier, const void * task_id) Register the creation of a new task. This is used to track task dependencies in APEX. APEX assumes that the current APEX profiler refers to the task that is the parent of this new task. The task_info object is a generic pointer to whatever data might need to be passed to a policy executed on when a new task is created. Sampling a value \u00b6 /* C++ */ void apex::sample_value (const std::string & name, const double value) /* C */ void apex_sample_value (const char * name, const double value); Record a measurement of the specified counter with the specified value. For example, \"bytes transferred\" and \"1024\". Setting the OS thread state \u00b6 /* C++ */ void apex::set_state (apex_thread_state state); /* C */ void apex_set_state (apex_thread_state state); Set the state of the current OS thread. States can include things like idle, busy, waiting, throttled, blocked. Policy-related methods (adaptation) \u00b6 Registering an event-based policy function \u00b6 /* C++ */ apex_policy_handle apex::register_policy (const apex_event_type when, std::function f); std::set apex::register_policy (std::set when, std::function f); /* C */ apex_policy_handle apex_register_policy (const apex_event_type when, int(*f)(apex_context const&)); APEX provides the ability to call an application-specified function when certain events occur in the APEX library, or periodically. This assigns the passed in function to the event, so that when that event occurs in APEX, the function is called. The context for the event will be passed to the registered function. A set of events can also be used to register a policy function, which will return a set of policy handles. When any event in the set occurs, the function will be called. Registering a periodic policy \u00b6 /* C++ */ apex_policy_handle apex::register_periodic_policy(const unsigned long period, std::function f); /* C */ apex_policy_handle apex_register_periodic_policy (const unsigned long period, int(*f)(apex_context const&)); Apex provides the ability to call an application-specified function periodically. This method assigns the passed in function to be called on a periodic basis. The context for the event will be passed to the registered function. The period units are in microseconds (us). De-registering a policy \u00b6 /* C++ */ apex::deregister_policy (apex_policy_handle handle); /* C */ apex_deregister_policy (apex_policy_handle handle); Remove the specified policy so that it will no longer be executed, whether it is event-based or periodic. The calling code should not try to dereference the policy handle after this call, as the memory pointed to by the handle will be freed. Registering a custom event \u00b6 /* C++ */ apex_event_type apex::register_custom_event (const std::string & name); /* C */ apex_event_type apex_register_custom_event (const char * name); Register a new event type with APEX. Trigger a custom event \u00b6 /* C++ */ void apex::custom_event (apex_event_type event_type, const void * event_data); /* C */ void apex_custom_event (const char * name, const void * event_data); Trigger a custom event. This function will pass a custom event to the APEX event listeners. Each listeners' custom event handler will handle the custom event. Policy functions will be passed the custom event name in the event context. The event data pointer is to be used to pass memory to the policy function from the code that triggered the event. Request a profile from APEX \u00b6 /* C++ */ apex_profile * apex::get_profile (const std::string & name); apex_profile * apex::get_profile (const apex_function_address function_address); /* C */ apex_profile * apex_get_profile (apex_profiler_type type, const void * identifier) This function will return the current profile for the specified identifier. Because profiles are updated out-of-band, it is possible that this profile values are out of date. This profile can be either a timer or a sampled value. Reset a profile \u00b6 /* C++ */ void apex::reset (const std::string & timer_name); void apex::reset (const apex_function_address function_address); /* C */ void apex_reset (apex_profiler_type type, const void * identifier) This function will reset the profile associated with the specified timer or counter id to zero. If the identifier is null, all timers and counters will be reset. Concurrency Throttling Policy Functions \u00b6 Setup tuning for adaptation \u00b6 /* C++ */ apex_tuning_session_handle setup_custom_tuning(apex_tuning_request & request); apex_tuning_session_handle setup_custom_tuning(apex_tuning_request * request); Setup tuning of specified parameters to optimize for a custom metric, using multiple input criteria. This function will initialize a policy to optimize a custom metric, using the list of tunable parameters. The system tries to minimize the custom metric. After evaluating the state of the system, the policy will assign new values to the inputs. Get the current thread cap \u00b6 /* C++ */ int apex::get_thread_cap (void); /* C */ int apex_get_thread_cap (void); This function will return the current thread cap based on the throttling policy. Set the current thread cap \u00b6 /* C++ */ void apex::set_thread_cap (int new_cap); /* C */ void apex_set_thread_cap (int new_cap); This function will set the current thread cap based on an external throttling policy. Event-based API (OCR, Legion support - TBD ) \u00b6 The OCR and Legion runtimes teams have met to propose a common API for measuring asynchronous task-based runtimes. For more details, see https://github.com/UO-OACISS/apex/issues/37 . /* C++ */ apex::task_create (uint64_t parent_id) apex::dependency_reached (uint64_t event_id, uint64_t data_id, uint64_t task_id, uint64_t parent_id, ?) apex::task_ready (uint64_t why_ready) apex::task_execute (uint64_t why_delay, const apex_function_address function) apex::task_finished (uint64_t task_id) apex::task_destroy (uint64_t task_id) apex::data_create (uint64_t data_id) apex::data_new_size (uint64_t data_id) apex::data_move_from (uint64_t data_id, uint64_t target_location) apex::data_move_to (uint64_t data_id, uint64_t source_location) apex::data_replace (uint64_t data_id, uint64_t new_id) apex::data_destroy (uint64_t data_id) apex::event_create (uint64_t event_id, parent_task_id) apex::event_add_dependency (uint64_t event_id, uint64_t data_event_task_id, uint64_t parent_task_id) apex::event_trigger (uint64_t event_id) apex::event_destroy (uint64_t event_id) /* C API tbd */","title":"API Specification"},{"location":"spec/#apex_specification_draft","text":"*...to be fully implemented in a future release. While the following specification is slightly different than the current implementation, the differences are minor. When in doubt, the current implementation is documented by Doxygen, and is available here: http://www.nic.uoregon.edu/~khuck/apex_docs/doc/html/index.html http://www.nic.uoregon.edu/~khuck/apex_docs/doc/refman.pdf *","title":"APEX Specification (DRAFT)"},{"location":"spec/#read_me_first","text":"The API specification is provided for users who wish to instrument their own applications, or who wish to instrument a runtime. Please note that most runtimes have already been instrumented (or provide callbacks), and that users typically do not have to make any calls to the APEX API, other than to add application level timers or to write custom policy rules. If that is you, please see the tutorial with lots of up-to-date examples, https://github.com/khuck/apex-tutorial .","title":"READ ME FIRST!"},{"location":"spec/#introduction","text":"This page contains the API specification for APEX. The API specification provides a high-level overview of the API and its functionality. The implementation has Doxygen comments inserted, so for full implementation details, please see the API Reference Manual .","title":"Introduction"},{"location":"spec/#a_note_about_c","text":"The following specification contains both the C and the the C++ API. Typically, the C++ names use overloading for different argument lists, and will replace the apex_ prefix with the apex:: namespace. Because both APIs return handles to internal APEX objects, the type definitions of these objects use the C naming convention. In addition to the simple API presented below, the C++ API includes scoped timers and threads. See http://www.nic.uoregon.edu/~khuck/apex_docs/doc/html/namespaceapex.html for details.","title":"A note about C++"},{"location":"spec/#terminology","text":"Unfortunately, many terms in Computer Science are overloaded. The following definitions are in use in this document: Thread : an operating system (OS) thread of execution. For example, Posix threads (pthreads). Task : a scheduled unit of work, such as an OpenMP task or an HPX thread. APEX timers are typically used to measure tasks.","title":"Terminology"},{"location":"spec/#c_example","text":"The following is a very small C program that uses the APEX API. For more examples, please see the programs in the src/examples and src/unit_tests/C directories of the APEX source code. #include #include #include \"apex.h\" int foo(int i) { /* start an APEX timer for the function foo */ apex_profiler_handle profiler = apex_start(APEX_FUNCTION_ADDRESS, &foo); int j = i * i; /* stop the APEX timer */ apex_stop(profiler); return j; } int main (int argc, char** argv) { /* initialize APEX */ apex_init(\"apex_start unit test\"); /* start a timer, passing in the address of the main function */ apex_profiler_handle profiler = apex_start(APEX_FUNCTION_ADDRESS, &main); int i,j = 0; for (i = 0 ; i < 3 ; i++) { j += foo(i); } /* stop the timer */ apex_stop(profiler); /* finalize APEX */ apex_finalize(); /* free all memory allocated by APEX */ apex_cleanup(); return 0; }","title":"C example"},{"location":"spec/#c_example_1","text":"The following is a slightly more complicated C++ pthread program that uses the APEX API. For more examples, please see the programs in the src/examples and src/unit_tests/C++ directories of the APEX source code. #include #include #include #include \"apex_api.hpp\" void* someThread(void* tmp) { int* tid = (int*)tmp; char name[32]; sprintf(name, \"worker thread %d\", *tid); /* Register this thread with APEX */ apex::register_thread(name); /* Start a timer */ apex::profiler* p = apex::start((apex_function_address)&someThread); /* ... */ /* do some computation */ /* ... */ /* stop the timer */ apex::stop(p); /* tell APEX that this thread is exiting */ apex::exit_thread(); return NULL; } int main (int argc, char** argv) { /* initialize APEX */ apex::init(\"apex::start unit test\"); /* set our node ID */ apex::set_node_id(0); /* start a timer */ apex::profiler* p = apex::start(\"main\"); /* Spawn two threads */ pthread_t thread[2]; int tid = 0; pthread_create(&(thread[0]), NULL, someThread, &tid); int tid2 = 1; pthread_create(&(thread[1]), NULL, someThread, &tid2); /* wait for the threads to finish */ pthread_join(thread[0], NULL); pthread_join(thread[1], NULL); /* stop our main timer */ apex::stop(p); /* finalize APEX */ apex::finalize(); /* free all memory allocated by APEX */ apex::cleanup(); return 0; }","title":"C++ example"},{"location":"spec/#constants_types_and_enumerations","text":"","title":"Constants, types and enumerations"},{"location":"spec/#constants","text":"/** A null pointer representing an APEX profiler handle. * Used when a null APEX profile handle is to be passed in to * apex::stop when the profiler object was not retained locally. */ #define APEX_NULL_PROFILER_HANDLE (apex_profiler_handle)(NULL) // for comparisons #define APEX_MAX_EVENTS 128 /*!< The maximum number of event types. Allows for ~20 custom events. */ #define APEX_NULL_FUNCTION_ADDRESS 0L // for comparisons","title":"Constants"},{"location":"spec/#pre-defined_types","text":"/** The address of a C++ object in APEX. * Not useful for the caller that gets it back, but required * for stopping the timer later. */ typedef uintptr_t apex_profiler_handle; // address of internal C++ object /** Not useful for the caller that gets it back, but required * for deregistering policies after registration. */ typedef uintptr_t apex_policy_handle; // address of internal C++ object /** Rather than use void pointers everywhere, be explicit about * what the functions are expecting. */ typedef uintptr_t apex_function_address; // generic function pointer","title":"Pre-defined types"},{"location":"spec/#enumerations","text":"/** * Typedef for enumerating the different timer types */ typedef enum _apex_profiler_type { APEX_FUNCTION_ADDRESS = 0, /*!< The ID is a function (or instruction) address */ APEX_NAME_STRING, /*!< The ID is a character string */ APEX_FUNCTOR /*!< C++ Object with the () operator defined */ } apex_profiler_type; /** * Typedef for enumerating the different event types */ typedef enum _event_type { APEX_INVALID_EVENT = -1, APEX_STARTUP = 0, /*!< APEX is initialized */ APEX_SHUTDOWN, /*!< APEX is terminated */ APEX_NEW_NODE, /*!< APEX has registered a new process ID */ APEX_NEW_THREAD, /*!< APEX has registered a new OS thread */ APEX_EXIT_THREAD, /*!< APEX has exited an OS thread */ APEX_START_EVENT, /*!< APEX has processed a timer start event */ APEX_RESUME_EVENT, /*!< APEX has processed a timer resume event (the number of calls is not incremented) */ APEX_STOP_EVENT, /*!< APEX has processed a timer stop event */ APEX_YIELD_EVENT, /*!< APEX has processed a timer yield event */ APEX_SAMPLE_VALUE, /*!< APEX has processed a sampled value */ APEX_PERIODIC, /*!< APEX has processed a periodic timer */ APEX_CUSTOM_EVENT_1, /*!< APEX has processed a custom event - useful for large granularity application control events */ APEX_CUSTOM_EVENT_2, // these are just here for padding, and so we can APEX_CUSTOM_EVENT_3, // test with them. APEX_CUSTOM_EVENT_4, APEX_CUSTOM_EVENT_5, APEX_CUSTOM_EVENT_6, APEX_CUSTOM_EVENT_7, APEX_CUSTOM_EVENT_8, APEX_UNUSED_EVENT = APEX_MAX_EVENTS // can't have more custom events than this } apex_event_type; /** * Typedef for enumerating the OS thread states. */ typedef enum _thread_state { APEX_IDLE, /*!< Thread is idle */ APEX_BUSY, /*!< Thread is working */ APEX_THROTTLED, /*!< Thread is throttled (sleeping) */ APEX_WAITING, /*!< Thread is waiting for a resource */ APEX_BLOCKED /*!< Thread is otherwise blocked */ } apex_thread_state; /** * Typedef for enumerating the different optimization strategies * for throttling. */ typedef enum {APEX_MAXIMIZE_THROUGHPUT, /*!< maximize the number of calls to a timer/counter */ APEX_MAXIMIZE_ACCUMULATED, /*!< maximize the accumulated value of a timer/counter */ APEX_MINIMIZE_ACCUMULATED /*!< minimize the accumulated value of a timer/counter */ } apex_optimization_criteria_t; /** * Typedef for enumerating the different optimization methods * for throttling. */ typedef enum {APEX_SIMPLE_HYSTERESIS, /*!< optimize using sliding window of historical observations. A running average of the most recent N observations are used as the measurement. */ APEX_DISCRETE_HILL_CLIMBING, /*!< Use a discrete hill climbing algorithm for optimization */ APEX_ACTIVE_HARMONY /*!< Use Active Harmony for optimization. */ } apex_optimization_method_t; /** The type of a profiler object * */ typedef enum _profile_type { APEX_TIMER, /*!< This profile is a instrumented timer */ APEX_COUNTER /*!< This profile is a sampled counter */ } apex_profile_type;","title":"Enumerations"},{"location":"spec/#data_structures_and_classes","text":"/** * The APEX context when an event occurs. This context will be passed to * any policies registered for this event. */ typedef struct _context { apex_event_type event_type; /*!< The type of the event currently processing */ apex_policy_handle* policy_handle; /*!< The policy handle for the current policy function */ void * data; /*!< Data associated with the event, such as the custom_data for a custom_event */ } apex_context; /** * The profile object for a timer in APEX. * Returned by the apex_get_profile() call. */ typedef struct _profile { double calls; /*!< Number of times a timer was called, or the number of samples collected for a counter */ double accumulated; /*!< Accumulated values for all calls/samples */ double sum_squares; /*!< Running sum of squares calculation for all calls/samples */ double minimum; /*!< Minimum value seen by the timer or counter */ double maximum; /*!< Maximum value seen by the timer or counter */ apex_profile_type type; /*!< Whether this is a timer or a counter */ double papi_metrics[8]; /*!< Array of accumulated PAPI hardware metrics */ } apex_profile; /** * The APEX tuning request structures. */ typedef struct _apex_param { char * init_value; /*!< Initial value */ const char * value; /*!< Current value */ int num_possible_values; /*!< Number of possible values */ char * possible_values[]; } apex_param_struct; typedef struct _apex_tuning_request { char * name; /*!< Tuning request name */ double (*metric)(void); /*!< function to return the address of the output parameter */ int num_params; /*!< number of tuning input parameters */ char * param_names[]; /*!< the input parameter names */ apex_param_struct * params[]; /*!< the input parameters */ apex_event_type trigger; /*!< the event that triggers the tuning update */ apex_tuning_session_handle tuning_session_handle; /*!< the Active Harmony tuning session handle */ bool running; /*!< the current state of the tuning */ apex_ah_tuning_strategy strategy; /*!< the requested Active Harmony tuning strategy */ } apex_tuning_request_struct;","title":"Data structures and classes"},{"location":"spec/#environment_variables","text":"Please see the environment variables section of the documentation. Please note that all environment variables can also be queried or set at runtime with associated API calls. For example, the APEX_CSV_OUTPUT variable can also be set/queried with: void apex_set_csv_output (int); int apex_get_csv_output (void);","title":"Environment variables"},{"location":"spec/#general_utility_functions","text":"","title":"General Utility functions"},{"location":"spec/#initialization","text":"/* C++ */ void apex::init (const char *thread_name); /* C */ void apex_init (const char *thread_name); APEX initialization is required to set up data structures and spawn the necessary helper threads, including the background system state query thread, the policy engine thread, and the profile handler thread. The thread name parameter will be used as the top-level timer for the the main thread of execution.","title":"Initialization"},{"location":"spec/#finalization","text":"/* C++ */ void apex::finalize (void); /* C */ void apex_finalize (void); APEX finalization is required to format any desired output (screen, csv, profile, etc.) and terminate all APEX helper threads. No memory is freed at this point - that is done by the apex_cleanup() call. The reason for this is that applications may want to perform reporting after finalization, so the performance state of the application should still exist.","title":"Finalization"},{"location":"spec/#cleanup","text":"/* C++ */ void apex::cleanup (void); /* C */ void apex_cleanup (void); APEX cleanup frees all memory associated with APEX.","title":"Cleanup"},{"location":"spec/#setting_node_id","text":"/* C++ */ void apex::set_node_id (const uint64_t id); /* C */ void apex_set_node_id (const uint64_t id); When running in distributed environments, assign the specified id number as the APEX node ID. This can be an MPI rank or an HPX locality, for example.","title":"Setting node ID"},{"location":"spec/#registering_threads","text":"/* C++ */ void apex::register_thread (const std::string &name); /* C */ void apex_register_thread (const char *name); Register a new OS thread with APEX. This method should be called whenever a new OS thread is spawned by the application or the runtime. An empty string or null string is valid input.","title":"Registering threads"},{"location":"spec/#exiting_a_thread","text":"/* C++ */ void apex::exit_thread (void); /* C */ void apex_exit_thread (void); Before any thread other than the main thread of execution exits, notify APEX that the thread is exiting. The main thread should not call this function, but apex_finalize instead. Exiting the thread will trigger an event in APEX, so any policies associated with a thread exit will be executed.","title":"Exiting a thread"},{"location":"spec/#getting_the_apex_version","text":"/* C++ */ std::string & apex::version (void); /* C */ const char * apex_version (void); Return the APEX version as a string.","title":"Getting the APEX version"},{"location":"spec/#getting_the_apex_settings","text":"/* C++ */ std::string & apex::get_options (void); /* C */ const char * apex_get_options (void); Return the current APEX options as a string.","title":"Getting the APEX settings"},{"location":"spec/#basic_measurement_functions_introspection","text":"","title":"Basic measurement Functions (introspection)"},{"location":"spec/#starting_a_timer","text":"/* C++ */ apex_profiler_handle apex::start (const std::string &timer_name); apex_profiler_handle apex::start (const apex_function_address function_address); /* C */ apex_profiler_handle apex_start (apex_profiler_type type, const void * identifier); Create an APEX timer and start it. An APEX profiler object is returned, containing an identifier that APEX uses to stop the timer. The timer is either identified by a name or a function/task instruction pointer address.","title":"Starting a timer"},{"location":"spec/#stopping_a_timer","text":"/* C++ */ void apex::stop (apex_profiler_handle the_profiler); /* C */ void apex_stop (apex_profiler_handle the_profiler); The timer associated with the profiler object is stopped and placed on an internal queue to be processed by the profiler handler thread in the background. The profiler object is flagged as \"stopped\", so that when the profiler is processed the call count for this particular timer will be incremented by 1, unless the timer was started by apex_resume() (see below). The profiler handle will be freed internally by APEX after processing.","title":"Stopping a timer"},{"location":"spec/#yielding_a_timer","text":"/* C++ */ void apex::yield (apex_profiler_handle the_profiler); /* C */ void apex_yield (apex_profiler_handle the_profiler); The timer associated with the profiler object is stopped and placed on an internal queue to be processed by the profiler handler thread in the background. The profiler object is flagged as NOT stopped , so that when the profiler is processed the call count will NOT be incremented. An application using apex_yield should not use apex_resume to restart the timer, it should use apex_start. apex_yield() is intended for situations when the completion state of the task is known and the state is not complete . below). The profiler handle will be freed internally by APEX after processing.","title":"Yielding a timer"},{"location":"spec/#resuming_a_timer","text":"/* C++ */ apex_profiler_handle apex::resume (const std::string &timer_name); apex_profiler_handle apex::resume (const apex_function_address function_address); /* C */ apex_profiler_handle apex_resume (apex_profiler_type type, const void * identifier); Create an APEX timer and start it. An APEX profiler object is returned, containing an identifier that APEX uses to stop the timer. The profiler is flagged as NOT a new task , so that when it is stopped by apex_stop the call count for this particular timer will not be incremented. Apex_resume is intended for situations when the completion state of a task is NOT known when control is returned to the task scheduler, but is known when an interrupted task is resumed.","title":"Resuming a timer"},{"location":"spec/#creating_a_new_task_dependency","text":"/* C++ */ void apex::new_task (std::string & name, const void * task_id); void apex::new_task (const apex_function_address function_address, const void * task_id); /* C */ void apex_new_task (apex_profiler_type type, const void * identifier, const void * task_id) Register the creation of a new task. This is used to track task dependencies in APEX. APEX assumes that the current APEX profiler refers to the task that is the parent of this new task. The task_info object is a generic pointer to whatever data might need to be passed to a policy executed on when a new task is created.","title":"Creating a new task dependency"},{"location":"spec/#sampling_a_value","text":"/* C++ */ void apex::sample_value (const std::string & name, const double value) /* C */ void apex_sample_value (const char * name, const double value); Record a measurement of the specified counter with the specified value. For example, \"bytes transferred\" and \"1024\".","title":"Sampling a value"},{"location":"spec/#setting_the_os_thread_state","text":"/* C++ */ void apex::set_state (apex_thread_state state); /* C */ void apex_set_state (apex_thread_state state); Set the state of the current OS thread. States can include things like idle, busy, waiting, throttled, blocked.","title":"Setting the OS thread state"},{"location":"spec/#policy-related_methods_adaptation","text":"","title":"Policy-related methods (adaptation)"},{"location":"spec/#registering_an_event-based_policy_function","text":"/* C++ */ apex_policy_handle apex::register_policy (const apex_event_type when, std::function f); std::set apex::register_policy (std::set when, std::function f); /* C */ apex_policy_handle apex_register_policy (const apex_event_type when, int(*f)(apex_context const&)); APEX provides the ability to call an application-specified function when certain events occur in the APEX library, or periodically. This assigns the passed in function to the event, so that when that event occurs in APEX, the function is called. The context for the event will be passed to the registered function. A set of events can also be used to register a policy function, which will return a set of policy handles. When any event in the set occurs, the function will be called.","title":"Registering an event-based policy function"},{"location":"spec/#registering_a_periodic_policy","text":"/* C++ */ apex_policy_handle apex::register_periodic_policy(const unsigned long period, std::function f); /* C */ apex_policy_handle apex_register_periodic_policy (const unsigned long period, int(*f)(apex_context const&)); Apex provides the ability to call an application-specified function periodically. This method assigns the passed in function to be called on a periodic basis. The context for the event will be passed to the registered function. The period units are in microseconds (us).","title":"Registering a periodic policy"},{"location":"spec/#de-registering_a_policy","text":"/* C++ */ apex::deregister_policy (apex_policy_handle handle); /* C */ apex_deregister_policy (apex_policy_handle handle); Remove the specified policy so that it will no longer be executed, whether it is event-based or periodic. The calling code should not try to dereference the policy handle after this call, as the memory pointed to by the handle will be freed.","title":"De-registering a policy"},{"location":"spec/#registering_a_custom_event","text":"/* C++ */ apex_event_type apex::register_custom_event (const std::string & name); /* C */ apex_event_type apex_register_custom_event (const char * name); Register a new event type with APEX.","title":"Registering a custom event"},{"location":"spec/#trigger_a_custom_event","text":"/* C++ */ void apex::custom_event (apex_event_type event_type, const void * event_data); /* C */ void apex_custom_event (const char * name, const void * event_data); Trigger a custom event. This function will pass a custom event to the APEX event listeners. Each listeners' custom event handler will handle the custom event. Policy functions will be passed the custom event name in the event context. The event data pointer is to be used to pass memory to the policy function from the code that triggered the event.","title":"Trigger a custom event"},{"location":"spec/#request_a_profile_from_apex","text":"/* C++ */ apex_profile * apex::get_profile (const std::string & name); apex_profile * apex::get_profile (const apex_function_address function_address); /* C */ apex_profile * apex_get_profile (apex_profiler_type type, const void * identifier) This function will return the current profile for the specified identifier. Because profiles are updated out-of-band, it is possible that this profile values are out of date. This profile can be either a timer or a sampled value.","title":"Request a profile from APEX"},{"location":"spec/#reset_a_profile","text":"/* C++ */ void apex::reset (const std::string & timer_name); void apex::reset (const apex_function_address function_address); /* C */ void apex_reset (apex_profiler_type type, const void * identifier) This function will reset the profile associated with the specified timer or counter id to zero. If the identifier is null, all timers and counters will be reset.","title":"Reset a profile"},{"location":"spec/#concurrency_throttling_policy_functions","text":"","title":"Concurrency Throttling Policy Functions"},{"location":"spec/#setup_tuning_for_adaptation","text":"/* C++ */ apex_tuning_session_handle setup_custom_tuning(apex_tuning_request & request); apex_tuning_session_handle setup_custom_tuning(apex_tuning_request * request); Setup tuning of specified parameters to optimize for a custom metric, using multiple input criteria. This function will initialize a policy to optimize a custom metric, using the list of tunable parameters. The system tries to minimize the custom metric. After evaluating the state of the system, the policy will assign new values to the inputs.","title":"Setup tuning for adaptation"},{"location":"spec/#get_the_current_thread_cap","text":"/* C++ */ int apex::get_thread_cap (void); /* C */ int apex_get_thread_cap (void); This function will return the current thread cap based on the throttling policy.","title":"Get the current thread cap"},{"location":"spec/#set_the_current_thread_cap","text":"/* C++ */ void apex::set_thread_cap (int new_cap); /* C */ void apex_set_thread_cap (int new_cap); This function will set the current thread cap based on an external throttling policy.","title":"Set the current thread cap"},{"location":"spec/#event-based_api_ocr_legion_support_-_tbd","text":"The OCR and Legion runtimes teams have met to propose a common API for measuring asynchronous task-based runtimes. For more details, see https://github.com/UO-OACISS/apex/issues/37 . /* C++ */ apex::task_create (uint64_t parent_id) apex::dependency_reached (uint64_t event_id, uint64_t data_id, uint64_t task_id, uint64_t parent_id, ?) apex::task_ready (uint64_t why_ready) apex::task_execute (uint64_t why_delay, const apex_function_address function) apex::task_finished (uint64_t task_id) apex::task_destroy (uint64_t task_id) apex::data_create (uint64_t data_id) apex::data_new_size (uint64_t data_id) apex::data_move_from (uint64_t data_id, uint64_t target_location) apex::data_move_to (uint64_t data_id, uint64_t source_location) apex::data_replace (uint64_t data_id, uint64_t new_id) apex::data_destroy (uint64_t data_id) apex::event_create (uint64_t event_id, parent_task_id) apex::event_add_dependency (uint64_t event_id, uint64_t data_event_task_id, uint64_t parent_task_id) apex::event_trigger (uint64_t event_id) apex::event_destroy (uint64_t event_id) /* C API tbd */","title":"Event-based API (OCR, Legion support - TBD)"},{"location":"usage/","text":"Usage \u00b6 Tutorial \u00b6 For an APEX tutorial, please see https://github.com/khuck/apex-tutorial . Supported Runtime Systems \u00b6 HPX (Louisiana State University) \u00b6 HPX (High Performance ParalleX) is the original implementation of the ParalleX model. Developed and maintained by the Ste||ar Group at Louisiana State University, HPX is implemented in C++. For more information, see http://stellar-group.org/projects/hpx/ . For a tutorial on HPX with APEX (presented at SC'15, Austin TX) see https://github.com/khuck/SC15_APEX_tutorial (somewhat outdated). APEX is configured and built as part of HPX. In fact, you don't even need to donwload it separately - it will be automatically checked out from Github as part of the HPX Cmake configuration. However, you do need to pass the correct Cmake options to the HPX configuration step. Configuring HPX with APEX \u00b6 See Intallation with HPX . Running HPX with APEX \u00b6 See APEX Quickstart . OpenMP \u00b6 The OpenMP API supports multi-platform shared-memory parallel programming in C/C++ and Fortran. The OpenMP API defines a portable, scalable model with a simple and flexible interface for developing parallel applications on platforms from the desktop to the supercomputer. For more information, see http://openmp.org/ . Configuring APEX for OpenMP OMPT support \u00b6 The CMake process will automatically detect whether your compiler has OpenMP support. If you configure APEX with -DUSE_OMPT=TRUE and have a compiler with full OpenMP 5.0 OMPT support, APEX will detect the support. If your compiler is GCC, Intel or Clang and does not have native OMPT support, APEX can build and use the open source LLVM OpenMP runtime as a drop-in replacement for the compiler's native runtime library, but this is no longer recommended and is deprecated. APEX uses Binutils to resolve the OpenMP outlined regions from instruction addresses to human-readable names, so also configure APEX with -DUSE_BFD=TRUE (see Other CMake Settings ). The following example was configured and run with Intel 20 compilers. The CMake configuration for this example was: cmake -DCMAKE_C_COMPILER=`which icc` -DCMAKE_CXX_COMPILER=`which icpc` -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=../install -DBUILD_TESTS=TRUE -DUSE_BFD=TRUE -DBFD_ROOT=/usr/local/packages/binutils/2.34 -DUSE_OMPT=TRUE .. Running OpenMP applications with APEX \u00b6 Using the apex_exec wrapper script, execute the OpenMP program as normal: [khuck@delphi apex]$ ./install/bin/apex_exec --apex:screen --apex:taskgraph --apex:ompt build/src/unit_tests/C++/apex_openmp_cpp Program to run : build/src/unit_tests/C++/apex_openmp_cpp Initializing... No Sharing... Result: 2690568.772590 Elapsed time: 0.0398378 seconds Cores detected: 72 Worker Threads observed: 72 Available CPU time: 2.86832 seconds Counter : #samples | minimum | mean | maximum | stddev ------------------------------------------------------------------------------------------------ Iterations: OpenMP Work Loop: no_shari... : 71 1.05e+06 1.05e+06 1.05e+06 0.000 Iterations: OpenMP Work Loop: my_init(... : 144 1.05e+06 1.05e+06 1.05e+06 0.000 OpenMP Initial Thread : 1 1.000 1.000 1.000 0.000 OpenMP Worker Thread : 71 1.000 1.000 1.000 0.000 Single: OpenMP Work Single Executor: L... : 1 1.000 1.000 1.000 0.000 Single: OpenMP Work Single Executor: L... : 2 1.000 1.000 1.000 0.000 Single: OpenMP Work Single Other: L__Z... : 71 1.000 1.000 1.000 0.000 Single: OpenMP Work Single Other: L__Z... : 142 1.000 1.000 1.000 0.000 status:Threads : 1 3.000 3.000 3.000 0.000 status:VmData : 1 1.07e+05 1.07e+05 1.07e+05 0.000 status:VmExe : 1 20.000 20.000 20.000 0.000 status:VmHWM : 1 9356.000 9356.000 9356.000 0.000 status:VmLck : 1 0.000 0.000 0.000 0.000 status:VmLib : 1 4.39e+04 4.39e+04 4.39e+04 0.000 status:VmPTE : 1 128.000 128.000 128.000 0.000 status:VmPeak : 1 2.49e+05 2.49e+05 2.49e+05 0.000 status:VmPin : 1 0.000 0.000 0.000 0.000 status:VmRSS : 1 9356.000 9356.000 9356.000 0.000 status:VmSize : 1 1.84e+05 1.84e+05 1.84e+05 0.000 status:VmStk : 1 136.000 136.000 136.000 0.000 status:VmSwap : 1 0.000 0.000 0.000 0.000 status:nonvoluntary_ctxt_switches : 1 43.000 43.000 43.000 0.000 status:voluntary_ctxt_switches : 1 46.000 46.000 46.000 0.000 ------------------------------------------------------------------------------------------------ Timer : #calls | mean | total | % total ------------------------------------------------------------------------------------------------ APEX MAIN : 1 0.040 0.040 100.000 OpenMP Parallel Region: no_sharing(double*, doubl... : 1 0.006 0.006 0.211 OpenMP Parallel Region: my_init(double*) [{/home/... : 2 0.014 0.028 0.961 OpenMP Work Loop: no_sharing(double*, double*) [{... : 72 0.003 0.195 6.806 OpenMP Work Loop: my_init(double*) [{/home/users/... : 143 0.001 0.161 5.622 OpenMP Work Single Executor: L__Z10no_sharingPdS_... : 1 0.001 0.001 0.028 OpenMP Work Single Executor: L__Z7my_initPd_39__p... : 2 0.000 0.001 0.018 OpenMP Work Single Other: L__Z10no_sharingPdS__20... : 71 0.000 0.029 1.027 OpenMP Work Single Other: L__Z7my_initPd_39__par_... : 141 0.001 0.100 3.472 ------------------------------------------------------------------------------------------------ Total timers : 433 If GraphViz is installed on your system, the dot program will generate a taskgraph image based on the taskgraph.0.dot file that was generated by APEX: OpenACC \u00b6 Configuring APEX for OpenACC support \u00b6 Nothing special needs to be done to enable OpenACC support. If your compiler supports OpenACC (PGI, GCC 10+), then CMake will detect it and enable OpenACC support in APEX. In this example, APEX was configured with GCC 10.0.0: cmake -DCMAKE_C_COMPILER=`which gcc` -DCMAKE_CXX_COMPILER=`which g++` -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=../install -DBUILD_TESTS=TRUE -DUSE_BFD=FALSE -DBFD_ROOT=/usr/local/packages/binutils/2.34 .. Running OpenACC programs with APEX \u00b6 Enabling OpenACC support requires setting the ACC_PROFLIB environment variable with the path to libapex.so , or by using the apex_exec script with the --apex:openacc flag: [khuck@gorgon apex]$ ./install/bin/apex_exec --apex:screen --apex:taskgraph --apex:openacc ./build/src/unit_tests/C/apex_openacc Program to run : ./build/src/unit_tests/C/apex_openacc Jacobi relaxation Calculation: 128 x 128 mesh Device API: none Device type: default Device vendor: -1 Device API: CUDA Device type: nvidia Device vendor: -1 0, 0.250000 Elapsed time: 0.451705 seconds Cores detected: 160 Worker Threads observed: 1 Available CPU time: 0.451705 seconds Counter : #samples | minimum | mean | maximum | stddev ------------------------------------------------------------------------------------------------ OpenACC Gangs : 200 1.000 2560.500 5120.000 2559.500 OpenACC Vector Lanes : 200 32.000 32.000 32.000 0.000 OpenACC Workers : 200 1.000 1.000 1.000 0.000 OpenACC device alloc (implicit) parall... : 301 15.000 889.206 2.62e+05 1.51e+04 OpenACC device free (implicit) paralle... : 301 0.000 0.000 0.000 0.000 OpenACC enqueue data transfer (HtoD) (... : 200 16.000 20.000 24.000 4.000 status:Threads : 1 3.000 3.000 3.000 0.000 status:VmData : 1 1.81e+04 1.81e+04 1.81e+04 0.000 status:VmExe : 1 64.000 64.000 64.000 0.000 status:VmHWM : 1 4416.000 4416.000 4416.000 0.000 status:VmLck : 1 0.000 0.000 0.000 0.000 status:VmLib : 1 8640.000 8640.000 8640.000 0.000 status:VmPMD : 1 16.000 16.000 16.000 0.000 status:VmPTE : 1 3.000 3.000 3.000 0.000 status:VmPeak : 1 1.59e+05 1.59e+05 1.59e+05 0.000 status:VmPin : 1 0.000 0.000 0.000 0.000 status:VmRSS : 1 4416.000 4416.000 4416.000 0.000 status:VmSize : 1 9.34e+04 9.34e+04 9.34e+04 0.000 status:VmStk : 1 192.000 192.000 192.000 0.000 status:VmSwap : 1 0.000 0.000 0.000 0.000 status:nonvoluntary_ctxt_switches : 1 0.000 0.000 0.000 0.000 status:voluntary_ctxt_switches : 1 46.000 46.000 46.000 0.000 ------------------------------------------------------------------------------------------------ Timer : #calls | mean | total | % total ------------------------------------------------------------------------------------------------ APEX MAIN : 1 0.452 0.452 100.000 OpenACC compute construct parallel : 200 0.001 0.215 47.492 OpenACC device init (implicit) parallel : 1 0.081 0.081 17.965 OpenACC enqueue data transfer (HtoD) (implicit) p... : 200 0.000 0.002 0.523 OpenACC enqueue launch: main$_omp_fn$0 (implicit)... : 100 0.000 0.001 0.288 OpenACC enqueue launch: main$_omp_fn$1 (implicit)... : 100 0.000 0.001 0.267 OpenACC enter data (implicit) parallel : 200 0.000 0.002 0.491 OpenACC enter data data : 1 0.000 0.000 0.078 OpenACC exit data (implicit) parallel : 200 0.000 0.003 0.733 OpenACC exit data data : 1 0.000 0.000 0.043 APEX Idle : 0.145 32.120 ------------------------------------------------------------------------------------------------ Total timers : 1003 CUDA \u00b6 Configuring APEX for CUDA support \u00b6 Enabling CUDA support in APEX requires the -DAPEX_WITH_CUDA=TRUE flag and the -DCUDA_ROOT=/path/to/cuda CMake variables at configuration time. CMake will look for the CUPTI and NVML libraries in the installation, and if found the support will be enabled. cmake -DCMAKE_C_COMPILER=`which gcc` -DCMAKE_CXX_COMPILER=`which g++` -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=../install -DBUILD_TESTS=TRUE -DUSE_BFD=TRUE -DAPEX_WITH_CUDA=TRUE -DCUDA_ROOT=/usr/local/packages/cuda/10.2 -DBFD_ROOT=/usr/local/packages/binutils/2.34 .. Running CUDA programs with APEX \u00b6 Enabling CUDA support only requires using the apex_exec wrapper script. [khuck@gorgon apex]$ ./install/bin/apex_exec --apex:screen --apex:taskgraph --apex:cuda ./build/src/unit_tests/CUDA/apex_cuda_cu Program to run : ./build/src/unit_tests/CUDA/apex_cuda_cu On device: name=hello, value=10 On device: name=dello, value=11 On device: name=dello, value=12 On device: name=dello, value=13 On host: name=dello, value=14 Elapsed time: 0.410402 seconds Cores detected: 160 Worker Threads observed: 1 Available CPU time: 0.410402 seconds Counter : #samples | minimum | mean | maximum | stddev ------------------------------------------------------------------------------------------------ Device 0 GPU Clock Memory (MHz) : 1 877.000 877.000 877.000 0.000 Device 0 GPU Clock SM (MHz) : 1 135.000 135.000 135.000 0.000 Device 0 GPU Memory Free (MB) : 1 3.41e+04 3.41e+04 3.41e+04 0.000 Device 0 GPU Memory Used (MB) : 1 0.197 0.197 0.197 0.000 Device 0 GPU Memory Utilization % : 1 0.000 0.000 0.000 0.000 Device 0 GPU NvLink Link Count : 1 6.000 6.000 6.000 0.000 Device 0 GPU NvLink Speed MB/s : 1 2.58e+04 2.58e+04 2.58e+04 0.000 Device 0 GPU NvLink Utilization C0 : 1 0.000 0.000 0.000 0.000 Device 0 GPU NvLink Utilization C1 : 1 0.000 0.000 0.000 0.000 Device 0 GPU Power (W) : 1 38.912 38.912 38.912 0.000 Device 0 GPU Temperature (C) : 1 33.000 33.000 33.000 0.000 Device 0 GPU Utilization % : 1 0.000 0.000 0.000 0.000 Device 0 PCIe RX Throughput (MB/s) : 1 1.000 1.000 1.000 0.000 Device 0 PCIe TX Throughput (MB/s) : 1 3.000 3.000 3.000 0.000 GPU: Bytes Allocated : 2 6.000 11.000 16.000 5.000 status:Threads : 1 4.000 4.000 4.000 0.000 status:VmData : 1 5.72e+04 5.72e+04 5.72e+04 0.000 status:VmExe : 1 64.000 64.000 64.000 0.000 status:VmHWM : 1 1.77e+04 1.77e+04 1.77e+04 0.000 status:VmLck : 1 0.000 0.000 0.000 0.000 status:VmLib : 1 6.92e+04 6.92e+04 6.92e+04 0.000 status:VmPMD : 1 12.000 12.000 12.000 0.000 status:VmPTE : 1 7.000 7.000 7.000 0.000 status:VmPeak : 1 2.58e+05 2.58e+05 2.58e+05 0.000 status:VmPin : 1 0.000 0.000 0.000 0.000 status:VmRSS : 1 1.77e+04 1.77e+04 1.77e+04 0.000 status:VmSize : 1 1.93e+05 1.93e+05 1.93e+05 0.000 status:VmStk : 1 192.000 192.000 192.000 0.000 status:VmSwap : 1 0.000 0.000 0.000 0.000 status:nonvoluntary_ctxt_switches : 1 0.000 0.000 0.000 0.000 status:voluntary_ctxt_switches : 1 102.000 102.000 102.000 0.000 ------------------------------------------------------------------------------------------------ Timer : #calls | mean | total | % total ------------------------------------------------------------------------------------------------ APEX MAIN : 1 0.410 0.410 100.000 GPU: Unified Memory copy DTOH : 1 0.000 0.000 0.001 GPU: Unified Memory copy HTOD : 1 0.000 0.000 0.001 GPU: Kernel(DataElement*) : 4 0.000 0.000 0.084 cudaDeviceSynchronize : 4 0.000 0.000 0.092 cudaFree : 2 0.000 0.000 0.045 cudaLaunchKernel : 4 0.000 0.000 0.007 cudaMallocManaged : 2 0.104 0.208 50.601 launch [/home/users/khuck/src/apex/src/unit_tests... : 4 0.001 0.003 0.798 APEX Idle : 0.199 48.371 ------------------------------------------------------------------------------------------------ Total timers : 22 To get additional information you can also enable the --apex:cuda_driver flag to see CUDA driver API calls, or enable the --apex:cuda_counters flag to enable CUDA counters. [khuck@gorgon apex]$ ./install/bin/apex_exec --apex:screen --apex:taskgraph --apex:cuda --apex:cuda_counters --apex:cuda_driver ./build/src/unit_tests/CUDA/apex_cuda_cu Program to run : ./build/src/unit_tests/CUDA/apex_cuda_cu On device: name=hello, value=10 On device: name=dello, value=11 On device: name=dello, value=12 On device: name=dello, value=13 On host: name=dello, value=14 Elapsed time: 0.309145 seconds Cores detected: 160 Worker Threads observed: 1 Available CPU time: 0.309145 seconds Counter : #samples | minimum | mean | maximum | stddev ------------------------------------------------------------------------------------------------ Device 0 GPU Clock Memory (MHz) : 1 877.000 877.000 877.000 0.000 Device 0 GPU Clock SM (MHz) : 1 135.000 135.000 135.000 0.000 Device 0 GPU Memory Free (MB) : 1 3.41e+04 3.41e+04 3.41e+04 0.000 Device 0 GPU Memory Used (MB) : 1 0.197 0.197 0.197 0.000 Device 0 GPU Memory Utilization % : 1 0.000 0.000 0.000 0.000 Device 0 GPU NvLink Link Count : 1 6.000 6.000 6.000 0.000 Device 0 GPU NvLink Speed MB/s : 1 2.58e+04 2.58e+04 2.58e+04 0.000 Device 0 GPU NvLink Utilization C0 : 1 0.000 0.000 0.000 0.000 Device 0 GPU NvLink Utilization C1 : 1 0.000 0.000 0.000 0.000 Device 0 GPU Power (W) : 1 38.912 38.912 38.912 0.000 Device 0 GPU Temperature (C) : 1 33.000 33.000 33.000 0.000 Device 0 GPU Utilization % : 1 0.000 0.000 0.000 0.000 Device 0 PCIe RX Throughput (MB/s) : 1 2.000 2.000 2.000 0.000 Device 0 PCIe TX Throughput (MB/s) : 1 3.000 3.000 3.000 0.000 GPU: Bandwith (GB/s) <- Unified Memory... : 1 18.618 18.618 18.618 0.000 GPU: Bandwith (GB/s) <- Unified Memory... : 1 11.770 11.770 11.770 0.000 GPU: Bytes <- Unified Memory copy DTOH : 1 6.55e+04 6.55e+04 6.55e+04 0.000 GPU: Bytes <- Unified Memory copy HTOD : 1 6.55e+04 6.55e+04 6.55e+04 0.000 GPU: Bytes Allocated : 3 0.000 7.333 16.000 6.600 GPU: Dynamic Shared Memory (B) : 4 0.000 0.000 0.000 0.000 GPU: Local Memory Per Thread (B) : 4 0.000 0.000 0.000 0.000 GPU: Local Memory Total (B) : 4 1.36e+08 1.36e+08 1.36e+08 0.000 GPU: Registers Per Thread : 4 32.000 32.000 32.000 0.000 GPU: Shared Memory Size (B) : 4 0.000 0.000 0.000 0.000 GPU: Static Shared Memory (B) : 4 0.000 0.000 0.000 0.000 Unified Memory CPU Page Fault Count : 2 1.000 1.000 1.000 0.000 Unified Memory GPU Page Fault Groups : 1 1.000 1.000 1.000 0.000 status:Threads : 1 4.000 4.000 4.000 0.000 status:VmData : 1 5.69e+04 5.69e+04 5.69e+04 0.000 status:VmExe : 1 64.000 64.000 64.000 0.000 status:VmHWM : 1 1.70e+04 1.70e+04 1.70e+04 0.000 status:VmLck : 1 0.000 0.000 0.000 0.000 status:VmLib : 1 6.92e+04 6.92e+04 6.92e+04 0.000 status:VmPMD : 1 16.000 16.000 16.000 0.000 status:VmPTE : 1 7.000 7.000 7.000 0.000 status:VmPeak : 1 2.58e+05 2.58e+05 2.58e+05 0.000 status:VmPin : 1 0.000 0.000 0.000 0.000 status:VmRSS : 1 1.70e+04 1.70e+04 1.70e+04 0.000 status:VmSize : 1 1.93e+05 1.93e+05 1.93e+05 0.000 status:VmStk : 1 192.000 192.000 192.000 0.000 status:VmSwap : 1 0.000 0.000 0.000 0.000 status:nonvoluntary_ctxt_switches : 1 0.000 0.000 0.000 0.000 status:voluntary_ctxt_switches : 1 100.000 100.000 100.000 0.000 ------------------------------------------------------------------------------------------------ Timer : #calls | mean | total | % total ------------------------------------------------------------------------------------------------ APEX MAIN : 1 0.309 0.309 100.000 GPU: Unified Memory copy DTOH : 1 0.000 0.000 0.001 GPU: Unified Memory copy HTOD : 1 0.000 0.000 0.002 GPU: Kernel(DataElement*) : 4 0.000 0.001 0.353 cuCtxGetCurrent : 2 0.000 0.000 0.002 cuCtxGetDevice : 1 0.000 0.000 0.001 cuCtxSetCurrent : 1 0.000 0.000 0.001 cuCtxSynchronize : 4 0.000 0.001 0.349 cuDeviceGet : 4 0.000 0.000 0.002 cuDeviceGetAttribute : 376 0.000 0.002 0.754 cuDeviceGetCount : 1 0.000 0.000 0.008 cuDeviceGetName : 4 0.000 0.000 0.046 cuDeviceGetUuid : 4 0.000 0.000 0.002 cuDevicePrimaryCtxRetain : 1 0.111 0.111 35.773 cuDeviceTotalMem_v2 : 4 0.002 0.006 2.022 cuLaunchKernel : 4 0.000 0.000 0.005 cuMemAllocManaged : 2 0.012 0.024 7.743 cuMemFree_v2 : 2 0.000 0.000 0.051 cuModuleGetFunction : 1 0.000 0.000 0.005 cudaDeviceSynchronize : 4 0.000 0.001 0.361 cudaFree : 2 0.000 0.000 0.057 cudaLaunchKernel : 4 0.000 0.000 0.051 cudaMallocManaged : 2 0.060 0.120 38.773 launch [/home/users/khuck/src/apex/src/unit_tests... : 4 0.000 0.001 0.442 APEX Idle : 0.041 13.195 ------------------------------------------------------------------------------------------------ Total timers : 433 The following flags will enable different types of CUDA support: --apex:cuda enable CUDA/CUPTI measurement (default: off) --apex:cuda-counters enable CUDA/CUPTI counter support (default: off) --apex:cuda-driver enable CUDA driver API callbacks (default: off) --apex:cuda-details enable per-kernel statistics where available (default: off) --apex:monitor-gpu enable GPU monitoring services (CUDA NVML, ROCm SMI) HIP/ROCm \u00b6 APEX suports HIP measurement using the Roc* libraries provided by AMD. Configuring APEX for HIP support \u00b6 Enabling HIP support in APEX requires the -DAPEX_WITH_HIP=TRUE flag and the -DROCM_ROOT=/path/to/rocm CMake variables at configuration time. CMake will look for the profile/trace and smi libraries in the installation, and if found the support will be enabled. cmake -B build -DCMAKE_C_COMPILER=`which clang` -DCMAKE_CXX_COMPILER=`which hipcc` -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=./install -DBUILD_TESTS=TRUE -DUSE_BFD=TRUE -DAPEX_WITH_HIP=TRUE -DROCM_ROOT=/opt/rocm-5.7.1 -DBFD_ROOT=/usr/local/packages/binutils/2.34 .. Running HIP programs with APEX \u00b6 Enabling CUDA support only requires using the apex_exec wrapper script. The following flags will enable additional support: --apex:hip enable HIP/ROCTracer measurement (default: off) --apex:hip-metrics enable HIP/ROCProfiler metric support (default: off) --apex:hip-counters enable HIP/ROCTracer counter support (default: off) --apex:hip-driver enable HIP/ROCTracer KSA driver API callbacks (default: off) --apex:hip-details enable per-kernel statistics where available (default: off) --apex:monitor-gpu enable GPU monitoring services (CUDA NVML, ROCm SMI) Kokkos \u00b6 Configuring APEX for Kokkos support \u00b6 Like OpenACC, nothing special needs to be done to enable Kokkos support. Running Kokkos programs with APEX \u00b6 Enabling Kokkos support requires setting the KOKKOS_PROFILE_LIBRARY environment variable with the path to libapex.so , or by using the apex_exec script with the --apex:kokkos flag. We also recommend using the --apex:kokkos-fence option which will time the full kernel execution time, not just the time to launch a kernel if the back-end activity is not measured by some other method (OMPT, CUDA, HIP, SYCL, OpenACC). APEX also has experimental autotuning support for Kokkos kernels, see https://github.com/UO-OACISS/apex/wiki/Using-APEX-with-Kokkos#autotuning-support . Configuring APEX for RAJA support \u00b6 Like OpenACC, nothing special needs to be done to enable RAJA support. Running RAJA programs with APEX \u00b6 Enabling RAJA support requires setting the RAJA_PLUGINS environment variable with the path to libapex.so , or by using the apex_exec script with the --apex:raja flag. The following flags will enable different types of Kokkos support: --apex:kokkos enable Kokkos support --apex:kokkos-tuning enable Kokkos runtime autotuning support --apex:kokkos-fence enable Kokkos fences for async kernels C++ Threads \u00b6 APEX suports C++ threads on Linux, with the assumption that they are implemented on top of POSIX threads. Configuring APEX for C++ Thread support \u00b6 Nothing special needs to be done to enable C++ thread support. Running C++ Thread programs with APEX \u00b6 Enabling C++ Thread support requires using the apex_exec script with the --apex:pthread flag. That will enable the preloading of a wrapper library to intercept pthread_create() calls. A sample program with C++ threads is in the APEX unit tests: khuck@Kevins-MacBook-Air build % ../install/bin/apex_exec --apex:pthread src/unit_tests/C++/apex_fibonacci_std_async_cpp Program to run : src/unit_tests/C++/apex_fibonacci_std_async_cpp usage: apex_fibonacci_std_async_cpp Using default value of 10 fib of 10 is 55 (valid value: 55) Elapsed time: 0.005359 seconds Cores detected: 8 Worker Threads observed: 178 Available CPU time: 0.042872 seconds Timer : #calls | mean | total | % total ------------------------------------------------------------------------------------------------ fib(int, std::__1::shared_ptr) : 177 0.001 0.171 --n/a-- APEX MAIN : 1 0.005 0.005 100.000 ------------------------------------------------------------------------------------------------ Total timers : 177 Note that APEX detected 178 total OS threads. That is because some C++ thread implementations (GCC, Clang, others) implement every std::async() call as a new OS thread, resulting in a pthread_create() call. Other Runtime Systems \u00b6 We are currently evaluating support for TBB, OpenCL, SYCL/DPC++/OneAPI, among others. Performance Measurement Features \u00b6 For all the following examples, we will use a simple CUDA program that is in the APEX unit tests. Profiling \u00b6 Profiling with APEX is the usual and most simple mode of operation. In order to profile an application and get a report at the end of execution, enable screen output (see Environment Variables for details) and run an application linked with the APEX library or with the apex_exec --apex:screen flag (enabled by default). The output should look like examples shown previously. [khuck@cyclops apex]$ export APEX_SCREEN_OUTPUT=1 [khuck@cyclops apex]$ ./build/src/unit_tests/CUDA/apex_cuda_cu Found 4 total devices On device: name=hello, value=10 On device: name=dello, value=11 On device: name=dello, value=12 On device: name=dello, value=13 On host: name=dello, value=14 Elapsed time: 0.46147 seconds Cores detected: 160 Worker Threads observed: 1 Available CPU time: 0.46147 seconds Counter : #samples | minimum | mean | maximum | stddev ------------------------------------------------------------------------------------------------ 1 Minute Load average : 1 13.320 13.320 13.320 0.000 Device 0 GPU Clock Memory (MHz) : 1 877.000 877.000 877.000 0.000 Device 0 GPU Clock SM (MHz) : 1 1530.000 1530.000 1530.000 0.000 Device 0 GPU Memory Free (MB) : 1 1.34e+04 1.34e+04 1.34e+04 0.000 Device 0 GPU Memory Used (MB) : 1 2.07e+04 2.07e+04 2.07e+04 0.000 Device 0 GPU Memory Utilization % : 1 48.000 48.000 48.000 0.000 Device 0 GPU NvLink Link Count : 1 6.000 6.000 6.000 0.000 Device 0 GPU NvLink Speed MB/s : 1 2.58e+04 2.58e+04 2.58e+04 0.000 Device 0 GPU NvLink Utilization C0 : 1 0.000 0.000 0.000 0.000 Device 0 GPU NvLink Utilization C1 : 1 0.000 0.000 0.000 0.000 Device 0 GPU Power (W) : 1 240.573 240.573 240.573 0.000 Device 0 GPU Temperature (C) : 1 73.000 73.000 73.000 0.000 Device 0 GPU Utilization % : 1 95.000 95.000 95.000 0.000 Device 0 PCIe RX Throughput (MB/s) : 1 5.000 5.000 5.000 0.000 Device 0 PCIe TX Throughput (MB/s) : 1 0.000 0.000 0.000 0.000 GPU: Bytes Allocated : 2 6.000 11.000 16.000 5.000 status:Threads : 1 7.000 7.000 7.000 0.000 status:VmData : 1 2.77e+05 2.77e+05 2.77e+05 0.000 status:VmExe : 1 64.000 64.000 64.000 0.000 status:VmHWM : 1 2.19e+05 2.19e+05 2.19e+05 0.000 status:VmLck : 1 0.000 0.000 0.000 0.000 status:VmLib : 1 8.74e+04 8.74e+04 8.74e+04 0.000 status:VmPMD : 1 16.000 16.000 16.000 0.000 status:VmPTE : 1 35.000 35.000 35.000 0.000 status:VmPeak : 1 7.17e+05 7.17e+05 7.17e+05 0.000 status:VmPin : 1 1.67e+05 1.67e+05 1.67e+05 0.000 status:VmRSS : 1 2.19e+05 2.19e+05 2.19e+05 0.000 status:VmSize : 1 6.52e+05 6.52e+05 6.52e+05 0.000 status:VmStk : 1 192.000 192.000 192.000 0.000 status:VmSwap : 1 0.000 0.000 0.000 0.000 status:nonvoluntary_ctxt_switches : 1 9.000 9.000 9.000 0.000 status:voluntary_ctxt_switches : 1 1331.000 1331.000 1331.000 0.000 ------------------------------------------------------------------------------------------------ Timer : #calls | mean | total | % total ------------------------------------------------------------------------------------------------ APEX MAIN : 1 0.461 0.461 100.000 GPU: Unified Memcpy DTOH : 1 0.000 0.000 0.001 GPU: Unified Memcpy HTOD : 1 0.000 0.000 0.001 GPU: Kernel(DataElement*) : 4 0.000 0.000 0.086 cudaDeviceSynchronize : 4 0.000 0.001 0.169 cudaFree : 2 0.000 0.000 0.052 cudaLaunchKernel : 4 0.000 0.000 0.021 cudaMallocManaged : 2 0.135 0.269 58.397 launch [/home/users/khuck/src/apex/src/unit_tests... : 4 0.028 0.110 23.870 APEX Idle : 0.080 17.403 ------------------------------------------------------------------------------------------------ Total timers : 22 Profiling with CSV output \u00b6 To enable CSV output, use one of the methods described in the Environment Variables page, and run as the previous example. [khuck@cyclops apex]$ export APEX_CSV_OUTPUT=1 [khuck@cyclops apex]$ ./build/src/unit_tests/CUDA/apex_cuda_cu Found 4 total devices On device: name=hello, value=10 On device: name=dello, value=11 On device: name=dello, value=12 On device: name=dello, value=13 On host: name=dello, value=14 [khuck@cyclops apex]$ cat apex.0.csv \"counter\",\"num samples\",\"minimum\",\"mean\"\"maximum\",\"stddev\" \"1 Minute Load average\",1,22,22,22,0 \"Device 0 GPU Clock Memory (MHz)\",1,877,877,877,0 \"Device 0 GPU Clock SM (MHz)\",1,1530,1530,1530,0 \"Device 0 GPU Memory Free (MB)\",1,13411,13411,13411,0 \"Device 0 GPU Memory Used (MB)\",1,20679,20679,20679,0 \"Device 0 GPU Memory Utilization %\",1,58,58,58,0 \"Device 0 GPU NvLink Link Count\",1,6,6,6,0 \"Device 0 GPU NvLink Speed MB/s\",1,25781,25781,25781,0 \"Device 0 GPU NvLink Utilization C0\",1,0,0,0,0 \"Device 0 GPU NvLink Utilization C1\",1,0,0,0,0 \"Device 0 GPU Power (W)\",1,255,255,255,0 \"Device 0 GPU Temperature (C)\",1,75,75,75,0 \"Device 0 GPU Utilization %\",1,99,99,99,0 \"Device 0 PCIe RX Throughput (MB/s)\",1,7,7,7,0 \"Device 0 PCIe TX Throughput (MB/s)\",1,2,2,2,0 \"GPU: Bytes Allocated\",2,6,11,16,5 \"status:Threads\",1,7,7,7,0 \"status:VmData\",1,277120,277120,277120,0 \"status:VmExe\",1,64,64,64,0 \"status:VmHWM\",1,219008,219008,219008,0 \"status:VmLck\",1,0,0,0,0 \"status:VmLib\",1,87424,87424,87424,0 \"status:VmPMD\",1,16,16,16,0 \"status:VmPTE\",1,36,36,36,0 \"status:VmPeak\",1,717248,717248,717248,0 \"status:VmPin\",1,166528,166528,166528,0 \"status:VmRSS\",1,219008,219008,219008,0 \"status:VmSize\",1,652032,652032,652032,0 \"status:VmStk\",1,192,192,192,0 \"status:VmSwap\",1,0,0,0,0 \"status:nonvoluntary_ctxt_switches\",1,8,8,8,0 \"status:voluntary_ctxt_switches\",1,1276,1276,1276,0 \"task\",\"num calls\",\"total cycles\",\"total microseconds\" \"APEX MAIN\",1,0,431162 \"GPU: Unified Memcpy DTOH\",1,0,3 \"GPU: Unified Memcpy HTOD\",1,0,4 \"GPU: Kernel(DataElement*)\",4,0,1082 \"cudaDeviceSynchronize\",4,0,9993 \"cudaFree\",2,0,172 \"cudaLaunchKernel\",4,0,66 \"cudaMallocManaged\",2,0,194367 \"launch [/home/users/khuck/src/apex/src/unit_tests/CUDA/apex_cuda.cu:35]\",4,0,164490 Profiling with TAU profile output \u00b6 To enable TAU profile output, use one of the methods described in the Environment Variables page, and run as the previous example. The output can be summarized with the TAU pprof command, which is installed with the TAU software. [khuck@cyclops apex]$ export APEX_CSV_OUTPUT=1 [khuck@cyclops apex]$ ./build/src/unit_tests/CUDA/apex_cuda_cu Found 4 total devices On device: name=hello, value=10 On device: name=dello, value=11 On device: name=dello, value=12 On device: name=dello, value=13 On host: name=dello, value=14 [khuck@cyclops apex]$ cat profile.0.0.0 9 templated_functions_MULTI_TIME # Name Calls Subrs Excl Incl ProfileCalls # \"GPU: Unified Memcpy DTOH\" 1 0 2.656 2.656 0 GROUP=\"TAU_USER\" \"cudaFree\" 2 0 193.18 193.18 0 GROUP=\"TAU_USER\" \"cudaMallocManaged\" 2 0 184435 184435 0 GROUP=\"TAU_USER\" \"GPU: Unified Memcpy HTOD\" 1 0 4.64 4.64 0 GROUP=\"TAU_USER\" \"GPU: Kernel(DataElement*)\" 4 0 355.293 355.293 0 GROUP=\"TAU_USER\" \"cudaLaunchKernel\" 4 0 67.4 67.4 0 GROUP=\"TAU_USER\" \"cudaDeviceSynchronize\" 4 0 811.244 811.244 0 GROUP=\"TAU_USER\" \"launch [/home/users/khuck/src/apex/src/unit_tests/CUDA/apex_cuda.cu:35]\" 4 0 100327 100327 0 GROUP=\"TAU_USER\" \"APEX MAIN\" 1 0 67830.2 354026 0 GROUP=\"TAU_USER\" 0 aggregates 32 userevents # eventname numevents max min mean sumsqr \"status:VmSwap\" 1 0 0 0 0 \"status:VmSize\" 1 652032 652032 652032 4.25146e+11 \"status:Threads\" 1 7 7 7 49 \"status:VmPeak\" 1 717248 717248 717248 5.14445e+11 \"Device 0 GPU Power (W)\" 1 224.057 224.057 224.057 50201.5 \"Device 0 GPU NvLink Speed MB/s\" 1 25781 25781 25781 6.6466e+08 \"status:VmExe\" 1 64 64 64 4096 \"status:nonvoluntary_ctxt_switches\" 1 12 12 12 144 \"Device 0 GPU Memory Utilization %\" 1 73 73 73 5329 \"status:VmStk\" 1 192 192 192 36864 \"status:VmData\" 1 277120 277120 277120 7.67955e+10 \"status:VmLck\" 1 0 0 0 0 \"status:VmPin\" 1 166528 166528 166528 2.77316e+10 \"status:VmPTE\" 1 35 35 35 1225 \"Device 0 GPU NvLink Utilization C1\" 1 0 0 0 0 \"status:VmHWM\" 1 219008 219008 219008 4.79645e+10 \"status:VmRSS\" 1 219008 219008 219008 4.79645e+10 \"GPU: Bytes Allocated\" 2 16 6 11 292 \"status:VmLib\" 1 87424 87424 87424 7.64296e+09 \"Device 0 GPU Utilization %\" 1 99 99 99 9801 \"status:voluntary_ctxt_switches\" 1 1320 1320 1320 1.7424e+06 \"Device 0 GPU Clock SM (MHz)\" 1 1530 1530 1530 2.3409e+06 \"status:VmPMD\" 1 20 20 20 400 \"1 Minute Load average\" 1 16.43 16.43 16.43 269.945 \"Device 0 GPU Clock Memory (MHz)\" 1 877 877 877 769129 \"Device 0 PCIe TX Throughput (MB/s)\" 1 2 2 2 4 \"Device 0 GPU Temperature (C)\" 1 73 73 73 5329 \"Device 0 PCIe RX Throughput (MB/s)\" 1 6 6 6 36 \"Device 0 GPU Memory Used (MB)\" 1 20679.1 20679.1 20679.1 4.27625e+08 \"Device 0 GPU NvLink Utilization C0\" 1 0 0 0 0 \"Device 0 GPU NvLink Link Count\" 1 6 6 6 36 \"Device 0 GPU Memory Free (MB)\" 1 13410.6 13410.6 13410.6 1.79845e+08 [khuck@cyclops apex]$ which pprof ~/src/tau2/ibm64linux/bin/pprof [khuck@cyclops apex]$ pprof Reading Profile files in profile.* NODE 0;CONTEXT 0;THREAD 0: --------------------------------------------------------------------------------------- %Time Exclusive Inclusive #Call #Subrs Inclusive Name msec total msec usec/call --------------------------------------------------------------------------------------- 100.0 67 354 1 0 354026 APEX MAIN 52.1 184 184 2 0 92218 cudaMallocManaged 28.3 100 100 4 0 25082 launch [/home/users/khuck/src/apex/src/unit_tests/CUDA/apex_cuda.cu:35] 0.2 0.811 0.811 4 0 203 cudaDeviceSynchronize 0.1 0.355 0.355 4 0 89 GPU: Kernel(DataElement*) 0.1 0.193 0.193 2 0 97 cudaFree 0.0 0.0674 0.0674 4 0 17 cudaLaunchKernel 0.0 0.00464 0.00464 1 0 5 GPU: Unified Memcpy HTOD 0.0 0.00266 0.00266 1 0 3 GPU: Unified Memcpy DTOH --------------------------------------------------------------------------------------- USER EVENTS Profile :NODE 0, CONTEXT 0, THREAD 0 --------------------------------------------------------------------------------------- NumSamples MaxValue MinValue MeanValue Std. Dev. Event Name --------------------------------------------------------------------------------------- 1 16.43 16.43 16.43 0.01 1 Minute Load average 1 877 877 877 0 Device 0 GPU Clock Memory (MHz) 1 1530 1530 1530 0 Device 0 GPU Clock SM (MHz) 1 1.341E+04 1.341E+04 1.341E+04 28.42 Device 0 GPU Memory Free (MB) 1 2.068E+04 2.068E+04 2.068E+04 13.3 Device 0 GPU Memory Used (MB) 1 73 73 73 0 Device 0 GPU Memory Utilization % 1 6 6 6 0 Device 0 GPU NvLink Link Count 1 2.578E+04 2.578E+04 2.578E+04 6.245 Device 0 GPU NvLink Speed MB/s 1 0 0 0 0 Device 0 GPU NvLink Utilization C0 1 0 0 0 0 Device 0 GPU NvLink Utilization C1 1 224.1 224.1 224.1 0.1981 Device 0 GPU Power (W) 1 73 73 73 0 Device 0 GPU Temperature (C) 1 99 99 99 0 Device 0 GPU Utilization % 1 6 6 6 0 Device 0 PCIe RX Throughput (MB/s) 1 2 2 2 0 Device 0 PCIe TX Throughput (MB/s) 2 16 6 11 5 GPU: Bytes Allocated 1 7 7 7 0 status:Threads 1 2.771E+05 2.771E+05 2.771E+05 74.83 status:VmData 1 64 64 64 0 status:VmExe 1 2.19E+05 2.19E+05 2.19E+05 63.75 status:VmHWM 1 0 0 0 0 status:VmLck 1 8.742E+04 8.742E+04 8.742E+04 64.99 status:VmLib 1 20 20 20 0 status:VmPMD 1 35 35 35 0 status:VmPTE 1 7.172E+05 7.172E+05 7.172E+05 553.6 status:VmPeak 1 1.665E+05 1.665E+05 1.665E+05 158.8 status:VmPin 1 2.19E+05 2.19E+05 2.19E+05 63.75 status:VmRSS 1 6.52E+05 6.52E+05 6.52E+05 520.6 status:VmSize 1 192 192 192 0 status:VmStk 1 0 0 0 0 status:VmSwap 1 12 12 12 0 status:nonvoluntary_ctxt_switches 1 1320 1320 1320 0 status:voluntary_ctxt_switches --------------------------------------------------------------------------------------- Profiling with Taskgraph output \u00b6 APEX can capture the task dependency graph from the application, and output it as a GraphViz graph. The graph represents summarized task \"type\" dependencies, not a full dependency graph/tree with every task instance. [khuck@cyclops apex]$ apex_exec --apex:taskgraph --apex:cuda ./build/src/unit_tests/CUDA/apex_cuda_cu [khuck@cyclops apex]$ dot -Tpdf -O taskgraph.0.dot Profiling with Tasktree output \u00b6 APEX can capture the task dependency tree from the application, and output it as a GraphViz graph or ASCII. The graph represents summarized task \"type\" dependencies, not a full dependency graph/tree with every task instance. The difference between the graph and the tree is that in the tree, there are no cycles and child tasks have only one parent. [khuck@cyclops apex]$ apex_exec --apex:tasktree --apex:cuda ./build/src/unit_tests/CUDA/apex_cuda_cu [khuck@cyclops apex]$ apex-treesummary.py apex_tasktree.csv Profiling with Scatterplot output \u00b6 For this example, we are using an HPX quickstart example, the fibonacci example. After execution, APEX writes a sample data file to disk, apex_task_samples.csv . That file is post-processed with the APEX python script task_scatterplot.py . [khuck@cyclops apex]$ export APEX_TASK_SCATTERPLOT=1 [khuck@cyclops build]$ ./bin/fibonacci --n-value=20 [khuck@cyclops build]$ /home/users/khuck/src/apex/install/bin/task_scatterplot.py Parsed 2362 samples Plotting async_launch_policy_dispatch Plotting async_launch_policy_dispatch::call Plotting async Rendering... Profiling with OTF2 Trace output \u00b6 For this example, we are using an APEX unit test that computes the value of PI. OTF2 is the \"Open Trace Format v2\", used for tracing large scale HPC applications. For more information on OTF2 and associated tools, see The VI-HPS Score-P web site . Vampir is a commercial trace viewer that can be used to visualize and analyze OTF2 trace data. Traveler is an open source tool that can be used to visualize and analyze APEX OTF2 trace data. [khuck@cyclops apex]$ export APEX_OTF2=1 [khuck@cyclops apex]$ ./build/src/unit_tests/CUDA/pi_cu Found 4 total devices 134217728 num streams 4 making streams starting compute n is 0 num darts in circle 0: 105418094 pi is 3.141704 Closing OTF2 event files... Writing OTF2 definition files... Writing OTF2 Global definition file... Writing OTF2 Node information... Writing OTF2 Communicators... Closing the archive... done. [khuck@eagle apex]$ module load vampir [khuck@eagle apex]$ vampir OTF2_archive/APEX.otf2 Profiling with Google Trace Events Format output \u00b6 For this example, we are using an APEX unit test that computes the value of PI. Google Trace Events is a format developed by Google for tracing activity on devices, but is free and open and JSON based. For more information on Google Trace Events and associated tools, see the Google Trace Event Format document . The Google Chrome Web Browser can be used to visualize and analyze GTE trace data. [khuck@cyclops apex]$ export APEX_TRACE_EVENT=1 [khuck@cyclops apex]$ ./build/src/unit_tests/CUDA/pi_cu","title":"Usage"},{"location":"usage/#usage","text":"","title":"Usage"},{"location":"usage/#tutorial","text":"For an APEX tutorial, please see https://github.com/khuck/apex-tutorial .","title":"Tutorial"},{"location":"usage/#supported_runtime_systems","text":"","title":"Supported Runtime Systems"},{"location":"usage/#hpx_louisiana_state_university","text":"HPX (High Performance ParalleX) is the original implementation of the ParalleX model. Developed and maintained by the Ste||ar Group at Louisiana State University, HPX is implemented in C++. For more information, see http://stellar-group.org/projects/hpx/ . For a tutorial on HPX with APEX (presented at SC'15, Austin TX) see https://github.com/khuck/SC15_APEX_tutorial (somewhat outdated). APEX is configured and built as part of HPX. In fact, you don't even need to donwload it separately - it will be automatically checked out from Github as part of the HPX Cmake configuration. However, you do need to pass the correct Cmake options to the HPX configuration step.","title":"HPX (Louisiana State University)"},{"location":"usage/#configuring_hpx_with_apex","text":"See Intallation with HPX .","title":"Configuring HPX with APEX"},{"location":"usage/#running_hpx_with_apex","text":"See APEX Quickstart .","title":"Running HPX with APEX"},{"location":"usage/#openmp","text":"The OpenMP API supports multi-platform shared-memory parallel programming in C/C++ and Fortran. The OpenMP API defines a portable, scalable model with a simple and flexible interface for developing parallel applications on platforms from the desktop to the supercomputer. For more information, see http://openmp.org/ .","title":"OpenMP"},{"location":"usage/#configuring_apex_for_openmp_ompt_support","text":"The CMake process will automatically detect whether your compiler has OpenMP support. If you configure APEX with -DUSE_OMPT=TRUE and have a compiler with full OpenMP 5.0 OMPT support, APEX will detect the support. If your compiler is GCC, Intel or Clang and does not have native OMPT support, APEX can build and use the open source LLVM OpenMP runtime as a drop-in replacement for the compiler's native runtime library, but this is no longer recommended and is deprecated. APEX uses Binutils to resolve the OpenMP outlined regions from instruction addresses to human-readable names, so also configure APEX with -DUSE_BFD=TRUE (see Other CMake Settings ). The following example was configured and run with Intel 20 compilers. The CMake configuration for this example was: cmake -DCMAKE_C_COMPILER=`which icc` -DCMAKE_CXX_COMPILER=`which icpc` -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=../install -DBUILD_TESTS=TRUE -DUSE_BFD=TRUE -DBFD_ROOT=/usr/local/packages/binutils/2.34 -DUSE_OMPT=TRUE ..","title":"Configuring APEX for OpenMP OMPT support"},{"location":"usage/#running_openmp_applications_with_apex","text":"Using the apex_exec wrapper script, execute the OpenMP program as normal: [khuck@delphi apex]$ ./install/bin/apex_exec --apex:screen --apex:taskgraph --apex:ompt build/src/unit_tests/C++/apex_openmp_cpp Program to run : build/src/unit_tests/C++/apex_openmp_cpp Initializing... No Sharing... Result: 2690568.772590 Elapsed time: 0.0398378 seconds Cores detected: 72 Worker Threads observed: 72 Available CPU time: 2.86832 seconds Counter : #samples | minimum | mean | maximum | stddev ------------------------------------------------------------------------------------------------ Iterations: OpenMP Work Loop: no_shari... : 71 1.05e+06 1.05e+06 1.05e+06 0.000 Iterations: OpenMP Work Loop: my_init(... : 144 1.05e+06 1.05e+06 1.05e+06 0.000 OpenMP Initial Thread : 1 1.000 1.000 1.000 0.000 OpenMP Worker Thread : 71 1.000 1.000 1.000 0.000 Single: OpenMP Work Single Executor: L... : 1 1.000 1.000 1.000 0.000 Single: OpenMP Work Single Executor: L... : 2 1.000 1.000 1.000 0.000 Single: OpenMP Work Single Other: L__Z... : 71 1.000 1.000 1.000 0.000 Single: OpenMP Work Single Other: L__Z... : 142 1.000 1.000 1.000 0.000 status:Threads : 1 3.000 3.000 3.000 0.000 status:VmData : 1 1.07e+05 1.07e+05 1.07e+05 0.000 status:VmExe : 1 20.000 20.000 20.000 0.000 status:VmHWM : 1 9356.000 9356.000 9356.000 0.000 status:VmLck : 1 0.000 0.000 0.000 0.000 status:VmLib : 1 4.39e+04 4.39e+04 4.39e+04 0.000 status:VmPTE : 1 128.000 128.000 128.000 0.000 status:VmPeak : 1 2.49e+05 2.49e+05 2.49e+05 0.000 status:VmPin : 1 0.000 0.000 0.000 0.000 status:VmRSS : 1 9356.000 9356.000 9356.000 0.000 status:VmSize : 1 1.84e+05 1.84e+05 1.84e+05 0.000 status:VmStk : 1 136.000 136.000 136.000 0.000 status:VmSwap : 1 0.000 0.000 0.000 0.000 status:nonvoluntary_ctxt_switches : 1 43.000 43.000 43.000 0.000 status:voluntary_ctxt_switches : 1 46.000 46.000 46.000 0.000 ------------------------------------------------------------------------------------------------ Timer : #calls | mean | total | % total ------------------------------------------------------------------------------------------------ APEX MAIN : 1 0.040 0.040 100.000 OpenMP Parallel Region: no_sharing(double*, doubl... : 1 0.006 0.006 0.211 OpenMP Parallel Region: my_init(double*) [{/home/... : 2 0.014 0.028 0.961 OpenMP Work Loop: no_sharing(double*, double*) [{... : 72 0.003 0.195 6.806 OpenMP Work Loop: my_init(double*) [{/home/users/... : 143 0.001 0.161 5.622 OpenMP Work Single Executor: L__Z10no_sharingPdS_... : 1 0.001 0.001 0.028 OpenMP Work Single Executor: L__Z7my_initPd_39__p... : 2 0.000 0.001 0.018 OpenMP Work Single Other: L__Z10no_sharingPdS__20... : 71 0.000 0.029 1.027 OpenMP Work Single Other: L__Z7my_initPd_39__par_... : 141 0.001 0.100 3.472 ------------------------------------------------------------------------------------------------ Total timers : 433 If GraphViz is installed on your system, the dot program will generate a taskgraph image based on the taskgraph.0.dot file that was generated by APEX:","title":"Running OpenMP applications with APEX"},{"location":"usage/#openacc","text":"","title":"OpenACC"},{"location":"usage/#configuring_apex_for_openacc_support","text":"Nothing special needs to be done to enable OpenACC support. If your compiler supports OpenACC (PGI, GCC 10+), then CMake will detect it and enable OpenACC support in APEX. In this example, APEX was configured with GCC 10.0.0: cmake -DCMAKE_C_COMPILER=`which gcc` -DCMAKE_CXX_COMPILER=`which g++` -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=../install -DBUILD_TESTS=TRUE -DUSE_BFD=FALSE -DBFD_ROOT=/usr/local/packages/binutils/2.34 ..","title":"Configuring APEX for OpenACC support"},{"location":"usage/#running_openacc_programs_with_apex","text":"Enabling OpenACC support requires setting the ACC_PROFLIB environment variable with the path to libapex.so , or by using the apex_exec script with the --apex:openacc flag: [khuck@gorgon apex]$ ./install/bin/apex_exec --apex:screen --apex:taskgraph --apex:openacc ./build/src/unit_tests/C/apex_openacc Program to run : ./build/src/unit_tests/C/apex_openacc Jacobi relaxation Calculation: 128 x 128 mesh Device API: none Device type: default Device vendor: -1 Device API: CUDA Device type: nvidia Device vendor: -1 0, 0.250000 Elapsed time: 0.451705 seconds Cores detected: 160 Worker Threads observed: 1 Available CPU time: 0.451705 seconds Counter : #samples | minimum | mean | maximum | stddev ------------------------------------------------------------------------------------------------ OpenACC Gangs : 200 1.000 2560.500 5120.000 2559.500 OpenACC Vector Lanes : 200 32.000 32.000 32.000 0.000 OpenACC Workers : 200 1.000 1.000 1.000 0.000 OpenACC device alloc (implicit) parall... : 301 15.000 889.206 2.62e+05 1.51e+04 OpenACC device free (implicit) paralle... : 301 0.000 0.000 0.000 0.000 OpenACC enqueue data transfer (HtoD) (... : 200 16.000 20.000 24.000 4.000 status:Threads : 1 3.000 3.000 3.000 0.000 status:VmData : 1 1.81e+04 1.81e+04 1.81e+04 0.000 status:VmExe : 1 64.000 64.000 64.000 0.000 status:VmHWM : 1 4416.000 4416.000 4416.000 0.000 status:VmLck : 1 0.000 0.000 0.000 0.000 status:VmLib : 1 8640.000 8640.000 8640.000 0.000 status:VmPMD : 1 16.000 16.000 16.000 0.000 status:VmPTE : 1 3.000 3.000 3.000 0.000 status:VmPeak : 1 1.59e+05 1.59e+05 1.59e+05 0.000 status:VmPin : 1 0.000 0.000 0.000 0.000 status:VmRSS : 1 4416.000 4416.000 4416.000 0.000 status:VmSize : 1 9.34e+04 9.34e+04 9.34e+04 0.000 status:VmStk : 1 192.000 192.000 192.000 0.000 status:VmSwap : 1 0.000 0.000 0.000 0.000 status:nonvoluntary_ctxt_switches : 1 0.000 0.000 0.000 0.000 status:voluntary_ctxt_switches : 1 46.000 46.000 46.000 0.000 ------------------------------------------------------------------------------------------------ Timer : #calls | mean | total | % total ------------------------------------------------------------------------------------------------ APEX MAIN : 1 0.452 0.452 100.000 OpenACC compute construct parallel : 200 0.001 0.215 47.492 OpenACC device init (implicit) parallel : 1 0.081 0.081 17.965 OpenACC enqueue data transfer (HtoD) (implicit) p... : 200 0.000 0.002 0.523 OpenACC enqueue launch: main$_omp_fn$0 (implicit)... : 100 0.000 0.001 0.288 OpenACC enqueue launch: main$_omp_fn$1 (implicit)... : 100 0.000 0.001 0.267 OpenACC enter data (implicit) parallel : 200 0.000 0.002 0.491 OpenACC enter data data : 1 0.000 0.000 0.078 OpenACC exit data (implicit) parallel : 200 0.000 0.003 0.733 OpenACC exit data data : 1 0.000 0.000 0.043 APEX Idle : 0.145 32.120 ------------------------------------------------------------------------------------------------ Total timers : 1003","title":"Running OpenACC programs with APEX"},{"location":"usage/#cuda","text":"","title":"CUDA"},{"location":"usage/#configuring_apex_for_cuda_support","text":"Enabling CUDA support in APEX requires the -DAPEX_WITH_CUDA=TRUE flag and the -DCUDA_ROOT=/path/to/cuda CMake variables at configuration time. CMake will look for the CUPTI and NVML libraries in the installation, and if found the support will be enabled. cmake -DCMAKE_C_COMPILER=`which gcc` -DCMAKE_CXX_COMPILER=`which g++` -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=../install -DBUILD_TESTS=TRUE -DUSE_BFD=TRUE -DAPEX_WITH_CUDA=TRUE -DCUDA_ROOT=/usr/local/packages/cuda/10.2 -DBFD_ROOT=/usr/local/packages/binutils/2.34 ..","title":"Configuring APEX for CUDA support"},{"location":"usage/#running_cuda_programs_with_apex","text":"Enabling CUDA support only requires using the apex_exec wrapper script. [khuck@gorgon apex]$ ./install/bin/apex_exec --apex:screen --apex:taskgraph --apex:cuda ./build/src/unit_tests/CUDA/apex_cuda_cu Program to run : ./build/src/unit_tests/CUDA/apex_cuda_cu On device: name=hello, value=10 On device: name=dello, value=11 On device: name=dello, value=12 On device: name=dello, value=13 On host: name=dello, value=14 Elapsed time: 0.410402 seconds Cores detected: 160 Worker Threads observed: 1 Available CPU time: 0.410402 seconds Counter : #samples | minimum | mean | maximum | stddev ------------------------------------------------------------------------------------------------ Device 0 GPU Clock Memory (MHz) : 1 877.000 877.000 877.000 0.000 Device 0 GPU Clock SM (MHz) : 1 135.000 135.000 135.000 0.000 Device 0 GPU Memory Free (MB) : 1 3.41e+04 3.41e+04 3.41e+04 0.000 Device 0 GPU Memory Used (MB) : 1 0.197 0.197 0.197 0.000 Device 0 GPU Memory Utilization % : 1 0.000 0.000 0.000 0.000 Device 0 GPU NvLink Link Count : 1 6.000 6.000 6.000 0.000 Device 0 GPU NvLink Speed MB/s : 1 2.58e+04 2.58e+04 2.58e+04 0.000 Device 0 GPU NvLink Utilization C0 : 1 0.000 0.000 0.000 0.000 Device 0 GPU NvLink Utilization C1 : 1 0.000 0.000 0.000 0.000 Device 0 GPU Power (W) : 1 38.912 38.912 38.912 0.000 Device 0 GPU Temperature (C) : 1 33.000 33.000 33.000 0.000 Device 0 GPU Utilization % : 1 0.000 0.000 0.000 0.000 Device 0 PCIe RX Throughput (MB/s) : 1 1.000 1.000 1.000 0.000 Device 0 PCIe TX Throughput (MB/s) : 1 3.000 3.000 3.000 0.000 GPU: Bytes Allocated : 2 6.000 11.000 16.000 5.000 status:Threads : 1 4.000 4.000 4.000 0.000 status:VmData : 1 5.72e+04 5.72e+04 5.72e+04 0.000 status:VmExe : 1 64.000 64.000 64.000 0.000 status:VmHWM : 1 1.77e+04 1.77e+04 1.77e+04 0.000 status:VmLck : 1 0.000 0.000 0.000 0.000 status:VmLib : 1 6.92e+04 6.92e+04 6.92e+04 0.000 status:VmPMD : 1 12.000 12.000 12.000 0.000 status:VmPTE : 1 7.000 7.000 7.000 0.000 status:VmPeak : 1 2.58e+05 2.58e+05 2.58e+05 0.000 status:VmPin : 1 0.000 0.000 0.000 0.000 status:VmRSS : 1 1.77e+04 1.77e+04 1.77e+04 0.000 status:VmSize : 1 1.93e+05 1.93e+05 1.93e+05 0.000 status:VmStk : 1 192.000 192.000 192.000 0.000 status:VmSwap : 1 0.000 0.000 0.000 0.000 status:nonvoluntary_ctxt_switches : 1 0.000 0.000 0.000 0.000 status:voluntary_ctxt_switches : 1 102.000 102.000 102.000 0.000 ------------------------------------------------------------------------------------------------ Timer : #calls | mean | total | % total ------------------------------------------------------------------------------------------------ APEX MAIN : 1 0.410 0.410 100.000 GPU: Unified Memory copy DTOH : 1 0.000 0.000 0.001 GPU: Unified Memory copy HTOD : 1 0.000 0.000 0.001 GPU: Kernel(DataElement*) : 4 0.000 0.000 0.084 cudaDeviceSynchronize : 4 0.000 0.000 0.092 cudaFree : 2 0.000 0.000 0.045 cudaLaunchKernel : 4 0.000 0.000 0.007 cudaMallocManaged : 2 0.104 0.208 50.601 launch [/home/users/khuck/src/apex/src/unit_tests... : 4 0.001 0.003 0.798 APEX Idle : 0.199 48.371 ------------------------------------------------------------------------------------------------ Total timers : 22 To get additional information you can also enable the --apex:cuda_driver flag to see CUDA driver API calls, or enable the --apex:cuda_counters flag to enable CUDA counters. [khuck@gorgon apex]$ ./install/bin/apex_exec --apex:screen --apex:taskgraph --apex:cuda --apex:cuda_counters --apex:cuda_driver ./build/src/unit_tests/CUDA/apex_cuda_cu Program to run : ./build/src/unit_tests/CUDA/apex_cuda_cu On device: name=hello, value=10 On device: name=dello, value=11 On device: name=dello, value=12 On device: name=dello, value=13 On host: name=dello, value=14 Elapsed time: 0.309145 seconds Cores detected: 160 Worker Threads observed: 1 Available CPU time: 0.309145 seconds Counter : #samples | minimum | mean | maximum | stddev ------------------------------------------------------------------------------------------------ Device 0 GPU Clock Memory (MHz) : 1 877.000 877.000 877.000 0.000 Device 0 GPU Clock SM (MHz) : 1 135.000 135.000 135.000 0.000 Device 0 GPU Memory Free (MB) : 1 3.41e+04 3.41e+04 3.41e+04 0.000 Device 0 GPU Memory Used (MB) : 1 0.197 0.197 0.197 0.000 Device 0 GPU Memory Utilization % : 1 0.000 0.000 0.000 0.000 Device 0 GPU NvLink Link Count : 1 6.000 6.000 6.000 0.000 Device 0 GPU NvLink Speed MB/s : 1 2.58e+04 2.58e+04 2.58e+04 0.000 Device 0 GPU NvLink Utilization C0 : 1 0.000 0.000 0.000 0.000 Device 0 GPU NvLink Utilization C1 : 1 0.000 0.000 0.000 0.000 Device 0 GPU Power (W) : 1 38.912 38.912 38.912 0.000 Device 0 GPU Temperature (C) : 1 33.000 33.000 33.000 0.000 Device 0 GPU Utilization % : 1 0.000 0.000 0.000 0.000 Device 0 PCIe RX Throughput (MB/s) : 1 2.000 2.000 2.000 0.000 Device 0 PCIe TX Throughput (MB/s) : 1 3.000 3.000 3.000 0.000 GPU: Bandwith (GB/s) <- Unified Memory... : 1 18.618 18.618 18.618 0.000 GPU: Bandwith (GB/s) <- Unified Memory... : 1 11.770 11.770 11.770 0.000 GPU: Bytes <- Unified Memory copy DTOH : 1 6.55e+04 6.55e+04 6.55e+04 0.000 GPU: Bytes <- Unified Memory copy HTOD : 1 6.55e+04 6.55e+04 6.55e+04 0.000 GPU: Bytes Allocated : 3 0.000 7.333 16.000 6.600 GPU: Dynamic Shared Memory (B) : 4 0.000 0.000 0.000 0.000 GPU: Local Memory Per Thread (B) : 4 0.000 0.000 0.000 0.000 GPU: Local Memory Total (B) : 4 1.36e+08 1.36e+08 1.36e+08 0.000 GPU: Registers Per Thread : 4 32.000 32.000 32.000 0.000 GPU: Shared Memory Size (B) : 4 0.000 0.000 0.000 0.000 GPU: Static Shared Memory (B) : 4 0.000 0.000 0.000 0.000 Unified Memory CPU Page Fault Count : 2 1.000 1.000 1.000 0.000 Unified Memory GPU Page Fault Groups : 1 1.000 1.000 1.000 0.000 status:Threads : 1 4.000 4.000 4.000 0.000 status:VmData : 1 5.69e+04 5.69e+04 5.69e+04 0.000 status:VmExe : 1 64.000 64.000 64.000 0.000 status:VmHWM : 1 1.70e+04 1.70e+04 1.70e+04 0.000 status:VmLck : 1 0.000 0.000 0.000 0.000 status:VmLib : 1 6.92e+04 6.92e+04 6.92e+04 0.000 status:VmPMD : 1 16.000 16.000 16.000 0.000 status:VmPTE : 1 7.000 7.000 7.000 0.000 status:VmPeak : 1 2.58e+05 2.58e+05 2.58e+05 0.000 status:VmPin : 1 0.000 0.000 0.000 0.000 status:VmRSS : 1 1.70e+04 1.70e+04 1.70e+04 0.000 status:VmSize : 1 1.93e+05 1.93e+05 1.93e+05 0.000 status:VmStk : 1 192.000 192.000 192.000 0.000 status:VmSwap : 1 0.000 0.000 0.000 0.000 status:nonvoluntary_ctxt_switches : 1 0.000 0.000 0.000 0.000 status:voluntary_ctxt_switches : 1 100.000 100.000 100.000 0.000 ------------------------------------------------------------------------------------------------ Timer : #calls | mean | total | % total ------------------------------------------------------------------------------------------------ APEX MAIN : 1 0.309 0.309 100.000 GPU: Unified Memory copy DTOH : 1 0.000 0.000 0.001 GPU: Unified Memory copy HTOD : 1 0.000 0.000 0.002 GPU: Kernel(DataElement*) : 4 0.000 0.001 0.353 cuCtxGetCurrent : 2 0.000 0.000 0.002 cuCtxGetDevice : 1 0.000 0.000 0.001 cuCtxSetCurrent : 1 0.000 0.000 0.001 cuCtxSynchronize : 4 0.000 0.001 0.349 cuDeviceGet : 4 0.000 0.000 0.002 cuDeviceGetAttribute : 376 0.000 0.002 0.754 cuDeviceGetCount : 1 0.000 0.000 0.008 cuDeviceGetName : 4 0.000 0.000 0.046 cuDeviceGetUuid : 4 0.000 0.000 0.002 cuDevicePrimaryCtxRetain : 1 0.111 0.111 35.773 cuDeviceTotalMem_v2 : 4 0.002 0.006 2.022 cuLaunchKernel : 4 0.000 0.000 0.005 cuMemAllocManaged : 2 0.012 0.024 7.743 cuMemFree_v2 : 2 0.000 0.000 0.051 cuModuleGetFunction : 1 0.000 0.000 0.005 cudaDeviceSynchronize : 4 0.000 0.001 0.361 cudaFree : 2 0.000 0.000 0.057 cudaLaunchKernel : 4 0.000 0.000 0.051 cudaMallocManaged : 2 0.060 0.120 38.773 launch [/home/users/khuck/src/apex/src/unit_tests... : 4 0.000 0.001 0.442 APEX Idle : 0.041 13.195 ------------------------------------------------------------------------------------------------ Total timers : 433 The following flags will enable different types of CUDA support: --apex:cuda enable CUDA/CUPTI measurement (default: off) --apex:cuda-counters enable CUDA/CUPTI counter support (default: off) --apex:cuda-driver enable CUDA driver API callbacks (default: off) --apex:cuda-details enable per-kernel statistics where available (default: off) --apex:monitor-gpu enable GPU monitoring services (CUDA NVML, ROCm SMI)","title":"Running CUDA programs with APEX"},{"location":"usage/#hiprocm","text":"APEX suports HIP measurement using the Roc* libraries provided by AMD.","title":"HIP/ROCm"},{"location":"usage/#configuring_apex_for_hip_support","text":"Enabling HIP support in APEX requires the -DAPEX_WITH_HIP=TRUE flag and the -DROCM_ROOT=/path/to/rocm CMake variables at configuration time. CMake will look for the profile/trace and smi libraries in the installation, and if found the support will be enabled. cmake -B build -DCMAKE_C_COMPILER=`which clang` -DCMAKE_CXX_COMPILER=`which hipcc` -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=./install -DBUILD_TESTS=TRUE -DUSE_BFD=TRUE -DAPEX_WITH_HIP=TRUE -DROCM_ROOT=/opt/rocm-5.7.1 -DBFD_ROOT=/usr/local/packages/binutils/2.34 ..","title":"Configuring APEX for HIP support"},{"location":"usage/#running_hip_programs_with_apex","text":"Enabling CUDA support only requires using the apex_exec wrapper script. The following flags will enable additional support: --apex:hip enable HIP/ROCTracer measurement (default: off) --apex:hip-metrics enable HIP/ROCProfiler metric support (default: off) --apex:hip-counters enable HIP/ROCTracer counter support (default: off) --apex:hip-driver enable HIP/ROCTracer KSA driver API callbacks (default: off) --apex:hip-details enable per-kernel statistics where available (default: off) --apex:monitor-gpu enable GPU monitoring services (CUDA NVML, ROCm SMI)","title":"Running HIP programs with APEX"},{"location":"usage/#kokkos","text":"","title":"Kokkos"},{"location":"usage/#configuring_apex_for_kokkos_support","text":"Like OpenACC, nothing special needs to be done to enable Kokkos support.","title":"Configuring APEX for Kokkos support"},{"location":"usage/#running_kokkos_programs_with_apex","text":"Enabling Kokkos support requires setting the KOKKOS_PROFILE_LIBRARY environment variable with the path to libapex.so , or by using the apex_exec script with the --apex:kokkos flag. We also recommend using the --apex:kokkos-fence option which will time the full kernel execution time, not just the time to launch a kernel if the back-end activity is not measured by some other method (OMPT, CUDA, HIP, SYCL, OpenACC). APEX also has experimental autotuning support for Kokkos kernels, see https://github.com/UO-OACISS/apex/wiki/Using-APEX-with-Kokkos#autotuning-support .","title":"Running Kokkos programs with APEX"},{"location":"usage/#configuring_apex_for_raja_support","text":"Like OpenACC, nothing special needs to be done to enable RAJA support.","title":"Configuring APEX for RAJA support"},{"location":"usage/#running_raja_programs_with_apex","text":"Enabling RAJA support requires setting the RAJA_PLUGINS environment variable with the path to libapex.so , or by using the apex_exec script with the --apex:raja flag. The following flags will enable different types of Kokkos support: --apex:kokkos enable Kokkos support --apex:kokkos-tuning enable Kokkos runtime autotuning support --apex:kokkos-fence enable Kokkos fences for async kernels","title":"Running RAJA programs with APEX"},{"location":"usage/#c_threads","text":"APEX suports C++ threads on Linux, with the assumption that they are implemented on top of POSIX threads.","title":"C++ Threads"},{"location":"usage/#configuring_apex_for_c_thread_support","text":"Nothing special needs to be done to enable C++ thread support.","title":"Configuring APEX for C++ Thread support"},{"location":"usage/#running_c_thread_programs_with_apex","text":"Enabling C++ Thread support requires using the apex_exec script with the --apex:pthread flag. That will enable the preloading of a wrapper library to intercept pthread_create() calls. A sample program with C++ threads is in the APEX unit tests: khuck@Kevins-MacBook-Air build % ../install/bin/apex_exec --apex:pthread src/unit_tests/C++/apex_fibonacci_std_async_cpp Program to run : src/unit_tests/C++/apex_fibonacci_std_async_cpp usage: apex_fibonacci_std_async_cpp Using default value of 10 fib of 10 is 55 (valid value: 55) Elapsed time: 0.005359 seconds Cores detected: 8 Worker Threads observed: 178 Available CPU time: 0.042872 seconds Timer : #calls | mean | total | % total ------------------------------------------------------------------------------------------------ fib(int, std::__1::shared_ptr) : 177 0.001 0.171 --n/a-- APEX MAIN : 1 0.005 0.005 100.000 ------------------------------------------------------------------------------------------------ Total timers : 177 Note that APEX detected 178 total OS threads. That is because some C++ thread implementations (GCC, Clang, others) implement every std::async() call as a new OS thread, resulting in a pthread_create() call.","title":"Running C++ Thread programs with APEX"},{"location":"usage/#other_runtime_systems","text":"We are currently evaluating support for TBB, OpenCL, SYCL/DPC++/OneAPI, among others.","title":"Other Runtime Systems"},{"location":"usage/#performance_measurement_features","text":"For all the following examples, we will use a simple CUDA program that is in the APEX unit tests.","title":"Performance Measurement Features"},{"location":"usage/#profiling","text":"Profiling with APEX is the usual and most simple mode of operation. In order to profile an application and get a report at the end of execution, enable screen output (see Environment Variables for details) and run an application linked with the APEX library or with the apex_exec --apex:screen flag (enabled by default). The output should look like examples shown previously. [khuck@cyclops apex]$ export APEX_SCREEN_OUTPUT=1 [khuck@cyclops apex]$ ./build/src/unit_tests/CUDA/apex_cuda_cu Found 4 total devices On device: name=hello, value=10 On device: name=dello, value=11 On device: name=dello, value=12 On device: name=dello, value=13 On host: name=dello, value=14 Elapsed time: 0.46147 seconds Cores detected: 160 Worker Threads observed: 1 Available CPU time: 0.46147 seconds Counter : #samples | minimum | mean | maximum | stddev ------------------------------------------------------------------------------------------------ 1 Minute Load average : 1 13.320 13.320 13.320 0.000 Device 0 GPU Clock Memory (MHz) : 1 877.000 877.000 877.000 0.000 Device 0 GPU Clock SM (MHz) : 1 1530.000 1530.000 1530.000 0.000 Device 0 GPU Memory Free (MB) : 1 1.34e+04 1.34e+04 1.34e+04 0.000 Device 0 GPU Memory Used (MB) : 1 2.07e+04 2.07e+04 2.07e+04 0.000 Device 0 GPU Memory Utilization % : 1 48.000 48.000 48.000 0.000 Device 0 GPU NvLink Link Count : 1 6.000 6.000 6.000 0.000 Device 0 GPU NvLink Speed MB/s : 1 2.58e+04 2.58e+04 2.58e+04 0.000 Device 0 GPU NvLink Utilization C0 : 1 0.000 0.000 0.000 0.000 Device 0 GPU NvLink Utilization C1 : 1 0.000 0.000 0.000 0.000 Device 0 GPU Power (W) : 1 240.573 240.573 240.573 0.000 Device 0 GPU Temperature (C) : 1 73.000 73.000 73.000 0.000 Device 0 GPU Utilization % : 1 95.000 95.000 95.000 0.000 Device 0 PCIe RX Throughput (MB/s) : 1 5.000 5.000 5.000 0.000 Device 0 PCIe TX Throughput (MB/s) : 1 0.000 0.000 0.000 0.000 GPU: Bytes Allocated : 2 6.000 11.000 16.000 5.000 status:Threads : 1 7.000 7.000 7.000 0.000 status:VmData : 1 2.77e+05 2.77e+05 2.77e+05 0.000 status:VmExe : 1 64.000 64.000 64.000 0.000 status:VmHWM : 1 2.19e+05 2.19e+05 2.19e+05 0.000 status:VmLck : 1 0.000 0.000 0.000 0.000 status:VmLib : 1 8.74e+04 8.74e+04 8.74e+04 0.000 status:VmPMD : 1 16.000 16.000 16.000 0.000 status:VmPTE : 1 35.000 35.000 35.000 0.000 status:VmPeak : 1 7.17e+05 7.17e+05 7.17e+05 0.000 status:VmPin : 1 1.67e+05 1.67e+05 1.67e+05 0.000 status:VmRSS : 1 2.19e+05 2.19e+05 2.19e+05 0.000 status:VmSize : 1 6.52e+05 6.52e+05 6.52e+05 0.000 status:VmStk : 1 192.000 192.000 192.000 0.000 status:VmSwap : 1 0.000 0.000 0.000 0.000 status:nonvoluntary_ctxt_switches : 1 9.000 9.000 9.000 0.000 status:voluntary_ctxt_switches : 1 1331.000 1331.000 1331.000 0.000 ------------------------------------------------------------------------------------------------ Timer : #calls | mean | total | % total ------------------------------------------------------------------------------------------------ APEX MAIN : 1 0.461 0.461 100.000 GPU: Unified Memcpy DTOH : 1 0.000 0.000 0.001 GPU: Unified Memcpy HTOD : 1 0.000 0.000 0.001 GPU: Kernel(DataElement*) : 4 0.000 0.000 0.086 cudaDeviceSynchronize : 4 0.000 0.001 0.169 cudaFree : 2 0.000 0.000 0.052 cudaLaunchKernel : 4 0.000 0.000 0.021 cudaMallocManaged : 2 0.135 0.269 58.397 launch [/home/users/khuck/src/apex/src/unit_tests... : 4 0.028 0.110 23.870 APEX Idle : 0.080 17.403 ------------------------------------------------------------------------------------------------ Total timers : 22","title":"Profiling"},{"location":"usage/#profiling_with_csv_output","text":"To enable CSV output, use one of the methods described in the Environment Variables page, and run as the previous example. [khuck@cyclops apex]$ export APEX_CSV_OUTPUT=1 [khuck@cyclops apex]$ ./build/src/unit_tests/CUDA/apex_cuda_cu Found 4 total devices On device: name=hello, value=10 On device: name=dello, value=11 On device: name=dello, value=12 On device: name=dello, value=13 On host: name=dello, value=14 [khuck@cyclops apex]$ cat apex.0.csv \"counter\",\"num samples\",\"minimum\",\"mean\"\"maximum\",\"stddev\" \"1 Minute Load average\",1,22,22,22,0 \"Device 0 GPU Clock Memory (MHz)\",1,877,877,877,0 \"Device 0 GPU Clock SM (MHz)\",1,1530,1530,1530,0 \"Device 0 GPU Memory Free (MB)\",1,13411,13411,13411,0 \"Device 0 GPU Memory Used (MB)\",1,20679,20679,20679,0 \"Device 0 GPU Memory Utilization %\",1,58,58,58,0 \"Device 0 GPU NvLink Link Count\",1,6,6,6,0 \"Device 0 GPU NvLink Speed MB/s\",1,25781,25781,25781,0 \"Device 0 GPU NvLink Utilization C0\",1,0,0,0,0 \"Device 0 GPU NvLink Utilization C1\",1,0,0,0,0 \"Device 0 GPU Power (W)\",1,255,255,255,0 \"Device 0 GPU Temperature (C)\",1,75,75,75,0 \"Device 0 GPU Utilization %\",1,99,99,99,0 \"Device 0 PCIe RX Throughput (MB/s)\",1,7,7,7,0 \"Device 0 PCIe TX Throughput (MB/s)\",1,2,2,2,0 \"GPU: Bytes Allocated\",2,6,11,16,5 \"status:Threads\",1,7,7,7,0 \"status:VmData\",1,277120,277120,277120,0 \"status:VmExe\",1,64,64,64,0 \"status:VmHWM\",1,219008,219008,219008,0 \"status:VmLck\",1,0,0,0,0 \"status:VmLib\",1,87424,87424,87424,0 \"status:VmPMD\",1,16,16,16,0 \"status:VmPTE\",1,36,36,36,0 \"status:VmPeak\",1,717248,717248,717248,0 \"status:VmPin\",1,166528,166528,166528,0 \"status:VmRSS\",1,219008,219008,219008,0 \"status:VmSize\",1,652032,652032,652032,0 \"status:VmStk\",1,192,192,192,0 \"status:VmSwap\",1,0,0,0,0 \"status:nonvoluntary_ctxt_switches\",1,8,8,8,0 \"status:voluntary_ctxt_switches\",1,1276,1276,1276,0 \"task\",\"num calls\",\"total cycles\",\"total microseconds\" \"APEX MAIN\",1,0,431162 \"GPU: Unified Memcpy DTOH\",1,0,3 \"GPU: Unified Memcpy HTOD\",1,0,4 \"GPU: Kernel(DataElement*)\",4,0,1082 \"cudaDeviceSynchronize\",4,0,9993 \"cudaFree\",2,0,172 \"cudaLaunchKernel\",4,0,66 \"cudaMallocManaged\",2,0,194367 \"launch [/home/users/khuck/src/apex/src/unit_tests/CUDA/apex_cuda.cu:35]\",4,0,164490","title":"Profiling with CSV output"},{"location":"usage/#profiling_with_tau_profile_output","text":"To enable TAU profile output, use one of the methods described in the Environment Variables page, and run as the previous example. The output can be summarized with the TAU pprof command, which is installed with the TAU software. [khuck@cyclops apex]$ export APEX_CSV_OUTPUT=1 [khuck@cyclops apex]$ ./build/src/unit_tests/CUDA/apex_cuda_cu Found 4 total devices On device: name=hello, value=10 On device: name=dello, value=11 On device: name=dello, value=12 On device: name=dello, value=13 On host: name=dello, value=14 [khuck@cyclops apex]$ cat profile.0.0.0 9 templated_functions_MULTI_TIME # Name Calls Subrs Excl Incl ProfileCalls # \"GPU: Unified Memcpy DTOH\" 1 0 2.656 2.656 0 GROUP=\"TAU_USER\" \"cudaFree\" 2 0 193.18 193.18 0 GROUP=\"TAU_USER\" \"cudaMallocManaged\" 2 0 184435 184435 0 GROUP=\"TAU_USER\" \"GPU: Unified Memcpy HTOD\" 1 0 4.64 4.64 0 GROUP=\"TAU_USER\" \"GPU: Kernel(DataElement*)\" 4 0 355.293 355.293 0 GROUP=\"TAU_USER\" \"cudaLaunchKernel\" 4 0 67.4 67.4 0 GROUP=\"TAU_USER\" \"cudaDeviceSynchronize\" 4 0 811.244 811.244 0 GROUP=\"TAU_USER\" \"launch [/home/users/khuck/src/apex/src/unit_tests/CUDA/apex_cuda.cu:35]\" 4 0 100327 100327 0 GROUP=\"TAU_USER\" \"APEX MAIN\" 1 0 67830.2 354026 0 GROUP=\"TAU_USER\" 0 aggregates 32 userevents # eventname numevents max min mean sumsqr \"status:VmSwap\" 1 0 0 0 0 \"status:VmSize\" 1 652032 652032 652032 4.25146e+11 \"status:Threads\" 1 7 7 7 49 \"status:VmPeak\" 1 717248 717248 717248 5.14445e+11 \"Device 0 GPU Power (W)\" 1 224.057 224.057 224.057 50201.5 \"Device 0 GPU NvLink Speed MB/s\" 1 25781 25781 25781 6.6466e+08 \"status:VmExe\" 1 64 64 64 4096 \"status:nonvoluntary_ctxt_switches\" 1 12 12 12 144 \"Device 0 GPU Memory Utilization %\" 1 73 73 73 5329 \"status:VmStk\" 1 192 192 192 36864 \"status:VmData\" 1 277120 277120 277120 7.67955e+10 \"status:VmLck\" 1 0 0 0 0 \"status:VmPin\" 1 166528 166528 166528 2.77316e+10 \"status:VmPTE\" 1 35 35 35 1225 \"Device 0 GPU NvLink Utilization C1\" 1 0 0 0 0 \"status:VmHWM\" 1 219008 219008 219008 4.79645e+10 \"status:VmRSS\" 1 219008 219008 219008 4.79645e+10 \"GPU: Bytes Allocated\" 2 16 6 11 292 \"status:VmLib\" 1 87424 87424 87424 7.64296e+09 \"Device 0 GPU Utilization %\" 1 99 99 99 9801 \"status:voluntary_ctxt_switches\" 1 1320 1320 1320 1.7424e+06 \"Device 0 GPU Clock SM (MHz)\" 1 1530 1530 1530 2.3409e+06 \"status:VmPMD\" 1 20 20 20 400 \"1 Minute Load average\" 1 16.43 16.43 16.43 269.945 \"Device 0 GPU Clock Memory (MHz)\" 1 877 877 877 769129 \"Device 0 PCIe TX Throughput (MB/s)\" 1 2 2 2 4 \"Device 0 GPU Temperature (C)\" 1 73 73 73 5329 \"Device 0 PCIe RX Throughput (MB/s)\" 1 6 6 6 36 \"Device 0 GPU Memory Used (MB)\" 1 20679.1 20679.1 20679.1 4.27625e+08 \"Device 0 GPU NvLink Utilization C0\" 1 0 0 0 0 \"Device 0 GPU NvLink Link Count\" 1 6 6 6 36 \"Device 0 GPU Memory Free (MB)\" 1 13410.6 13410.6 13410.6 1.79845e+08 [khuck@cyclops apex]$ which pprof ~/src/tau2/ibm64linux/bin/pprof [khuck@cyclops apex]$ pprof Reading Profile files in profile.* NODE 0;CONTEXT 0;THREAD 0: --------------------------------------------------------------------------------------- %Time Exclusive Inclusive #Call #Subrs Inclusive Name msec total msec usec/call --------------------------------------------------------------------------------------- 100.0 67 354 1 0 354026 APEX MAIN 52.1 184 184 2 0 92218 cudaMallocManaged 28.3 100 100 4 0 25082 launch [/home/users/khuck/src/apex/src/unit_tests/CUDA/apex_cuda.cu:35] 0.2 0.811 0.811 4 0 203 cudaDeviceSynchronize 0.1 0.355 0.355 4 0 89 GPU: Kernel(DataElement*) 0.1 0.193 0.193 2 0 97 cudaFree 0.0 0.0674 0.0674 4 0 17 cudaLaunchKernel 0.0 0.00464 0.00464 1 0 5 GPU: Unified Memcpy HTOD 0.0 0.00266 0.00266 1 0 3 GPU: Unified Memcpy DTOH --------------------------------------------------------------------------------------- USER EVENTS Profile :NODE 0, CONTEXT 0, THREAD 0 --------------------------------------------------------------------------------------- NumSamples MaxValue MinValue MeanValue Std. Dev. Event Name --------------------------------------------------------------------------------------- 1 16.43 16.43 16.43 0.01 1 Minute Load average 1 877 877 877 0 Device 0 GPU Clock Memory (MHz) 1 1530 1530 1530 0 Device 0 GPU Clock SM (MHz) 1 1.341E+04 1.341E+04 1.341E+04 28.42 Device 0 GPU Memory Free (MB) 1 2.068E+04 2.068E+04 2.068E+04 13.3 Device 0 GPU Memory Used (MB) 1 73 73 73 0 Device 0 GPU Memory Utilization % 1 6 6 6 0 Device 0 GPU NvLink Link Count 1 2.578E+04 2.578E+04 2.578E+04 6.245 Device 0 GPU NvLink Speed MB/s 1 0 0 0 0 Device 0 GPU NvLink Utilization C0 1 0 0 0 0 Device 0 GPU NvLink Utilization C1 1 224.1 224.1 224.1 0.1981 Device 0 GPU Power (W) 1 73 73 73 0 Device 0 GPU Temperature (C) 1 99 99 99 0 Device 0 GPU Utilization % 1 6 6 6 0 Device 0 PCIe RX Throughput (MB/s) 1 2 2 2 0 Device 0 PCIe TX Throughput (MB/s) 2 16 6 11 5 GPU: Bytes Allocated 1 7 7 7 0 status:Threads 1 2.771E+05 2.771E+05 2.771E+05 74.83 status:VmData 1 64 64 64 0 status:VmExe 1 2.19E+05 2.19E+05 2.19E+05 63.75 status:VmHWM 1 0 0 0 0 status:VmLck 1 8.742E+04 8.742E+04 8.742E+04 64.99 status:VmLib 1 20 20 20 0 status:VmPMD 1 35 35 35 0 status:VmPTE 1 7.172E+05 7.172E+05 7.172E+05 553.6 status:VmPeak 1 1.665E+05 1.665E+05 1.665E+05 158.8 status:VmPin 1 2.19E+05 2.19E+05 2.19E+05 63.75 status:VmRSS 1 6.52E+05 6.52E+05 6.52E+05 520.6 status:VmSize 1 192 192 192 0 status:VmStk 1 0 0 0 0 status:VmSwap 1 12 12 12 0 status:nonvoluntary_ctxt_switches 1 1320 1320 1320 0 status:voluntary_ctxt_switches ---------------------------------------------------------------------------------------","title":"Profiling with TAU profile output"},{"location":"usage/#profiling_with_taskgraph_output","text":"APEX can capture the task dependency graph from the application, and output it as a GraphViz graph. The graph represents summarized task \"type\" dependencies, not a full dependency graph/tree with every task instance. [khuck@cyclops apex]$ apex_exec --apex:taskgraph --apex:cuda ./build/src/unit_tests/CUDA/apex_cuda_cu [khuck@cyclops apex]$ dot -Tpdf -O taskgraph.0.dot","title":"Profiling with Taskgraph output"},{"location":"usage/#profiling_with_tasktree_output","text":"APEX can capture the task dependency tree from the application, and output it as a GraphViz graph or ASCII. The graph represents summarized task \"type\" dependencies, not a full dependency graph/tree with every task instance. The difference between the graph and the tree is that in the tree, there are no cycles and child tasks have only one parent. [khuck@cyclops apex]$ apex_exec --apex:tasktree --apex:cuda ./build/src/unit_tests/CUDA/apex_cuda_cu [khuck@cyclops apex]$ apex-treesummary.py apex_tasktree.csv","title":"Profiling with Tasktree output"},{"location":"usage/#profiling_with_scatterplot_output","text":"For this example, we are using an HPX quickstart example, the fibonacci example. After execution, APEX writes a sample data file to disk, apex_task_samples.csv . That file is post-processed with the APEX python script task_scatterplot.py . [khuck@cyclops apex]$ export APEX_TASK_SCATTERPLOT=1 [khuck@cyclops build]$ ./bin/fibonacci --n-value=20 [khuck@cyclops build]$ /home/users/khuck/src/apex/install/bin/task_scatterplot.py Parsed 2362 samples Plotting async_launch_policy_dispatch Plotting async_launch_policy_dispatch::call Plotting async Rendering...","title":"Profiling with Scatterplot output"},{"location":"usage/#profiling_with_otf2_trace_output","text":"For this example, we are using an APEX unit test that computes the value of PI. OTF2 is the \"Open Trace Format v2\", used for tracing large scale HPC applications. For more information on OTF2 and associated tools, see The VI-HPS Score-P web site . Vampir is a commercial trace viewer that can be used to visualize and analyze OTF2 trace data. Traveler is an open source tool that can be used to visualize and analyze APEX OTF2 trace data. [khuck@cyclops apex]$ export APEX_OTF2=1 [khuck@cyclops apex]$ ./build/src/unit_tests/CUDA/pi_cu Found 4 total devices 134217728 num streams 4 making streams starting compute n is 0 num darts in circle 0: 105418094 pi is 3.141704 Closing OTF2 event files... Writing OTF2 definition files... Writing OTF2 Global definition file... Writing OTF2 Node information... Writing OTF2 Communicators... Closing the archive... done. [khuck@eagle apex]$ module load vampir [khuck@eagle apex]$ vampir OTF2_archive/APEX.otf2","title":"Profiling with OTF2 Trace output"},{"location":"usage/#profiling_with_google_trace_events_format_output","text":"For this example, we are using an APEX unit test that computes the value of PI. Google Trace Events is a format developed by Google for tracing activity on devices, but is free and open and JSON based. For more information on Google Trace Events and associated tools, see the Google Trace Event Format document . The Google Chrome Web Browser can be used to visualize and analyze GTE trace data. [khuck@cyclops apex]$ export APEX_TRACE_EVENT=1 [khuck@cyclops apex]$ ./build/src/unit_tests/CUDA/pi_cu","title":"Profiling with Google Trace Events Format output"},{"location":"usecases/","text":"Before you start \u00b6 All examples on this page assume you have downloaded, configured and built APEX. See the Getting Started page for instructions on how to do that. Simple example \u00b6 In the APEX installation directory, there is a bin directory. In the bin directory are a number of examples, one of which is a simple matrix multiplication example, matmult . To run the matmult example, simply type 'matmult'. The output should be something like this: khuck@ktau:~/src/apex/install/bin$ ./matmult Spawned thread 1... Spawned thread 2... Spawned thread 3... Done. Not very interesting, eh? To see what APEX measured, set the APEX_SCREEN_OUTPUT environment variable to 1, and run it again: khuck@ktau:~/src/apex/install/bin$ export APEX_SCREEN_OUTPUT=1 khuck@ktau:~/src/apex/install/bin$ ./matmult v0.1-e050e17-master Built on: 14:38:56 Dec 22 2015 C++ Language Standard version : 201402 GCC Compiler version : 5.2.1 20151010 APEX_TAU : 0 APEX_POLICY : 1 APEX_MEASURE_CONCURRENCY : 0 APEX_MEASURE_CONCURRENCY_PERIOD : 1000000 APEX_SCREEN_OUTPUT : 1 APEX_PROFILE_OUTPUT : 0 APEX_CSV_OUTPUT : 1 APEX_TASKGRAPH_OUTPUT : 0 APEX_PROC_CPUINFO : 0 APEX_PROC_MEMINFO : 0 APEX_PROC_NET_DEV : 0 APEX_PROC_SELF_STATUS : 0 APEX_PROC_STAT : 1 APEX_THROTTLE_CONCURRENCY : 1 APEX_THROTTLING_MAX_THREADS : 8 APEX_THROTTLING_MIN_THREADS : 1 APEX_THROTTLE_ENERGY : 0 APEX_THROTTLING_MAX_WATTS : 300 APEX_THROTTLING_MIN_WATTS : 150 APEX_PTHREAD_WRAPPER_STACK_SIZE : 0 APEX_PAPI_METRICS : Spawned thread 1... Spawned thread 2... Spawned thread 3... Done. CPU is 2.66013e+09 Hz. Elapsed time: 0.966516 Cores detected: 8 Worker Threads observed: 4 Available CPU time: 3.86607 Action : #calls | minimum | mean | maximum | total | stddev | % total ------------------------------------------------------------------------------------------------------------ allocateMatrix : 12 --n/a-- 1.94e-02 --n/a-- 2.33e-01 --n/a-- 6.014 compute : 4 --n/a-- 6.89e-01 --n/a-- 2.76e+00 --n/a-- 71.279 compute_interchange : 4 --n/a-- 1.85e-01 --n/a-- 7.38e-01 --n/a-- 19.091 do_work : 4 --n/a-- 9.43e-01 --n/a-- 3.77e+00 --n/a-- 97.601 freeMatrix : 12 --n/a-- 2.36e-04 --n/a-- 2.83e-03 --n/a-- 0.073 initialize : 12 --n/a-- 3.56e-03 --n/a-- 4.27e-02 --n/a-- 1.104 main : 1 --n/a-- 9.66e-01 --n/a-- 9.66e-01 --n/a-- 24.983 APEX Idle : --n/a-- --n/a-- --n/a-- --n/a-- --n/a-- --n/a-- --n/a-- ------------------------------------------------------------------------------------------------------------ In this output, we see the status of all of the environment variables (as read by APEX at initialization), the regular program output, and then a summary from APEX at the end. Because APEX captures timestamps using the low-overhead rdtsc function call (where available), the measurements are done in cycles. APEX estimates the Hz rating of the CPU to convert to seconds for output. APEX reports the elapsed wall-clock time, the number of cores detected, the number of worker threads observed, as well as the total available CPU time (wall-clock times workers). OpenMP example \u00b6 In the APEX installation directory, there is a bin directory. In the bin directory are a number of examples, one of which is the OpenMP implementation of LULESH (for details, see the LLNL explanation of LULESH ). When APEX is configured with OpenMP OMPT support (using the -DBUILD_OMPT=TRUE or equivalent CMake configuration settings) it will measure OpenMP events. Executing the LULESH example (with APEX_SCREEN_OUTPUT=1) gives the following output: khuck@ktau:~/src/apex$ ./install/bin/lulesh_OpenMP_2.0 v0.1-e050e17-master Built on: 14:38:56 Dec 22 2015 C++ Language Standard version : 201402 GCC Compiler version : 5.2.1 20151010 APEX_TAU : 0 APEX_POLICY : 1 APEX_MEASURE_CONCURRENCY : 0 APEX_MEASURE_CONCURRENCY_PERIOD : 1000000 APEX_SCREEN_OUTPUT : 1 APEX_PROFILE_OUTPUT : 0 APEX_CSV_OUTPUT : 1 APEX_TASKGRAPH_OUTPUT : 0 APEX_PROC_CPUINFO : 0 APEX_PROC_MEMINFO : 0 APEX_PROC_NET_DEV : 0 APEX_PROC_SELF_STATUS : 0 APEX_PROC_STAT : 1 APEX_THROTTLE_CONCURRENCY : 1 APEX_THROTTLING_MAX_THREADS : 8 APEX_THROTTLING_MIN_THREADS : 1 APEX_THROTTLE_ENERGY : 0 APEX_THROTTLING_MAX_WATTS : 300 APEX_THROTTLING_MIN_WATTS : 150 APEX_PTHREAD_WRAPPER_STACK_SIZE : 0 APEX_PAPI_METRICS : Running problem size 30^3 per domain until completion Num processors: 1 Registering OMPT events...done. Num threads: 8 Total number of elements: 27000 To run other sizes, use -s . To run a fixed number of iterations, use -i . To run a more or less balanced region set, use -b . To change the relative costs of regions, use -c . To print out progress, use -p To write an output file for VisIt, use -v See help (-h) for more options APEX: disabling lightweight timer OpenMP_BARRIER: CalcPressur... APEX: disabling lightweight timer OpenMP_BARRIER: CalcPressur... APEX: disabling lightweight timer OpenMP_BARRIER: EvalEOSForE... APEX: disabling lightweight timer OpenMP_BARRIER: CalcEnergyF... APEX: disabling lightweight timer OpenMP_BARRIER: CalcEnergyF... APEX: disabling lightweight timer OpenMP_BARRIER: CalcEnergyF... APEX: disabling lightweight timer OpenMP_BARRIER: CalcEnergyF... APEX: disabling lightweight timer OpenMP_BARRIER: CalcEnergyF... APEX: disabling lightweight timer OpenMP_BARRIER: CalcCourant... APEX: disabling lightweight timer OpenMP_BARRIER: CalcHydroCo... APEX: disabling lightweight timer OpenMP_BARRIER: CalcMonoton... APEX: disabling lightweight timer OpenMP_BARRIER: EvalEOSForE... APEX: disabling lightweight timer OpenMP_BARRIER: CalcSoundSp... APEX: disabling lightweight timer OpenMP_BARRIER: InitStressT... APEX: disabling lightweight timer OpenMP_BARRIER: CalcVolumeF... APEX: disabling lightweight timer OpenMP_BARRIER: CalcAcceler... APEX: disabling lightweight timer OpenMP_BARRIER: CalcVelocit... APEX: disabling lightweight timer OpenMP_BARRIER: CalcPositio... APEX: disabling lightweight timer OpenMP_BARRIER: CalcLagrang... APEX: disabling lightweight timer OpenMP_BARRIER: UpdateVolum... APEX: disabling lightweight timer OpenMP_BARRIER: ApplyAccele... APEX: disabling lightweight timer OpenMP_BARRIER: CalcForceFo... Run completed: Problem size = 30 MPI tasks = 1 Iteration count = 932 Final Origin Energy = 2.025075e+05 Testing Plane 0 of Energy Array on rank 0: MaxAbsDiff = 6.548362e-11 TotalAbsDiff = 8.615093e-10 MaxRelDiff = 1.461140e-12 Elapsed time = 55.00 (s) Grind time (us/z/c) = 2.1855548 (per dom) ( 2.1855548 overall) FOM = 457.54973 (z/s) CPU is 2.66013e+09 Hz. Elapsed time: 55.0085 Cores detected: 8 Worker Threads observed: 8 Available CPU time: 440.068 Action : #calls | minimum | mean | maximum | total | stddev | % total ------------------------------------------------------------------------------------------------------------ CPU Guest % : 54 0.000 0.000 0.000 0.000 0.000 --n/a-- CPU I/O Wait % : 54 0.000 0.040 0.714 2.143 0.133 --n/a-- CPU IRQ % : 54 0.000 0.000 0.000 0.000 0.000 --n/a-- CPU Idle % : 54 0.857 1.384 4.857 74.714 0.763 --n/a-- CPU Nice % : 54 0.000 0.000 0.000 0.000 0.000 --n/a-- CPU Steal % : 54 0.000 0.000 0.000 0.000 0.000 --n/a-- CPU System % : 54 15.286 23.339 26.714 1260.286 2.301 --n/a-- CPU User % : 54 84.143 88.373 97.143 4772.143 2.268 --n/a-- CPU soft IRQ % : 54 0.000 0.026 0.286 1.429 0.068 --n/a-- OpenMP_BARRIER: ApplyAccele... : DISABLED (high frequency, short duration) OpenMP_BARRIER: ApplyMateri... : 14912 --n/a-- 3.96e-05 --n/a-- 5.91e-01 --n/a-- 0.134 OpenMP_BARRIER: CalcAcceler... : DISABLED (high frequency, short duration) OpenMP_BARRIER: CalcCourant... : DISABLED (high frequency, short duration) OpenMP_BARRIER: CalcEnergyF... : DISABLED (high frequency, short duration) OpenMP_BARRIER: CalcEnergyF... : DISABLED (high frequency, short duration) OpenMP_BARRIER: CalcEnergyF... : DISABLED (high frequency, short duration) OpenMP_BARRIER: CalcEnergyF... : DISABLED (high frequency, short duration) OpenMP_BARRIER: CalcEnergyF... : DISABLED (high frequency, short duration) OpenMP_BARRIER: CalcFBHourg... : 7456 --n/a-- 1.11e-04 --n/a-- 8.27e-01 --n/a-- 0.188 OpenMP_BARRIER: CalcFBHourg... : 7456 --n/a-- 1.49e-04 --n/a-- 1.11e+00 --n/a-- 0.252 OpenMP_BARRIER: CalcForceFo... : DISABLED (high frequency, short duration) OpenMP_BARRIER: CalcHourgla... : 7456 --n/a-- 1.32e-04 --n/a-- 9.84e-01 --n/a-- 0.224 OpenMP_BARRIER: CalcHydroCo... : DISABLED (high frequency, short duration) OpenMP_BARRIER: CalcKinemat... : 7456 --n/a-- 7.88e-05 --n/a-- 5.88e-01 --n/a-- 0.134 OpenMP_BARRIER: CalcLagrang... : DISABLED (high frequency, short duration) OpenMP_BARRIER: CalcMonoton... : 7456 --n/a-- 6.98e-05 --n/a-- 5.21e-01 --n/a-- 0.118 OpenMP_BARRIER: CalcMonoton... : DISABLED (high frequency, short duration) OpenMP_BARRIER: CalcPositio... : DISABLED (high frequency, short duration) OpenMP_BARRIER: CalcPressur... : DISABLED (high frequency, short duration) OpenMP_BARRIER: CalcPressur... : DISABLED (high frequency, short duration) OpenMP_BARRIER: CalcSoundSp... : DISABLED (high frequency, short duration) OpenMP_BARRIER: CalcVelocit... : DISABLED (high frequency, short duration) OpenMP_BARRIER: CalcVolumeF... : DISABLED (high frequency, short duration) OpenMP_BARRIER: EvalEOSForE... : DISABLED (high frequency, short duration) OpenMP_BARRIER: EvalEOSForE... : DISABLED (high frequency, short duration) OpenMP_BARRIER: InitStressT... : DISABLED (high frequency, short duration) OpenMP_BARRIER: IntegrateSt... : 7456 --n/a-- 6.66e-05 --n/a-- 4.97e-01 --n/a-- 0.113 OpenMP_BARRIER: IntegrateSt... : 7456 --n/a-- 1.28e-04 --n/a-- 9.54e-01 --n/a-- 0.217 OpenMP_BARRIER: UpdateVolum... : DISABLED (high frequency, short duration) OpenMP_PARALLEL_REGION: App... : 932 --n/a-- 1.09e-04 --n/a-- 1.01e-01 --n/a-- 0.023 OpenMP_PARALLEL_REGION: App... : 932 --n/a-- 2.58e-04 --n/a-- 2.40e-01 --n/a-- 0.055 OpenMP_PARALLEL_REGION: Cal... : 932 --n/a-- 7.83e-04 --n/a-- 7.30e-01 --n/a-- 0.166 OpenMP_PARALLEL_REGION: Cal... : 10252 --n/a-- 7.72e-05 --n/a-- 7.91e-01 --n/a-- 0.180 OpenMP_PARALLEL_REGION: Cal... : 32620 --n/a-- 4.29e-05 --n/a-- 1.40e+00 --n/a-- 0.318 OpenMP_PARALLEL_REGION: Cal... : 32620 --n/a-- 5.07e-05 --n/a-- 1.65e+00 --n/a-- 0.376 OpenMP_PARALLEL_REGION: Cal... : 32620 --n/a-- 3.31e-05 --n/a-- 1.08e+00 --n/a-- 0.245 OpenMP_PARALLEL_REGION: Cal... : 32620 --n/a-- 4.75e-05 --n/a-- 1.55e+00 --n/a-- 0.352 OpenMP_PARALLEL_REGION: Cal... : 32620 --n/a-- 4.09e-05 --n/a-- 1.34e+00 --n/a-- 0.303 OpenMP_PARALLEL_REGION: Cal... : 932 --n/a-- 8.10e-03 --n/a-- 7.55e+00 --n/a-- 1.715 OpenMP_PARALLEL_REGION: Cal... : 932 --n/a-- 3.51e-03 --n/a-- 3.28e+00 --n/a-- 0.744 OpenMP_PARALLEL_REGION: Cal... : 932 --n/a-- 4.34e-04 --n/a-- 4.05e-01 --n/a-- 0.092 OpenMP_PARALLEL_REGION: Cal... : 932 --n/a-- 4.27e-03 --n/a-- 3.98e+00 --n/a-- 0.905 OpenMP_PARALLEL_REGION: Cal... : 10252 --n/a-- 4.72e-05 --n/a-- 4.84e-01 --n/a-- 0.110 OpenMP_PARALLEL_REGION: Cal... : 932 --n/a-- 1.68e-03 --n/a-- 1.57e+00 --n/a-- 0.356 OpenMP_PARALLEL_REGION: Cal... : 932 --n/a-- 2.29e-04 --n/a-- 2.13e-01 --n/a-- 0.048 OpenMP_PARALLEL_REGION: Cal... : 932 --n/a-- 1.15e-03 --n/a-- 1.07e+00 --n/a-- 0.244 OpenMP_PARALLEL_REGION: Cal... : 10252 --n/a-- 2.29e-04 --n/a-- 2.34e+00 --n/a-- 0.533 OpenMP_PARALLEL_REGION: Cal... : 932 --n/a-- 4.98e-04 --n/a-- 4.64e-01 --n/a-- 0.105 OpenMP_PARALLEL_REGION: Cal... : 97860 --n/a-- 3.26e-05 --n/a-- 3.19e+00 --n/a-- 0.725 OpenMP_PARALLEL_REGION: Cal... : 97860 --n/a-- 3.20e-05 --n/a-- 3.13e+00 --n/a-- 0.712 OpenMP_PARALLEL_REGION: Cal... : 10252 --n/a-- 4.52e-05 --n/a-- 4.63e-01 --n/a-- 0.105 OpenMP_PARALLEL_REGION: Cal... : 932 --n/a-- 3.39e-04 --n/a-- 3.16e-01 --n/a-- 0.072 OpenMP_PARALLEL_REGION: Cal... : 932 --n/a-- 1.57e-04 --n/a-- 1.47e-01 --n/a-- 0.033 OpenMP_PARALLEL_REGION: Eva... : 32620 --n/a-- 1.07e-04 --n/a-- 3.50e+00 --n/a-- 0.796 OpenMP_PARALLEL_REGION: Eva... : 10252 --n/a-- 2.86e-05 --n/a-- 2.93e-01 --n/a-- 0.067 OpenMP_PARALLEL_REGION: Ini... : 932 --n/a-- 3.52e-04 --n/a-- 3.28e-01 --n/a-- 0.074 OpenMP_PARALLEL_REGION: Int... : 932 --n/a-- 3.14e-03 --n/a-- 2.93e+00 --n/a-- 0.666 OpenMP_PARALLEL_REGION: Int... : 932 --n/a-- 2.18e-03 --n/a-- 2.03e+00 --n/a-- 0.461 OpenMP_PARALLEL_REGION: Upd... : 932 --n/a-- 1.34e-04 --n/a-- 1.25e-01 --n/a-- 0.028 APEX Idle : --n/a-- --n/a-- --n/a-- --n/a-- 3.87e+02 --n/a-- 88.011 ------------------------------------------------------------------------------------------------------------ There are several lightweight events that APEX elects to ignore. The other events are timed by APEX and reported at exit, along with the /proc/stat data (CPU % counters). With PAPI \u00b6 When APEX is configured with PAPI support (using -DPAPI_ROOT=/path/to/papi and -DUSE_PAPI=TRUE), hardware counter data can also be collected by APEX. To specify hardware counters of interest, use the APEX_PAPI_METRICS environment variable: khuck@ktau:~/src/apex$ export APEX_PAPI_METRICS=\"PAPI_TOT_INS PAPI_L2_TCM\" ...and then execute as normal: khuck@ktau:~/src/apex$ ./install/bin/matmult v0.1-e050e17-master Built on: 14:38:56 Dec 22 2015 C++ Language Standard version : 201402 GCC Compiler version : 5.2.1 20151010 APEX_TAU : 1 APEX_POLICY : 1 APEX_MEASURE_CONCURRENCY : 0 APEX_MEASURE_CONCURRENCY_PERIOD : 1000000 APEX_SCREEN_OUTPUT : 1 APEX_PROFILE_OUTPUT : 0 APEX_CSV_OUTPUT : 1 APEX_TASKGRAPH_OUTPUT : 0 APEX_PROC_CPUINFO : 0 APEX_PROC_MEMINFO : 0 APEX_PROC_NET_DEV : 0 APEX_PROC_SELF_STATUS : 0 APEX_PROC_STAT : 1 APEX_THROTTLE_CONCURRENCY : 1 APEX_THROTTLING_MAX_THREADS : 8 APEX_THROTTLING_MIN_THREADS : 1 APEX_THROTTLE_ENERGY : 0 APEX_THROTTLING_MAX_WATTS : 300 APEX_THROTTLING_MIN_WATTS : 150 APEX_PTHREAD_WRAPPER_STACK_SIZE : 0 APEX_PAPI_METRICS : PAPI_TOT_INS PAPI_L2_TCM Spawned thread 1... Spawned thread 2... Spawned thread 3... Done. CPU is 2.66019e+09 Hz. Elapsed time: 0.954974 Cores detected: 8 Worker Threads observed: 4 Available CPU time: 3.81989 Action : #calls | minimum | mean | maximum | total | stddev | % total PAPI_TOT_INS PAPI_L2_TCM ------------------------------------------------------------------------------------------------------------ allocateMatrix : 12 --n/a-- 2.21e-02 --n/a-- 2.65e-01 --n/a-- 6.930 1.62e+06 9.10e+03 compute : 4 --n/a-- 6.85e-01 --n/a-- 2.74e+00 --n/a-- 71.743 4.31e+09 1.71e+06 compute_interchange : 4 --n/a-- 1.81e-01 --n/a-- 7.23e-01 --n/a-- 18.922 3.77e+09 8.12e+05 do_work : 4 --n/a-- 9.44e-01 --n/a-- 3.78e+00 --n/a-- 98.851 8.10e+09 2.92e+06 freeMatrix : 12 --n/a-- 2.07e-04 --n/a-- 2.49e-03 --n/a-- 0.065 1.13e+06 6.30e+03 initialize : 12 --n/a-- 3.58e-03 --n/a-- 4.29e-02 --n/a-- 1.124 2.21e+07 3.80e+05 main : 1 --n/a-- 9.54e-01 --n/a-- 9.54e-01 --n/a-- 24.978 2.03e+09 7.66e+05 APEX Idle : --n/a-- --n/a-- --n/a-- --n/a-- --n/a-- --n/a-- --n/a-- ------------------------------------------------------------------------------------------------------------ CSV output \u00b6 While APEX is not designed for post-mortem performance analysis, you can export the data that APEX collected. If you set the APEX_CSV_OUTPUT environment variable to 1, APEX will also dump the timer statistics as a CSV file: khuck@ktau:~/src/apex$ cat apex.0.csv \"task\",\"num calls\",\"total cycles\",\"total microseconds\",\"PAPI_TOT_INS\",\"PAPI_L2_TCM\" \"allocateMatrix\",12,704195504,264717,1615804,9100 \"compute\",4,7290209200,2740489,4306522734,1709040 \"compute_interchange\",4,1922797744,722806,3769652571,812196 \"do_work\",4,10044907856,3776018,8101109302,2922142 \"freeMatrix\",12,6613336,2486,1132717,6301 \"initialize\",12,114177592,42921,22093639,379785 \"main\",1,2538202992,954145,2025172707,766218 With TAU \u00b6 If APEX is configured with TAU support, then APEX measurements will be forwarded to TAU and recorded as a TAU profile. In addition, all other TAU features are supported, including sampling, MPI measurement, I/O measurement, tracing, etc. To configure APEX with TAU, specify the flags -DUSE_TAU, -DTAU_ROOT, -DTAU_ARCH, and -DTAU_OPTIONS. For example, if TAU was configured with \"./configure -pthread\" on an x86_64 Linux machine, the APEX configuration options would be \"-DUSE_TAU=1 -DTAU_ROOT=/path/to/tau -DTAU_ARCH=x86_64 -DTAU_OPTIONS=-pthread\". If TAU was configured with \"./configure -mpi -pthread\" on an x86_64 Linux machine, the APEX configuration options would be \"-DUSE_TAU=1 -DTAU_ROOT=/path/to/tau -DTAU_ARCH=x86_64 -DTAU_OPTIONS=-mpi-pthread\". Here is a suggested configuration for TAU on x86-Linux to use with APEX (some systems require special flags - please contact the maintaners if you are interested): # download the latest TAU release wget http://www.cs.uoregon.edu/research/paracomp/tau/tauprofile/dist/tau_latest.tar.gz # expand the tar file tar -xvzf tau_latest.tar.gz cd tau-2.25 # configure TAU ./configure -papi=/usr/local/papi/5.3.2 -pthread -prefix=/usr/local/tau/2.25 # build make -j install # set our path to include the new TAU installation export PATH=$PATH:/usr/local/tau/2.25/x86_64/bin Here is a suggested configuration for APEX to use the above TAU installation: cd xpress-apex mkdir build-tau cd build-tau cmake -DBUILD_EXAMPLES=TRUE -DBUILD_TESTS=TRUE -DCMAKE_BUILD_TYPE=RelWithDebInfo \\ -DUSE_TAU=TRUE -DTAU_ROOT=/usr/local/tau/2.25 -DTAU_ARCH=x86_64 -DTAU_OPTIONS=-papi-pthread \\ -DBUILD_BFD=TRUE -DBUILD_ACTIVEHARMONY=TRUE -DCMAKE_INSTALL_PREFIX=../install-tau .. make make tests make install After configuring, building and installing TAU and then configuring, building and installing APEX, the TAU profiling is enabled by setting the environment variable \"APEX_TAU=1\". After executing an example (say 'matmult'), there should be profile.* files in the working directory: khuck@ktau:~/src/xpress-apex$ export APEX_TAU=1 khuck@ktau:~/src/xpress-apex$ ./install/bin/matmult Spawned thread 1... Spawned thread 2... Spawned thread 3... Done. khuck@ktau:~/src/xpress-apex$ ls profile.* profile.0.0.0 profile.0.0.1 profile.0.0.2 profile.0.0.3 profile.0.0.4 profile.0.0.5 If the TAU analysis utilties are in your path, you can execute paraprof to view the profiles: khuck@ktau:~/src/xpress-apex$ paraprof ...which should launch the ParaProf profile viewer/analysis program. The profile should look something like the following (for a complete manual on using ParaProf, see the TAU website ). If you want to collect a TAU trace, you would enable the appropriate TAU environment variable (TAU_TRACE=1), and then re-run the example. After the execution, the trace files need to be merged (using tau_treemerge.pl) and then converted (with tau2slog2) to be viewed with the Jumpshot trace viewer (included with TAU): khuck@ktau:~/src/xpress-apex$ export APEX_TAU=1 khuck@ktau:~/src/xpress-apex$ export TAU_TRACE=1 khuck@ktau:~/src/xpress-apex$ ./install/bin/matmult Spawned thread 1... Spawned thread 2... Spawned thread 3... Done. khuck@ktau:~/src/xpress-apex$ ls *.edf *.trc events.0.edf tautrace.0.0.1.trc tautrace.0.0.3.trc tautrace.0.0.5.trc tautrace.0.0.0.trc tautrace.0.0.2.trc tautrace.0.0.4.trc # merge the trace khuck@ktau:~/src/xpress-apex$ tau_treemerge.pl /home/khuck/src/tau2/x86_64/bin/tau_merge -m tau.edf -e events.0.edf events.0.edf events.0.edf events.0.edf events.0.edf events.0.edf tautrace.0.0.0.trc tautrace.0.0.1.trc tautrace.0.0.2.trc tautrace.0.0.3.trc tautrace.0.0.4.trc tautrace.0.0.5.trc tau.trc tautrace.0.0.0.trc: 34 records read. tautrace.0.0.1.trc: 8 records read. tautrace.0.0.2.trc: 8 records read. tautrace.0.0.3.trc: 30 records read. tautrace.0.0.4.trc: 30 records read. tautrace.0.0.5.trc: 30 records read. # convert the trace khuck@ktau:~/src/xpress-apex$ tau2slog2 tau.trc tau.edf -o tau.slog2 140 records initialized. Processing. 2 Records read. 1% converted 4 Records read. 2% converted 6 Records read. 4% converted 8 Records read. 5% converted 10 Records read. 7% converted 12 Records read. 8% converted 14 Records read. 10% converted 16 Records read. 11% converted 18 Records read. 12% converted 20 Records read. 14% converted 22 Records read. 15% converted 24 Records read. 17% converted 26 Records read. 18% converted 28 Records read. 20% converted 30 Records read. 21% converted 32 Records read. 22% converted 34 Records read. 24% converted 36 Records read. 25% converted 38 Records read. 27% converted 40 Records read. 28% converted 42 Records read. 30% converted 44 Records read. 31% converted 46 Records read. 32% converted 48 Records read. 34% converted 50 Records read. 35% converted 52 Records read. 37% converted 54 Records read. 38% converted 56 Records read. 40% converted 58 Records read. 41% converted 60 Records read. 42% converted 62 Records read. 44% converted 64 Records read. 45% converted 66 Records read. 47% converted 68 Records read. 48% converted 70 Records read. 50% converted 72 Records read. 51% converted 74 Records read. 52% converted 76 Records read. 54% converted 78 Records read. 55% converted 80 Records read. 57% converted 82 Records read. 58% converted 84 Records read. 60% converted 86 Records read. 61% converted 88 Records read. 62% converted 90 Records read. 64% converted 92 Records read. 65% converted 94 Records read. 67% converted 96 Records read. 68% converted 98 Records read. 70% converted 100 Records read. 71% converted 102 Records read. 72% converted 104 Records read. 74% converted 106 Records read. 75% converted 108 Records read. 77% converted 110 Records read. 78% converted 112 Records read. 80% converted 114 Records read. 81% converted 116 Records read. 82% converted 118 Records read. 84% converted 120 Records read. 85% converted 122 Records read. 87% converted 124 Records read. 88% converted 1521 enters: 0 exits: 0 126 Records read. 90% converted 1521 enters: 0 exits: 0 128 Records read. 91% converted 130 Records read. 92% converted 1521 enters: 0 exits: 0 132 Records read. 94% converted 1521 enters: 0 exits: 0 134 Records read. 95% converted 136 Records read. 97% converted 1521 enters: 0 exits: 0 138 Records read. 98% converted 1521 enters: 0 exits: 0 140 Records read. 100% converted Reached end of trace file. Getting YMap, Maxnode: 0, Maxthread: 5 SLOG-2 Header: version = SLOG 2.0.6 NumOfChildrenPerNode = 2 TreeLeafByteSize = 65536 MaxTreeDepth = 0 MaxBufferByteSize = 1960 Categories is FBinfo(641 @ 2068) MethodDefs is FBinfo(0 @ 0) LineIDMaps is FBinfo(197 @ 2709) TreeRoot is FBinfo(1960 @ 108) TreeDir is FBinfo(38 @ 2906) Annotations is FBinfo(0 @ 0) Postamble is FBinfo(0 @ 0) 1521 enters: 0 exits: 0 Number of Drawables = 58 timeElapsed between 1 & 2 = 67 msec timeElapsed between 2 & 3 = 28 msec # open jumpshot khuck@ktau:~/src/xpress-apex$ jumpshot tau.slog2 Policy Rules and Runtime Adaptation \u00b6 ...Coming soon!","title":"Before you start"},{"location":"usecases/#before_you_start","text":"All examples on this page assume you have downloaded, configured and built APEX. See the Getting Started page for instructions on how to do that.","title":"Before you start"},{"location":"usecases/#simple_example","text":"In the APEX installation directory, there is a bin directory. In the bin directory are a number of examples, one of which is a simple matrix multiplication example, matmult . To run the matmult example, simply type 'matmult'. The output should be something like this: khuck@ktau:~/src/apex/install/bin$ ./matmult Spawned thread 1... Spawned thread 2... Spawned thread 3... Done. Not very interesting, eh? To see what APEX measured, set the APEX_SCREEN_OUTPUT environment variable to 1, and run it again: khuck@ktau:~/src/apex/install/bin$ export APEX_SCREEN_OUTPUT=1 khuck@ktau:~/src/apex/install/bin$ ./matmult v0.1-e050e17-master Built on: 14:38:56 Dec 22 2015 C++ Language Standard version : 201402 GCC Compiler version : 5.2.1 20151010 APEX_TAU : 0 APEX_POLICY : 1 APEX_MEASURE_CONCURRENCY : 0 APEX_MEASURE_CONCURRENCY_PERIOD : 1000000 APEX_SCREEN_OUTPUT : 1 APEX_PROFILE_OUTPUT : 0 APEX_CSV_OUTPUT : 1 APEX_TASKGRAPH_OUTPUT : 0 APEX_PROC_CPUINFO : 0 APEX_PROC_MEMINFO : 0 APEX_PROC_NET_DEV : 0 APEX_PROC_SELF_STATUS : 0 APEX_PROC_STAT : 1 APEX_THROTTLE_CONCURRENCY : 1 APEX_THROTTLING_MAX_THREADS : 8 APEX_THROTTLING_MIN_THREADS : 1 APEX_THROTTLE_ENERGY : 0 APEX_THROTTLING_MAX_WATTS : 300 APEX_THROTTLING_MIN_WATTS : 150 APEX_PTHREAD_WRAPPER_STACK_SIZE : 0 APEX_PAPI_METRICS : Spawned thread 1... Spawned thread 2... Spawned thread 3... Done. CPU is 2.66013e+09 Hz. Elapsed time: 0.966516 Cores detected: 8 Worker Threads observed: 4 Available CPU time: 3.86607 Action : #calls | minimum | mean | maximum | total | stddev | % total ------------------------------------------------------------------------------------------------------------ allocateMatrix : 12 --n/a-- 1.94e-02 --n/a-- 2.33e-01 --n/a-- 6.014 compute : 4 --n/a-- 6.89e-01 --n/a-- 2.76e+00 --n/a-- 71.279 compute_interchange : 4 --n/a-- 1.85e-01 --n/a-- 7.38e-01 --n/a-- 19.091 do_work : 4 --n/a-- 9.43e-01 --n/a-- 3.77e+00 --n/a-- 97.601 freeMatrix : 12 --n/a-- 2.36e-04 --n/a-- 2.83e-03 --n/a-- 0.073 initialize : 12 --n/a-- 3.56e-03 --n/a-- 4.27e-02 --n/a-- 1.104 main : 1 --n/a-- 9.66e-01 --n/a-- 9.66e-01 --n/a-- 24.983 APEX Idle : --n/a-- --n/a-- --n/a-- --n/a-- --n/a-- --n/a-- --n/a-- ------------------------------------------------------------------------------------------------------------ In this output, we see the status of all of the environment variables (as read by APEX at initialization), the regular program output, and then a summary from APEX at the end. Because APEX captures timestamps using the low-overhead rdtsc function call (where available), the measurements are done in cycles. APEX estimates the Hz rating of the CPU to convert to seconds for output. APEX reports the elapsed wall-clock time, the number of cores detected, the number of worker threads observed, as well as the total available CPU time (wall-clock times workers).","title":"Simple example"},{"location":"usecases/#openmp_example","text":"In the APEX installation directory, there is a bin directory. In the bin directory are a number of examples, one of which is the OpenMP implementation of LULESH (for details, see the LLNL explanation of LULESH ). When APEX is configured with OpenMP OMPT support (using the -DBUILD_OMPT=TRUE or equivalent CMake configuration settings) it will measure OpenMP events. Executing the LULESH example (with APEX_SCREEN_OUTPUT=1) gives the following output: khuck@ktau:~/src/apex$ ./install/bin/lulesh_OpenMP_2.0 v0.1-e050e17-master Built on: 14:38:56 Dec 22 2015 C++ Language Standard version : 201402 GCC Compiler version : 5.2.1 20151010 APEX_TAU : 0 APEX_POLICY : 1 APEX_MEASURE_CONCURRENCY : 0 APEX_MEASURE_CONCURRENCY_PERIOD : 1000000 APEX_SCREEN_OUTPUT : 1 APEX_PROFILE_OUTPUT : 0 APEX_CSV_OUTPUT : 1 APEX_TASKGRAPH_OUTPUT : 0 APEX_PROC_CPUINFO : 0 APEX_PROC_MEMINFO : 0 APEX_PROC_NET_DEV : 0 APEX_PROC_SELF_STATUS : 0 APEX_PROC_STAT : 1 APEX_THROTTLE_CONCURRENCY : 1 APEX_THROTTLING_MAX_THREADS : 8 APEX_THROTTLING_MIN_THREADS : 1 APEX_THROTTLE_ENERGY : 0 APEX_THROTTLING_MAX_WATTS : 300 APEX_THROTTLING_MIN_WATTS : 150 APEX_PTHREAD_WRAPPER_STACK_SIZE : 0 APEX_PAPI_METRICS : Running problem size 30^3 per domain until completion Num processors: 1 Registering OMPT events...done. Num threads: 8 Total number of elements: 27000 To run other sizes, use -s . To run a fixed number of iterations, use -i . To run a more or less balanced region set, use -b . To change the relative costs of regions, use -c . To print out progress, use -p To write an output file for VisIt, use -v See help (-h) for more options APEX: disabling lightweight timer OpenMP_BARRIER: CalcPressur... APEX: disabling lightweight timer OpenMP_BARRIER: CalcPressur... APEX: disabling lightweight timer OpenMP_BARRIER: EvalEOSForE... APEX: disabling lightweight timer OpenMP_BARRIER: CalcEnergyF... APEX: disabling lightweight timer OpenMP_BARRIER: CalcEnergyF... APEX: disabling lightweight timer OpenMP_BARRIER: CalcEnergyF... APEX: disabling lightweight timer OpenMP_BARRIER: CalcEnergyF... APEX: disabling lightweight timer OpenMP_BARRIER: CalcEnergyF... APEX: disabling lightweight timer OpenMP_BARRIER: CalcCourant... APEX: disabling lightweight timer OpenMP_BARRIER: CalcHydroCo... APEX: disabling lightweight timer OpenMP_BARRIER: CalcMonoton... APEX: disabling lightweight timer OpenMP_BARRIER: EvalEOSForE... APEX: disabling lightweight timer OpenMP_BARRIER: CalcSoundSp... APEX: disabling lightweight timer OpenMP_BARRIER: InitStressT... APEX: disabling lightweight timer OpenMP_BARRIER: CalcVolumeF... APEX: disabling lightweight timer OpenMP_BARRIER: CalcAcceler... APEX: disabling lightweight timer OpenMP_BARRIER: CalcVelocit... APEX: disabling lightweight timer OpenMP_BARRIER: CalcPositio... APEX: disabling lightweight timer OpenMP_BARRIER: CalcLagrang... APEX: disabling lightweight timer OpenMP_BARRIER: UpdateVolum... APEX: disabling lightweight timer OpenMP_BARRIER: ApplyAccele... APEX: disabling lightweight timer OpenMP_BARRIER: CalcForceFo... Run completed: Problem size = 30 MPI tasks = 1 Iteration count = 932 Final Origin Energy = 2.025075e+05 Testing Plane 0 of Energy Array on rank 0: MaxAbsDiff = 6.548362e-11 TotalAbsDiff = 8.615093e-10 MaxRelDiff = 1.461140e-12 Elapsed time = 55.00 (s) Grind time (us/z/c) = 2.1855548 (per dom) ( 2.1855548 overall) FOM = 457.54973 (z/s) CPU is 2.66013e+09 Hz. Elapsed time: 55.0085 Cores detected: 8 Worker Threads observed: 8 Available CPU time: 440.068 Action : #calls | minimum | mean | maximum | total | stddev | % total ------------------------------------------------------------------------------------------------------------ CPU Guest % : 54 0.000 0.000 0.000 0.000 0.000 --n/a-- CPU I/O Wait % : 54 0.000 0.040 0.714 2.143 0.133 --n/a-- CPU IRQ % : 54 0.000 0.000 0.000 0.000 0.000 --n/a-- CPU Idle % : 54 0.857 1.384 4.857 74.714 0.763 --n/a-- CPU Nice % : 54 0.000 0.000 0.000 0.000 0.000 --n/a-- CPU Steal % : 54 0.000 0.000 0.000 0.000 0.000 --n/a-- CPU System % : 54 15.286 23.339 26.714 1260.286 2.301 --n/a-- CPU User % : 54 84.143 88.373 97.143 4772.143 2.268 --n/a-- CPU soft IRQ % : 54 0.000 0.026 0.286 1.429 0.068 --n/a-- OpenMP_BARRIER: ApplyAccele... : DISABLED (high frequency, short duration) OpenMP_BARRIER: ApplyMateri... : 14912 --n/a-- 3.96e-05 --n/a-- 5.91e-01 --n/a-- 0.134 OpenMP_BARRIER: CalcAcceler... : DISABLED (high frequency, short duration) OpenMP_BARRIER: CalcCourant... : DISABLED (high frequency, short duration) OpenMP_BARRIER: CalcEnergyF... : DISABLED (high frequency, short duration) OpenMP_BARRIER: CalcEnergyF... : DISABLED (high frequency, short duration) OpenMP_BARRIER: CalcEnergyF... : DISABLED (high frequency, short duration) OpenMP_BARRIER: CalcEnergyF... : DISABLED (high frequency, short duration) OpenMP_BARRIER: CalcEnergyF... : DISABLED (high frequency, short duration) OpenMP_BARRIER: CalcFBHourg... : 7456 --n/a-- 1.11e-04 --n/a-- 8.27e-01 --n/a-- 0.188 OpenMP_BARRIER: CalcFBHourg... : 7456 --n/a-- 1.49e-04 --n/a-- 1.11e+00 --n/a-- 0.252 OpenMP_BARRIER: CalcForceFo... : DISABLED (high frequency, short duration) OpenMP_BARRIER: CalcHourgla... : 7456 --n/a-- 1.32e-04 --n/a-- 9.84e-01 --n/a-- 0.224 OpenMP_BARRIER: CalcHydroCo... : DISABLED (high frequency, short duration) OpenMP_BARRIER: CalcKinemat... : 7456 --n/a-- 7.88e-05 --n/a-- 5.88e-01 --n/a-- 0.134 OpenMP_BARRIER: CalcLagrang... : DISABLED (high frequency, short duration) OpenMP_BARRIER: CalcMonoton... : 7456 --n/a-- 6.98e-05 --n/a-- 5.21e-01 --n/a-- 0.118 OpenMP_BARRIER: CalcMonoton... : DISABLED (high frequency, short duration) OpenMP_BARRIER: CalcPositio... : DISABLED (high frequency, short duration) OpenMP_BARRIER: CalcPressur... : DISABLED (high frequency, short duration) OpenMP_BARRIER: CalcPressur... : DISABLED (high frequency, short duration) OpenMP_BARRIER: CalcSoundSp... : DISABLED (high frequency, short duration) OpenMP_BARRIER: CalcVelocit... : DISABLED (high frequency, short duration) OpenMP_BARRIER: CalcVolumeF... : DISABLED (high frequency, short duration) OpenMP_BARRIER: EvalEOSForE... : DISABLED (high frequency, short duration) OpenMP_BARRIER: EvalEOSForE... : DISABLED (high frequency, short duration) OpenMP_BARRIER: InitStressT... : DISABLED (high frequency, short duration) OpenMP_BARRIER: IntegrateSt... : 7456 --n/a-- 6.66e-05 --n/a-- 4.97e-01 --n/a-- 0.113 OpenMP_BARRIER: IntegrateSt... : 7456 --n/a-- 1.28e-04 --n/a-- 9.54e-01 --n/a-- 0.217 OpenMP_BARRIER: UpdateVolum... : DISABLED (high frequency, short duration) OpenMP_PARALLEL_REGION: App... : 932 --n/a-- 1.09e-04 --n/a-- 1.01e-01 --n/a-- 0.023 OpenMP_PARALLEL_REGION: App... : 932 --n/a-- 2.58e-04 --n/a-- 2.40e-01 --n/a-- 0.055 OpenMP_PARALLEL_REGION: Cal... : 932 --n/a-- 7.83e-04 --n/a-- 7.30e-01 --n/a-- 0.166 OpenMP_PARALLEL_REGION: Cal... : 10252 --n/a-- 7.72e-05 --n/a-- 7.91e-01 --n/a-- 0.180 OpenMP_PARALLEL_REGION: Cal... : 32620 --n/a-- 4.29e-05 --n/a-- 1.40e+00 --n/a-- 0.318 OpenMP_PARALLEL_REGION: Cal... : 32620 --n/a-- 5.07e-05 --n/a-- 1.65e+00 --n/a-- 0.376 OpenMP_PARALLEL_REGION: Cal... : 32620 --n/a-- 3.31e-05 --n/a-- 1.08e+00 --n/a-- 0.245 OpenMP_PARALLEL_REGION: Cal... : 32620 --n/a-- 4.75e-05 --n/a-- 1.55e+00 --n/a-- 0.352 OpenMP_PARALLEL_REGION: Cal... : 32620 --n/a-- 4.09e-05 --n/a-- 1.34e+00 --n/a-- 0.303 OpenMP_PARALLEL_REGION: Cal... : 932 --n/a-- 8.10e-03 --n/a-- 7.55e+00 --n/a-- 1.715 OpenMP_PARALLEL_REGION: Cal... : 932 --n/a-- 3.51e-03 --n/a-- 3.28e+00 --n/a-- 0.744 OpenMP_PARALLEL_REGION: Cal... : 932 --n/a-- 4.34e-04 --n/a-- 4.05e-01 --n/a-- 0.092 OpenMP_PARALLEL_REGION: Cal... : 932 --n/a-- 4.27e-03 --n/a-- 3.98e+00 --n/a-- 0.905 OpenMP_PARALLEL_REGION: Cal... : 10252 --n/a-- 4.72e-05 --n/a-- 4.84e-01 --n/a-- 0.110 OpenMP_PARALLEL_REGION: Cal... : 932 --n/a-- 1.68e-03 --n/a-- 1.57e+00 --n/a-- 0.356 OpenMP_PARALLEL_REGION: Cal... : 932 --n/a-- 2.29e-04 --n/a-- 2.13e-01 --n/a-- 0.048 OpenMP_PARALLEL_REGION: Cal... : 932 --n/a-- 1.15e-03 --n/a-- 1.07e+00 --n/a-- 0.244 OpenMP_PARALLEL_REGION: Cal... : 10252 --n/a-- 2.29e-04 --n/a-- 2.34e+00 --n/a-- 0.533 OpenMP_PARALLEL_REGION: Cal... : 932 --n/a-- 4.98e-04 --n/a-- 4.64e-01 --n/a-- 0.105 OpenMP_PARALLEL_REGION: Cal... : 97860 --n/a-- 3.26e-05 --n/a-- 3.19e+00 --n/a-- 0.725 OpenMP_PARALLEL_REGION: Cal... : 97860 --n/a-- 3.20e-05 --n/a-- 3.13e+00 --n/a-- 0.712 OpenMP_PARALLEL_REGION: Cal... : 10252 --n/a-- 4.52e-05 --n/a-- 4.63e-01 --n/a-- 0.105 OpenMP_PARALLEL_REGION: Cal... : 932 --n/a-- 3.39e-04 --n/a-- 3.16e-01 --n/a-- 0.072 OpenMP_PARALLEL_REGION: Cal... : 932 --n/a-- 1.57e-04 --n/a-- 1.47e-01 --n/a-- 0.033 OpenMP_PARALLEL_REGION: Eva... : 32620 --n/a-- 1.07e-04 --n/a-- 3.50e+00 --n/a-- 0.796 OpenMP_PARALLEL_REGION: Eva... : 10252 --n/a-- 2.86e-05 --n/a-- 2.93e-01 --n/a-- 0.067 OpenMP_PARALLEL_REGION: Ini... : 932 --n/a-- 3.52e-04 --n/a-- 3.28e-01 --n/a-- 0.074 OpenMP_PARALLEL_REGION: Int... : 932 --n/a-- 3.14e-03 --n/a-- 2.93e+00 --n/a-- 0.666 OpenMP_PARALLEL_REGION: Int... : 932 --n/a-- 2.18e-03 --n/a-- 2.03e+00 --n/a-- 0.461 OpenMP_PARALLEL_REGION: Upd... : 932 --n/a-- 1.34e-04 --n/a-- 1.25e-01 --n/a-- 0.028 APEX Idle : --n/a-- --n/a-- --n/a-- --n/a-- 3.87e+02 --n/a-- 88.011 ------------------------------------------------------------------------------------------------------------ There are several lightweight events that APEX elects to ignore. The other events are timed by APEX and reported at exit, along with the /proc/stat data (CPU % counters).","title":"OpenMP example"},{"location":"usecases/#with_papi","text":"When APEX is configured with PAPI support (using -DPAPI_ROOT=/path/to/papi and -DUSE_PAPI=TRUE), hardware counter data can also be collected by APEX. To specify hardware counters of interest, use the APEX_PAPI_METRICS environment variable: khuck@ktau:~/src/apex$ export APEX_PAPI_METRICS=\"PAPI_TOT_INS PAPI_L2_TCM\" ...and then execute as normal: khuck@ktau:~/src/apex$ ./install/bin/matmult v0.1-e050e17-master Built on: 14:38:56 Dec 22 2015 C++ Language Standard version : 201402 GCC Compiler version : 5.2.1 20151010 APEX_TAU : 1 APEX_POLICY : 1 APEX_MEASURE_CONCURRENCY : 0 APEX_MEASURE_CONCURRENCY_PERIOD : 1000000 APEX_SCREEN_OUTPUT : 1 APEX_PROFILE_OUTPUT : 0 APEX_CSV_OUTPUT : 1 APEX_TASKGRAPH_OUTPUT : 0 APEX_PROC_CPUINFO : 0 APEX_PROC_MEMINFO : 0 APEX_PROC_NET_DEV : 0 APEX_PROC_SELF_STATUS : 0 APEX_PROC_STAT : 1 APEX_THROTTLE_CONCURRENCY : 1 APEX_THROTTLING_MAX_THREADS : 8 APEX_THROTTLING_MIN_THREADS : 1 APEX_THROTTLE_ENERGY : 0 APEX_THROTTLING_MAX_WATTS : 300 APEX_THROTTLING_MIN_WATTS : 150 APEX_PTHREAD_WRAPPER_STACK_SIZE : 0 APEX_PAPI_METRICS : PAPI_TOT_INS PAPI_L2_TCM Spawned thread 1... Spawned thread 2... Spawned thread 3... Done. CPU is 2.66019e+09 Hz. Elapsed time: 0.954974 Cores detected: 8 Worker Threads observed: 4 Available CPU time: 3.81989 Action : #calls | minimum | mean | maximum | total | stddev | % total PAPI_TOT_INS PAPI_L2_TCM ------------------------------------------------------------------------------------------------------------ allocateMatrix : 12 --n/a-- 2.21e-02 --n/a-- 2.65e-01 --n/a-- 6.930 1.62e+06 9.10e+03 compute : 4 --n/a-- 6.85e-01 --n/a-- 2.74e+00 --n/a-- 71.743 4.31e+09 1.71e+06 compute_interchange : 4 --n/a-- 1.81e-01 --n/a-- 7.23e-01 --n/a-- 18.922 3.77e+09 8.12e+05 do_work : 4 --n/a-- 9.44e-01 --n/a-- 3.78e+00 --n/a-- 98.851 8.10e+09 2.92e+06 freeMatrix : 12 --n/a-- 2.07e-04 --n/a-- 2.49e-03 --n/a-- 0.065 1.13e+06 6.30e+03 initialize : 12 --n/a-- 3.58e-03 --n/a-- 4.29e-02 --n/a-- 1.124 2.21e+07 3.80e+05 main : 1 --n/a-- 9.54e-01 --n/a-- 9.54e-01 --n/a-- 24.978 2.03e+09 7.66e+05 APEX Idle : --n/a-- --n/a-- --n/a-- --n/a-- --n/a-- --n/a-- --n/a-- ------------------------------------------------------------------------------------------------------------","title":"With PAPI"},{"location":"usecases/#csv_output","text":"While APEX is not designed for post-mortem performance analysis, you can export the data that APEX collected. If you set the APEX_CSV_OUTPUT environment variable to 1, APEX will also dump the timer statistics as a CSV file: khuck@ktau:~/src/apex$ cat apex.0.csv \"task\",\"num calls\",\"total cycles\",\"total microseconds\",\"PAPI_TOT_INS\",\"PAPI_L2_TCM\" \"allocateMatrix\",12,704195504,264717,1615804,9100 \"compute\",4,7290209200,2740489,4306522734,1709040 \"compute_interchange\",4,1922797744,722806,3769652571,812196 \"do_work\",4,10044907856,3776018,8101109302,2922142 \"freeMatrix\",12,6613336,2486,1132717,6301 \"initialize\",12,114177592,42921,22093639,379785 \"main\",1,2538202992,954145,2025172707,766218","title":"CSV output"},{"location":"usecases/#with_tau","text":"If APEX is configured with TAU support, then APEX measurements will be forwarded to TAU and recorded as a TAU profile. In addition, all other TAU features are supported, including sampling, MPI measurement, I/O measurement, tracing, etc. To configure APEX with TAU, specify the flags -DUSE_TAU, -DTAU_ROOT, -DTAU_ARCH, and -DTAU_OPTIONS. For example, if TAU was configured with \"./configure -pthread\" on an x86_64 Linux machine, the APEX configuration options would be \"-DUSE_TAU=1 -DTAU_ROOT=/path/to/tau -DTAU_ARCH=x86_64 -DTAU_OPTIONS=-pthread\". If TAU was configured with \"./configure -mpi -pthread\" on an x86_64 Linux machine, the APEX configuration options would be \"-DUSE_TAU=1 -DTAU_ROOT=/path/to/tau -DTAU_ARCH=x86_64 -DTAU_OPTIONS=-mpi-pthread\". Here is a suggested configuration for TAU on x86-Linux to use with APEX (some systems require special flags - please contact the maintaners if you are interested): # download the latest TAU release wget http://www.cs.uoregon.edu/research/paracomp/tau/tauprofile/dist/tau_latest.tar.gz # expand the tar file tar -xvzf tau_latest.tar.gz cd tau-2.25 # configure TAU ./configure -papi=/usr/local/papi/5.3.2 -pthread -prefix=/usr/local/tau/2.25 # build make -j install # set our path to include the new TAU installation export PATH=$PATH:/usr/local/tau/2.25/x86_64/bin Here is a suggested configuration for APEX to use the above TAU installation: cd xpress-apex mkdir build-tau cd build-tau cmake -DBUILD_EXAMPLES=TRUE -DBUILD_TESTS=TRUE -DCMAKE_BUILD_TYPE=RelWithDebInfo \\ -DUSE_TAU=TRUE -DTAU_ROOT=/usr/local/tau/2.25 -DTAU_ARCH=x86_64 -DTAU_OPTIONS=-papi-pthread \\ -DBUILD_BFD=TRUE -DBUILD_ACTIVEHARMONY=TRUE -DCMAKE_INSTALL_PREFIX=../install-tau .. make make tests make install After configuring, building and installing TAU and then configuring, building and installing APEX, the TAU profiling is enabled by setting the environment variable \"APEX_TAU=1\". After executing an example (say 'matmult'), there should be profile.* files in the working directory: khuck@ktau:~/src/xpress-apex$ export APEX_TAU=1 khuck@ktau:~/src/xpress-apex$ ./install/bin/matmult Spawned thread 1... Spawned thread 2... Spawned thread 3... Done. khuck@ktau:~/src/xpress-apex$ ls profile.* profile.0.0.0 profile.0.0.1 profile.0.0.2 profile.0.0.3 profile.0.0.4 profile.0.0.5 If the TAU analysis utilties are in your path, you can execute paraprof to view the profiles: khuck@ktau:~/src/xpress-apex$ paraprof ...which should launch the ParaProf profile viewer/analysis program. The profile should look something like the following (for a complete manual on using ParaProf, see the TAU website ). If you want to collect a TAU trace, you would enable the appropriate TAU environment variable (TAU_TRACE=1), and then re-run the example. After the execution, the trace files need to be merged (using tau_treemerge.pl) and then converted (with tau2slog2) to be viewed with the Jumpshot trace viewer (included with TAU): khuck@ktau:~/src/xpress-apex$ export APEX_TAU=1 khuck@ktau:~/src/xpress-apex$ export TAU_TRACE=1 khuck@ktau:~/src/xpress-apex$ ./install/bin/matmult Spawned thread 1... Spawned thread 2... Spawned thread 3... Done. khuck@ktau:~/src/xpress-apex$ ls *.edf *.trc events.0.edf tautrace.0.0.1.trc tautrace.0.0.3.trc tautrace.0.0.5.trc tautrace.0.0.0.trc tautrace.0.0.2.trc tautrace.0.0.4.trc # merge the trace khuck@ktau:~/src/xpress-apex$ tau_treemerge.pl /home/khuck/src/tau2/x86_64/bin/tau_merge -m tau.edf -e events.0.edf events.0.edf events.0.edf events.0.edf events.0.edf events.0.edf tautrace.0.0.0.trc tautrace.0.0.1.trc tautrace.0.0.2.trc tautrace.0.0.3.trc tautrace.0.0.4.trc tautrace.0.0.5.trc tau.trc tautrace.0.0.0.trc: 34 records read. tautrace.0.0.1.trc: 8 records read. tautrace.0.0.2.trc: 8 records read. tautrace.0.0.3.trc: 30 records read. tautrace.0.0.4.trc: 30 records read. tautrace.0.0.5.trc: 30 records read. # convert the trace khuck@ktau:~/src/xpress-apex$ tau2slog2 tau.trc tau.edf -o tau.slog2 140 records initialized. Processing. 2 Records read. 1% converted 4 Records read. 2% converted 6 Records read. 4% converted 8 Records read. 5% converted 10 Records read. 7% converted 12 Records read. 8% converted 14 Records read. 10% converted 16 Records read. 11% converted 18 Records read. 12% converted 20 Records read. 14% converted 22 Records read. 15% converted 24 Records read. 17% converted 26 Records read. 18% converted 28 Records read. 20% converted 30 Records read. 21% converted 32 Records read. 22% converted 34 Records read. 24% converted 36 Records read. 25% converted 38 Records read. 27% converted 40 Records read. 28% converted 42 Records read. 30% converted 44 Records read. 31% converted 46 Records read. 32% converted 48 Records read. 34% converted 50 Records read. 35% converted 52 Records read. 37% converted 54 Records read. 38% converted 56 Records read. 40% converted 58 Records read. 41% converted 60 Records read. 42% converted 62 Records read. 44% converted 64 Records read. 45% converted 66 Records read. 47% converted 68 Records read. 48% converted 70 Records read. 50% converted 72 Records read. 51% converted 74 Records read. 52% converted 76 Records read. 54% converted 78 Records read. 55% converted 80 Records read. 57% converted 82 Records read. 58% converted 84 Records read. 60% converted 86 Records read. 61% converted 88 Records read. 62% converted 90 Records read. 64% converted 92 Records read. 65% converted 94 Records read. 67% converted 96 Records read. 68% converted 98 Records read. 70% converted 100 Records read. 71% converted 102 Records read. 72% converted 104 Records read. 74% converted 106 Records read. 75% converted 108 Records read. 77% converted 110 Records read. 78% converted 112 Records read. 80% converted 114 Records read. 81% converted 116 Records read. 82% converted 118 Records read. 84% converted 120 Records read. 85% converted 122 Records read. 87% converted 124 Records read. 88% converted 1521 enters: 0 exits: 0 126 Records read. 90% converted 1521 enters: 0 exits: 0 128 Records read. 91% converted 130 Records read. 92% converted 1521 enters: 0 exits: 0 132 Records read. 94% converted 1521 enters: 0 exits: 0 134 Records read. 95% converted 136 Records read. 97% converted 1521 enters: 0 exits: 0 138 Records read. 98% converted 1521 enters: 0 exits: 0 140 Records read. 100% converted Reached end of trace file. Getting YMap, Maxnode: 0, Maxthread: 5 SLOG-2 Header: version = SLOG 2.0.6 NumOfChildrenPerNode = 2 TreeLeafByteSize = 65536 MaxTreeDepth = 0 MaxBufferByteSize = 1960 Categories is FBinfo(641 @ 2068) MethodDefs is FBinfo(0 @ 0) LineIDMaps is FBinfo(197 @ 2709) TreeRoot is FBinfo(1960 @ 108) TreeDir is FBinfo(38 @ 2906) Annotations is FBinfo(0 @ 0) Postamble is FBinfo(0 @ 0) 1521 enters: 0 exits: 0 Number of Drawables = 58 timeElapsed between 1 & 2 = 67 msec timeElapsed between 2 & 3 = 28 msec # open jumpshot khuck@ktau:~/src/xpress-apex$ jumpshot tau.slog2","title":"With TAU"},{"location":"usecases/#policy_rules_and_runtime_adaptation","text":"...Coming soon!","title":"Policy Rules and Runtime Adaptation"}]}
\ No newline at end of file
diff --git a/sitemap.xml b/sitemap.xml
index c5928429..c5606384 100644
--- a/sitemap.xml
+++ b/sitemap.xml
@@ -2,62 +2,62 @@
http://UO-OACISS.github.io/apex/
- 2024-02-07
+ 2024-02-22
daily
http://UO-OACISS.github.io/apex/environment/
- 2024-02-07
+ 2024-02-22
daily
http://UO-OACISS.github.io/apex/examples/
- 2024-02-07
+ 2024-02-22
daily
http://UO-OACISS.github.io/apex/feature/
- 2024-02-07
+ 2024-02-22
daily
http://UO-OACISS.github.io/apex/hpx5/
- 2024-02-07
+ 2024-02-22
daily
http://UO-OACISS.github.io/apex/install/
- 2024-02-07
+ 2024-02-22
daily
http://UO-OACISS.github.io/apex/quickstart/
- 2024-02-07
+ 2024-02-22
daily
http://UO-OACISS.github.io/apex/quickstarthpx/
- 2024-02-07
+ 2024-02-22
daily
http://UO-OACISS.github.io/apex/refman/
- 2024-02-07
+ 2024-02-22
daily
http://UO-OACISS.github.io/apex/spec/
- 2024-02-07
+ 2024-02-22
daily
http://UO-OACISS.github.io/apex/usage/
- 2024-02-07
+ 2024-02-22
daily
http://UO-OACISS.github.io/apex/usecases/
- 2024-02-07
+ 2024-02-22
daily
\ No newline at end of file
diff --git a/sitemap.xml.gz b/sitemap.xml.gz
index 69cfdf804c0efe5d98534ad306632cb2f83896a3..99caf275ffff7e3e9eab4eab570fe4a6b37a8eae 100644
GIT binary patch
delta 291
zcmV+;0o?wl0;d9hABzYGZim-p0{?SqbY*Q}a4vXlYyj2O&uhaV6bJBo|B8@%Gz}YU
zBe9IZV5c6s{sGRsBr<>1XPWH4pKZg)d57V=sE@CY4+vWCX05U_c$C&O`IeM<21eLc
zn!d@8@2|x}zGItu^a>$lF>J8OhY;KorPDMKlmRq%NUZOXe-3{x4Q{LJsFf9L>2ll`
z`{$Rp!-4cN4C4opmU0I(t?4DTKEyF-+p?;x?uxQ1swzg7+A1anZhGiEoLI}H`l{)w
za?{Ws&Tn8~&eB_>!Gwg7GuKW5rYr1#htWg2z`@P;>FT7xz?DiDcp9bnj7A^+gE3xq
px=at9<|bVqIuIM_>PF-}q|5`s@h@KL>lFWS_6-c8^49kU003iwlgt1B
delta 291
zcmV+;0o?wl0;d9hABzYG&&tAO0{?SqbY*Q}a4vXlYyj2O%WA|R6b9h^JVnSpnueBM
zNE}L`(A6xa^8nVIBvNnVF_ZM|>&%3bbr;3DQ4ha|4+O2ZvsT#&JW6Yt{Dzcy21eLc
zn!d>o@2|x}zGa(w^a>$lF>J8OhY;KorPDMKlmRq%NUZOXe-3}n4X&%psFf9L>2lZ=
zyXTj;{hstP4C4opmU0I(t?4Z_)!
z%2h*uIKP2`IZ1Dg1``rS&RjbMn69t`9!3x80tYwWrK^($16L|t;Bl1VGa7yP55{=e
p=`uZZnwxZe=s+8!s~eH`kTMSh$G>=~FH`)-**D=zhS&E8002wzjBEe^