Skip to content

Version 2.2.0

Compare
Choose a tag to compare
@khuck khuck released this 05 Aug 16:38
· 913 commits to develop since this release

This release contains many updates and fixes. Of note is new support for CUDA/CUPTI events, and the ability to detect MPI applications even though HPX or APEX aren't configured with MPI support.

Changes:

  • view commit • Change to personal fork of concurrentqueue for stability
  • view commit • Cleaning up clang pedantic errors
  • view commit • Tweaking build system to support Windows
  • view commit • Merge pull request #122 from STEllAR-GROUP/fixing_windows_support
  • view commit • Adding annotation for process_profiles task
  • view commit • Cleaning up the dot/graphviz output
  • view commit • Adding "untied timers" option. With this option enabled, a profiler can be started on one OS thread and stopped on another. APEX won't keep track of the profiler stack.
  • view commit • Fixing unit conversion when writing out TAU profiles
  • view commit • Add capture of /proc/self/status Threads value
  • view commit • Capture the number of OS context switches
  • view commit • Cleaning up thread swap test
  • view commit • Adding additional error messages to PAPI component support
  • view commit • Debugging PAPI error checking
  • view commit • Updating to support binutils 2.34 API changes, adding pthread.h include header where needed
  • view commit • Updating deprecated HPX headers
  • view commit • First step in adding CUDA support Adding a CUDA example and adding CUDA/CUPTI headers through CMake.
  • view commit • Adding another cuda example
  • view commit • Working kernel measurement
  • view commit • Basic callback and activity support enabled
  • view commit • Done with initial implementation
  • view commit • Disable thread affinity for HPX configurations
  • view commit • Minor change to support running in MPI environment when MPI is not used by HPX or the APEX configuration. This happens when HPX is configured without a parcel port, and APEX thinks all ranks are 0. This change adds a check for MPI environment variables to validate the MPI rank that was passed in.
  • view commit • Adding MPI rank/size detection support for MPICH ...which also covers MVAPICH, Intel, Cray, etc. Also added some PBS/torque support, but unfortunately they don't provide an environment variable that specifies the total number of ranks. Maybe in the future we could have that be a special APEX environment variable that specifies the total number of ranks, if needed.
  • view commit • First step in adding CUDA support Adding a CUDA example and adding CUDA/CUPTI headers through CMake.
  • view commit • Adding another cuda example
  • view commit • Working kernel measurement
  • view commit • Basic callback and activity support enabled
  • view commit • Done with initial implementation
  • view commit • Merge branch 'cuda_support' of github.com:khuck/xpress-apex into cuda_support
  • view commit • Adding CUDA task dependency support
  • view commit • task dependency working! When GPU callbacks are made, we map the correlation ID to the task_wrapper associated with the parent. Then the GPU activity can be linked to the parent that launched it. also added two more examples.
  • view commit • Working CUDA support with task graphs and correct annotations This commit contains a nasty bug in task_identifier, where any identifier string gets "in place" modified when demangled. That can cause problems later when if map of said task_identifiers is modified. This will be merged to develop when the full support with tracing is merged.
  • view commit • Adding basic CUDA counters to the support for kernels and memory transfers.
  • view commit • Adding HPX config support for CUDA/CUPTI
  • view commit • Minor typo in HPX configuration
  • view commit • More changes for HPX support
  • view commit • Testing with cuda 10.1 and fixing config Testing with older cuda revealed that some installations are different.
  • view commit • Fixing bugs in shutdown. During shutdown, the asynchronous buffers were processed but the static strings that some labels depended on went out of scope. So the strings got corrupted. This is fixed by using const char * strings instead of const std::string&. Also, the counters are way too much overhead, so they are now optional.
  • view commit • Adding Google Chrome trace event support
  • view commit • Working (rudimentary) Google Trace Event support. This support only handles timers, no counters (yet).
  • view commit • Merge branch 'chrome_trace_event' into develop
  • view commit • Fixing implementation of public profile processing function to work with gcc 8
  • view commit • Minor change to add cudart to the link
  • view commit • Merge branch 'cuda_support' of https://github.com/khuck/xpress-apex into cuda_support
  • view commit • Minor changes to CUDA support and Google trace The Google trace support needs to be refactored, but otherwise this seems to be working.
  • view commit • Merge branch 'cuda_support' into develop
  • view commit • fixing time units in trace output
  • view commit • Cleaning up trace event output, making it more compact
  • view commit • Fixing Demangle/DEMANGLE inconsistency in CMake
  • view commit • Fixing DEMANGLE on a real computer
  • view commit • Fixing trace_event file creation for MPI runs
  • view commit • Fixing OTF2 clock to use new timestamps. Also updated GPU example to create many streams.
  • view commit • adding unified memory support. Needs to be initialized AFTER cuInit.
  • view commit • Merge branch 'cuda_support' into cuda_and_trace_event
  • view commit • Fixing context for unified memory events
  • view commit • Cleaning up CUDA support, removing dead code
  • view commit • forgot to add trace_event_listener.cpp to CMakeLists.hpx
  • view commit • Write trace events during shutdown, not destructor
  • view commit • Adding clock delta to account for difference between CPU and GPU clocks
  • view commit • CUPTI processing should ignore APEX non-worker threads
  • view commit • Cleanup parent assignement and finalization Always assign the parent, if it's available. And during finalization, don't do anything until after we have checked if APEX is disabled or not
  • view commit • Added get_num_workers() routine to get just the worker count.
  • view commit • Assign a thread ID for all threads The CUDA/CUPTI asynchronous processing thread needs to be able to generate GUIDS for asynchronous tasks. In order to generate GUIDs, the thread needs an id. So, always assign an ID. But only increment the number of workers if the new thread is actually a worker.
  • view commit • Adding unified memory counter support and fixing parent Adding two counters for page faults for unified memory. Also, when profilers for async events are created, pass in the task wrapper, which will call the right profiler constructor.
  • view commit • Minimizing trace output, adding guids as args Changing all timers to complete events to be compact. Writing GUID and parent GUID values for all timers. Added metadata tags for processes and threads so that they are sorted correctly.
  • view commit • removing debug message
  • view commit • Adding OTF2 support for CUDA! CUDA offloaded events are now supported in APEX when writing out to OTF2. Still to do - the stream "threads" need to be annotated as GPU threads, and given device/context/stream labels.
  • view commit • Cleaning up trace event stream names
  • view commit • Fixing thread labels for GPU and CPU threads
  • view commit • Fixing event unification at the end of OTF2 tracing When region names have spaces in them, the C++ istringstream parser will split them. Instead, just read the whole line into a string and split on the tab between the region ID and the name.
  • view commit • Adding support for cudaMalloc* bytes
  • view commit • Fixing race conditions in startup when PAPI NVML module starts making CUDA calls before APEX is ready to profile them.
  • view commit • Adding support for CUDA device API In addition to the CUDA runtime API, the device API can also be wrapped with callbacks. Use APEX_CUDA_DEVICE_API=1 with or without APEX_CUDA_RUNTIME_API=1 (enabled by default) to see the low level CUDA function calls.
  • view commit • Fixing MPI thread reduction for OTF2
  • view commit • Shortening test that is crashing unexpectedly
  • view commit • Removing apex assertions from profiler.hpp apex_assert.h doesn't get installed when building with HPX, so don't include it in profiler.hpp.
  • view commit • Fixing minor bugs, removing printf, and adding utilization per-core There's a new option, APEX_PROC_STAT_DETAILS that will show per-core (HW thread, really) utilization percentage. It's a total of all states minus idle, divided by total. Requested by DCA++ performance tests. Also fixing initialization in APEX, but not quite. I don't think we want to automatically call apex::init() and apex::finalize() as global constructors or destructors. But the option is still there.
  • view commit • Minor fix to prevent unification crash
  • view commit • removing files
  • view commit • Cleaning up CUPTI code and removing debug message from OTF2 listener