Releases
v1.14.0
Features:
UCP
Added API for querying transport and device names on endpoint
Added API for querying datatype object
Added API for exporting and importing memory keys (no implementation yet)
Added support for non-persistent active message header
Added infrastructure to print protocols v2 performance
Multiple performance improvements for protocols v2
Added support for non-contiguous datatypes for rendezvous protocols v2
Added support for reset and abort request in protocols v2
Added support for user memory handles in RMA API
Added multi-rail support for RMA API in protocols v2
Added support for up to 16 different lanes per endpoint
Added support for dmabuf memory registration in protocols v2
Added strong fence mode for ucp_worker_fence() API
UCT
Added new uct_md_mem_attach() API to support exported memory handles
Added remote completion mode for endpoint flush (via new flag)
Added support for dmabuf registration
Added new uct_ep_connect_to_ep_v2() API
Added new uct_mem_reg_v2() API
Added new uct_md_query_v2() API
Added support for IPv6 loopback address in TCP transport
RDMA CORE (IB, ROCE, etc.)
Added ECE (enhanced connection establishment) support for RC and DC transports
Added support for hardware DCS in DC transport
Added UD interface and endpoint resource information to VFS
Added CQ creation via DEVX API
Removed support for accelerated IB transports over legacy experimental verbs
UCS
Added support for auto-correction of user environment variables
UCM
Implemented CUDA bistro hooks for aarch64 (to enable memory cache on this platform)
Added support for CUDA virtual/stream-ordered memory with cudaMallocAsync
GPU (CUDA, ROCM)
Implemented uct_iface_estimate_perf() function for ROCM
Removed obsoleted ROCM gdr transport
Added support for hsa async_copy for short operations in ROCM
Added memory allocation functions in ROCM
Java
Added methods for ucp_worker_arm() and ucp_worker_get_efd()
Documentation
Added FAQ for using pkg-config tool to build applications with UCX
Tests
Added prints of latency per connection in io_demo
Tools
Added runtime library version to the 'ucx_info -v' output
Added support for memory types in ucx_info
Bugfixes
UCP
Multiple fixes in keepalive protocol
Multiple fixes and improvements in UCP rcache flows
Fixed endpoints leak by disabling resolving remote endpoints in certain cases
Multiple fixes and cleanups in wireup protocol and lanes selection flows
Multiple fixes in protocols v2 infrastructure
Fixed worker interface initialization taking atomic caps into account
Fixed UCP AM max payload value calculation for protocols v2
Fixed deadlock in rcache when UCX_LOG_LEVEL set to debug
Fixed lanes weight calculation in rendezvous protocol v2
Fixed user memory handle support in rendezvous protocol
Fixed message split in rendezvous protocol to avoid having very small chunks
Improved performance estimations for protocols v2
Fixed receive descriptors leak in UCP AM rendezvous
UCT
Fixed double free of server endpoint in TCP sockcm
Updated KNEM bandwidth to be dedicated resource rather than shared
Fixed race in CM when listener is destroyed during conn_req_cb invocation
Updated default bandwidth value for memory mapper transports
Disqualify posix transport if /dev/shm size is too small
Disqualify KNEM transport if memory registration fails with it
Fixed cuda detection (when cuda headers are not present, but nvml headers are)
RDMA CORE (IB, ROCE, etc.)
Fixed device error handling (prevent coredump when iface is down/up)
Multiple fixes in DC transport (error flows, flow control, etc)
Multiple fixes and cleanups in UD transport
Fixed MR registration (avoid atomic offset breaking region alignment)
Fixed indirect key registration (avoid creating atomic KSM on top of relaxed-order key)
Fixed thread domain usage for accelerated verbs transports
Added print of a particular syndrome on DEVX function failures
Fixed DEVX QP creation by setting proper ts_format attribute
Decreased size of DC endpoint
Fixed bandwidth calculation for RoCE LAGs
Fixed port counters setting for DEVX QPs
Fixed compile errors on SLES sp3
Removed errors during md open in case of strict memlock limit
UCS
Removed async_max_events limit (e.g. to support many concurrent TCP connections)
Updated memory wc flush using DGH hint for ARM platform
Fixed deprecation warnings because of <sys/fcntl.h> includes
Added default bandwidth value for ZHAOXIN CPU
UCM
Fixed segfault in malloc when compiled with -flto
GPU (CUDA, ROCM)
Updated cuda_copy transport to use event fd instead of async callback
Fixed ROCM IPC transport (use remote agent if available)
Fixed clang compilation errors in CUDA copy transport
Fixed ROCM memtype detection
Improved performance estimation of CUDA copy transport
Fixed send to self flows in ROCM
Documentation
Updated GPU memory support section in FAQ
Tests
Multiple fixes and improvements in unit tests
Tools
Fixed MPI RTE send deadlock in ucx_perftest
Build
Build Debian package with multi-thread support
Fixed configure warning by using POSIX compliant sh syntax
Multiple fixes for Debian package build
Dropped support for Ubuntu16
You can’t perform that action at this time.