Releases: openucx/ucx
Releases · openucx/ucx
v1.12.0-rc3
1.12.0 RC3 (January 11, 2022)
Bugfixes
- Fixes in tag_send datatype processing
- Fixed keep-alive protocol for intra-node transports (sm, cuda)
v1.12.0-rc2
1.12.0 RC2 (January 8, 2022)
Features:
Added detection of IB NDR
v1.12.0-rc1
1.12.0 RC1 (December 14, 2021)
Features:
Core
- Added beta-level support for Go language bindings
- Added new objects to VFS (md, component, log_level, etc.)
- Added configuration variable to specify which loadable modules are allowed
- Added build-time configuration to disable sigaction overriding
UCP
- Added client_id to ucp_worker_create() and ucp_conn_request_query() APIs
- Added ucp_worker_address_query() API
- Updated ucp_ep_query() API for getting local and remote addresses
- Added address versioning to correctly preserve wire compatibility starting from version 1.11.0
- Added new client/server connection establishment packet header format
- Enabled rendezvous and tag sync protocols when error handling is enabled on the endpoint
- Added iov zcopy support to RMA operations
- Reduced memory usage of unexpected messages by fitting receive buffer size to packet size
- Added support for modifying UCT and UCS configs by ucp_config_modify() API
- Optimized unpacked rkeys memory consumption
- Added request flag to influence latency vs. bandwidth protocol
- Reduced memory management overhead with new protocols
- Improved performance calculations for new protocols
- Added AMO support with GPU memory target using new protocols
- Added put_zcopy, get_zcopy and pipeline based rendezvous in new protocols
- Added support for user-defined alignment in Active Messages
- Added support for offload tag sync in new protocols
- Updated ucp_atomic_post() to use NBX flow
UCT
- Added API - uct_iface_is_reachable_v2()
- Added IPv6 address support in TCP
- Added latency estimation to uct_iface_estimate_perf()
- Adjusted knem and cma overhead cost
- Increased built-in TCP keep-alive interval to 2 seconds
RDMA CORE (IB, ROCE, etc.)
- Added check for CQ overrun in assert mode
- Added bitmap usage for releasing detached DCIs
- Added configuration for requests ack frequency with DevX
- Added remote QP info to tx error CQE traces
UCS
- Added API for a per-process aggregate-sum statistics report
- Added memory pool set data structure
- Added new ptr_array API for bulk allocation
- Added ucs_string_buffer_append_flags() for string buffer
- Added ucs_ffs32()
- Added ucs_vsnprintf_safe() which always adds '\0'
- Added thread-safe put to ptr_map
- Improved accuracy of the topology distance estimation
- Added prints of leaked callbacks from the callback queue
- Removed a diagnostic message when fuse thread is stopped
- Added configurable limit for the memory consumed by rcache
- Added configuration for VFS(FUSE) thread affinity
- Added memory limit support to memtrack
CUDA
- Added global memtype cache to allow UCT transports to query memory attributes
- Auto-register CUDA whole allocations to avoid repeated registration costs
- Added capability to select CUDA stream based on source and destination memory type
(required for device memory based pipelining) - Added selection of CUDA-IPC capabilities based on NVLINK topology
(to prefer writes vs. reads for specific platforms using NVML) - Added option to set cuda_copy bandwidth
- Added profiling of CUDA runtime function calls
- Added option to limit GPUDirectRDMA size in rendezvous protocol
Java
- Added ucp_listener_reject functionality
- Added support for setting worker id and querying it from the connection request
- Added support to bind on a free port in UcpListener
Packaging
- Added cmake config files for better integration with external cmake based projects
Tests
- Removed memcpy from AM eager flow in io_demo
- Added check_qps.sh script to detected stuck QPs
- Improved diagnostic in test_init_mt
- Added iov support in ucp_client_server
- Added option to use epoll in io_demo
- Added registration of memory allocated by io_demo in memtrack
- Extended statistics in io_demo
- Improved logging in io_demo
- Replaced rand by urand in io_demo
- More improvements in io_demo
- Generalized median calculation to support any percentile in ucx_perftest
Tools
- Added loop-back transport support in ucx_perftest
- Split ucx_perftest into separate modules
- Added process placement option for ucx_info
- Extended parameters correctness check in ucx_perftest
- Added support for GPU memory RMA and atomics in ucx_perftest
CI
- Updated gtest 1.7 to 1.10
- Increased uptime in network corrupter (used for io_demo)
- Enabled set of gtests for new protocols
- Added running CI in docker containers
- Increased thresholds for test_ucp_wait_mem
- Added test for ucx binary compatibility between OS versions
- Increased test job timeout to 6 hours
- Reduced testing time under valgrind
- Added suppressions for glibc and libnl leaks
- Relaxed performance requirements in perf test
Bugfixes
Core
- Fixed invalid remote memory access after connection error
- Fixed creating more than 64K endpoints between the same peers
- Fixed simultaneous endpoint close with ucp_hello_world
UCP
- Fixes and improvements in new protocols infrastructure
- Fixes in AM flows
- Fixed tag short threshold selection
- Multiple fixes in keep-alive protocol
- Multiple fixes in wire-up protocol
- Fixes in error flow during rendezvous protocol
- Multiple fixes in general error flow
- Fixed fallback to PUT pipeline in rendezvous protocol
- Reduced default value of keep-alive interval to 20 seconds
UCT
- Fixed deadlock in TCP
- Suppressed EHOSTUNREACH error in TCP sockcm
- Restricted connecting loop-back to other devices in TCP
RDMA CORE (IB, ROCE, etc.)
- Fixed pkey_index initialization when creating RC QP with DEVX
- Disabled MP_SRQ by default
- Fixed TX WQ overflow check
- Fixed dci->pool_index initialization when HAVE_DC_DV is false
- Fixed syndrome value for creating rdmacm reserved qpn
- Fixed error code on rdma_establish failure
- Fixed uct_ep_am_short_iov for UD verbs
- Fixed handling of error CQE after rc_ep is destroyed
- Fixes in flow control when error CQE is polled
- Multiple fixes in RC and DC error flows
- Fixed deadlock between DCIs and RDMA_READ credits
- Removed AM handler invocation for PURE_GRANT messages
- Fixed endpoint arbiter_group leak in DC
- Fixed resource check in flush for DC
UCS
- Fixed segmentation fault for ucs_stats_parser
- Fixed potential crash on cleanup when use UCX profiling
- Fixed read_profile print of new request
- Fixed uninitialized variable access in VFS
- Changed log level of inotify_init failure to diag
- Fixed integer overflow in mpool chunk allocation
Packaging
- Fixed with-fuse arg for RPM build
Documentation
- Fixes in UCP, UCT, UCS, FAQ and README documentation
Tests
- Multiple fixes in io_demo
CI
- Fixed snapshot docker name
- Fixed hipMallocManaged hook gtest
- Fixes in Azure release pipeline
- Fixes in Coverity CI
- Fixed test_uct_query gtest for ROCm
- Fixes in jenkins test script
- Fixed release commit title check
v1.11.2
v1.11.2-rc1
Bugfixes
- Fixes in Java release pipeline
- Fixes in handling large number of devices
- Fixes in UD out-of-order processing
- Fixes in switching transports during client/server connection setup
- Fixes in transport-level error reporting
v1.11.1
Features:
UCS
- Added API to read boot ID value or use machine_guid
Bugfixes:
- Fixes in Cuda memory hooks
- Fixes in setting traffic class for DCT RoCE transport
- Fixes in TCP endpoint flush
- Fixes in TCP pending operations progress
- Fixes in release pipelines
- Fixes in error handling flow
- Fixes in multi-threaded tag probe
- Fixes in TCP disconnect flow
- Fixes in RPM post-install script
- Fixes in UCT common keepalive
v1.11.1-rc3
1.11.1-rc3 (August 26, 2021)
Bugfixes:
- Fixes in RPM post-install script
- Fixes in UCT common keepalive
v1.11.1-rc2
v1.11.1-rc1
1.11.1-rc1 (August 10, 2021)
Features:
UCS
- Added API to read boot ID value or use machine_guid
Bugfixes:
- Fixes in Cuda memory hooks
- Fixes in setting traffic class for DCT RoCE transport
- Fixes in TCP endpoint flush
- Fixes in TCP pending operations progress
- Fixes in release pipelines
- Fixes in error handling flow
v1.11.0
1.11.0 (July 26, 2021)
Features:
Core
- Added support for UCX monitoring using virtual file system (VFS)/FUSE
- Added support for applications with static CUDA runtime linking
- Added support for a configuration file
- Updated clang format configuration
UCP
- Added rendezvous API for active messages
- Added user-defined name to context, worker, and endpoint objects
- Added flag to silence request leak check
- Added API for endpoint performance evaluation
- Added API - ucp_request_query
- Added API - ucp_lib_query
- Ported connection manager to a new UCT API
- Added bandwidth optimizations for new protocols multi-lane
- Added support for multi-rail over lanes with BW ratio >= 1/4
- Added support for tracking outstanding requests and aborting those in case of connection failure
- Refactored keep-alive protocol
- Added device id to wireup protocol
- Added support up to 128 transport layer resources in UCP context
- Added support CUDA memory allocations with ucp_mem_map
- Increased UCP_WORKER_MAX_EP_CONFIG to 64
- Adjusted memory type zcopy threshold when UCX_ZCOPY_THRESH set
- Refactored wireup protocols, rendezvous, get, zcopy protocols
- Added put zcopy multi-rail
- Improved logging for new protocols
- Added system topology information
- Added new protocols for eager offload protocols
UCT
- Extended connection establishment API
- Added active message AM alignment in iface params
- Added active message short IOV API.
- Added support for interface query by operation and memory type
- Added API to get allocation base address and length
- Added md_dereg_v2 API
UCS
- Added log filter by source file name.
- Added checking for last element in fraglist queue
- Added a method to get IP address from sockaddr.
- Added memory usage limits to registration cache
UCM
- Improved x86 parser to recognize some mov flavors
CUDA
- Added registration for whole CUDA allocations
- Added CUDA-IPC keepalive
- Adjusted performance estimations
- Added Improve logging
- Added allocation methods for CUDA pinned/managed memory
- Added support for a global cuda_ipc cache
RDMA CORE (IB, ROCE, etc.)
- Added report of QP info in case of completion with error
- Refactored of FC send operations
- Added support for DevX unique QPN allocation
- Optimized endpoint lookup for DCI
- Added support for RDMA sub-function (SF)
- Added support for DCI via DEVX
- Added DCI pool per LAG port
- Added support for RoCE IP reachability check using a subnet mask
- Added active message short IOV for UD/DC/RC mlx, UD/RC verbs
- Added endpoint keep alive check for UD
- Suppressed warning if device can't be opened
- Added support for multiple flush cancel without completion
- Added ignore for devices with invalid GID
- Added support for SRQ linked list reordering
- Added flush by flow control on old devices
- Added support for configurable rdma_resolve_addr/route timeout
Shared memory
- Added active message short IOV support for posix, sysv, and self transports
TCP
- Added support for peer failure in case of CONNECT_TO_EP
- Added support for active message short IOV
Java
- Added full support for UCP Java API
Tests
- Added length/mem_type for UCP client server example
- Added port sockaddr tests for a new API
- Added test send-recv between client/server with diff UCX_IB_NUM_PATHS
- Added support for CUDA and CUDA managed memory in io_demoo
- Added support for a custom watchdog timeout from command line
- Extended memtype hook tests
Tools
- Added UCP active message support to perftest
- Added error handling option to perftest
- Added wakeup option
- Added performance tests for am short iov
CI
- Added RHEL 7.6 with MOFED 4.7
- Added Fedora 34, RHEL 7.2, 7.4
- Added PGI support from HPC-SDK module
- Added docker image with CUDA 11.2
- Added IODEMO test
- Added Ubuntu 20.4
- Added test for connection manager fallback in client-server testing
- Added loopback interface for tcp testing
Bugfixes:
Build
- Fixes in libnuma detection macro
- Fixes for cross compilation support
- Fixes for --without-dc compilation
Continues Integration
- Fixes in Azure pipeline build system
- Fixes in Coverity CI
- Fixes in Azure release pipeline
Packaging
- Fixed in DEB package - added essential system dependencies
Documentation
- Fixes in UCP, UCT, Readme, FAQ, and Read-the-docs documentation
Tests
- Fixes in CMA peer failure test
- Fixes in SRQ tests
- Fixes in the usage requests_wait
- Fixes in test_uct_query
- Fixes addressing race conditions on client user data in test_uct_sockaddr
- Fixes in IODEMO app
- Fixes in error handling flow for perftest
- Fixes in perftest batch tests
- Fixes addressing hang issues for rendezvous protocol in UCP client server example
UCP
- Fixes in endpoint error handling
- Fixes in error reporting failed CM lanes
- Fixes in progress worker flush
- Fixes in rendezvous pipeline flow
- Fixes in recursive protocol selection
- Fixes in error handling for AM_ZCOPY
- Fixes in length check condition in RMA PUT short
- Fixes in failure handling rendezvous offload send
- Fixes in offload completion with inlined data
- Fixes in statistics calculations for rendezvous protocol
- Fixes in ucp_worker_query() thread mode for SERIALIZED
- Fixes preventing leaks of UCP requests
ROCM
- Fixes in device memory registration and de-registration
- Fixes in missing mem_query definition for rocm_copy
- Fixes addressing build failure due to const violation
- Fixes in sockaddr_accessibility test for rocm_copy and rocm_ipc
- Fixes in bandwidth estimation for rocm_ipc
RDMA CORE (IB, ROCE, etc.)
- Fixes addressing deadlock between DCI resources and RDMA_READ credits
- Fixes in DSCP for RoCE DCT
- Fixes in flush(cancel) flow
- Fixes preventing segfault in uct_rdmacm_cm_ep_str
- Fixes in scatter-gather entries logging
- Fixes for compilation with experimental verbs
- Fixes in UD dgid filtering
- Fixes in domain resources destroying
- Fixes in PCIe bandwidth calculation
- Fixes addressing CQ creation failure using legacy ibv API
- Fixes in iov2sge converter
- Fixes in port width check on HDR100
- Fixes in SL selection
- Fixes in hardware tag matching compilation
- Fixes in uct_rdmacm_cm_cqs hash key
- Fixes for compilation with rdma-core 20
Java
- Fixes in tag sender mask
UCT
- Fixes in reachability of loopback ifaces
- Fixes addressing possible uninitialized memory accesses
- Fixes in error flow for endpoints created upon receiving connection request
- Fixes in TCP keepalive to avoid false-positive error detection
UCM
- Fixes addressing heap corruption caused by ucp_set_event_handler()
- Fixes in mmap events test