AWS OFI NCCL v1.10.0
AmedeoSapio
released this
06 Aug 21:38
·
198 commits
to master
since this release
This release is intended only for use on AWS P* instances. A general release that supports other Libfabric networks will be made in the near future. This release requires Libfabric v1.18.0 or later and supports NCCL 2.21.5-1 while maintaining backward compatibility with older NCCL versions (NCCL v2.17.1 and later).
New Features:
- Replaced the model-based tuner with one based on regions derived from experimental evaluations.
- Changed properties reported to NCCL to signal that registered MRs are global, in order to support user buffer registrations.
- Added the option to use different endpoints for receive communicators connected to the same source endpoint, while using a shared completion queue.
- Updated plugin to use the zero-copy path in the EFA provider for fi_send/fi_recv operations.
- Shrank the control message to 32 bytes to fit in inline data for EFA.
Bug Fixes:
- Disabled Libfabric shared memory when possible.
- Disabled RDMA eager messages on Neuron by default for better performance.
- Ensured plugin's multi-rail protocol consistently sorts rails in order of VF index for better performance.
The plugin has been tested with following libfabric providers using tests bundled in the source code and nccl-tests suite:
- efa
Checksum (sha512) for the release tarball:
fa296339a7e40fa420e2934c3a44f9a18ad3a9d798b7f129b35f46892f76532b70996fe36f309e3dedd2823ed9a819a4578f7c8241d8549805c49811b38ae14f aws-ofi-nccl-1.10.0-aws.tar.gz