Skip to content

AWS OFI NCCL v1.7.4

Compare
Choose a tag to compare
@bwbarrett bwbarrett released this 04 Dec 20:44
· 400 commits to master since this release
v1.7.4-aws

This release is intended only for use on AWS P* instances. A general release that supports other Libfabric networks will be made in the near future. This release includes the following changes:

New Features:

  • Hard fail if GPUDirect RDMA initialization fails on an EC2 instance that should support GPUDirect RDMA (such as P4d.24xlarge or P5.48xlarge), rather than fall back to host copy buffers at significantly reduced performance. Setting the environment variable OFI_NCCL_DISABLE_GDR_REQUIRED_CHECK=1 will disable this behavior.
  • Change the threshold at which the rdma transport switches from round robin to striping from 8 KiB to 256 KiB, improving the efficiency of large message transfers.

Bug Fixes:

  • Fixed debugging output in some initialization failure cases.
  • Request FI_LOCAL_COMM feature from Libfabric, as flush and eager copies are both implemented via local communication.
  • Fix initialization when using the Libfabric TCP provider.
  • Improve documentation on using the plugin with AWS's Elastic Fabric Adapter (EFA).
  • Improve handling of Neuron device detection when the plugin is used with Tranium instances.
  • Fix segfault in error case of freelist memory growth.
  • The test programs that only support 2 ranks now fail with a useful error message if run with another number of ranks.

This release has been tested on P3dn, P4d/P4de, and P5 using the EFA provider in Libfabric.