Skip to content
Sergey Lebedev edited this page May 16, 2023 · 30 revisions

1. What is UCC?
2. What are the important components of UCC reference implementation?
3. How can I participate?
4. How to compile and run UCC with Open MPI?
5. How to compile and run UCC with PyTorch?
6. What is TL scoring and how to select a certain TL?
7. What are the dependencies for UCC?
8. How to compile all TLs?
9. How to compile a specific TL?
10. How to compile and run UCC with OpenSHMEM Applications?
11. How to implement new TL for UCC?
12. Where I can find a simple UCC example?
13. How to configure UCC components with configuration file and priority?
14. Where can I find more details about the API and more UCC documentation?
15. How to compile UCC for a specific GPU architecture?


1. What is UCC?

UCC is a collective communication operations API and library that is flexible, complete, and feature-rich for current and emerging programming models and runtimes.

2. What are the important components of UCC reference implementation?

Please refer https://github.com/openucx/ucc/blob/master/docs/images/ucc_components.png

3. How can I participate?

4. How to compile and run UCC with Open MPI?

Please refer: https://github.com/openucx/ucc#open-mpi-and-ucc-collectives

5. How to compile and run UCC with PyTorch?

UCC is available as internal ProcessGroup backend starting from PyTorch 2.0 release. Please refer to PyTorch ProcessGroup UCC backend for details on how to use UCC with earlier releases of PyTorch.

6. What is TL scoring and how to select a certain TL?

env var pattern: UCC_<TL/CL>_<NAME>_TUNE=token1#token2#...#tokenn, '#' separated list of tokens where token=coll_type:msg_range:mem_type:team_size:score:alg - a ':' separated list of qualifiers. Each qualifier is optional. The only requirement is that either "score" or "alg" is provided.

Qualifiers:

  • coll_type = coll_type_1,coll_type_2,...,coll_type_n - a ',' separated list of coll_types
  • msg_range = m_start_1-m_end_1,m_start_2-m_end_2,..,m_start_n-m_end_n - a ',' separated list of msg ranges, where each range is represented by "start" and "end" values separated by "-". Values can be numbers with "Size" characters, e.g. 128, 256b, 4K, 1M. Special value "inf" means MAX msg size.
  • mem_type = m1,m2,..,mn - ',' separated list of memory types
  • team_size = [t_start_1-t_end_1,t_start_2-t_end_2,...,t_start_n-t_end_n] - a ',' separated list of team size ranges enclosed with [].
  • score = , a int value from 0 to "inf"
  • alg = @<value|str> - character @ followed by either int number of string representing the collective algorithm.

Examples:

  • UCC_TL_NCCL_TUNE=0 - disable all the NCCL collectives (score 0 is applied to ALL collectives since qualifier is not specified, similarly to ALL memory types, to default [0-inf] msg range and [0-inf] team size).
  • UCC_TL_NCCL_TUNE=allreduce:cuda:inf#alltoall:0 - force NCCL allreduce for "cuda" buffers and disable alltoall
  • UCC_TL_UCP_TUNE=bcast:0-4K:cuda:0#bcast:65k-1M:[25-100]:cuda:inf - disable UCP bcast on cuda buffers for msg sizes 0-4K and force UCP bcast on cuda buffers for msg sizes 65K-1M only for teams with 25-100 ranks
  • UCC_TL_UCP_TUNE=allreduce:0-4K:@0#allreduce:4K-inf:@sra_knomial - for TL_UCP set allreduce algorithm to 0 for msg range 0-4K and to 1 (sra_knomial) for 4k-inf.

7. What are the dependencies for UCC?

It depends on the system configuration, the workload that uses UCC, and TLs/CLs the user wants to enable.

  • UCX
  • NCCL
  • Doxygen

8. How to compile all TLs?

All available TLs are compiled by default (--with-tls=all)

9. How to compile a specific TL?

User can specify a list of specific TLs to be compiled, e.g. --with-tls=ucp: enables the only "ucp" tl build; --with-tls=sharp,nccl: enables build of tl/sharp and tl/nccl

10. How to compile and run UCC with OpenSHMEM Applications?

For compilation instructions using OSHMEM with Open-MPI, please refer to: https://github.com/openucx/ucc#open-mpi-and-ucc-collectives

To run OpenSHMEM applications:

$ oshrun -np 2 --mca scoll_ucc_enable 1 --mca scoll_ucc_priority 100 ./my_openshmem_app

To run OpenSHMEM applications with one-sided collectives (i.e., Alltoall):

$ oshrun -np 2 --mca scoll_ucc_enable 1 --mca scoll_ucc_priority 100 -x UCC_TL_UCP_TUNE=alltoall:0-inf:@onesided ./my_openshmem_app

13. UCC configuration file and priority

The UCC configuration file (ucc.conf) provides a unified way of tailoring the behavior of UCC components such as CLs, TLs, and ECs to meet workload needs. The configuration variables are of the format <VAR = VALUE>.

Examples

Selecting a hierarchy CL

UCC_CLS=hier

Selecting a UCP TL

UCC_TLS=ucp

Selecting an algorithm

UCC_TL_SHARP_TUNE=allreduce:inf

Log info

UCC_TL_UCP_LOG_LEVEL=INFO

The VALUE can also specify message size ranges and memory types, i.e: UCC_TL_UCP_ALLREDUCE_KN_RADIX=0-8k:host:8,8k-inf:host:2 Currently, the implementation supports radices for Allreduce collective in the TL/UCP. However, a similar range can be added for other TLs, and collectives. This will be added as UCC developers or users find the need.

In addition, ucc.conf contains architecture-specific tuning sections for optimal performance. Each section is identified by key-value pairs including vendor, model, team size, processes-per-node, and a number of nodes. For example: [vendor=intel model=skylake team_size=8 ppn=1 nnodes=8]. The specific tuning parameters for that section follow the section title.

Precedence:

Command Line and Precedence: If a UCC user sets the UCC variable VALUE in the command line, and also in the configuration file, the VALUE provided in the command line takes precedence.

Multiple ucc.conf files: When multiple configuration files are found in the runtime environment, the priority is as follows:

  1. The file available via the environment variable UCC_CONFIG_FILE
  2. ucc.conf file in the $HOME
  3. ucc.conf found in the install <ucc_install_dir>/share/ucc.conf

Default ucc.conf files:

A default version of ucc.conf is available with the HPC-X installation in the <ucc_install_dir>/share directory. Default tuning for TL/UCP Allreduce on multiple architectures (Intel Broadwell, Intell Skylake, AMD Rome) has been researched and added by UCC developers.

For users who clone the UCC repo, there won't be a default ucc.conf file saved. However, the user can copy an example version of ucc.conf from ucc/contrib/ucc.conf into local install/share/ucc.conf.

The UCC configuration file (ucc.conf) provides a unified way of tailoring the behavior of UCC components - CLs, TLs, and ECs. The configuration file can contain any UCC variables of the format VAR = VALUE

15. How to compile UCC for a specific GPU architecture?

To compile UCC for a particular GPU architecture, you can use the "./configure" command with appropriate options and specify the "--with-nvcc-gencode" flag. For instance, if you want to compile UCC for the NVIDIA Volta architecture, you can run the following command:

./configure --with-nvcc-gencode="-gencode=arch=compute_70,code=sm_70"

You can also specify multiple GPU architectures using the "--with-nvcc-gencode" flag, as shown below:

./configure --with-nvcc-gencode="-gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_80,code=sm_80"

For more information on the NVCC code generation options, please refer to the documentation at https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#generate-code-specification-gencode.