-
Notifications
You must be signed in to change notification settings - Fork 96
FAQ
1. What is UCC?
2. What are the important components of UCC reference implementation?
3. How can I participate?
4. How to compile and run UCC with Open MPI?
5. How to compile and run UCC with PyTorch?
6. What is TL scoring and how to select a certain TL?
7. What are the dependencies for UCC?
8. How to compile all TLs?
9. How to compile a specific TL?
10. How to compile and run UCC with OpenSHMEM Applications?
11. How to implement new TL for UCC?
12. Where I can find a simple UCC example?
13. How to configure UCC components with configuration file and priority?
14. Where can I find more details about the API and more UCC documentation?
15. How to compile UCC for a specific GPU architecture?
UCC is a collective communication operations API and library that is flexible, complete, and feature-rich for current and emerging programming models and runtimes.
Please refer https://github.com/openucx/ucc/blob/master/docs/images/ucc_components.png
- Propose features, discuss issues, review design and code on GitHub
- Participate in the weekly working group meetings
- Mailing list: https://elist.ornl.gov/mailman/listinfo/ucx-group)
Please refer: https://github.com/openucx/ucc#open-mpi-and-ucc-collectives
UCC is available as internal ProcessGroup backend starting from PyTorch 2.0 release. Please refer to PyTorch ProcessGroup UCC backend for details on how to use UCC with earlier releases of PyTorch.
env var pattern: UCC_<TL/CL>_<NAME>_TUNE=token1#token2#...#tokenn, '#'
separated list of tokens
where token=coll_type:msg_range:mem_type:team_size:score:alg - a ':'
separated list of qualifiers.
Each qualifier is optional. The only requirement is that either "score" or "alg" is provided.
Qualifiers:
- coll_type = coll_type_1,coll_type_2,...,coll_type_n - a ',' separated list of coll_types
- msg_range = m_start_1-m_end_1,m_start_2-m_end_2,..,m_start_n-m_end_n - a ',' separated list of msg ranges, where each range is represented by "start" and "end" values separated by "-". Values can be numbers with "Size" characters, e.g. 128, 256b, 4K, 1M. Special value "inf" means MAX msg size.
- mem_type = m1,m2,..,mn - ',' separated list of memory types
- team_size = [t_start_1-t_end_1,t_start_2-t_end_2,...,t_start_n-t_end_n] - a ',' separated list of team size ranges enclosed with [].
- score = , a int value from 0 to "inf"
- alg = @<value|str> - character @ followed by either int number of string representing the collective algorithm.
Examples:
- UCC_TL_NCCL_TUNE=0 - disable all the NCCL collectives (score 0 is applied to ALL collectives since qualifier is not specified, similarly to ALL memory types, to default [0-inf] msg range and [0-inf] team size).
- UCC_TL_NCCL_TUNE=allreduce:cuda:inf#alltoall:0 - force NCCL allreduce for "cuda" buffers and disable alltoall
- UCC_TL_UCP_TUNE=bcast:0-4K:cuda:0#bcast:65k-1M:[25-100]:cuda:inf - disable UCP bcast on cuda buffers for msg sizes 0-4K and force UCP bcast on cuda buffers for msg sizes 65K-1M only for teams with 25-100 ranks
- UCC_TL_UCP_TUNE=allreduce:0-4K:@0#allreduce:4K-inf:@sra_knomial - for TL_UCP set allreduce algorithm to 0 for msg range 0-4K and to 1 (sra_knomial) for 4k-inf.
It depends on the system configuration, the workload that uses UCC, and TLs/CLs the user wants to enable.
- UCX
- NCCL
- Doxygen
All available TLs are compiled by default (--with-tls=all)
User can specify a list of specific TLs to be compiled, e.g. --with-tls=ucp: enables the only "ucp" tl build; --with-tls=sharp,nccl: enables build of tl/sharp and tl/nccl
For compilation instructions using OSHMEM with Open-MPI, please refer to: https://github.com/openucx/ucc#open-mpi-and-ucc-collectives
To run OpenSHMEM applications:
$ oshrun -np 2 --mca scoll_ucc_enable 1 --mca scoll_ucc_priority 100 ./my_openshmem_app
To run OpenSHMEM applications with one-sided collectives (i.e., Alltoall):
$ oshrun -np 2 --mca scoll_ucc_enable 1 --mca scoll_ucc_priority 100 -x UCC_TL_UCP_TUNE=alltoall:0-inf:@onesided ./my_openshmem_app
The UCC configuration file (ucc.conf) provides a unified way of tailoring the behavior of UCC components such as CLs, TLs, and ECs to meet workload needs. The configuration variables are of the format <VAR = VALUE>.
Examples
Selecting a hierarchy CL
UCC_CLS=hier
Selecting a UCP TL
UCC_TLS=ucp
Selecting an algorithm
UCC_TL_SHARP_TUNE=allreduce:inf
Log info
UCC_TL_UCP_LOG_LEVEL=INFO
The VALUE can also specify message size ranges and memory types, i.e: UCC_TL_UCP_ALLREDUCE_KN_RADIX=0-8k:host:8,8k-inf:host:2 Currently, the implementation supports radices for Allreduce collective in the TL/UCP. However, a similar range can be added for other TLs, and collectives. This will be added as UCC developers or users find the need.
In addition, ucc.conf contains architecture-specific tuning sections for optimal performance. Each section is identified by key-value pairs including vendor, model, team size, processes-per-node, and a number of nodes. For example: [vendor=intel model=skylake team_size=8 ppn=1 nnodes=8]. The specific tuning parameters for that section follow the section title.
Precedence:
Command Line and Precedence: If a UCC user sets the UCC variable VALUE in the command line, and also in the configuration file, the VALUE provided in the command line takes precedence.
Multiple ucc.conf files: When multiple configuration files are found in the runtime environment, the priority is as follows:
- The file available via the environment variable UCC_CONFIG_FILE
- ucc.conf file in the $HOME
- ucc.conf found in the install <ucc_install_dir>/share/ucc.conf
Default ucc.conf files:
A default version of ucc.conf is available with the HPC-X installation in the <ucc_install_dir>/share directory. Default tuning for TL/UCP Allreduce on multiple architectures (Intel Broadwell, Intell Skylake, AMD Rome) has been researched and added by UCC developers.
For users who clone the UCC repo, there won't be a default ucc.conf file saved. However, the user can copy an example version of ucc.conf from ucc/contrib/ucc.conf into local install/share/ucc.conf.
The UCC configuration file (ucc.conf) provides a unified way of tailoring the behavior of UCC components - CLs, TLs, and ECs. The configuration file can contain any UCC variables of the format VAR = VALUE
To compile UCC for a particular GPU architecture, you can use the "./configure" command with appropriate options and specify the "--with-nvcc-gencode" flag. For instance, if you want to compile UCC for the NVIDIA Volta architecture, you can run the following command:
./configure --with-nvcc-gencode="-gencode=arch=compute_70,code=sm_70"
You can also specify multiple GPU architectures using the "--with-nvcc-gencode" flag, as shown below:
./configure --with-nvcc-gencode="-gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_80,code=sm_80"
For more information on the NVCC code generation options, please refer to the documentation at https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#generate-code-specification-gencode.