forked from openucx/ucc
-
Notifications
You must be signed in to change notification settings - Fork 0
/
NEWS
157 lines (113 loc) · 4.17 KB
/
NEWS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
/**
* @copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
*
* See file LICENSE for terms.
*/
## Current
## 1.1.0 (TBD)
## Features
## API
- Added float 128 and float 32, 64, 128 (complex) data types
- Added Active Sets based collectives to support dynamic groups as well as
point-to-point messaging
- Added ucc_team_get_attr interface
## Core
- Config file support
- Fixed component search
## CL
- Added split rail allreduce collective implementation
- Enable hierarchical alltoallv and barrier
- Fixed cleanup bugs
## TL
- Added SELF TL supporting team size one
### UCP
- Added service broadcast
- Added reduce_scatterv ring algorithm
- Added k-nomial based gather collective implementation
- Added one-sided get based algorithms
### SHARP
- Fixed SHARP OOB
- Added SHARP broadcast
### GPU Collectives (CUDA, NCCL TL and RCCL TL)
- Added support for CUDA TL (intranode collectives for NVIDIA GPUs)
- Added multiring allgatherv, alltoall, reduce-scatter, and reduce-scatterv
multiring in CUDA TL
- Added topo based ring construction in CUDA TL to maximize bandwidth
- Added NCCL gather, scatter and its vector variant
- Enable using multiple streams for collectives
- Added support for RCCL gather (v), scatter (v), broadcast, allgather (v),
barrier, alltoall (v) and all reduce collectives
- Added ROCm memory component
- Adapted all GPU collectives to executor design
### Tests
- Added tests for triggered collectives in perftests
- Fixed bugs in multi-threading tests
### Utils
- Added CPU model and vendor detection
- Several bug fixes in all components
## 1.0.0 (April 19th, 2022)
### Features
#### API
- Added Avg reduce operation
- Added nonblocking team destroy option
- Added user-defined datatype definitions
- Added Bfloat16 type
- Clarify semantics of core abstractions including teams and context
- Added timeout option
#### Core
- Added coll scoring and selection support
- Added support for Triggered collectives
- Added support for timeouts in collectives
- Added support for team create without ep in post
- Added support for multithreaded context progress
- Added support for nonblocking team destroy
#### CL
- Added support for hierarchical collectives
- Added support for hierarchical allreduce collective operation
- Added support for collectives based on one-sided communication routines
#### TL
- Added SHARP TL
##### UCP
- Added Bcast SAG algorithm for large messages
- Added Knomial based reduce algorithm
- Making allgather and alltoall agree with the API
- Added SRA knomial allreduce algorithm
- Added pairwise alltoall and alltoallv algorithms
- Added allgather and allgatherv ring algorithms
- Added support for collective operations based on one-sided semantics
- Added support for alltoall with one-sided transfer semantics
- Bug fixes
##### SHARP
- Added support for switch based hardware collectives (SHARP)
#### NCCL
- Add support for NCCL allreduce, alltoall, alltoallv, barrier, reduce, reduce
scatter, bcast, allgather and allgatherv
#### Tests
- Updated tests to test the newly added algorithms and operations
## 0.1.0 (TBD)
### Features
#### API
- UCC API to support library, contexts, teams, collective operations, execution
engine, memory types, and triggered operations
#### Core
- Added implementation for UCC abstractions - library, context, team,
collective operations, execution engine, memory types, and triggered
operations
- Added support for memory types - CUDA, and CPU
- Added support for configuring UCC library and contexts
#### CL
- Added support for collectives, while the source and destination is either in
CPU or device (GPU)
- Added support for UCC_THREAD_MULTIPLE
- Added support for CUDA stream-based collectives
#### TL
- Added support for send/receive based collectives using UCX/UCP as a transport
layer
- Support for basic collectives types including barrier, alltoall, alltoallv,
broadcast, allgather, allgatherv, allreduce was added in the UCP TL
- Added support using NCCL as a transport layer
- Support for collectives types including alltoall, alltoallv, allgather,
allgatherv, allreduce, and broadcast
#### Tests
- Added support for unit testing (gtest) infrastructure
- Added support for MPI tests