Releases: eth-cscs/COSMA
Releases · eth-cscs/COSMA
COSMA-v2.6.6
Fix linking against cray-libsci.
COSMA-v2.6.5
- fix a bug in tiled-mm API
- fix cmake related to nccl/rccl
COSMA-v2.6.4
Update submodules, minor fixes in cmake.
COSMA-v2.6.3
Improvements in cmake config. Update to new tiled-mm API.
COSMA-v2.6.2
This release fixes a bug in find_package(cosma) (cmake).
COSMA-v2.6.1
This release fixes the issues of COSMA-v2.6.0 coming from resizing the memory pool, as reported here.
2.6.0-fixed
Fixed a bug with memory pool resizing.
COSMA-v2.6.0
This release enables COSMA to take advantage of fast GPU-to-GPU interconnects like NVLink, to efficiently utilize modern Multi-GPU Systems. This is achieved in 2 ways:
- Using
NCCL/RCCL
Libraries: by specifying-DCOSMA_WITH_NCCL=ON
cmake option. - Using GPU-aware MPI: by specifying
-DCOSMA_WITH_GPU_AWARE_MPI=ON
cmake option, as proposed here.
See README and INSTALL for more info on how to build.
In addition, the following performance improvemets have been made:
- Improved Caching:
- all nccl buffers, MPI comms, nccl comms are cached and reused when appropriate.
- all device memory is cached and reused.
- Reduced Data Trasfers: the GPU backend of COSMA called Tiled-MM is extended to offer the possibility to the user to leave the resulting matrix C on the GPU. In that case, there is no need to trasfer matrix C from device to host, which not only reduces the communication, but also speeds up the whole cpu->gpu pipeline as no additional synchronizations are needed. Furthermore, reduce_scatter operation does not have to wait for C to be transfered back to host but is immediately invoked with GPU pointers, thus utilizing fast inter-gpu links. This way, there is no unnecessary data transfers between cpu<->gpu.
- All collectives updated: both
all-gather
andreduce-scatter
collectives are improved. - Reduced Data Reshuffling: avoids double reshuffling of data, i.e. the data from NCCL/RCCL GPU buffers is immediately copied in the right layout, without additional reshuffling.
- Works for variable blocks: NCCL/RCCL' reduce_scatter operation assumes that all the blocks are of the same size and is hence not completely equivalent to
MPI_Reduce_scatterv
which we previously used. We padded all the blocks to be able to overcome this issue. - Portability: Supports both NVIDIA and AMD GPUs.
- Tiled-MM: Updated submodule
- COSTA: Updated submodule
COSMA-v2.5.1
Fixes the building issue with cmake versions prior to 3.12.2.
COSMA-v2.5.0
This version brings the following improvements:
- [feature] Adds
COSMA_DIM_THRESHOLD
environment variable tocosma_prefixed_pxgemm
. - [improvements] Fixes the building issues and dependency handling in CMake.
- [bugfix] Fixes OpenMP race conditions.
- [bugfix] Resolves the problem with setting devices when running COSMA on multigpu systems.