The release notes for the ROCm platform.
ROCm 5.6 consists of several AI software ecosystem improvements to our fast-growing user base. A few examples include:
- New documentation portal at https://rocm.docs.amd.com
- Ongoing software enhancements for LLMs, ensuring full compliance with the HuggingFace unit test suite
- OpenAI Triton, CuPy, HIP Graph support, and many other library performance enhancements
- Improved ROCm deployment and development tools, including CPU-GPU (rocGDB) debugger, profiler, and docker containers
- New pseudorandom generators are available in rocRAND. Added support for half-precision transforms in hipFFT/rocFFT. Added LU refactorization and linear system solver for sparse matrices in rocSOLVER.
- SLES15 SP5 support was added this release. SLES15 SP3 support was dropped.
- AMD Instinct MI50, Radeon Pro VII, and Radeon VII products (collectively referred to as gfx906 GPUs) will be entering the maintenance mode starting Q3 2023. This will be aligned with ROCm 5.7 GA release date.
- No new features and performance optimizations will be supported for the gfx906 GPUs beyond ROCm 5.7
- Bug fixes / critical security patches will continue to be supported for the gfx906 GPUs till Q2 2024 (End of Maintenance [EOM])(will be aligned with the closest ROCm release)
- Bug fixes during the maintenance will be made to the next ROCm point release
- Bug fixes will not be back ported to older ROCm releases for this SKU
- Distro / Operating system updates will continue as per the ROCm release cadence for gfx906 GPUs till EOM.
-
AMDSMI CLI tool enabled for Linux Bare Metal & Guest
-
Package: amd-smi-lib
-
not all Error Correction Code (ECC) fields are currently supported
-
RHEL 8 & SLES 15 have extra install steps
- Stability fix for multi GPU system reproducilble via ROCm_Bandwidth_Test as reported in Issue 2198.
- Consolidation of hipamd, rocclr and OpenCL projects in clr
- Optimized lock for graph global capture mode
- Added hipRTC support for amd_hip_fp16
- Added hipStreamGetDevice implementation to get the device associated with the stream
- Added HIP_AD_FORMAT_SIGNED_INT16 in hipArray formats
- hipArrayGetInfo for getting information about the specified array
- hipArrayGetDescriptor for getting 1D or 2D array descriptor
- hipArray3DGetDescriptor to get 3D array descriptor
- hipMallocAsync to return success for zero size allocation to match hipMalloc
- Separation of hipcc perl binaries from HIP project to hipcc project. hip-devel package depends on newly added hipcc package
- Consolidation of hipamd, ROCclr, and OpenCL repositories into a single repository called clr. Instructions are updated to build HIP from sources in the HIP Installation guide
- Removed hipBusBandwidth and hipCommander samples from hip-tests
- Fixed regression in hipMemCpyParam3D when offset is applied
- Limited testing on xnack+ configuration
- Multiple HIP tests failures (gpuvm fault or hangs)
- hipSetDevice and hipSetDeviceFlags APIs return hipErrorInvalidDevice instead of hipErrorNoDevice, on a system without GPU
- Known memory leak when code object files are loaded/unloaded via hipModuleLoad/hipModuleUnload APIs. Issue will be fixed in a future ROCm release
- Removal of gcnarch from hipDeviceProp_t structure
- Addition of new fields in hipDeviceProp_t structure
- maxTexture1D
- maxTexture2D
- maxTexture1DLayered
- maxTexture2DLayered
- sharedMemPerMultiprocessor
- deviceOverlap
- asyncEngineCount
- surfaceAlignment
- unifiedAddressing
- computePreemptionSupported
- uuid
- Removal of deprecated code
- hip-hcc codes from hip code tree
- Correct hipArray usage in HIP APIs such as hipMemcpyAtoH and hipMemcpyHtoA
- HIPMEMCPY_3D fields correction (unsigned int -> size_t)
- Renaming of 'memoryType' in hipPointerAttribute_t structure to 'type'
- Improved performances when handling the end of a process with a large number of threads.
Known Issues
-
On certain configurations, ROCgdb can show the following warning message:
warning: Probes-based dynamic linker interface failed. Reverting to original interface.
This does not affect ROCgdb's functionalities.
In ROCm 5.6 the rocprofilerv1
and rocprofilerv2
include and library files of
ROCm 5.5 are split into separate files. The rocmtools
files that were
deprecated in ROCm 5.5 have been removed.
ROCm 5.6 | rocprofilerv1 | rocprofilerv2 |
---|---|---|
Tool script | bin/rocprof |
bin/rocprofv2 |
API include | include/rocprofiler/rocprofiler.h |
include/rocprofiler/v2/rocprofiler.h |
API library | lib/librocprofiler.so.1 |
lib/librocprofiler.so.2 |
The ROCm Profiler Tool that uses rocprofilerV1
can be invoked using the
following command:
$ rocprof …
To write a custom tool based on the rocprofilerV1
API do the following:
main.c:
#include <rocprofiler/rocprofiler.h> // Use the rocprofilerV1 API
int main() {
// Use the rocprofilerV1 API
return 0;
}
This can be built in the following manner:
$ gcc main.c -I/opt/rocm-5.6.0/include -L/opt/rocm-5.6.0/lib -lrocprofiler64
The resulting a.out
will depend on
/opt/rocm-5.6.0/lib/librocprofiler64.so.1
.
The ROCm Profiler that uses rocprofilerV2
API can be invoked using the
following command:
$ rocprofv2 …
To write a custom tool based on the rocprofilerV2
API do the following:
main.c:
#include <rocprofiler/v2/rocprofiler.h> // Use the rocprofilerV2 API
int main() {
// Use the rocprofilerV2 API
return 0;
}
This can be built in the following manner:
$ gcc main.c -I/opt/rocm-5.6.0/include -L/opt/rocm-5.6.0/lib -lrocprofiler64-v2
The resulting a.out
will depend on
/opt/rocm-5.6.0/lib/librocprofiler64.so.2
.
- Improved Test Suite
- 'end_time' need to be disabled in roctx_trace.txt
- rocprof in ROcm/5.4.0 gpu selector broken.
- rocprof in ROCm/5.4.1 fails to generate kernel info.
- rocprof clobbers LD_PRELOAD.
Library | Version |
---|---|
hipBLAS | ⇒ 1.0.0 |
hipCUB | ⇒ 2.13.1 |
hipFFT | ⇒ 1.0.12 |
hipSOLVER | ⇒ 1.8.0 |
hipSPARSE | ⇒ 2.3.6 |
MIOpen | ⇒ 2.19.0 |
rccl | ⇒ 2.15.5 |
rocALUTION | ⇒ 2.1.9 |
rocBLAS | ⇒ 3.0.0 |
rocFFT | ⇒ 1.0.23 |
rocm-cmake | ⇒ 0.9.0 |
rocPRIM | ⇒ 2.13.0 |
rocRAND | ⇒ 2.10.17 |
rocSOLVER | ⇒ 3.22.0 |
rocSPARSE | ⇒ 2.5.2 |
rocThrust | ⇒ 2.18.0 |
rocWMMA | ⇒ 1.1.0 |
Tensile | ⇒ 4.37.0 |
hipBLAS 1.0.0 for ROCm 5.6.0
- added const qualifier to hipBLAS functions (swap, sbmv, spmv, symv, trsm) where missing
- removed support for deprecated hipblasInt8Datatype_t enum
- removed support for deprecated hipblasSetInt8Datatype and hipblasGetInt8Datatype functions
- in-place trmm is deprecated. It will be replaced by trmm which includes both in-place and out-of-place functionality
hipCUB 2.13.1 for ROCm 5.6.0
- Benchmarks for
BlockShuffle
,BlockLoad
, andBlockStore
.
- CUB backend references CUB and Thrust version 1.17.2.
- Improved benchmark coverage of
BlockScan
by addingExclusiveScan
, benchmark coverage ofBlockRadixSort
by addingSortBlockedToStriped
, and benchmark coverage ofWarpScan
by addingBroadcast
. - Updated
docs
directory structure to match the standard of rocm-docs-core.
BlockRadixRankMatch
is currently broken under the rocPRIM backend.BlockRadixRankMatch
with a warp size that does not exactly divide the block size is broken under the CUB backend.
hipFFT 1.0.12 for ROCm 5.6.0
- Implemented the hipfftXtMakePlanMany, hipfftXtGetSizeMany, hipfftXtExec APIs, to allow requesting half-precision transforms.
- Added --precision argument to benchmark/test clients. --double is still accepted but is deprecated as a method to request a double-precision transform.
hipSOLVER 1.8.0 for ROCm 5.6.0
- Added compatibility API with hipsolverRf prefix
hipSPARSE 2.3.6 for ROCm 5.6.0
- Added SpGEMM algorithms
- For hipsparseXbsr2csr and hipsparseXcsr2bsr, blockDim == 0 now returns HIPSPARSE_STATUS_INVALID_SIZE
MIOpen 2.19.0 for ROCm 5.6.0
- ROCm 5.5 support for gfx1101 (Navi32)
- Tuning results for MLIR on ROCm 5.5
- Bumping MLIR commit to 5.5.0 release tag
- Fix 3d convolution Host API bug
- [HOTFIX][MI200][FP16] Disabled ConvHipImplicitGemmBwdXdlops when FP16_ALT is required.
RCCL 2.15.5 for ROCm 5.6.0
- Compatibility with NCCL 2.15.5
- Unit test executable renamed to rccl-UnitTests
- HW-topology aware binary tree implementation
- Experimental support for MSCCL
- New unit tests for hipGraph support
- NPKit integration
- rocm-smi ID conversion
- Support for HIP_VISIBLE_DEVICES for unit tests
- Support for p2p transfers to non (HIP) visible devices
- Removed TransferBench from tools. Exists in standalone repo: https://github.com/ROCmSoftwarePlatform/TransferBench
rocALUTION 2.1.9 for ROCm 5.6.0
- Fixed synchronization issues in level 1 routines
rocBLAS 3.0.0 for ROCm 5.6.0
- Improved performance of Level 2 rocBLAS GEMV on gfx90a GPU for non-transposed problems having small matrices and larger batch counts. Performance enhanced for problem sizes when m and n <= 32 and batch_count >= 256.
- Improved performance of rocBLAS syr2k for single, double, and double-complex precision, and her2k for double-complex precision. Slightly improved performance for general sizes on gfx90a.
- Added bf16 inputs and f32 compute support to Level 1 rocBLAS Extension functions axpy_ex, scal_ex and nrm2_ex.
- trmm inplace is deprecated. It will be replaced by trmm that has both inplace and out-of-place functionality
- rocblas_query_int8_layout_flag() is deprecated and will be removed in a future release
- rocblas_gemm_flags_pack_int8x4 enum is deprecated and will be removed in a future release
- rocblas_set_device_memory_size() is deprecated and will be replaced by a future function rocblas_increase_device_memory_size()
- rocblas_is_user_managing_device_memory() is deprecated and will be removed in a future release
- is_complex helper was deprecated and now removed. Use rocblas_is_complex instead.
- The enum truncate_t and the value truncate was deprecated and now removed from. It was replaced by rocblas_truncate_t and rocblas_truncate, respectively.
- rocblas_set_int8_type_for_hipblas was deprecated and is now removed.
- rocblas_get_int8_type_for_hipblas was deprecated and is now removed.
- build only dependency on python joblib added as used by Tensile build
- fix for cmake install on some OS when performed by install.sh -d --cmake_install
- make trsm offset calculations 64 bit safe
- refactor rotg test code
rocFFT 1.0.23 for ROCm 5.6.0
- Implemented half-precision transforms, which can be requested by passing rocfft_precision_half to rocfft_plan_create.
- Implemented a hierarchical solution map which saves how to decompose a problem and the kernels to be used.
- Implemented a first version of offline-tuner to support tuning kernels for C2C/Z2Z problems.
- Replaced std::complex with hipComplex data types for data generator.
- FFT plan dimensions are now sorted to be row-major internally where possible, which produces better plans if the dimensions were accidentally specified in a different order (column-major, for example).
- Added --precision argument to benchmark/test clients. --double is still accepted but is deprecated as a method to request a double-precision transform.
- Fixed over-allocation of LDS in some real-complex kernels, which was resulting in kernel launch failure.
rocm-cmake 0.9.0 for ROCm 5.6.0
- Added the option ROCM_HEADER_WRAPPER_WERROR
- Compile-time C macro in the wrapper headers causes errors to be emitted instead of warnings.
- Configure-time CMake option sets the default for the C macro.
rocPRIM 2.13.0 for ROCm 5.6.0
- New block level
radix_rank
primitive. - New block level
radix_rank_match
primitive. - Added a stable block sorting implementation. This be used with
block_sort
by using theblock_sort_algorithm::stable_merge_sort
algorithm.
- Improved the performance of
block_radix_sort
anddevice_radix_sort
. - Improved the performance of
device_merge_sort
. - Updated
docs
directory structure to match the standard of rocm-docs-core. Contributed by: v01dXYZ.
- Disabled GPU error messages relating to incorrect warp operation usage with Navi GPUs on Windows, due to GPU printf performance issues on Windows.
- When
ROCPRIM_DISABLE_LOOKBACK_SCAN
is set,device_scan
fails for input sizes bigger thanscan_config::size_limit
, which defaults tostd::numeric_limits<unsigned int>::max()
.
rocRAND 2.10.17 for ROCm 5.6.0
- MT19937 pseudo random number generator based on M. Matsumoto and T. Nishimura, 1998, Mersenne Twister: A 623-dimensionally equidistributed uniform pseudorandom number generator.
- New benchmark for the device API using Google Benchmark,
benchmark_rocrand_device_api
, replacingbenchmark_rocrand_kernel
.benchmark_rocrand_kernel
is deprecated and will be removed in a future version. Likewise,benchmark_curand_host_api
is added to replacebenchmark_curand_generate
andbenchmark_curand_device_api
is added to replacebenchmark_curand_kernel
. - experimental HIP-CPU feature
- ThreeFry pseudorandom number generator based on Salmon et al., 2011, "Parallel random numbers: as easy as 1, 2, 3".
- Python 2.7 is no longer officially supported.
rocSOLVER 3.22.0 for ROCm 5.6.0
- LU refactorization for sparse matrices
- CSRRF_ANALYSIS
- CSRRF_SUMLU
- CSRRF_SPLITLU
- CSRRF_REFACTLU
- Linear system solver for sparse matrices
- CSRRF_SOLVE
- Added type
rocsolver_rfinfo
for use with sparse matrix routines
- Improved the performance of BDSQR and GESVD when singular vectors are requested
- BDSQR and GESVD should no longer hang when the input contains
NaN
orInf
rocSPARSE 2.5.2 for ROCm 5.6.0
- Fixed a memory leak in csritsv
- Fixed a bug in csrsm and bsrsm
rocThrust 2.18.0 for ROCm 5.6.0
lower_bound
,upper_bound
, andbinary_search
failed to compile for certain types.
- Updated
docs
directory structure to match the standard of rocm-docs-core.
rocWMMA 1.1.0 for ROCm 5.6.0
- Added cross-lane operation backends (Blend, Permute, Swizzle and Dpp)
- Added GPU kernels for rocWMMA unit test pre-process and post-process operations (fill, validation)
- Added performance gemm samples for half, single and double precision
- Added rocWMMA cmake versioning
- Added vectorized support in coordinate transforms
- Included ROCm smi for runtime clock rate detection
- Added fragment transforms for transpose and change data layout
- Default to GPU rocBLAS validation against rocWMMA
- Re-enabled int8 gemm tests on gfx9
- Upgraded to C++17
- Restructured unit test folder for consistency
- Consolidated rocWMMA samples common code
Tensile 4.37.0 for ROCm 5.6.0
- Added user driven tuning API
- Added decision tree fallback feature
- Added SingleBuffer + AtomicAdd option for GlobalSplitU
- DirectToVgpr support for fp16 and Int8 with TN orientation
- Added new test cases for various functions
- Added SingleBuffer algorithm for ZGEMM/CGEMM
- Added joblib for parallel map calls
- Added support for MFMA + LocalSplitU + DirectToVgprA+B
- Added asmcap check for MIArchVgpr
- Added support for MFMA + LocalSplitU
- Added frequency, power, and temperature data to the output
- Improved the performance of GlobalSplitU with SingleBuffer algorithm
- Reduced the running time of the extended and pre_checkin tests
- Optimized the Tailloop section of the assembly kernel
- Optimized complex GEMM (fixed vgpr allocation, unified CGEMM and ZGEMM code in MulMIoutAlphaToArch)
- Improved the performance of the second kernel of MultipleBuffer algorithm
- Updated custom kernels with 64-bit offsets
- Adapted 64-bit offset arguments for assembly kernels
- Improved temporary register re-use to reduce max sgpr usage
- Removed some restrictions on VectorWidth and DirectToVgpr
- Updated the dependency requirements for Tensile
- Changed the range of AssertSummationElementMultiple
- Modified the error messages for more clarity
- Changed DivideAndReminder to vectorStaticRemainder in case quotient is not used
- Removed dummy vgpr for vectorStaticRemainder
- Removed tmpVgpr parameter from vectorStaticRemainder/Divide/DivideAndReminder
- Removed qReg parameter from vectorStaticRemainder
- Fixed tmp sgpr allocation to avoid over-writing values (alpha)
- 64-bit offset parameters for post kernels
- Fixed gfx908 CI test failures
- Fixed offset calculation to prevent overflow for large offsets
- Fixed issues when BufferLoad and BufferStore are equal to zero
- Fixed StoreCInUnroll + DirectToVgpr + no useInitAccVgprOpt mismatch
- Fixed DirectToVgpr + LocalSplitU + FractionalLoad mismatch
- Fixed the memory access error related to StaggerU + large stride
- Fixed ZGEMM 4x4 MatrixInst mismatch
- Fixed DGEMM 4x4 MatrixInst mismatch
- Fixed ASEM + GSU + NoTailLoop opt mismatch
- Fixed AssertSummationElementMultiple + GlobalSplitU issues
- Fixed ASEM + GSU + TailLoop inner unroll