Some example codes for mixed-precision training in TensorFlow and PyTorch.
It's good to have parameters as multiple of 8 to utilize performance of TensorCores in Volta GPUs.
- Convolutions: Multiple of 8 - Number of input channels, output channels, batch size
- GEMM: Multiple of 8 - M, N, K dimensions
- Fully connected layers: Multiple of 8 - Input features, output features, batch size
- mnist_softmax.py - simple softmax mnist classification example in TensorFlow source
- mnist_softmax_fp16_naive.py - naive fp16 implementation - just works
- mnist_softmax_deep.py - softmax mnist classification with 1 hidden layer
- mnist_softmax_deep_fp16_naive.py - naive fp16 implementation of the mnist_softmax_deep.py - it doesn't work
- mnist_softmax_deep_fp16_advanced.py - mixed-precision implementation of the mnist_softmax_deep.py - works with speed-up utilizing TensorCores in Volta GPUs with reduced memory usage - can experiment with number of hidden units to see how that affects utilizing TensorCores and training speed
- mnist_softmax_deep_conv_fp16_advanced.py - mixed-precision implementation of convolutional neural network for mnist classification - can experiment with convolutional filter size and if that affects utilizing TensorCores and training speed
- pytorch - corresponding PyTorch implementations
- Run the program with nvprof and see the log output - if there's kernel calls with "884" then TensorCores are called. Example:
nvprof python mnist_softmax_deep_conv_fp16_advanced.py
The "default" loss-scaling value of 128 works for all the examples here. However, in a case it doesn't work, it's advised to choose a large value and gradually decrease it until sucessful. apex is a easy-to-use mixed-precision training utilities for PyTorch, and it's loss-scaler does that.