-
Notifications
You must be signed in to change notification settings - Fork 56
Algorithm and Protocol Tuner for AWS
With the v1.8.0-aws, a new tuner component was added to the plugin. Currently, this plugin is only built for the AWS platform.
NCCL uses a combination of empirically derived costs for different aspects of data movement, latency/bandwidth reported by a particular network plugin, and other algorithm-specific coefficients to compute the cost of a specific algorithm/protocol combination. These costs are used to pick an algorithm and protocol at runtime based on the the size of the communicator and the specific message size of an operation. The default coefficients and empirically derived costs are not optimal for AWS's network.
The nccl_ofi_tuner
component replaces NCCL's internal tuning mechanism with a custom tuner that uses a Hockney model (cost = latency + (size/bandwidth)
) to compute the costs of the different algorithms and protocols with AWS-specific characteristics as inputs. With this tuner loaded, the plugin is able to select the right combination across message sizes and scales without needing empirical estimation to compute the switch points.
As of NCCL v2.20.3, the tuner can be loaded by setting the following environment variable:
NCCL_TUNER_PLUGIN=$PATH_TO_AWS_OFI_NCCL_PLUGIN_INSTALL/lib/libnccl-ofi-tuner.so
To confirm if the tuner has been loaded, enable the tuner logging:
NCCL_DEBUG_SUBSYS=INIT,TUNING
and verify you see the equivalent of the following entries in the log:
hostname:46348 [7] NCCL INFO NCCL_TUNER_PLUGIN set to /home/user/aws-ofi-nccl/install/libnccl-ofi-tuner.so
hostname:46279:46348 [7] NCCL INFO Opened tuner: 'nccl_ofi_tuner'
hostname:46279:46348 [7] NCCL INFO Using tuner plugin: 'nccl_ofi_tuner'