Skip to content

Algorithm and Protocol Tuner for AWS

Raghu Raja edited this page Feb 19, 2024 · 8 revisions

With the v1.8.0-aws, a new tuner component was added to the plugin. Currently, this plugin is only built for the AWS platform.

NCCL uses a combination of empirically derived costs for different aspects of data movement, latency/bandwidth reported by a particular network plugin, and other algorithm-specific coefficients to compute the cost of a specific algorithm/protocol combination. These costs are used to pick an algorithm and protocol at runtime based on the the size of the communicator and the specific message size of an operation. The default coefficients and empirically derived costs are not optimal for AWS's network.

The nccl_ofi_tuner component replaces NCCL's internal tuning mechanism with a custom tuner that uses a Hockney model (cost = latency + (size/bandwidth)) to compute the costs of the different algorithms and protocols with AWS-specific characteristics as inputs. With this tuner loaded, the plugin is able to select the right combination across message sizes and scales without needing empirical estimation to compute the switch points.

As of NCCL v2.20.3, the tuner can be loaded by setting the following environment variable:

NCCL_TUNER_PLUGIN=$PATH_TO_AWS_OFI_NCCL_PLUGIN_INSTALL/lib/libnccl-ofi-tuner.so

To confirm if the tuner has been loaded, enable the tuner logging:

NCCL_DEBUG_SUBSYS=INIT,TUNING

and verify you see the equivalent of the following entries in the log:

hostname:46348 [7] NCCL INFO NCCL_TUNER_PLUGIN set to /home/user/aws-ofi-nccl/install/libnccl-ofi-tuner.so
hostname:46279:46348 [7] NCCL INFO Opened tuner: 'nccl_ofi_tuner'
hostname:46279:46348 [7] NCCL INFO Using tuner plugin: 'nccl_ofi_tuner'
Clone this wiki locally