Flumen Equi - A Characterization of Network Requirements for Distributed Machine Learning in the Cloud
This repository, named Horses River in Latin as an allusion to the significant number of packets, flows, and processes that make up the overall communication stream of distributed learning training, contains the results of my characterization of the response of Distributed Machine Learning (DL) training to the presence of competing 'cross' traffic.
This repository is broken down into three key sections:
- The processed datasets that contain the figures from my thesis and the data and final processing details that go into them, along with *.csv files of the data for convenient reuse. Additional information on datasets.
- The code used to perform DL training and create cross traffic as well as the processing code required to interpret the raw outputs.
- Run scripts for the general case and notes on how to run each type of experiment I have performed.
I utilize two toy network topologies to collect the data in the datasets folder, A, with egress as the bottleneck and B, with ingress as the bottleneck, as shown in the Topology figure below. The majority of data is captured using configuration B, to ensure the bottleneck exists in the switch (network) side, rather than on a server's NIC. This is an important consideration for replicating these results.
I captured these results on a cluster of four ASUS ECS4000A-E10 servers. Each is fitted with an AMD EPYC 7502P 32-core processor, an Nvidia A100 accelerator with 40 GB of HBM2e memory. Networking is provided by Mellanox ConnectX-5 100 Gbps NICs on each server directly connected to a Juniper MX480 SDN Switch. The switch is configured to 25 Gbps mode for this cluster, which means each NIC also reports the available link speed as 25 Gbps. I made this choice to explore the impacts of high network congestion levels between reasonably sized DL loads and cross traffic at a significant link capacity, without needing to be concerned about additional bottlenecks or confounding factors present at the full 100 Gbps system capability. I utilized Ubuntu 18.04, CUDA v11.1, cuDNN v8.0.5 and NCCL v2.7.8, Horovod v0.21.3, Pytorch v1.7.1 and Tensorflow v2.4.1.
This research was sponsored in part by MIT's Google Cloud Engine Program, the United States Air Force Research Laboratory and the United States Air Force Artificial Intelligence Accelerator and was accomplished under Cooperative Agreement Number FA8750-19-2-1000. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the United States Air Force or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein.