Qwen‐72B on Ali EMR with eRDMA

EMR is recently being gradually launched on Alibaba Cloud. In the EMR launch media event, Intel demonstrated the live demo of the distributed Qwen-72B, achieving excellent on-site results. This BKM will use the same approach, EMR*4 + eRDMA network, to guide everyone in building a distributed inference demo of Qwen-72B using XFT (Intel/xFasterTransformer).

XFT(xFasterTransformer) is an exceptionally optimized solution for large language models (LLM) on the X86 platform, similar to FasterTransformer on the GPU platform. xFasterTransformer can operate in distributed mode across multiple sockets and nodes to support inference on larger models. Additionally, it provides both C++ and Python APIs, spanning from high-level to low-level interfaces, making it easy to adopt and integrate.

eRDMA (Elastic Remote Direct Memory Access) is Alibaba Cloud's self-developed elastic RDMA network in the cloud. The underlying link reuses the VPC network and employs a self-developed full-stack Congestion Control (CC) algorithm. While enjoying the high throughput and low latency characteristics of traditional RDMA networks, eRDMA can also support large-scale RDMA networking with sub-second latency. It is compatible with traditional HPC applications as well as traditional TCP/IP applications.

Based on eRDMA, you can deploy HPC application software in the cloud to obtain a cost-effective, more elastic high-performance application cluster. Alternatively, you can replace the VPC network with an eRDMA network to accelerate the performance of your other applications.

Hardware Configuration

Purchase cloud instances of the following configuration types on Alibaba Cloud:

Instance 1

name	xftest01
user	root
passwd	xxxxxx
OS	Alibaba Cloud Linux 3.2104 LTS 64位
CPU Model	INTEL(R) XEON(R) PLATINUM 8575C
NUMA node0 CPU(s):	0-95
NIC0 (with eRDMA Capability)	eth0 (192.168.0.1)
disk	200GB
NFS-disk(mount in /mnt)	1TB

Instance 2

name	xftest01
user	root
passwd	xxxxxx
OS	Alibaba Cloud Linux 3.2104 LTS 64位
CPU Model	INTEL(R) XEON(R) PLATINUM 8575C
NUMA node0 CPU(s):	0-95
NIC0 (with eRDMA Capability)	eth0 (192.168.0.2)
disk	200GB
NFS-disk(mount in /mnt)	1TB

Instance 3

name	xftest01
user	root
passwd	xxxxxx
OS	Alibaba Cloud Linux 3.2104 LTS 64位
CPU Model	INTEL(R) XEON(R) PLATINUM 8575C
NUMA node0 CPU(s):	0-95
NIC0 (with eRDMA Capability)	eth0 (192.168.0.3)
disk	200GB
NFS-disk(mount in /mnt)	1TB

Instance 4

name	xftest01
user	root
passwd	xxxxxx
OS	Alibaba Cloud Linux 3.2104 LTS 64位
CPU Model	INTEL(R) XEON(R) PLATINUM 8575C
NUMA node0 CPU(s):	0-95
NIC0 (with eRDMA Capability)	eth0 (192.168.0.4)
disk	200GB
NFS-disk(mount in /mnt)	1TB

Topology of Rank

    ID |               Node name |             IP | Ranks
     0 |          EMR instance 0 |    192.168.0.1 | 0
     1 |          EMR instance 1 |    192.168.0.2 | 1
     2 |          EMR instance 2 |    192.168.0.3 | 2
     3 |          EMR instance 3 |    192.168.0.4 | 3

    ID |           0            1            2            3 
     0 |         shm        eRDMA        eRDMA        eRDMA 
     1 |       eRDMA          shm        eRDMA        eRDMA 
     2 |       eRDMA        eRDMA          shm        eRDMA 
     3 |       eRDMA        eRDMA        eRDMA          shm

Important Notes:

1. Ensure configuration deployment sets(结合部属集策略实现更低的eRDMA时延). Alibaba Cloud ECS provides deployment set strategies that control the physical distribution of ECS instances. Deployment sets support various strategies:

High Availability Strategy: All ECS instances in the deployment set are strictly distributed across different physical servers within a specified region, ensuring high availability of business on ECS instances and the underlying physical server's disaster recovery capability.
Low Latency Strategy: In this mode, all ECS instances in the deployment set are deployed as centrally as possible within the same network topology range in the available zone, reducing network communication latency.

We know that RDMA itself has the characteristics of low latency and high throughput. In practical use, it is also influenced by the actual physical network distance: the farther the distance, the greater the latency between nodes. In Alibaba Cloud, we can combine deployment set strategies to enable ECS to provide elastic RDMA acceleration, obtaining lower latency as much as possible.

2. Ensure optimal configuration for eRDMA:

Disable Delay ACK: Currently, Alibaba support is needed for manual modification of Alibaba Cloud network parameters. Waiting for the new version update, eadm tool can be used for 'Delay ACK' configuration by users themselves.
Enable Congestion Control (CC) algorithm: running $ eadm conf -d erdma_0 -t cc -v 1

3. Ensure the following VMs are distributed on different physical resources. Cloud instances do not have isolated memory bandwidth resources. If deployed on the same physical resource, there may be bandwidth contention issues.

4. Ensure consistent code environment; machines mount files to /mnt directory using NFS for file synchronization.

Software Configuration

Basic Environment

# Initially, we need to perform some basic environment configuration, such as:
# 1. Install system dependencies;
# 2. Install Python environment dependencies;
# 3. Configure passwordless login between machines;
# ...

$ yum install -y htop tmux git git-lfs python38 python38-pip numactl numactl-devel"

$ pip3.8 install --upgrade pip

$ cd /mnt/xFasterTransformer/
# install torch / transformers dependencies
$ pip3.8 install -r requirements.txt

# install distributed test dependencies.
#  (optinal) install jpeg dependencies:
#  $ yum -y install libjpeg-turbo-devel
$ cd distributed/
$ pip3.8 install -r requirements.txt

# modify host configuration
# Default user is using root. you can modify from ansible.cfg.
$ vim hosts
  [all_hosts:vars]
  ansible_ssh_pass=<server password> 
  
  # modify host IP address
  [all_hosts]
  192.168.0.1
  192.168.0.2
  192.168.0.3
  192.168.0.4

# replace the default python + pip version
$ ansible all_hosts -m shell -a "rm -rf /usr/bin/pip /usr/bin/python && ln -s /usr/bin/pip3.8 /usr/bin/pip && ln -s /usr/bin/python3.8 /usr/bin/python"

# using the ansible ping plugin to check the network.
$ ansible all_hosts -m ping
...
192.168.0.1 | SUCCESS => {
    "ansible_facts": {
        "discovered_interpreter_python": "/usr/bin/python3.6"
    },
    "changed": false,
    "ping": "pong"
}
...

# Configure passwordless login between machines
$ ansible-playbook 000-ssh-wo-passwd-update-hosts.xml

oneCCL Configuration

# Download the deployment environment on NFS and
# synchronize code files across multiple machines.
$ cd /mnt
$ git clone -b test/distributed https://github.com/intel/xFasterTransformer.git

$ cd /mnt/xFasterTransformer/

# Compile and install the oneCCL environment.
$ cd 3rdparty/
$ sh prepare_oneccl.sh

# Synchronize the multi-node environment configuration.
$ cd /mnt/xFasterTransformer/distributed/
$ ansible all_hosts -m shell -a "echo 'source /mnt/xFasterTransformer/3rdparty/oneccl/build/_install/env/setvars.sh' >> ~/.bashrc"

# Update environment variables, run benchmarks,
# and if the output is similar to the reference
# content below, then the oneCCL configuration is successful.
$ bash
$ cd /mnt/xFasterTransformer/3rdparty/oneccl/build/
$ mpirun -print-rank-map -prot -n 4 -ppn 1 -hosts 192.168.0.1,192.168.0.2 ./_install/examples/benchmark/benchmark
(192.168.0.1:0)
(192.168.0.2:1)

    ID |               Node name |             IP | Ranks
     0 | iZ2ze2krbyeyuro5gcr1uiZ | 192.168.0.1 | 0
     1 | iZ2zeb3qtvacqss8re3pcrZ | 192.168.0.2 | 1
    ID |             0             1 
     0 | verbs;ofi_rxm verbs;ofi_rxm 
     1 | verbs;ofi_rxm verbs;ofi_rxm 


options:
  processes:      2
  backend:        host
  iters:          16
  warmup_iters:   16
  iter_policy:    auto
  buf_count:      1
  min_elem_count: 1
  max_elem_count: 128
  elem_counts:    [1 2 4 8 16 32 64 128]
  check:          last
  cache:          1
  inplace:        0
  collectives:    allreduce 
  datatypes:      float32 
  reductions:     sum 
  extended info:  auto
  csv_filepath:   

datatype: float32
reduction: sum

#------------------------------------------------------------
# Benchmarking: allreduce 
# #processes: 2
#------------------------------------------------------------

        #bytes  #repetitions   t_min[usec]   t_max[usec]   t_avg[usec]  stddev[%]
             4            16         30.50         30.86         30.68       0.59
             8            16         29.70         29.77         29.73       0.11
            16            16         29.81         30.08         29.95       0.44
            32            16         30.25         30.98         30.62       1.20
            64            16         30.41         31.09         30.75       1.12
           128            16         30.11         31.14         30.62       1.68
           256            16         30.47         30.52         30.49       0.08
           512            16         30.28         30.41         30.34       0.21

# All done

XFT setup

$ cd /mnt/xFasterTransformer/

$ mkdir build && cd build

# Compile and install; if there are any issues, feel free to submit an issue.
# https://github.com/intel/xFasterTransformer/issues/new
# cmake -DPython_EXECUTABLE=/usr/bin/python3.8 ..
$ cmake .. && make -j

Qwen model download

modelscope: https://www.modelscope.cn/models/qwen/Qwen-72B-Chat/summary
huggingface: https://huggingface.co/Qwen/Qwen-72B

Demo setup (WIP)

Qwen 72B distributed inference

$ cd /mnt/xFasterTransformer/distributed/
# 修改分布式脚本:
#   1. ############# HW configuration #############
#        a. 修改IFACE为指定网卡;
#        b. 修改IP_A/IP_B/IP_C/...;
#        c. 如果在云上测试则: export is_ali_cloud=1;
#        d. TCP or eRDMA 二选一: 打开# enable TCP/eRDMA 注释;
#   2. ############# XFT configuration #############
#        a. XFT_COMM_TIME=1: 打印单次allreduce时间日志;
#        a. XFT_FAKE_MODEL=1: 模型文件未就位可以使用fake model, 
#                             此时模型输出没有意义, 仅仅性能测试使用;
$ vim run_benchmark.sh
$ bash run_benchmark.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly