Skip to content

Latest commit

 

History

History
167 lines (121 loc) · 12.3 KB

README.md

File metadata and controls

167 lines (121 loc) · 12.3 KB

Top500 Benchmark - HPL Linpack

CI

A common generic benchmark for clusters (or extremly powerful single node workstations) is Linpack, or HPL (High Performance Linpack), which is famous for its use in rankings in the Top500 supercomputer list over the past few decades.

The benchmark solves a random dense linear system in double-precision (64 bits / FP64) arithmetic (source).

I wanted to see where my various clusters and workstations would rank, historically (you can compare to past lists here), so I built this Ansible playbook which installs all the necessary tooling for HPL to run, connects all the nodes together via SSH, then runs the benchmark and outputs the result.

Why not PTS?

Phoronix Test Suite includes HPL Linpack and HPCC test suites. I may see how they compare in the future.

When I initially started down this journey, the PTS versions didn't play nicely with the Pi, especially when clustered. And the PTS versions don't seem to support clustered usage at all!

Supported OSes

Currently supported OSes:

  • Ubuntu (20.04+)
  • Raspberry Pi OS (11+)
  • Debian (11+)
  • Rocky Linux (9+)
  • AlmaLinux (9+)
  • CentOS Stream(9+)
  • RHEL (9+)
  • Fedora (38+)
  • Arch Linux
  • Manjaro

Other OSes may need a few tweaks to work correctly. You can also run the playbook inside Docker (see the note under 'Benchmarking - Single Node'), but performance will be artificially limited.

Benchmarking - Cluster

Make sure you have Ansible installed (pip3 install ansible), then copy the following files:

  • cp example.hosts.ini hosts.ini: This is an inventory of all the hosts in your cluster (or just a single computer).
  • cp example.config.yml config.yml: This has some configuration options you may need to override, especially the ssh_* and ram_in_gb options (depending on your cluster layout)

Each host should be reachable via SSH using the username set in ansible_user. Other Ansible options can be set under [cluster:vars] to connect in more exotic clustering scenarios (e.g. via bastion/jump-host).

Tweak other settings inside config.yml as desired (the most important being hpl_root—this is where the compiled MPI, ATLAS/OpenBLAS/Blis, and HPL benchmarking code will live).

Note: The names of the nodes inside hosts.ini must match the hostname of their corresponding node; otherwise, the benchmark will hang when you try to run it in a cluster.

For example, if you have node-01.local in your hosts.ini your host's hostname should be node-01 and not something else like raspberry-pi.

If you're testing with .local domains on Ubuntu, and local mDNS resolution isn't working, consider installing the avahi-daemon package:

sudo apt-get install avahi-daemon

Then run the benchmarking playbook inside this directory:

ansible-playbook main.yml

This will run three separate plays:

  1. Setup: downloads and compiles all the code required to run HPL. (This play takes a long time—up to many hours on a slower Raspberry Pi!)
  2. SSH: configures the nodes to be able to communicate with each other.
  3. Benchmark: creates an HPL.dat file and runs the benchmark, outputting the results in your console.

After the entire playbook is complete, you can also log directly into any of the nodes (though I generally do things on node 1), and run the following commands to kick off a benchmarking run:

cd ~/tmp/hpl-2.3/bin/top500
mpirun -f cluster-hosts ./xhpl

The configuration here was tested on smaller 1, 4, and 6-node clusters with 6-64 GB of RAM. Some settings in the config.yml file that affect the generated HPL.dat file may need diffent tuning for different cluster layouts!

Benchmarking - Single Node

To run locally on a single node, clone or download this repository to the node where you want to run HPL. Make sure the hosts.ini is set up with the default options (with just one node, 127.0.0.1).

All the default configuration from example.config.yml should be copied to a config.yml file, and all the variables should scale dynamically for your node.

Run the following command so the cluster networking portion of the playbook is not run:

ansible-playbook main.yml --tags "setup,benchmark"

For testing, you can start an Ubuntu docker container:

docker run --name top500 -it -v $PWD:/code geerlingguy/docker-ubuntu2404-ansible:latest bash

Then go into the code directory (cd /code) and run the playbook using the command above.

Setting performance CPU frequency

If you get an error like CPU Throttling apparently enabled!, you may need to set the CPU frequency to performance (and disable any throttling or performance scaling).

For different OSes and different CPU types, the way you do this could be different. So far the automated performance setting in the main.yml playbook has only been tested on Raspberry Pi OS. You may need to look up how to disable throttling on your own system. Do that, then run the main.yml playbook again.

Overclocking

Since I originally built this project for a Raspberry Pi cluster, I include a playbook to set an overclock for all the Raspberry Pis in a given cluster.

You can set a clock speed by changing the pi_arm_freq in the overclock-pi.yml playbook, then run it with:

ansible-playbook overclock-pi.yml

Higher clock speeds require more power and thus more cooling, so if you are running a Pi cluster with just heatsinks, you may also require a fan blowing over them if running overclocked.

Results

Here are a few of the results I've acquired in my testing (sorted by efficiency, highest to lowest):

Configuration Architecture Result Wattage Gflops/W
M4 Mac mini (1x M4 @ 4.4 GHz, in Docker) Arm 299.93 Gflops 39.6W 7.57 Gflops/W
Radxa CM5 (RK3588S2 8-core) Arm 48.619 Gflops 10W 4.86 Gflops/W
AmpereOne A192-26X @ 2.6 GHz Arm 2,745.1 Gflops 570W 4.82 Gflops/W
Ampere Altra Q64-22 @ 2.2 GHz Arm 655.90 Gflops 140W 4.69 Gflops/W
Orange Pi 5 (RK3588S 8-core) Arm 53.333 Gflops 11.5W 4.64 Gflops/W
AmpereOne A192-32X @ 3.2 GHz Arm 3,026.9 Gflops 692W 4.37 Gflops/W
Radxa ROCK 5B (RK3588 8-core) Arm 51.382 Gflops 12W 4.32 Gflops/W
Ampere Altra Max M128-28 @ 2.8 GHz Arm 1,265.5 Gflops 296W 4.27 Gflops/W
Orange Pi 5 Max (RK3588 8-core) Arm 52.924 Gflops 12.8W 4.13 Gflops/W
Radxa ROCK 5C (RK3588S2 8-core) Arm 49.285 Gflops 12W 4.11 Gflops/W
Ampere Altra Max M96-28 @ 2.8 GHz Arm 1,188.3 Gflops 295W 4.01 Gflops/W
M1 Max Mac Studio (1x M1 Max @ 3.2 GHz, in Docker) Arm 264.32 Gflops 66W 4.00 Gflops/W
Ampere Altra M128-30 @ 3.0 GHz Arm 1,652.4 Gflops 440W 3.76 Gflops/W
Raspberry Pi CM5 (BCM2712 @ 2.4 GHz) Arm 32.152 Gflops 9.2W 3.49 Gflops/W
Ampere Altra Q32-17 @ 1.7 GHz Arm 332.07 Gflops 100W 3.32 Gflops/W
Turing Machines RK1 (RK3588 8-core) Arm 59.810 Gflops 18.1W 3.30 Gflops/W
Raspberry Pi 500 (BCM2712 @ 2.4 GHz) Arm 35.586 Gflops 11W 3.24 Gflops/W
Turing Pi 2 (4x RK1 @ 2.4 GHz) Arm 224.60 Gflops 73W 3.08 Gflops/W
Raspberry Pi 5 (BCM2712 @ 2.4 GHz) Arm 35.169 Gflops 12.7W 2.77 Gflops/W
LattePanda Mu (1x N100 @ 3.4 GHz) x86 62.851 Gflops 25W 2.51 Gflops/W
Radxa X4 (1x N100 @ 3.4 GHz) x86 37.224 Gflops 16W 2.33 Gflops/W
Raspberry Pi CM4 (BCM2711 @ 1.5 GHz) Arm 11.433 Gflops 5.2W 2.20 Gflops/W
Ampere Altra Max M128-30 @ 3.0 GHz Arm 953.47 Gflops 500W 1.91 Gflops/W
Turing Pi 2 (4x CM4 @ 1.5 GHz) Arm 44.942 Gflops 24.5W 1.83 Gflops/W
Lenovo M710q Tiny (1x i5-7400T @ 2.4 GHz) x86 72.472 Gflops 41W 1.76 Gflops/W
Raspberry Pi 400 (BCM2711 @ 1.8 GHz) Arm 11.077 Gflops 6.4W 1.73 Gflops/W
Raspberry Pi 4 (BCM2711 @ 1.8 GHz) Arm 11.889 Gflops 7.2W 1.65 Gflops/W
Turing Pi 2 (4x CM4 @ 2.0 GHz) Arm 51.327 Gflops 33W 1.54 Gflops/W
DeskPi Super6c (6x CM4 @ 1.5 GHz) Arm 60.293 Gflops 40W 1.50 Gflops/W
Orange Pi CM4 (RK3566 4-core) Arm 5.604 Gflops 4.0W 1.40 Gflop/W
DeskPi Super6c (6x CM4 @ 2.0 GHz) Arm 70.338 Gflops 51W 1.38 Gflops/W
AMD Ryzen 5 5600x @ 3.7 GHz x86 229 Gflops 196W 1.16 Gflops/W
Milk-V Mars CM JH7110 4-core RISC-V 1.99 Gflops 3.6W 0.55 Gflops/W
Lichee Console 4A TH1520 4-core RISC-V 1.99 Gflops 3.6W 0.55 Gflops/W
Milk-V Jupiter SpacemiT X60 8-core RISC-V 5.66 Gflops 10.6W 0.55 Gflops/W
Sipeed Lichee Pi 3A SpacemiT K1 8-core RISC-V 4.95 Gflops 9.1W 0.54 Gflops/W
Milk-V Mars JH7110 4-core RISC-V 2.06 Gflops 4.7W 0.44 Gflops/W
Raspberry Pi Zero 2 W (RP3A0-AU @ 1.0 GHz) Arm 0.370 Gflops 2.1W 0.18 Gflops/W
M2 Pro MacBook Pro (1x M2 Pro, in Asahi Linux) Arm 296.93 Gflops N/A N/A
M2 MacBook Air (1x M2 @ 3.5 GHz, in Docker) Arm 104.68 Gflops N/A N/A

You can enter the Gflops in this tool to see how it compares to historical top500 lists.

Note: My current calculation for efficiency is based on average power draw over the course of the benchmark, based on either a Kill-A-Watt (pre-2024 tests) or a ThirdReality Smart Outlet monitor. The efficiency calculations may vary depending on the specific system under test.

Other Listings

Over the years, as I find other people's listings of HPL results—especially those with power usage ratings—I will add them here: