Skip to content

Komondor Benchmarking

Levente Bajczi edited this page Dec 7, 2024 · 2 revisions

Benchmarking on Komondor (SLURM HPC)

Important

This is not a guide on how to get started with either generic SLURM scheduling, or Benchexec-benchmarking. For the former see this quick start guide, or the Komondor documentation; and for the latter, see this quick start guide. This guide is on setting up an environment on Komondor that can utilize Benchexec for benchmarking, primarily for verification-based workflows.

Overview

Benchmarking with Benchexec at the low level usually resolves to running Runexec to facilitate resource constraints and measurements. Most features of runexec require cgroups. Furthermore, benchmarked tools (such as Theta) often have specific requirements on the operating systems and installed packages (documented here). Therefore, we containerize the tasks using singularity.

Benchexec will collect the tasks of a benchmark, and submit them as an array job. The user has options to aggregate multiple tasks per job, parallelize tasks within a job, and to restart failed measurements. It is useful to try and keep the number of jobs at a minimum, but this comes at a cost of suboptimal scheduling (if we started a job for every task, the walltime would be the smallest, but if we started a job with N tasks per job, an unfortunate selection of N small tasks on the same job could slow down the entire system).

Simple Benchmarking Setup

To start a simple benchmark, we'll need a tool, some data, and a benchmark definition file. Then we'll pull Benchexec, configure the containers, and finally run the benchmark.

Data

wget https://github.com/ftsrg/theta/releases/download/v6.8.6/Theta.zip
unzip Theta.zip
git clone https://gitlab.com/sosy-lab/benchmarking/sv-benchmarks.git
wget --directory-prefix=xmls https://gitlab.com/sosy-lab/sv-comp/bench-defs/-/raw/main/benchmark-defs/theta.xml 

Benchexec

We also need the version of Benchexec that supports SLURM Array Jobs (not yet upstreamed at the time of writing, but planned):

git clone https://github.com/leventeBajczi/benchexec -b slurm-array-aggregate benchexec

Containers

We can now collect the necessary packages of the tool and runexec (benchexec), and create a Singularity container (e.g., svcomp.def):

BootStrap: docker
From: ubuntu:24.04

%post
apt -y update
apt -y install libgomp1 libmpfr-dev openjdk-17-jre-headless # tool dependencies
apt -y install software-properties-common # down from here everything is mandatory
add-apt-repository ppa:sosy-lab/benchmarking
apt -y install benchexec fuse-overlayfs
mkdir /overlay

Then we need to build a SIF file. However, we don't have the necessary permissions, so we make use of a builder service:

singularity remote login

Follow the instructions (you will need a user on the builder service), then:

singularity build  --remote --fix-perms svcomp.sif svcomp.def

Benchmarking

Now we can start the benchmark:

python3.11 ./benchexec/contrib/slurm-benchmark.py \
 --no-container \   # we don't want to nest containers 
 --slurm-array \    # this will activate the SLURM executor
 --singularity ./svcomp.sif \   # this selects the container to use (relative path) 
 --no-hyperthreading \          # hyperthreading is not robustly supported
 --tool-directory Theta \       # the tool directory to use
 -N 10 \            # at most this many jobs at the same time 
 -o svcomp25/ \     # output directory (has to end with a /, otherwise, will only be used as filename prefix)
 --aggregation-factor 32 \      # how many tasks to aggregate into one job
 --parallelization 4 \          # how many tasks to run in parallel
 --batch-size 5000 \            # how many tasks to handle at the same time (if too many, array script can be too big for SLURM)
 xmls/theta.xml \   # the benchmark to run
 --retry-killed 10  # how many times to retry a task when it fails due to the infrastructure

This will first parse the benchmark definition, query the version of the tool, query some representative information about the operating system, then do the following:

  1. For an (at most) --batch-size part of the tasks, it creates a temporary directory with the following structure:
.
├── 0/
├── 1/
....
├── <batch_size>/ # <batch_size> many folders, these will hold the outputs
├── bin0.tasks
├── bin1.tasks
....
├── binX.tasks # tasks per job, each holding <aggregation_factory> many tasks
├── logs/ # log folder, one file per each job
└── array.sbatch # batch script
  1. It starts the array.sbatch script and waits for it to end. The jobs of the array will take the input files and parameters from the binX.tasks file, and populate the {0..<batch_size>} folders with output artifacts. The tool output will be in ouptut.log, and the runexec output will be in log.

  2. Benchexec then parses the outputs, sets the verdicts, and returns the result (or continues with the next batch).

  3. The temporary directory is deleted. Results are output in the specified directory.

If a run was aborted (say, the login node was restarted suddenly), then zip the half-finished .logfiles and .files directories (pay attention to keep the prefixes!). Then, if necessary, fix half-finished ZIP files by running 7za x <zip> && zip <zip> <folder> -rm -q (7za can extract files that unzip cannot due to missing end-of-dir signatures). Then, re-run benchexec with the same parameters and output dirs, but add --continue-interrupted at the end. This will find the latest logfiles.zip, and attempt to recover results from it, the corresponding .xml(.bz2) files, and the files.zip zip.

Useful commands

squeue is the fastest client to query running (R), pending (PD), and completing (CG) runs. Filter using -t <FLAG>, e.g., squeue -t CG.

Find out more info about a still not completed job using scontrol show job <JOBID>.

Find out more info about any job using seff <JOBID>.

Array jobs have two IDs: an array ID (containing a _), and a job ID (not containing a _). Sometimes you need one, sometimes you need the other. seff can parse either and outputs both.

Find out about personal limits using:

  1. sacctmgr list assoc: How many jobs a user can run at the same time (under GrpJobs)
  2. squota: how much storage a user has (pay attention to file count quotas, those are much stricter than expected!)
  3. sbalance: how much CPU time a user has left (this is not accounted immediately, sometimes it is delayed up to a day)

sacct -j <JOBID> is sometimes useful for getting data on jobs.

To get current resource allocation: echo "Allocated resources:" ;squeue -o "%m" -t RUNNING -h | awk '{s+=$1} END {print "Memory: " s/1000/1000 " TB"}'; squeue -o "%C" -t RUNNING -h | awk '{s+=$1} END {print "CPUs: " s " cores"}'

To cancel a job (if it's an array job, cancel all jobs): scancel <JOBID>

To get current job numbers: printf "R: %d, PD: %d\n" $(squeue -h -t R | wc -l) $(squeue -h -t PD | wc -l)

To attach to a running job: srun --pty --overlap --jobid 9009949 bash. Useful for running htop or similar to check on fuse-overlayfs status, among others.

Clone this wiki locally