Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for cgroupsv2 #791

Merged
merged 153 commits into from
Oct 20, 2023
Merged

Support for cgroupsv2 #791

merged 153 commits into from
Oct 20, 2023

Conversation

globin
Copy link
Contributor

@globin globin commented Nov 23, 2021

This adds support for cgroupsv2 to BenchExec. Support for cgroupsv1 is kept as is.

There are currently open TODOs:

  • Add use of cgroup namespace (allows proper nesting).
  • Better error messages and documentation.
  • Automated testing (CI).

However, testing is already possible and would be highly appreciated. The actual functionality should be working fine.

There are two ways to let BenchExec use cgroupsv2:

  • Start benchexec/runexec inside their own cgroup (as the only process). On systems with systemd, this can be done by prefixing the command line with:

    systemd-run --user --scope --slice=benchexec -p Delegate=yes ...
    
  • Let BenchExec create their own cgroup. This requires systemd and the pystemd Python package installed. On Ubuntu, for example install python3-pystemd.

Please test and provide information in a comment here:

  • What did work and what not?
  • What problems did you encounter?
  • What is your system (distribution, kernel version, systemd version)?

By default, even when not explicitly enabled the cpu subsystem in v2
provides all the metrics we use. If cpu.{max,uclamp,weight} were to be
used in the future this needs to be extended and made more specific.
still not working correctly, as apparently only the parent cgroup
outputs to the io.stat file?
This allows the following to be run without any need for adminitrative
privileges.
```
$ systemd-run --user --scope -p Delegate=true  runexec --read-only-dir / date
```

This also moves the main runexec process into a child cgroup due to it
not being possible to delegate controllers if a process
exists in the parent. This will make it possible to report benchexec
system usage.
This prevents the benchmarked process from changing the configured
limits.
So far, the cgroup with the limits it the same where we add the
benchmarked process. But if we then delegate it into the container,
the benchmarked process can access it and change the limits.
So now we create a child cgroup of the cgroup with the limits,
move the benchmarked process into the child,
and make the child the root cgroup of the container.
Then the limits are configured outside of the container and cannot be
changed.

This finishes #436 for runexec.

We just need to take care that for some special operations
we also use the child cgroup instead of the main one.

An alternative would be the "nsdelegate" mount option for the cgroup2 fs.
However, this needs to be set in the initial namespace,
so we cannot enforce this. And at least on my Ubuntu system,
it is missing, so we also not just declare it as a requirement.
The cgroup namespaces from the previous commit provide better isolation,
so using them also makes sense for containerexec, not just for runexec.
However, containerexec currently does not use and require cgroups at
all, and we do not want to make them mandatory,
such that containerexec stays usable for users without cgroup access.
So we add a new argument --cgroup-access to containerexec
that triggers use of cgroups for the run (without any limits etc.)
and uses cgroup namespaces to provide a usable cgroup for the process
inside the container.
We support this flag only on systems with cgroupsv2
because with cgroupsv1 the cgroup namespaces do not work as well.

runexec and benchexec do not get this argument,
because they use cgroups anyway and thus --cgroup-access is implied.

This closes #436.
This is a kernel problem that is hopefully fixed in the future,
but until then likely to cause problems for users who use BenchExec
on their own machines (and not dedicated servers).
So we at least give them a quick workaround.
@lorenzleutgeb
Copy link
Contributor

With cgroups2, the systemd unit benchexec-cgroups is not required anymore, because BenchExec can use pystemd to set things up, right?

# Only necessary on cgroupv1 systems
ConditionControlGroupController=v1

@PhilippWendler
Copy link
Member

With cgroups2, the systemd unit benchexec-cgroups is not required anymore, because BenchExec can use pystemd to set things up, right?

Yes, exactly.

@lorenzleutgeb
Copy link
Contributor

lorenzleutgeb commented Oct 3, 2023

I am trying to package BenchExec for NixOS (see #920). I decided to go for the cgroupsv2 branch directly, and would like to report.

runexec --debug --no-container echo Test complains that "Cgroup subsystem cpuset is not available. Please make sure it is supported by your kernel and available." Note that cpuset is listed in /proc/cgroups (see below) and I am not sure what else I have to configure here.
% runexec --debug --no-container echo Test
2023-10-03 12:57:47,760 - DEBUG - This is runexec 3.18-dev.
2023-10-03 12:57:47,765 - DEBUG - Analyzing /proc/mounts and /proc/self/cgroup to determine cgroups.
2023-10-03 12:57:47,766 - DEBUG - Available Cgroups: {
  'memory': PosixPath('/sys/fs/cgroup/user.slice/user-1000.slice/session-4.scope'),
  'pids':   PosixPath('/sys/fs/cgroup/user.slice/user-1000.slice/session-4.scope'),
  'cpu':    PosixPath('/sys/fs/cgroup/user.slice/user-1000.slice/session-4.scope'),
  'io':     PosixPath('/sys/fs/cgroup/user.slice/user-1000.slice/session-4.scope'),
  'kill':   PosixPath('/sys/fs/cgroup/user.slice/user-1000.slice/session-4.scope'),
  'freeze': PosixPath('/sys/fs/cgroup/user.slice/user-1000.slice/session-4.scope')
}
2023-10-03 12:57:47,803 - DEBUG - Process moved to a fresh systemd scope: benchexec_Ay4STzwJJH4.scope
2023-10-03 12:57:47,803 - DEBUG - Analyzing /proc/mounts and /proc/self/cgroup to determine cgroups.
2023-10-03 12:57:47,803 - DEBUG - Available Cgroups: {
  'memory': PosixPath('/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/benchexec.slice/benchexec_Ay4STzwJJH4.scope'),
  'pids':   PosixPath('/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/benchexec.slice/benchexec_Ay4STzwJJH4.scope'),
  'cpu':    PosixPath('/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/benchexec.slice/benchexec_Ay4STzwJJH4.scope'),
  'io':     PosixPath('/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/benchexec.slice/benchexec_Ay4STzwJJH4.scope'),
  'kill':   PosixPath('/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/benchexec.slice/benchexec_Ay4STzwJJH4.scope'),
  'freeze': PosixPath('/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/benchexec.slice/benchexec_Ay4STzwJJH4.scope')
}
2023-10-03 12:57:47,803 - DEBUG - Available Cgroups: {
  'freeze': PosixPath('/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/benchexec.slice/benchexec_Ay4STzwJJH4.scope/benchexec_process_33siw073'),
  'kill':   PosixPath('/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/benchexec.slice/benchexec_Ay4STzwJJH4.scope/benchexec_process_33siw073'),
  'cpu':    PosixPath('/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/benchexec.slice/benchexec_Ay4STzwJJH4.scope/benchexec_process_33siw073')
}
2023-10-03 12:57:47,803 - WARNING - Cgroup subsystem cpuset is not available. Please make sure it is supported by your kernel and available.
2023-10-03 12:57:47,804 - INFO - Starting command echo Test
2023-10-03 12:57:47,804 - INFO - Writing output to output.log
2023-10-03 12:57:47,804 - DEBUG - Setting up cgroups for run.
2023-10-03 12:57:47,804 - DEBUG - Available Cgroups: {
  'kill':   PosixPath('/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/benchexec.slice/benchexec_Ay4STzwJJH4.scope/benchmark_fpoc5nqk'),
  'memory': PosixPath('/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/benchexec.slice/benchexec_Ay4STzwJJH4.scope/benchmark_fpoc5nqk'),
  'pids':   PosixPath('/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/benchexec.slice/benchexec_Ay4STzwJJH4.scope/benchmark_fpoc5nqk'),
  'cpu':    PosixPath('/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/benchexec.slice/benchexec_Ay4STzwJJH4.scope/benchmark_fpoc5nqk'),
  'io':     PosixPath('/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/benchexec.slice/benchexec_Ay4STzwJJH4.scope/benchmark_fpoc5nqk'),
  'freeze': PosixPath('/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/benchexec.slice/benchexec_Ay4STzwJJH4.scope/benchmark_fpoc5nqk')
}
2023-10-03 12:57:47,804 - DEBUG - Created cgroups {
  PosixPath('/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/benchexec.slice/benchexec_Ay4STzwJJH4.scope/benchmark_fpoc5nqk')
}.
2023-10-03 12:57:47,804 - DEBUG - Using additional environment {}.
2023-10-03 12:57:47,806 - DEBUG - Starting process.
2023-10-03 12:57:47,806 - DEBUG - Executing run with $HOME and $TMPDIR below /tmp/BenchExec_run_adf9q116.
2023-10-03 12:57:47,809 - DEBUG - Waiting for process echo with pid 15091
2023-10-03 12:57:47,810 - DEBUG - energy measurement output:
2023-10-03 12:57:47,810 - DEBUG - energy measurement output: cpu_count=1
2023-10-03 12:57:47,810 - DEBUG - energy measurement output: duration_seconds=0.001045
2023-10-03 12:57:47,810 - DEBUG - energy measurement output: cpu0_package_joules=0.046875
2023-10-03 12:57:47,810 - DEBUG - energy measurement output: cpu0_core_joules=0.050537
2023-10-03 12:57:47,810 - DEBUG - energy measurement output: cpu0_dram_joules=0.001770
2023-10-03 12:57:47,810 - DEBUG - Process terminated, exit code 0.
2023-10-03 12:57:47,811 - DEBUG - Getting cgroup measurements.
2023-10-03 12:57:47,911 - DEBUG - Resource usage of run: walltime=0.0024268529999744715, cputime=0.001914, cgroup-cputime=0.001432, memory=1032192
2023-10-03 12:57:47,911 - DEBUG - Cleaning up cgroups.
2023-10-03 12:57:47,911 - DEBUG - Cleaning up temporary directory /tmp/BenchExec_run_adf9q116.
2023-10-03 12:57:47,912 - DEBUG - Size of logfile 'output.log' is 100 bytes, size limit disabled.
starttime=2023-10-03T12:57:47.807575+02:00
returnvalue=0
walltime=0.0024268529999744715s
cputime=0.001432s
memory=1032192B
blkio-read=0B
blkio-write=0B
pressure-cpu-some=0.000002s
pressure-io-some=0s
pressure-memory-some=0s
cpuenergy=0.046875J
cpuenergy-pkg0-core=0.050537J
cpuenergy-pkg0-dram=0.001770J
cpuenergy-pkg0-package=0.046875J
containerexec --debug bash complains with "Failed to configure container: [Errno 22] Creating overlay mount for / failed: Invalid argument. Please use other directory modes, for example --read-only-dir /."
% containerexec --debug bash
2023-10-03 13:04:50,824 - DEBUG - This is containerexec 3.18-dev.
2023-10-03 13:04:50,824 - INFO - Starting command bash
2023-10-03 13:04:50,825 - DEBUG - Cannot use overlay mode for /var/lib/lxcfs/proc because it is not a directory. Using read-only mode instead.
2023-10-03 13:04:50,825 - DEBUG - Available Cgroups: {}
2023-10-03 13:04:50,825 - DEBUG - Starting process.
2023-10-03 13:04:50,826 - DEBUG - Parent: child process of RunExecutor with PID 16396 started.
2023-10-03 13:04:50,826 - DEBUG - Child: child process of RunExecutor with PID 16396 started
2023-10-03 13:04:50,827 - DEBUG - Failed to make b'/tmp/BenchExec_run_5qp72igi/mount/home/benchexec' a bind mount: [Errno 2] mount(b'/tmp/BenchExec_run_5qp72igi/mount/home/benchexec', b'/tmp/BenchExec_run_5qp72igi/mount/home/benchexec', None, 4096, None) failed: No such file or directory
2023-10-03 13:04:50,827 - DEBUG - Mounting '/' as overlay
2023-10-03 13:04:50,827 - DEBUG - [Errno 16] umount(b'/tmp/BenchExec_run_5qp72igi/mount/') failed: Device or resource busy
2023-10-03 13:04:50,827 - DEBUG - Creating overlay mount: target=b'/tmp/BenchExec_run_5qp72igi/mount/', lower=b'/', upper=b'/tmp/BenchExec_run_5qp72igi/temp/', work=b'/tmp/BenchExec_run_5qp72igi/overlayfs/1'
2023-10-03 13:04:50,827 - CRITICAL - Failed to configure container: [Errno 22] Creating overlay mount for '/' failed: Invalid argument. Please use other directory modes, for example '--read-only-dir /'.
2023-10-03 13:04:50,828 - DEBUG - Waiting for process bash with pid 16396
2023-10-03 13:04:50,828 - DEBUG - Parent: child process of RunExecutor with PID 16396 terminated with return value 128.
2023-10-03 13:04:50,828 - DEBUG - Process terminated, exit code 0.
2023-10-03 13:04:50,828 - DEBUG - Cleaning up temporary directory.
2023-10-03 13:04:50,828 - ERROR - execution in container failed, check log for details
Traceback (most recent call last):
  File "/nix/store/cmm6inc9n3q8yv23hjpydbhsnr2c45jf-benchexec-unstable-2023-09-05/lib/python3.10/site-packages/benchexec/containerexecutor.py", line 942, in _start_execution_in_container
    grandchild_pid = int(os.read(from_grandchild, 10))
ValueError: invalid literal for int() with base 10: b''

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/nix/store/cmm6inc9n3q8yv23hjpydbhsnr2c45jf-benchexec-unstable-2023-09-05/lib/python3.10/site-packages/benchexec/containerexecutor.py", line 296, in main
result = executor.execute_run(
File "/nix/store/cmm6inc9n3q8yv23hjpydbhsnr2c45jf-benchexec-unstable-2023-09-05/lib/python3.10/site-packages/benchexec/containerexecutor.py", line 469, in execute_run
pid, result_fn = self._start_execution(
File "/nix/store/cmm6inc9n3q8yv23hjpydbhsnr2c45jf-benchexec-unstable-2023-09-05/lib/python3.10/site-packages/benchexec/containerexecutor.py", line 538, in _start_execution
result = self._start_execution_in_container(
File "/nix/store/cmm6inc9n3q8yv23hjpydbhsnr2c45jf-benchexec-unstable-2023-09-05/lib/python3.10/site-packages/benchexec/containerexecutor.py", line 946, in _start_execution_in_container
check_child_exit_code()
File "/nix/store/cmm6inc9n3q8yv23hjpydbhsnr2c45jf-benchexec-unstable-2023-09-05/lib/python3.10/site-packages/benchexec/containerexecutor.py", line 899, in check_child_exit_code
raise BenchExecException(
benchexec.BenchExecException: execution in container failed, check log for details
Cannot execute bash: execution in container failed, check log for details.

Fiddling around with the arguments a bit lead me to:

runexec --debug --read-only-dir / --hidden-dir /home --dir /tmp $(readlink -f $(which bash)) -- -c "echo Test" which still complains, but I am not sure how bad the situation is. Maybe you can tell me? Another issue seems to be that my /run/wrappers/bin/cpu-energy-meter is not found because /run is mounted as hidden. Note that it was found with --no-container. How to best resolve that?
2023-10-03 13:14:13,568 - DEBUG - This is runexec 3.18-dev.
2023-10-03 13:14:13,568 - INFO - LXCFS is not available, some host information like the uptime leaks into the container.
2023-10-03 13:14:13,568 - DEBUG - Available Cgroups: {}
2023-10-03 13:14:13,573 - DEBUG - Analyzing /proc/mounts and /proc/self/cgroup to determine cgroups.
2023-10-03 13:14:13,573 - DEBUG - Available Cgroups: {
  'cpu':    PosixPath('/sys/fs/cgroup/user.slice/user-1000.slice/session-4.scope'),
  'io':     PosixPath('/sys/fs/cgroup/user.slice/user-1000.slice/session-4.scope'),
  'freeze': PosixPath('/sys/fs/cgroup/user.slice/user-1000.slice/session-4.scope'),
  'pids':   PosixPath('/sys/fs/cgroup/user.slice/user-1000.slice/session-4.scope'),
  'kill':   PosixPath('/sys/fs/cgroup/user.slice/user-1000.slice/session-4.scope'),
  'memory': PosixPath('/sys/fs/cgroup/user.slice/user-1000.slice/session-4.scope')
}
2023-10-03 13:14:13,601 - DEBUG - Process moved to a fresh systemd scope: benchexec_O5DHPeGtQQ0.scope
2023-10-03 13:14:13,601 - DEBUG - Analyzing /proc/mounts and /proc/self/cgroup to determine cgroups.
2023-10-03 13:14:13,601 - DEBUG - Available Cgroups: {
  'cpu':    PosixPath('/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/benchexec.slice/benchexec_O5DHPeGtQQ0.scope'),
  'io':     PosixPath('/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/benchexec.slice/benchexec_O5DHPeGtQQ0.scope'),
  'freeze': PosixPath('/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/benchexec.slice/benchexec_O5DHPeGtQQ0.scope'),
  'pids':   PosixPath('/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/benchexec.slice/benchexec_O5DHPeGtQQ0.scope'),
  'kill':   PosixPath('/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/benchexec.slice/benchexec_O5DHPeGtQQ0.scope'),
  'memory': PosixPath('/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/benchexec.slice/benchexec_O5DHPeGtQQ0.scope')
}
2023-10-03 13:14:13,601 - DEBUG - Available Cgroups: {
  'cpu':    PosixPath('/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/benchexec.slice/benchexec_O5DHPeGtQQ0.scope/benchexec_process_1y4d9uty'),
  'kill':   PosixPath('/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/benchexec.slice/benchexec_O5DHPeGtQQ0.scope/benchexec_process_1y4d9uty'),
  'freeze': PosixPath('/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/benchexec.slice/benchexec_O5DHPeGtQQ0.scope/benchexec_process_1y4d9uty')
}
2023-10-03 13:14:13,602 - WARNING - Cgroup subsystem cpuset is not available. Please make sure it is supported by your kernel and available.
2023-10-03 13:14:13,602 - INFO - Starting command /nix/store/kxkdrxvc3da2dpsgikn8s2ml97h88m46-bash-interactive-5.2-p15/bin/bash -c 'echo Test'
2023-10-03 13:14:13,602 - INFO - Writing output to output.log
2023-10-03 13:14:13,602 - DEBUG - Setting up cgroups for run.
2023-10-03 13:14:13,602 - DEBUG - Available Cgroups: {
  'cpu':    PosixPath('/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/benchexec.slice/benchexec_O5DHPeGtQQ0.scope/benchmark_z9eptppn'),
  'io':     PosixPath('/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/benchexec.slice/benchexec_O5DHPeGtQQ0.scope/benchmark_z9eptppn'),
  'kill':   PosixPath('/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/benchexec.slice/benchexec_O5DHPeGtQQ0.scope/benchmark_z9eptppn'),
  'memory': PosixPath('/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/benchexec.slice/benchexec_O5DHPeGtQQ0.scope/benchmark_z9eptppn'),
  'freeze': PosixPath('/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/benchexec.slice/benchexec_O5DHPeGtQQ0.scope/benchmark_z9eptppn'),
  'pids':   PosixPath('/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/benchexec.slice/benchexec_O5DHPeGtQQ0.scope/benchmark_z9eptppn')
}
2023-10-03 13:14:13,602 - DEBUG - Created cgroups {PosixPath('/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/benchexec.slice/benchexec_O5DHPeGtQQ0.scope/benchmark_z9eptppn')}.
2023-10-03 13:14:13,602 - DEBUG - Using additional environment {}.
2023-10-03 13:14:13,604 - DEBUG - Starting process.
2023-10-03 13:14:13,605 - DEBUG - Parent: child process of RunExecutor with PID 18724 started.
2023-10-03 13:14:13,605 - DEBUG - Child: child process of RunExecutor with PID 18724 started
2023-10-03 13:14:13,606 - DEBUG - Failed to make b'/tmp/BenchExec_run_itv8bmlc/mount/home/benchexec' a bind mount: [Errno 2] mount(b'/tmp/BenchExec_run_itv8bmlc/mount/home/benchexec', b'/tmp/BenchExec_run_itv8bmlc/mount/home/benchexec', None, 4096, None) failed: No such file or directory
2023-10-03 13:14:13,606 - DEBUG - Mounting '/' as read-only
2023-10-03 13:14:13,606 - DEBUG - Mounting '/nix' as read-only
2023-10-03 13:14:13,606 - DEBUG - Mounting '/nix/store' as read-only
2023-10-03 13:14:13,606 - DEBUG - Mounting '/proc' as read-only
2023-10-03 13:14:13,606 - DEBUG - Mounting '/sys' as read-only
2023-10-03 13:14:13,606 - DEBUG - Mounting '/sys/kernel/security' as read-only
2023-10-03 13:14:13,606 - DEBUG - Mounting '/sys/fs/cgroup' as read-only
2023-10-03 13:14:13,606 - DEBUG - Mounting '/sys/fs/pstore' as read-only
2023-10-03 13:14:13,607 - DEBUG - Mounting '/sys/firmware/efi/efivars' as read-only
2023-10-03 13:14:13,607 - DEBUG - Mounting '/sys/fs/bpf' as read-only
2023-10-03 13:14:13,607 - DEBUG - Mounting '/sys/kernel/debug' as read-only
2023-10-03 13:14:13,607 - DEBUG - Mounting '/sys/fs/fuse/connections' as read-only
2023-10-03 13:14:13,607 - DEBUG - Mounting '/sys/kernel/config' as read-only
2023-10-03 13:14:13,607 - DEBUG - Mounting '/dev' as read-only
2023-10-03 13:14:13,607 - DEBUG - Mounting '/dev/pts' as read-only
2023-10-03 13:14:13,607 - DEBUG - Mounting '/dev/shm' as read-only
2023-10-03 13:14:13,607 - DEBUG - Mounting '/dev/mqueue' as read-only
2023-10-03 13:14:13,607 - DEBUG - Mounting '/dev/hugepages' as read-only
2023-10-03 13:14:13,607 - DEBUG - Mounting '/run' as hidden
2023-10-03 13:14:13,607 - DEBUG - [Errno 16] umount(b'/tmp/BenchExec_run_itv8bmlc/mount/run') failed: Device or resource busy
2023-10-03 13:14:13,607 - DEBUG - Mounting '/tmp' as hidden
2023-10-03 13:14:13,607 - DEBUG - Mounting '/home' as hidden
2023-10-03 13:14:13,607 - DEBUG - [Errno 16] umount(b'/tmp/BenchExec_run_itv8bmlc/mount/home') failed: Device or resource busy
2023-10-03 13:14:13,607 - DEBUG - Mounting '/boot' as read-only
2023-10-03 13:14:13,607 - DEBUG - Mounting '/tmp' as hidden
2023-10-03 13:14:13,607 - DEBUG - Mounting '/run' as hidden
2023-10-03 13:14:13,608 - DEBUG - Mounting '/home' as hidden
2023-10-03 13:14:13,609 - DEBUG - Parent: executing /nix/store/kxkdrxvc3da2dpsgikn8s2ml97h88m46-bash-interactive-5.2-p15/bin/bash in grand child with PID 18725 via child with PID 18724.
2023-10-03 13:14:13,610 - DEBUG - Available Cgroups: {
  'cpu': PosixPath('/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/benchexec.slice/benchexec_O5DHPeGtQQ0.scope/benchmark_z9eptppn/delegate_hbj_mrht'),
  'io': PosixPath('/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/benchexec.slice/benchexec_O5DHPeGtQQ0.scope/benchmark_z9eptppn/delegate_hbj_mrht'),
  'kill': PosixPath('/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/benchexec.slice/benchexec_O5DHPeGtQQ0.scope/benchmark_z9eptppn/delegate_hbj_mrht'),
  'memory': PosixPath('/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/benchexec.slice/benchexec_O5DHPeGtQQ0.scope/benchmark_z9eptppn/delegate_hbj_mrht'),
  'freeze': PosixPath('/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/benchexec.slice/benchexec_O5DHPeGtQQ0.scope/benchmark_z9eptppn/delegate_hbj_mrht'),
  'pids': PosixPath('/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/benchexec.slice/benchexec_O5DHPeGtQQ0.scope/benchmark_z9eptppn/delegate_hbj_mrht')
}
2023-10-03 13:14:13,612 - DEBUG - Waiting for signals
2023-10-03 13:14:13,613 - DEBUG - Child: process /nix/store/kxkdrxvc3da2dpsgikn8s2ml97h88m46-bash-interactive-5.2-p15/bin/bash terminated with exit code 0.
2023-10-03 13:14:13,614 - DEBUG - energy measurement output:
2023-10-03 13:14:13,614 - DEBUG - energy measurement output: cpu_count=1
2023-10-03 13:14:13,614 - DEBUG - energy measurement output: duration_seconds=0.000017
2023-10-03 13:14:13,614 - DEBUG - 0 output files matched the patterns and were transferred.
2023-10-03 13:14:13,614 - DEBUG - Waiting for process /nix/store/kxkdrxvc3da2dpsgikn8s2ml97h88m46-bash-interactive-5.2-p15/bin/bash with pid 18724
2023-10-03 13:14:13,615 - DEBUG - Parent: child process of RunExecutor with PID 18724 terminated with return value 0.
2023-10-03 13:14:13,615 - DEBUG - Process terminated, exit code 0.
2023-10-03 13:14:13,615 - DEBUG - Getting cgroup measurements.
2023-10-03 13:14:13,716 - DEBUG - Resource usage of run: walltime=0.002302315000179078, cputime=0.0026539999999999997, cgroup-cputime=0.002127, memory=835584
2023-10-03 13:14:13,716 - DEBUG - Cleaning up cgroups.
2023-10-03 13:14:13,716 - DEBUG - Cleaning up temporary directory /tmp/BenchExec_run_itv8bmlc.
2023-10-03 13:14:13,717 - DEBUG - Size of logfile 'output.log' is 183 bytes, size limit disabled.
starttime=2023-10-03T13:14:13.611688+02:00
returnvalue=0
walltime=0.002302315000179078s
cputime=0.002127s
memory=835584B
blkio-read=0B
blkio-write=0B
pressure-cpu-some=0.000002s
pressure-io-some=0s
pressure-memory-some=0s

The state of the system is as follows:

cat /proc/cgroups
#subsys_name    hierarchy   num_cgroups enabled
cpuset  0   140 1
cpu 0   140 1
cpuacct 0   140 1
blkio   0   140 1
memory  0   140 1
devices 0   140 1
freezer 0   140 1
net_cls 0   140 1
perf_event  0   140 1
net_prio    0   140 1
hugetlb 0   140 1
pids    0   140 1
rdma    0   140 1
misc    0   140 1
debug   0   140 1
grep -E 'cgroup|overlay' /proc/filesystems
nodev   cgroup
nodev   cgroup2
nodev   overlay
gunzip --stdout /proc/config.gz | grep -E 'CONFIG_(OVERLAY_FS|USER_NS)='
CONFIG_USER_NS=y
CONFIG_OVERLAY_FS=m
lsmod | grep -E 'cpuid|msr|overlay'
overlay               163840  0
intel_rapl_msr         20480  0
intel_rapl_common      28672  1 intel_rapl_msr
msr                    16384  0
cpuid                  16384  0
uname -a
uname -a
Linux [...] 6.1.44 #1-NixOS SMP PREEMPT_DYNAMIC Tue Aug  8 18:03:51 UTC 2023 x86_64 GNU/Linux
systemctl --version
systemd 253 (253.6)
+PAM +AUDIT -SELINUX +APPARMOR +IMA +SMACK +SECCOMP +GCRYPT -GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN +IPTC +KMOD +LIBCRYPTSETUP +LIBFDISK +PCRE2 -PWQUALITY +P11KIT -QRENCODE +TPM2 +BZIP2 +LZ4 +XZ +ZLIB +ZSTD +BPF_FRAMEWORK -XKBCOMMON +UTMP -SYSVINIT default-hierarchy=unified

@PhilippWendler
Copy link
Member

I am trying to package BenchExec for NixOS (see #920). I decided to go for the cgroupsv2 branch directly, and would like to report.

Thanks a lot!

runexec --debug --no-container echo Test complains that "Cgroup subsystem cpuset is not available. Please make sure it is supported by your kernel and available." Note that cpuset is listed in /proc/cgroups (see below) and I am not sure what else I have to configure here.

This sounds like the kernel problem from 6cbd9fc. If you force runexec to require cpuset, for example with --cores 0, you should see a better error message. Maybe I should also improve the warning that is printed if cpuset is missing but not required.

containerexec --debug bash complains with "Failed to configure container: [Errno 22] Creating overlay mount for / failed: Invalid argument. Please use other directory modes, for example --read-only-dir /."

Yes, this is a well-known kernel regression: #776

Fiddling around with the arguments a bit lead me to:
runexec --debug --read-only-dir / --hidden-dir /home --dir /tmp $(readlink -f $(which bash)) -- -c "echo Test" which still complains, but I am not sure how bad the situation is. Maybe you can tell me?

--dir should be optional, though, right? Except if you are currently in a hidden directory, then you have to specify it, of course. Or, if you need the current directory, you can use an overlayfs for it.

I seen only one remaining warning, which is the one discussed above.

Another issue seems to be that my /run/wrappers/bin/cpu-energy-meter is not found because /run is mounted as hidden. Note that it was found with --no-container. How to best resolve that?

The last log of runexec shows that it successfully used cpu-energy-meter.
Do you really have binaries in /run? If you need them, you can make them un-hidden with for example --overlay-dir /run/wrappers/bin, or we can add such logic to BenchExec if this is some NixOS rule.

@lorenzleutgeb
Copy link
Contributor

lorenzleutgeb commented Oct 4, 2023

runexec --debug --no-container echo Test complains that "Cgroup subsystem cpuset is not available. Please make sure it is supported by your kernel and available." Note that cpuset is listed in /proc/cgroups (see below) and I am not sure what else I have to configure here.

This sounds like the kernel problem from 6cbd9fc. If you force runexec to require cpuset, for example with --cores 0, you should see a better error message. Maybe I should also improve the warning that is printed if cpuset is missing but not required.

Yes, you're right. I'd prefer the workaround to be printed even for the warning, but that's a matter of taste. Also, I had to run a few commands (see below) in alternation with runexec. It would be helpful to print them all at once, but I don't know whether that's possible:

echo +cpuset | sudo tee /sys/fs/cgroup/user.slice/cgroup.subtree_control
echo +cpuset | sudo tee /sys/fs/cgroup/user.slice/user-1000.slice/cgroup.subtree_control
echo +cpuset | sudo tee /sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/cgroup.subtree_control
echo +cpuset | sudo tee /sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/benchexec.slice/cgroup.subtree_control

Now the warning is gone! And it looks like this is fixed in kernel 6.6.

containerexec --debug bash complains with "Failed to configure container: [Errno 22] Creating overlay mount for / failed: Invalid argument. Please use other directory modes, for example --read-only-dir /."

Yes, this is a well-known kernel regression: #776

Fiddling around with the arguments a bit lead me to:
runexec --debug --read-only-dir / --hidden-dir /home --dir /tmp $(readlink -f $(which bash)) -- -c "echo Test" which still complains, but I am not sure how bad the situation is. Maybe you can tell me?

--dir should be optional, though, right? Except if you are currently in a hidden directory, then you have to specify it, of course. Or, if you need the current directory, you can use an overlayfs for it.

I invoked runexec in a subdirectory of /home, so when I added --hidden-dir /home it complained that the directory did not exist. That's why I added --dir /tmp.

I seen only one remaining warning, which is the one discussed above.

Another issue seems to be that my /run/wrappers/bin/cpu-energy-meter is not found because /run is mounted as hidden. Note that it was found with --no-container. How to best resolve that?

The last log of runexec shows that it successfully used cpu-energy-meter.

Hmm yeah, sorry. I was confused.

Do you really have binaries in /run?

Yes. Some background: In NixOS, the cpu-energy-meter binary is somewhere in /nix/store/*-cpu-energy-meter-*/bin/cpu-energy-meter and has no capabilities (for sandboxing builds, I believe). So there's another binary at /run/wrappers/bin/cpu-energy-meter configured via security.wrappers.cpu-energy-meter that does have the CAP_SYS_RAWIO and will execute /nix/store/*-cpu-energy-meter-*/bin/cpu-energy-meter.

[...] or we can add such logic to BenchExec if this is some NixOS rule.

I don't think that's necessary.

So, it seems that all warnings are resolved, CPU Energy Meter, LXCFS, and libseccomp are available. One thing that NixOS reviewers will ask is why I want to package an unreleased version. Are there any plans on releasing BenchExec with support for cgroups2? Do you need any more data or reports?

@PhilippWendler
Copy link
Member

runexec --debug --no-container echo Test complains that "Cgroup subsystem cpuset is not available. Please make sure it is supported by your kernel and available." Note that cpuset is listed in /proc/cgroups (see below) and I am not sure what else I have to configure here.

This sounds like the kernel problem from 6cbd9fc. If you force runexec to require cpuset, for example with --cores 0, you should see a better error message. Maybe I should also improve the warning that is printed if cpuset is missing but not required.

Yes, you're right. I'd prefer the workaround to be printed even for the warning, but that's a matter of taste. Also, I had to run a few commands (see below) in alternation with runexec. It would be helpful to print them all at once, but I don't know whether that's possible:

echo +cpuset | sudo tee /sys/fs/cgroup/user.slice/cgroup.subtree_control
echo +cpuset | sudo tee /sys/fs/cgroup/user.slice/user-1000.slice/cgroup.subtree_control
echo +cpuset | sudo tee /sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/cgroup.subtree_control
echo +cpuset | sudo tee /sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/benchexec.slice/cgroup.subtree_control

Interesting. In my tests I think that systemd did that automatically once the initial command was run. But as long as BenchExec at least prints the next step each time it is run, I guess I will keep it that way because detecting what exactly to run upfront could be difficult.

Now the warning is gone! And it looks like this is fixed in kernel 6.6.

Ah, nice! It is really hard to keep track of patches floating around in Linux land. I guess I can add a recommendation for Linux 6.6.

--dir should be optional, though, right? Except if you are currently in a hidden directory, then you have to specify it, of course. Or, if you need the current directory, you can use an overlayfs for it.

I invoked runexec in a subdirectory of /home, so when I added --hidden-dir /home

Which btw. is not strictly necessary, but does improve how the container's system config looks.

it complained that the directory did not exist. That's why I added --dir /tmp.

Right. The typical alternative would be to use something like --overlay-dir . or --overlay-dir ~.

Do you really have binaries in /run?

Yes. Some background: In NixOS, the cpu-energy-meter binary is somewhere in /nix/store/*-cpu-energy-meter-*/bin/cpu-energy-meter and has no capabilities (for sandboxing builds, I believe). So there's another binary at /run/wrappers/bin/cpu-energy-meter configured via security.wrappers.cpu-energy-meter that does have the CAP_SYS_RAWIO and will execute /nix/store/*-cpu-energy-meter-*/bin/cpu-energy-meter.

Ah, thanks for the explanation.

[...] or we can add such logic to BenchExec if this is some NixOS rule.

I don't think that's necessary.

Ok, as you prefer. Maybe add --read-only-dir /run/wrappers/bin to the package documentation or so?

So, it seems that all warnings are resolved, CPU Energy Meter, LXCFS, and libseccomp are available. One thing that NixOS reviewers will ask is why I want to package an unreleased version. Are there any plans on releasing BenchExec with support for cgroups2? Do you need any more data or reports?

The most important reason was that I didn't get any feedback from others so far, so I wanted to do more internal tests. But thanks to you, I think I can just go ahead and release a version with cgroupsv2 and tell people to still treat it as somewhat experimental in the release notes. Ah, and I have to write more docs... So I hope I can do a release this week (but then without the polishings discussed in the last comments).

@lorenzleutgeb
Copy link
Contributor

lorenzleutgeb commented Oct 4, 2023

[...] I think I can just go ahead and release a version with cgroupsv2 and tell people to still treat it as somewhat experimental in the release notes. Ah, and I have to write more docs... So I hope I can do a release this week (but then without the polishings discussed in the last comments).

OK, that's great news. I am not in a rush, so please take your time and feel free to polish things up before releasing in the coming weeks. I'll wait.

We want to recommend installation of pystemd
and provide appropriate documentation for users.
Instead of crashing, we continue and provide nice error messages
if cgroups are required.
For Podman we can even tell the user exactly what to do.
Mention @globin as contributor as this work is from him.
@PhilippWendler PhilippWendler linked an issue Oct 20, 2023 that may be closed by this pull request
There is this kernel bug for cgroups v2 that prevents us from using
CPUSET but everything else is working.
In this case we provide a nice error message with a workaround (6cbd9fc),
but check_cgroups did not do this so far because it terminated itself
too early. Now we continue and let RunExecutor print the message.
So far, BenchExec prints a warning about every missing cgroup
controller, and later on an error message if that controller is strictly
required. But CPUSET is only required for core limits, and has no other
use. This is different from e.g. MEMORY which is required for memory
limits but also provides memory measurements.
So lets silence the warning about missing CPUSET and keep only the error
message.

check_cgroups is not affected because if forces CPUSET to be required.
@PhilippWendler
Copy link
Member

runexec --debug --no-container echo Test complains that "Cgroup subsystem cpuset is not available. Please make sure it is supported by your kernel and available." Note that cpuset is listed in /proc/cgroups (see below) and I am not sure what else I have to configure here.

This sounds like the kernel problem from 6cbd9fc. If you force runexec to require cpuset, for example with --cores 0, you should see a better error message. Maybe I should also improve the warning that is printed if cpuset is missing but not required.

Yes, you're right. I'd prefer the workaround to be printed even for the warning, but that's a matter of taste.

I thought about that more now, and decided to completely remove the warning about missing cpuset if it is not required. If there is a core limit, we still print the error message with the workaround, and benchexec.check_cgroups now does it as well.

@PhilippWendler PhilippWendler merged commit 6942db1 into main Oct 20, 2023
3 checks passed
@PhilippWendler PhilippWendler deleted the cgroupsv2 branch October 20, 2023 09:39
@PhilippWendler
Copy link
Member

Also, I had to run a few commands (see below) in alternation with runexec. It would be helpful to print them all at once, but I don't know whether that's possible:

echo +cpuset | sudo tee /sys/fs/cgroup/user.slice/cgroup.subtree_control
echo +cpuset | sudo tee /sys/fs/cgroup/user.slice/user-1000.slice/cgroup.subtree_control
echo +cpuset | sudo tee /sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/cgroup.subtree_control
echo +cpuset | sudo tee /sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/benchexec.slice/cgroup.subtree_control

Interesting. In my tests I think that systemd did that automatically once the initial command was run. But as long as BenchExec at least prints the next step each time it is run, I guess I will keep it that way because detecting what exactly to run upfront could be difficult.

This mystery is now also solved. I had previously configured systemd on my machine with Delegate=yes for user@.service. This takes care of the first three commands of the above, which are actually independent of the discussed kernel problem (meaning that even on Linux 6.6 with the fix, one has to configure systemd). And the systemd configuration is strongly preferred over running the first three of the above commands, because the config also takes care of the other cgroups, not just cpuset, and it is a one-time thing instead of after every reboot.

I have now updated the documentation and our Ubuntu package with this systemd configuration: 5b5bc49

@lorenzleutgeb You might want to add this to the Nix package as well.

@lorenzleutgeb
Copy link
Contributor

lorenzleutgeb commented Oct 20, 2023

@lorenzleutgeb You might want to add this to the Nix package as well.

Instead of modifying user@ (which applies to all users), I believe it should be possible to only modify user@UID for the UIDs of the users wanting to use BenchExec, right? I'd like to minimize the impact of BenchExec on the overall system, that's why I would prefer localizing the change for individual users. I'll test in a VM when I have time.

@PhilippWendler
Copy link
Member

Yes, this is correct. I decided to not do this in order to not overcomplicate things, because for example when installing a Debian package it is not easy to answer the question "which users should be allowed to do this" and we cannot even use a regular group or a single file, but have to do this individually for every single allowed user.

Furthermore, delegating cgroups v2 should be safe in principle (even Podman considers enabling this by default for containers). Also cf. systemd/systemd@b8df7f8.

But if Nix allows you to do this more easily, you can do so.

@lorenzleutgeb
Copy link
Contributor

lorenzleutgeb commented Oct 20, 2023

But if Nix allows you to do this more easily, you can do so.

Yes. The following expression with the free variable username (and config, which refers back to the system being configured via a fixed-point construction), will set Delegate=yes appropriately. The module I wrote for BenchExec can then take a list of usernames or UIDs to be configured.

systemd.slices."user-${builtins.toString config.users.users.${username}.uid}".serviceConfig.Delegate = "yes";

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Use systemd API to create cgroup Support for cgroup v2 - testers needed
3 participants