Capability handling in SingularityCE 4.0 --oci mode #1735

dtrudg · 2023-06-05T10:51:27Z

dtrudg
Jun 5, 2023
Maintainer

SingularityCE, using the existing native runtime, has the following default capabilities handling:

User - no capabilities in container

$ singularity exec fedora_latest.sif capsh --print
Current: =
Bounding set =
Ambient set =

User + --fakeroot - full capabilities for the fakeroot user inside the user namespace

$ singularity exec --fakeroot fedora_latest.sif capsh --print
INFO:    Converting SIF file to temporary sandbox...
Current: =eip
Bounding set =cap_chown,cap_dac_override,cap_dac_read_search,... <snipped>
Ambient set =cap_chown,cap_dac_override,cap_dac_read_search,... <snipped>

Root

Depends on singularity.conf:

# ROOT DEFAULT CAPABILITIES: [full/file/no]
# DEFAULT: full
# Define default root capability set kept during runtime
# - full: keep all capabilities (same as --keep-privs)
# - file: keep capabilities configured for root in
#         ${prefix}/etc/singularity/capability.json
# - no: no capabilities (same as --no-privs)

This differs quite a bit from OCI rootless runtimes:

podman, rootless userns as user in the container - specific set of bounding caps

$ podman run --userns=keep-id -it --rm fedora capsh --print
Current: =
Bounding set =cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_sys_chroot,cap_setfcap
Ambient set =

podman, rootless userns as root in the container - specific set of current/bounding caps,

$ podman run -it --rm fedora capsh --print
Current: cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_sys_chroot,cap_setfcap=ep
Bounding set =cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_sys_chroot,cap_setfcap
Ambient set =

podman, rootfull - specific set of current/bounding caps

$ sudo podman run -it --rm fedora capsh --print
Current: cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_sys_chroot,cap_setfcap=ep
Bounding set =cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_sys_chroot,cap_setfcap
Ambient set =

The question, then, is what behaviour makes sense for Singularity CE's new --oci mode?

At present (#1587) no capability sets are wired up. The current, bounding, ambient sets are always empty in --oci mode.

I believe there are a number of options:

1. Following SingularityCE native behaviour

The capabilities for a non-root user are more restricted - the bounding set stays empty, vs holding a limited set of caps.
In a --fakeroot container, all capabilities are available in the user namespace, vs a limited set of caps.
In a true rootfull container, by default all capabilities are available in the container, vs a limited set of caps. There is an important implication here, w.r.t. Singularity not being intended to secure rootfull containers against container escape etc.
In a true rootfull container, default behaviour can be changed with singularity.conf default root capabilities.
Ambient caps are always follow the bounding set.

The consistency gained from following SingularityCE's existing behaviour would be beneficial in a world where we can expect mixed use of the native and oci runtimes for some time.

The main argument against is that rootfull containers are more dangerous.. and we have to maintain a clear expectation that a rootfull container in SingularityCE's --oci mode is not intended to give any kind of security boundary. I don't think we have the resources to focus on making --oci mode safe in rootfull operation anyway. There are many other aspects of --oci mode which would have to be addressed including application of seccomp filtering by default, full review of mount and image handling etc.

--fakeroot containers run as a user would be more powerful than when following OCI caps. There is an argument that the large set of fakeroot caps here opens up additional risk, albeit limited to what the user can do anyway in a host shell. It's fairly equivalent to running e.g. rootless podman with --privileged.

2. Follow podman/OCI style in --oci mode only

The bounding capability set for a non-root user, as themselves in the container, is expanded to a specific set of caps.
The capability set for a non-root user in --fakeroot mode is reduced.
The capability set for a rootfull container is reduced.
Ambient capabilities are eliminated.

This option would provide consistency with OCI rootless runtimes. It is likely that some workflows in a user namespace may be prevented, mostly container-in-container flows where --privileged is required with rootless Docker/podman etc. We'd have to provide a way to open things up. Note that we have a --keep-privs flag with the native mode at the moment... but it's currently effective for certain rootful circumstances only.

In rootfull --oci containers, operations would be more restricted. However, as above, rootfull containers would not be 'safe' without a huge amount of additional work that I don't believe we can accomplish for 4.0.

3. Do something inbetween for --oci mode only

I'm not sure this would make sense. Opening up the --oci mode behaviour a little bit vs podman could ease some workflows, but we are then not consistent with the Singularity native mode, or stock OCI rootless runtimes.

4. Change native mode behaviour, --oci based on podman/OCI rootless runtimes

We could implement capabilities in --oci mode following the podman / rootless OCI patterns.... and then move native mode closer to the same behaviour. This would bring internal consitency for 4.0, but make it inconsitent with prior releases.

I don't think we want to do this. Particularly any container in container workflow using --fakeroot in the outer layer could be affected.

tri-adam · 2023-06-20T13:58:30Z

tri-adam
Jun 20, 2023
Maintainer

With an admittedly limited knowledge of the full implications, on the surface of it I would guess that following the path of the other rootless OCI runtimes would make sense. While it's less consistent with SingularityCE's existing behaviour, it's more consistent with the rest of the OCI space.

In the event that someone wants the traditional CE behaviour with an OCI image, could they simply run the OCI image in non-OCI mode with 4.0? If so, that would seemingly cover both bases?

0 replies

dtrudg · 2023-06-20T14:27:07Z

dtrudg
Jun 20, 2023
Maintainer Author

Thanks @tri-adam

With an admittedly limited knowledge of the full implications, on the surface of it I would guess that following the path of the other rootless OCI runtimes would make sense. While it's less consistent with SingularityCE's existing behaviour, it's more consistent with the rest of the OCI space.

I think I'm leaning towards this, slightly, too.

In the event that someone wants the traditional CE behaviour with an OCI image, could they simply run the OCI image in non-OCI mode with 4.0? If so, that would seemingly cover both bases?

Yes, the non-oci runtime certainly isn't disappearing in 4.0, so the old behavior will still be easily available.

0 replies

dtrudg · 2023-06-21T15:57:13Z

dtrudg
Jun 21, 2023
Maintainer Author

As this has now been implemented (following the OCI runtime approach) in a series of PRs, I'll close the thread. Thanks again.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Capability handling in SingularityCE 4.0 --oci mode #1735

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

Capability handling in SingularityCE 4.0 --oci mode #1735

dtrudg Jun 5, 2023 Maintainer

Replies: 3 comments

tri-adam Jun 20, 2023 Maintainer

dtrudg Jun 20, 2023 Maintainer Author

dtrudg Jun 21, 2023 Maintainer Author

dtrudg
Jun 5, 2023
Maintainer

tri-adam
Jun 20, 2023
Maintainer

dtrudg
Jun 20, 2023
Maintainer Author

dtrudg
Jun 21, 2023
Maintainer Author