podman’s logo

Containers are not secure ! I am kidding ! they are pretty secure but they are dangerous and one reason why they are is because they require root privileges to run.

Why ?

OKey Okey let’s take a look at this:

sudo docker run -v /:/hostfs ubuntu rm -rf /hostfs

You figured why running docker (container runtime) as root is dangerous ?

Running your container runtime as root is the most dangerous thing you can do, because escaping the container barrier (container runtime vulnerability) will make the attacker root and Slat 3enbi.

Then why don’t we do this ?

sudo usermod -aG docker devc
docker run ...

This does nothing but give our user devc root powers when running docker

srw-rw----  1 root docker      145 May 19 08:14  /var/run/docker.sock

And this ?

sudo docker run --user

Although it is safer, this is also not a good idea, since it requires docker to run as root. So a privilege escalation inside the container + escaping the container and here we go, Slat 3enbi again !

Here is a summary of how running the container runtime as root is dangerous, and how rootless saves the day.

summary

Problem

Container runtimes use Linux namespaces and Cgroups and other Linux technologies in order to run a container. Those technologies especially namespaces and Cgroups require root privileges. Try this:

unshare --fork --pid ... # pid namespaces
unshare --mount ... # mount name spaces

You’ll get a permission denied error.

Solution

User Namespace 👼 !!

Fortunately user namespaces can be used without root privileges. Try this:

unshare --map-root-user --user sh -c whoami

No permission denied !

User namespaces are the key technology used to achieve the rootless containers promise. Both Docker rootless and Podman rooless rely mainly on them.

User namespace

User namespaces are a way to map a PID range from the host into the container. e.g. 10000-11000(host) –> 0-1000 (container): So root(PID 0) inside the container is mapped to pid 10000 on the host and so forth.[0]

docker rootless

Most container engine(docker, podman, lxc …) rely on user namespace to provide rootless containers. [1]

Docker for example, runs Dockerd itself inside a user container [2] so it’ll have the impression it is root, while it does only have a somewhat fake root who has full capability inside the userns only.

The recipe: Podman’s case

The first step podman does when we podman run is creating a user namespace if it’s not already there [3]. Then it proceeds with creating the other namespaces inside it using the new root (which have full capabilities inside the userns).

We said before that we have full root capabilities inside the user namespace, what does this mean ?

Linux Capabilities

Linux caps is a Linux technology that provides fine-grained control over superuser permissions.

So now instead of instead of assigning of running a process with root privileges (using sudo or setuid/setguid…) we only give some of root capabilities e.g. CAP_SETUID or CAP_KILL.

You can see the entire list in the capabilities’ man page.[4]

Networking

Although the container (process) has full root capabilities inside the userns, it can only perform operations (that require privileges) on resources owned by the user namespace. [5]

In order to connect to the internet the pod must go through the default netns, and this is done using Virtual Ethernet Device “veth” pairs which enable inter namespaces communication.[6]

veth

As for podman, the new root can create network namespaces, though it can’t setup veth pair. And thus no Internet connection for our container, without real root.

netns not internet

To solve this problem Podman uses slirp4netns [7] which escapes the netns boundary by creating a TAP interface inside the netns and using it to read network packets and forward them to/from the internet.

slirp4netns

Storage

Containers storage relies mainly on copy on write filesystems since copying the image entirely everytime we create a container will make container technoly almost useless and unusable.

The problem is almost every COW filesystem (namely OverlayFS and BTRFS) doesn’t support user namespace i.e. require true root to mount and manage.

In this case podman use fuse-overlayfs (filesystem in userspace) to mount and interact with the image/container filesystem.

Resource limits (Cgroups)

There is no way to use Cgroups without having true root privileges. Fortunately Cgroups V2 provides a way to have per user Cgroups trees.

Using Cgroups V2 with Podman requires to have crun as the container runtime (since runc doesn’t support Cgroups V2) [8]

[0] Experimenting with Rootless Docker
[1] The Route To Rootless Containers - Ed King, Pivotal & Julz Friedman, IBM (Any Skill Level)
[2] Hardening Docker daemon with Rootless mode
[3] What happens behind the scenes of a rootless Podman container?
[4] Linux Capabilities manpage
[5] Overview and Recent Developments: Namespaces and Capabilities - Christian Brauner, Canonical Ltd
[6] Basics of Container Networking with Linux
[7] slirp4netns — How does it work
[8] The current adoption status of cgroup v2 in containers
[-] Rootless containers with Podmanand fuse-overlayfs
[-] Introducing Linux Network Namespaces
[-] Rootless containers