Recently I wanted to deploy a service of marginal trustworthiness, namely synapse, the reference server application for the Matrix protocol. More precisely, I wanted to deploy synapse on boxes I already had for economic reasons. Since I do consider it not fully trustworthy (it being network-facing and monolithic and all) there were some safeguards that needed taking. Among other things I wanted to be sure that synapse would not interfere with other services on the same box in ways it could not have done had it lived on a separate machine.

“Great!”, I hear the internet say, “Just dockerize it and everything will be fine!”. If only it were that simple.

Containers, Containers

Docker does work for exactly this purpose. It works rather well for this purpose, too. Unfortunately it didn’t quite do everything I wanted it to do in this case. Synapse, being an application that is nowhere near complete, still in active development, and open to the entire internet, cannot (and should not) be considered “secure”. Regular updates are absolutely essential for secure, or at least secure enough, operation of any network-facing service. And regular updates are the first thing docker let me down at. Unless you run your containers from the latest tag on the hub and regularly pull and restart it, you will over time end up with out-of-date images.

Automatic updates of this kind are fine for many things, but they do take control out of your hands. They also require trust in the docker hub, which is another thing I decidedly do not have. (Building images from scratch and deploying those after testing is fine, but building images without the hub is a serious pain with docker. Rkt is a bit easier, but since IPv6 is something rkt has, at best, only heard of I can’t consider using it.)

What is a container?

With docker gone, rkt not even in the competition, and running it on a VM out of the question for a variety of reasons, another solution had to be found. Docker, for all its faults, does very well on the containerization side side of my requirements. So the question that begs to be asked is, what is a container anyway?

Most importantly a container isolates its contents from the host the container is running on. For the sake of argument, let us adopt a non-interference style of definition for what a container is. Docker guarantees a number of intereferences the container cannot produce:

it cannot write to the filesystem of the host, unless this is explicitly allowed
it can’t even read the filesystem of the host. It has its entirely separate filesystem to work on.
it cannot access the host via localhost networking
processes on the host remain hidden to the container
shared memory, inter-process communication and other kernel resources are not shared between container and host
and a number of other things, mostly resource limits and capabilities of root containers.

Ideally we want to keep all of these different isolations, but sadly we can’t. Providing all of them requires a full container solution like docker, rkt, or any of the others. Or does it? Let’s check, one by one.

Protecting the host filesystem

Mount namespaces can be used to give a process or a set of processes an independent view of the filesystem hierarchy. This allows hiding of host directories in the container, hiding container directories in the host, even separating host and container filesystems completely. Linux allows us not only to split views on the filesystem and mount directories differently, but files as well. Combining these two we can build a completely locked-down filesystem for a container in which, for example, /tmp is not shared with the host, and /etc is empty except for the few files a container needs from the host. We cannot completely isolate host and container without building a completely fresh root filesystem for the container, but we can make a lot of the host filesystem inaccessible or read-only.

All of this can be easily done with systemd. As listed in systemd.exec(5), there are a sizable number of options for manipulating the filesystem hierarchy, most of which are very useful for isolation.

ProtectSystem=strict mounts the entire filesystem hierarchy read-only, except for a few API filesystems. Some of these can be made read-only with ProtectKernelTunables, ProtectKernelModules, and ProtectControlGroups.
ProtectHome=true makes home directories (/home, /root) empty to the service. It can also make them read-only instead, or mount a tmpfs instance there.
PrivateTmp=true creates private instances of the system temporary directories (/tmp, /var/tmp), making communication via tempfiles impossible.
PrivateDevices=true hides almost all hardware devices from the service.
TemporaryFileSystems=/dev/shm hides all POSIX shared memory. Adding /etc to the list also hides all system configuration.

Blocking localhost networking

Two solutions exist to protect localhost from a rogue service: the one that uses network namespaces, and the one that doesn’t. Network namespaces are the analog of mount namespaces for the network stack: different network namespaces can see a completely different picture of what “network” even means. Among other things this leads to a completely different view of what “localhost” means, since loopback devices (and the localhost addresses with them) are not shared between namespaces. This is the solution commonly used by containers unless told otherwise: create a network namespace, add a virtual ethernet link from the host to container, and use network address translation on the host.

The other solution involves not network namespaces, but control groups. One very interesting feature of control groups is that the netfilter subsystem (most usefully the iptables family) can filter packets depending on which control group sent them. If we can stick our service into a control group we can thus use iptables filters to drop all packets to localhost we do not want to see.

# assuming that our service lives in cgroup service.group:

# create a new filter chain for just our service
ip6tables -N service-group-filter
# block all packets to localhost, allow all others
ip6tables -A service-group-filter -d ::1 -j REJECT
ip6tables -A service-group-filter -j ACCEPT
# hook the filter into the OUTPUT chain
ip6tables -A OUTPUT -m cgroup --path service.group -j service-group-filter

# the same can be done for legacy IP

Systemd already places each service in a control group, and can be instructed to place it in a special control group as well. (Unfortunately, systemd slice paths are ridiculously verbose.)

This script can be called from a systemd unit on startup, for example with a drop-in:

# /etc/systemd/system/synapse.service.d/iptables.conf
[Service]
Slice = system-sandbox-synapse.slice

ExecStartPre = -+/usr/bin/ip6tables -N synapse.service-sandbox
ExecStartPre =  +/usr/bin/ip6tables -A synapse.service-sandbox -d "::1" -j REJECT
ExecStartPre =  +/usr/bin/ip6tables -A synapse.service-sandbox -j ACCEPT
ExecStartPre =  +/usr/bin/ip6tables -A OUTPUT -m cgroup --path system.slice/system-sandbox.slice/system-sandbox-synapse.service.slice -j synapse.service-sandbox
ExecStopPost = -+/usr/bin/ip6tables -D OUTPUT -m cgroup --path system.slice/system-sandbox.slice/system-sandbox-synapse.service.slice -j synapse.service-sandbox
ExecStopPost = -+/usr/bin/ip6tables -F synapse.service-sandbox
ExecStopPost = -+/usr/bin/ip6tables -X synapse.service-sandbox

# same can be done for legacy IP

Hiding all processes

PID namespaces provide an independent view of the process tree. In a PID namespace all processes see their PIDs as though they were alone on the machine, while the parent namespace sees them as normal processes in its own tree. Each PID namespace must have an init process, i.e. a process with PID 1. This init process is treated specially by the kernel, so it cannot be a part of our service.

Systemd offers nothing to make PID namespaces happen. The rest of the system does though.

Using the unshare utility a privileged process can create new namespaces, PID namespaces among them. After creating those namespaces, unshare will call any program with any arguments we provide. In order to use unshare we will need a specific set of capabilities we do not want the service to have. Luckily setpriv also exists, and unshare combined with setpriv lets us create a new PID namespace in which the service we are sandboxing cannot acquire capabilities needed to break out of the sandbox.

# /etc/systemd/system/synapse.service.d/exec.conf
[Service]
Type = simple
Group = synapse
ExecStart = \
  /usr/bin/unshare -pf --mount-proc --kill-child=SIGTERM \
    /usr/bin/setpriv --inh-caps=-all --bounding-set=-all \
      --reuid synapse \
    /usr/bin/python2.7 -m synapse.app.homeserver \
      --config-path=/etc/synapse/homeserver.yaml

The service started by unshare in this manner will be the init process of the PID namespace. As such it must not terminate, lest the namespace be torn down and the remaining service processes killed in the process.

Hiding IPC, shared memory, etc.

POSIX shared memory is already hidden by mounting /dev/shm/ with a new tmpfs instance using TemporaryFileSystems. SysV IPC facilities and other POSIX mechanisms can also be hidden with IPC namespaces. Doing that is as simple as turning the -pf argument to unshare seen earlier into -ipf.

Capabilities and such

Most of the changes we have made so far can be undone by a process with enough privileges. Such privileges can come from being root, or they can be obtained from the filesystem by using the capabilities mechanism. We have already cleared some capability sets while we opened the PID namespace. There is one more set we can clear however, and a single bit that, when set, will lock our service into its sandbox forever. This is the NoNewPrivileges bit, and it does exactly what it says on the tin: once a process has dropped a capability, it can’t get it back. Ever.

We might also want to ensure that the kernel keyring is not shared between host and sandbox.

# /etc/systemd/system/synapse.service.d/caps.conf
[Service]
AmbientCapabilities =
NoNewPrivileges = true
SecureBits =
KeyringMode = private

Putting it all together

Combining all of these we can write a single drop-in for synapse that combines all of changes above. So how well have we done on our list?

Pretty well, actually.

Synapse cannot write to the filesystem of the host at all now. We must make some paths writable, or at least readable, for synapse to even start. These directories can be added with additional ReadWritePaths and ReadOnlyPaths settings for system directories. For directories we have turned into an empty tmpfs we’ll have to use BindPaths and BindReadOnlyPaths instead, to much the same effect.
Localhost networking is completely cut off. If we want to use a local proxy to handle TLS this will have to be loosened.
Host processes are entirely invisible to synapse.
IPC facilities are separated.
Synapse retains no capabilities. It can’t send signals to other processes, can’t modify the kernel, can’t trigger reboots, it can’t even write to a tty that might be currently in use–because it can’t see any of them.

What we have not (yet!) done is hiding all parts of the host filesystem that isn’t needed for synapse to function, eg the package manager caches or static websites served by our web server. But we still have the good old UNIX user/group/world permissions (and maybe ACLs) that’ll take care of that.

If a web server can read it, chances are something on the internet could read it anyway. Always check your mode bits.