Skip to main content

Sandbox Isolation

Every community MCP server runs inside a bubblewrap (bwrap) sandbox spawned directly by the gateway — no Docker socket involved. The sandbox gives the server fresh kernel namespaces, a read-only view of exactly the files it was granted, a cleared environment, no capabilities, uid 65534 (nobody), and a seccomp-BPF filter that kills the known escape vectors.

One instance is spawned per (server × user/profile) — shared code, never shared process.

From manifest to running sandbox

  1. Pull by digest. internal/oci pulls the manifest's digest-pinned OCI image daemonlessly (pure go-containerregistry), re-derives the digest of what actually arrived, and rejects mismatches. Servers are packaged as static self-contained binaries: the puller streams the flattened rootfs (whiteouts applied) and extracts exactly one file — the entrypoint — to <GIG_DATA_DIR>/servers/<name>@<version>/server with mode 0555.
  2. Build the bwrap command. internal/sandbox constructs the argv. The extracted binary is bind-mounted read-only at /app/server inside the sandbox; the proxy's CA certificate at /etc/gigmcp-ca.pem.
  3. Spawn and wire networking. The gateway starts bwrap, reads the child PID from --info-fd, writes the uid/gid maps, unblocks the child via --userns-block-fd, creates a veth pair, and moves the peer end into the child's network namespace by PID (LinkSetNsPid — netlink, no ip CLI, no named-netns bind-mount that would need SYS_ADMIN).
  4. Bootstrap drops privileges and execs. Inside the sandbox, the trusted cmd/bootstrap init configures the network, drops everything, installs the seccomp filter, and execs the untrusted server (details below).

For an image to extract, only the entrypoint file is taken from the image rootfs — the sandbox does not get the rest of the image's filesystem. Full rootfs sandboxing (needed for interpreted node/python servers) is a planned gateway extension; see Builders.

bwrap flags

For an egress-enabled sandbox the gateway passes (from internal/sandbox/sandbox.go):

bwrap \
--die-with-parent \ # no orphaned servers if the gateway dies
--new-session \ # blocks TIOCSTI terminal injection
--unshare-user --unshare-net --unshare-pid \
--unshare-ipc --unshare-uts --unshare-cgroup \
--info-fd 5 \ # gateway reads the child PID from here
--userns-block-fd 6 \ # child blocks until the gateway writes uid/gid maps
--clearenv \
--setenv PATH /usr/bin:/bin \
--tmpfs /etc \
--setenv GIG_PLACEHOLDER <sentinel> \ # extra env emitted in sorted key order
--setenv HTTPS_PROXY http://<proxy-ip>:8081 \
--setenv NODE_EXTRA_CA_CERTS /etc/gigmcp-ca.pem \
--setenv SSL_CERT_FILE /etc/gigmcp-ca.pem \
--ro-bind <extracted-binary> /app/server \
--ro-bind <ca-cert> /etc/gigmcp-ca.pem \
--proc /proc \ # fresh procfs: sandbox sees only its own PID namespace
--dev /dev \
--tmpfs /tmp \
--chdir / \
--ro-bind /usr/local/bin/bootstrap /usr/local/bin/bootstrap \ # after --tmpfs /tmp so a /tmp-resident path would still be visible
-- /usr/local/bin/bootstrap <cidr> <proxy-ip> <veth> -- /app/server

No-network sandboxes use the simpler --unshare-all form with no bootstrap. Caller mounts are applied first so the fixed isolation mounts (--proc, --dev, --tmpfs /tmp) always win on path collisions — bwrap mounts are last-one-wins.

Namespaces

NamespaceEffect
userThe sandbox gets its own uid space. The gateway writes a uid/gid map covering 0 0 65535 so bootstrap can start as userns-root (uid 0, CAP_NET_ADMIN for veth setup) and then drop to uid 65534. bwrap's own --uid flag is deliberately not used — it writes a single-entry map that would make setuid(65534) impossible.
pidFresh PID namespace plus --proc /proc: the server sees only its own process tree, never the host's. (Requires systempaths=unconfined on the Docker container so /proc masks are removed.)
netFresh network namespace whose only interface is the injected veth and whose only route is default via <proxy-IP>. This is the foundation of egress enforcement.
mountPrivate mount namespace with only the read-only binds listed above, tmpfs /etc and /tmp, fresh /dev.
ipc / uts / cgroupUnshared — no host IPC objects, hostname, or cgroup visibility.

The bootstrap privilege drop

cmd/bootstrap is the only trusted code that ever runs inside a sandbox. Its sequence (order is security-critical):

  1. Wait for the gateway to inject the veth (1 byte on fd 3).
  2. Configure the network as userns-root: bring up lo and the peer veth, assign the sandbox's /30 address, add default via <proxy-IP>.
  3. Signal net-ready to the gateway (fd 4).
  4. Open the server binary as a file descriptor before dropping privileges (it is later exec'd via execveat(AT_EMPTY_PATH), sidestepping overlay-filesystem path-permission quirks).
  5. PR_SET_NO_NEW_PRIVS, drop the entire capability bounding set (caps 0–40), clear ambient caps.
  6. setgroups/setgid/setuid to 65534 (nobody). The kernel clears permitted and effective caps on the uid transition.
  7. capset to zero every remaining capability set.
  8. Install the seccomp-BPF filter — fail-closed: any error aborts; the server is never exec'd without the filter.
  9. execveat the untrusted server.

The result, verified by tests: the server process runs as uid 65534 with CapEff=0 — zero capabilities in every set.

Seccomp

Two layers, with different statuses:

Docker layer (the container running the gateway). --security-opt seccomp=unconfined is currently required, because Docker's default seccomp profile blocks the unprivileged user namespaces bwrap needs. A scoped custom Docker profile is a follow-up hardening item that has not shipped.

Application layer (inside every sandbox) — shipped. internal/seccomp, installed by bootstrap after the privilege drop and inherited across execve. It closes the nested-user-namespace escape that uid 65534 + capability-dropping alone do not prevent. The default action is ALLOW; only escape and escalation vectors are denied:

SyscallsAction
unshare, setnsKILL_PROCESS
clone with CLONE_NEWUSER (arg-filtered)KILL_PROCESS
clone3ENOSYS (not kill — glibc ≥ 2.34 uses clone3 for pthread_create and falls back to plain clone on ENOSYS, which is arg-filtered above; a KILL would SIGSYS-kill Rust/C servers on thread creation)
mount, umount2, pivot_root, chroot, open_tree, move_mount, fsopen, mount_setattrKILL_PROCESS
ptrace, process_vm_readv, process_vm_writevKILL_PROCESS
keyctl, add_key, request_keyKILL_PROCESS
bpf, perf_event_openKILL_PROCESS
kexec_load, init_module, finit_module, delete_moduleKILL_PROCESS

Verified by TestSeccompBlocksUnshare (SIGSYS on unshare) and TestSeccompAllowsNormalWork (Go goroutines + TCP under the filter, exit 0).

note

Landlock and cgroups are designed but not yet implemented. DESIGN.md lists Landlock (filesystem rules) and cgroup resource limits as part of the hardening plan, and the sandbox package explicitly defers them. Current filesystem isolation relies on the mount namespace, read-only binds, --clearenv, the private procfs, and the uid/cap drop. A full allowlist-style seccomp profile for maximally hostile workloads is likewise deferred.

What a sandboxed server can and can't see

Can see:

  • Its own binary at /app/server (read-only).
  • The proxy CA at /etc/gigmcp-ca.pem, an otherwise-empty tmpfs /etc, writable tmpfs /tmp, minimal /dev.
  • Environment: PATH, HTTPS_PROXY, SSL_CERT_FILE, NODE_EXTRA_CA_CERTS, and GIG_PLACEHOLDER (the high-entropy sentinel — not a secret).
  • Its own process tree, its own loopback, and one veth with a /30 address.

Can't see:

  • Your real API keys, the vault, the database, or GIG_MASTER_KEY.
  • The host filesystem, host processes, host network interfaces, other sandboxes.
  • Any inherited environment (bwrap --clearenv).
  • Any route except default via <proxy-IP>.

Host-side requirements

From docker-compose.yml, the gateway container runs with:

cap_add:
- NET_ADMIN # veth creation + LinkSetNsPid; no SYS_ADMIN, never privileged
security_opt:
- seccomp=unconfined # bwrap needs unprivileged userns (see note above)
- apparmor=unconfined
- systempaths=unconfined # unmask /proc so bwrap can mount a fresh procfs

At startup the gateway also sets the iptables FORWARD policy to DROP as defense-in-depth (best-effort — route isolation remains the primary enforcement if iptables is unavailable).

Next: how the network namespace becomes an unforgeable identity in the Egress Proxy.