Sandbox Isolation
Every community MCP server runs inside a bubblewrap (bwrap) sandbox spawned directly by the gateway — no Docker socket involved. The sandbox gives the server fresh kernel namespaces, a read-only view of exactly the files it was granted, a cleared environment, no capabilities, uid 65534 (nobody), and a seccomp-BPF filter that kills the known escape vectors.
One instance is spawned per (server × user/profile) — shared code, never shared process.
From manifest to running sandbox
- Pull by digest.
internal/ocipulls the manifest's digest-pinned OCI image daemonlessly (purego-containerregistry), re-derives the digest of what actually arrived, and rejects mismatches. Servers are packaged as static self-contained binaries: the puller streams the flattened rootfs (whiteouts applied) and extracts exactly one file — the entrypoint — to<GIG_DATA_DIR>/servers/<name>@<version>/serverwith mode0555. - Build the bwrap command.
internal/sandboxconstructs the argv. The extracted binary is bind-mounted read-only at/app/serverinside the sandbox; the proxy's CA certificate at/etc/gigmcp-ca.pem. - Spawn and wire networking. The gateway starts bwrap, reads the child PID from
--info-fd, writes the uid/gid maps, unblocks the child via--userns-block-fd, creates a veth pair, and moves the peer end into the child's network namespace by PID (LinkSetNsPid— netlink, noipCLI, no named-netns bind-mount that would needSYS_ADMIN). - Bootstrap drops privileges and execs. Inside the sandbox, the trusted
cmd/bootstrapinit configures the network, drops everything, installs the seccomp filter, andexecs the untrusted server (details below).
For an image to extract, only the entrypoint file is taken from the image rootfs — the sandbox does not get the rest of the image's filesystem. Full rootfs sandboxing (needed for interpreted node/python servers) is a planned gateway extension; see Builders.
bwrap flags
For an egress-enabled sandbox the gateway passes (from internal/sandbox/sandbox.go):
bwrap \
--die-with-parent \ # no orphaned servers if the gateway dies
--new-session \ # blocks TIOCSTI terminal injection
--unshare-user --unshare-net --unshare-pid \
--unshare-ipc --unshare-uts --unshare-cgroup \
--info-fd 5 \ # gateway reads the child PID from here
--userns-block-fd 6 \ # child blocks until the gateway writes uid/gid maps
--clearenv \
--setenv PATH /usr/bin:/bin \
--tmpfs /etc \
--setenv GIG_PLACEHOLDER <sentinel> \ # extra env emitted in sorted key order
--setenv HTTPS_PROXY http://<proxy-ip>:8081 \
--setenv NODE_EXTRA_CA_CERTS /etc/gigmcp-ca.pem \
--setenv SSL_CERT_FILE /etc/gigmcp-ca.pem \
--ro-bind <extracted-binary> /app/server \
--ro-bind <ca-cert> /etc/gigmcp-ca.pem \
--proc /proc \ # fresh procfs: sandbox sees only its own PID namespace
--dev /dev \
--tmpfs /tmp \
--chdir / \
--ro-bind /usr/local/bin/bootstrap /usr/local/bin/bootstrap \ # after --tmpfs /tmp so a /tmp-resident path would still be visible
-- /usr/local/bin/bootstrap <cidr> <proxy-ip> <veth> -- /app/server
No-network sandboxes use the simpler --unshare-all form with no bootstrap. Caller mounts are applied first so the fixed isolation mounts (--proc, --dev, --tmpfs /tmp) always win on path collisions — bwrap mounts are last-one-wins.
Namespaces
| Namespace | Effect |
|---|---|
| user | The sandbox gets its own uid space. The gateway writes a uid/gid map covering 0 0 65535 so bootstrap can start as userns-root (uid 0, CAP_NET_ADMIN for veth setup) and then drop to uid 65534. bwrap's own --uid flag is deliberately not used — it writes a single-entry map that would make setuid(65534) impossible. |
| pid | Fresh PID namespace plus --proc /proc: the server sees only its own process tree, never the host's. (Requires systempaths=unconfined on the Docker container so /proc masks are removed.) |
| net | Fresh network namespace whose only interface is the injected veth and whose only route is default via <proxy-IP>. This is the foundation of egress enforcement. |
| mount | Private mount namespace with only the read-only binds listed above, tmpfs /etc and /tmp, fresh /dev. |
| ipc / uts / cgroup | Unshared — no host IPC objects, hostname, or cgroup visibility. |
The bootstrap privilege drop
cmd/bootstrap is the only trusted code that ever runs inside a sandbox. Its sequence (order is security-critical):
- Wait for the gateway to inject the veth (1 byte on fd 3).
- Configure the network as userns-root: bring up
loand the peer veth, assign the sandbox's/30address, adddefault via <proxy-IP>. - Signal net-ready to the gateway (fd 4).
- Open the server binary as a file descriptor before dropping privileges (it is later exec'd via
execveat(AT_EMPTY_PATH), sidestepping overlay-filesystem path-permission quirks). PR_SET_NO_NEW_PRIVS, drop the entire capability bounding set (caps 0–40), clear ambient caps.setgroups/setgid/setuidto 65534 (nobody). The kernel clears permitted and effective caps on the uid transition.capsetto zero every remaining capability set.- Install the seccomp-BPF filter — fail-closed: any error aborts; the server is never exec'd without the filter.
execveatthe untrusted server.
The result, verified by tests: the server process runs as uid 65534 with CapEff=0 — zero capabilities in every set.
Seccomp
Two layers, with different statuses:
Docker layer (the container running the gateway). --security-opt seccomp=unconfined is currently required, because Docker's default seccomp profile blocks the unprivileged user namespaces bwrap needs. A scoped custom Docker profile is a follow-up hardening item that has not shipped.
Application layer (inside every sandbox) — shipped. internal/seccomp, installed by bootstrap after the privilege drop and inherited across execve. It closes the nested-user-namespace escape that uid 65534 + capability-dropping alone do not prevent. The default action is ALLOW; only escape and escalation vectors are denied:
| Syscalls | Action |
|---|---|
unshare, setns | KILL_PROCESS |
clone with CLONE_NEWUSER (arg-filtered) | KILL_PROCESS |
clone3 | ENOSYS (not kill — glibc ≥ 2.34 uses clone3 for pthread_create and falls back to plain clone on ENOSYS, which is arg-filtered above; a KILL would SIGSYS-kill Rust/C servers on thread creation) |
mount, umount2, pivot_root, chroot, open_tree, move_mount, fsopen, mount_setattr | KILL_PROCESS |
ptrace, process_vm_readv, process_vm_writev | KILL_PROCESS |
keyctl, add_key, request_key | KILL_PROCESS |
bpf, perf_event_open | KILL_PROCESS |
kexec_load, init_module, finit_module, delete_module | KILL_PROCESS |
Verified by TestSeccompBlocksUnshare (SIGSYS on unshare) and TestSeccompAllowsNormalWork (Go goroutines + TCP under the filter, exit 0).
Landlock and cgroups are designed but not yet implemented. DESIGN.md lists Landlock (filesystem rules) and cgroup resource limits as part of the hardening plan, and the sandbox package explicitly defers them. Current filesystem isolation relies on the mount namespace, read-only binds, --clearenv, the private procfs, and the uid/cap drop. A full allowlist-style seccomp profile for maximally hostile workloads is likewise deferred.
What a sandboxed server can and can't see
Can see:
- Its own binary at
/app/server(read-only). - The proxy CA at
/etc/gigmcp-ca.pem, an otherwise-empty tmpfs/etc, writable tmpfs/tmp, minimal/dev. - Environment:
PATH,HTTPS_PROXY,SSL_CERT_FILE,NODE_EXTRA_CA_CERTS, andGIG_PLACEHOLDER(the high-entropy sentinel — not a secret). - Its own process tree, its own loopback, and one veth with a
/30address.
Can't see:
- Your real API keys, the vault, the database, or
GIG_MASTER_KEY. - The host filesystem, host processes, host network interfaces, other sandboxes.
- Any inherited environment (bwrap
--clearenv). - Any route except
default via <proxy-IP>.
Host-side requirements
From docker-compose.yml, the gateway container runs with:
cap_add:
- NET_ADMIN # veth creation + LinkSetNsPid; no SYS_ADMIN, never privileged
security_opt:
- seccomp=unconfined # bwrap needs unprivileged userns (see note above)
- apparmor=unconfined
- systempaths=unconfined # unmask /proc so bwrap can mount a fresh procfs
At startup the gateway also sets the iptables FORWARD policy to DROP as defense-in-depth (best-effort — route isolation remains the primary enforcement if iptables is unavailable).
Next: how the network namespace becomes an unforgeable identity in the Egress Proxy.