Sovereign AI

Sovereign AI on Metal: Air-Gapped LLM Stack with Ubuntu & vLLM

For when the cloud isn't private enough. How to run a Sovereign Appliance using hardened Ubuntu and open-source models.

·1 min read·
#OnPremise#Ubuntu#vLLM

Some clients — central banks, defence, regulated insurers — cannot use cloud. Full stop. They need a physical appliance that does inference behind their own firewall, with no callback, no telemetry, no licence server.

Here's the stack I ship.

Hardware baseline

  • 2× NVIDIA H100 (80GB) — comfortably fits Llama-3 70B at 4-bit.
  • Hardened Ubuntu 22.04 LTS, kernel locked down with sysctl + AppArmor profiles.
  • Mellanox 100GbE between nodes for tensor parallelism.

The inference layer

vLLM wins on three axes that matter for sovereign deployments:

  1. PagedAttention — squeezes more concurrent users out of fixed VRAM.
  2. OpenAI-compatible REST — drop-in replacement for SDKs the dev team already knows.
  3. No phone-home. Inspect lsof -i after boot; nothing leaves the box.
python -m vllm.entrypoints.openai.api_server \
  --model /opt/models/llama-3-70b-awq \
  --quantization awq \
  --tensor-parallel-size 2 \
  --host 127.0.0.1 \
  --port 8000

Bind to localhost; expose through an Nginx reverse proxy that enforces mTLS from the internal CA.

What the auditor sees

  • Filesystem hash of the model weights, signed at delivery.
  • journalctl export of every inference request (URL only, never prompt body — that's logged separately to an encrypted volume).
  • A documented "kill switch": pull the cable, the model stops. There is no SaaS dependency.

This is what Sovereign actually means: the customer owns every byte that touches the model.

More in Sovereign AI