Sovereign AI on Metal: Air-Gapped LLM Stack with Ubuntu & vLLM
For when the cloud isn't private enough. How to run a Sovereign Appliance using hardened Ubuntu and open-source models.
·1 min read·
#OnPremise#Ubuntu#vLLM
Some clients — central banks, defence, regulated insurers — cannot use cloud. Full stop. They need a physical appliance that does inference behind their own firewall, with no callback, no telemetry, no licence server.
Here's the stack I ship.
Hardware baseline
- 2× NVIDIA H100 (80GB) — comfortably fits Llama-3 70B at 4-bit.
- Hardened Ubuntu 22.04 LTS, kernel locked down with
sysctl+ AppArmor profiles. - Mellanox 100GbE between nodes for tensor parallelism.
The inference layer
vLLM wins on three axes that matter for sovereign deployments:
- PagedAttention — squeezes more concurrent users out of fixed VRAM.
- OpenAI-compatible REST — drop-in replacement for SDKs the dev team already knows.
- No phone-home. Inspect
lsof -iafter boot; nothing leaves the box.
python -m vllm.entrypoints.openai.api_server \
--model /opt/models/llama-3-70b-awq \
--quantization awq \
--tensor-parallel-size 2 \
--host 127.0.0.1 \
--port 8000
Bind to localhost; expose through an Nginx reverse proxy that enforces mTLS from the internal CA.
What the auditor sees
- Filesystem hash of the model weights, signed at delivery.
journalctlexport of every inference request (URL only, never prompt body — that's logged separately to an encrypted volume).- A documented "kill switch": pull the cable, the model stops. There is no SaaS dependency.
This is what Sovereign actually means: the customer owns every byte that touches the model.
More in Sovereign AI
Sovereign AI·2 min read
The Hidden Costs of AI: Preventing Token Shock in AWS Bedrock
#CostOptimization#AWSBedrock
Sovereign AI·2 min read
From Prompt to Production: The Golden Path for Secure GenAI Apps
#SecureGenAI#Lambda
Sovereign AI·2 min read
The Anatomy of a Private GPT: Architecting for SOC2 in Banking
#PrivateGPT#Architecture