Sovereign AI on Metal: Air-Gapped LLM Stack with Ubuntu & vLLM

For when the cloud isn't private enough. How to run a Sovereign Appliance using hardened Ubuntu and open-source models.

March 24, 2026·2 min read·

#OnPremise#Ubuntu#vLLM

Some clients — central banks, defence, regulated insurers — cannot use cloud. Full stop. They need a physical appliance that does inference behind their own firewall, with no callback, no telemetry, no licence server.

Here's the stack I ship.

Hardware baseline

2× NVIDIA H100 (80GB) — comfortably fits Llama-3 70B at 4-bit.
Hardened Ubuntu 22.04 LTS, kernel locked down with sysctl + AppArmor profiles.
Mellanox 100GbE between nodes for tensor parallelism.

The inference layer

vLLM wins on three axes that matter for sovereign deployments:

PagedAttention — squeezes more concurrent users out of fixed VRAM.
OpenAI-compatible REST — drop-in replacement for SDKs the dev team already knows.
No phone-home. Inspect lsof -i after boot; nothing leaves the box.

python -m vllm.entrypoints.openai.api_server \
  --model /opt/models/llama-3-70b-awq \
  --quantization awq \
  --tensor-parallel-size 2 \
  --host 127.0.0.1 \
  --port 8000

Bind to localhost; expose through an Nginx reverse proxy that enforces mTLS from the internal CA.

What the auditor sees

The auditor should see more than a rack of GPUs. They should see a controlled appliance: hardened operating system baseline, model checksum, offline package repository, role-based access, prompt logging, and change records for every model update.

The vLLM documentation is useful for inference setup, but sovereign deployment needs an additional operations layer. Treat the model server like regulated infrastructure: patch windows, access reviews, backup plan, incident procedure, and evidence that the appliance cannot call home. Sovereign AI is not only where the model runs. It is how the model is governed.

Filesystem hash of the model weights, signed at delivery.
journalctl export of every inference request (URL only, never prompt body — that's logged separately to an encrypted volume).
A documented "kill switch": pull the cable, the model stops. There is no SaaS dependency.

This is what Sovereign actually means: the customer owns every byte that touches the model.

The final test is operational: can the customer patch, restart, audit, and disable the stack without calling a vendor SaaS endpoint? If yes, the appliance is sovereign in practice, not only in marketing.

Closing thought

Sovereign AI is not a marketing label, it is an operational property. You either own every byte of inference and can prove it, or you do not. vLLM on metal is one way to get there for teams that genuinely cannot send data to a SaaS endpoint — and a poor choice for teams that can.

Honest pre-requisites before going on-metal

A team that owns hardware patching cadence
Encrypted, signed model weight delivery process
Observability that survives without external SaaS
A 24x7 escalation path that includes physical access

Public profile lookup

Ask AI About the Author

Open this query in ChatGPT, Claude, or Perplexity.

ChatGPT

Best for structured summaries.

Claude

Useful for concise synthesis.

Perplexity

Good for web-backed lookup.

Comments

Comments are open to confirmed email subscribers. Use the email you subscribed with. To edit a comment, delete it and post a new one.

Get new field notes by email

Field notes from someone who ships before they write about it. Sovereign AI, AI-SDLC, DevOps, and what 59 production deployments teach you. No spam. Unsubscribe anytime.

Related field notes

Sovereign AI·5 min read