Name: Inner Warden
Author: Inner Warden

Why inference boxes are different

An H100 node with a fine-tuned model on it is one of the highest-value compute units a small startup ever touches. The weights might represent six months of training data work, a proprietary dataset, or a partnership you cannot replace. The hourly cost alone is between four and twelve dollars depending on the provider. Every minute a scraper hammers your endpoint is real money out the door.

And yet most AI startups treat the inference box like any other application server: a Dockerfile, an API key in front, a Cloudflare proxy, done. The security model assumes the network is the perimeter. That assumption was already wrong for normal apps. For inference, the threats look different in specific ways.

Threats that actually hit inference servers

Four patterns we see most:

First, model weight exfiltration. A compromised dependency in your inference container reads/models/and pipes it out over an outbound connection disguised as telemetry. The weights are usually between 2 GB and 70 GB, and a slow trickle over hours blends in with normal traffic to S3 or HuggingFace.

Second, inference budget drain. Scrapers and competitor bots automate request floods to extract your model behavior, generate synthetic training sets, or just burn your GPU minutes. We see campaigns sustained over weeks at 30 to 50 percent of capacity.

Third, prompt injection at the agent layer. If your product has an agentic loop, prompt injection can cause it to call tools it should not, exfil context, or execute MCP commands the user never approved.

Fourth, supply chain. A malicious update to a Python package pulled by your inference container, or a compromised model on a public hub, gets root equivalent inside the container.

eBPF visibility on the host

The sensor runs on the node, not inside the container. It hooks 40 kernel events using eBPF: process exec, file open, network connect, syscall audit, mount, capability use. Containers cannot hide from this. A reverse shell launched inside your inference container is just anotherexecvefrom the host kernel's point of view.

The detector that matters most for weight exfil is the file-read-then-network-write correlator. It flags any process that reads a large file from a configured sensitive path (your weights directory) and then writes more than a threshold of bytes to a network socket within a configurable window. The defaults are 100 MB read and 50 MB sent within five minutes.

sudo innerwarden detector enable weight_exfil \
  --watch-path /models \
  --read-threshold 100MB \
  --network-threshold 50MB \
  --window 5m \
  --auto-block

agent-guard for the agent layer

The agent-guard module sits between your agent loop and any tool it calls. It inspects MCP traffic, enforces 71 ATR (Agent Threat Rules), and can block or downgrade tool calls that match command injection, path traversal, SSRF, or exfil patterns. It speaks MCP natively, so you do not need to rewrite your tool layer.

The default rule pack covers 29 prompt injection patterns curated from public adversarial corpora plus our own incident data. Each rule has a hash and a reasoning trail, so when something gets blocked you can trace why.

For background, see protecting AI agents on a server and building secure AI agents.

MCP traffic inspection

If your stack uses Model Context Protocol, every tool call your agent makes is JSON over a stdio or HTTP transport. agent-guard can either run as a transparent proxy or be embedded as a library. Either way it logs the full request and response, hashes it into an audit chain, and runs the rule set before forwarding.

The audit chain is hash-linked, so if someone tampers with the log on the box you can detect it. For startups dealing with auditors or enterprise customers, this is the artifact they ask for and you usually do not have.

Scraper and bot drain

The host agent fingerprints clients across IP rotation using behavioural features (request cadence, path distribution, header oddities, TLS fingerprint). Persistent abusers from rotating residential proxies still cluster on the fingerprint.

When the rate detector trips, the auto-response is not a flat block. It is a tiered downgrade: 429 for an hour, then a 24 hour block at the firewall if the same fingerprint reappears from new IPs. For abuse coming from one Cloudflare-fronted competitor we have seen this drop GPU utilisation by 40 percent overnight.

Supply chain inside the container

The host agent does not need to be inside your container to spot a malicious dependency. When a new binary execs that did not exist five minutes ago, when a Python process suddenly opens a connection to an IP it has never opened before, when anld.so.preloadfile appears, the host sees it. The triage layer scores it. If your inference container normally talks to S3, OpenAI, and your own DB and nowhere else, anything outside that set is a flag.

For deeper coverage of the docker side, see Docker container security.

Honest scope

Inner Warden does not encrypt your weights at rest, you still want LUKS or provider disk encryption. It does not authenticate your API clients, that is your gateway's job. It does not prevent a determined attacker with valid creds from doing things your application allows. What it does is make the post-exploit phase loud and short, and the agent-side abuse expensive.

Inner Warden for AI Startups: Protecting Inference Servers