Today we're releasing CAPE (Capability Achievement via Policy Execution), a new protocol for post-training that shifts AI development from optimizing intelligence to engineering capability. Along with the research paper, we're open-sourcing the full protocol, specification language, and policy packs under Apache 2.0, and launching CapabilityBench, a community registry for evaluating models against explicit capability requirements.
Intelligence Isn't Enough
Modern AI models are astonishingly intelligent. They solve mathematical olympiad problems. They pass bar exams. They generate sophisticated code. By every benchmark we've invented to measure raw problem-solving ability, they're improving at a breathtaking pace.
And yet.
When you try to deploy these same models into production—to build something real that people depend on—they struggle in ways that feel absurd. A model that can derive complex mathematical proofs can't reliably adhere to a hospital's specific formulary. A model that writes elegant code won't consistently follow a team's style guide. A model that reasons through intricate logic fails to respect a firm's jurisdiction constraints or a company's escalation protocol.
This isn't a bug. It's a category error. We've been measuring intelligence when deployment requires capability.
Intelligence ≠ Capability
Intelligence is the raw ability to solve complex, open-ended problems. It's what benchmarks like AIME and GPQA measure. When we celebrate a model achieving 79.8% on a math olympiad, we're celebrating intelligence.
Capability is different. It's the application of intelligence to specific requirements in specific contexts. If intelligence asks "can it solve this?" Capability asks "can it do this job, the way we need it done?"
A model can be extraordinarily intelligent while lacking basic capabilities. This is the deployment gap: the space between what models can theoretically do and what they reliably do when requirements must be met.
Current post-training methods like RLHF, DPO, and their variants, optimize for intelligence. They collect human preferences, train reward models, and nudge outputs toward what annotators prefer. The problem is that preferences are noisy. Annotators disagree 30–50% of the time on subtle comparisons, and that disagreement rate doesn't improve with more compute or better models. It's a structural ceiling.
CAPE takes a different approach entirely. Instead of asking "which output do humans prefer?", it asks "does this output satisfy this specification?"
Introducing Capability Engineering
We propose completing the AI development stack with a third layer: capability engineering—the systematic practice of defining, verifying, and training models against executable specifications.
| Layer | Function | Mechanism | Guarantee |
|---|---|---|---|
| Context Engineering | Inform | RAG, retrieval | Probabilistic |
| Prompt Engineering | Guide | Instructions | Probabilistic |
| Capability Engineering | Constrain | Specifications | Verifiable |
The key insight underpinning CAPE is that most capability requirements become objective once context is fixed. "Good medical advice" is genuinely subjective—different people, different values, different answers. But "recommend only formulary drugs, disclose all contraindications, verify suitability against stated risk tolerance" is objective. There's a fact of the matter about whether each requirement is satisfied.
We call this contextual objectivity. In our studies, inter-annotator agreement jumped from 63% (abstract questions like "is this good advice?") to 99% (explicit policy evaluation). The apparent subjectivity dissolves when you ask: good according to whom, for what purpose, in what context?
The CAPE Loop
CAPE operationalizes capability engineering through a closed loop: Specify → Verify → Correct → Train.
Specify. Convert capability requirements into executable policies. For structural properties (arithmetic correctness, citation format, safety patterns), this means symbolic policies written in CPL, our deterministic specification language. For semantic properties (reasoning validity, proof completeness, plan feasibility), this means learned verifiers trained on explicit rubrics.
Verify. Parse model outputs into a structured representation called a PredicateGraph, then evaluate against policies. Symbolic verification is deterministic: same policy, same output, same verdict, every time. Learned verification uses rubric-trained models that detect specific failure modes rather than predicting annotator preferences.
Correct. When violations occur, generate corrections that satisfy the specification. This might be deterministic patching ("change 7.1 to 7.095"), template insertion for missing elements, or constrained rewriting for semantic changes. Every correction is re-verified before acceptance.
Train. Use verified-correct outputs as supervised training signal. The model learns to satisfy specifications by default, not through reward shaping or preference optimization, but through direct supervision on what correct looks like.
This loop avoids the pathologies of reward-based methods: no length bias from loss normalization, no difficulty weighting from advantage computation, no reward hacking from proxy optimization. The signal is direct and clean.
Breaking the Preference Ceiling
CAPE's ceiling is technical, not human.
Preference-based methods plateau at human disagreement. No matter how much compute you throw at the problem, annotators will still disagree 30–50% of the time on nuanced comparisons. The ceiling is structural.
CAPE's ceiling is verification fidelity: how accurately we can parse outputs and evaluate them against specifications. For symbolic policies, this means extraction accuracy. For learned verifiers, this means rubric-following capability. Both improve with model scale.
In our experiments, each percentage point reduction in extraction error corresponded to approximately 0.8 percentage points reduction in violation rate (r = 0.94). Extraction quality determines your ceiling. And because extraction improves with model scale, CAPE benefits twice from continued progress: better base models to train, and better extractors to train them with.
This is a genuine scaling law: invest in verification quality, and model capability follows.
Performance
Violation rate measures capability directly: does the model satisfy the specification or not? Across 109,500 examples in six domains, CAPE reduces violation rates by 81% relative to DPO:
| Domain | DPO Violation Rate | CAPE Violation Rate |
|---|---|---|
| Arithmetic | 10.2% | 1.8% |
| Code Safety | 15.8% | 3.2% |
| Citation Grounding | 13.4% | 2.6% |
| Argument Soundness | 0.62 score | 0.83 score |
| Proof Validity | 0.58 score | 0.81 score |
| Code Correctness | 0.64 score | 0.84 score |
Perhaps more importantly, CAPE's improvements are stable. While reward-based methods show high variance across random seeds (σ = 1.6–2.1%), CAPE achieves σ less than 0.3%. When you run the same training with different seeds, you get essentially the same result.
Hybrid training works too. Running CAPE first to establish a correctness floor, then DPO to optimize quality within constraints, achieves both low violation rates (2.9%) and high preference win rates (63.7%).
Accessible Post-Training
CAPE reduces capability-specific post-training costs by 5–20× by replacing per-example annotation with reusable specifications.
Preference annotation costs $5–15 per comparison. To get quality signal, you need 5+ comparisons per example. For a training set of 10,000 examples, that's $50,000–150,000 in annotation costs alone. And that's per capability. Adding citation grounding after you've trained arithmetic requires a whole new annotation campaign.
CAPE's upfront cost is policy authoring: 2–5 days of domain expert and engineer time ($2,000–4,000 fully loaded) to generate unlimited training signal.
| Stage | RLHF/DPO Cost | CAPE Cost |
|---|---|---|
| Specification | — | $2,000–4,000 |
| Annotation | $50,000–150,000 | — |
| Verification Infrastructure | — | $200–400 |
| Compute (training) | $8,000–12,000 | $8,000–12,000 |
| Iteration Cycles | 3–5 rounds | 1–2 rounds |
| Total | $80,000–200,000 | $10,000–16,000 |
| Timeline | 2–4 months | 1–2 weeks |
The iteration cycle difference matters just as much. When a preference-trained model fails, you know something went wrong but not what. Diagnosing the problem requires new annotation campaigns. CAPE's explicit verdicts tell you exactly which policies fail on which outputs: "policy.citation.factual_claims_cited failed on 847 of 10,000 examples." Engineers fix the policy or the correction strategy directly.
This cost structure means capability engineering is accessible to far more teams. You don't need a massive annotation budget. You need domain expertise and engineering time.
The same policies also compound in value. As base models improve, your existing policy packs yield better results without rework. You're not re-annotating, you're re-running.
Introducing CapabilityBench
Current benchmarks measure intelligence: can the model solve this problem? But deployment requires knowing capability: does the model satisfy these specific requirements?
A score of 78% on a benchmark tells you nothing about whether a model can operate within your hospital's clinical workflow, navigate your firm's jurisdiction constraints, or execute your company's customer service protocol. It's an aggregate over problems you may never encounter, evaluated on criteria that may not match yours.
We're also launching CapabilityBench, a public registry that replaces opaque intelligence scores with traceable capability verdicts.
The idea is simple. Organizations and researchers contribute policy packs encoding capability requirements for specific domains. Models are evaluated against these policies, and results show exactly which requirements each model satisfies or violates.
The question shifts from "how smart is this model?" to "can this model do what I need?"
As the registry grows, CapabilityBench will become a shared resource for evaluating models against the requirements that actually matter for deployment.
We expect to launch CapabilityBench in public in early 2026. If your domain has requirements that models should meet, we'd love to hear from you.
The Path Forward
For too long, post-training has been alchemy: collect preferences, train a reward model, hope the outputs improve. When something goes wrong, there's no specification to trace failures against, no verification to prove fixes work, no guarantee that improvement on one dimension doesn't regress another.
CAPE makes model improvement resemble traditional engineering. Requirements are explicit. Correctness is verifiable. Failures are traceable. Fixes are validated.
The ceiling is now technical, not human. Better verification yields more capable models. The path forward is clear.
We believe capability engineering completes the AI development stack. Context engineering provides the right information. Prompt engineering provides the right framing. Capability engineering provides the right constraints. Together, they transform intelligent models into capable systems.
The gap between intelligence and capability has defined the deployment bottleneck. CAPE closes it.