Introducing CAPE: From Intelligence Training to Capability Engineering

Full paper (arXiv)

Today we're releasing CAPE (Capability Achievement via Policy Execution), a protocol for engineering capability into AI systems. CAPE defines executable specifications for what models must do, verifies outputs against those specifications, and corrects violations—at inference or training time. Along with the research paper, we're open-sourcing the full protocol, specification language, and policy packs under Apache 2.0, and launching CapabilityBench, a community registry for evaluating models against explicit capability requirements.

Intelligence Isn't Enough

Modern AI models are astonishingly intelligent. They solve mathematical olympiad problems. They pass bar exams. They generate sophisticated code. By every benchmark we've invented to measure raw problem-solving ability, they're improving at a breathtaking pace.

And yet.

When you try to deploy these same models into production—to build something real that people depend on—they struggle in ways that feel absurd. A model that can derive complex mathematical proofs can't reliably adhere to a hospital's specific formulary. A model that writes elegant code won't consistently follow a team's style guide. A model that reasons through intricate logic fails to respect a firm's jurisdiction constraints or a company's escalation protocol.

We've been measuring intelligence, while deployment depends on capability.

Intelligence ≠ Capability

Intelligence is the raw ability to solve complex, open-ended problems. It's what benchmarks like AIME and GPQA measure. When we celebrate a model achieving 79.8% on a math olympiad, we're celebrating intelligence.

Capability is different. It's the application of intelligence to specific requirements in specific contexts. If intelligence asks "can it solve this?" Capability asks "can it do this job, the way we need it done?"

A model can be extraordinarily intelligent while lacking basic capabilities. This is the deployment gap: the space between what models can theoretically do and what they reliably do when requirements must be met.

Current post-training methods like RLHF, DPO, and their variants optimize for intelligence. They collect human preferences, train reward models, and nudge outputs toward what annotators prefer. The problem is that preferences are noisy. Annotators disagree 30–50% of the time on subtle comparisons, and that disagreement rate doesn't improve with more compute or better models. It's a structural ceiling.

CAPE takes a different approach entirely. Instead of asking "which output do humans prefer?", it asks "does this output satisfy this specification?"

Introducing Capability Engineering

We propose completing the AI development stack with a third layer: capability engineering—the systematic practice of defining, verifying, and executing model behavior against executable specifications.

Layer	Function	Mechanism	Guarantee
Context Engineering	Inform	RAG, retrieval	Probabilistic
Prompt Engineering	Guide	Instructions	Probabilistic
Capability Engineering	Constrain	Specifications	Verifiable

Context and prompt engineering operate at inference only. Capability engineering applies at both inference and training—the same specifications that verify outputs in production can generate training data for owned models.

The key insight underpinning CAPE is that most capability requirements become objective once context is fixed. "Good medical advice" is genuinely subjective—different people, different values, different answers. But "recommend only formulary drugs, disclose all contraindications, verify suitability against stated risk tolerance" is objective. There's a fact of the matter about whether each requirement is satisfied.

We call this contextual objectivity. In our studies, inter-annotator agreement jumped from 63% (abstract questions like "is this good advice?") to 99% (explicit policy evaluation). The apparent subjectivity dissolves when you ask: good according to whom, for what purpose, in what context?

The CAPE Loop

CAPE operationalizes capability engineering through a core loop: Specify → Verify → Correct.

This loop executes per-request at inference (for real-time capability execution on any model) or in batch (for generating training data to synthesize capabilities into owned models).

Specify. Convert capability requirements into executable policies. For structural properties (arithmetic correctness, citation format, safety patterns), this means symbolic policies written in CPL, our deterministic specification language. For semantic properties (reasoning validity, proof completeness, plan feasibility), this means learned verifiers trained on explicit rubrics.

Verify. Parse model outputs into a structured representation called a PredicateGraph, then evaluate against policies. Symbolic verification is deterministic: same policy, same output, same verdict, every time. Learned verification uses rubric-trained models that detect specific failure modes rather than predicting annotator preferences.

Correct. When violations occur, generate corrections that satisfy the specification. This might be deterministic patching ("change 7.1 to 7.095"), template insertion for missing elements, or constrained rewriting for semantic changes. Every correction is re-verified before acceptance.

For inference-time execution, the loop ends here—return the verified-correct output to the user. For training synthesis, add a fourth step:

Train. Use verified-correct outputs as supervised training signal. The model learns to satisfy specifications by default, not through reward shaping or preference optimization, but through direct supervision on what correct looks like.

This loop avoids the pathologies of reward-based methods: no length bias from loss normalization, no difficulty weighting from advantage computation, no reward hacking from proxy optimization. The signal is direct and clean.

Inference or Training: Same Policies, Two Paths

CAPE policies are deployment-agnostic. The same specification that verifies and corrects GPT-5 outputs in production can generate training data for a Llama model you own.

Inference-time execution applies CAPE to models you can't fine-tune—Claude, GPT-5, Gemini. Every request runs through Specify → Verify → Correct. You get compliant outputs immediately, from any model, without training.

Training-time synthesis uses the same loop in batch to generate verified training data. Fine-tune Llama, Mistral, or your own base model. Own the weights, eliminate API costs, deploy anywhere.

The two paths compound. Start with inference-time execution on a frontier model. Every verified output becomes training data. When volume justifies it, use those same outputs—already verified against your policies—to train an owned model. Your inference deployment funds your training dataset.

Breaking the Preference Ceiling

CAPE's ceiling is technical, not human.

Preference-based methods plateau at human disagreement. No matter how much compute you throw at the problem, annotators will still disagree 30–50% of the time on nuanced comparisons. The ceiling is structural.

CAPE's ceiling is verification fidelity: how accurately we can parse outputs and evaluate them against specifications. For symbolic policies, this means extraction accuracy. For learned verifiers, this means rubric-following capability. Both improve with model scale.

In our experiments, each percentage point reduction in extraction error corresponded to approximately 0.8 percentage points reduction in violation rate (r = 0.94). Extraction quality determines your ceiling. And because extraction improves with model scale, CAPE benefits twice from continued progress: better base models to execute against, and better extractors to verify them with.

This is a genuine scaling law: invest in verification quality, and model capability follows.

Performance

Violation rate measures capability directly: does the model satisfy the specification or not? Across 109,500 examples in six domains, CAPE reduces violation rates by 81% relative to DPO:

Domain	DPO Violation Rate	CAPE Violation Rate
Arithmetic	10.2%	1.8%
Code Safety	15.8%	3.2%
Citation Grounding	13.4%	2.6%
Argument Soundness	0.62 score	0.83 score
Proof Validity	0.58 score	0.81 score
Code Correctness	0.64 score	0.84 score

Perhaps more importantly, CAPE's improvements are stable. While reward-based methods show high variance across random seeds (σ = 1.6–2.1%), CAPE achieves σ less than 0.3%. When you run the same process with different seeds, you get essentially the same result.

Hybrid approaches work too. Running CAPE first to establish a correctness floor, then DPO to optimize quality within constraints, achieves both low violation rates (2.9%) and high preference win rates (63.7%).

Accessible Capability Engineering

CAPE reduces capability-specific costs by 5–20× by replacing per-example annotation with reusable specifications.

Preference annotation costs $5–15 per comparison. To get quality signal, you need 5+ comparisons per example. For a training set of 10,000 examples, that's $50,000–150,000 in annotation costs alone. And that's per capability. Adding citation grounding after you've trained arithmetic requires a whole new annotation campaign.

CAPE's upfront cost is policy authoring: 2–5 days of domain expert and engineer time ($2,000–4,000 fully loaded) to generate unlimited verified outputs—at inference or for training.

Stage	RLHF/DPO Cost	CAPE Cost
Specification	—	$2,000–4,000
Annotation	$50,000–150,000	—
Verification Infrastructure	—	$200–400
Compute (training)	$8,000–12,000	$8,000–12,000*
Iteration Cycles	3–5 rounds	1–2 rounds
Total (training path)	$80,000–200,000	$10,000–16,000
Timeline	2–4 months	1–2 weeks

*Training compute applies only to the training synthesis path. Inference-time execution requires only policy authoring and verification infrastructure—no training compute, no iteration cycles.

The iteration cycle difference matters just as much. When a preference-trained model fails, you know something went wrong but not what. Diagnosing the problem requires new annotation campaigns. CAPE's explicit verdicts tell you exactly which policies fail on which outputs: "policy.citation.factual_claims_cited failed on 847 of 10,000 examples." Engineers fix the policy or the correction strategy directly.

This cost structure means capability engineering is accessible to far more teams. You don't need a massive annotation budget. You need domain expertise and engineering time.

The same policies compound in value. As base models improve, your existing policy packs yield better results without rework. Whether you're executing at inference against GPT-6 or synthesizing training data for the next Llama, you're not re-annotating—you're re-running.

Introducing CapabilityBench

Current benchmarks measure intelligence: can the model solve this problem? But deployment requires knowing capability: does the model satisfy these specific requirements?

A score of 78% on a benchmark tells you nothing about whether a model can operate within your hospital's clinical workflow, navigate your firm's jurisdiction constraints, or execute your company's customer service protocol. It's an aggregate over problems you may never encounter, evaluated on criteria that may not match yours.

We're also launching CapabilityBench, a public registry that replaces opaque intelligence scores with traceable capability verdicts.

The idea is simple. Organizations and researchers contribute policy packs encoding capability requirements for specific domains. Models are evaluated against these policies, and results show exactly which requirements each model satisfies or violates.

The question shifts from "how smart is this model?" to "can this model do what I need?"

As the registry grows, CapabilityBench will become a shared resource for evaluating models against the requirements that actually matter for deployment.

We expect to launch CapabilityBench in public in early 2026. If your domain has requirements that models should meet, we'd love to hear from you.

The Path Forward

For too long, improving model behavior has been alchemy: collect preferences, train a reward model, hope the outputs improve. When something goes wrong, there's no specification to trace failures against, no verification to prove fixes work, no guarantee that improvement on one dimension doesn't regress another.

CAPE makes model improvement resemble traditional engineering. Requirements are explicit. Correctness is verifiable. Failures are traceable. Fixes are validated. And this applies whether you're executing against a frontier model in production or synthesizing capabilities into owned weights.

The ceiling is now technical, not human. Better verification yields more capable models. The path forward is clear.

We believe capability engineering completes the AI development stack. Context engineering provides the right information. Prompt engineering provides the right framing. Capability engineering provides the right constraints—executed at inference, synthesized at training, or both. Together, they transform intelligent models into capable systems.

The gap between intelligence and capability has defined the deployment bottleneck. CAPE closes it.

Full paper (arXiv) GitHub CapabilityBench