Evals and tuning

Superficial delivers the deterministic, claim-level signals you need to actively tune out mistakes, providing correction pairs, preferences, and abstention data for your pre- and post-training pipelines.

Across leading models from OpenAI, Anthropic, Grok, and Google, Superficial increased average claim-level accuracy from 78.56% to 95.16%.

Claim Accuracy Scores

Workflow

Superficial turns model outputs into deterministic signals — extracting, verifying, and labelling claims, and generating correction pairs, preferences, and abstention data that plug directly into your pre- and post-training pipelines, with re-tests to prove accuracy gains.

Extract
Model outputs are decomposed into atomic claims — the smallest verifiable units of fact — moving beyond vague accuracy scores to a precise list of testable statements.
Verify
Each claim is assessed against trusted references using dynamically generated symbolic programs that enable deterministic checks, with span-linked evidence attached. This replaces subjective grading with traceable true/false/unknown outcomes.
Label
Verification results are transformed into structured training signals: correction pairs for inaccurate claims, preference data to rank completions, and abstention flags when the appropriate response is “unknown.” These outputs form high-quality inputs for model tuning.
Tune
The signals generated by Superficial integrate into fine-tuning or post-training pipelines, enabling teams to systematically reduce hallucinations, improve calibration, and teach models when to abstain.
Re-test
After tuning is applied, new outputs can be re-evaluated through the workflow, delivering claim-level before-and-after metrics that quantify accuracy gains, reduced confident errors, and stronger abstention behaviour.

Superficial vs traditional approaches

Most evals stop at measurement. Human labelers are slow and inconsistent. Superficial delivers deterministic claim-level accuracy scoring and training inputs at machine speed.

	Traditional evals	Traditional labelers	Superficial
Granularity	Whole outputs	Sentence/segment-level, subjective	Claim-level, deterministic
Speed	Automated, fast	Human-in-the-loop, slow	Machine-speed, instant
Consistency	Probabilistic	Subjective, variable	Deterministic, repeatable
Abstention	Ignored	Rarely captured	Explicit abstention signals
Output	Accuracy scores	Labels for training	Correction pairs, preferences, abstention inputs
Traceability	Limited	Human rationale notes	Source spans, rules fired
Cost	Low per run	High per dataset	Scales with usage, lower unit cost

Labs

Turn evaluations into training leverage: claim-level breakdowns become correction pairs, preference signals, and abstention data you can feed directly into fine-tuning, making your models more truthful and better calibrated.

Enterprises

Win customer trust at scale: deliver audit-ready outputs with span-linked evidence and omission detection, backed by deterministic checks that give your clients and regulators confidence in every answer.

Startups

Compete on accuracy: show measurable gains against baselines and ship models that outperform rivals, with deterministic labels and proof you can use to build credibility fast.

Regulated industries

Deploy with defensible compliance: deterministic audits and calibration signals that enforce abstention, delivering traceable proof and safer rollouts under regulatory scrutiny.

Get started with a free claim-level accuracy audit

Evals and tuning

Workflow

Extract

Verify

Label

Tune

Re-test

Superficial vs traditional approaches

Labs

Enterprises

Startups

Regulated industries