OpenAI gpt-oss-120b Superfacts Benchmark results are in.
Highlights:
- gpt-oss-120b like other OpenAI models is particularly hallucination prone, coming in 11th out of 15 models tracked but interestingly above non-reasoning OpenAI models like GPT-4o and GPT-4.1.
- At a claim level, gpt-oss-120b hallucinations 17.81% of the time, on par with OpenAI's closed reasoning models, o3 and o3 Pro, at around 18%.
- gpt-oss-120b appears, like other OpenAI models, to respond particularly well to Superficial one-shot audits with 100% enhanced accuracy.
About Superfacts
Superfacts is the first claim-level hallucination benchmark for top AI models. Visit the benchmark at benchmarks.superficial.org.
About gpt-oss-120b
gpt-oss-120b is OpenAI's most powerful open weight model that outperforms similarly sized open models on reasoning tasks, demonstrates strong tool use capabilities, and is optimized for efficient deployment on consumer hardware. Learn more at platform.openai.com/docs/models/gpt-oss-120b.