News·Published 2026.05.29·Views 6

Claude Opus 4.8 Launches, Leading Most Coding and Agentic Benchmarks

Anthropic has unveiled Claude Opus 4.8, which leads most coding, computer-use, and knowledge-work benchmarks. Its autonomous, long-session agentic behavior

Anthropic has unveiled Claude Opus 4.8, its latest large language model. According to comparison figures released alongside the model, Opus 4.8 outperforms its predecessor Opus 4.7 as well as GPT-5.5 and Gemini 3.1 Pro across most benchmarks in coding, computer use, and knowledge work. The release notably emphasizes autonomous, agentic behavior in the developer tool Claude Code, where the model carries out long working sessions on its own.

Key points

Leads coding benchmarks — Per the published figures, Opus 4.8 scored 69.2% on SWE-Bench Pro, ahead of Opus 4.7 (64.3%), GPT-5.5 (58.6%), and Gemini 3.1 Pro (54.2%).
Strong on agentic and knowledge work — It also posted the highest marks among the compared models on computer use (OSWorld-Verified, 83.4%), knowledge work (GDPval-AA, 1890), and financial analysis (Finance Agent v2, 53.9%). One exception: GPT-5.5 led terminal coding (Terminal-Bench 2.1) at 78.2%, ahead of Opus 4.8's 74.6%.
Autonomy in Claude Code — Anthropic says Opus 4.8 makes tool calls like an experienced engineer without constant check-ins, holds its trajectory across long sessions, and follows repository work through to completion — letting users hand off an entire feature or a bug sweep.
'Fast mode' option — Separately, per its info page, Fast mode is a high-speed configuration that delivers 2.5x faster output token speeds while keeping the same Opus-level intelligence (the page references Opus 4.7). It is available as a research preview on Claude Code (with extra usage enabled) and via an API waitlist.

Benchmark comparison

Benchmark	Opus 4.8	Opus 4.7	GPT-5.5	Gemini 3.1 Pro
Agentic coding (SWE-Bench Pro)	69.2%	64.3%	58.6%	54.2%
Agentic terminal coding (Terminal-Bench 2.1)	74.6%	66.1%	78.2%	70.3%
Multidisciplinary reasoning (HLE, no tools)	49.8%	46.9%	41.4%	44.4%
Multidisciplinary reasoning (HLE, with tools)	57.9%	54.7%	52.2%	51.4%
Agentic computer use (OSWorld-Verified)	83.4%	82.8%	78.7%	76.2%
Knowledge work (GDPval-AA)	1890	1753	1769	1314
Agentic financial analysis (Finance Agent v2)	53.9%	51.5%	51.8%	43.0%

Analysis

What stands out in this release is less the individual benchmark wins than the weight placed on "delegability." The ability to maintain context across a long session and see a task through to the end is what translates into a felt difference in real-world work. Observers note that the yardstick for evaluating models is shifting from raw accuracy toward how long and how autonomously a model can be trusted with a task. The fact that a rival model still leads on some items, such as terminal coding, also suggests the field has reached a stage where no single model dominates every task.

Source

Fast mode for Claude Opus — Claude
Benchmark figures: Anthropic's published comparison (Opus 4.8 / Opus 4.7 / GPT-5.5 / Gemini 3.1 Pro)

Claude Opus 4.8 Launches, Leading Most Coding and Agentic Benchmarks

Key points

Benchmark comparison

Analysis

Source

Comments 0

More notes.

Obsidian 1.13.0 — A Searchable Settings Panel and Stronger URI Security

Dogfooding: Eat Your Own Dog Food First

Hamburger Menu: This Three-Line Icon Was Drawn in 1981