News·Published 2026.05.29·Views 6
Claude Opus 4.8 Launches, Leading Most Coding and Agentic Benchmarks
Anthropic has unveiled Claude Opus 4.8, which leads most coding, computer-use, and knowledge-work benchmarks. Its autonomous, long-session agentic behavior
Anthropic has unveiled Claude Opus 4.8, its latest large language model. According to comparison figures released alongside the model, Opus 4.8 outperforms its predecessor Opus 4.7 as well as GPT-5.5 and Gemini 3.1 Pro across most benchmarks in coding, computer use, and knowledge work. The release notably emphasizes autonomous, agentic behavior in the developer tool Claude Code, where the model carries out long working sessions on its own.
Key points
- Leads coding benchmarks — Per the published figures, Opus 4.8 scored 69.2% on SWE-Bench Pro, ahead of Opus 4.7 (64.3%), GPT-5.5 (58.6%), and Gemini 3.1 Pro (54.2%).
- Strong on agentic and knowledge work — It also posted the highest marks among the compared models on computer use (OSWorld-Verified, 83.4%), knowledge work (GDPval-AA, 1890), and financial analysis (Finance Agent v2, 53.9%). One exception: GPT-5.5 led terminal coding (Terminal-Bench 2.1) at 78.2%, ahead of Opus 4.8's 74.6%.
- Autonomy in Claude Code — Anthropic says Opus 4.8 makes tool calls like an experienced engineer without constant check-ins, holds its trajectory across long sessions, and follows repository work through to completion — letting users hand off an entire feature or a bug sweep.
- 'Fast mode' option — Separately, per its info page, Fast mode is a high-speed configuration that delivers 2.5x faster output token speeds while keeping the same Opus-level intelligence (the page references Opus 4.7). It is available as a research preview on Claude Code (with extra usage enabled) and via an API waitlist.
Benchmark comparison
| Benchmark | Opus 4.8 | Opus 4.7 | GPT-5.5 | Gemini 3.1 Pro |
|---|---|---|---|---|
| Agentic coding (SWE-Bench Pro) | 69.2% | 64.3% | 58.6% | 54.2% |
| Agentic terminal coding (Terminal-Bench 2.1) | 74.6% | 66.1% | 78.2% | 70.3% |
| Multidisciplinary reasoning (HLE, no tools) | 49.8% | 46.9% | 41.4% | 44.4% |
| Multidisciplinary reasoning (HLE, with tools) | 57.9% | 54.7% | 52.2% | 51.4% |
| Agentic computer use (OSWorld-Verified) | 83.4% | 82.8% | 78.7% | 76.2% |
| Knowledge work (GDPval-AA) | 1890 | 1753 | 1769 | 1314 |
| Agentic financial analysis (Finance Agent v2) | 53.9% | 51.5% | 51.8% | 43.0% |
Analysis
What stands out in this release is less the individual benchmark wins than the weight placed on "delegability." The ability to maintain context across a long session and see a task through to the end is what translates into a felt difference in real-world work. Observers note that the yardstick for evaluating models is shifting from raw accuracy toward how long and how autonomously a model can be trusted with a task. The fact that a rival model still leads on some items, such as terminal coding, also suggests the field has reached a stage where no single model dominates every task.
Source
- Fast mode for Claude Opus — Claude
- Benchmark figures: Anthropic's published comparison (Opus 4.8 / Opus 4.7 / GPT-5.5 / Gemini 3.1 Pro)
Comments
Comments 0
Checking sign-in status…
Loading comments…