Anthropic Releases Claude Opus 4.7 with Record 64.3% SWE-bench Pro Score

Anthropic released Claude Opus 4.7 on April 16, narrowly retaking the lead for the most powerful generally available large language model with a 10.9-point leap on the SWE-bench Pro coding benchmark. The new flagship lands just days after xAI, OpenAI, and Google each refreshed their own frontier lineups, and it comes amid a broader industry pivot toward agentic coding workloads.

Key Highlights

SWE-bench Pro score climbs from 53.4 percent on Opus 4.6 to 64.3 percent on Opus 4.7.
SWE-bench Verified rises from 80.8 percent to 87.6 percent; Terminal-Bench 2.0 lands at 69.4 percent.
Anthropic says production task resolution is roughly three times better on Rakuten's SWE-bench evaluation.
Pricing stays flat at 5 dollars per million input tokens and 25 dollars per million output tokens.
Vision input jumps from 1.15 megapixels to 3.75 megapixels, enabling dense screenshots and design mockups at full fidelity.
A new "xhigh" effort level sits between high and max, and Claude Code gains an /ultrareview command that simulates a senior human reviewer.

Details

The model ships across Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry on day one, matching the rollout playbook Anthropic used for Opus 4.6. Third-party evaluator LayerLens reports a separate jump on Humanity's Last Exam, where Opus 4.7 scored 30.8 percent compared to 18.6 percent for Opus 4.6 — a 12.2-point gain on one of the hardest contamination-resistant benchmarks in production.

The update is aimed squarely at agentic coding. Anthropic highlights that the model handles long-running tasks with more rigor, follows instructions more precisely, and verifies its own outputs before reporting back. In the Claude Code environment, the new /ultrareview command goes beyond syntax checks to flag subtle design flaws and logic gaps, positioning the feature against similar multi-agent review flows launched by OpenAI's Codex and GitHub Copilot.

The Mythos Asterisk

Anthropic's own system card concedes that Opus 4.7 does not advance the capability frontier. That title still belongs to Mythos, the internal model Anthropic has been piloting with a small group of partners reported to include Nvidia, JPMorgan Chase, Google, Apple, and Microsoft. Developers building on the public API pay the same price for what Anthropic describes as the second-tier version.

Independent reviewers flagged additional tradeoffs that the official announcement omits. According to TechLint Lab, long-context performance on the MRCR v2 benchmark at 1 million tokens dropped from 78.3 percent on Opus 4.6 to 32.2 percent on Opus 4.7, a 46-point regression documented in the system card but absent from the release blog. A new tokenizer also maps the same text to roughly 1.0 to 1.35 times more tokens depending on content type, which analytics firm Finout estimates can translate into a 35 percent monthly cost increase on identical workloads despite the flat per-token pricing.

Impact

For agentic coding teams, Opus 4.7 is an immediate upgrade. The SWE-bench Pro jump is one of the largest single-version gains the industry has seen this year, and the Rakuten production benchmark claim — three times more resolved tasks — suggests real-world coding copilots should see meaningful reliability improvements on hard tickets. Teams already wired into Claude Code can test the new /ultrareview command on existing pull requests without migration work.

Teams running long-context retrieval pipelines, multilingual question answering, or terminal-heavy agents may want to hold off or route those workloads elsewhere. GPT-5.4 still leads on agentic search at 89.3 percent versus Opus 4.7's 79.3 percent, and the 46-point drop on 1M-token MRCR v2 is meaningful for anyone feeding whole codebases or long transcripts into a single prompt.

Background

Opus 4.7 is the fourth point release in Anthropic's 4.x line, following Opus 4.5 in late 2025, Opus 4.6 in early 2026, and the Sonnet 5 launch announced earlier this spring. The cadence underscores how quickly frontier labs are iterating: xAI's Grok 4.3 appeared on paid menus this week, OpenAI's GPT-5.4 shipped days earlier with computer-use and 4 million-token context, and Zhipu's open-source GLM-5.1 hit 58.4 percent on SWE-bench Pro just ten days before Opus 4.7 raised the bar to 64.3 percent.

Anthropic raised a 30 billion dollar Series G at a 380 billion dollar valuation earlier this year, giving the company the runway to keep shipping on this cadence while reserving Mythos-class models for strategic partners.

What's Next

Anthropic has not given a public timeline for Mythos general availability, but the pattern from Opus 4.6 suggests a 4.8 or Sonnet 5.1 point release could land within four to eight weeks. Expect the next updates to target the areas where Opus 4.7 lost ground: long-context retrieval, multilingual Q&A, and terminal-based coding. Developers considering a migration from Opus 4.6 should pilot on agentic coding workloads first, then measure token counts carefully before rolling out broadly.

Source: VentureBeat — Anthropic releases Claude Opus 4.7