Skip to content
KINJA
The letters AI glowing in warm orange light against a minimal abstract background
AI & Machine Learning

Claude Opus 4.7 vs 4.6: Upgrade If You Code, Wait If You Don't

Opus 4.7 scores 87.6% on SWE-bench Verified and triples image resolution, but costs up to 35% more per token than 4.6. Who should upgrade, who should wait.

Alex ChenAlex Chen·9 min read
||9 min read

Key Takeaway

  • Claude Opus 4.7 scores 87.6% on SWE-bench Verified, a 6.8 point jump over Opus 4.6's 80.8%, and hits 64.3% on SWE-bench Pro versus 53.4% for 4.6.
  • Vision processing tripled: Opus 4.7 handles images up to 2,576 pixels on the long edge (roughly 3.75 megapixels), and XBOW's visual-acuity benchmark jumped from 54.5% on 4.6 to 98.5% on 4.7.
  • Pricing is unchanged at $5 per million input tokens and $25 per million output tokens, but a new tokenizer maps the same input to 1.0 to 1.35 times more tokens depending on content type.
  • Opus 4.7 follows instructions more literally than 4.6. Prompts written for 4.6's looser interpretation can produce different outputs on 4.7 without code changes.
  • Anthropic disclosed that 4.7 is "modestly weaker" than 4.6 on harm-reduction advice around controlled substances.

Opus 4.7 scores 87.6% on SWE-bench Verified, handles images at three times the resolution of Opus 4.6, and fixes a loop problem that was breaking long-running agent workflows. It also uses more tokens per request (up to 35% more for the same input, depending on content type). For most readers, those three sentences are the whole decision.

Anthropic shipped Claude Opus 4.7 on April 16, 2026, ten weeks after Opus 4.6 landed on February 5. Pricing is identical ($5 per million input tokens, $25 per million output tokens), the API is mostly compatible, and Anthropic is explicitly calling 4.7 a direct upgrade path. That framing is true but incomplete. The decision of Claude Opus 4.7 vs 4.6 isn't "upgrade or don't," it's "which workloads benefit and which ones don't care." For agentic coding, vision-heavy tasks, and long-running autonomous work, 4.7 is a meaningful step up. For short conversational workloads and stable production pipelines, the gains are marginal and the migration has real costs.

What actually changed

Start with the coding benchmarks. On SWE-bench Verified, which tests whether a model can resolve real GitHub issues in production codebases, Opus 4.6 scored 80.8% and Opus 4.7 hits 87.6%. That's a 6.8 point jump. On SWE-bench Pro, which is harder and more representative of production engineering, 4.6 scored 53.4% and 4.7 reaches 64.3%. That's an 11-point lift, which isn't as splashy as Anthropic's launch graphics suggest but is still meaningful on a benchmark where top frontier models cluster in the mid-50s. Terminal-Bench 2.0 went from 65.4% to 69.4%.

The launch partner numbers back this up. Cursor reported their internal CursorBench score jumped from 58% on Opus 4.6 to over 70% on Opus 4.7. Rakuten said 4.7 resolves three times more production tasks than 4.6 on their internal SWE benchmark. Hex, which runs a 93-task coding evaluation, saw a 13% resolution lift and four tasks that neither Opus 4.6 nor Sonnet 4.6 could solve.

Agentic reliability is where the real-world gains concentrate. Notion reported 4.7 delivered a 14% accuracy improvement over 4.6 "at fewer tokens and a third of the tool errors," and described 4.7 as the first model to keep executing through tool failures that used to stop Opus cold. Genspark's team specifically called out loop resistance: their Super Agent benchmark found Opus 4.6 would loop indefinitely on roughly 1 in 18 queries, and 4.7 substantially reduces that. Factory Droids saw a 10% to 15% lift in task success with fewer tool errors. Vercel noted new behavior the team hadn't seen before: 4.7 "does proofs on systems code before starting work."

The vision upgrade is the only architectural change in the release. Opus 4.7 can process images up to 2,576 pixels on the long edge, which Anthropic describes as roughly 3.75 megapixels and more than three times the resolution prior Claude models could handle. XBOW, which uses Claude for autonomous penetration testing, reported their visual-acuity benchmark jumped from 54.5% on Opus 4.6 to 98.5% on Opus 4.7. That's a new threshold, not a better score on the same benchmark: work that couldn't reliably run on 4.6 (dense screenshots, pixel-perfect diagram interpretation, serious computer-use agents) now runs on 4.7.

Not every change is an improvement. Anthropic's own alignment team disclosed in the launch post that 4.7 is "modestly weaker" than 4.6 on harm-reduction advice around controlled substances. For most readers this doesn't matter, but anyone working on health information or substance-related content workflows should know the regression exists.

What stayed the same

Migration isn't architecturally disruptive because most of what made 4.6 useful is still here.

FeatureOpus 4.6Opus 4.7
Input pricing$5 per million tokens$5 per million tokens
Output pricing$25 per million tokens$25 per million tokens
Context window1 million tokens1 million tokens
Max output128,000 tokens128,000 tokens
Prompt caching discountUp to 90%Up to 90%
Batch processing discount50%50%
AvailabilityClaude API, Bedrock, Vertex AI, FoundryClaude API, Bedrock, Vertex AI, Foundry
Model stringclaude-opus-4-6claude-opus-4-7

Both models support the full feature set around tool use, PDF support, vision, code execution, bash, and computer use. If you're already on 4.6, the code you've written doesn't need to change to call 4.7: swap the model string and you're running.

The two catches nobody's mentioning

Catch one: the tokenizer changed. Opus 4.7 uses a new tokenizer that processes text differently than 4.6. Anthropic's migration guide says the same input can map to 1.0 to 1.35 times more tokens, depending on content type, and that "token efficiency can vary by workload shape." At the high end of that range, a workload that cost $1,000 a month on Opus 4.6 costs $1,350 a month on Opus 4.7 for identical inputs, before you account for the second catch. Box's head of AI reported the opposite effect for their evaluations: a 56% reduction in model calls and 50% reduction in tool calls on 4.7 compared to 4.6. Net token cost depends on whether 4.7's efficiency gains from fewer calls offset the per-token overhead, which is workload-specific and requires real traffic to measure.

Catch two: 4.7 thinks more. Anthropic's own framing is that "Opus 4.7 thinks more at higher effort levels, particularly on later turns in agentic settings." This produces better results on hard problems, but it generates more output tokens. Output tokens cost five times input tokens ($25 vs $5 per million), so increased reasoning has a real cost signature. In Anthropic's own internal agentic coding evaluation, token usage net-improves across effort levels, but they explicitly recommend measuring on real traffic. Don't assume parity.

One lever partially offsets the higher token costs. Hex's team reported that "low-effort Opus 4.7 is roughly equivalent to medium-effort Opus 4.6." If your workload currently runs at a mid-range effort setting on 4.6, you may be able to drop one tier on 4.7 and come out ahead on cost and quality simultaneously. That requires testing, and dropping an effort level isn't a universally applicable strategy.

The third change is a migration warning, not a hidden cost. Opus 4.7 follows instructions more literally than 4.6. Anthropic's launch post flags that previous models interpreted instructions loosely or skipped parts entirely, and 4.7 takes them at face value. Prompts written for 4.6 that relied on the model filling in reasonable interpretations of vague instructions can produce different, sometimes wrong outputs on 4.7. The more ambiguous the prompt, the bigger the migration risk.

Who should switch and who shouldn't

Switch now if any of the following apply. You run agentic coding workflows where the model uses tools, writes multi-file changes, or handles long-running tasks. The 11-point SWE-bench Pro lift and the partner reports from Cursor, Rakuten, Devin, and Factory all point in the same direction: 4.7 handles complex engineering work noticeably better than 4.6. You do vision-heavy work involving dense screenshots, technical diagrams, or high-resolution imagery. The XBOW result (54.5% to 98.5%) is the strongest single data point in the entire launch. You're a Claude Code user and want access to the new /ultrareview slash command and extended auto mode. You need honest uncertainty handling in analytical workflows. Hex's review is specific: 4.7 reports when data is missing instead of generating plausible-but-incorrect fallbacks. For finance, legal, and research work, that behavior is worth more than any benchmark point.

Test carefully before migrating if you're in one of these situations. Your token budget is tight, because the tokenizer change plus increased reasoning can add meaningfully to costs at equivalent effort levels depending on workload type. Your production prompts are stable, well-tuned, and ambiguous in ways 4.6 handled by interpretation; the literalism change will surface every vague instruction in your prompt library. You rely on the old thinking: {type: "enabled", budget_tokens: N} syntax, which is now deprecated on both 4.6 and 4.7 in favor of thinking: {type: "adaptive"} with the effort parameter.

Stay on 4.6 if your workloads are short, conversational, or single-turn. The gains from 4.7 concentrate on long agentic work, hard coding problems, and vision tasks. For chatbots, Q&A, short-form content generation, and extractive work, you won't notice the difference, and you'll pay the token overhead. Also stay on 4.6 if you have workflows involving harm reduction advice around controlled substances; Anthropic's own disclosure on this is unambiguous, and a modestly worse model on a safety-adjacent topic is a bad trade.

For anyone launching a new project today, start on 4.7. The migration friction only applies to existing 4.6 pipelines. Greenfield code written against 4.7's literalism gets the full benefit without the re-tuning cost.

A note on the benchmarks

Anthropic's own memorization screens flag a subset of problems in the SWE-bench Verified, SWE-bench Pro, and SWE-bench Multilingual evaluations, meaning some test questions may have leaked into training data. The company explicitly stated in the launch post that excluding problems showing signs of memorization, Opus 4.7's margin of improvement over Opus 4.6 still holds. That disclosure matters: benchmark numbers, even from first-party sources, should be read as directional rather than definitive. The partner reports from Rakuten, Cursor, Hex, and Devin are in some ways more useful than the headline benchmark table, because they reflect real workloads on proprietary tasks that weren't in any training set.

On safety: Anthropic's automated behavioral audit found Opus 4.7 shows a "similar safety profile" to Opus 4.6, with modest improvements in prompt-injection resistance and honesty, and modest regressions in a few other areas including the controlled-substance harm-reduction disclosure mentioned earlier. Anthropic's own characterization of 4.7's alignment is "largely well-aligned and trustworthy, though not fully ideal in its behavior." That's careful language, not marketing language.

What 4.7 tells you about what's coming

Anthropic's launch post spent significant space not on Opus 4.7 but on a different model: Claude Mythos Preview. Mythos is a more capable model that Anthropic has already built and is deliberately withholding from public release because of cybersecurity concerns. Anthropic announced Project Glasswing last week as the framework for eventually deploying Mythos-class models safely.

Opus 4.7 is the first public model shipping with the cybersecurity safeguards Anthropic wants to validate before a broader release. The launch post says so directly: "Opus 4.7 is the first such model: its cyber capabilities are not as advanced as those of Mythos Preview (indeed, during its training we experimented with efforts to differentially reduce these capabilities)." Legitimate security researchers can apply to a new Cyber Verification Program to get reduced restrictions for penetration testing and vulnerability research.

For anyone doing six to twelve month planning, the implication is that Opus 4.7 is the current best commercially available model, but it isn't Anthropic's current best model. Reports from The Information suggest a Mythos-class release could surface in May, though Anthropic hasn't confirmed timing. If your workloads would benefit substantially from the jump beyond 4.7, that's a planning window worth tracking rather than an upgrade to make today.

The short version

Opus 4.7 is the right default for new projects and the right upgrade for agentic coding, vision work, and long-running autonomous tasks. Stay on 4.6 if your workloads are short and conversational, your token budget is sensitive, or your prompts are tuned to 4.6's looser instruction-following. Don't make this decision based on the benchmark table alone. Make it based on whether your specific use case sits in the narrow window where 4.7's gains show up.

If you can't tell which category you're in, you're probably in the second one. The gains from 4.7 show up on work that was already hard for 4.6: multi-hour agent runs, dense visual tasks, engineering problems the model was failing at before. If your Claude usage today is mostly short-form chat or single-turn extraction, and it's mostly working, you'll pay the token overhead on 4.7 without seeing the capability payoff. Wait for the next model.

Topics

Alex Chen

Written by

Alex Chen

Technology journalist who has spent over a decade covering AI, cybersecurity, and software development. Former contributor to major tech publications. Writes about the tools, systems, and policies shaping the technology landscape, from machine learning breakthroughs to defense applications of emerging tech.

Continue Reading in AI & Machine Learning

Computer monitor displaying source code with syntax highlightingAI & Machine Learning

Can You Really Make Money with Vibe Coding?

Lovable, Cursor, and Replit are collectively worth nearly $45 billion. The people using them to build apps? Most are making somewhere between nothing and $300 a month. Here's what the vibe coding gold rush actually looks like from the other side of the counter.

Alex ChenAlex Chen·14 min read

The Kinja Brief

Get the stories that matter, delivered daily.