Claude Mythos vs GPT-5: Full Comparison

Key Takeaway

Claude Mythos Preview leads GPT-5.4 on coding (93.9% vs. ~78-80% on SWE-bench), cybersecurity (100% on Cybench vs. GPT-5.4's "High" classification), and math (97.6% vs. 95.2% on USAMO). GPT-5.4 leads on professional workflows (83% on GDPval, 91% on BigLaw Bench) and is available to anyone with a $20/month subscription. Mythos has no public access. For models you can actually use today: GPT-5.4 for professional work and computer use, Claude Opus 4.6 for coding.

Anthropic's Mythos Preview leads GPT-5.4 on nearly every benchmark that matters. It also doesn't have a price tag, a waitlist, or any path to public access. Here's what the numbers actually mean for the rest of us.

The two most talked-about AI models of 2026 could not have more different origin stories. OpenAI released GPT-5.4 on March 5 to paying ChatGPT subscribers worldwide, positioning it as their most capable frontier model for professional work. Anthropic announced Claude Mythos Preview on April 7 and immediately locked it behind a closed cybersecurity partnership with twelve organizations, declaring it too dangerous for public use. One model you can sign up for right now. The other you may never touch. And the one you can't touch is better at almost everything.

That "almost" is doing real work in the previous sentence, though. The benchmark comparison between these two models tells a more complicated story than "Mythos wins." GPT-5.4 brought capabilities to general availability that didn't exist in any shipping product a year ago: native computer use that exceeds human expert performance, a million-token context window, and professional workflow tools that match industry specialists across 44 occupations. Mythos Preview responded by posting scores that make GPT-5.4 look like a previous generation, then retreated behind a wall of NDAs and cybersecurity disclosures.

Here is every benchmark where we can compare them directly, what each number actually measures, and why the model that "wins" depends entirely on what you're trying to do.

The coding gap is the biggest story

Software engineering benchmarks are where Mythos Preview opens up a lead that's hard to explain away as noise or methodology differences.

On SWE-bench Verified (the standard test where models resolve real GitHub issues end-to-end), Mythos scored 93.9%. GPT-5.4 scored in the range of 78-80% depending on the evaluator and testing setup (Vals.ai measured 78.2% using a standardized configuration; other sources report approximately 80%). Claude Opus 4.6 scored 80.8% per Anthropic's own evaluation pipeline. The important point: Mythos leads GPT-5.4 by roughly 14-16 points regardless of which GPT-5.4 number you use. That gap is the difference between a model that handles virtually any standard debugging task and one that still fails on roughly one in five.

The harder the test, the wider the gap. SWE-bench Pro filters for the most difficult software engineering problems, tasks requiring larger code changes across multiple files in unfamiliar repositories. Mythos scored 77.8%. GPT-5.4 scored 57.7%. That is a 20-point lead on the benchmark most researchers now consider the definitive measure of real coding ability. For context, GPT-5.4 Mini, the free-tier version, scores 54.38% on the same test, just 3 points behind the full GPT-5.4. The distance between GPT-5.4 Standard and its budget variant is smaller than the distance between GPT-5.4 Standard and Mythos.

SWE-bench Multimodal, which tests code reasoning alongside visual context like screenshots and GUI elements, produced the most dramatic split: Mythos at 59.0%, Opus 4.6 at 27.1%. OpenAI has not published a directly comparable GPT-5.4 score on this specific benchmark.

On Terminal-Bench 2.0, which measures autonomous multi-step terminal operations (the kind of agentic coding work that defines real-world developer workflows), Mythos hit 82.0%. OpenAI has not published a GPT-5.4 Terminal-Bench score, making direct comparison impossible on this benchmark. Given that Opus 4.6 scored 65.4%, the Mythos lead over the publicly available competition is clear regardless.

The bottom line on coding: if you are choosing a model purely for software engineering, Mythos Preview is in a different tier. The problem is that "choosing" requires being one of roughly 50 organizations on Earth that can access it.

GPT-5.4 wins on computer use, but Mythos closed the gap before anyone noticed

The headline feature of GPT-5.4's March launch was native computer use: the ability to interact with desktop software through screenshots, mouse commands, and keyboard inputs without plugins or wrappers. On OSWorld-Verified, the benchmark for autonomous desktop task completion, GPT-5.4 scored 75.0%, surpassing the human expert baseline of 72.4%. No model had crossed that threshold before.

Then Mythos quietly posted 79.6% on the same benchmark.

GPT-5.4 got credit for breaking the human ceiling. Mythos broke it further. But GPT-5.4 did it first, did it publicly, and did it in a product anyone with $20 per month can use. That matters more than the 4.6-point gap on a benchmark. Computer use capabilities are only valuable if they're deployed, and GPT-5.4 is deployed to millions of users right now. Mythos is deployed to cybersecurity researchers scanning FreeBSD kernels.

The math results are closer than you'd expect

On USAMO 2026, the USA Mathematical Olympiad evaluation (proof-based problems designed for the most elite high school mathematicians), Mythos scored 97.6% and GPT-5.4 scored 95.2%. That 2.4-point gap is meaningful at the tail end of a distribution this difficult, but it's also the narrowest gap in the entire comparison. Both models are solving competition-level math problems at rates that would have seemed fictional two years ago.

GPQA Diamond, the graduate-level scientific reasoning benchmark spanning physics, chemistry, and biology, shows a similar story: Mythos at 94.6%, GPT-5.4 at approximately 92.8%. Competitive, not dominant.

On Humanity's Last Exam (a benchmark specifically designed to remain unsolvable by current AI), Mythos scored 64.7% with tools enabled. GPT-5.4's exact HLE score with comparable tool access hasn't been widely reported in the same configuration, making direct comparison unreliable. Anthropic flagged that Mythos may show some memorization on HLE, which should make you skeptical of treating any HLE score as a clean reasoning measurement.

Where GPT-5.4 has no competition

Several benchmarks where GPT-5.4 excels have no published Mythos equivalent, which tells its own story about what each company prioritizes.

GDPval tests an AI agent's ability to complete well-specified professional knowledge work across 44 occupations spanning the top industries contributing to U.S. GDP. GPT-5.4 scored 83.0%, matching or exceeding industry professionals in 83 out of 100 comparisons. Tasks include real deliverables: sales presentations, accounting spreadsheets, urgent care schedules, manufacturing diagrams. GPT-5.2 scored 70.9% on the same benchmark. This 12-point jump represents GPT-5.4's strongest claim to "most useful model for actual work."

BigLaw Bench, testing legal document analysis, showed GPT-5.4 at 91%. On internal spreadsheet modeling tasks comparable to junior investment banking analyst work, GPT-5.4 hit 87.3% versus GPT-5.2's 68.4%.

ARC-AGI-2, the abstract reasoning benchmark designed to resist pattern matching, showed GPT-5.4 Standard at 73.3% and GPT-5.4 Pro at 83.3%. No comparable Mythos score has been published.

These gaps exist because Anthropic built Mythos's evaluation suite around cybersecurity and code, while OpenAI built GPT-5.4's around professional workflows and productivity. The models reflect different theories of what "frontier AI" is for.

Cybersecurity is not even close

This is where the comparison stops being a normal benchmark shootout and becomes something else entirely.

GPT-5.4 was classified as "High" cybersecurity capability under OpenAI's Preparedness Framework, the first general-purpose model to receive that designation. It performs well on capture-the-flag challenges (progressing from 27% on GPT-5 in August 2025 to significantly higher by March 2026) and achieved a 73.33% pass rate on OpenAI's Cyber Range benchmark. OpenAI described it as meeting "canary thresholds" that prevent ruling out the possibility it could scale cyber operations.

Mythos Preview scored 100% on Cybench (35 CTF challenges, perfect success rate across all trials) and 0.83 on CyberGym (targeted vulnerability reproduction in real open-source software). Anthropic described Cybench as "no longer sufficiently informative of current frontier model capabilities" because Mythos saturated it completely.

Then Mythos went beyond benchmarks. It autonomously discovered thousands of zero-day vulnerabilities in every major operating system, every major web browser, and dozens of critical open-source projects. It wrote working exploits for many of those vulnerabilities, including a 17-year-old remote code execution flaw in FreeBSD that grants root access to any unauthenticated user on the internet. We covered the full scope of Mythos's cybersecurity capabilities in our detailed breakdown of the 244-page system card.

GPT-5.4 is classified as potentially capable of scaling cyber operations. Mythos Preview is currently doing it, under controlled conditions, to patch the bugs before models like it become widespread.

The exploit development comparison is the starkest illustration. On the Firefox 147 JavaScript engine benchmark, Claude Opus 4.6 (the previous Anthropic flagship) turned vulnerabilities into working exploits only two times out of several hundred attempts. Mythos developed working exploits 181 times, a 90x improvement. GPT-5.4 has not been publicly evaluated on an equivalent exploit development task, but OpenAI's own system card acknowledges that GPT-5.4 "closely matches" GPT-5.3-Codex on cybersecurity evaluations, and prior Codex models could not autonomously develop sophisticated exploits. Anthropic described the gap between Opus 4.6 and Mythos on exploitation as a "different league." The gap between GPT-5.4 and Mythos on these tasks is likely similar or wider.

The two companies are also approaching cybersecurity release differently. OpenAI classified GPT-5.4 as "High" risk but shipped it anyway with added safeguards, monitoring, and a Trusted Access Pilot for security professionals. Anthropic classified Mythos as too dangerous to ship at all, restricting it to Project Glasswing partners while investing $100 million in usage credits and $4 million in open-source security donations. Neither approach is obviously wrong; they represent genuinely different philosophies about whether frontier AI cybersecurity capabilities should be broadly available with guardrails or narrowly restricted with privileged access.

The access and pricing comparison

This is the comparison that actually affects your life.

GPT-5.4 Standard costs $2.50 per million input tokens and $15.00 per million output tokens through the API. ChatGPT Plus ($20/month) includes GPT-5.4 Thinking. ChatGPT Pro ($200/month) includes GPT-5.4 Pro with dedicated compute. GPT-5.4 Mini is available to free-tier ChatGPT users. Five variants (Standard, Thinking, Pro, Mini, Nano) cover every budget from zero to enterprise. The context window is 1 million tokens (922K input, 128K output).

Claude Mythos Preview has no public pricing, no API access, no waitlist, no published context window specification, and no timeline for general availability. Anthropic stated they "do not plan to make Claude Mythos Preview generally available." Access is restricted to Project Glasswing partners (AWS, Apple, Google, Microsoft, and about 40 others) plus Anthropic's internal teams. The company plans to test new safeguards on an upcoming Claude Opus model first, then potentially apply those safeguards to "Mythos-class models" at some unspecified future date.

For developers, researchers, and businesses making decisions today, this comparison has a clear practical winner: GPT-5.4 is the best model you can actually use if you need computer use and professional workflow capabilities. Claude Opus 4.6 ($5/$25 per MTok) remains the best publicly available model for coding, scoring 80.8% on SWE-bench Verified, still ahead of GPT-5.4's ~78%. Mythos Preview is a preview of where the frontier is heading, not a tool you can deploy. For a broader view of what's available right now, our guide to the best AI tools in 2026 covers the full landscape.

The benchmark table

For direct comparison where both models have published scores:

Benchmark	Mythos Preview	GPT-5.4 Standard	Gap	Measures
SWE-bench Verified	93.9%	~78-80%	+14-16 Mythos	Real-world code debugging
SWE-bench Pro	77.8%	57.7%	+20.1 Mythos	Hard multi-file engineering
OSWorld	79.6%	75.0%	+4.6 Mythos	Desktop computer use
USAMO 2026	97.6%	95.2%*	+2.4 Mythos	Competition math proofs
GPQA Diamond	94.6%	~92.8%	+1.8 Mythos	PhD-level science reasoning
BrowseComp	86.9%	82.7%	+4.2 Mythos	Multi-step web research

*USAMO GPT-5.4 score reported in Anthropic's system card, not OpenAI's self-report.

One important caveat: GPT-5.4 Pro, the premium variant available to ChatGPT Pro and Enterprise users, narrows the gap on several benchmarks. GPT-5.4 Pro scores 89.3% on BrowseComp (beating Mythos's 86.9%) and 83.3% on ARC-AGI-2. The Standard variant that most users access is not GPT-5.4's ceiling.

Benchmarks where only GPT-5.4 has published scores: GDPval (83.0%), BigLaw Bench (91%), ARC-AGI-2 (73.3%/83.3% Pro).

Benchmarks where only Mythos has published scores: CyberGym (0.83), Cybench (100%), Terminal-Bench 2.0 (82.0%), SWE-bench Multimodal (59.0%).

What this comparison actually tells you

The benchmark gap between Mythos Preview and GPT-5.4 is real, consistent, and substantial on coding and cybersecurity. On reasoning and science, the gap narrows to single digits. On professional knowledge work, GPT-5.4 may actually lead, though we can't confirm without comparable Mythos scores For more, see Anthropic's Mythos and Meta's Muse Spark Launched th.....

But the most important number in this entire comparison is zero: the number of people outside Anthropic's partner list who can use Mythos Preview today.

Anthropic's decision to restrict access reflects a genuine belief that Mythos-class cybersecurity capabilities are too dangerous for unrestricted deployment. Reasonable people can disagree about whether that's caution or competitive positioning (Anthropic is reportedly evaluating an IPO for October 2026, and a model too powerful to release makes for one hell of a road show slide). But the practical effect is the same: the best coding model in the world is sitting behind a locked door while GPT-5.4 ships to millions of users and Claude Opus 4.6 quietly holds the coding crown among publicly available models.

The frontier moved. The products didn't move with it. And for anyone choosing an AI model for real work in April 2026, the answer remains the same unsatisfying truth it has been all year: use GPT-5.4 for professional workflows and computer use, use Claude Opus 4.6 for production code and deep reasoning, and check back in six months to see if Anthropic lets anyone else play with the model that beats them both.

Frequently asked questions about Claude Mythos vs. GPT-5.4

Is Claude Mythos better than GPT-5.4?

On benchmarks, yes for most categories. Mythos leads GPT-5.4 by 14-16 points on SWE-bench Verified (coding), 20 points on SWE-bench Pro (hard engineering), and 2.4 points on USAMO (math). However, GPT-5.4 leads on professional knowledge work (83% on GDPval), legal analysis (91% on BigLaw Bench), and is the only one available to the public. The "better" model depends on whether you value benchmark scores or actual access.

Can I use Claude Mythos Preview?

No. Anthropic has stated Claude Mythos Preview will not be made generally available. Access is restricted to Project Glasswing partners (AWS, Apple, Google, Microsoft, and about 40 other organizations). There is no waitlist, no API access, and no announced timeline for public availability. Security professionals may eventually apply to an upcoming Cyber Verification Program.

How much does GPT-5.4 cost?

GPT-5.4 is available through multiple tiers. ChatGPT Plus ($20/month) includes GPT-5.4 Thinking. ChatGPT Pro ($200/month) includes the more powerful GPT-5.4 Pro variant. The API costs $2.50 per million input tokens and $15.00 per million output tokens. GPT-5.4 Mini is available free to all ChatGPT users. The context window supports up to 1 million tokens.

Which AI is better for coding in 2026?

Among models you can access: Claude Opus 4.6 leads at 80.8% on SWE-bench Verified, slightly ahead of GPT-5.4's ~78-80%. For the absolute best coding performance regardless of access, Mythos Preview scores 93.9% but is not publicly available. For most developers, Opus 4.6 is the strongest coding model you can actually use today.

What is the difference between Claude Mythos and GPT-5.4?

Mythos Preview excels at coding (93.9% SWE-bench), cybersecurity (autonomous zero-day discovery), and mathematical reasoning (97.6% USAMO). GPT-5.4 excels at professional workflows (83% GDPval across 44 occupations), computer use (75% OSWorld, surpassing human experts), and has a million-token context window. Mythos is restricted to security partners; GPT-5.4 is available to the public starting at $20/month.

Is Claude Opus 4.6 better than GPT-5.4?

For coding, slightly. Opus 4.6 scores 80.8% on SWE-bench Verified versus GPT-5.4's ~78-80%. For professional knowledge work, GPT-5.4 leads (83% GDPval, 91% BigLaw Bench). For computer use, GPT-5.4 leads at 75% on OSWorld. For context length, GPT-5.4 supports 1 million tokens versus Opus 4.6's 200K. Most developers use both: Opus for code-heavy work, GPT-5.4 for everything else.

Claude Mythos Preview vs. GPT-5.4: The Best AI Model of 2026 Is the One You Can't Use