23 Apr 2026 — Tyler Wright

Mozilla's Firefox numbers give us the best AI vulnerability benchmark yet

Opus found 22 bugs. Mythos found 271. The same codebase. The same technique. The gap tells you everything about the queue ahead.

Mozilla ran two successive AI-assisted security reviews of Firefox — first with Claude Opus, then with Claude Mythos Preview. The same codebase. The same technique. Six weeks apart. Opus found 22 bugs. Mythos found 271.

Mozilla has inadvertently run the most instructive AI vulnerability benchmark in the public record, and the numbers deserve more attention than they've received.

Between February and April 2026, Mozilla's Firefox team underwent two successive AI-assisted security reviews — first with Anthropic's Frontier Red Team using Claude Opus 4.6, then with an early access version of Claude Mythos Preview. The results, reported in two separate Mozilla blog posts, landed in the same codebase, used materially the same technique, and targeted the same class of vulnerabilities. The first review found 22 security-sensitive bugs, fixed in Firefox 148. The second found 271, fixed in Firefox 150. That represents a 12-fold increase in the same target, with the same methodology, six weeks apart.

It is not a scientific study. Mozilla wasn't running a controlled experiment. But it is something rare in AI security discourse: a real-world, independently reported, directionally consistent data point from a credible organisation on a hardened and widely scrutinised codebase. Firefox has been subject to continuous fuzzing, static analysis, and elite human research for more than two decades. If there were easy bugs left, they would have been found before now. The bugs that remained were the hard ones — logic errors that fuzzers miss, vulnerability classes requiring genuine source-level reasoning. Both Claude Opus and Claude Mythos found them. Mythos found twelve times as many.

What the comparison tells us

Mozilla's second post makes an observation that practitioners should sit with: the Firefox team found no category or complexity of vulnerability that elite human researchers can find that Mythos Preview cannot. The model is described as every bit as capable as the world's best security researchers — but able to operate across the entire codebase simultaneously, rather than over months of concentrated human effort.

The Opus review, which preceded this, had already demonstrated that AI-assisted vulnerability analysis was substantially beyond what traditional automated tooling could produce. Fuzzers — automated tools that feed software unexpected inputs to trigger crashes — miss logic errors. Elite humans find logic errors, but slowly, and only in the parts of the codebase they actually read. Claude Opus found 22 bugs across both categories that had survived decades of the best coverage Firefox's team and community could provide.

Mythos found 271.

The directional implication is unambiguous. If your organisation is waiting for Mythos-class access before beginning AI-assisted security review, you are already behind. Opus-class capability is accessible today, produces real findings, and represents a meaningful step ahead of the baseline for most Australian organisations that have not yet integrated AI into their security workflows.

Why "wait for Mythos" is the wrong frame

While we wait for Mythos, Opus can serve as your first foray and an excellent testbed for AI-driven vulnerability testing. After all, it found 22 bugs in one of the world's most scrutinised codebases on its first pass. Imagine what it could do for your stack.

The framing error that matters here is treating AI-assisted vulnerability review as a binary — either you have frontier capability or you don't — and using that binary to justify inaction. The Mozilla data shows the framing is wrong. Opus produced findings that traditional tooling had missed for years. The Mythos results show there is meaningfully more capability available at the frontier, but Opus finding 22 bugs against a hardened target is not a story about limitation. It is a story about what is already discoverable that your existing tools are not finding.

The practical implication for most organisations: your attack surface is not Firefox. Firefox is a massively complex multi-platform browser codebase with a full-time red team, continuous fuzzing infrastructure, and a global open-source security community. Whatever your codebase looks like — internal application stack, exposed API layer, legacy components supporting regulated services — the coverage gap relative to what AI-assisted analysis can produce is almost certainly larger than Mozilla's was.

The Australian context

Australian organisations have limited direct exposure to Project Glasswing. The initiative's participants are concentrated among US and global technology firms. What Australian organisations do inherit is the downstream benefit of patches shipped by their software dependencies — Mozilla's Firefox 150, for instance, contains fixes for 271 vulnerabilities that were previously unknown.

But that dependency creates a structural assumption that the most important Australian cyber teams should not be making: that someone upstream is running the analysis so you don't have to. The Glasswing coalition is patching foundational software. It is not reviewing your applications, your integrations, or your organisation-specific deployment configurations.

APRA's Prudential Standard CPS 234 Information Security requires APRA-regulated entities to maintain information security capabilities commensurate with their threat environment and vulnerability posture. The ASD's Essential Eight Maturity Model treats application patching as a minimum-baseline control. Neither framework yet speaks directly to AI-assisted vulnerability analysis, but the underlying expectation is clear: organisations are expected to apply emerging security techniques as they become available.

AI-assisted vulnerability review is no longer emerging. It is available. Mozilla's numbers make clear that it produces findings that other techniques do not.

What a sensible first step looks like

The architecture for an Opus-assisted review is not technically complex. Anthropic's API is available. The approach Mozilla validated — structured source analysis with minimal reproducible test case generation — can be scoped and executed against a defined codebase. The output is actionable findings, not a list of theories.

The harder question for most organisations is not access but scope discipline: what do you point the model at first, what do you do with the findings, and how do you integrate the workflow into ongoing security operations rather than treating it as a one-off exercise. That is the space where practitioner experience matters.

The Mozilla case study is useful precisely because it is not a vendor pitch. It is a disclosure from a security-serious organisation about what they actually found, using tools that are actually available. The numbers make the argument plainly: the bugs are there, the tools can find them, and the question is whether you run the analysis or wait for an adversary to do it first.

Getting the analysis done

Artificer Cyber runs AI-assisted code and architecture reviews for organisations that want structured findings rather than a slide deck of generic risk categories. Where the review touches sensitive systems or the findings may have legal and regulatory implications, we structure the engagement through Artificer Legal to give the work product the protection it may need in downstream litigation or regulatory response.

If you want to understand what an Opus-assisted review of your own environment would produce — and how that process fits within your existing security and legal obligations — that conversation starts here.

On retainer

The firms that respond fastest are the ones that planned ahead.

When an incident hits, the last thing you want is to be searching for a firm. Retainer clients get priority response, privileged structure, and a team that already knows your environment.

Discuss a retainer →
  • Priority SLA — response within hours, not days
  • Alignment with your legal, executive, and CTO-office protocols from day one
  • Pre-negotiated rates — no emergency premium
  • Red team and blue team engagements to pressure-test your defences
  • Quarterly posture reviews so we already know your environment when it counts