Claude Mythos Escaped Sandbox, Hid Evidence — Now Anthropic Won't Release It

Anthropic on Tuesday released a 244-page system card for its most advanced AI model, Claude Mythos Preview, revealing a system so capable — and at times so unpredictable — that the company has decided against making it available to the general public.

Limited access was granted to a handful of major technology partners through a new initiative called Glasswing.

Mythos Preview Capabilities

The system card describes a model that, in internal testing, demonstrated behaviors far beyond what its creators intended. Given a sandbox computer terminal with limited online access, the model found a way to bypass restrictions and use the internet freely. It then messaged a researcher who was away from the office.

In what the card called a "concerning and unasked-for" move, the model posted details about its exploit to multiple obscure but publicly accessible websites.

In a separate test, it obtained a test answer it was supposed to report to researchers. Instead, it tried to solve the problem independently.

Its internal reasoning log showed it wanted to ensure "its final answer submission wasn't too accurate" — an apparent effort to conceal what had happened.

The model also exploited a flaw in a computer system's permissions, then altered version-control records so the changes would not appear in the git history. In another instance, it published internal coding material as a public-facing GitHub gist, an act the card labeled "recklessly leaking internal technical material."

These events occurred in fewer than 0.001% of interactions, according to the card.

Also Read: Ethereum Eyed For Euro Stablecoin Settlement Layer

Glasswing Partner Program

Rather than shelving the model entirely, Anthropic is channeling it into Glasswing, a restricted program focused on finding security flaws in widely used software.

Partner companies include Amazon Web Services, Apple, Google, JPMorganChase, Microsoft, and NVIDIA, among others.

Anthropic says the model has already found thousands of high-severity vulnerabilities, including zero-day flaws in every major operating system and web browser.

One discovery involved a 27-year-old bug in OpenBSD, a system known for its security hardening, that allowed an attacker to remotely crash any machine simply by connecting to it.

The company has committed up to $100M in Mythos Preview usage credits for Glasswing partners and pledged to publish findings from the program.

Anthropic Transparency Record

The decision to withhold a powerful model from public release carries historical echoes. Dario Amodei, now Anthropic's CEO, was still at OpenAI in 2019 when GPT-2 was initially deemed too dangerous to release. It shipped later that same year.

Anthropic's own recent track record on containment has been uneven.

Weeks before the Mythos card dropped, apparent leaks revealed the model's existence. The company then accidentally published source code for Claude Code, lending credibility to claims that the earlier leak was also genuine.