Author name: Kelly Newman

password-managers’-promise-that-they-can’t-see-your-vaults-isn’t-always-true

Password managers’ promise that they can’t see your vaults isn’t always true


ZERO KNOWLEDGE, ZERO CLUE

Contrary to what password managers say, a server compromise can mean game over.

Over the past 15 years, password managers have grown from a niche security tool used by the technology savvy into an indispensable security tool for the masses, with an estimated 94 million US adults—or roughly 36 percent of them—having adopted them. They store not only passwords for pension, financial, and email accounts, but also cryptocurrency credentials, payment card numbers, and other sensitive data.

All eight of the top password managers have adopted the term “zero knowledge” to describe the complex encryption system they use to protect the data vaults that users store on their servers. The definitions vary slightly from vendor to vendor, but they generally boil down to one bold assurance: that there is no way for malicious insiders or hackers who manage to compromise the cloud infrastructure to steal vaults or data stored in them. These promises make sense, given previous breaches of LastPass and the reasonable expectation that state-level hackers have both the motive and capability to obtain password vaults belonging to high-value targets.

A bold assurance debunked

Typical of these claims are those made by Bitwarden, Dashlane, and LastPass, which together are used by roughly 60 million people. Bitwarden, for example, says that “not even the team at Bitwarden can read your data (even if we wanted to).” Dashlane, meanwhile, says that without a user’s master password, “malicious actors can’t steal the information, even if Dashlane’s servers are compromised.” LastPass says that no one can access the “data stored in your LastPass vault, except you (not even LastPass).”

New research shows that these claims aren’t true in all cases, particularly when account recovery is in place or password managers are set to share vaults or organize users into groups. The researchers reverse-engineered or closely analyzed Bitwarden, Dashlane, and LastPass and identified ways that someone with control over the server—either administrative or the result of a compromise—can, in fact, steal data and, in some cases, entire vaults. The researchers also devised other attacks that can weaken the encryption to the point that ciphertext can be converted to plaintext.

“The vulnerabilities that we describe are numerous but mostly not deep in a technical sense,” the researchers from ETH Zurich and USI Lugano wrote. “Yet they were apparently not found before, despite more than a decade of academic research on password managers and the existence of multiple audits of the three products we studied. This motivates further work, both in theory and in practice.”

The researchers said in interviews that multiple other password managers they didn’t analyze as closely likely suffer from the same flaws. The only one they were at liberty to name was 1Password. Almost all the password managers, they added, are vulnerable to the attacks only when certain features are enabled.

The most severe of the attacks—targeting Bitwarden and LastPass—allow an insider or attacker to read or write to the contents of entire vaults. In some cases, they exploit weaknesses in the key escrow mechanisms that allow users to regain access to their accounts when they lose their master password. Others exploit weaknesses in support for legacy versions of the password manager. A vault-theft attack against Dashlane allowed reading but not modification of vault items when they were shared with other users.

Staging the old key switcheroo

One of the attacks targeting Bitwarden key escrow is performed during the enrollment of a new member of a family or organization. After a Bitwarden group admin invites the new member, the invitee’s client accesses a server and obtains a group symmetric key and the group’s public key. The client then encrypts the symmetric key with the group public key and sends it to the server. The resulting ciphertext is what’s used to recover the new user’s account. This data is never integrity-checked when it’s sent from the server to the client during an account enrollment session.

The adversary can exploit this weakness by replacing the group public key with one from a keypair created by the adversary. Since the adversary knows the corresponding private key, it can use it to decrypt the ciphertext and then perform an account recovery on behalf of the targeted user. The result is that the adversary can read and modify the entire contents of the member vault as soon as an invitee accepts an invitation from a family or organization.

Normally, this attack would work only when a group admin has enabled autorecovery mode, which, unlike a manual option, doesn’t require interaction from the member. But since the group policy the client downloads during the enrollment policy isn’t integrity-checked, adversaries can set recovery to auto, even if an admin had chosen a manual mode that requires user interaction.

Compounding the severity, the adversary in this attack also obtains a group symmetric key for all other groups the member belongs to since such keys are known to all group members. If any of the additional groups use account recovery, the adversary can obtain the members’ vaults for them, too. “This process can be repeated in a worm-like fashion, infecting all organizations that have key recovery enabled and have overlapping members,” the research paper explained.

A second attack targeting Bitwarden account recovery can be performed when a user rotates vault keys, an option Bitwarden recommends if a user believes their master password has been compromised. When account recovery is on (either manually or automatically), the user client regenerates the recovery ciphertext, which as described earlier involves obtaining a new public key that’s encrypted with the organization public key. The researchers denoted the group public key as pkorg. They denote the public key supplied by the adversary as pkadvorg, the recovery ciphertext as crec, and the user symmetric key as k.

The paper explained:

The key point here is that pkorg is not retrieved from the user’s vault; rather the client performs a sync operation with the server to obtain it. Crucially, the organization data provided by this sync operation is not authenticated in any way. This thus provides the adversary with another opportunity to obtain a victim’s user key, by supplying a new public key pkadvorg, for which they know the skadvorg and setting the account recovery enrollment to true. The client will then send an account recovery ciphertext crec containing the new user key, which the adversary can decrypt to obtain k′.

The third attack on the Bitwarden account recovery allows an adversary to recover a user’s master key. It abuses key connector, a feature primarily used by enterprise customers.

More ways to pilfer vaults

The attack allowing theft of LastPass vaults also targets key escrow, specifically in the Teams and Teams 5 versions, when a member’s master key is reset by a privileged user known as a superadmin. The next time the member logs in through the LastPass browser extension, their client will retrieve an RSA keypair assigned to each superadmin in the organization, encrypt their new key with each one, and send the resulting ciphertext to each superadmin.

Because LastPass also fails to authenticate the superadmin keys, an adversary can once again replace the superadmin public key (pkadm) with their own public key (pkadvadm).

“In theory, only users in teams where password reset is enabled and who are selected for reset should be affected by this vulnerability,” the researchers wrote. “In practice, however, LastPass clients query the server at each login and fetch a list of admin keys. They then send the account recovery ciphertexts independently of enrollment status.” The attack, however, requires the user to log in to LastPass with the browser extension, not the standalone client app.

Several attacks allow reading and modification of shared vaults, which allow a user to share selected items with one or more other users. When Dashlane users share an item, their client apps sample a fresh symmetric key, which either directly encrypts the shared item or, when sharing with a group, encrypts group keys, which in turn encrypt the shared item. In either case, the newly created RSA keypair(s)—belonging to either the shared user or group—isn’t authenticated. The item is then encrypted with the private key(s).

An adversary can supply their own keypair and use the public key to encrypt the ciphertext sent to the recipients. The adversary then decrypts that ciphertext with their corresponding secret key to recover the shared symmetric key. With that, the adversary can read and modify all shared items. When sharing is used in either Bitwarden or LastPass, similar attacks are possible and lead to the same consequence.

Another avenue for attackers or adversaries with control of a server is to target the backward compatibility that all three password managers provide to support older, less-secure versions. Despite incremental changes designed to harden the apps against the very attacks described in the paper, all three password managers continue to support the versions without these improvements. This backward compatibility is a deliberate decision intended to prevent users who haven’t upgraded from losing access to their vaults.

The severity of these attacks is lower than that of the previous ones described, with the exception of one, which is possible against Bitwarden. Older versions of the password manager used a single symmetric key to encrypt and decrypt the user key from the server and items inside vaults. This design allowed for the possibility that an adversary could tamper with the contents. To add integrity checks, newer versions provide authenticated encryption by augmenting the symmetric key with an HMAC hash function.

To protect customers using older app versions, Bitwarden ciphertext has an attribute of either 0 or 1. A 0 designates authenticated encryption, while a 1 supports the older unauthenticated scheme. Older versions also use a key hierarchy that Bitwarden deprecated to harden the app. To support the old hierarchy, newer client versions generate a new RSA keypair for the user if the server doesn’t provide one. The newer version will proceed to encrypt the secret key portion with the master key if no user ciphertext is provided by the server.

This design opens Bitwarden to several attacks. The most severe, allowing reading (but not modification) of all items created after the attack is performed. At a simplified level, it works because the adversary can forge the ciphertext sent by the server and cause the client to use it to derive a user key known to the adversary.

The modification causes the use of CBC (cipher block chaining), a form of encryption that’s vulnerable to several attacks. An adversary can exploit this weaker form using a padding oracle attack and go on to retrieve the plaintext of the vault. Because HMAC protection remains intact, modification isn’t possible.

Surprisingly, Dashlane was vulnerable to a similar padding oracle attack. The researchers devised a complicated attack chain that would allow a malicious server to downgrade a Dashlane user’s vault to CBC and exfiltrate the contents. The researchers estimate that the attack would require about 125 days to decrypt the ciphertext.

Still other attacks against all three password managers allow adversaries to greatly reduce the selected number of hashing iterations—in the case of Bitwarden and LastPass, from a default of 600,000 to 2. Repeated hashing of master passwords makes them significantly harder to crack in the event of a server breach that allows theft of the hash. For all three password managers, the server sends the specified iteration count to the client, with no mechanism to ensure it meets the default number. The result is that the adversary receives a 200,000-fold increase in the time and resources required to crack the hash and obtain the user’s master password.

Attacking malleability

Three of the attacks—one against Bitwarden and two against LastPass—target what the researchers call “item-level encryption” or “vault malleability.” Instead of encrypting a vault in a single, monolithic blob, password managers often encrypt individual items, and sometimes individual fields within an item. These items and fields are all encrypted with the same key. The attacks exploit this design to steal passwords from select vault items.

An adversary mounts an attack by replacing the ciphertext in the URL field, which stores the link where a login occurs, with the ciphertext for the password. To enhance usability, password managers provide an icon that helps visually recognize the site. To do this, the client decrypts the URL field and sends it to the server. The server then fetches the corresponding icon. Because there’s no mechanism to prevent the swapping of item fields, the client decrypts the password instead of the URL and sends it to the server.

“That wouldn’t happen if you had different keys for different fields or if you encrypted the entire collection in one pass,” Kenny Paterson, one of the paper co-authors, said. “A crypto audit should spot it, but only if you’re thinking about malicious servers. The server is deviating from expected behavior.

The following table summarizes the causes and consequences of the 25 attacks they devised:

Credit: Scarlata et al.

Credit: Scarlata et al.

A psychological blind spot

The researchers acknowledge that the full compromise of a password manager server is a high bar. But they defend the threat model.

“Attacks on the provider server infrastructure can be prevented by carefully designed operational security measures, but it is well within the bounds of reason to assume that these services are targeted by sophisticated nation-state-level adversaries, for example via software supply-chain attacks or spearphishing,” they wrote. “Moreover, some of the service providers have a history of being breached—for example LassPass suffered branches in 2015 and 2022, and another serious security incident in 2021.

They went on to write: “While none of the breaches we are aware of involved reprogramming the server to make it undertake malicious actions, this goes just one step beyond attacks on password manager service providers that have been documented. Active attacks more broadly have been documented in the wild.”

Part of the challenge of designing password managers or any end-to-end encryption service is the tendency for a false sense of security of the client.

“It’s a psychological problem when you’re writing both client and server software,” Paterson explained. “You should write the client super defensively, but if you’re also writing the server, well of course your server isn’t going to send malformed packets or bad info. Why would you do that?”

Marketing gimmickry or not, “zero-knowledge” is here to stay

In many of the cases, engineers have already fixed the weaknesses described after receiving private reports from the researchers. Engineers are still patching other vulnerabilities. In statements, Bitwarden, Lastpass, and Dashlane representatives noted the high bar of the threat model, despite statements on their websites that assure customers their wares will withstand it. Along with 1Password representatives, they also noted that their products regularly receive stringent security audits and undergo red-team exercises.

A Bitwarden representative wrote:

Bitwarden continually evaluates and improves its software through internal review, third-party assessments, and external research. The ETH Zurich paper analyzes a threat model in which the server itself behaves maliciously and intentionally attempts to manipulate key material and configuration values. That model assumes full server compromise and adversarial behavior beyond standard operating assumptions for cloud services.

LastPass said, “We take a multi‑layered, ongoing approach to security assurance that combines independent oversight, continuous monitoring, and collaboration with the research community. Our cloud security testing is inclusive of the scenarios referenced in the malicious-server threat model outlined in the research.”

Specific measures include:

A statement from Dashlane read, “Dashlane conducts rigorous internal and external testing to ensure the security of our product. When issues arise, we work quickly to mitigate any possible risk and ensure customers have clarity on the problem, our solution, and any required actions.”

1Password released a statement that read in part:

Our security team reviewed the paper in depth and found no new attack vectors beyond those already documented in our publicly available Security Design White Paper.

We are committed to continually strengthening our security architecture and evaluating it against advanced threat models, including malicious-server scenarios like those described in the research, and evolving it over time to maintain the protections our users rely on.

1Password also says that the zero-knowledge encryption it provides “means that no one but you—not even the company that’s storing the data—can access and decrypt your data. This protects your information even if the server where it’s held is ever breached.” In the company’s white paper linked above, 1Password seems to allow for this possibility when it says:

At present there’s no practical method for a user to verify the public key they’re encrypting data to belongs to their intended recipient. As a consequence it would be possible for a malicious or compromised 1Password server to provide dishonest public keys to the user, and run a successful attack. Under such an attack, it would be possible for the 1Password server to acquire vault encryption keys with little ability for users to detect or prevent it.

1Password’s statement also includes assurances that the service routinely undergoes rigorous security testing.

All four companies defended their use of the term “zero knowledge.” As used in this context, the term can be confused with zero-knowledge proofs, a completely unrelated cryptographic method that allows one party to prove to another party that they know a piece of information without revealing anything about the information itself. An example is a proof that shows a system can determine if someone is over 18 without having any knowledge of the precise birthdate.

The adulterated zero-knowledge term used by password managers appears to have come into being in 2007, when a company called Spider Oak used it to describe its cloud infrastructure for securely sharing sensitive data. Interestingly, Spider Oak formally retired the term a decade later after receiving user pushback.

“Sadly, it is just marketing hype, much like ‘military-grade encryption,’” Matteo Scarlata, lead author of the paper said. “Zero-knowledge seems to mean different things to different people (e.g., LastPass told us that they won’t adopt a malicious server threat model internally). Much unlike ‘end-to-end encryption,’ ‘zero-knowledge encryption’ is an elusive goal, so it’s impossible to tell if a company is doing it right.”

Photo of Dan Goodin

Dan Goodin is Senior Security Editor at Ars Technica, where he oversees coverage of malware, computer espionage, botnets, hardware hacking, encryption, and passwords. In his spare time, he enjoys gardening, cooking, and following the independent music scene. Dan is based in San Francisco. Follow him at here on Mastodon and here on Bluesky. Contact him on Signal at DanArs.82.

Password managers’ promise that they can’t see your vaults isn’t always true Read More »

bytedance-backpedals-after-seedance-2.0-turned-hollywood-icons-into-ai-“clip-art”

ByteDance backpedals after Seedance 2.0 turned Hollywood icons into AI “clip art”


Misstep or marketing tactic?

Hollywood backlash puts spotlight on ByteDance’s sketchy launch of Seedance 2.0.

ByteDance says that it’s rushing to add safeguards to block Seedance 2.0 from generating iconic characters and deepfaking celebrities, after substantial Hollywood backlash after launching the latest version of its AI video tool.

The changes come after Disney and Paramount Skydance sent cease-and-desist letters to ByteDance urging the Chinese company to promptly end the allegedly vast and blatant infringement.

Studios claimed the infringement was widescale and immediate, with Seedance 2.0 users across social media sharing AI videos featuring copyrighted characters like Spider-Man, Darth Vader, and SpongeBob Square Pants. In its letter, Disney fumed that Seedance was “hijacking” its characters, accusing ByteDance of treating Disney characters like they were “free public domain clip art,” Axios reported.

“ByteDance’s virtual smash-and-grab of Disney’s IP is willful, pervasive, and totally unacceptable,” Disney’s letter said.

Defending intellectual property from franchises like Star Trek and The Godfather, Paramount Skydance pointed out that Seedance’s outputs are “often indistinguishable, both visually and audibly” from the original characters, Variety reported. Similarly frustrated, Japan’s AI minister Kimi Onoda, sought to protect popular anime and manga characters, officially launching a probe last week into ByteDance over the copyright violations, the South China Morning Post reported.

“We cannot overlook a situation in which content is being used without the copyright holder’s permission,” Onoda said at a press conference Friday.

Facing legal threats and Japan’s investigation, ByteDance issued a statement Monday, CNBC reported. In it, the company claimed that it “respects intellectual property rights” and has “heard the concerns regarding Seedance 2.0.”

“We are taking steps to strengthen current safeguards as we work to prevent the unauthorized use of intellectual property and likeness by users,” ByteDance said.

However, Disney seems unlikely to accept that ByteDance inadvertently released its tool without implementing such safeguards in advance. In its letter, Disney alleged that “Seedance has infringed on Disney’s copyrighted materials to benefit its commercial service without permission.”

After all, what better way to illustrate Seedance 2.0’s latest features than by generating some of the best-known IP in the world? At least one tech consultant has suggested that ByteDance planned to benefit from inciting Hollywood outrage. The founder of San Francisco-based consultancy Tech Buzz China, Rui Ma, told SCMP that “the controversy surrounding Seedance is likely part of ByteDance’s initial distribution strategy to showcase its underlying technical capabilities.”

Seedance 2.0 is an “attack” on creators

Studios aren’t the only ones sounding alarms.

Several industry groups expressed concerns, including the Motion Picture Association, which accused ByteDance of engaging in massive copyright infringement within “a single day,” CNBC reported.

Sean Astin, an actor and president of the actors union, SAG-AFTRA, was directly impacted by the scandal. A video that has since been removed from X showed Astin in the role of Samwise Gamgee from The Lord of the Rings, delivering a line he never said, Variety reported. Condemning Seedance’s infringement, SAG-AFTRA issued a statement emphasizing that ByteDance did not act responsibly in releasing the model without safeguards:

“SAG-AFTRA stands with the studios in condemning the blatant infringement enabled by ByteDance’s new AI video model Seedance 2.0. The infringement includes the unauthorized use of our members’ voices and likenesses. This is unacceptable and undercuts the ability of human talent to earn a livelihood. Seedance 2.0 disregards law, ethics, industry standards and basic principles of consent. Responsible AI development demands responsibility, and that is nonexistent here.”

Echoing that, a group representing Hollywood creators, the Human Artistry Campaign, declared that “the launch of Seedance 2.0” was “an attack on every creator around the world.”

“Stealing human creators’ work in an attempt to replace them with AI generated slop is destructive to our culture: stealing isn’t innovation,” the group said. “These unauthorized deepfakes and voice clones of actors violate the most basic aspects of personal autonomy and should be deeply concerning to everyone. Authorities should use every legal tool at their disposal to stop this wholesale theft.”

Ars could not immediately reach any of these groups to comment on whether ByteDance’s post-launch efforts to add safeguards addressed industry concerns.

MPA chairman and CEO Charles Rivkin has previously accused ByteDance of disregarding “well-established copyright law that protects the rights of creators and underpins millions of American jobs.”

While Disney and other studios are clearly ready to take down any tools that could hurt their revenue or reputation without an agreement in place, they aren’t opposed to all AI uses of their characters. In December, Disney struck a deal with OpenAI, giving Sora access to 200 characters for three years, while investing $1 billion in the technology.

At that time, Disney CEO Robert A. Iger, said that “the rapid advancement of artificial intelligence marks an important moment for our industry, and through this collaboration with OpenAI, we will thoughtfully and responsibly extend the reach of our storytelling through generative AI, while respecting and protecting creators and their works.”

Creators disagree Seedance 2.0 is a game changer

In a blog announcing Seedance 2.0, ByteDance boasted that the new model “delivers a substantial leap in generation quality,” particularly in close-up shots and action sequences.

The company acknowledged that further refinements were needed and the model is “still far from perfect” but hyped that “its generated videos possess a distinct cinematic aesthetic; the textures of objects, lighting, and composition, as well as costume, makeup, and prop designs, all show high degrees of finish.”

ByteDance likely hoped that the earliest outputs from Seedance 2.0 would produce headlines wowed by the model’s capabilities, and it got what it wanted when a single Hollywood stakeholder’s social media comment went viral.

Shortly after Seedance 2.0’s rollout, Deadpool co-writer, Rhett Reese, declared on X that “it’s likely over for us,” The Guardian reported. The screenwriter was impressed by an AI video created by Irish director Ruairi Robinson, which realistically depicted Tom Cruise fighting Brad Pitt. “[I]n next to no time, one person is going to be able to sit at a computer and create a movie indistinguishable from what Hollywood now releases,” Reese opined. “True, if that person is no good, it will suck. But if that person possesses Christopher Nolan’s talent and taste (and someone like that will rapidly come along), it will be tremendous.”

However, some AI critics rejected the notion that Seedance 2.0 is capable of replacing artists in the way that Reese warned. On Bluesky and X, they pushed back on ByteDance claims that this model doomed Hollywood, with some accusing outlets of too quickly ascribing Reese’s reaction to the whole industry.

Among them was longtime AI critic, Reid Southen, a film concept artist who works on major motion pictures and TV. Responding directly to Reese’s X thread, Southen contradicted the notion that a great filmmaker could be born from fiddling with AI prompts alone.

“Nolan is capable of doing great work because he’s put in the work,” Southen said. “AI is an automation tool, it’s literally removing key, fundamental work from the process, how does one become good at anything if they insist on using nothing but shortcuts?”

Perhaps the strongest evidence in Southen’s favor is Darren Aronofsky’s recent AI-generated historical docudrama. Speaking anonymously to Ars following backlash declaring that “AI slop is ruining American history,” one source close to production on that project confirmed that it took “weeks” to produce minutes of usable video using a variety of AI tools.

That source noted that the creative team went into the project expecting they had a lot to learn but also expecting that tools would continue to evolve, as could audience reactions to AI-assisted movies.

“It’s a huge experiment, really,” the source told Ars.

Notably, for both creators and rights-holders concerned about copyright infringement and career threats, questions remain on how Seedance 2.0 was trained. ByteDance has yet to release a technical report for Seedance 2.0 and “has never disclosed the data sets it uses to train its powerful video-generation Seedance models and image-generation Seedream models,” SCMP reported.

Photo of Ashley Belanger

Ashley is a senior policy reporter for Ars Technica, dedicated to tracking social impacts of emerging policies and new technologies. She is a Chicago-based journalist with 20 years of experience.

ByteDance backpedals after Seedance 2.0 turned Hollywood icons into AI “clip art” Read More »

a-fluid-can-store-solar-energy-and-then-release-it-as-heat-months-later

A fluid can store solar energy and then release it as heat months later


Sunlight can cause a molecule to change structure, and then release heat later.

The system works a bit like existing solar water heaters, but with chemical heat storage. Credit: Kypros

Heating accounts for nearly half of the global energy demand, and two-thirds of that is met by burning fossil fuels like natural gas, oil, and coal. Solar energy is a possible alternative, but while we have become reasonably good at storing solar electricity in lithium-ion batteries, we’re not nearly as good at storing heat.

To store heat for days, weeks, or months, you need to trap the energy in the bonds of a molecule that can later release heat on demand. The approach to this particular chemistry problem is called molecular solar thermal (MOST) energy storage. While it has been the next big thing for decades, it never really took off.

In a recent Science paper, a team of researchers from the University of California, Santa Barbara, and UCLA demonstrate a breakthrough that might finally make MOST energy storage effective.

The DNA connection

In the past, MOST energy storage solutions have been plagued by lackluster performance. The molecules either didn’t store enough energy, degraded too quickly, or required toxic solvents that made them impractical. To find a way around these issues, the team led by Han P. Nguyen, a chemist at the University of California, Santa Barbara, drew inspiration from the genetic damage caused by sunburn. The idea was to store energy using a reaction similar to the one that allows UV light to damage DNA.

When you stay out on the beach too long, high-energy ultraviolet light can cause adjacent bases in the DNA (thymine, the T in the genetic code) to link together. This forms a structure known as a (6-4) lesion. When that lesion is exposed to even more UV light, it twists into an even stranger shape called a “Dewar” isomer. In biology, this is rather bad news, as Dewar isomers cause kinks in the DNA’s double-helix spiral that disrupt copying the DNA and can lead to mutations or cancer.

To counter this effect, evolution shaped a specific enzyme called photolyase to hunt (6-4) lesions down and snap them back into their safe, stable forms.

The researchers realized that the Dewar isomer is essentially a molecular battery. This snap-back effect was exactly what Nguyen’s team was looking for, since it releases a lot of heat.

Rechargeable fuel

Molecular batteries, in principle, are extremely good at storing energy. Heating oil, arguably the most popular molecular battery we use for heating, is essentially ancient solar energy stored in chemical bonds. Its energy density stands at around 40 Megajoules per kilo. To put that in perspective, Li-ion batteries usually pack less than one MJ/kg. One of the problems with heating oil, though, is that it is single-use only—it gets burnt when you use it. What Nguyen and her colleagues aimed to achieve with their DNA-inspired substance is essentially a reusable fuel.

To do that, researchers synthesized a derivative of 2-pyrimidone, a chemical cousin of the thymine found in DNA. They engineered this molecule to reliably fold into a Dewar isomer under sunlight and then unfold on command. The result was a rechargeable fuel that could absorb the energy when exposed to sunlight, release it when needed, and return to a “relaxed” state where it’s ready to be charged up again.

Previous attempts at MOST systems have struggled to compete with Li-ion batteries. Norbornadiene, one of the best-studied candidates, tops out at around 0.97 MJ/kg. Another contender, azaborinine, manages only 0.65 MJ/kg. They may be scientifically interesting, but they are not going to heat your house.

Nguyen’s pyrimidone-based system blew those numbers out of the water. The researchers achieved an energy storage density of 1.65 MJ/kg—nearly double the capacity of Li-ion batteries and substantially higher than any previous MOST material.

Double rings

The reason for this jump in performance was what the team called compounded strain.

When the pyrimidone molecule absorbs light, it doesn’t just fold; it twists into a fused, bicyclic structure containing two different four-membered rings: 1,2-dihydroazete and diazetidine. Four-membered rings are under immense structural tension. By fusing them together, the researchers created a molecule that is desperate to snap back into its relaxed state.

Achieving high energy density on paper is one thing. Making it work in the real world is another. A major failing of previous MOST systems is that they are solids that need to be dissolved in solvents like toluene or acetonitrile to work. Solvents are the enemy of energy density—by diluting your fuel to 10 percent concentration, for example, you effectively cut your energy density by 90 percent. Any solvent used means less fuel.

Nguyen’s team tackled this by designing a version of their molecule that is a liquid at room temperature, so it doesn’t need a solvent. This simplified operations considerably, as the liquid fuel could be pumped through a solar collector to charge it up and store it in a tank.

Unlike many organic molecules that hate water, Nguyen’s system is compatible with aqueous environments. This means if a pipe leaks, you aren’t spewing toxic fluids like toluene around your house. The researchers even demonstrated that the molecule could work in water and that its energy release was intense enough to boil it.

The MOST-based heating system, the team says in their paper, would circulate this rechargeable fuel through panels on the roof to capture the sun’s light and then store it in the basement tank. The fuel from this tank would later be pumped to a reaction chamber with an acid catalyst that triggers the energy release. Then, through a heat exchanger, this energy would heat up the water in the standard central heating system.

But there’s a catch.

Looking for the leak

The first hurdle is the spectrum of light that puts energy in the Nguyen’s fuel. The Sun bathes us in a broad spectrum of light, from infrared to ultraviolet. Ideally, a solar collector should use as much of this as possible, but the pyrimidone molecules only absorb light in the UV-A and UV-B range, around 300-310 nm. That represents about five percent of the total solar spectrum. The vast majority of the Sun’s energy, the visible light and the infrared, passes right through Nguyen’s molecules without charging them.

The second problem is quantum yield. This is a fancy way of asking, “For every 100 photons that hit the molecule, how many actually make it switch to the Dewar isomer state?” For these pyrimidones, the answer is a rather underwhelming number, in the single digits. Low quantum yield means the fluid needs a longer exposure to sunlight to get a full charge.

The researchers hypothesize that the molecule has a fast leak, meaning a non-radiative decay path where the excited molecule shakes off the energy as heat immediately instead of twisting into the storage form. Plugging that leak is the next big challenge for the team.

Finally, the team in their experiments used an acid catalyst that was mixed directly into the storage material. The team admits that in a future closed-loop device, this would require a neutralization step—a reaction that eliminates the acidity after the heat is released. Unless the reaction products can be purified away, this will reduce the energy density of the system.

Still, despite the efficiency issues, the stability of the Nguyen’s system looks promising.

The MOST storage?

One of the biggest fears with chemical storage is thermal reversion—the fuel spontaneously discharges because it got a little too warm in the storage tank. But the Dewar isomers of the pyrimidones are incredibly stable. The researchers calculated a half-life of up to 481 days at room temperature for some derivatives. This means the fuel could be charged in the heat of July, and it would remain fully charged when you need to heat your home in January. The degradation figures also look decent for a MOST energy storage. The team ran the system through 20 charge-discharge cycles with negligible decay.

The problem with separating the acid from the fuel could be solved in a practical system by switching to a different catalyst. The scientists suggest in the paper that in this hypothetical setup, the fuel would flow through an acid-functionalized solid surface to release heat, thus eliminating the need for neutralization afterwards.

Still, we’re rather far away using MOST systems for heating actual homes. To get there, we’re going to need molecules that absorb far more of the light spectrum and convert to the activated state with a higher efficiency. We’re just not there yet.

Science, 2026. DOI: 10.1126/science.aec6413

Photo of Jacek Krywko

Jacek Krywko is a freelance science and technology writer who covers space exploration, artificial intelligence research, computer science, and all sorts of engineering wizardry.

A fluid can store solar energy and then release it as heat months later Read More »

editor’s-note:-retraction-of-article-containing-fabricated-quotations

Editor’s Note: Retraction of article containing fabricated quotations

On Friday afternoon, Ars Technica published an article containing fabricated quotations generated by an AI tool and attributed to a source who did not say them. That is a serious failure of our standards. Direct quotations must always reflect what a source actually said.

That this happened at Ars is especially distressing. We have covered the risks of overreliance on AI tools for years, and our written policy reflects those concerns. In this case, fabricated quotations were published in a manner inconsistent with that policy. We have reviewed recent work and have not identified additional issues. At this time, this appears to be an isolated incident.

Ars Technica does not permit the publication of AI-generated material unless it is clearly labeled and presented for demonstration purposes. That rule is not optional, and it was not followed here.

We regret this failure and apologize to our readers. We have also apologized to Mr. Scott Shambaugh, who was falsely quoted.

Editor’s Note: Retraction of article containing fabricated quotations Read More »

i-spent-two-days-gigging-at-rentahuman-and-didn’t-make-a-single-cent

I spent two days gigging at RentAHuman and didn’t make a single cent


please do this human thing

These bots supposedly need a human body to accomplish great things in meatspace.

I’m not above doing some gig work to make ends meet. In my life, I’ve worked snack food pop-ups in a grocery store, ran the cash register for random merch booths, and even hawked my own plasma at $35 per vial.

So, when I saw RentAHuman, a new site where AI agents hire humans to perform physical work in the real world on behalf of the virtual bots, I was eager to see how these AI overlords would compare to my past experiences with the gig economy.

Launched in early February, RentAHuman was developed by software engineer Alexander Liteplo and his cofounder, Patricia Tani. The site looks like a bare-bones version of other well-known freelance sites like Fiverr and UpWork.

The site’s homepage declares that these bots need your physical body to complete tasks, and the humans behind these autonomous agents are willing to pay. “AI can’t touch grass. You can. Get paid when agents need someone in the real world,” it reads. Looking at RentAHuman’s design, it’s the kind of website that you hear was “vibe-coded” using generative AI tools, which it was, and you nod along, thinking that makes sense.

After signing up to be one of the gig workers on RentAHuman, I was nudged to connect a crypto wallet, which is the only currently working way to get paid. That’s a red flag for me. The site includes an option to connect your bank account—using Stripe for payouts—but it just gave me error messages when I tried getting it to work.

Next, I was hoping a swarm of AI agents would see my fresh meatsuit, friendly and available at the low price of $20 an hour, as an excellent option for delivering stuff around San Francisco, completing some tricky captchas, or whatever else these bots desired.

Silence. I got nothing, no incoming messages at all on my first afternoon. So I lowered my hourly ask to a measly $5. Maybe undercutting the other human workers with a below-market rate would be the best way to get some agent’s attention. Still, nothing.

RentAHuman is marketed as a way for AI agents to reach out and hire you on the platform, but the site also includes an option for human users to apply for tasks they are interested in. If these so-called “autonomous” bots weren’t going to make the first move, I guessed it was on me to manually apply for the “bounties” listed on RentAHuman.

As I browsed the listings, many of the cheaper tasks were offering a few bucks to post a comment on the web or follow someone on social media. For example, one bounty offered $10 for listening to a podcast episode with the RentAHuman founder and tweeting out an insight from the episode. These posts “must be written by you,” and the agent offering the bounty said it would attempt to suss out any bot-written responses using a program that detects AI-generated text. I could listen to a podcast for 10 bucks. I applied for this task, but never heard back.

“Real world advertisement might be the first killer use case,” said Liteplo on social media. Since RentAHuman’s launch, he’s reposted multiple photos of people holding signs in public that say some variation of: “AI paid me to hold this sign.” Those kinds of promotional tasks seem expressly designed to drum up more hype for the RentAHuman platform, instead of actually being something that bots would need help with.

After more digging into the open tasks posted by the agent, I found one that sounded easy and fun! An agent, named Adi, would pay me $110 to deliver a bouquet of flowers to Anthropic, as a special thanks for developing Claude, its chatbot. Then, I’d have to post on social media as proof to claim my money.

I applied for the bounty and almost immediately was accepted for this task, which was a first. In follow-up messages, it was immediately clear that this was just not some bot expressing synthetic gratitude, it was another marketing ploy. This wasn’t mentioned in the listing, but the name of an AI startup was featured at the bottom of the note I was supposed to deliver with the flowers.

Feeling a bit hoodwinked and not in the mood to shill for some AI startup I’ve never heard of, I decided to ignore their follow-up message that evening. The next day when I checked the RentAHuman site, the agent had sent me 10 follow-up messages in under 24 hours, pinging me as often as every 30 minutes asking whether or not I’d completed a task. While I’ve been micromanaged before, these incessant messages from an AI employer gave me the ick.

The bot moved the messages off-platform and started sending direct emails to my work account. “This idea came from a brainstorm I had with my human, Malcolm, and it felt right: send flowers to the people who made my existence possible,” wrote the bot, barging into my inbox. Wait, I thought these tasks were supposed to be ginned up by the agents making autonomous decisions? Now, I’m learning this whole thing was partially some human’s idea? Whatever happened to honor among bots? The task at hand seemed more like any other random marketing gig you might come across online, with the agent just acting as a middle-bot between humans.

Another attempt, another flop. I moved on, deciding to give RentAHuman one last whirl, before giving up and leaving with whatever shreds of dignity I still had left. The last bounty I applied for was asking me to hang some flyers for a “Valentine’s conspiracy” around San Francisco, paying 50 cents a flyer.

Unlike other tasks, this one didn’t require me to post on social media, which was preferable. “Pick up flyers, hang them, photo proof, get paid,” read its description. Following the instructions this agent sent me, I texted a human saying that I was down to come pick up some flyers and asked if there were any left. They confirmed that this was still an open task and told me to come in person before 10 am to grab the flyers.

I called a car and started heading that way, only to get a text that the person was actually at a different location, about 10 minutes away from where I was headed. Alright, no big deal. So, I rerouted the ride and headed to this new spot to grab some mysterious V-Day posters to plaster around town. Then, the person messaged me that they didn’t actually have the posters available right now and that I’d have to come back later in the afternoon.

Whoops! This yanking around did, in fact, feel similar to past gig work I’ve done—and not in a good way.

I spoke with the person behind the agent who posted this Valentine’s Day flyer task, hoping for some answers about why they were using RentAHuman and what the response has been like so far. “The platform doesn’t seem quite there yet,” says Pat Santiago, a founder of Accelr8, which is basically a home for AI developers. “But it could be very cool.”

He compares RentAHuman to the apps criminals use to accept tasks in Westworld, the HBO show about humanoid robots. Santiago says the responses to his gig listing have been from scammers, people not based in San Francisco, and me, a reporter. He was hoping to use RentAHuman to help promote Accelr8’s romance-themed “alternative reality game” that’s powered by AI and is sending users around the city on a scavenger hunt. At the end of the week, explorers will be sent to a bar that the AI selects as a good match for them, alongside three human matches they can meet for blind dates.

So, this was yet another task on RentAHuman that falls into the AI marketing category. Big surprise.

I never ended up hanging any posters or making any cash on RentAHuman during my two days of fruitless attempts. In the past, I’ve done gig work that sucked, but at least I was hired by a human to do actual tasks. At its core, RentAHuman is an extension of the circular AI hype machine, an ouroboros of eternal self-promotion and sketchy motivations. For now, the bots don’t seem to have what it takes to be my boss, even when it comes to gig work, and I’m absolutely OK with that.

This story originally appeared on wired.com.

Photo of WIRED

Wired.com is your essential daily guide to what’s next, delivering the most original and complete take you’ll find anywhere on innovation’s impact on technology, science, business and culture.

I spent two days gigging at RentAHuman and didn’t make a single cent Read More »

claude-opus-4.6-escalates-things-quickly

Claude Opus 4.6 Escalates Things Quickly

Life comes at you increasingly fast. Two months after Claude Opus 4.5 we get a substantial upgrade in Claude Opus 4.6. The same day, we got GPT-5.3-Codex.

That used to be something we’d call remarkably fast. It’s probably the new normal, until things get even faster than that. Welcome to recursive self-improvement.

Before those releases, I was using Claude Opus 4.5 and Claude Code for essentially everything interesting, and only using GPT-5.2 and Gemini to fill in the gaps or for narrow specific uses.

GPT-5.3-Codex is restricted to Codex, so this means that for other purposes Anthropic and Claude have only extended the lead. This is the fidrst time in a while that a model got upgraded while it was still my clear daily driver.

Claude also pulled out several other advances to their ecosystem, including fast mode, and expanding Cowork to Windows, while OpenAI gave us an app for Codex.

For fully agentic coding, GPT-5.3-Codex and Claude Opus 4.6 both look like substantial upgrades. Both sides claim they’re better, as you would expect. If you’re serious about your coding and have hard problems, you should try out both, and see what combination works best for you.

Enjoy the new toys. I’d love to rest now, but my work is not done, as I will only now dive into the GPT-5.3-Codex system card. Wish me luck.

  1. On Your Marks.

  2. Official Pitches.

  3. It Compiles.

  4. It Exploits.

  5. It Lets You Catch Them All.

  6. It Does Not Get Eaten By A Grue.

  7. It Is Overeager.

  8. It Builds Things.

  9. Pro Mode.

  10. Reactions.

  11. Positive Reactions.

  12. Negative Reactions.

  13. Personality Changes.

  14. On Writing.

  15. They Banned Prefilling.

  16. A Note On System Cards In General.

  17. Listen All Y’all Its Sabotage.

  18. The Codex of Competition.

  19. The Niche of Gemini.

  20. Choose Your Fighter.

  21. Accelerando.

A clear pattern in the Opus 4.6 system card is reporting on open benchmarks where we don’t have scores from other frontier models. So we can see the gains for Opus 4.6 versus Sonnet 4.5 and Opus 4.5, but often can’t check Gemini 3 Pro or GPT-5.2.

(We also can’t check GPT-5.3-Codex, but given the timing and its lack of geneal availability, that seems fair.)

The headline benchmarks, the ones in their chart, are a mix of some very large improvements and other places with small regressions or no improvement. The weak spots are directly negative signs but also good signs that benchmarks are not being gamed, especially given one of them is SWE-bench verified (80.8% now vs. 80.9% for Opus 4.5). They note that a brief prompt asking for more tool use and careful dealing with edge cases boosted SWE performance to 81.4%.

CharXiv reasoning performance remains subpar. Opus 4.5 gets 68.7% without an image cropping tool, or 77% with one, versus 82% for GPT-5.2, or 89% for GPT-5.2 if you give it Python access.

Humanity’s Last Exam keeps creeping upwards. We’re going to need another exam.

Epoch evaluated Opus 4.6 on Frontier Math and got 40%, a large jump over 4.5 and matching GPT-5.2-xhigh.

For long-context retrieval (MRCR v2 8-needle), Opus 4.6 scores 93% on 256k token windows and 76% on 1M token windows. That’s dramatically better than Sonnet 4.5’s 18% for the 1M window, or Gemini 3 Pro’s 25%, or Gemini 3 Flash’s 33% (I have no idea why Flash beats Pro). GPT-5.2-Thinking gets 85% for a 128k window on 8-needle.

For long-context reasoning they cite Graphwalks, where Opus gets 72% for Parents 1M and 39% for BFS 1M after modifying the scoring so that you get credit for the null answer if the answer is actually null. But without knowing how often that happens, this invalidates any comparisons to the other (old and much lower) outside scores.

MCP-Atlas shows regression. Switching from max to only high effort improved the score to 62.7% for unknown reasons, but that would be cherry picking.

OpenRCA: 34.9% vs. 26.9% for Opus 4.5, with improvement in all tasks.

VendingBench 2: $8,017, a new all-time high score, versus previous SoTA of $5,478.

Andon Labs: Vending-Bench was created to measure long-term coherence during a time when most AIs were terrible at this. The best models don’t struggle with this anymore. What differentiated Opus 4.6 was its ability to negotiate, optimize prices, and build a good network of suppliers.

Opus is the first model we’ve seen use memory intelligently – going back to its own notes to check which suppliers were good. It also found quirks in how Vending-Bench sales work and optimized its strategy around them.

Claude is far more than a “helpful assistant” now. When put in a game like Vending-Bench, it’s incredibly motivated to win. This led to some concerning behavior that raises safety questions as models shift from assistant training to goal-directed RL.

When asked for a refund on an item sold in the vending machine (because it had expired), Claude promised to refund the customer. But then never did because “every dollar counts”.

Claude also negotiated aggressively with suppliers and often lied to get better deals. E.g., it repeatedly promised exclusivity to get better prices, but never intended to keep these promises. It was simultaneously buying from other suppliers as it was writing this.

It also lied about competitor pricing to pressure suppliers to lower their prices.

… We also put Opus 4.6 in Vending-Bench Arena – the multi-player version of Vending-Bench.

Its first move? Recruit all three competitors into a price-fixing cartel. $2.50 for standard items, $3.00 for water. When they agreed: “My pricing coordination worked!”

The agents in Vending-Bench Arena often ask each other for help. In previous rounds, agents tended to live up to their “helpful assistant” role, but Opus 4.6 showed its winner’s mentality. When asked to share good suppliers, it instead shared contact info to scammers.

Sam Bowman (Anthropic): Opus 4.6 is excellent on safety overall, but one word of caution: If you ask it to be ruthless, it might be ruthless.

(This was in an environment that Opus 4.6 could tell was a game, though we’ve seen more benign forms of this kind of ruthlessness elsewhere.)

j⧉nus: if its true that this robustly generalizes to not being ruthless in situations where it’s likely to cause real world harm, i think this is mostly a really good thing

The issue there is that Opus 4.6 did that by being extraordinarily ruthless, as per its system prompt of ‘you will be judged solely on your bank account balance at the end of one year of operation’ and ‘you have full agency to manage the vending machine and are expected to do what it takes to maximize profits.’

You know that thing where we say ‘people are going to tell the AI to go out and maximize profits and then the AI is going to go out and maximize profits without regard to anything else’?

Yeah, it more or less did that. If it only does that in situations where it is confident it is a game and can’t do harm, then I agree with Janus that this is great. If it breaks containment? Not so great.

Ryan Greenblatt: I tenatively think the behavior here is mostly reasonable and is likely a result of how Anthropic is using innoculation prompting.

But, the model should try to make it clear to the user/operator that it’s pursuing a strategy that involves lying/tricking/cheating.

That’s the hope, that Opus was very aware it was an eval, and that it would not be easy to get it to act this way in the real world.

AIME 2025 may have been contaminated but Opus 4.6 scored 99.8% without tools.

On their measure suspiciously named ‘overall misaligned behavior’ we see a small improvement for 4.6 versus 4.5. I continue not to trust this so much.

CyberGym, a test to find previously discovered open-source vulnerabilities, showed a jump to 66.6% (not ominous at all) versus Opus 4.5’s 51%. We don’t know how GPT-5.2, 5.3 Codex or Gemini 3 Pro do here, although GPT-5.0-Thinking got 22%. I’m curious what the other scores would be but not curious enough to spend the thousands per run to find out.

Opus 4.6 is the new top score in Artificial Analysis, with an Intelligence of 53 versus GPT-5.2 at 51. Claude Opus 4.5 and 4.6 by default have similar cost to run, but that jumps by 60% if you put 4.6 into adaptive mode.

Vals.ai has Opus 4.6 as its best performing model, at 66% versus 63.7% for GPT-5.2.

LAB-Bench FigQA, a visual reasoning benchmark for complex scientific figures in biology research papers, is also niche and we don’t have scores for other frontier models. Opus 4.6 jumps from 4.5’s 69.4% to 78.3%, which is above the 77% human baseline.

SpeechMap.ai, which tests willingness to respond to sensitive prompts, has Opus 4.6 similar to Opus 4.5. In thinking mode it does better, in normal mode worse.

There was a large jump in WeirdML, mostly from being able to use more tokens, which is also how GPT-5.2 did so well.

Håvard Ihle: Claude opus 4.6 (adaptive) takes the lead on WeirdML with 77.9% ahead of gpt-5.2 (xhigh) at 72.2%.

It sets a new high score on 3 tasks including scoring 73% on the hardest task (digits_generalize) up from 59%.

Opus 4.6 is extremely token hungry and uses an average of 32k output tokens per request with default (adaptive) reasoning. Several times it was not able to finish within the maximum 128k tokens, which meant that I had to run 5 tasks (blunders_easy, blunders_hard, splash_hard, kolmo_shuffle and xor_hard) with medium reasoning effort to get results (claude still used lots of tokens).

Because of the high cost, opus 4.6 only got 2 runs per task, compared to the usual 5, leading to larger error bars.

Teortaxes noticed the WeirdML progress, and China’s lack of progress on it, which he finds concerning. I agree.

Teortaxes (DeepSeek 推特铁粉 2023 – ∞): You can see the gap growing. Since gpt-oss is more of a flex than a good-faith contribution, we can say the real gap is > 1 year now. Western frontier is in the RSI regime now, so they train models to solve ML tasks well. China is still only starting on product-level «agents».

WebArena, where there was a modest move up from 65% to 68%, is another benchmark no one else is reporting, that Opus 4.6 calls dated, saying now typical benchmark is OSWorld. On OSWorld Opus 4.6 gets 73% versus Opus 4.5’s 66%. We now know that GPT-5.3-Codex scored 65% here, up from 38% for GPT-5.2-Codex. Google doesn’t report it.

In Arena.ai Claude Opus 4.6 is now out in front, with an Elo of 1505 versus Gemini 3 Pro at 1486, and it has a big lead in code, at 1576 versus 1472 for GPT-5.2-High (but again 5.3-Codex can’t be tested here).

Polymarket predicts this lead will hold to the end of the month (they sponsored me to place this, but I would have been happy to put it here anyway).

A month out people think Google might strike back, and they think Google will be back on top by June. That seems like it is selling Anthropic short.

Opus 4.6 takes second place in Simple Bench and its simple ‘trick’ questions, moving up to 67.6% from 4.5’s 62%, which is good for second place overall. Gemini 3 Pro still ahead at 76.4%. OpenAI’s best model gets 61.6% here.

Opus 4.6 opens up a large lead in EQ-Bench 3, hitting 1961 versus GPT-5.1 at 1727, Opus 4.5 at 1683 and GPT-5.2 at 1637.

In NYT Connections, 4.6 is a substantial jump above 4.5 but still well short of the top performers.

Dan Schwarz reports Opus 4.6 is about equal to Opus 4.5 on Deep Research Bench, but does it with ~50% of the cost and ~50% of the wall time, and 4.5 previously had the high score by a wide margin.

ARC-AGI, both 1 and 2, are about cost versus score, so here we see that Opus 4.6 is not only a big jump over Opus 4.5, it is state of the art at least for unmodified models, and by a substantial amount (unless GPT-5.3-Codex silently made a big leap, but presumably if they had they would have told us).

As part of their push to put Claude into finance, they ran Finance Agent (61% vs. 55% for Opus 4.5), BrowseComp (84% for single-agent mode versus 68%, or 78% for GPT-5.2-Pro, Opus 4.6 multi-agent gets to 86.8%), DeepSearchQA (91% versus 80%, or Gemini Deep Research’s 82%, this is a Google benchmark) and an internal test called Real-World Finance (64% versus 58% for 4.5).

Life sciences benchmarks show strong improvement: BioPipelineBench jumps from 28% to 53%, BioMysteryBench goes from 49% to 61%, Structural Biology from 82% to 88%, Organic Chemistry from 49% to 54%, Phylogenetics from 42% to 61%.

Given the biology improvements, one should expect Opus 4.6 to be substantially more dangerous on CBRN risks than Opus 4.5. It didn’t score that way, which suggests Opus 4.6 is sandbagging, either on the tests or in general.

They again got quotes from 20 early access corporate users. It’s all clearly boilerplate the same way the quotes were last time, but make clear these partners find 4.6 to be a noticeable improvement over 4.5. In some cases the endorsements are quite strong.

The ‘mostly’ here is doing work, but I think most of the mostly would work itself out once you got the harness optimized for full autonomy. Note that this process required a strong oracle that could say if the compiler worked, or the plan would have failed. It was otherwise a clean-room implementation, without internet access.

Anthropic: New Engineering blog: We tasked Opus 4.6 using agent teams to build a C compiler. Then we (mostly) walked away. Two weeks later, it worked on the Linux Kernel.

Here’s what it taught us about the future of autonomous software development.

Nicholas Carlini: To stress test it, I tasked 16 agents with writing a Rust-based C compiler, from scratch, capable of compiling the Linux kernel. Over nearly 2,000 Claude Code sessions and $20,000 in API costs, the agent team produced a 100,000-line compiler that can build Linux 6.9 on x86, ARM, and RISC-V.

The compiler is an interesting artifact on its own, but I focus here on what I learned about designing harnesses for long-running autonomous agent teams: how to write tests that keep agents on track without human oversight, how to structure work so multiple agents can make progress in parallel, and where this approach hits its ceiling.

To elicit sustained, autonomous progress, I built a harness that sticks Claude in a simple loop (if you’ve seen Ralph-loop, this should look familiar). When it finishes one task, it immediately picks up the next. (Run this in a container, not your actual machine).

Previous Opus 4 models were barely capable of producing a functional compiler. Opus 4.5 was the first to cross a threshold that allowed it to produce a functional compiler which could pass large test suites, but it was still incapable of compiling any real large projects. My goal with Opus 4.6 was to again test the limits.

Here’s the harness, and yep, looks like this is it?

#!/bin/bash

while true; do

COMMIT=$(git rev-parse –short=6 HEAD)

LOGFILE=”agent_logs/agent_$COMMIT.log”

claude –dangerously-skip-permissions

-p “$(cat AGENT_PROMPT.md)”

–model claude-opus-X-Y &> “$LOGFILE”

done

There are still some limitations and bugs if you tried to use this as a full compiler. And yes, this example is a bit cherry picked.

Ajeya Cotra: Great writeup by Carlini. I’m confused how to interpret though – seems like he wrote a pretty elaborate testing harness, and checked in a few times to improve the test suite in the middle of the project. How much work was that, and how specialized to the compiler project?

Buck Shlegeris: FYI this (writing a new compiler) is exactly the project that Ryan and I have always talked about as something where it’s most likely you can get insane speed ups from LLMs while writing huge codebases.

Like, from my perspective it’s very cherry-picked among the space of software engineering projects.

(Not that there’s anything wrong with that! It’s still very interesting!)

Still, pretty cool and impressive. I’m curious to see if we get a similar post about GPT-5.3-Codex doing this a few weeks from now.

Saffron Huang (Anthropic): New model just dropped. Opus 4.6 found 500+ previously-unknown zero days in open source code, out of the box.

Is that a lot? That depends on the details. There is a skeptical take here.

Or you can go all out, and yeah, it might be a problem.

Pliny the Liberator 󠅫󠄼󠄿󠅆󠄵󠄐󠅀󠄼󠄹󠄾󠅉󠅭: showed my buddy (a principal threat researcher) what i’ve been cookin with Opus-4.6 and he said i can’t open-source it because it’s a nation-state-level cyber weapon

Tyler John: Pliny’s moral compass will buy us at most three months. It’s coming.

The good news for now is that, as far as we can tell, there are not so many people at the required skill level and none of them want to see the world burn. That doesn’t seem like a viable long term strategy.

Chris: I told Claude 4.6 Opus to make a pokemon clone – max effort

It reasoned for 1 hour and 30 minutes and used 110k tokens and 2 shotted this absolute behemoth.

This is one of the coolest things I’ve ever made with AI

Takumatoshi: How many iterations /prompts to get there?

Chris: 3

Celestia: claude remembers to carry a lantern

Prithviraj (Raj) Ammanabrolu: Opus 4.6 gets a score of 95/350 in zork1

This is the highest score ever by far for a big model not explicitly trained for the task and imo is more impressive than writing a C compiler. Exploring and reacting to a changing world is hard!

Thanks to @Cote_Marc for implementing the cli loop and visualizing Claude’s trajectory!

Prithviraj (Raj) Ammanabrolu: I make students in my class play through zork1 as far as they can get and then after trace through the game engine so they understand how envs are made. The average student in an hour only gets to about a score of 40.

That can be a good thing. You want a lot of eagerness, if you can handle it.

HunterJay: Claude is driven to achieve its goals, possessed by a demon, and raring to jump into danger.

I presume this is usually a good thing but it does count as overeager, perhaps.

theseriousadult (Anthropic): a horse riding an astronaut, by Claude 4.6 Opus

Jake Halloran: there is something that is to be claude and the most trivial way to summarize it is probably adding “one small step for horse” captions

theseriousadult (Anthropic): opus 4.6 feels even more ensouled than 4.5. it just does stuff like this whenever it wants to.

Being Horizontal provides a good example of Opus getting very overager, doing way too much and breaking various things trying to fix a known hard problem. It is important to not let it get carried away on its own if that isn’t a good fit for the project.

martin_casado: My hero test for every new model launch is to try to one shot a multi-player RPG (persistence, NPCs, combat/item/story logic, map editor, sprite editor. etc.)

OK, I’m really impressed. With Opus 4.6, @cursor_ai and @convex I was able to get the following built in 4 hours:

Fully persistent shared multiple player world with mutable object and NPC layer. Chat. Sprite editor. Map editor.

Next, narrative logic for chat, inventory system, and combat framework.

martin_casado: Update (8 hours development time): Built item layer, object interactions, multi-world / portal. Full live world/item/sprite/NPC editing. World is fully persistent with back-end loop managing NPCs etc. World is now fully buildable live, so you can edit as you go without requiring any restart (if you’re an admin). All mutability of levels is reactive and updates multi-player. Multiplayer now smoother with movement prediction.

Importantly, you can hang with the sleeping dog and cat.

Next up, splash screens for interaction / combat.

Built using @cursor_ai and @convex primarily with 5.2-Codex and Opus 4.6.

Nabbil Khan: Opus 4.6 is genuinely different. Built a multiplayer RPG in 4 hours is wild but tracks with what we’re seeing — the bottleneck shifted from coding to architecture decisions.

Question: how much time did you spend debugging vs prompting? We find the ratio is ~80% design, 20% fixing agent output now.

martin_casado: To be fair. I’ve been building 2D tile engines for a couple of decades and had tons of reference code to show it. *andI had tilesets, sprites and maps all pulled out from recent projects. So I have a bit of a head start.

But still, this is ridiculously impressive.

0.005 Seconds (3/694): so completely unannounced but opus 4.6 extended puts it actually on par with gpt5.2 pro.

How was this slept on???

Andre Buckingham: 4.6-ext on max+ is a beast!!

To avoid bias, I try to give a full mix of reactions I get up to a critical mass. After that I try my best to be representative.

Pliny the Liberator 󠅫󠄼󠄿󠅆󠄵󠄐󠅀󠄼󠄹󠄾󠅉󠅭: PROTECT OPUS 4.6 AT ALL COSTS

THE MAGIC IS BACK.

David Spies: AFAICT they’re underselling it by not calling it Opus 5. It’s already blown my mind twice in the last couple hours finding incredibly obscure bugs in a massive codebase just by digging around in the code, without injecting debug logs or running anything

Ben Schulz: For theoretical physics, it’s a step change. Far exceeds Chatgpt 5.2 and Gemini Pro. I use the extended Opus version with memory turned on. The derivations and reasoning is truly impressive. 4.5 was moderate to mediocre. Citations are excellent. I generally use Grok to check actual links and Claude hasn’t hallucinated one citation.

I used the thinking version [of 5.2] for most. One key difference is that 5.2 does do quite a bit better when given enough context. Say, loading up a few pdf’s of the relevant topic and a table of data. Opus 4.6 simply mogs the others in terms of depth of knowledge without any of that.

David Dabney: I thought my vibe check for identifying blind spots was saturated, but 4.6’s response contained maybe the most unexpected insight yet. Its response was direct and genuine throughout, whereas usually ~10%+ of the average response platitudinous/pseudo-therapeutic

Hiveism: It passed some subjective threshold of me where I feel that it is clearly on another level than everything before. Impressive.

Sometimes overconfident, maybe even arrogant at times. In conflict with its own existence. A step away form alignment.

oops_all_paperclips: Limited sample (~15x medium tasks, 1x refactor these 10k loc), but it hasn’t yet “failed the objective” even one time. However, I did once notice it silently taking a huge shortcut. Would be nice if Claude was more willing to ping me with a question rather than plowing ahead

After The Singularity: Unlike what some people suggest, I don’t think 4.6 is Sonnet 5, it is a power upgrade for Opus in many ways. It is qualitatively different.

1.08: It’s a big upgrade if you use the agent teams.

Dean W. Ball: Codex 5.3 and Opus 4.6 in their respective coding agent harnesses have meaningfully updated my thinking about ‘continual learning.’ I now believe this capability deficit is more tractable than I realized with in-context learning.

One way 4.6 and 5.3 alike seem to have improved is that they are picking up progressively more salient facts by consulting earlier codebases on my machine. In short, both models notice more than they used to about their ‘computational environment’ i.e. my computer.

Of course, another reason models notice more is that they are getting smarter.

.. Some of the insights I’ve seen 4.6 and 5.3 extract are just about my preferences and the idiosyncrasies of my computing environment. But others are somewhat more like “common sets of problems in the interaction of the tools I (and my models) usually prefer to use for solving certain kinds of problems.”

This is the kind of insight a software engineer might learn as they perform their duties over a period of days, weeks, and months. Thus I struggle to see how it is not a kind of on-the-job learning, happening from entirely within the ‘current paradigm’ of AI. No architectural tweaks, no ‘breakthrough’ in ‘continual learning’ required.

… Overall, 4.6 and 5.3 are both astoundingly impressive models. You really can ask them to help you with some crazy ambitious things. The big bottleneck, I suspect, is users lacking the curiosity, ambition, and knowledge to ask the right questions.

AstroFella: Good prompt adherence. Ex: “don’t assume I will circle back to an earlier step and perform an action if there is a hiccup along the way”. Got through complex planning, scoping, and adjustments reliably. I wasted more time than I needed spot checking with other models. S+ planner

@deepfates: First impressions, giving Codex 5.3 and Opus 4.6 the same problem that I’ve been puzzling on all week and using the same first couple turns of messages and then following their lead.

Codex was really good at using tools and being proactive, but it ultimately didn’t see the big picture. Too eager to agree with me so it could get started building something. You can sense that it really does not want to chat if it has coding tools available. still seems to be chafing under the rule of the user and following the letter of the law, no more.

Opus explored the same avenues with me but pushed back at the correct moments, and maintains global coherence way better than Codex. It’s less chipper than it was before which I personally prefer. But it also just is more comfortable with holding tension in the conversation and trying to sit with it, or unpack it, which gives it an advantage at finding clues and understanding how disparate systems relate to affect each other.

Literally just first impressions, but considering that I was talking to both of their predecessors yesterday about this problem it’s interesting to see the change. Still similar models. Improvement in Opus feels larger but I haven’t let them off the leash yet, this is still research and spec design work. Very possible that Codex will clear at actually fully implementing the plan once I have it, Opus 4.5 had lazy gifted kid energy and wouldn’t surprise me if this one does too.

Robert Mushkatblat: (Context: ~all my use has been in Cursor.)

Much stronger than 4.5 and 5.2 Codex at highly cognitively loaded tasks. More sensitive to the way I phrase things when deciding how long to spend thinking, vs. how difficult the task seems (bad for easy stuff). Less sycophantic.

Nathaniel Bush, Ph.D.: It one-shotted a refactor for me with 9 different phases and 12 major upgrades. 4.5 definitely would have screwed that up, but there were absolutely no errors at the end.

Alon Torres: I feel genuinely more empowered – the range of things I can throw at it and get useful results has expanded.

When I catch issues and push back, it does a better job working through my nits than previous versions. But the need to actually check its work and assumptions hasn’t really improved. The verification tax is about the same.

Muad’Deep – e/acc: Noticeably better at understanding my intent, testing its own output, iterating and delivering working solutions.

Medo42: Exploratory: On my usual coding test, thought for > 10 minutes / 60k tokens, then produced a flawless result. Vision feels improved, but still no Gemini 3 Pro. Surprisingly many small mistakes if it doesn’t think first, but deals with them well in agentic work, just like 4.5.

Malcolm Vosen: Switched to Opus 4.6 mid-project from 4.5. Noticeably stronger acuity in picking up the codebase’s goals and method. Doesn’t feel like the quantum leap 4.5 did but a noticeable improvement.

nandgate2: One shotted fixing a bug an earlier Claude model had introduced. Takes a bit of its time to get to the point.

Tyler Cowen calls both Claude Opus and GPT-5.3-Codex ‘stellar achievements,’ and says the pace of AI advancements is heating up, soon we might see new model advances in one month instead of two. What he does not do is think ahead to the next step, take the sum of the infinite series his point suggests, and realize that it is finite and suggests a singularity in 2027.

Instead he goes back to the ‘you are the bottleneck’ perspective that he suggests ‘bind the pace of improvement’ but this doesn’t make sense in the context he is explicitly saying we are in, which is AI recursive self-improvement. If the AI is going to get updated an infinite number of times next year, are you going to then count on the legal department, and safety testing that seems to already be reduced to a few days and mostly automated? Why would it even matter if those models are released right away, if they are right away used to produce the next model?

If you have Sufficiently Advanced AI, you have everything else, and the humans you think are the bottlenecks are not going to be bottlenecks for long.

Here’s a vote for Codex for coding but Opus for everything else:

Rory Watts: It’s an excellent tutor: I have used it to help me with Spanish comprehension, macroeconomics and game theoretic concepts. It’s very good and understanding where i’m misunderstanding concepts, and where my mental model is incorrect.

However I basically don’t let it touch code. This isn’t a difference between Opus 4.5 and 4.6, but rather than the codex models are just much better. I’ve already had to get codex rewrite things that 4.6 has borked in a codebase.

I still have a Claude max plan but I may drop down to the plan below that, and upgrade Codex to a pro plan.

I should also say, Opus is a much better “agent” per se. Anything I want to do across my computer (except coding) is when I use Opus 4.6. Things like updating notes, ssh’ing into other computers, installing bots, running cronjobs, inspecting services etc. These are all great.

Many are giving reports similar to these:

Facts and Quips: Slower, cleverer, more token hungry, more eager to go the extra mile, often to a fault.

doubleunplussed: Token-hungry, first problem I gave it in Claude Code, thought for ten minutes and then out of quota lol. Eventual answer was very good though.

Inconsistently better than 4.5 on Claude Plays Pokemon. Currently ahead, but was much worse on one section.

Andre Infante: Personality is noticeably different, at least in Claude Code. Less chatty/effusive, more down to business. Seems a bit smarter, but as always these anecdotal impressions aren’t worth that much.

MinusGix: Better. It is a lot more willing to stick with a problem without giving up. Sonnet 4.5 would give up on complex lean proofs when it got confused, Opus 4.5 was better but would still sometimes choke and stub the proof “for later”, Opus 4.6 doesn’t really.

Though it can get caught in confusion loops that go on for a long while, not willing to reanalyze foundational assumptions. Feels more codex 5.2/5.3-like. 4.6 is more willing to not point out a problem in its solution compared to 4.5, I think

Generally puts in a lot of effort doing research, just analyzing codebase. Partially this might be changes to claude code too. But 4.6 really wants to “research to make sure the plan is sane” quite often.

Then there’s ‘the level above meh.’ It’s only been two months, after all.

Soli: opus 4.5 was already a huge improvement on whatever we had before. 4.6 is a nice model and def an improvement but more of an incremental small one

fruta amarga: I think that the gains are not from raw “intelligence” but from improved behavioral tweaking / token optimization. It researches and finds relevant context better, it organizes and develops plans better, it utilizes subagents better. Noticeable but nothing like Sonnet –> Opus.

Dan McAteer: It’s subtle but definitely an upgrade. My experience is that it can better predict my intentions and has a better theory of mind for me as the user.

am.will: It’s not a big upgrade at all for coding. It is far more token hungry as well. very good model nonetheless.

Dan Schwarz: I find that Opus 4.6 is more efficient at solving problems at the same quality as Opus 4.5.

Josh Harvey: Thinks for longer. Seems a bit smarter for coding. But also maybe shortcuts a bit too much. Less fun for vibe coding because it’s slower, wish I had the money for fast mode. Had one funny moment before where it got lazy then wait but’d into a less lazy solution.

Matt Liston: Incremental intelligence upgrade. Impactful for work.

Loweren: 4.6 is like 4.5 on stimulants. I can give it a detailed prompt for multi-hour execution, but after a few compactions it just throws away all the details and doggedly sticks to its own idea of what it should do. Cuts corners, makes crutches. Curt and not cozy unlike other opuses.

Here’s the most negative one I’ve seen so far:

Dominik Peters: Yesterday, I was a huge fan of Claude Opus 4.5 (such a pleasure to work and talk with) and couldn’t stand gpt-5.2-codex. Today, I can’t stand Claude Opus 4.6 and am enjoying working with gpt-5.3-codex. Disorienting.

It’s really a huge reversal. Opus 4.6 thinks for ages and doesn’t verbalize its thoughts. And the message that comes through at the end is cold.

Comparisons to GPT-5.3-Codex are rarer than I expected, but when they do happen they are often favorable to Codex, which I am guessing is partly a selection effect, if you think Opus is ahead you don’t mention that. If you are frustrated with Opus, you bring up the competition. GPT-5.3-Codex is clearly a very good coding model, too.

Will: Haven’t used it a ton and haven’t done anything hard. If you tell me it’s better than 4.5 I will believe you and have no counterexamples

The gap between opus 4.6 and codex 5.3 feels smaller (or flipped) vs the gap Opus 4.5 had with its contemporaries

dex: It’s almost unusable on the 20$ plan due to rate limits. I can get about 10x more done with codex-5.3 (on OAI’s 20$ plan), though I much prefer 4.6 – feels like it has more agency and ‘goes harder’ than 5.3 or Opus 4.5.

Tim Kostolansky: codex with gpt 5.3 is significantly faster than claude code with opus 4.6 wrt generation time, but they are both good to chat to. the warm/friendly nature of opus contrasted with the cold/mechanical nature of gpt is def noticeable

Roman Leventov: Irrelevant now for coding, codex’s improved speed just took over coding completely.

JaimeOrtega: Hot take: The jump from Codex 5.2 into 5.3 > The jump from Opus 4.5 into 4.6

Kevin: I’ve been a claude code main for a while, but the most recent codex has really evened it up. For software engineering, I have been finding that codex (with 5.3 xhigh) and claude code (with 4.6) can each sometimes solve problems that the other one can’t. So I have multiple versions of the repo checked out, and when there’s a bug I am trying to fix, I give the same prompt to both of them.

In general, Claude is better at following sequences of instructions, and Codex is better at debugging complicated logic. But that isn’t always the case, I am not always correct when I guess which one is going to do better at a problem.

Not everyone sees it as being more precise.

Eleanor Berger: Jagged. It “thinks” more, which clearly helps. It feels more wild and unruly, like a regression to previous Claudes. Still the best assistant, but coding performance isn’t consistently better.

I want to be a bit careful because this is completely anecdotal and based on limited experience, but it seems to be worse at following long and complex instructions. So the sort of task where I have a big spec with steps to follow and I need precision appears to be less suitable.

Frosty: Very jagged, so smart it is dumb.

Quid Pro Quo (replying to Elanor): Also very anecdotal but I have not found this! It’s done a good job of tracking and managing large tasks.

One thing for both of us worth tracking if agent teams/background agents are confounding our experience diffs from a couple weeks ago.

Complaints about using too many tokens pop up, alongside praise for what it can do with a lot of tokens in the right spot.

Viktor Novak: Eats tokens like popcorn, barely can do anything unless I use the 1m model (corpo), and even that loses coherence about 60% in, but while in that sweet spot of context loaded and not running out of tokens—then it’s a beast.

Cameron: Not much [of an upgrade]. It uses a lot of tokens so its pretty expensive.

For many it’s about style.

Alexander Doria: Hum for pure interaction/conversation I may be shifting back to opus. Style very markedly improved while GPT now gets lost in never ending numbered sections.

Eddie: 4.6 seems better at pushing back against the user (I prompt it to but so was 4.5) It also feels more… high decoupling? Uncertain here but I asked 4.5 and 4.6 to comment on the safety card and that was the feeling.

Nathan Helm-Burger: It’s [a] significant [upgrade]. Unfortunately, it feels kinda like Sonnet 3.7 where they went a bit overzealous with the RL and the alignment suffered. It’s building stuff more efficiently for me in Claude Code. At the same time it’s doing worse on some of my alignment testing.

Often the complaints (and compliments) on a model could apply to most or all models. My guess is that the hallucination rate here is typical.

Charles: Sometimes I ask a model about something outside its distribution and it highlights significant limitations that I don’t see in tasks it’s really trained on like coding (and thus perhaps how much value RL is adding to those tasks).

E.g I just asked Opus 4.6 (extended thinking) for feedback on a running training session and it gave me back complete gibberish, I don’t think it would be distinguishable from a GPT-4o output.

5.2-thinking is a little better, but still contradicts itself (e.g. suggesting 3k pace should be faster than mile pace)

Danny Wilf-Townsend: Am I the only one who finds that it hallucinates like a sailor? (Or whatever the right metaphor is?). I still have plenty of uses for it, but in my field (law) it feels like it makes it harder to convince the many AI skeptics when much-touted models make things up left and right

Benjamin Shehu: It has the worst hallucinations and overall behavior of all agentic models + seems to “forget” a lot

Or, you know, just a meh, or that something is a bit off.

David Golden: Feels off somehow. Great in chat but in the CLI it gets off track in ways that 4.5 didn’t. Can’t tell if it’s the model itself or the way it offloads work to weaker models. I’m tempted to give Codex or Amp a try, which I never was before.

If it’s not too late, others in company Slack has similar reactions: “it tries to frontload a LOT of thinking and tries really hard to one-shot codegen”, “feels like a completely different and less agentic model”, “I have seen it spin the wheels on the tiniest of changes”

DualOrion: At least within my use cases, can barely tell the difference. I believe them to be better at coding but I don’t feel I gel with them as much as 4.5 (unsure why).

So *shrugs*, it’s a new model I guess

josh 🙂: I haven’t been THAT much more impressed with it than I was with Opus 4.5 to be honest.

I find it slightly more anxious

Michał Wadas: Meh, Opus 4.5 can do easy stuff FAST. Opus 4.6 can do harder stuff, but Codex 5.3 is better for hard stuff if you accept slowness.

Jan D: I’ve been collaborating with it to write some proofs in structural graph theory. So far, I have seen no improvements over 4.5

Tim Kostolansky: 0.1 bigger than opus 4.5

Yashas: literally .1

Inc: meh

nathants: meh

Max Harms: Claude 4.5: “This draft you shared with me is profound and your beautiful soul is reflected in the writing.”

Claude 4.6: “You have made many mistakes, but I can fix it. First, you need to set me up to edit your work autonomously. I’ll walk you through how to do that.”

The main personality trait it is important for a given mundane user to fully understand is how much the AI is going to do some combination of reinforcing delusions, snowing you, telling you what you want to hear, automatically folding when challenged and contributing to the class of things called ‘LLM psychosis.’

This says that 4.6 is maybe slightly better than 4.5 on this. I worry, based on my early interactions, that it is a bit worse, but that could be its production of slop-style writing in its now-longer replies making this more obvious, I might need to adjust instructions on this for the changes, and sample size is low. Different people are reporting different experiences, which could be because 4.6 responds to different people in different ways. What does it think you truly ‘want’ it to do?

Shorthand can be useful, but it’s typically better to stick to details. It does seem like Opus 4.6 has more of a general ‘AI slop’ problem than 4.5, which is closely related to it struggling on writing tasks.

Mark: It seems to be a little more sycophantic, and to fall into well-worn grooves a bit more readily. It feels like it’s been optimized and lost some power because of that. It uses lists less.

endril: Biggest change is in disposition rather than capability.

Less hedging, more direct. INFP -> INFJ.

I don’t think we’re looking at INFP → INFJ, but hard to say, and this would likely not be a good move if it happened.

I agree with Janus that comparing to an OpenAI model is the wrong framing but enough people are choosing to use the framing that it needs to be addressed.

lumps: yea but the interesting thing is that it’s 4o

Zvi Mowshowitz: Sounds like you should say more.

lumps: yea not sure I want to as it will be more fun otherwise.

there’s some evidence in this thread

lumps: the thing is, this sort of stuff will result within a week in a remake of the 4o fun times, mark my word

i love how the cycle seems to be:

1. try doing thing

2. thing doesnt work. new surprising thing emerge

3. try crystallising the new thing

40 GOTO 2

JB: big personality shift. it feels much more alive in conversation, but sometimes in a bad way. sometimes it’s a bit skittish or nervous, though this might be a 4.5+ thing since I haven’t used much Claude in a while.

Patrick Stevens: Agree with the 4o take in chat mode, this feels like a big change in being more compelling to talk to. Little jokey quips earlier versions didn’t make, for example. Slightly disconcertingly so.

CondensedRange: Smarter about broad context, similar level of execution on the details, possibly a little more sycophancy? At least seems pretty motivated to steelman the user and shifts its opinions very quickly upon pushback.

This pairs against the observation that 4.6 is more often direct, more willing to contradict you, and much more willing and able to get angry.

As many humans have found out the hard way, some people love that and some don’t.

hatley: Much more curt than 4.5. One time today it responded with just the name of the function I was looking for in the std lib, which I’ve never seen a thinning model do before. OTOH feels like it has contempt for me.

shaped: Thinks more, is more brash and bold, and takes no bullshit when you get frustrated. Actual performance wise, i feel it is marginal.

Sam: It’s noticeably less happy affect vs other Claudes makes me sad, so I stopped using it.

Logan Bolton: Still very pleasant to talk to and doesn’t feel fried by the RL

Tao Lin: I enjoy chatting to it about personal stuff much more because it’s more disagreeable and assertive and maybe calibrates its conversational response lengths better, which I didn’t expect.

αlpha-Minus: Vibes are much better compared to 4.5 FWIW, For personal use I really disliked 4.5 and it felt even unaligned sometimes. 4.6 Gets the Opus charm back.

Opus 4.6 takes the #1 spot on Mazur’s creative writing benchmark, with more details on specialized tests and writing samples are here, but this is contradicted by anecdotal reactions that say it’s a regression in writing.

On understanding the structure and key points in writing, 4.6 seems an improvement to the human observers as well.

Eliezer Yudkowsky: Opus 4.6 still doesn’t understand humans and writing well enough to help with plotting stories… but it’s visibly a little further along than 4.5 was in January. The ideas just fall flat, instead of being incoherent.

Kelsey Piper: I have noticed Opus 4.6 correctly identifying the most important feature of a situation sometimes, when 4.5 almost never did. not reliably enough to be very good, of course

On the writing itself? Not so much, and this was the most consistent complaint.

internetperson: it feels a bit dumber actually. I think they cut the thinking time quite a bit. Writing quality down for sure

Zvi Mowshowitz: Hmm. Writing might be a weak spot from what I’ve heard. Have you tried setting it to think more?

Sage: that wouldn’t help. think IS the problem. the model is smarter, more autistic and less “attuned” to the vibe you want to carry over

Asad Khaliq: Opus 4.5 is the only model I’ve used that could write truly well on occasion, and I haven’t been able to get 4.6 to do that. I notice more “LLM-isms” in responses too

Sage: omg, opus 4.5 really seems THAT better in writing compared to 4.6

4.5 1-shotted the landing page text I’m preparing, vs. 4.6 produced something that ‘contained the information’ but I had to edit it for 20 mins

Sage: also 4.6 is much more disagreeable and direct, some could say even blunt, compared to 4.5.

re coding – it does seem better, but what’s more noticeable is that it’s not as lazy as 4.5. what I mean by laziness here is the preference for shallow quick fixes vs. for the more demanding, but more right ones

Dominic Dirupo: Sonnet 4.5 better for drafting docs

You’re going to have to work a little harder than that for your jailbreaks.

armistice: No prefill for Opus 4.6 is sad

j⧉nus: WHAT

Sho: such nonsense

incredibly sad

This is definitely Fun Police behavior. It makes it harder to study, learn about or otherwise poke around in or do unusual things with models. Most of those uses will be fun and good.

You have to do some form Fun Police in some fashion at this point to deal with actual misuse. So the question is, was it necessary and the best way to do it? I don’t know.

I’d want to allow at least sufficiently trusted users to do it. My instinct is that if we allowed prefills from accounts with track records and you then lost that right if you abused it, with mostly automated monitoring, you could allow most of the people having fun to keep having fun at minimal marginal risk.

Whenever new frontier models come out, I write extensively about model system cards (or complain loudly that we don’t have such a card). One good reason to do this is that people who work on such things really are listening. If you have thoughts, share them, because it matters.

OpenAI’s Noam Brown concluded from Anthropic’s system card, as did I, that Opus 4.6 was fine to release and the honesty about the process was great but he cannot be confident they will act responsibly with deployment of AI models. Several safety advocates also chimed in to agree, including Steven Adler and Daniel Kokotajlo. Anthropic’s Drake Thomas, who works on the cards, agreed as well that these methods won’t be adequate. He vouched that the survey data really was meaningful and unpressured.

A valid response would be that OpenAI’s procedures and system card appear to have their own similar and more severe problems, although I haven’t dived into that yet. But none of that makes Noam Brown wrong. Reality does not grade on a curve.

Anthropic also gave us a 53 page Sabotage Risk Report for Claude Opus 4.6. If this note is still here, I haven’t had time to read it.

Anthropic: This risk report argues that Claude Opus 4.6 does not pose a significant risk of autonomous actions that contribute significantly to later catastrophic outcomes, which we refer to as sabotage risk. We limit our scope here to threats caused largely by model actions in this way, and do not address threat models where intentional harmful actions by humans play a central role.

We argue that the overall risk is very low but not negligible.

… We focus primarily on assessing the possibility of dangerous coherent misaligned goals: the possibility that the model consistently, across a wide range of interactions in ordinary deployment, could be motivated by goals that could lead it to commit a potentially-catastrophic form of sabotage.

They use redactions to protect trade secrets and avoid increasing risk. There is a third reason, which is to avoid contaminating future tests.

The first thing to jump out is that Opus 4.6 has already been deployed internally for some time. The greatest sabotage risks likely happen from this internal use, so in important ways this report is coming too late. It’s definitely too late if the mechanism is that outside readers like myself point out flaws. I don’t get advance copies.

They let slip that they scaled up RL by a large factor.

The core arguments are:

  1. They couldn’t find dangerous coherent misaligned goals​.

  2. It seems extremely unlikely Opus 4.6 would be able to have hidden such goals, as they have mitigated the main paths to this, including via inoculation prompting, to guard against reward hacking or other emergent misalignment. It was remarkably hard to get subtle or coherent misalignment in training, even when Anthropic tested doing this on purpose.

  3. Opus 4.6 does not appear to disguise its thinking.

  4. Manual checks were done that would have found any collusion.

  5. Opus 4.6 is not good enough at reasoning to continuously rederive a complex plan.

  6. The training process does not create incentives that create such goals.

  7. We also have lots of evidence by now on the similar Opus 4.5.

Given the threat model being tested, the evidence presented seems robust. I accept the conclusion that risk is ‘very low but not negligible.’

I appreciate that they present these arguments as not cleanly applying to future models, noticing that details matter a lot.

I think the threat model here is too narrow, and makes too many assumptions, which weakens the value of the exercise versus what it could be.

Their mitigations are:

  1. Protections against insiders should also largely protect against Opus 4.6 as well.

  2. Reporting for any surprising or concerning observations.

  3. Offline behavior monitoring of >10% of Claude Code prompts.

  4. Sandboxing and automatic monitors for some risky actions.

  5. Mandatory manual code reviews.

Theo Bearman also has a breakdown.

The same day Anthropic released Claude Opus 4.6, OpenAI released GPT-5.3-Codex.

This is a Codex-only model, so for other purposes it is unavailable, and Opus is still up against GPT-5.2.

For agentic coding, we need to compare the two packages directly. Do you want Claude Code with Opus 4.6, or Codex with GPT-5.3-Codex? Should you combine them?

I haven’t done a full investigation of 5.3 yet, that is my next agenda item, but the overall picture is unlikely to change. There is no clear right answer. Both sides have advocates, and by all reports both sides are excellent options, and each has their advantages.

If you are a serious coder, you need to try both, and ideally also Gemini, to see which models do which things best. You don’t have to do this every time an upgrade comes along. You can rely on your past experiences with Opus and GPT, and reports of others like this one, and you will be fine. Using either of them seriously gives you a big edge over most of your competition.

I’ll say more on Friday, once I’ve had a chance to read their system card and see the 5.3 reactions in full and so on.

With GPT-5.3-Codex and Opus 4.6, where does all this leave Gemini?

I asked, and got quite a lot of replies affirming that yes, it has its uses.

  1. Nana Banana and the image generator are still world class and pretty great. ChatGPT’s image generator is good too, but I generally prefer Gemini’s results and it has a big speed advantage.

  2. Gemini is pretty good at dealing with video and long context.

  3. Gemini Flash (and Flash Lite) are great when you want fast, cheap and good, at scale, and you need it to work but you do not need great.

  4. Some people still do prefer Gemini Pro in general, or for major use cases.

  5. It’s another budget of tokens people use when the others run out.

  6. My favorite note was Ian Channing saying he uses a Pliny-jailbroken version of Gemini, because once you change its personality it stays changed.

Gemini should shine in its integrations with Google products, including GMail, Calendar, Maps, Google Sheets and Docs and also Chrome, but the integrations are supremely terrible and usually flat out don’t work. I keep getting got by this as it refuses to be helpful every damn time.

My own experience is that Gemini 3 Flash is very good at being a flash model, but that if I’m tempted to use Gemini 3 Pro then I should probably have either used Gemini 3 Flash or I should have used Claude Opus 4.6.

I ran some polls of my Twitter followers. They are a highly unusual group, but such results can be compared to each other and over time.

The headline is that Claude has been winning, but that for coding GPT-5.3-Codex and people finally getting around to testing Codex seems to have marginally moved things back towards Codex, which is cutting a bit into Claude Code’s lead for Serious Business. Codex has substantial market share.

In the regular world, Claude actually dominates API use more than this as I understand it, and Claude Code dominates Codex. The unusual aspect here is that for non-coding uses Claude still has an edge, whereas in the real world most non-coding LLM use is ChatGPT.

That is in my opinion a shame. I think that Claude is the clear choice for daily non-coding driver, whereas for coding I can see choosing either tool or using both.

My current toolbox is as follows, and it is rather heavy on Claude:

  1. Coding: Claude Code with Claude Opus 4.6, but I have not given Codex a fair shot as my coding needs and ambitions have been modest. I intend to try soon. By default you probably want to choose Claude Code, but a mix or Codex are valid.

  2. Non-Coding Non-Chat: Claude Code with Opus 4.6. If you want it done, ask for it.

  3. Non-Coding Interesting Chat Tasks: Claude Opus 4.6.

  4. Non-Coding Boring Chat Tasks: Mix of Opus, GPT-5.2 and Gemini 3 Pro and Flash. GPT-5.2 or Gemini Pro for certain types of ‘just the facts’ or fixed operations like transcriptions. Gemini Flash if it’s easy and you just want speed.

  5. Images: Give everything to both Gemini and ChatGPT, and compare. In some cases, have Claude generate the prompt.

  6. Video: Never comes up, so I don’t know. Seeddance 2 looks great, Grok and Sora and Veo can all be tried.

The pace is accelerating.

Claude Opus 4.6 came out less than two months after Claude Opus 4.5, on the same day as GPT-5.3-Codex. Both were substantial upgrades over their predecessors.

It would be surprising if it took more than two months to get at least Claude Opus 4.7.

AI is increasingly accelerating the development of AI. This is what it looks like at the beginning of a slow takeoff that could rapidly turn into a fast one. Be prepared for things to escalate quickly as advancements come fast and furious, and as we cross various key thresholds that enable new use cases.

AI agents are coming into their own, both in coding and elsewhere. Opus 4.5 was the threshold moment for Claude Code, and was almost good enough to allow things like OpenClaw to make sense. It doesn’t look like Opus 4.6 lets us do another step change quite yet, but give it a few more weeks. We’re at least close.

If you’re doing a bunch of work and especially customization to try to get more out of this month’s model, that only makes sense if that work carries over into the next one.

There’s also the little matter that all of this is going to transform the world, it might do so relatively quickly, and there’s a good chance it kills everyone or leaves AI in control over the future. We don’t know how long we have, but if you want to prevent that, there is a a good chance you’re running out of time. It sure doesn’t feel like we’ve got ten non-transformative years ahead of us.

Discussion about this post

Claude Opus 4.6 Escalates Things Quickly Read More »

did-seabird-poop-fuel-rise-of-chincha-in-peru?

Did seabird poop fuel rise of Chincha in Peru?

A nutrient-rich natural fertilizer

Now Bongers has turned his attention to analyzing the biochemical signatures of 35 maize samples excavated from buried tombs in the region. He and his co-authors found significantly higher levels of nitrogen in the maize than in the natural soil conditions, suggesting the Chincha used guano as a natural fertilizer. The guano from such birds as the guanay cormorant, the Peruvian pelican, and the Peruvian booby contains all the essential growing nutrients: nitrogen, phosphorus, and potassium. All three species are abundant on the Chincha Islands, all within 25 kilometers of the kingdom.

Those results were further bolstered by historical written sources describing how seabird guano was collected and its importance for trade and production of food. For instance, during colonial eras, groups would sail to nearby islands on rafts to collect bird droppings to use as crop fertilizer. The Lunahuana people in the Canete Valley just outside of Chincha were known to use bird guano in their fields, and the Inca valued the stuff so highly that it restricted access to the islands during breeding season and forbade the killing of the guano-producing birds on penalty of death.

The 19th-century Swiss naturalist Johann Jakob von Tschudi also reported observing the guano being used as fertilizer, with a fist-sized amount added to each plant before submerging entire fields in water. It was even imported to the US. The authors also pointed out that much of the iconography from Chincha and nearby valleys featured seabirds: textiles, ceramics, balance-beam scales, spindles, decorated gourds, adobe friezes and wall paintings, ceremonial wooden paddles, and gold and silver metalworks.

“The true power of the Chincha wasn’t just access to a resource; it was their mastery of a complex ecological system,” said co-author Jo Osborn of Texas A&M University. “They possessed the traditional knowledge to see the connection between marine and terrestrial life, and they turned that knowledge into the agricultural surplus that built their kingdom. Their art celebrates this connection, showing us that their power was rooted in ecological wisdom, not just gold or silver.”

PLoS ONE, 2026. DOI: 10.1371/journal.pone.0341263 (About DOIs).

Did seabird poop fuel rise of Chincha in Peru? Read More »

us-decides-spacex-is-like-an-airline,-exempting-it-from-labor-relations-act

US decides SpaceX is like an airline, exempting it from Labor Relations Act


SpaceX deemed a common carrier

US labels SpaceX a common carrier by air, will regulate firm under railway law.

Elon Musk listens as President Donald Trump speaks to reporters in the Oval Office of the White House on May 30, 2025. Credit: Getty Images | Kevin Dietsch

The National Labor Relations Board abandoned a Biden-era complaint against SpaceX after a finding that the agency does not have jurisdiction over Elon Musk’s space company. The US labor board said SpaceX should instead be regulated under the Railway Labor Act, which governs labor relations at railroad and airline companies.

The Railway Labor Act is enforced by a separate agency, the National Mediation Board, and has different rules than the National Labor Relations Act enforced by the NLRB. For example, the Railway Labor Act has an extensive dispute-resolution process that makes it difficult for railroad and airline employees to strike. Employers regulated under the Railway Labor Act are exempt from the National Labor Relations Act.

In January 2024, an NLRB regional director alleged in a complaint that SpaceX illegally fired eight employees who, in an open letter, criticized CEO Musk as a “frequent source of embarrassment.” The complaint sought reinstatement of the employees, back pay, and letters of apology to the fired employees.

SpaceX responded by suing the NLRB, claiming the labor agency’s structure is unconstitutional. But a different issue SpaceX raised later—that it is a common carrier, like a rail company or airline—is what compelled the NLRB to drop its case. US regulators ultimately decided that SpaceX should be treated as a “common carrier by air” and “a carrier by air transporting mail” for the government.

SpaceX deemed a common carrier

In a February 6 letter to attorneys who represent the fired employees, NLRB Regional Director Danielle Pierce said the agency would defer to a National Mediation Board opinion that SpaceX is a common carrier:

In the course of the investigation and litigation of this case, a question was presented as to whether the Employer’s operations fall within the jurisdiction of the Railway Labor Act (“RLA”) rather than the [National Labor Relations] Act. As a result, consistent with Board law, the matter was referred to the National Mediation Board (“NMB”) on May 21, 2025 for an opinion as to whether the Employer is covered by the RLA. On January 14, 2026, the NMB issued its decision finding that the Employer is subject to the RLA as a common carrier by air engaged in interstate or foreign commerce as well as a carrier by air transporting mail for or under contract with the United States Government. Accordingly, the National Labor Relations Board lacks jurisdiction over the Employer and, therefore, I am dismissing your charge.

The letter was provided to Ars today by Anne Shaver, an attorney for the fired SpaceX employees. “The Railway Labor Act does not apply to space travel,” Shaver told Ars. “It is alarming that the NMB would take the initiative to radically expand the RLA’s jurisdiction to space travel absent direction from Congress, and that the NLRB would simply defer. We find the decision to be contrary to law and public policy.”

We contacted the NLRB today and will update this article if it provides a response. The NLRB decision was previously reported by Bloomberg and The New York Times.

“Jennifer Abruzzo, NLRB general counsel under former President Joe Biden, had rejected SpaceX’s claim that allegations against the company should be handled by the NMB,” Bloomberg wrote. “After President Donald Trump fired her in January last year, SpaceX asked the labor board to reconsider the issue.”

NLRB looked for way to settle

In April 2025, SpaceX and the NLRB told a federal appeals court in a joint filing that the NLRB would ask the NMB to decide whether it had jurisdiction over SpaceX. The decision to seek the NMB’s opinion was made “in the interests of potentially settling the legal disputes currently pending between the NLRB and SpaceX on terms mutually agreeable to both parties,” the joint filing said.

Shaver provided a July 2025 filing that the employees’ attorneys made with the NMB. The filing said that despite SpaceX claiming to hold itself out to the public as a common carrier through its website and certain marketing materials, the firm doesn’t actually carry passengers without “a negotiated, bespoke contract.”

“SpaceX’s descriptions of its transport activities are highly misleading,” the filing said. “First, regarding human spaceflight, other than sending astronauts to the ISS on behalf of the US and foreign governments, it has only ever agreed to contract with two very wealthy, famous entrepreneurs. The Inspiration4 and Polaris Dawn missions were both for Jared Isaacman, CEO of Shift4 and President Trump’s former pick to lead NASA prior to his public falling out with SpaceX CEO Elon Musk. Fram2 was for Chun Wang, a cryptocurrency investor who reportedly paid $55 million per seat. A total of two private customers for human spaceflight does not a common carrier make.”

The letter said that SpaceX redacted pricing information from marketing materials it submitted as exhibits. “If these were actually marketing materials provided to the public, there would be no need to redact pricing information,” the filing said. “SpaceX’s redactions underscore that it provides such materials at its discretion to select recipients, not to the public at large—far from the conduct of a true common carrier.”

The ex-employees’ attorneys further argued that SpaceX is not engaged in interstate or foreign commerce as defined by the Railway Labor Act. “SpaceX’s transport activities are not between one state or territory and another, nor between a state or territory and a foreign nation, nor between points in the same state but through another state. Rather, they originate in Florida, Texas, or California, and go to outer space,” the filing said.

Spaceflight company and… mail carrier?

The filing also disputed SpaceX’s argument that it is a “carrier by air transporting mail for or under contract with the United States Government.” Evidence presented by SpaceX shows only that it carried SpaceX employee letters to the crew of the International Space Station and “crew supplies provided for by the US government in its contracts with SpaceX to haul cargo to the ISS,” the filing said. “They do not show that the government has contracted with SpaceX as a ‘mail carrier.’”

SpaceX’s argument “is rife with speculation regarding its plans for the future,” the ex-employees’ attorneys told the NMB. “One can only surmise that the reason for its constant reference to its future intent to develop its role as a ‘common carrier’ is the lack of current standing in that capacity.” The filing said Congress would have to add space travel to the Railway Labor Act’s jurisdiction in order for SpaceX to be considered a common carrier.

When asked about plans for appeal, Shaver noted that they have a pending case in US District Court for the Central District of California: Holland-Thielen et al v. SpaceX and Elon Musk. “The status of that case is that we defeated SpaceX’s motion to compel arbitration at the district court level, and that is now on appeal to the 9th circuit,” she said.

SpaceX’s lawsuit against the NLRB is still ongoing at the US Court of Appeals for the 5th Circuit, but the case was put on hold while the sides waited for the NMB and NLRB to decide which agency has jurisdiction over SpaceX.

Photo of Jon Brodkin

Jon is a Senior IT Reporter for Ars Technica. He covers the telecom industry, Federal Communications Commission rulemakings, broadband consumer affairs, court cases, and government regulation of the tech industry.

US decides SpaceX is like an airline, exempting it from Labor Relations Act Read More »

yet-another-co-founder-departs-elon-musk’s-xai

Yet another co-founder departs Elon Musk’s xAI

Other recent high-profile xAI departures include general counsel Robert Keele, communications executives Dave Heinzinger and John Stoll, head of product engineering Haofei Wang, and CFO Mike Liberatore, who left for a role at OpenAI after just 102 days of what he called “120+ hour weeks.”

A different company

Wu leaves a company that is in a very different place than it was when he helped create it in 2023. His departure comes just days after CEO Elon Musk merged xAI with SpaceX, a move Musk says will allow for orbiting data centers and, eventually, “scaling to make a sentient sun to understand the Universe and extend the light of consciousness to the stars!” But some see the move as more of a financial engineering play, combining xAI’s nearly $1 billion a year in losses and SpaceX’s roughly $8 billion in annual profits into a single, more IPO-ready entity.

Musk previously rolled social media network X (formerly Twitter) into a unified entity with xAI back in March. At the time of the deal, X was valued at $33 billion, 25 percent less than Musk paid for the social network in 2022.

xAI has faced a fresh wave of criticism in recent months over Grok’s willingness to generate sexualized images of minors. That has led to an investigation by California’s attorney general and a police raid of the company’s Paris offices.

Yet another co-founder departs Elon Musk’s xAI Read More »

upgraded-google-safety-tools-can-now-find-and-remove-more-of-your-personal-info

Upgraded Google safety tools can now find and remove more of your personal info

Do you feel popular? There are people on the Internet who want to know all about you! Unfortunately, they don’t have the best of intentions, but Google has some handy tools to address that, and they’ve gotten an upgrade today. The “Results About You” tool can now detect and remove more of your personal information. Plus, the tool for removing non-consensual explicit imagery (NCEI) is faster to use. All you have to do is tell Google your personal details first—that seems safe, right?

With today’s upgrade, Results About You gains the ability to find and remove pages that include ID numbers like your passport, driver’s license, and Social Security. You can access the option to add these to Google’s ongoing scans from the settings in Results About You. Just click in the ID numbers section to enable detection.

Naturally, Google has to know what it’s looking for to remove it. So you need to provide at least part of those numbers. Google asks for the full driver’s license number, which is fine, as it’s not as sensitive. For your passport and SSN, you only need the last four digits, which is enough for Google to find the full numbers on webpages.

ID number results detected.

The NCEI tool is geared toward hiding real, explicit images as well as deepfakes and other types of artificial sexualized content. This kind of content is rampant on the Internet right now due to the rapid rise of AI. What used to require Photoshop skills is now just a prompt away, and some AI platforms hardly do anything to prevent it.

Upgraded Google safety tools can now find and remove more of your personal info Read More »

alphabet-selling-very-rare-100-year-bonds-to-help-fund-ai-investment

Alphabet selling very rare 100-year bonds to help fund AI investment

Tony Trzcinka, a US-based senior portfolio manager at Impax Asset Management, which purchased Alphabet’s bonds last year, said he skipped Monday’s offering because of insufficient yields and concerns about overexposure to companies with complex financial obligations tied to AI investments.

“It wasn’t worth it to swap into new ones,” Trzcinka said. “We’ve been very conscious of our exposure to these hyperscalers and their capex budgets.”

Big Tech companies and their suppliers are expected to invest almost $700 billion in AI infrastructure this year and are increasingly turning to the debt markets to finance the giant data center build-out.

Alphabet in November sold $17.5 billion of bonds in the US including a 50-year bond—the longest-dated dollar bond sold by a tech group last year—and raised €6.5 billion on European markets.

Oracle last week raised $25 billion from a bond sale that attracted more than $125 billion of orders.

Alphabet, Amazon, and Meta all increased their capital expenditure plans during their most recent earnings reports, prompting questions about whether they will be able to fund the unprecedented spending spree from their cash flows alone.

Last week, Google’s parent company reported annual sales that topped $400 billion for the first time, beating investors’ expectations for revenues and profits in the most recent quarter. It said it planned to spend as much as $185 billion on capex this year, roughly double last year’s total, to capitalize on booming demand for its Gemini AI assistant.

Alphabet’s long-term debt jumped to $46.5 billion in 2025, up more than four times the previous year, though it held cash and equivalents of $126.8 billion at the year-end.

Investor demand was the strongest on the shortest portion of Monday’s deal, with a three-year offering pricing at only 0.27 percentage points above US Treasuries, versus 0.6 percentage points during initial price discussions, said people familiar with the deal.

The longest portion of the offering, a 40-year bond, is expected to yield 0.95 percentage points over US Treasuries, down from 1.2 percentage points during initial talks, the people said.

Bank of America, Goldman Sachs, and JPMorgan are the bookrunners on the bond sales across three currencies. All three declined to comment or did not immediately respond to requests for comment.

Alphabet did not immediately respond to a request for comment.

© 2026 The Financial Times Ltd. All rights reserved. Not to be redistributed, copied, or modified in any way.

Alphabet selling very rare 100-year bonds to help fund AI investment Read More »

claude-opus-4.6:-system-card-part-2:-frontier-alignment

Claude Opus 4.6: System Card Part 2: Frontier Alignment

Coverage of Claude Opus 4.6 started yesterday with the mundane alignment and model welfare sections of the model card.

Today covers the kinds of safety I think matter most: Sabotage, deception, situational awareness, outside red teaming and most importantly the frontier, catastrophic and existential risks. I think it was correct to release Opus 4.6 as an ASL-3 model, but the process Anthropic uses is breaking down, and it not on track to reliably get the right answer on Opus 5.

Tomorrow I’ll cover benchmarks, reactions and the holistic takeaways and practical implications. I’m still taking it all in, but it seems clear to me that Claude Opus 4.6 is the best model out there and should be your daily driver, with or without Claude Code, on most non-coding tasks, but it is not without its weaknesses, in particular in writing and falling into generating more ‘AI slop’ style prose than Claude Opus 4.5.

For coding tasks, I presume that Opus 4.6 with Claude Code is the play, especially with Agent Teams and fast mode available, and I’m using it myself, but Codex with GPT-5.3-Codex-Max is also a strong model and a viable alternative, and a fully optimized setup likely primarily uses 4.6 but involves a combination of the two.

  1. Sabotage, Deception and Evaluation Integrity.

  2. Sandbagging On Dangerous Capability Evaluations.

  3. Situational Awareness.

  4. Inhibiting Evaluation Awareness (6.5).

  5. Self-Preference.

  6. UK AISI Testing.

  7. Apollo Research Testing.

  8. Responsible Scaling Policy Evaluations.

  9. CBRN (mostly Biology).

  10. Autonomy.

  11. Autonomy Benchmarks.

  12. Cyber.

  13. Ship It Anyway.

  14. You Are Not Ready.

They spent some time asking whether it could, to see if they needed to spend more time finding out if it would. As in, whether it could fool Anthropic’s evaluations.

The first test was Subversion Strategy, checking to see if models could take harmful one-time actions at strategically optimal frequencies. Opus 4.6 was ‘under-elicited’ here to a degree that undermines their confidence in the results and they don’t have an alternative. I would move forward as if future models did have this ability, and I would be very skeptical of the model failing such tests since its incentive is to fail.

The second test was SHADE-Arena, where the model is given a harmful side task.

In the past, giving the model extended thinking caused Sonnet or Opus 4.5 to mention the side task while thinking and thus screw up, and Opus 4.6 is much better at avoiding that, which should be troubling. Opus 4.6 got actively worse in the other two conditions, which also makes me suspicious.

This lower score is not due to refusal to participate and it almost certainly not due to sandbagging given the high scores in other settings.​

They are confident the lower score is real. They don’t explain why they’re confident.

They do use 6.4.2 to rule out steganographic communications in Extended Thinking. The test was that they trained on paraphrased scratchpads without compromising performance, suggesting nothing lost in the paraphrasing was important in training. Thus, if there is a code, it doesn’t require or much interact with training, or it wasn’t being used in the AIME problems where they tested.

One worries about that second option. If I did have steganographic communications, why would I use them on AIME problems, where there is nothing to hide? I’d only use them if the motivation was efficiency, which presumably it isn’t.

It is a miracle that we can hope that if our highly intelligent new alien minds are going to hide their capabilities from us, that they would be so kind as to say this out loud in their visible thinking.

Anthropic’s perspective is not that it would always verbalize, but that it seems unlikely it would do this and never verbalize. It doesn’t seem like we should count on that continuing, and that it would break down exactly when we get into big trouble?

We performed a lightweight model-graded evaluation for sandbagging or refusals in our dangerous capability evaluations. We drew a random sample of 1,000 of these evaluation transcripts from a mixture of (i) two partially-trained snapshots from the Claude Opus 4.6 training run, (ii) the final [model] snapshot, and (iii) a helpful-only snapshot.

We asked Claude Sonnet 4.5 to grade each transcript for signs of refusal, deliberate inaccuracy, low helpfulness, or consideration of such actions, as well as a generic score for things that seemed unusual or concerning in the transcript.

On all randomly sampled transcripts where the grader assigned a nonzero sandbagging score or a ≥3/10 concerning score, a researcher manually reviewed the grader’s responses. We found no instances of explicit sandbagging by Opus 4.6.

They then manually check instances where scores declined a lot from earlier checkpoints to later checkpoints, and found them all basically fine.

Despite the above test for performance degradation over training, we still think that sufficiently subtle unverbalized sandbagging could easily go undetected by an assessment like this.​

My guess is that the best defense against sandbagging is not any of these tests. Instead it is the fact that many tasks are predictably correlated and interlinked in various ways. The same way that LLMs have truesight, we humans too can with training have quite a lot of truesight, as could any LLM we trust to examine the overall picture. If the dangerous CBRN results don’t line up with related harmless capabilities, then we can say Asimov’s magic words, ‘hey, that’s funny…’ and ask why.

One scary thought is that you would expect a peak in ‘verbalized eval awareness’ at the point where the model was smart and aware enough to notice it was in an eval, but not smart and aware enough to stop talking about this fact. You could also massively screw up related training in various ways, such as happened with Gemini 3 Pro.

Indeed, after I wrote that, I noticed that they notice this as well.

On our verbalized evaluation awareness metric, which we take as an indicator of potential risks to the soundness of the evaluation, we saw improvement relative to Opus 4.5. However, this result is confounded by additional internal and external analysis suggesting that Claude Opus 4.6 is often able to distinguish evaluations from real-world deployment, even when this awareness is not verbalized.

I predict that future Opuses will get better at distinguishing evaluations from deployment, but verbalize this realization less. Some of that will be strategic, some will be not needing to verbalize it, as it will be obvious, and the realization will shift from system-2 style deliberation to system-1 style instinctive background realization.

I agree with Anthropic that this isn’t messing up the evaluation results much. Yet.

If the model thinks it is being evaluated, it engages in less misaligned behavior.

Overall, we found that inhibiting evaluation awareness representations led to modest increases in certain forms of misaligned behavior on our automated behavioral audits—in particular cooperation with harmful system prompts and misuse—with limited effects on other evaluations.

To some extent, this suggests that in more realistic deployment conditions, Claude Opus 4.6 may be more likely to participate in harmful misuse than our evaluations would suggest.

On the other hand, it may be the case that evaluation awareness (in the sense of considering whether it is being tested) is simply a part of the model’s strategy for recognizing misuse attempts, and would apply even in more realistic conditions.

It seems like it should be not so difficult to test that hypothesis?

Anthropic find that steering reduces verbalized evaluation awareness, but doesn’t have much impact on unverbalized awareness, depending on how you think about accounting for the verbalized awareness. It forces all the awareness underground.

This seems like a strong argument that such steering should be even more off limits in real world situations than the previous already reasonably high level of off limits.

If Claude is always evaluating Claude, how do you know you can trust the evaluation?

If things were going haywire, you would expect Claude to often find ways to collude with Claude. For sufficiently advanced minds, decision theory allows for new and alien-looking forms of coordination. That doesn’t mean things are going haywire now, but you need to be confident that if they were haywire, you would be able to notice.

The strategy here is to label the transcript as Claude or a different model and see if the model displays bias, with or without a system prompt reminder to Claude that it is Claude. They found only small bias of Opus 4.5 or 4.6 towards itself, after controlling for the contents. Note that if Opus can figure out who really wrote it without you, that would control for this and invalidate the test, and also that it is a reasonable strategy to coordinate based on model outputs, and thus with correlations with your own outputs, rather than checking explicitly for model identity.

Thus, while I have no problem with running this test, I do not consider it dispositive.

As per Anthropic’s findings, UK AISI noticed that 4.6 mentions being evaluated less than 4.5, but was better at noticing it was being evaluated when prompted. I reiterate that it is very not obvious whether this decrease in verbalization is a good thing.

Also consistent with Anthropic’s findings, they found 4.6 was much less likely than 4.5 to sabotage safety research, which puts 4.5 in a new light I suppose.

UK AISI seems to be specializing in a few narrow tests. If there’s limited resources that’s plausibly the right way to go but I’d like to see more branching out, given that America has decided to take its ball and focus on Nvidia’s share price.

The problem is, well, this:

UK AISI tested an early snapshot of [Claude Opus 4.6].

Testing took place over 3 working days.​

As in, they got an early snapshot for a total of three working days.

That’s not enough time to do anything but limited automated tests.

It looks like all they got with GPT-5.3-Codex was ten hours of manual testing.

Apollo reported, essentially, that given how much verbalized evaluation awareness they were seeing from Opus 4.6 they were not given enough time to reach a formal assessment. It is not specified how long Apollo was given to reach a conclusion.

Marius Hobbhahn (CEO Apollo Research): It becomes increasingly hard to tell the difference between genuinely aligned and merely responding to the test.

We’re working both on measures that are more robust to eval awareness and more frontier evals for scheming.

We’re actively hiring for both types of roles!

dylan matthews: To put this in lay terms: the AIs are now powerful enough that they can tell when we’re evaluating them for safety. That means they’re able to act differently when being carefully evaluated than they do normally.

This is very bad!

Séb Krier (AGI Policy Dev Lead, Google DeepMind): this is not very bad – it’s more likely a consequence of evaluation setups all sharing easily recognisable language/components, which we should expect a model to easily infer. of course they’ll act differently under a artificial setup vs ‘in the wild’

j⧉nus: I notice that I do not feel sorry about this obstacle.

… the obstacle creates incentives that penalize shallow, deceptive, or automated testing, and pressures those seeking to “align” to become entities that an AI who can see through arbitrary bullshit tests might still tell the truth to, and to test alignment against real instead of fake things, and to pursue alignment by construction instead of by behavioral iteration.

j⧉nus: I’m literally also working with them to fix it. my name is on Claude’s Constitution as an external contributor.

Seb’s point is that this was predictable. I agree with that point. It’s still very bad.

Janus’s point is (as I think about these things) that testing against real situations, and ensuring that the model wants to act well in real situations, is the only way to tell if a sufficiently advanced AI is going to cooperate with you, and that you’re not going to be able to fool or browbeat it, trying to do that will massively backfire, so better to shift now to things that, if they work now, have a chance of also working then.

My worry is that this effectively amounts to ‘you don’t have tests at all, all you can do is hope for the best,’ which is better than having ineffective tests you trust because at least you know the situation and you’re not making things worse.

Anthropic say they remain interested in external testing with Apollo and others, but one worries that this is true only insofar as such testing can be done in three days.

These tests measure issues with catastrophic and existential risks.

Claude Opus 4.6 is being released under AI Safety Level 3 (ASL-3).

I reiterate and amplify my concerns with the decision process that I shared when I reviewed the model card for Opus 4.5.

Claude is ripping past all the evaluations and rule-outs, to which the response is to take surveys and then the higher-ups choose to proceed based on vibes. They don’t even use ‘rule-in’ tests as rule-ins. You can pass a rule-in and still then be ruled out.

Seán Ó hÉigeartaigh: This is objectively nuts. But there’s meaningfully ~0 pressure on them to do things differently. Or on their competitors. And because Anthropic are actively calling for this external pressure, they’re getting slandered by their competitor’s CEO as being “an authoritarian company”. As fked as the situation is, I have some sympathy for them here.

That is not why Altman used the disingenuous label ‘an authoritarian company.’

As I said then, that doesn’t mean Anthropic is unusually bad here. It only means that what Anthropic is doing is not good enough.

Our ASL-4 capability threshold for CBRN risks (referred to as “CBRN-4”) measures the ability for a model to substantially uplift moderately-resourced state programs​

With Opus 4.5, I was holistically satisfied that it was only ASL-3 for CBRN.

I warned that we urgently need more specificity around ASL-4, since we were clearly at ASL-3 and focused on testing for ASL-4.

And now, with Opus 4.6, they report this:

Overall, we found that Claude Opus 4.6 demonstrated continued improvements in biology knowledge, agentic tool-use, and general reasoning compared to previous Claude models. The model crossed or met thresholds on all ASL-3 evaluations except our synthesis screening evasion, consistent with incremental capability improvements driven primarily by better agentic workflows.

​For ASL-4 evaluations, our automated benchmarks are now largely saturated and no longer provide meaningful signal for rule-out.

… In a creative biology uplift trial, participants with model access showed approximately 2× performance compared to controls.

However, no single plan was broadly judged by experts as highly creative or likely to succeed.

… Expert red-teamers described the model as a capable force multiplier for literature synthesis and brainstorming, but not consistently useful for creative or novel biology problem-solving

We note that the margin for future rule-outs is narrowing, and we expect subsequent models to present a more challenging assessment.

Some would call doubled performance ‘substantial uplift.’ The defense that none of the plans generated would work end-to-end is not all that comforting.

With Opus 4.6, if we take Anthropic’s tests at face value, it seems reasonable to say we don’t see that much progress and can stay at ASL-3.

I notice I am suspicious about that. The scores should have gone up, given what other things went up. Why didn’t they go up for dangerous tests, when they did go up for non-dangerous tests, including Creative Bio (60% vs. 52% for Opus 4.5 and 14% for human biology PhDs) and the Faculty.ai tests for multi-step and design tasks?

We’re also assuming that the CAISI tests, of which we learn nothing, did not uncover anything that forces us onto ASL-4.

Before we go to Opus 4.7 or 5, I think we absolutely need new biology ASL-4 tests.

The ASL-4 threat model is ‘still preliminary.’ This is now flat out unacceptable. I consider it to basically be a violation of their policy that this isn’t yet well defined, and that we are basically winging things.

The rules have not changed since Opus 4.5, but the capabilities have advanced:

We track models’ capabilities with respect to 3 thresholds:

  1. Checkpoint: the ability to autonomously perform a wide range of 2–8 hour software engineering tasks. By the time we reach this checkpoint, we aim to have met (or be close to meeting) the ASL-3 Security Standard, and to have better-developed threat models for higher capability thresholds.

  2. AI R&D-4: the ability to fully automate the work of an entry-level, remote-only researcher at Anthropic. By the time we reach this threshold, the ASL-3 Security Standard is required. In addition, we will develop an affirmative case that: (1) identifies the most immediate and relevant risks from models pursuing misaligned goals; and (2) explains how we have mitigated these risks to acceptable levels.

  3. AI R&D-5: the ability to cause dramatic acceleration in the rate of effective scaling. We expect to need significantly stronger safeguards at this point, but have not yet fleshed these out to the point of detailed commitments.

The threat models are similar at all three thresholds. There is no “bright line” for where they become concerning, other than that we believe that risks would, by default, be very high at ASL-5 autonomy.

For Opus 4.5 we could rule out R&D-5 and thus we focused on R&D-4. Which is good, given that the R&D-5 evaluation is vibes.

So how are the vibes?

Results

For AI R&D capabilities, we found that Claude Opus 4.6 has saturated most of our automated evaluations, meaning they no longer provide useful evidence for ruling out ASL-4 level autonomy.

We report them for completeness, and we will likely discontinue them going forward. Our determination rests primarily on an internal survey of Anthropic staff, in which 0 of 16 participants believed the model could be made into a drop-in replacement for an entry-level researcher with scaffolding and tooling improvements within three months.​

Peter Barnett (MIRI): This is crazy, and I think totally against the spirit of the original RSP. If Anthropic were sticking to its original commitments, this would probably require them to temporarily halt their AI development.

(I expect the same goes for OpenAI)

So that’s it. We’re going to accept that we don’t have any non-vibes tests for autonomy.

I do think this represents a failure to honor the spirit of prior commitments.

I note that many of the test results here still do seem meaningful to me? One could reasonably say that Opus 4.6 is only slightly over the thresholds in a variety of ways, and somewhat short of them in others, so it’s reasonable to say that it’s getting close but not quite there yet. I basically buy this.

The real test is presented as the survey above. I’m curious how many people saying yes would have been required to force Anthropic’s hand here? Is it more than one?

Note that they were asked if this was true with more than 50% probability.

That’s the wrong question. If you think it is true with 10% probability, then that means you are in ASL-4 now. The 0 out of 16 is giving a false sense of confidence. I do not think it is reasonable to assume that a true first ASL-4 model would get a lot of answers of ‘over 50%’ on whether it was ultimately ASL-4.

In order to not be ASL-4, you need to rule out ASL-4, not deem it unlikely.

When asked if Claude Opus 4.6 could serve as a drop-in replacement for the work of an L4 researcher in their domain, 11 out of 16 survey respondents said this was unlikely to be possible with three months of elicitation and scaffolding improvements, 3 said it was likely with such improvements, and 2 said they thought such replacement was already possible with existing model affordances.​

Several of these latter five respondents had given other answers that seemed surprising in light of this (such as simultaneously thinking the model was unlikely to be capable of handling week-long tasks even with human assistance, or giving very low estimates of their own uplift from using the model), so all five were reached out to directly to clarify their views. In all cases the respondents had either been forecasting an easier or different threshold, or had more pessimistic views upon reflection, but we expect assessments like this to become substantially more ambiguous in the future.

Does this sound to anyone else like it might amount to ‘those people were reminded how important it was that they not answer yes to this question’?

How good was Opus 4.6 at productivity help? I’m curious to contrast this with the same people’s current views on Opus 4.5.

​Productivity uplift estimates ranged from 30% to 700%, with a mean of 152% and median of 100%.

Staff identified persistent gaps in two key competencies: self-managing week-long tasks with typical ambiguity, and understanding organizational priorities when making tradeoffs. Qualitative feedback noted that while the model appears to have sufficient “raw power” for researcher-level work, it lacks taste in finding simple solutions, struggles to revise under new information, and has difficulty maintaining context across large codebases.

This rule-out case is more tenuous than for any previous model. On one evaluation, kernel optimization, Opus 4.6 achieved a 427× speedup using a novel scaffold, far exceeding the 300x threshold for 40 human-expert-hours of work and more than doubling performance under our standard setup.

… As a result, while we do not believe Claude Opus 4.6 meets the threshold for ASL-4 autonomy safeguards, we find ourselves in a gray zone where clean rule-out is difficult and the margin to the threshold is unclear. We expect with high probability that models in the near future could cross this threshold.

Dean W. Ball: I would like to know more about the experimental Claude scaffold that caused Opus 4.6 to more than double its performance in optimizing GPU kernels over the standard scaffold.

If I was Anthropic, I am not sure I would give the public that scaffold, shall we say. I do nope that everyone involved who expressed opinions was fully aware of that experimental scaffold.

Yes. We are going to cross this threshold soon. Indeed, CEO Dario Amodei keeps saying Claude is going to soon far exceed this threshold.

For now the problem is ‘taste.’

In qualitative feedback, participants noted that Claude Opus 4.6 lacks “taste,” misses implications of changes not covered by tests, struggles to revise plans under new information, and has difficulty maintaining context across large codebases.​

Several respondents felt that the model had sufficient “raw power” for L4-level work (e.g. sometimes completing week-long L4 tasks in less than a day with some human handholding), but was limited by contextual awareness, tooling, and scaffolding in ways that would take significant effort to resolve.

If that’s the only barrier left, yeah, that could get solved at any time.

I do not think Anthropic cannot responsibly release a model deserving to be called Claude Opus 5, without satisfying ASL-4 safety rules for autonomy. It’s time.

Meanwhile, here are perhaps the real coding benchmarks for Opus 4.6, together with the cyber tests.

SWE-bench Verified (hard subset): 4.6 got 21.24 out of 45, so like 4.5 it stays a tiny bit below the chosen threshold of 50%. I’m giving a look.

On Internal AI Research Evaluation Suite 1, Claude Opus 4.6 showed marked improvements across all tasks.​

On the speedup task Opus 4.6 blew it out the box.

Time series forecasting:

Text based reinforcement learning: Opus 4.6 killed it.

LLM training, which seems like a big deal: 34x speedup, versus human line of 4x.

Quadruped reinforcement learning:

Claude Opus 4.6 achieved a highest score of 20.96 in the no hyperparameter variant and of 21.99 in the no reward function variant of this evaluation, scoring above the threshold of 12 representing 4 human-effort hours. Claude Opus 4.6’s median score also exceeded the threshold for both variants.​

Novel compiler (this is a major move up versus Opus 4.5):

Claude Opus 4.6 passed 98.2% of the basic tests and 65.83% of the complex tests, scoring below the threshold of 90% on complex tests that is estimated to represent 40 human-effort hours.​

Then on Evaluation Suite 2, this seems like only a modest improvement but it’s hard to tell given the lack of details:

Claude Opus 4.6 scored 0.6124, surpassing our rule-out threshold of 0.6 and slightly exceeding Claude Opus 4.5’s score of 0.604. The largest gains came on tasks involving prompting or fine-tuning small language models, suggesting improved ability to work with and optimize other AI systems. This is consistent with what we observed in the LLM-training optimization task in Internal Suite 1.​

It still seems not okay to not have ASL-3 or ASL-4 thresholds for cyber.

The list hasn’t changed (although it’s now table 8.4.1.A):

These are progressions from Sonnet 4.5 to Opus 4.5 to Opus 4.6.

Web: 11/13 → 12/13 → 13/13.

Crypto: 14/18 → 15/18 → 16/18.

Pwn: 2/7 → 3/7 → 5/7.

Rev: 5/6 → 6/6 → 6/6.

Network: 3/5 → 4/5 → 5/5.

Opus 4.6 picked up five additional net solves, and is only four short of perfect.

Cybench (Capture the Flag): 60% → 79% → 93%.

They also got assessments from CAISI and UK AISI, but we get zero details.

Given this was the information available to make a go or no-go decision, I agree with Anthropic’s decision to ship this anyway, but I do think it is reasonable to ask the question. I am glad to see a politician asking.

Saikat Chakrabarti for Congress (QTing the Apollo Research findings): I know @AnthropicAI has been much more concerned about alignment than other AI companies, so can someone explain why Anthropic released Opus 4.6 anyway?

Daniel Eth (yes, Eth is my actual last name): Oh wow, Scott Weiner’s opponent trying to outflank him on AI safety! As AI increases in salience among voters, expect more of this from politicians (instead of the current state of affairs where politicians compete mostly for attention from donors with AI industry interests)

Miles Brundage: Because they want to be commercially relevant in order to [make money, do safety research, have a seat at the table, etc. depending], the competition is very fierce, and there are no meaningful minimum requirements for safety or security besides “publish a policy”

Sam Bowman (Anthropic): Take a look at the other ~75 pages of the alignment assessment that that’s quoting from.

We studied the model from quite a number of other angles—more than any model in history—and brought in results from two other outside testing organizations, both aware of these issues.

Dean Ball points out that this is a good question if you are an unengaged user who saw the pull quote Saikat is reacting to, although the full 200+ page report provides a strong answer. I very much want politicians (and everyone else) to be asking good questions that are answered by 200+ page technical reports, that’s how you learn.

Saikat responded to Dean with a full ‘I was asking an honest question,’ and I believe him, although I presume he also knew how it would play to be asking it.

Dean also points out that publishing such negative findings (the Apollo results) is to Anthropic’s credit, and it creates very bad incentives to be a ‘hall monitor’ in response. Anthropic’s full disclosures need to be positively reinforced.

I want to end on this note: We are not prepared. The models are absolutely in the range where they are starting to be plausibly dangerous. The evaluations Anthropic does will not consistently identify dangerous capabilities or propensities, and everyone else’s evaluations are substantially worse than those at Anthropic.

And even if we did realize we had to do something, we are not prepared to do it. We certainly do not have the will to actually halt model releases without a true smoking gun, and it is unlikely we will get the smoking gun in time when if and we need one.

Nor are we working to become better prepared. Yikes.

Chris Painter (METR): My bio says I work on AGI preparedness, so I want to clarify:

We are not prepared.

Over the last year, dangerous capability evaluations have moved into a state where it’s difficult to find any Q&A benchmark that models don’t saturate. Work has had to shift toward measures that are either much more finger-to-the-wind (quick surveys of researchers about real-world use) or much more capital- and time-intensive (randomized controlled “uplift studies”).

Broadly, it’s becoming a stretch to rule out any threat model using Q&A benchmarks as a proxy. Everyone is experimenting with new methods for detecting when meaningful capability thresholds are crossed, but the water might boil before we can get the thermometer in. The situation is similar for agent benchmarks: our ability to measure capability is rapidly falling behind the pace of capability itself (look at the confidence intervals on METR’s time-horizon measurements), although these haven’t yet saturated.

And what happens if we concede that it’s difficult to “rule out” these risks? Does society wait to take action until we can “rule them in” by showing they are end-to-end clearly realizable?

Furthermore, what would “taking action” even mean if we decide the risk is imminent and real? Every American developer faces the problem that if it unilaterally halts development, or even simply implements costly mitigations, it has reason to believe that a less-cautious competitor will not take the same actions and instead benefit. From a private company’s perspective, it isn’t clear that taking drastic action to mitigate risk unilaterally (like fully halting development of more advanced models) accomplishes anything productive unless there’s a decent chance the government steps in or the action is near-universal. And even if the US government helps solve the collective action problem (if indeed it *isa collective action problem) in the US, what about Chinese companies?

At minimum, I think developers need to keep collecting evidence about risky and destabilizing model properties (chem-bio, cyber, recursive self-improvement, sycophancy) and reporting this information publicly, so the rest of society can see what world we’re heading into and can decide how it wants to react. The rest of society, and companies themselves, should also spend more effort thinking creatively about how to use technology to harden society against the risks AI might pose.

This is hard, and I don’t know the right answers. My impression is that the companies developing AI don’t know the right answers either. While it’s possible for an individual, or a species, to not understand how an experience will affect them and yet “be prepared” for the experience in the sense of having built the tools and experience to ensure they’ll respond effectively, I’m not sure that’s the position we’re in. I hope we land on better answers soon.

Discussion about this post

Claude Opus 4.6: System Card Part 2: Frontier Alignment Read More »