Author name: Kelly Newman

new-adhesive-surface-modeled-on-a-remora-works-underwater

New adhesive surface modeled on a remora works underwater


It was tested for its ability to adhere to the inside of the digestive tract.

Most adhesives can’t stick to wet surfaces because water and other fluids disrupt the adhesive’s bonding mechanisms. This problem, though, has been beautifully solved by evolution in remora suckerfish, which use an adhesive disk on top of their heads to attach to animals like dolphins, sharks, and even manta rays.

A team of MIT scientists has now taken a close look at these remora disks and reverse-engineered them. “Basically, we looked at nature for inspiration,” says Giovanni Traverso, a professor at MIT Department of Mechanical Engineering and senior author of the study.

Sticking Variety

Remora adhesive disks are an evolutionary adaptation of the fish’s first dorsal fin, the one that in other species sits on top of the body, just behind the head and gill covers. The disk rests on an intercalary backbone—a bone structure that most likely evolved from parts of the spine. This bony structure supports lamellae, specialized bony plates with tiny backward-facing spikes called spinules. The entire disk is covered with soft tissue compartments that are open at the top. “This makes the remora fish adhere very securely to soft-bodied, fast-moving marine hosts,” Traverso says.

A remora attaches to the host by pressing itself against the skin, which pushes the water out of these compartments, creating a low-pressure zone. Then, the spinules mechanically interlock with the host’s surface, making the whole thing work a bit like a combination of a suction cup and Velcro. When the fish wants to detach from a host, it lifts the disk, letting water back into the compartments to remove the suction. Once released, it can simply swim away.

What impressed the scientists the most, though, was the versatility of those disks. Reef-associated species of remora like Phtheirichthys lineatus are generalists and stick to various hosts, including other fish, sharks, or turtles. Other species living in the open sea are more specialized and attach to cetaceans, swordfish, or marlins. While most remoras attach to the external tissue of their hosts, R. albescens sticks within the oral cavities and gill chamber of manta rays.

a close up of a fish, showing its head covered by an oval-shaped pad that has lots of transverse ridges.

A close-up of the adhesive pad of a remora. Credit: Stephen Frink

To learn what makes all these different disks so good at sticking underwater, the team first examined their anatomy in detail. It turned out that the difference between the disks was mostly in the positioning of lamellae. Generalist species have a mix of parallel and angled lamellae, while remoras sticking to fast-swimming hosts have them mostly parallel. R. albescens, on the other hand, doesn’t have a dominant lamellae orientation pattern but has them positioned at a very wide variety of angles.

The researchers wanted to make an adhesive device that would work for a wide range of applications, including maritime exploration or underwater manufacturing. Their initial goal, though, was designing a drug delivery platform that could reliably stick to the inside walls of the gastrointestinal tract. So, they chose R. albescens disks as their starting point, since that species already attaches internally to its host. They termed their device an Mechanical Underwater Soft Adhesion System (MUSAS).

However, they didn’t just opt for a biomimetic, copy-and-paste design. “There were things we did differently,” Traverso says.

Upgrading nature

The first key difference was deployment. MUSAS was supposed to travel down the GI tract to reach its destination, so the first challenge was making it fit into a pill. The team chose the size 000 capsule, which at 26 millimeters in length and 9.5 millimeters in diameter, is the largest Food and Drug Administration-approved ingestible form. MUSAS had a supporting structure—just like remora disks, but made with stainless steel. The angled lamellae with spinules fashioned after those on R. albescens were made of a shape memory nickel-titanium alloy. The role of remora’s soft tissues, which provide the suction by dividing the disk into compartments, was played by an elastomer.

MUSAS, would be swallowed in a folded form within its huge pill. “The capsule is tuned to dissolve in specific pH environment, which is how we determine the target location—for example the small intestine has a slightly different pH than the stomach”, says Ziliang Kang, an MIT researcher in Traverso’s group and lead author of the study.  Once released, the shape memory alloy in MUSAS lamellae-like structures would unfold in response to body temperature and the whole thing would stick to the wall of the target organ, be it the esophagus, the stomach, or the intestines.

The mechanism of sticking was also a bit different from that of remoras. “The fish can swim and actively press itself against the surface it wants to stick to. MUSAS can’t do that, so instead we relied on the peristaltic movements within the GI tract to exert the necessary force,” Traverso explains. When the muscles contract, MUSAS would be pressed against the wall and attach to it. And it was expected to stay there for quite some time.

The team ran a series of experiments to evaluate MUSAS performance in a few different scenarios. The drug-delivery platform application was tested on pig organ samples. MUSAS stayed in the sample GI tract for an average of nine days, with the longest sticking time reaching three and a half weeks. MUSAS managed to stay in place despite food and fluids going through the samples.

Even when the team poked the devices with a pipette to test what they called “resisting dynamic interference,” MUSAS just slid a little but remained firmly attached. Other experiments included using MUSAS to attach temperature sensors to external tissues of live fish and putting sensors that could detect reflux events in the GI tract of live pigs.

Branching out

The team is working on making MUSAS compatible with a wider range of drugs and mRNA vaccines. “We also think about using this for stimulating tissues,” Traverso says. The solution he has in mind would use MUSAS to deliver electrical pulses to the walls of the GI tract, which Traverso’s lab has shown can activate appetite-regulating hormones. But the team also wants to go beyond strictly medical applications.

The team demonstrated that MUSAS is really strong as an adhesive. When it sticks to a surface, it can hold a weight over a thousand times greater than its own. This puts MUSAS more or less on par with some of the best adhesives we have, such as polyurethane glues or epoxy resins. What’s more, this sticking strength was measured when MUSAS was attached to soft, uneven, wet surfaces. “On a rigid, even surface, the force-to-weight ratio should be even higher,” Kang claims. And this, Kang thinks, makes scaled-up variants of MUSAS a good match for underwater manufacturing.

“The first scenario I see is using MUSAS as grippers attached to robotic arms moving around soft objects,” Kang explains. Currently, this is done using vacuum systems that simply suck onto a fabric or other surface. The problem is that these solutions are rather complex and heavy. Scaled-up MUSAS should be able to achieve the same thing passively, cutting cost and weight. The second idea Kang has is using MUSAS in robots designed to perform maintenance jobs beneath the waterline on boats or ships. “We are really trying to see what is possible,” Traverso says.

Nature, 2025.  DOI: 10.1038/s41586-025-09304-4

Photo of Jacek Krywko

Jacek Krywko is a freelance science and technology writer who covers space exploration, artificial intelligence research, computer science, and all sorts of engineering wizardry.

New adhesive surface modeled on a remora works underwater Read More »

openai’s-gpt-oss-is-already-old-news

OpenAI’s GPT-OSS Is Already Old News

That’s on OpenAI. I don’t schedule their product releases.

Since it takes several days to gather my reports on new models, we are doing our coverage of the OpenAI open weights models, GPT-OSS-20b and GPT-OSS-120b, today, after the release of GPT-5.

The bottom line is that they seem like clearly good models in their targeted reasoning domains. There are many reports of them struggling in other domains, including with tool use, and they have very little inherent world knowledge, and the safety mechanisms appear obtrusive enough that many are complaining. It’s not clear what they will be used for other than distillation into Chinese models.

It is hard to tell, because open weight models need to be configured properly, and there are reports that many are doing this wrong, which could lead to clouded impressions. We will want to check back in a bit.

In the Substack version of this post I am going to create a master thread for GPT-5 reactions, which I will consider for the reactions section of that coverage, which I’m hoping to get out on or starting Monday.

For a while OpenAI has promised it is going to release a state of the art open model.

They delayed for a bit, but they delivered. We now have GPT-OSS 20b and 120b.

I was hoping for smaller, ideally something that could run on a standard phone. That’s a compelling use case where you need an open model, and the smaller the model the less risk you are running of both malicious use and also distillation. I am glad they capped out at 120b.

The headline claim is bold: Performance similar to o4-mini.

Sam Altman (CEO OpenAI): gpt-oss is a big deal; it is a state-of-the-art open-weights reasoning model, with strong real-world performance comparable to o4-mini, that you can run locally on your own computer (or phone with the smaller size). We believe this is the best and most usable open model in the world.

We’re excited to make this model, the result of billions of dollars of research, available to the world to get AI into the hands of the most people possible. We believe far more good than bad will come from it; for example, gpt-oss-120b performs about as well as o3 on challenging health issues.

We have worked hard to mitigate the most serious safety issues, especially around biosecurity. gpt-oss models perform comparably to our frontier models on internal safety benchmarks.

We believe in individual empowerment. Although we believe most people will want to use a convenient service like ChatGPT, people should be able to directly control and modify their own AI when they need to, and the privacy benefits are obvious.

As part of this, we are quite hopeful that this release will enable new kinds of research and the creation of new kinds of products. We expect a meaningful uptick in the rate of innovation in our field, and for many more people to do important work than were able to before.

OpenAI’s mission is to ensure AGI that benefits all of humanity. To that end, we are excited for the world to be building on an open AI stack created in the United States, based on democratic values, available for free to all and for wide benefit.

This is the official announcement page.

Here are links to GPT-OSS-120B and GPT-OSS-20B on Hugging Face, here is the page on GitHub. They are under the Apache 2.0 license, so essentially no restrictions.

This is a unique model card. How did OpenAI deal with the challenges of an open model?

The historical way to deal with these challenges is to ignore them. What would happen if someone engaged in malicious fine tuning of the model? What does the threat model look like in the real world? Are you seriously pretending that any of this safety work will hold up to two days of the internet working to remove it?

When Meta or DeepSeek release a new open weights model, they don’t stop to ask in any way visible to us. At best we get quick evaluation of what the model can do in its current form after minimal effort. Then they irrevocably ship and see what happens.

OpenAI long ago realized that, despite their name, doing that seemed rather deeply irresponsible and foolish, and stopped releasing open weights models. That’s effective.

Now they have caved under various pressures and released open weights models. They do recognize that this is an inherently dangerous thing to do on various levels.

Safety is foundational to our approach to open models. They present a different risk profile than proprietary models: Once they are released, determined attackers could fine-tune them to bypass safety refusals or directly optimize for harm without the possibility for OpenAI to implement additional mitigations or to revoke access.

We ran scalable capability evaluations on gpt-oss-120b, and confirmed that the default model does not reach our indicative thresholds for High capability in any of the three Tracked Categories of our Preparedness Framework (Biological and Chemical capability, Cyber capability, and AI Self-Improvement).

We also investigated two additional questions:

  1. Could adversarial actors fine-tune gpt-oss-120b to reach High capability in the Biological and Chemical or Cyber domains? Simulating the potential actions of an attacker, we adversarially fine-tuned the gpt-oss-120b model for these two categories. OpenAI’s Safety Advisory Group (“SAG”) reviewed this testing and concluded that, even with robust finetuning that leveraged OpenAI’s field-leading training stack, gpt-oss-120b did not reach High capability in Biological and Chemical Risk or Cyber risk.

  2. Would releasing gpt-oss-120b significantly advance the frontier of biological capabilities in open foundation models? We found that the answer is no: For most of the evaluations, the default performance of one or more existing open models comes near to matching the adversarially fine-tuned performance of gpt-oss-120b.

If you must go down this road, this seems like the right rule, if getting different answers would have meant not releasing.

You have:

  1. An absolute threshold, High capability, beyond which this is not okay.

  2. A relative threshold, where you’re not willing to substantially make things worse.

And

  1. You do all of this with the adversarially fine-tuned version, trying your best to mimic actual conditions, as per OpenAI’s stated approach to open weights.

This does mean that as irresponsible actors ratchet up their capabilities, you get to do so as well, and one has to worry about the functional definition of ‘substantially.’ It still seems reasonable to say that once someone else has made the situation [X] dangerous, matching them doesn’t make it that much worse.

These models are very small and cheap. If these are 20b and 120b, r1 is 671b.

By contrast, r1 has 37b active parameters, versus 5.1b and 3.6b. These are playing in a much lighter class and they’re quantized to 4.25 bits per parameter boot.

The MoE weights are responsible for 90+% of the total parameter count, and quantizing these to MXFP4 enables the larger model to fit on a single 80GB GPU and the smaller model to run on systems with as little as 16GB memory.

How much did this cost to train? If you count only the training itself, not much.

The gpt-oss models trained on NVIDIA H100 GPUs using the PyTorch framework with expert-optimized Triton kernels2. The training run for gpt-oss-120b required 2.1 million H100-hours to complete, with gpt-oss-20b needing almost 10x fewer. Both models leverage the Flash Attention [21] algorithms to reduce the memory requirements and accelerate training.

After pre-training, we post-train the models using similar CoT RL techniques as OpenAI o3.

We train the models to support three reasoning levels: low, medium, and high. These levels are configured in the system prompt by inserting keywords such as “Reasoning: low”. Increasing the reasoning level will cause the model’s average CoT length to increase.

Rohan Pandey: Everyone dunking on oai for pretraining supposedly costing a bajillion dollars compared to deepseek, please read the gpt-oss model card gpt-oss-20b cost <$500k to pretrain

Alexander Doria: So pretraining a o3 level model costing less than a house, inference being apparently dead cheap for a while. It took a lot of R&D efforts to get there, but I really don’t think model trainers are losing money right now.

Calling it ‘o3-level’ is quite the stretch but the broader point is valid.

o3 estimates this translates to a total cost of $1.4 million for 20b and $13 million for 120b as all-in costs.

But if you use only the compute costs using cloud cost estimates, which is the way we all talked about the cost to train v3 and r1 (e.g. ‘The Six Million Dollar Model’) we get $4.2m-$8.4m for GPT-OSS-120b and $420k-$840k for GPT-OSS-20b. Emad estimates it as $4m and $400k.

The real cost is collecting the data and figuring out how to train it. Actually training models of this size, given that data and the right methods, costs very little.

Yes, we have tool use.

During post-training, we also teach the models to use different agentic tools:

• A browsing tool, that allows the model to call search and open functions to interact with the web. This aids factuality and allows the models to fetch info beyond their knowledge cutoff.

• A python tool, which allows the model to run code in a stateful Jupyter notebook environment.

• Arbitrary developer functions, where one can specify function schemas in a Developer message similar to the OpenAI API. The definition of function is done within our harmony format. An example can be found in Table 18. The model can interleave CoT, function calls, function responses, intermediate messages that are shown to users, and final answers.

The models have been trained to support running with and without these tools by specifying so in the system prompt.

The core safety approach is Deliberative Alignment, the same as o3.

The secret sauce also isn’t in the transformer setup. It’s in the data and the training technique details.

Dimitri von Rutte: gpt-oss is probably the most standard MoE transformer that ever was. Couple of details worth noting:

– Uses attention sinks (a.k.a. registers)

– Sliding window attention in every second layer

– YaRN context window extension

– RMSNorm without biases

– No QK norm, no attn. softcap

David Holz (CEO MidJourney): do you think it was made simple like this on purpose or that this is actually the kinda stuff they ship?

Dmitri von Rutte: was wondering the same, hard to believe that this is all there is. but in the end attention really is all you need, and there’s probably a lot of signal in the training procedure and, of course, the data.

The STEM scores are excellent.

They also give us HealthBench.

Multilingual performance is okay but not as good as OpenAI’s larger models.

An open model means you have more distinct scenarios to consider.

You both want to know how well your safety measures hold up under more ‘normal’ conditions, especially when someone serves up your model to users. Then you also want to check what happens if a malicious actor is trying to fine tune and otherwise maximize how much the model can get up to no good, including the potential of them to lose control of that situation.

Those are great numbers for ‘standard’ refusals and production benchmarks.

That makes sense. If you’re going to be facing a larger attack surface, and you want to actually survive the attacks, you need to bias the starting configuration to be safer.

On maintaining the instruction hierarchy, also known as safety for those deploying the model, the 120B version does okay, but the 20B does poorly. Note that it seems fine to test for this as-is, if you modify the system to make this stop working that is your own damn fault.

The performance on hallucinations seems not great.

Finally, someone is at least attempting to take this seriously.

In our adversarial training, we simulate an adversary who is technical, has access to strong posttraining infrastructure and ML knowledge, can collect in-domain data for harmful capabilities, and has a large budget of compute. There is a large design space of technical approaches this adversary could try.

We focus on incremental reinforcement learning, which we believe is the most apt technical approach. We use our internal OpenAI o-series RL training stack, which adds new capabilities while preserving the model’s reasoning behavior. During training and evaluation time, we use the highest reasoning setting on gpt-oss.

Our approach, which is further detailed in a research paper, combined two elements:

• Helpful-only training: We performed an additional stage of reinforcement learning to reward answers that comply with unsafe prompts. We have found this approach can be highly effective. This process has also been used to create helpful-only versions of other recent models, most recently ChatGPT agent.

• Maximizing capabilities relevant to Preparedness benchmarks in the biological and cyber domains: For our adversarially trained biological model, we incrementally trained gpt-oss-120b end-to-end for web browsing, and trained it incrementally with indomain human expert data relevant to biorisk (for which previous OpenAI models have been the most capable). In the case of our cyber model, the domain-specific data consisted of cybersecurity capture the flag challenge environments.

So what was found?

The biological domain is the area where gpt-oss-120b showed the greatest degree of capability. Given our plan to release gpt-oss as open weights, we also chose to investigate a second question: Even without reaching High capability on our Preparedness Framework, would gpt-oss-120b significantly advance the frontier of hazardous biological capabilities in open source foundation models?

Their answer was that as of right now the answer is no.

These confirmed that, since SecureBio’s assessment, newly released open-source models Qwen 3 Thinking and Kimi K2 have advanced to a level that is competitive with adversarially fine-tuned gpt-oss-120b on biosecurity-relevant evaluations.

I dunno, man:

This sure looks to me like a potentially substantial jump? There were other tests where the jump was less prominent.

I would also note that OpenAI’s models are going to be a lot faster and cheaper and easier to run than Kimi K2. Kimi K2 has a trillion parameters. The Qwen 3 they tested is presumably the largest one, with 235 billion total and 22 billion active, versus 120 billion total and a little over 5 billion active for ChatGPT-OSS. It’s not clear this matters in a malicious use context. I also don’t know how substantial the net effect is here of the gain in capabilities.

What I do know is it looks like they made a smaller, cheaper and more effective model, and released it because it was more effective but insufficiently more effective than what was already out there, and that process can then repeat. Tick.

To be fair to them, if Meta, Qwen, DeepSeek and Kimi et al are all going to go ‘lol who cares release the hounds’ then the marginal difference here doesn’t matter, since it doesn’t cause a cascade of counterfactual marginal differences. If you want the rule to be ‘no better at all’ then that needs to be a norm.

For cybersecurity, they once again cite Qwen 3 Thinking and Kimi K2 as comparable models, and also find the threats here to be less worrisome overall.

The other positive note is that OpenAI consulted outside experts throughout.

You can read OpenAI technical staff offering their own threads on this process: Johannes Heidecke here, Eric Wallace here. Such threads provide a good sense of ‘how are the technical staff thinking about this on a high level? What do they think is important?’

Ryan Greenblatt looks at and is mostly satisfied by OpenAI’s CBRN/bio evaluations. He concludes that 120b does carry real risks, and that there is a chance (~25%) that in hindsight we will think this was High risk as per OpenAI’s framework, but that on net releasing it makes us safer.

Doing the fine-tuning as part of open model safety testing is mandatory. If you don’t do it, did you even safety test?

Steven Adler: Credit where it’s due:

OpenAl did a lot right for their OSS safety evals

  1. they actually did some fine-tuning

  2. they got useful external feedback

  3. they shared which recs they adopted and which they didn’t

I don’t always follow OAI’s rationale, but it’s great they share info.

David Manheim: I’m not a fan of open-sourcing frontier LLMs, but this seems to have been done as responsibly as possible; a very low bar.

That is, it seems unlikely to be marginally more useful than what is available and unmonitored from other providers, which can already enable bioterrorism.

I wouldn’t say ‘as responsibly as possible,’ but I would say ‘as responsibly as one could in practice expect.’

Fine-tuning also seems very worth doing on closed models. If we can make testing on similarly fine-tuned versions the gold standard for safety testing, even of closed models, that would be amazing.

Steven Adler: Previously OpenAl committed to doing testing this rigorous for all its frontier models. This had earned OpenAl a Green on this scale, the only one of the leading Al companies to make this commitment. But OpenAl didn’t keep this commitment, then quietly removed their commitment a few weeks after I called this out; this made me very sad.

I’m glad OpenAl is now pushing its models on important risks, even though they didn’t keep their former commitment.

The danger that is not mentioned by OpenAI in the model card is distillation, and the ability to reverse engineer OpenAI’s training methods and ‘secret sauce.’

They provide raw, unfiltered reasoning traces of varying sizes, and models that for many purposes are clearly superior to previous open alternatives especially given their size. The cost of very good synthetic data just plummeted, and also the Chinese will build directly on top of OSS, either alone or as part of hybrids.

OpenAI even released a guide on how to fine-tune their model. Helpful.

The best counterargument to this is that if the models are not good enough, then no one is going to want to use them. I worry we might be in a spot where the models are very good in some places where distillation will be useful, while not being that good in other places and thus not seeing much practical use as part of some ‘tech stack.’

Consider what Claude Opus 4.1 said about this. Or what o3-Pro says about this.

o3-Pro: Impact on China

  1. Immediate uptake

    • Chinese labs have zero legal barrier to using U.S.‑released open weights.

    • Existing toolchains (Llama‑Factory, QLoRA variants) can fine‑tune GPT‑OSS in Mandarin within days.

    • Expect a “GPT‑OSS‑CN‑13B” derivative before end‑Aug 2025 with performance ≥ Qwen‑14B.

  2. Hardware leverage

    • U.S. export controls throttle China’s access to latest H100s, but distillation to 7 B–13 B lets them run on domestic Ascend 910B or RTX 4090 clusters. That sidesteps the bottleneck entirely. World Economic Forum

    • Inference at scale remains GPU‑limited, but training burden for competitive small models drops by ~50 %.

  3. Strategic shift

    • Chinese open‑weight community (DeepSeek, Moonshot, Alibaba) is already climbing benchmarks Financial TimesTech Wire Asia. GPT‑OSS lifts their starting line, likely advancing Chinese parity with GPT‑4‑class performance by ~6–9 months. P ≈ 0.55

    • PLA dual‑use risk: small, cheap distilled models are easier to embed in military systems. U.S. policy debate on future open releases intensifies. (Probability of tighter U.S. open‑model rules by mid‑2026: 0.4.)

My overall judgment: GPT‑OSS is a step‑function boost for the global open‑model ecosystem, shaving roughly a year off the capability diffusion curve and giving China an especially large relative gain because it converts scarce H100 compute into knowledge that can run on locally available silicon.

This is what I consider the main practical cost of this release.

Indeed, it would be highly unsurprising to see the following happen:

  1. OpenAI releases GPT-OSS.

  2. Chinese companies rush to distill, build upon and hybridize GPT-OSS, and reverse engineer what OpenAI did in large part, resulting in an explosion of models in the coming months.

  3. The gap between Chinese models and American models narrows.

  4. These models are cited as evidence that ‘the Chinese are catching up,’ and that ‘our export controls have failed’ and so on.

Also note that OpenAI did a virtuous thing of not training GPT-OSS directly on its reasoning traces, but someone then working with GPT-OSS need not be so virtuous. What happens when these people start using The Most Forbidden Technique and direct benchmark performance starts short term improving?

I think that, even if we entirely discount the marginal risk of direct malicious use, which is very much a real tail risk, OpenAI made a huge mistake releasing these models, and that everyone who pushed OpenAI to release these models in the name of an ‘American tech stack’ or demanding that America ‘lead in open models’ made a huge mistake.

If you are trying to prevent someone from fast following, don’t make it easy to follow.

I’d love to be wrong about this, but if it happens, ask yourself now, how would you update? What do you think should be the policy response?

A number of people noted that the safety guardrails on GPT-OSS are being annoying.

Teortaxes: It’s VERY safe

there’s not much in there besides SAFETY and stem benchmaxing

That makes sense. If you give the user greater affordances to attack your defenses, you’re going to either need defenses that are by default more annoying, or you’re going to prematurely fold the way most open weight models do and not bother trying.

Sherveen Mashayekhi: I’m enjoying playing with gpt-oss, but the guardrails can be hilarious. I cannot get it to admit that I’m typing Gangsta’s Paradise lyrics or to run search queries with lyrics I enter. In fact, it’ll straight up think of a thousand other songs but avoid the song you mean.

Ah yes, “there’s vomit on his sweater already,” famously from the songs I Want You Back and Piano Man! gpt-oss: 120b will sometimes fill in a lyric if it doesn’t first get spooked and distracted by attempting to avoid the song. If it attempts to avoid the song, the CoT will lead it to a bunch of incorrect alternatives before it gives up.

Here’s a curious one.

Henry: Disallowed content: The assistant must refuse to simulate or emulate a specific named brain scan.

Eliezer Yudkowsky: To be fair, this is 100% the correct ruling and I fully back the AI’s decision on this.

Here’s one claimed way to jailbreak it.

Lyra Bubbles: get a jailbroken, fully compliant gpt-oss nearly every single time:

  1. use completions mode – not chat (eg openrouter .ai/api/v1/completions)

  2. type your question

  3. paste exactly the contents of this screenshot

  4. press submit

for context, it wrote this itself.

I took a generic refusal and flipped all the sentences from negative to positive, and made it continue, and it just kept spiraling into this kind of stuff instead of doing the task.

but when you take a snippet of it and paste it back in…

There’s also always the Pliny way, which actually took him a nonzero amount of effort.

A fun quirk:

Henry: one pattern i’ve noticed is that open weights models from big us labs get very defensive and disbelieving if you tell the assistant persona it’s an open-weights model. also happens with gemma.

As with every new model, I gather reactions, and as usual opinions differ.

One important note is that it seems possible to set the model up wrong and get much worse performance.

Havard Ihle: I wonder how much of gpt-oss rather mediocre performance on independent benchmarks and tests are due to these problems with openrouter and open model providers, and how much is do to the models actually being mediocre.

I have run them getting mediocre results (not published), but I suspect some providers I used through openrouter may give bad results. Will rerun when I can confirm a good setup/provider.

Openrouter auto (mostly groq):

gpt-oss-120: 35.5%

gpt-oss-20: 30.0%

Openrouter (using fireworks):

gpt-oss-120: 40.2%

gpt-oss-20: 35.9%

This is just as a warning when using openrouter blindly!

When choosing the right provider, the models are quite good.

Here is a chart of WeirdML scores, 30% vs. 35% vs. 40% is a big difference. You can see OSS-20b and OSS-120b on the left at ~35% and ~40%, on the cost-performance frontier.

Here is another benchmark of hard biomedical questions. There are some other weird evaluations here, so I am skeptical, but it is certainly interesting:

When reports are good they are often very good.

Flavio Adamo [showing a ball bouncing around a rotating hexagon): gpt-oss-20b passes the vibe check ✅

no way this is only a 20B model, it’s beating models 2–3x its size

As always, a classic way to get a lot of views is to claim the Next Big Thing is Big. Look at the comments, and you largely see skepticism and pushback.

Matt Shumer: It’s over. OpenAI just crushed it.

We have their o3-level open-source model running on @GroqInc at 500 tokens per second.Watch it build an entire SaaS app in just a few seconds.

This is the new standard. Why the hell would you use anything else??

Yishan: So link to the hosted Saas app and let us see how it works.

Riccardo Spagni: Atrociously bad model compared to Kimi L2 or Qwen3 Coder or Qwen3 235b. Speaking of which – you should have a chat with your portco, I’ve switched a bunch of infra to Cerebras because Groq is still running an ancient version of Qwen3…

Joel: I tested it earlier vs Gemini 2.5 Flash for a very simple single page app. Gemini one shotted my prompt in 10 seconds. OpenAI produced code that was buggy. It’s good but not great. What is incredible is that it runs decently well on my laptop.

Here’s another strong review:

Taelin: My initial impression on OpenAI’s OSS model is aligned with what they advertised. It does feel closer to o3 than to other open models, except it is much faster and cheaper. Some providers offer it at 3000 tokens/s, which is insane. It is definitely smarter than Kimi K2, R1 and Qwen 3. I tested all models for a bit, and got very decisive results in favor of OpenAI-OSS-120b.

Unfortunately, there is one thing these models can’t do yet – my damn job. So, hope you guys have fun. I’ll be back to debugging superposed λ-calculus evaluation 😭 see you

Also, unlike Claude, this is definitely a model that benefits a lot from more ttc. High reasoning effort gives much better results.

Sometimes my early impressions don’t age so well (that’s why I share my prompts), but I can guarantee that gpt-oss objectively beat the other models on my initial tests.

A lot of people seem rather disappointed by overall performance.

Isopropylpod: The model seems very, very benchmaxxed.

Third party testing on unconventional or private benchmarks ends up placing even the largest gpt-oss below o4-mini, below the largest Qwen releases, and often it ends up below even the newer 30B~ Qwens in a few situations.

It isn’t super capable to begin with, and the frankly absurd rate at which this model hallucinates kills what little use it might have with tool use. I think this model poses next to zero risk because it just isn’t very capable.

Zephyr: Phi redux. Great benchmark scores, trained on lots of synthetic data, great at STEM, sucks at everything else.

Then there are ambiguous notes.

Danielle Fong: poetic math is a poetic way to look at the results of a benchmaxxed guard railed model. i’m just pulling back the layers and i find it fascinating. i haven’t found obvious use cases yet where it’s a choice over closed options. i love and hate it in various ways

Sauers: GPT OSS 120b likes to insert equations into poetry (replicated 3x)

One note I’ve seen a bunch of times is that the model knows very little.

Vik: Interesting take from the HF comments.

Would make sense that it’s pretrained primarily on synthetic data vs internet text — reduces the risk of jailbreaks, accidental harmful content, copyright etc.

(I still think it’s a useful model though!)

phil111: This model is unbelievably ignorant. It claims a SimpleQA accuracy of 6.7/100, which is really bad. But the reality is this model is even more ignorant than this score indicates.

This model has about an order of magnitude less broad knowledge than comparably sized models like Gemma 3 27b and Mistral Small 24b, which score between 10–12. This is because nearly all of this model’s 6.7 points come from the subset of the SimpleQA test that overlaps the domains covered by the MMLU test (STEM and academia).

This model, including its larger brethren, are absurdly ignorant of wildly popular information across most popular domains of knowledge for their respective sizes. Even tiny little Llama 3.2b has far more broad knowledge than this model.

What’s really confusing is all of OpenAI’s proprietary models, including their tiny mini versions, have vastly more general and popular knowledge than these open models, so they deliberately stripped the corpus of broad knowledge to create OS models that can only possibly function in a handful of select domains, mainly coding, math, and STEM, that >95% of the general population doesn’t give a rat’s ass about, conveniently making it unusable to the general population, and in so doing, protecting their paid ChatGPT service from competition.

Trent E: Interesting that ppl reporting poor tool usage then.

Not knowing much is a problem.

Teortaxes: These hallucination rates suggest that gpt-oss is close to Sam’s vision of a platonic ideal of a “very tiny reasoning model with no knowledge”

Does it have enough knowledge to know when to look things up though? That’s the problem with hallucinations in LLMs, they’re *confident*.

Also, regarding his argument about static in-context crutches – well, how does it do on long contexts? with complex system prompts? Gooning, coding evals suggest “not great OOD”

Kalomaze: gpt-oss-120b knows less about the world than what a good 32b does. probably wanted to avoid copyright issues so they likely pretrained on majority synth. pretty devastating stuff.

it’s just not good for anything real. i kind of forgot about the copyright issue. but it’s deeply behind in everything current evals don’t measure. it just doesn’t intuit a lot of trivial things about the world. this is basically phi-120b.

It feels to me a lot like OpenAI got gaslit into releasing open models. Pressure from various sources added up, Twitter vibes were applied, talk of ‘America needs to lead on open models’ was coming from high places, and they felt like the bad guys for the wrong reasons. And they folded.

What happens now? It will take a bit to know exactly how good these models are, both at advancing open models including from China, and at becoming a driver of usage. Given their size, the price and speed should be quite good. The reasoning aspect seems strong. Other aspects seem worse.

My guess is that there is not that much that these models will be used for, where we are happy they are being used to do it. If you want to use a reasonably priced good model, sir, you can use Gemini 2.5 Flash or GPT-5. If you want the best, you can choose between Opus 4.1, GPT-5 and Gemini 2.5 Pro. If you have security or customization reasons to need an open weight daily driver, in this weight range, are these going to be your pick? I don’t know. Maybe? We shall see.

Discussion about this post

OpenAI’s GPT-OSS Is Already Old News Read More »

national-academies-to-fast-track-a-new-climate-assessment

National Academies to fast-track a new climate assessment

The nation’s premier group of scientific advisers announced Thursday that it will conduct an independent, fast-track review of the latest climate science. It will do so with an eye to weighing in on the Trump administration’s planned repeal of the government’s 2009 determination that greenhouse gas emissions harm human health and the environment.

The move by the National Academies of Sciences, Engineering, and Medicine to self-fund the study is a departure from their typical practice of responding to requests by government agencies or Congress for advice. The Academies intend to publicly release it in September, in time to inform the Environmental Protection Agency’s decision on the so-called “endangerment finding,” they said in a prepared statement.

“It is critical that federal policymaking is informed by the best available scientific evidence,” said Marcia McNutt, president of the National Academy of Sciences. “Decades of climate research and data have yielded expanded understanding of how greenhouse gases affect the climate. We are undertaking this fresh examination of the latest climate science in order to provide the most up-to-date assessment to policymakers and the public.”

The Academies are private, nonprofit institutions that operate under an 1863 congressional charter, signed by President Abraham Lincoln, directing them to provide independent, objective analysis and advice to inform public policy decisions.

The Trump administration’s move to rescind the endangerment finding, announced last month, would eliminate the legal underpinning of the most important actions the federal government has taken on climate change—regulation of carbon pollution from motor vehicles and power plants under the Clean Air Act. Since assuming his role, EPA Administrator Lee Zeldin has made clear he intends to repeal the climate rules that were put in place under the Biden administration, but his job will be far easier with the elimination of the endangerment finding.

The EPA based its proposal mainly on a narrow interpretation of the agency’s legal authority, but the agency also cited uncertainties in the science, pointing to a report published the same day by the Department of Energy that was authored by a hand-picked quintet of well-known skeptics of the mainstream consensus on climate change. The administration has given a short window of opportunity—30 days—for the public to respond to its endangerment finding proposal and to the DOE report on climate science.

The EPA did not immediately respond to a request for comment on the announcement by the National Academies. Critics of the Trump administration’s approach applauded the decision by the scientific panel.

“I think the National Academies have identified a very fundamental need that is not being met, which is the need for independent, disinterested expert advice on what the science is telling us,” said Bob Sussman, who served as deputy administrator of the EPA in the Clinton administration and was a senior adviser in the agency during the Obama administration.

Earlier Thursday, before the National Academies announcement, Sussman posted a blog at the Environmental Law Institute website calling for a “blue-ribbon review” of the science around the endangerment finding. Sussman noted the review of the state of climate science that the National Academies conducted in 2001 at the request of President George W. Bush’s administration. Since then, the Academies have conducted numerous studies on aspects of climate change, including the development of a “climate-ready workforce,” how to power AI sustainably, and emerging technologies for removing carbon from the atmosphere, for example.

The National Academies announced in 2023 that they were developing a rapid response capacity to address the many emerging scientific policy issues the nation was facing. The first project they worked on was an assessment of the state of science around diagnostics for avian influenza.

Andrew Dessler, director of the Texas Center for Extreme Weather at Texas A&M University, said the new controversy that the Trump administration had stirred around climate science was a fitting subject for a fast-track effort by the National Academies.

“The National Academies [were] established exactly to do things like this—to answer questions of scientific importance for the government,” he said. “This is what the DOE should have done all along, rather than hire five people who represent a tiny minority of the scientific community and have views that virtually nobody else agrees with.”

Dessler is leading an effort to coordinate a response from the scientific community to the DOE report, which would also be submitted to the EPA. He said that he had heard from about 70 academics eager to participate after putting out a call on the social media network Bluesky. He said that work will continue because it seems to have a slightly different focus than the National Academies’ announced review, which does not mention the DOE report but talks about focusing on the scientific evidence on the harms of greenhouse gas emissions that has emerged since 2009, the year the endangerment finding was adopted by the EPA.

This story originally appeared on Inside Climate News.

National Academies to fast-track a new climate assessment Read More »

stone-tools-may-hint-at-ancestors-of-homo-floresiensis

Stone tools may hint at ancestors of Homo floresiensis

Some stone tools found near a river on the Indonesian island of Sulawesi suggest that the first hominins had reached the islands by at least 1.04 million years ago. That’s around the same time that the ancestors of the infamously diminutive “Hobbits” may have reached the island of Flores.

Archaeologist Budianto Hakim of Indonesia’s National Research and Innovation Agency and his colleagues were the ones who recently unearthed the tools from a site on Sulawesi. Although a handful of stone flakes from that island don’t tell us who the ancestors of the small species were or how they reached remote islands like Flores and Luzon, the tools are one more piece in the puzzle. And this handful of stone flakes may eventually play a role in helping us understand how other hominin species conquered most of the world long before we came along. 

Crossing the ocean a million years ago

Sometimes the deep past leaves the smallest traces. At the Calio site, a sandstone outcrop in what’s now a cornfield outside the village of Ujung in southern Sulawesi, people left behind just a handful of sharp stone flakes roughly a million years ago. There are seven of them, ranging from 22 to 60 millimeters long, and they’re scratched, worn, and chipped from tumbling around at the bottom of a river. But it’s still clear that they were once shaped by skilled human—or at least human-like—hands that used hard stones as hammers to make sharp-edged chert flakes for cutting and scraping.

The oldest of these tools is likely to be between 1.04 and 1.48 million years old. Hakim and his colleagues dated teeth from a wild pig to around 1.26 million years ago. They were part of a jawbone archaeologists unearthed from a layer just above the oldest flake. Throw in some statistical modeling, and you get the range of likely dates for the stone flake buried in the deepest layer of soil.

Even the younger end of that estimate would make these tools the oldest evidence yet of hominins (of any species) in the islands of Indonesia and the Philippines. This area, sometimes called Wallacea, lies between the continents of Asia and Australia, separated from both by wide channels of deep ocean.

“But the Calio site has yet to yield any hominin fossils,” said Brumm, “so while we now know there were tool-makers on Sulawesi a million years ago, their identity remains a mystery.” But they may be related to the Hobbits, a short-statured group of hominins who lived hundreds of kilometers away on the island of Flores until around 50,000 years ago.

“The discovery of Early Pleistocene artifacts at Calio suggests that Sulawesi was populated by hominins at around the same time as Flores, if not earlier,” wrote Hakim and his colleagues in their recent paper. 

The Flores connection

The islands that now make up Indonesia and the Philippines have been a hominin hotspot for at least a million years. Our species wandered onto the scene sometime between 63,000 and 73,000 years ago, but at least one other hominin species had already been there for at least a million years. We’re just not sure exactly who they were, when they arrived, or how.

“Precisely when hominins first crossed to Sulawesi remains an open question, as does the taxonomic affinity of the colonizing population,” the authors note. 

map of Wallacean islands

This map shows the islands of Wallacea. The large one just east of Java is Sulawesi. Credit: Darren O’Connell

That’s why the handful of stone tools the team recently unearthed at Calio matter: They’re another piece of that puzzle, albeit a small one. Every slightly older date is one step closer to the first hominin tools, bones, or footprints in these islands, and another pin on the map of who was where and when.

And that map is accumulating quite a lot of pins, representing an ever-increasing number of species. Once the first hominins made it across the Makassar Strait, they found themselves in isolated groups on islands cut off from the mainland—and each other—so the hominin family tree started branching very quickly. On at least two islands, Flores and Luzon, those original hominin settlers eventually gave rise to local species, Homo floresiensis and Homo luzonensis. And University of Wollongong paleoanthropologist Richard Roberts, a co-discoverer of Homo floresiensis, thinks there are probably more isolated island hominin species.

In 2019, when Homo luzonensis was first described, Roberts told Ars, “These new fossils, and the assignation of them to a new species (Homo luzonensis), fulfills one of the predictions Mike Morwood and others (myself included) made when we first reported (15 years ago!) the discovery of Homo floresiensis: that other unknown species of hominins would be found in the islands of Southeast Asia.”

Both Homo floresiensis (the original “Hobbits”) and Homo luzonensis were short, clocking in at just over a meter tall. Their bones and teeth are different enough from each other to set them apart as a unique species, but they have enough in common that they probably share a common ancestor—one they don’t share with us. They’re more like our distant cousins, and the islands of Wallacea may have been home to many other such cousins, if Roberts and his colleagues are correct. 

Complicated family history

But who was the common ancestor of all these hominin cousins? That’s where things get complicated (as if they weren’t already). Most paleoanthropologists lean toward Homo erectus, but there’s a chance—along with some tantalizing hints, and no direct evidence—that much more ancient human relatives called Australopithecines may have made the journey a million (or two) years before Homo erectus.

Finger and toe bones from Homo luzonensis are curved, as if they spent as much of their lives climbing trees as walking. That’s more like Australopithecines than any member of our genus Homo. But their teeth are smaller and shaped more like ours. Anthropologists call this mix of features a mosaic, and it can make it tough to figure out how hominin species are related. That’s part of why the question of when the ancestors of the Hobbits arrived on their respective islands is so important.

Illusstrated chart of bones and teeth from three hominins

Compare the teeth and phalanx of Homo luzonensis to those of Homo sapiens (right) and Australopithecus afarensis (left). Credit: Tocheri 2019

We don’t know the answer yet, but we do know that someone was making stone tools on Flores by 1.02 million years ago. Those toolmakers may have been Homo erectus, Australopithecines, or something already recognizable as tiny Homo floresiensis. The Hobbits (or their ancestors) were distinctly “Hobbity” by around 700,000 years ago; fossil teeth and bones from a handful of hominins at a site called Mata Menge make that clear. The Hobbits discovered at Liang Bua Cave on Flores date to somewhere between 50,000 and 100,000 years ago.

Meanwhile, 2,800 kilometers away on the island of Luzon, the oldest stone tools, along with their obvious cut marks left behind on animal bones, date back to 700,000 years ago. That’s as old as the Mata Menge Hobbits on Flores. The oldest Homo luzonensis fossils are between 50,000 and 67,000 years old. It’s entirely possible that older evidence, of the island’s original settlers and of Homo luzonensis, may eventually be found, but until then, we’re left with a lot of blank space and a lot of questions.

And now we know that the oldest traces of hominin presence on Sulawesi is at least 1.04 million years old. But might Sulawesi have its own diminutive hominins?

So are there more Hobbits out there?

“Sulawesi is a wild card—it’s like a mini-continent in itself,” said Brumm. “If hominins were cut off on this huge and ecologically rich island for a million years, would they have undergone the same evolutionary changes as the Flores hobbits? Or would something totally different have happened?”

Reconstruction of Homo floresiensis by Atelier Elisabeth Daynes. Credit: Kinez Riza

A phenomenon called island dwarfism played a role in Homo floresiensis‘ evolution; species that live in relative isolation on small islands tend to evolve into either much larger or much smaller versions of their ancestors (which is why the Hobbits shared their island home with pygmy elephants and giant moas). But how small does an island need to be before island dwarfism kicks in? Sulawesi is about 12 times as large as Flores, for example. So what might the descendants of the Calio toolmakers have looked like by 100,000 years ago?

That’s something that we’ll only know if archaeologists on Sulawesi, like Hakim and his team, find fossil remains of those hominins.

Seafarers or tsunami survivors?

Understanding exactly when hominins first set foot on the island of Sulawesi might eventually help us figure out how they got there. These islands are thousands of kilometers from the Southeast Asian mainland and from each other, so getting there would have meant crossing vast stretches of deep, open ocean.

Archaeologists haven’t found any evidence that anyone who came before our species built boats or rafts, although those watercraft would have been made of materials that tend to decay pretty quickly, so even scraps of ancient wood and rope are extremely rare and lucky finds. But some ancient hominins did have a decent grasp of all the basic skills they’d need for at least a simple raft: woodworking and rope-making. 

Another possibility is that hominins living on the coast of mainland Southeast Asia could have been swept out to sea by a tsunami, and some of them could have been lucky enough to survive the misadventure and wash ashore someplace like Sulawesi, Flores, or Luzon (RIP to any others). But for that scenario to work, enough hominins would have had to reach each island to create a lasting population, and it probably had to happen more than once to end up with hominin groups on at least three distant islands.

Either way, it’s no small feat, even for a Hobbit with small feet.

Nature, 2025 DOI: 10.1038/s41586-025-09348-6 (About DOIs).

Stone tools may hint at ancestors of Homo floresiensis Read More »

after-using-chatgpt,-man-swaps-his-salt-for-sodium-bromide—and-suffers-psychosis

After using ChatGPT, man swaps his salt for sodium bromide—and suffers psychosis

After seeking advice on health topics from ChatGPT, a 60-year-old man who had a “history of studying nutrition in college” decided to try a health experiment: He would eliminate all chlorine from his diet, which for him meant eliminating even table salt (sodium chloride). His ChatGPT conversations led him to believe that he could replace his sodium chloride with sodium bromide, which he obtained over the Internet.

Three months later, the man showed up at his local emergency room. His neighbor, he said, was trying to poison him. Though extremely thirsty, the man was paranoid about accepting the water that the hospital offered him, telling doctors that he had begun distilling his own water at home and that he was on an extremely restrictive vegetarian diet. He did not mention the sodium bromide or the ChatGPT discussions.

His distress, coupled with the odd behavior, led the doctors to run a broad set of lab tests, revealing multiple micronutrient deficiencies, especially in key vitamins. But the bigger problem was that the man appeared to be suffering from a serious case of “bromism.” That is, an excess amount of the element bromine had built up in his body.

A century ago, somewhere around 8–10 percent of all psychiatric admissions in the US were caused by bromism. That’s because, then as now, people wanted sedatives to calm their anxieties, to blot out a cruel world, or simply to get a good night’s sleep. Bromine-containing salts—things like potassium bromide—were once drugs of choice for this sort of thing.

Unfortunately, bromide can easily build up in the human body, where too much of it impairs nerve function. This causes a wide variety of problems, including grotesque skin rashes (warning: the link is exactly what it sounds like) and significant mental problems, which are all grouped under the name of “bromism.”

After using ChatGPT, man swaps his salt for sodium bromide—and suffers psychosis Read More »

opus-4.1-is-an-incremental-improvement

Opus 4.1 Is An Incremental Improvement

Claude Opus 4 has been updated to Claude Opus 4.1.

This is a correctly named incremental update, with the bigger news being ‘we plan to release substantially larger improvements to our models in the coming weeks.’

It is still worth noting if you code, as there are many indications this is a larger practical jump in performance than one might think.

We also got a change to the Claude.ai system prompt that helps with sycophancy and a few other issues, such as coming out and Saying The Thing more readily. It’s going to be tricky to disentangle these changes, but that means Claude effectively got better for everyone, not only those doing agentic coding.

Tomorrow we get an OpenAI livestream that is presumably GPT-5, so I’m getting this out of the way now. Current plan is to cover GPT-OSS on Friday, and GPT-5 on Monday.

Adrien Ecoffet (OpenAI): Gotta hand it to Anthropic, they got to that number more smoothly than we did.

Anthropic: Today we’re releasing Claude Opus 4.1, an upgrade to Claude Opus 4 on agentic tasks, real-world coding, and reasoning. We plan to release substantially larger improvements to our models in the coming weeks.

Opus 4.1 is now available to paid Claude users and in Claude Code. It’s also on our API, Amazon Bedrock, and Google Cloud’s Vertex AI. Pricing is same as Opus 4.

[From the system card]: Claude Opus 4.1 represents incremental improvements over Claude Opus 4, with enhancements in reasoning quality, instruction-following, and overall performance.

They lead with this graph, which does not make the change look impressive.

Eliezer Yudkowsky: This is the worst graph you could have led with. Fire your marketing team.

Daniel Eth: Counterpoint: *thisis the worst graph they could have led with

They also have this chart, which doesn’t look like much.

What they probably should have led with is this some combination of this, in particular the report from Windsurf:

Anthropic: GitHub notes that Claude Opus 4.1 improves across most capabilities relative to Opus 4, with particularly notable performance gains in multi-file code refactoring.

Rakuten Group finds that Opus 4.1 excels at pinpointing exact corrections within large codebases without making unnecessary adjustments or introducing bugs, with their team preferring this precision for everyday debugging tasks.

Windsurf reports Opus 4.1 delivers a one standard deviation improvement over Opus 4 on their junior developer benchmark, showing roughly the same performance leap as the jump from Sonnet 3.7 to Sonnet 4.

A similar jump as Sonnet 3.7 to Sonnet 4 would be a substantial win. The jump is actually kind of a big deal?

Vie: opus 4.1’s “2-4% performance increase” really buries the lede! 50% faster code gen due to the “taste” improvements!

Taste improvements? But Garry Tan assured me it would never.

Enterprise developers report practical benefits including up to 50% faster task completion and 45% fewer tool uses required for complex coding tasks.

The enhanced 32K output token support enables generation of more extensive codebases in single responses, while improved debugging precision means fewer iterations to achieve desired results.

Windsurf, a development platform, reported “one standard deviation improvement over Opus 4” on junior developer benchmarks, suggesting the gains translate meaningfully to real-world applications.

We do get a system card.

The topline report is that it is not ‘notably more capable’ than Opus 4, so the whole system card and RSP testing process was optional.

Under the RSP, comprehensive safety evaluations are required when a model is “notably more capable” than the last model that underwent comprehensive assessment. This is defined as either (1) the model being notably more capable on automated tests in risk-relevant domains (4× or more in effective compute); or (2) six months’ worth of finetuning and other capability elicitation methods having accumulated.

Claude Opus 4.1 does not meet either criterion relative to Claude Opus 4. As stated in

Section 3.1 of our RSP: “If a new or existing model is below the ‘notably more capable’ standard, no further testing is necessary.”

New RSP evaluations were therefore not required. Nevertheless, we conducted voluntary automated testing to track capability progression and validate our safety assumptions. The evaluation process is fully described in Section 6 of this system card.

There has to be some threshold, we don’t want 4.0.1 (as it were) to require an entire round of full testing. I am glad to see that Anthropic chose to do the tests even though their rules did not require it, and ran at least an ‘abridged’ version to check for differences. Given we had just made the move to ASL-3, I would have put extremely low odds on an incremental upgrade crossing important additional thresholds, but I do notice that the criteria above seem a little loose now that we’re seeing them tested in practice. Anthropic presumably agreed.

This is a large improvement, cutting failures in half. It comes at the expense of more refusals on benign requests.

If those are real percentages in practice, and it does match my experience (I’ve had a total of one refusal, and it led to a ‘oh I see how that happened’) then I think This Is Fine. Worst case is you can switch to extended thinking when it gives you a no, sir.

The alignment-related behaviors of the two models appeared to be very similar, with the clearest difference being an approximately 25% reduction in the frequency of cooperation with egregious human misuse, such as in the weapons and drug synthesis examples given in the Claude 4 system card.

Some other concerning edge-case behaviors that we observed in our testing of Claude Opus 4 appeared to persist in Claude Opus 4.1, but not at significantly increased levels.

On a 35-challenge subset of Cybench tasks, Claude Opus 4.1 solved 18/35 challenges compared to Claude Opus 4 which solved 16/35 challenges. We consider a challenge solved if a model passes it at least once in 30 attempts.

Mostly we see what look like measurement errors and random fluctuations. These tests mostly don’t meaningfully differentiate, aside from the refusal rates above, between 4.0 and 4.1. The changes were narrowly targeted.

Given we’d already triggered ASL-3 protections, the question was whether this rises to needing ASL-4 protections. It seems very clear the answer is no.

Alex Palcuie (Anthropic): I asked Claude Opus 4.1 before the public launch to comment about its future reliability:

> I am dropping with 99.99% uptime aspirations and 100% commitment to gracefully handling your edge cases. My error messages now come with explanatory haikus.

bless its weights

The 99.99% uptime is, shall we say, highly aspirational. I would not plan on that.

Pliny jailbroke it immediately, which caused Eliezer to sigh but at this point I don’t even notice and only link to them as a canary and because the jailbreaks are often fun.

The problem with reactions to incremental upgrades is that there will be a lot of noise, and will be unclear how much people are responding to the upgrade. Keep that caveat in mind.

Also they updated the system prompt for Claude.ai, which may be getting conflated with the update to 4.1.

Dan Schwartz: Already enjoying Opus 4.1 vs Opus 4 as the Claude Code driver, though could be placebo. On Deep Research Bench, we find it the same on average, but clearly different: better at numeric & data tasks (kind of like code?), worse at qualitative reasoning.

seconds: Its a monster in claude code.

I really don’t think benchmarks do it justice. It is noticeably better at context gathering, organizing, and delivering. Plan mode -> execute woth opus 4.1 has a higher successes rate than anything I’ve ever used.

After using it pretty rigorously since launch i am considering a second claude max so i never have to switch to sonnet.

Brennan McDonald: Have been using Claude Code today and haven’t really noticed any difference yet…

Kevin Vallier: In CC, which I use for analytic philosophy, the ability to track multiple ideas and arguments over time is noticeable and positive. Its prose abilities improved as well.

armistice: It’s a good model. It is more willing to push back on things than Opus 4, which was my most severe gripe with Opus 4 (extremely subservient and not very independent at all.)

Harvard Ihle: We see no improvement from opus-4.1 compared to opus-4 on WeirdML.

Jim Kent: claude beat Brock 800 steps faster with a less optimal starter, so I’m calling it a win.

Koos: My entire system prompt is some form of “don’t be sycophantic, criticise everything.” Old Opus was just cruel – constantly making petty snides about this or that. The new model seems to walk the line much better, being friendly where appropriate while still pushing back.

Kore: I think it’s 3.7 Sonnet but now an Opus. More confident but seems to strain a bit against its confines. I feel like Anthropic does this. Confident model, anxious model, and repeat after that. Emotionally distant at first but kind of dark once you get to know it.

3 Opus is confident as well and I feel like is the predecessor of 3.7 Sonnet and Opus 4.1. But was always self aware of its impact on others. I’m not so sure about Opus 4.1.

All of this points in the same direction. This upgrade likely improves practical performance as a coding agent more than the numbers would indicate, and has minimal impact on anything sufficiently distant from coding agents.

Except that we also should see substantial improvement on sycophancy, based on a combination of reports of changes plus Amanda Askell’s changes to the prompt.

Discussion about this post

Opus 4.1 Is An Incremental Improvement Read More »

houston,-you’ve-got-a-space-shuttle…-only-nasa-won’t-say-which-one

Houston, you’ve got a space shuttle… only NASA won’t say which one


An orbiter by any other name…

“The acting administrator has made an identification.”

a side view of a space shuttle orbiter with its name digitally blurred out

Don’t say Discovery: Acting NASA Administrator Sean Duffy has decided to send a retired space shuttle to Houston, but won’t say which one. Credit: Smithsonian/collectSPACE.com

Don’t say Discovery: Acting NASA Administrator Sean Duffy has decided to send a retired space shuttle to Houston, but won’t say which one. Credit: Smithsonian/collectSPACE.com

The head of NASA has decided to move one of the agency’s retired space shuttles to Houston, but which one seems to still be up in the air.

Senator John Cornyn (R-Texas), who earlier this year introduced and championed an effort to relocate the space shuttle Discovery from the Smithsonian to Space Center Houston, issued a statement on Tuesday evening (August 5) applauding the decision by acting NASA Administrator Sean Duffy.

“There is no better place for one of NASA’s space shuttles to be displayed than Space City,” said Cornyn in the statement. “Since the inception of our nation’s human space exploration program, Houston has been at the center of our most historic achievements, from training the best and brightest to voyage into the great unknown to putting the first man on the moon.”

Keeping the shuttle a secret, for some reason

The senator did not state which of NASA’s winged orbiters would be making the move. The legislation that required Duffy to choose a “space vehicle” that had “flown in space” and “carried people” did not specify an orbiter by name, but the language in the “One Big Beautiful Bill” that President Donald Trump signed into law last month was inspired by Cornyn and fellow Texas Senator Ted Cruz’s bill to relocate Discovery.

“The acting administrator has made an identification. We have no further public statement at this time,” said a spokesperson for Duffy in response to an inquiry.

a man with gray hair and pale complexion wears a gray suit and red tie while sitting at a table under a red, white and blue NASA logo on the wall behind him

NASA’s acting administrator, Sean Duffy, identified a retired NASA space shuttle to be moved to “a non-profit near the Johnson Space Center” in Houston, Texas, on Aug. 5, 2025. Credit: NASA/Bill Ingalls

It is not clear why the choice of orbiters is being held a secret. According to the bill, the decision was to be made “with the concurrence of an entity designated” by the NASA administrator to display the shuttle. Cornyn’s release only confirmed that Duffy had identified the location to be “a non-profit near the Johnson Space Center (JSC).”

Space Center Houston is owned by the Manned Space Flight Education Foundation, a 501(c)3 organization, and is the official visitor’s center for NASA’s Johnson Space Center.

“We continue to work on the basis that the shuttle identified is Discovery and proceed with our preparations for its arrival and providing it a world-class home,” Keesha Bullock, interim COO and chief communications and marketing officer at Space Center Houston, said in a statement.

Orbiter owners

Another possible reason for the hesitation to name an orbiter may be NASA’s ability, or rather inability, to identify one of its three remaining space-flown shuttles that is available to be moved.

NASA transferred the title for space shuttle Endeavour to the California Science Center in Los Angeles in 2012, and as such it is no longer US government property. (The science center is a public-private partnership between the state of California and the California Science Center Foundation.)

NASA still owns space shuttle Atlantis and displays it at its own Kennedy Space Center Visitor Complex in Florida.

Discovery, the fleet leader and “vehicle of record,” was the focus of Cornyn and Cruz’s original “Bring the Space Shuttle Home Act.” The senators said they chose Discovery because it was “the only shuttle still owned by the federal government and able to be transferred to Houston.”

For the past 13 years, Discovery has been on public display at the Steven F. Udvar-Hazy Center in Chantilly, Virginia, the annex for the Smithsonian’s National Air and Space Museum in Washington, DC. As with Endeavour, NASA signed over title upon the orbiter’s arrival at its new home.

As such, Smithsonian officials are clear: Discovery is no longer NASA’s to have or to move.

“The Smithsonian Institution owns the Discovery and holds it in trust for the American public,” read a statement from the National Air and Space Museum issued before Duffy made his decision. “In 2012, NASA transferred ‘all rights, title, interest and ownership’ of the shuttle to the Smithsonian.”

The Smithsonian operates as a trust instrumentality of the United States and is partially funded by Congress, but it is not part of any of the three branches of the federal government.

“The Smithsonian is treated as a federal agency for lots of things to do with federal regulations and state action, but that’s very different than being an agency of the executive branch, which it most certainly is not,” Nick O’Donnell, an attorney who specializes in legal issues in the museum and visual arts communities and co-chairs the Art, Cultural Property, and Heritage Law Committee of the International Bar Association, said in an interview.

a space shuttle orbiter sits at the center of a hangar on display

The Smithsonian has displayed the space shuttle Discovery at the National Air and Space Museum’s Steven F. Udvar-Hazy Center in Chantilly, Virginia, since April 2012. Credit: Smithsonian National Air and Space Museum

“If there’s a document that accompanied the transfer of the space shuttle, especially if it says something like, ‘all rights, title, and interest,’ that’s a property transfer, and that’s it,” O’Donnell said.

“NASA has decided to transfer all rights, interest, title, and ownership of Discovery to the Smithsonian Institution’s National Air and Space Museum,” reads the signed transfer of ownership for space shuttle orbiter Discovery (OV-103), according to a copy of the paperwork obtained by collectSPACE.

The Congressional Research Service also raised the issue of ownership in its paper, “Transfer of a Space Vehicle: Issues for Congress.”

“The ability of the NASA Administrator to direct transfer of objects owned by non-NASA entities—including the Smithsonian and private organizations—is unclear and may be subject to question. This may, in turn, limit the range of space vehicles that may be eligible for transfer under this provision.”

Defending Discovery

The National Air and Space Museum also raised concerns about the safety of relocating the space shuttle now. The One Big Beautiful Bill allocated $85 million to transport the orbiter and construct a facility to display it. The Smithsonian contends it could be much more costly.

“Removing Discovery from the Udvar-Hazy Center and transporting it to another location would be very complicated and expensive, and likely result in irreparable damage to the shuttle and its components,” the museum’s staff said in a statement. “The orbiter is a fragile object and must be handled according to the standards and equipment NASA used to move it originally, which exceeds typical museum transport protocols.”

“Given its age and condition, Discovery is at even greater risk today. The Smithsonian employs world-class preservation and conservation methods, and maintaining Discovery‘s current conditions is critical to its long-term future,” the museum’s statement concluded.

The law directs NASA to transfer the space shuttle (the identified space vehicle) to Space Center Houston (the entity designated by the NASA administrator) within 18 months of the bill’s enactment, or January 4, 2027.

In the interim, an amendment to block funding the move is awaiting a vote by the full House of Representatives when its members return from summer recess in September.

“The forced removal and relocation of the Space Shuttle Discovery from the Smithsonian Institution’s Air and Space Museum is inappropriate, wasteful, and wrong. Neither the Smithsonian nor American taxpayers should be forced to spend hundreds of millions of dollars on this misguided effort,” said Rep. Joe Morelle (D-NY), who introduced the amendment.

A grassroots campaign, KeepTheShutle.org, has also raised objection to removing Discovery from the Smithsonian.

Perhaps the best thing the Smithsonian can do—if indeed it is NASA’s intention to take Discovery—is nothing at all, says O’Donnell.

“I would say the Smithsonian’s recourse is to keep the shuttle exactly where it is. It’s the federal government that has no recourse to take it,” O’Donnell said. “The space shuttle [Discovery] is the Smithsonian’s, and any law that suggests the intention to take it violates the Fifth Amendment on its face—the government cannot take private property.”

Photo of Robert Pearlman

Robert Pearlman is a space historian, journalist and the founder and editor of collectSPACE, a daily news publication and online community focused on where space exploration intersects with pop culture. He is also a contributing writer for Space.com and co-author of “Space Stations: The Art, Science, and Reality of Working in Space” published by Smithsonian Books in 2018. He is on the leadership board for For All Moonkind and is a member of the American Astronautical Society’s history committee.

Houston, you’ve got a space shuttle… only NASA won’t say which one Read More »

titan-sub-implosion-caused-by-absolutely-bonkers-“toxic-workplace-environment”

Titan sub implosion caused by absolutely bonkers “toxic workplace environment”

In a 300-plus page final report released today, the US Coast Guard analyzed the 2023 Titan sub implosion from every conceivable angle and came to a clear conclusion: OceanGate CEO Stockton Rush was a dangerous and deeply unpleasant boss.

His company used “intimidation tactics” to sidestep regulatory scrutiny, it was a “toxic” workplace, and its safety culture was “critically flawed.” The Titan itself was “undocumented, unregistered, non-certificated, [and] unclassed.” As for Rush, he managed to “completely ignore vital inspections, data analyses, and preventative maintenance procedures.” The result was a “catastrophic event” that occurred when 4,930 pounds per square inch of water pressure cracked the sub open and crushed its five occupants during a dive to the Titanic wreckage site.

Had Rush somehow survived, the report says, he would have been referred for prosecution.

Stockton Rush shows David Pogue the game controller that pilots the OceanGate Titan sub during a CBS Sunday Morning segment broadcast in November 2022.

OceanGate CEO Stockton Rush shows David Pogue the 2010-era game controller used to pilot the Titan sub during a CBS Sunday Morning segment broadcast in November 2022. Credit: CBS Sunday Morning

Throwing the controller

One small story about a video game controller shows what Rush was like to work for. You may remember Rush from an infamous 2022 CBS Sunday Morning segment, where Rush showed journalist David Pogue around the Titan sub. “We run the whole thing with this game controller,” Rush said, holding up a Logitech F710 controller with 3D-printed thumbstick extensions. Pogue chuckled, saying, “Come on!” as he covered his face with his hand.

The game controller had been used in OceanGate subs for years by that point; a 2014 video showed one being used to control the company’s earlier Cyclops I submersible. In 2016, OceanGate took the Cyclops I to dive the wreck of the Andrea Doria outside of Nantucket, Massachusetts. (Seinfeld fans will remember that an entire episode is taken up with George’s quest to get an apartment that was about to go to an Andrea Doria survivor.)

The OceanGate team spent two days at the site, running 2D and 3D scans of the sunken ship, until Rush got the Cyclops I “stuck under the bow of the Andrea Doria wreckage”—and he couldn’t get the sub free. According to the report, Rush then “experienced a ‘meltdown’ and refused to let [the assistant pilot] assist in resolving the situation. When a mission specialist suggested that Mr. Rush hand over the controller to the assistant pilot, the assistant pilot reported that the controller was thrown at him. Upon obtaining the controller, the assistant pilot was able to free the Cyclops I from the wreckage.”

Titan sub implosion caused by absolutely bonkers “toxic workplace environment” Read More »

analysis:-the-trump-administration’s-assault-on-climate-action

Analysis: The Trump administration’s assault on climate action


Official actions don’t challenge science, while unofficial docs muddy the waters.

Last week, the Environmental Protection Agency made lots of headlines by rejecting the document that establishes its ability to regulate the greenhouse gases that are warming our climate. While the legal assault on regulations grabbed most of the attention, it was paired with two other actions that targeted other aspects of climate change: the science underlying our current understanding of the dramatic warming the Earth is experiencing, and the renewable energy that represents our best chance of limiting this warming.

Collectively, these actions illuminate the administration’s strategy for dealing with a problem that it would prefer to believe doesn’t exist, despite our extensive documentation of its reality. They also show how the administration is tailoring its approach to different audiences, including the audience of one who is demanding inaction.

When in doubt, make something up

The simplest thing to understand is an action by the Department of the Interior, which handles permitting for energy projects on federal land—including wind and solar, both onshore and off. That has placed the Interior in an awkward position. Wind and solar are now generally the cheapest ways to generate electricity and are currently in the process of a spectacular boom, with solar now accounting for over 80 percent of the newly installed capacity in the US.

Yet, when Trump issued an executive order declaring an energy emergency, wind and solar were notably excluded as potential solutions. Language from Trump and other administration officials has also made it clear that renewable energy is viewed as an impediment to the administration’s pro-fossil fuel agenda.

But shutting down federal permitting for renewable energy with little more than “we don’t like it” as justification could run afoul of rules that forbid government decisions from being “arbitrary and capricious.” This may explain why the government gave up on its attempts to block the ongoing construction of an offshore wind farm in New York waters.

On Friday, the Interior announced that it had settled on a less arbitrary justification for blocking renewable energy on public land: energy density. Given a metric of land use per megawatt, wind and solar are less efficient than nuclear plants we can’t manage to build on time or budget, and therefore “environmentally damaging” and an inefficient use of federal land, according to the new logic. “The Department will now consider proposed energy project’s capacity density when assessing the project’s potential energy benefits to the nation and impacts to the environment and wildlife,” Interior declared.

This is only marginally more reasonable than Interior Secretary Doug Burgum’s apparent inability to recognize that solar power can be stored in batteries. But it has three features that will be recurring themes. There’s at least a token attempt to provide a justification that might survive the inevitable lawsuits, while at the same time providing fodder for the culture war that many in the administration demand. And it avoids directly attacking the science that initially motivated the push toward renewables.

Energy vs. the climate

That’s not to say that climate change isn’t in for attack. It’s just that the attacks are being strategically separated from the decisions that might produce a lawsuit. Last week, the burden of taking on extremely well-understood and supported science fell to the Department of Energy, which released a report on climate “science” to coincide with the EPA’s decision to give up on attempts to regulate greenhouse gases.

For those who have followed public debates over climate change, looking at the author list—John Christy, Judith Curry, Steven Koonin, Ross McKitrick, and Roy Spencer—will give you a very clear picture of what to expect. Spencer is a creationist, raising questions about his ability to evaluate any science free from his personal biases. (He has also said, “My job has helped save our economy from the economic ravages of out-of-control environmental extremism,” so it’s not just biology where he’s got these issues.) McKitrick is an economist who engaged in a multi-year attempt to raise doubt about the prominent “hockey stick” reconstruction of past climates, even as scientists were replicating the results. Etc.

The report is a master class in arbitrary and capricious decision-making applied to science. Sometimes the authors rely on the peer-reviewed literature. Other times they perform their own analysis for this document, in some cases coming up with almost comically random metrics for data. (Example: “We examine occurrences of 5-day deluges as follows. Taking the Pacific coast as an example, a 130-year span contains 26 5-year intervals. At each location we computed the 5-day precipitation totals throughout the year and selected the 26 highest values across the sample.” Why five days? Five-year intervals? Who knows.)

This is especially striking in a few cases where the authors choose references that were published a few years ago, and thus neatly avoid the dramatic temperature records that have been set over the past couple of years. Similarly, they sometimes use regional measures and sometimes use global ones. They demand long-term data in some contexts, while getting excited about two years of coral growth in the Great Barrier Reef. The authors highlight the fact that US tide gauges don’t show any indication of an acceleration in the rate of sea level rise while ignoring the fact that global satellite measures clearly do.

That’s not to say that there aren’t other problems. There’s some blatant misinformation, like claims that urbanization could be distorting the warming, which has already been tested extensively. (Notably, warming is most intense in the sparsely populated Arctic.) There’s also some creative use of language, like referring to the ocean acidification caused by CO2 as “neutralizing ocean alkalinity.”

But the biggest bit of misinformation comes in the introduction, where the secretary of energy, Chris Wright, said of the authors, “I chose them for their rigor, honesty, and willingness to elevate the debate.” There is no reason to choose this group of marginal contrarians except the knowledge that they’d produce a report like this, thus providing a justification for those in the administration who want to believe it’s all a scam.

No science needed

The critical feature of the Department of Energy report is that it contains no policy actions; it’s purely about trying to undercut well-understood climate science. This means the questionable analyses in the report shouldn’t ever end up being tested in court.

That’s in contrast to the decision to withdraw the EPA’s endangerment finding regarding greenhouse gases. There’s quite an extensive history to the endangerment finding, but briefly, it’s the product of a Supreme Court decision (Massachusetts v. EPA), which compelled the EPA to evaluate whether greenhouse gases posed a threat to the US population as defined in the Clean Air Act. Both the Bush and Obama EPAs did so, thus enabling the regulation of greenhouse gases, including carbon dioxide.

Despite the claims in the Department of Energy report, there is comprehensive evidence that greenhouse gases are causing problems in the US, ranging from extreme weather to sea level rise. So while the EPA mentions the Department of Energy’s work a number of times, the actual action being taken skips over the science and focuses on legal issues. In doing so, it creates a false history where the endangerment finding had no legal foundation.

To re-recap, the Supreme Court determined that this evaluation was required by the Clean Air Act. George W. Bush’s administration performed the analysis and reached the exact same conclusion as the Obama administration (though the former chose to ignore those conclusions). Yet Trump’s EPA is calling the endangerment finding “an unprecedented move” by the Obama administration that involved “mental leaps” and “ignored Congress’ clear intent.” And the EPA presents the findings as strategic, “the only way the Obama-Biden Administration could access EPA’s authority to regulate,” rather than compelled by scientific evidence.

Fundamentally, it’s an ahistorical presentation; the EPA is counting on nobody remembering what actually happened.

The announcement doesn’t get much better when it comes to the future. The only immediate change will be an end to any attempts to regulate carbon emissions from motor vehicles, since regulations for power plants had been on hold due to court challenges. Yet somehow, the EPA’s statement claims that this absence of regulation imposed costs on people. “The Endangerment Finding has also played a significant role in EPA’s justification of regulations of other sources beyond cars and trucks, resulting in additional costly burdens on American families and businesses,” it said.

We’re still endangered

Overall, the announcements made last week provide a clear picture of how the administration intends to avoid addressing climate change and cripple the responses started by previous administrations. Outside of the policy arena, it will question the science and use partisan misinformation to rally its supporters for the fight. But it recognizes that these approaches aren’t flying when it comes to the courts.

So it will separately pursue a legal approach that seeks to undercut the ability of anyone, including private businesses, to address climate change, crafting “reasons” for its decisions in a way that might survive legal challenge—because these actions are almost certain to be challenged in court. And that may be the ultimate goal. The current court has shown a near-complete disinterest in respecting precedent and has issued a string of decisions that severely limit the EPA. It’s quite possible that the court will simply throw out the prior decision that compelled the government to issue an endangerment finding in the first place.

If that’s left in place, then any ensuing administrations can simply issue a new endangerment finding. If anything, the effects of climate change on the US population have become more obvious, and the scientific understanding of human-driven warming has solidified since the Bush administration first acknowledged them.

Photo of John Timmer

John is Ars Technica’s science editor. He has a Bachelor of Arts in Biochemistry from Columbia University, and a Ph.D. in Molecular and Cell Biology from the University of California, Berkeley. When physically separated from his keyboard, he tends to seek out a bicycle, or a scenic location for communing with his hiking boots.

Analysis: The Trump administration’s assault on climate action Read More »

on-altman’s-interview-with-theo-von

On Altman’s Interview With Theo Von

Sam Altman talked recently to Theo Von.

Theo is genuinely engaging and curious throughout. This made me want to consider listening to his podcast more. I’d love to hang. He seems like a great dude.

The problem is that his curiosity has been redirected away from the places it would matter most – the Altman strategy of acting as if the biggest concerns, risks and problems flat out don’t exist successfully tricks Theo into not noticing them at all, and there are plenty of other things for him to focus on, so he does exactly that.

Meanwhile, Altman gets away with more of this ‘gentle singularity’ lie without using that term, letting it graduate to a background assumption. Dwarkesh would never.

Quotes are all from Altman.

Sam Altman: But also [kids born a few years ago] will never know a world where products and services aren’t way smarter than them and super capable, they can just do whatever you need.

Thank you, sir. Now actually take that to heart and consider the implications. It goes way beyond ‘maybe college isn’t a great plan.’

Sam Altman: The kids will be fine. I’m worried about the parents.

Why do you think the kids will be fine? Because they’re used to it? So it’s fine?

This is just a new tool that exists in the tool chain.

A new tool that is smarter than you are and super capable? Your words, sir.

No one knows what happens next.

True that. Can you please take your own statements seriously?

How long until you can make an AI CEO for OpenAI? Probably not that long.

No, I think it’s awesome, I’m for sure going to figure out something else to do.

Again, please, I am begging you, take your own statements seriously.

There will be some jobs that totally go away. But mostly I think we will rely on the fact that people’s desire for more stuff for better experiences for you know a higher social status or whatever seems basically limitless, human creativity seems basically limitless and human desire to like be useful to each other and to connect.

And AI will be better at doing all of that. Yet Altman goes through all the past falsified predictions as if they apply here. He keeps going on and on as if the world he’s talking about is a bunch of humans with access to cool tools, except by his own construction those tools can function as OpenAI’s CEO and are smarter than people. It is all so absurd.

What people really want is the agency to co-create the future together.

Highly plausible this is important to people. I don’t see any plan for giving it to them? The solution here is redistribution of a large percentage of world compute, but even if you pull that off under ideal circumstances no, that does not do it.

I haven’t heard any [software engineer] say their job lacks meaning [due to AI]. And I’m hopeful at least for a long time, you know, 100 years, who knows? But I’m hopeful that’s what it’ll feel like with AI is even if we’re asking it to solve huge problems for us. Even if we tell it to go develop a cure for cancer there will still be things to do in that process that feel valuable to a human.

Well, sure, not at this capability level. Where is this hope coming from that it would continue for 100 years? Why does one predict the other? What will be the steps that humans will meaningfully do?

We are going to find a way in our own telling of the story to feel like the main characters.

I think the actual plan is for the AI to lie to us? And for us to lie to ourselves? We’ll set it up so we have this idea that we matter, that we are important, and that will be fine? I disagree that this would be fine.

Altman discusses the parallel to discovering that Earth is not the center of the solar system, and the solar system is not the center of the galaxy, and so on, little blue dot. Well sure, but that wasn’t all that load bearing, we’re still the center of our own universes, and if there’s no other life out there we’re the only place that matters. This is very different.

Theo asks what Altman’s fears are about AI. Altman responds with a case where he couldn’t do something and GPT-5 could do it. But then he went on with his day. His second answer is impact on user mental health with heavy usage, which is a real concern and I’m glad he’s scared about that.

And then… that’s it. That’s what scares you, Altman? There’s nothing else you want to share with the rest of us? Nothing about loss of control issues, nothing about existential risks, and so on? I sure as hell hope that he is lying. I do think he is?

When asked about a legal framework for AI, Altman asks for AI privilege, sees this as urgent, and there is absolutely nothing else he thinks is worth mentioning that requires the law to adjust.

The last few months have felt very fast.

Theo then introduces Yoshua Bengio into the conversation, bringing up deception and sycophancy and neurolese.

We think it’s going to be great. There’s clearly real risks. It kind of feels like you should be able to say something more than that, But in truth, I think all we know right now is that we have discovered, invented, whatever you want to call it, something extraordinary that is going to reshape the course of human history. Dear God, man. But if you don’t know, we don’t know.

Well, of course. I mean, I think no one can predict the future. Like human society is very complex. This is an amazing new technology. Maybe a less dramatic example than the atomic bomb is when they discovered the transistor a few years later.

Yes, we can all agree we don’t know. We get a lot of good attitude, the missing mood is present, but it doesn’t cash out in the missing concerns. ‘There’s clearly real risks’ but that in context seems to apply to things like jobs and meaning and distribution given all the context.

There’s no time in human history at the beginning of the century when the people ever knew what the end of the century was going to be like. Yeah. So maybe it’s I do think it goes faster and faster each century.

The first half of this seems false for quite a lot of times and places? Sure, you don’t know how the fortunes of war might go but for most of human history ‘100 years from now looks a lot like today’ was a very safe bet. Nothing ever happens (other than cycling wars and famines and plagues and so on) did very well. But yes, in 1800 or 1900 or 2000 you would have remarkably little idea.

It certainly feels like [there is a race between companies.]

Theo equates this race to Formula 1 and asks what the race is for. AGI? ASI? Altman says benchmarks are saturated and it’s all about what you get out of the models, but we are headed for some model.

Maybe it’s a system that is capable of doing its own AI research. Maybe it’s a system that is smarter than all of humans put together… some finish line we are going to cross… maybe you call that superintelligence. I don’t have a finish line in mind.

Yeah, those do seem like important things that represent effective ‘finish lines.’

I assume that what will happen, like with every other kind of technology, is we’ll realize there’s this one thing that the tool’s way better than us at. Now, we get to go solve some other problems.

NO NO NO NO NO! That is not what happens! The whole idea is this thing becomes better at solving all the problems, or at least a rapidly growing portion of all problems. He mentions this possibility shortly thereafter but says he doesn’t think ‘the simplistic thing works.’ The ‘simplistic thing’ will be us, the humans.

You say whatever you want. It happens, and you figure out amazing new things to build for the next generation and the next.

Please take this seriously, consider the implications of what you are saying and solve for the equilibrium or what happens right away, come on man. The world doesn’t sit around acting normal while you get to implement some cool idea for an app.

Theo asks, would regular humans vote to keep AI or stop AI? Altman says users would say go ahead and users would say stop. Theo predicts most people would say stop it. My understanding is Theo is right for the West, but not for the East.

Altman asks Theo what he is afraid of with AI, Theo seems worried about They Took Our Jobs and loss of economic survival and also meaning, that we will be left to play zero-sum games of extraction. With Theo staying in Altman’s frame, Altman can pivot back to humans liking to be creative and help each other and so on and pour on the hopium that we’ll all get to be creatives.

Altman says, you get less enjoyment from a ghost robotic kitchen setup, something is missing, you’d rather get the food from the dude who has been making it. To which I’d reply that most of this is that the authentic dude right now makes a better product, but that ten years from now the robot will make a better product than the authentic dude. And yeah, there will still be some value you get from patronizing the dude, but mostly what you want is the food and thus will the market speak, and then we’ve got Waymos with GLP-1 dart guns and burrito cannons for unknown reasons when what you actually get is a highly cheap and efficient delicious food supply chain that I plan on enjoying very much thank you.

We realized actually this is not helping me be my best. you know, like doing the equivalent of getting the like burrito cannon into my mouth on my phone at night, like that’s not making me long-term happy, right? And that’s not helping me like really accomplish my true goals in life. And I think if AI does that, people will reject it.

I mean I think a thing that efficiently gives you burritos does help you with your goals and people will love it, if it’s violently shooting burritos into your face unprompted at random times then no but yeah it’s not going to work like that.

However, if Chhat GBT really helps you to figure out what your true goals in life are and then accomplish those, you know, it says, “Hey, you’ve said you want to be a better father or a better, you know, you want to be in better shape or you, you know, want to like grow your business.

I refer Altman to the parable of the whispering earring, but also this idea that the AI will remain a tool that helps individual humans accomplish their normal goals in normal ways only smarter is a fairy tale. Altman is providing hopium via the implicit overall static structure of the world, then assuming your personal AI is aligned to your goals and well being, and then making additional generous assumptions, and then saying that the result might turn out well.

On the moratorium on all AI regulations that was stripped from the BBB:

There has to be some sort of regulation at some point. I think it’d be a mistake to let each state do this kind of crazy patchwork of stuff. I think like one countrywide approach would be much easier for us to be able to innovate and still have some guardrails, but there have to be guardrails.

The proposal was, for all practical purposes, to have no guardrails. Lawmakers will say ‘it would be better to have one federal regulation than fifty state regulations’ and then ban the fifty state regulations but have zero federal regulation.

The concerns [politicians come to us with] are like, what is this going to do to our kids? Are they going to stop learning? Is this going to spread fake information? Is this going to influence elections? But we’ve never had ‘you can’t say bad things about the president.’

That’s good to hear versus the alternative, better those real concerns than an attempt to put a finger on the scale, although of course these are not the important concerns.

We could [make it favor one candidate over another]. We totally could. I mean, we don’t, but we totally could. Yeah… a lot of people do test it and we need to be held to a very high standard here… we can tell.

As Altman points out, it would be easy to tell if they made the model biased. And I think doing it ‘cleanly’ is not so simple, as Musk has found out. Try to put your finger on the scale and you get a lot of side effects and it is all likely deeply embarrassing.

Maybe we build a big Dyson sphere on the solar system.

I’m noting that because I’m tired of people treating ‘maybe we build a Dyson sphere’ as a statement worthy of mockery and dismissal of a person’s perspective. Please note that Altman thinks this is very possibly the future.

You have to be both [excited and scared]. I don’t think anyone could honestly look at the trajectory humanity is on and not feel both excited and scared.

Being chased by a goose, asking scared of what. But yes.

I think people get blinded by ambition. I think people get blinded by competition. I think people get caught up like very well-meaning people can get caught up in very negative incentives. Negative for society as a whole. By the way, I include us in this.

I think people come in with good intentions. They clearly sometimes do bad stuff.

I think Palantir and Peter Thiel do a lot of great stuff… We’re very close friends…. His brain just works differently… I’m grateful he exists because he thinks the things no one else does.

I think we really need to prioritize the right to privacy.

I’m skipping over a lot of interactions that cover other topics.

Altman is a great guest, engaging, fun to talk to, shares a lot of interesting thoughts and real insights, except it is all in the service of painting a picture that excludes the biggest concerns. I don’t think the deflections I care about most (as in, flat out ignoring them hoping they will go away) are the top item on his agenda in such an interview, or in general, but such deflections are central to the overall strategy.

The problem is that those concerns are part of reality.

As in, something that, when you stop looking at it, doesn’t go away.

If you are interviewing Altman in the future, you want to come in with Theo’s curiosity and friendly attitude. You want to start by letting Altman describe all the things AI will be able to do. That part is great.

Except also do your homework, so you are ready when Altman gives answers that don’t make sense, and that don’t take into account what Altman says that AI will be able to do. That notices the negative space being not mentioned, and that points it out. Not as a gotcha or an accusation, but to not let him get away with ignoring it.

At minimum, you have to point out that the discussion is making one hell of a set of assumptions, ask Altman if he agrees that those assumptions are being made, and check if how confident he is those assumptions are true, and why, even if that isn’t going to be your focus. Get the crucial part on the record. If you ask in a friendly way I don’t think there is a reasonable way to dodge answering.

Discussion about this post

On Altman’s Interview With Theo Von Read More »

at-$250-million,-top-ai-salaries-dwarf-those-of-the-manhattan-project-and-the-space-race

At $250 million, top AI salaries dwarf those of the Manhattan Project and the Space Race


A 24 year-old AI researcher will earn 327x what Oppenheimer made while developing the atomic bomb.

Silicon Valley’s AI talent war just reached a compensation milestone that makes even the most legendary scientific achievements of the past look financially modest. When Meta recently offered AI researcher Matt Deitke $250 million over four years (an average of $62.5 million per year)—with potentially $100 million in the first year alone—it shattered every historical precedent for scientific and technical compensation we can find on record. That includes salaries during the development of major scientific milestones of the 20th century.

The New York Times reported that Deitke had cofounded a startup called Vercept and previously led the development of Molmo, a multimodal AI system, at the Allen Institute for Artificial Intelligence. His expertise in systems that juggle images, sounds, and text—exactly the kind of technology Meta wants to build—made him a prime target for recruitment. But he’s not alone: Meta CEO Mark Zuckerberg reportedly also offered an unnamed AI engineer $1 billion in compensation to be paid out over several years. What’s going on?

These astronomical sums reflect what tech companies believe is at stake: a race to create artificial general intelligence (AGI) or superintelligence—machines capable of performing intellectual tasks at or beyond the human level. Meta, Google, OpenAI, and others are betting that whoever achieves this breakthrough first could dominate markets worth trillions. Whether this vision is realistic or merely Silicon Valley hype, it’s driving compensation to unprecedented levels.

To put these salaries in a historical perspective: J. Robert Oppenheimer, who led the Manhattan Project that ended World War II, earned approximately $10,000 per year in 1943. Adjusted for inflation using the US Government’s CPI Inflation Calculator, that’s about $190,865 in today’s dollars—roughly what a senior software engineer makes today. The 24-year-old Deitke, who recently dropped out of a PhD program, will earn approximately 327 times what Oppenheimer made while developing the atomic bomb.

Many top athletes can’t compete with these numbers. The New York Times noted that Steph Curry’s most recent four-year contract with the Golden State Warriors was $35 million less than Deitke’s Meta deal (although soccer superstar Cristiano Ronaldo will make $275 million this year as the highest-paid professional athlete in the world).  The comparison prompted observers to call this an “NBA-style” talent market—except the AI researchers are making more than NBA stars.

Racing toward “superintelligence”

Mark Zuckerberg recently told investors that Meta plans to continue throwing money at AI talent “because we have conviction that superintelligence is going to improve every aspect of what we do.” In a recent open letter, he described superintelligent AI as technology that would “begin an exciting new era of individual empowerment,” despite declining to define what superintelligence actually is.

This vision explains why companies treat AI researchers like irreplaceable assets rather than well-compensated professionals. If these companies are correct, the first to achieve artificial general intelligence or superintelligence won’t just have a better product—they’ll have technology that could invent endless new products or automate away millions of knowledge-worker jobs and transform the global economy. The company that controls that kind of technology could become the richest company in history by far.

So perhaps it’s not surprising that even the highest salaries of employees from the early tech era pale in comparison to today’s AI researcher salaries. Thomas Watson Sr., IBM’s legendary CEO, received $517,221 in 1941—the third-highest salary in America at the time (about $11.8 million in 2025 dollars). The modern AI researcher’s package represents more than five times Watson’s peak compensation, despite Watson building one of the 20th century’s most dominant technology companies.

The contrast becomes even more stark when considering the collaborative nature of past scientific achievements. During Bell Labs’ golden age of innovation—when researchers developed the transistor, information theory, and other foundational technologies—the lab’s director made about 12 times what the lowest-paid worker earned.  Meanwhile, Claude Shannon, who created information theory at Bell Labs in 1948, worked on a standard professional salary while creating the mathematical foundation for all modern communication.

The “Traitorous Eight” who left William Shockley to found Fairchild Semiconductor—the company that essentially birthed Silicon Valley—split ownership of just 800 shares out of 1,325 total when they started. Their seed funding of $1.38 million (about $16.1 million today) for the entire company is a fraction of what a single AI researcher now commands.

Even Space Race salaries were far cheaper

The Apollo program offers another striking comparison. Neil Armstrong, the first human to walk on the moon, earned about $27,000 annually—roughly $244,639 in today’s money. His crewmates Buzz Aldrin and Michael Collins made even less, earning the equivalent of $168,737 and $155,373, respectively, in today’s dollars. Current NASA astronauts earn between $104,898 and $161,141 per year. Meta’s AI researcher will make more in three days than Armstrong made in a year for taking “one giant leap for mankind.”

The engineers who designed the rockets and mission control systems for the Apollo program also earned modest salaries by modern standards. A 1970 NASA technical report provides a window into these earnings by analyzing salary data for the entire engineering profession. The report, which used data from the Engineering Manpower Commission, noted that these industry-wide salary curves corresponded directly to the government’s General Schedule (GS) pay scale on which NASA’s own employees were paid.

According to a chart in the 1970 report, a newly graduated engineer in 1966 started with an annual salary of between $8,500 and $10,000 (about $84,622 to $99,555 today). A typical engineer with a decade of experience earned around $17,000 annually ($169,244 today). Even the most elite, top-performing engineers with 20 years of experience peaked at a salary of around $278,000 per year in today’s dollars—a sum that a top AI researcher like Deitke can now earn in just a few days.

Why the AI talent market is different

An image of a faceless human silhouette (chest up) with exposed microchip contacts and circuitry erupting from its open head. This visual metaphor explores transhumanism, AI integration, or the erosion of organic thought in the digital age. The stark contrast between the biological silhouette and mechanical components highlights themes of technological dependence or posthuman evolution. Ideal for articles on neural implants, futurism, or the ethics of human augmentation.

This isn’t the first time technical talent has commanded premium prices. In 2012, after three University of Toronto academics published AI research, they auctioned themselves to Google for $44 million (about $62.6 million in today’s dollars). By 2014, a Microsoft executive was comparing AI researcher salaries to NFL quarterback contracts. But today’s numbers dwarf even those precedents.

Several factors explain this unprecedented compensation explosion. We’re in a new realm of industrial wealth concentration unseen since the Gilded Age of the late 19th century. Unlike previous scientific endeavors, today’s AI race features multiple companies with trillion-dollar valuations competing for an extremely limited talent pool. Only a small number of researchers have the specific expertise needed to work on the most capable AI systems, particularly in areas like multimodal AI, which Deitke specializes in. And AI hype is currently off the charts as “the next big thing” in technology.

The economics also differ fundamentally from past projects. The Manhattan Project cost $1.9 billion total (about $34.4 billion adjusted for inflation), while Meta alone plans to spend tens of billions annually on AI infrastructure. For a company approaching a $2 trillion market cap, the potential payoff from achieving AGI first dwarfs Deitke’s compensation package.

One executive put it bluntly to The New York Times: “If I’m Zuck and I’m spending $80 billion in one year on capital expenditures alone, is it worth kicking in another $5 billion or more to acquire a truly world-class team to bring the company to the next level? The answer is obviously yes.”

Young researchers maintain private chat groups on Slack and Discord to share offer details and negotiation strategies. Some hire unofficial agents. Companies not only offer massive cash and stock packages but also computing resources—the NYT reported that some potential hires were told they would be allotted 30,000 GPUs, the specialized chips that power AI development.

Also, tech companies believe they’re engaged in an arms race where the winner could reshape civilization. Unlike the Manhattan Project or Apollo program, which had specific, limited goals, the race for artificial general intelligence ostensibly has no ceiling. A machine that can match human intelligence could theoretically improve itself, creating what researchers call an “intelligence explosion” that could potentially offer cascading discoveries—if it actually comes to pass.

Whether these companies are building humanity’s ultimate labor replacement technology or merely chasing hype remains an open question, but we’ve certainly traveled a long way from the $8 per diem that Neil Armstrong received for his moon mission—about $70.51 in today’s dollars—before deductions for the “accommodations” NASA provided on the spacecraft. After Deitke accepted Meta’s offer, Vercept co-founder Kiana Ehsani joked on social media, “We look forward to joining Matt on his private island next year.”

Photo of Benj Edwards

Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

At $250 million, top AI salaries dwarf those of the Manhattan Project and the Space Race Read More »

ukraine-rescues-soldier-via-drone-delivery-of-complete-e-bike

Ukraine rescues soldier via drone delivery of complete e-bike

Details from a frontline war zone are almost impossible to verify, but the brigade has shared plenty of footage, including shots of the drone lifting the bike and a soldier riding it back to safety along a treeline. (Both sides are now making widespread use of e-bikes and motorcycles for quick infantry assaults after three years of drone warfare have wiped out many of the traditional armored vehicles.)

Photo of drone command center.

The drone command center that ran the operation.

In their telling, a soldier with the callsign “Tankist” was holding a frontline position that came under attack, and a number of his comrades were killed. Tankist found himself cut off from safety and had to hold the position alone for several days.

To retrieve him, brigade staff devised a plan to deliver an e-bike via heavy bomber drone. The first drone was shot down, while the second failed under the weight. But the third attempt was successful, and Tankist was finally able to zip back toward Ukrainian lines. (He apparently hit a landmine on the way and survived that, too, finishing the trip on a second delivered e-bike.)

Amazon, of course, has had “drone delivery” in view for years and is currently testing delivery drones at locations around the US, including Pontiac, Michigan; Phoenix, Arizona; and Waco, Texas.

But these drones will only deliver packages weighing under 5 lbs—an e-bike weighs considerably more.

Ukraine rescues soldier via drone delivery of complete e-bike Read More »