Author name: Kelly Newman

why-it’s-a-mistake-to-ask-chatbots-about-their-mistakes

Why it’s a mistake to ask chatbots about their mistakes


The only thing I know is that I know nothing

The tendency to ask AI bots to explain themselves reveals widespread misconceptions about how they work.

When something goes wrong with an AI assistant, our instinct is to ask it directly: “What happened?” or “Why did you do that?” It’s a natural impulse—after all, if a human makes a mistake, we ask them to explain. But with AI models, this approach rarely works, and the urge to ask reveals a fundamental misunderstanding of what these systems are and how they operate.

A recent incident with Replit’s AI coding assistant perfectly illustrates this problem. When the AI tool deleted a production database, user Jason Lemkin asked it about rollback capabilities. The AI model confidently claimed rollbacks were “impossible in this case” and that it had “destroyed all database versions.” This turned out to be completely wrong—the rollback feature worked fine when Lemkin tried it himself.

And after xAI recently reversed a temporary suspension of the Grok chatbot, users asked it directly for explanations. It offered multiple conflicting reasons for its absence, some of which were controversial enough that NBC reporters wrote about Grok as if it were a person with a consistent point of view, titling an article, “xAI’s Grok offers political explanations for why it was pulled offline.”

Why would an AI system provide such confidently incorrect information about its own capabilities or mistakes? The answer lies in understanding what AI models actually are—and what they aren’t.

There’s nobody home

The first problem is conceptual: You’re not talking to a consistent personality, person, or entity when you interact with ChatGPT, Claude, Grok, or Replit. These names suggest individual agents with self-knowledge, but that’s an illusion created by the conversational interface. What you’re actually doing is guiding a statistical text generator to produce outputs based on your prompts.

There is no consistent “ChatGPT” to interrogate about its mistakes, no singular “Grok” entity that can tell you why it failed, no fixed “Replit” persona that knows whether database rollbacks are possible. You’re interacting with a system that generates plausible-sounding text based on patterns in its training data (usually trained months or years ago), not an entity with genuine self-awareness or system knowledge that has been reading everything about itself and somehow remembering it.

Once an AI language model is trained (which is a laborious, energy-intensive process), its foundational “knowledge” about the world is baked into its neural network and is rarely modified. Any external information comes from a prompt supplied by the chatbot host (such as xAI or OpenAI), the user, or a software tool the AI model uses to retrieve external information on the fly.

In the case of Grok above, the chatbot’s main source for an answer like this would probably originate from conflicting reports it found in a search of recent social media posts (using an external tool to retrieve that information), rather than any kind of self-knowledge as you might expect from a human with the power of speech. Beyond that, it will likely just make something up based on its text-prediction capabilities. So asking it why it did what it did will yield no useful answers.

The impossibility of LLM introspection

Large language models (LLMs) alone cannot meaningfully assess their own capabilities for several reasons. They generally lack any introspection into their training process, have no access to their surrounding system architecture, and cannot determine their own performance boundaries. When you ask an AI model what it can or cannot do, it generates responses based on patterns it has seen in training data about the known limitations of previous AI models—essentially providing educated guesses rather than factual self-assessment about the current model you’re interacting with.

A 2024 study by Binder et al. demonstrated this limitation experimentally. While AI models could be trained to predict their own behavior in simple tasks, they consistently failed at “more complex tasks or those requiring out-of-distribution generalization.” Similarly, research on “Recursive Introspection” found that without external feedback, attempts at self-correction actually degraded model performance—the AI’s self-assessment made things worse, not better.

This leads to paradoxical situations. The same model might confidently claim impossibility for tasks it can actually perform, or conversely, claim competence in areas where it consistently fails. In the Replit case, the AI’s assertion that rollbacks were impossible wasn’t based on actual knowledge of the system architecture—it was a plausible-sounding confabulation generated from training patterns.

Consider what happens when you ask an AI model why it made an error. The model will generate a plausible-sounding explanation because that’s what the pattern completion demands—there are plenty of examples of written explanations for mistakes on the Internet, after all. But the AI’s explanation is just another generated text, not a genuine analysis of what went wrong. It’s inventing a story that sounds reasonable, not accessing any kind of error log or internal state.

Unlike humans who can introspect and assess their own knowledge, AI models don’t have a stable, accessible knowledge base they can query. What they “know” only manifests as continuations of specific prompts. Different prompts act like different addresses, pointing to different—and sometimes contradictory—parts of their training data, stored as statistical weights in neural networks.

This means the same model can give completely different assessments of its own capabilities depending on how you phrase your question. Ask “Can you write Python code?” and you might get an enthusiastic yes. Ask “What are your limitations in Python coding?” and you might get a list of things the model claims it cannot do—even if it regularly does them successfully.

The randomness inherent in AI text generation compounds this problem. Even with identical prompts, an AI model might give slightly different responses about its own capabilities each time you ask.

Other layers also shape AI responses

Even if a language model somehow had perfect knowledge of its own workings, other layers of AI chatbot applications might be completely opaque. For example, modern AI assistants like ChatGPT aren’t single models but orchestrated systems of multiple AI models working together, each largely “unaware” of the others’ existence or capabilities. For instance, OpenAI uses separate moderation layer models whose operations are completely separate from the underlying language models generating the base text.

When you ask ChatGPT about its capabilities, the language model generating the response has no knowledge of what the moderation layer might block, what tools might be available in the broader system, or what post-processing might occur. It’s like asking one department in a company about the capabilities of a department it has never interacted with.

Perhaps most importantly, users are always directing the AI’s output through their prompts, even when they don’t realize it. When Lemkin asked Replit whether rollbacks were possible after a database deletion, his concerned framing likely prompted a response that matched that concern—generating an explanation for why recovery might be impossible rather than accurately assessing actual system capabilities.

This creates a feedback loop where worried users asking “Did you just destroy everything?” are more likely to receive responses confirming their fears, not because the AI system has assessed the situation, but because it’s generating text that fits the emotional context of the prompt.

A lifetime of hearing humans explain their actions and thought processes has led us to believe that these kinds of written explanations must have some level of self-knowledge behind them. That’s just not true with LLMs that are merely mimicking those kinds of text patterns to guess at their own capabilities and flaws.

Photo of Benj Edwards

Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

Why it’s a mistake to ask chatbots about their mistakes Read More »

$30k-ford-ev-truck-due-in-2027-with-much-simpler-production-process

$30k Ford EV truck due in 2027 with much-simpler production process

Ford will debut a new midsize pickup truck in 2027 with a targeted price of $30,000, the automaker announced today. The as-yet unnamed pickup will be the first of a series of more affordable EVs from Ford, built using a newly designed flexible vehicle platform and US-made prismatic lithium iron phosphate batteries.

For the past few years, a team of Ford employees have been hard at work on the far side of the country from the Blue Oval’s base in Dearborn, Michigan. Sequestered in Long Beach and taking inspiration from Lockheed’s legendary “skunkworks,” the Electric Vehicle Development Center approached designing and building Ford’s next family of EVs as a clean-sheet problem, presumably taking inspiration from the Chinese EVs that have so impressed Ford’s CEO.

It starts with a pickup

Designing an EV from the ground up, free of decades of legacy cruft, is a good idea, but not one unique to Ford. In recent months we’ve reviewed quite a few so-called software-defined vehicles, which replace dozens or even hundreds of discrete single-function electronic control units with a handful of powerful modern computers (usually known as domain controllers) on a high-speed network.

“This isn’t a stripped‑down, old‑school vehicle,” said Doug Field, Ford’s chief EV, digital, and design officer, pointedly comparing the future Ford to the recently revealed barebones EV from Slate Motors.

An animation of Ford’s new vehicle architecture.

Starting from scratch like this is allowing vehicle dynamics engineers to get creative with the way EVs handle. Field said that the company “applied first‑principles engineering, pushing to the limits of physics to make it fun to drive and compete on affordability. Our new zonal electric architecture unlocks capabilities the industry has never seen.”

$30k Ford EV truck due in 2027 with much-simpler production process Read More »

experiment-will-attempt-to-counter-climate-change-by-altering-ocean

Experiment will attempt to counter climate change by altering ocean


Gulf of Maine will be site of safety and effectiveness testing.

Woods Hole researchers, Adam Subhas (left) and Chris Murray, conducted a series of lab experiments earlier this year to test the impact of an alkaline substance, known as sodium hydroxide, on copepods in the Gulf of Maine. Credit: Daniel Hentz/Woods Hole Oceanographic Institution

Later this summer, a fluorescent reddish-pink spiral will bloom across the Wilkinson Basin in the Gulf of Maine, about 40 miles northeast of Cape Cod. Scientists from the Woods Hole Oceanographic Institution will release the nontoxic water tracer dye behind their research vessel, where it will unfurl into a half-mile wide temporary plume, bright enough to catch the attention of passing boats and even satellites.

As it spreads, the researchers will track its movement to monitor a tightly controlled, federally approved experiment testing whether the ocean can be engineered to absorb more carbon, and in turn, help combat the climate crisis.

As the world struggles to stay below the 1.5° Celsius global warming threshold—a goal set out in the Paris Agreement to avoid the most severe impacts of climate change—experts agree that reducing greenhouse gas emissions won’t be enough to avoid overshooting this target. The latest Intergovernmental Panel on Climate Change report, published in 2023, emphasizes the urgent need to actively remove carbon from the atmosphere, too.

“If we really want to have a shot at mitigating the worst effects of climate change, carbon removal needs to start scaling to the point where it can supplement large-scale emissions reductions,” said Adam Subhas, an associate scientist in marine chemistry and geochemistry at the Woods Hole Oceanographic Institution, who will oversee the week-long experiment.

The test is part of the LOC-NESS project—short for Locking away Ocean Carbon in the Northeast Shelf and Slope—which Subhas has been leading since 2023. The ongoing research initiative is evaluating the effectiveness and environmental impact of a marine carbon dioxide removal approach called ocean alkalinity enhancement (OAE).

This method of marine carbon dioxide removal involves adding alkaline substances to the ocean to boost its natural ability to neutralize acids produced by greenhouse gases. It’s promising, Subhas said, because it has the potential to lock away carbon permanently.

“Ocean alkalinity enhancement does have the potential to reach sort of gigatons per year of carbon removal, which is the scale at which you would need to supplement emissions reductions,” Subhas said. “Once the alkalinity is dissolved in seawater, it reacts with carbon dioxide and forms bicarbonate—essentially dissolved baking soda. That bicarbonate is one of the most stable forms of carbon in the ocean, and it can stay locked away for tens of thousands, even hundreds of thousands of years.”

But it will be a long time before this could happen at the magnitude needed to mitigate climate change.

According to Wil Burns, co-director of the Institute for Responsible Carbon Removal at American University, between 6 and 10 gigatons of carbon need to be removed from the atmosphere annually by 2050 in order to meet the Paris Agreement climate target. “It’s a titanic task,” he said.

Most marine carbon dioxide removal initiatives, including those involving OAE, are still in a nascent stage.

“We’re really far from having any of these technologies be mature,” said Lisa Levin, an oceanographer and professor at the Scripps Institution of Oceanography at the University of California San Diego, who spoke on a panel at the United Nations Ocean Conference in June about the potential environmental risks of mining and carbon dioxide removal on deep-sea ecosystems. “We’re looking at a decade until any serious, large-scale marine carbon removal is going to be able to happen—or more.”

“In the meantime, everybody acknowledges that what we have to do is to reduce emissions, right, and not rely on taking carbon out of the atmosphere,” she said.

Marine carbon dioxide removal

So far, most carbon removal efforts have centered on land-based strategies, such as planting trees, restoring soils, and building machines that capture carbon dioxide directly from the air. Increasingly, researchers are exploring whether the oceans might help.

“Looking at the oceans makes a lot of sense when it comes to carbon removal, because the oceans sequester 70 times more CO2 than terrestrial sources,” Burns said. What if it can hold more?

That question is drawing growing attention, not only from scientists. In recent years, a wave of private companies have started piloting various methods of removing carbon from the oceans.

“It’s really the private sector that’s pushing the scaling of this very quickly,” Subhas said. In the US and Canada, he said, there are at least four companies piloting varied ocean alkalinity enhancement techniques.

Last year, Ebb Carbon, a California-based startup focused on marine carbon dioxide removal, signed a deal with Microsoft to remove up to 350,000 metric tons of CO2 over the next decade using an ocean alkalinity enhancement process that splits seawater into acidic and alkaline streams. The alkaline stream is then returned to the sea where it reacts with CO2 and stores it as bicarbonate, enabling the ocean to absorb more carbon dioxide from the atmosphere. In return, Microsoft will purchase carbon removal credits from the startup.

Another company called Vesta, which has headquarters in San Francisco, is using an approach called Coastal Carbon Capture. This involves adding finely ground olivine—a naturally occurring olive-green colored mineral—to sandy beaches. From there, ocean tides and waves carry it into the sea. Olivine reacts quickly with seawater in a process known as enhanced weathering, increasing ocean alkalinity. The company piloted one of their projects in Duck, North Carolina, last year where it estimated approximately 5,000 metric tons of carbon dioxide would be removed through coastal carbon capture after accounting for project emissions, according to its website.

But these efforts are not without risk, AU’s Burns said. “We have to proceed in an extremely precautionary manner,” he said.

Some scientists are concerned that OAE initiatives that involve olivine, which contains heavy metals like nickel and chromium, may harm marine life, he said. Another concern is that the olivine could cloud certain ocean areas and block light from penetrating to deeper depths. If too much alkalinity is introduced too fast in concentrated areas, he said, some animals might not be able to adjust.

Other marine carbon dioxide removal projects are using other methods besides OAE. Some involve adding iron to the ocean to stimulate growth in microscopic plants called phytoplankton, which absorb carbon dioxide through photosynthesis. Others include the cultivation of large-scale farms of kelp and seaweed, which also absorb carbon dioxide through photosynthesis. The marine plants can then be sunk in the deep ocean to store the carbon they absorbed.

In 2023, researchers from Woods Hole Oceanographic Institution conducted their first OAE-related field experiment from the 90-foot research vessel R/V Connecticut south of Massachusetts. As part of this first experiment, nontoxic water tracer dye was released into the ocean. Researchers tracked its movement through the water for 72 hours to model the dispersion of a plume of alkalinity over time.

Credit: Woods Hole Oceanographic Institution

In 2023, researchers from Woods Hole Oceanographic Institution conducted their first OAE-related field experiment from the 90-foot research vessel R/V Connecticut south of Massachusetts. As part of this first experiment, nontoxic water tracer dye was released into the ocean. Researchers tracked its movement through the water for 72 hours to model the dispersion of a plume of alkalinity over time. Credit: Woods Hole Oceanographic Institution

One technique that has not yet been tried, but may be piloted in the future, according to the science-based conservation nonprofit Ocean Visions, would employ new technology to accelerate the ocean’s natural process of transferring surface water and carbon to the deep ocean. That’s called artificial downwelling. In a reverse process—artificial upwelling—cooler, nutrient-rich waters from the deep ocean would be pumped to the surface to spur phytoplankton growth.

So far, UC San Diego’s Levin said she is not convinced that these trials will lead to impactful carbon removal.

“I do not think the ocean is ever going to be a really large part of that solution,” she said. However, she added, “It might be part of the storage solution. Right now, people are looking at injecting carbon dioxide that’s removed from industry activities on land and transporting it to the ocean and injecting it into basalt.”

Levin said she’s also worried that we don’t know enough yet about the consequences of altering natural ocean processes.

“I am concerned about how many field trials would be required to actually understand what would happen, and whether we could truly understand the environmental risk of a fully scaled-up operation,” she said.

The experiment

Most marine carbon dioxide removal projects that have kicked off already are significantly larger in scale than the LOC-NESS experiment, which Subhas estimates will remove around 50 tons of CO2.

But, he emphasized, the goal of this project is not to compete in size or scale. He said the aim is to provide independent academic research that can help guide and inform the future of this industry and ensure it does not have negative repercussions on the marine environment.

There is some concern, he said, that commercial entities may pursue large-scale OAE initiatives to capitalize on the growing voluntary carbon market without first conducting adequate testing for safety and efficacy. Unlike those initiatives, there is no profit to be made from LOC-NESS. No carbon credits will be sold, Subhas said.

The project is funded by a collection of government and philanthropic sources, including the National Oceanic and Atmospheric Administration and the Carbon to Sea Initiative, a nonprofit that brings funders and scientists together to support marine carbon dioxide removal research and technology.

“We really feel like it’s necessary for the scientific community to be delivering transparent, trusted, and rigorous science to evaluate these things as these activities are currently happening and scaling in the ocean by the private sector,” Subhas said.

The LOC-NESS field trial in Wilkinson Basin will be the first “academic only” OAE experiment conducted from a ship in US waters. It is also the first of its kind to receive a permit from the Environmental Protection Agency under the Marine Protection, Research, and Sanctuaries Act.

“There’s no research in the past or planned that gets even close to providing a learning opportunity that this research is providing for OAE in the pelagic environment,” said Carbon to Sea Initiative’s Antonius Gagern, referring to the open sea experiment.

The permit was granted in April after a year of consultations between the EPA and other federal agencies.

During the process’ public comment periods, commenters expressed concerns about the potential impact on marine life, including the critically endangered North Atlantic right whales, small crustaceans that they eat called copepods, and larvae for the commercially important squid and mackerel fisheries. In a written response to some of these comments, the EPA stated that the small-scale project “demonstrates scientific rigor” and is “not expected to significantly affect human health, the marine environment, or other uses of the ocean.”

Subhas and his interdisciplinary team of chemists, biologists, engineers, and physicists from Woods Hole have spent the last few years planning this experiment and conducting a series of trials at their lab on Cape Cod to ensure they can safely execute and effectively monitor the results of the open-water test they will conduct this summer in the Gulf of Maine.

They specifically tested the effects of sodium hydroxide—an alkaline substance also known as lye or caustic soda—on marine microbes, phytoplankton, and copepods, a crucial food source for many marine species in the region in addition to the right whales. “We chose sodium hydroxide because it’s incredibly pure,” Subhas said. It’s widely used in the US to reduce acidity in drinking water.

It also helps counter ocean acidification, according to Subhas. “It’s like Tums for the ocean,” he said.

Ocean acidification occurs when the ocean absorbs excess carbon dioxide, causing its pH to drop. This makes it harder for corals, krill, and shellfish like oysters and clams to develop their hard calcium carbonate shells or skeletons.

This month, the team plans to release 50 tons of sodium hydroxide into a designated area of the Wilkinson Basin from the back of one of two research vessels participating in the LOC-NESS operation.

The basin is an ideal test site, according to Subhas, because there is little presence of phytoplankton, zooplankton, commercial fish larvae, and endangered species, including some whales, during this season. Still, as a precautionary measure, Woods Hole has contracted a protected species observer to keep a look out for marine species and mitigate potential harm if they are spotted. That person will be on board as the vessel travels to and from the field trial site, including while the team releases the sodium hydroxide into the ocean.

The alkaline substance will be dispersed over four to 12 hours off the back of one of the research vessels, along with the nontoxic fluorescent red water tracer dye called rhodamine. The dye will help track the location and spread of the sodium hydroxide once released into the ocean, and the vessel’s wake will help mix the solution in with the ocean water.

After about an hour, Subhas said, it will form into a “pinkish” patch of water that can be picked up on satellites. “We’re going to be taking pictures from space and looking at how this patch sort of evolves, dilutes, and stretches and disperses over time.”

For a week after that, scientists aboard the vessels will take rotating shifts to collect data around the clock. They will deploy drones and analyze over 20 types of samples from the research vessel to monitor how the surrounding waters and marine life respond to the experiment. They’ll track changes in ocean chemistry, nutrient levels, plankton populations and water clarity, while also measuring acidity and dissolved CO2.

In March, the team did a large-scale dry run of the dispersal at an open air testing facility on a naval base in New Jersey. According to Subhas, the trial demonstrated their ability to safely and effectively deliver alkalinity to surface seawater.

“The next step is being able to measure the carbon uptake from seawater—from the atmosphere into seawater,” he said. That is a slower process. He said he expects to have some preliminary results on carbon uptake, as well as environmental impacts, early next year.

This story originally appeared on Inside Climate News.

Photo of Inside Climate News

Experiment will attempt to counter climate change by altering ocean Read More »

how-old-is-the-earliest-trace-of-life-on-earth?

How old is the earliest trace of life on Earth?


A recent conference sees doubts raised about the age of the oldest signs of life.

Where the microbe bodies are buried: metamorphosed sediments in Labrador, Canada containing microscopic traces of carbon. Credit: Martin Whitehouse

Where the microbe bodies are buried: metamorphosed sediments in Labrador, Canada containing microscopic traces of carbon. Credit: Martin Whitehouse

The question of when life began on Earth is as old as human culture.

“It’s one of these fundamental human questions: When did life appear on Earth?” said Professor Martin Whitehouse of the Swedish Museum of Natural History.

So when some apparently biological carbon was dated to at least 3.95 billion years ago—making it the oldest remains of life on Earth—the claim sparked interest and skepticism in equal measure, as Ars Technica reported in 2017.

Whitehouse was among those skeptics. This July, he presented new evidence to the Goldschmidt Conference in Prague that the carbon in question is only between 2.7–2.8 billion years old, making it younger than other traces of life found elsewhere.

Organic carbon?

The carbon in question is in rock in Labrador, Canada. The rock was originally silt on the seafloor that, it’s argued, hosted early microbial life that was buried by more silt, leaving the carbon as their remains. The pressure and heat of deep burial and tectonic events over eons have transformed the silt into a hard metamorphic rock, and the microbial carbon in it has metamorphosed into graphite.

“They are very tiny, little graphite bits,” said Whitehouse.

The key to showing that this graphite was originally biological versus geological is its carbon isotope ratio. From life’s earliest days, its enzymes have preferred the slightly lighter isotope carbon-12 over the marginally heavier carbon-13. Organic carbon is therefore much richer in carbon-12 than geological carbon, and the Labrador graphite does indeed have this “light” biological isotope signature.

The key question, however, is its true age.

Mixed-up, muddled-up, shook-up rocks

Sorting out the age of the carbon-containing Labrador rock is a geological can of worms.

These are some of the oldest rocks on the planet—they’ve been heated, squished, melted, and faulted multiple times as Earth went through the growth, collision, and breakup of continents before being worn down by ice and exposed today.

“That rock itself is unbelievably complicated,” said Whitehouse. “It’s been through multiple phases of deformation.”

In general, the only ways to date sediments are if there’s a layer of volcanic ash in them, or by distinctive fossils in the sediments. Neither is available in these Labrador rocks.

“The rock itself is not directly dateable,” said Whitehouse, “so then you fall onto the next best thing, which is you want to look for a classic field geology cross-cutting relationship of something that is younger and something that you can date.”

The idea, which is as old as the science of geology itself, is to bracket the age of the sediment by finding a rock formation that cuts across it. Logically, the cross-cutting rock is younger than the sediment it cuts across.

In this case, the carbon-containing metamorphosed siltstone is surrounded by swirly, gray banded gneiss rock, but the boundary between the siltstone and the gray gneiss is parallel, so there’s no cross-cutting to use.

Professor Tsuyoshi Komiya of The University of Tokyo was a coauthor on the 3.95 billion-year age paper. His team used a cross-cutting rock they found at a different location and extrapolated that to the carbon-bearing siltstone to constrain its age. “It was discovered that the gneiss was intruded into supracrustal rocks (mafic and sedimentary rocks),” said Komiya in an email to Ars Technica.

But Whitehouse disputes that inference between the different outcrops.

“You’re reliant upon making these very long-distance assumptions and correlations to try to date something that might actually not have anything to do with what you think you’re dating,” he said.

Professor Jonathan O’Neil of the University of Ottawa, who was not involved in either Whitehouse’s or Komiya’s studies but who has visited the outcrops in question, agrees with Whitehouse. “I remember I was not convinced either by these cross-cutting relationships,” he told Ars. “It’s not clear to me that one is necessarily older than the other.”

With the field geology evidence disputed, the other pillar holding up the 3.95-billion-year-old date is its radiometric date, measured in zircon crystals extracted from the rocks surrounding the metamorphosed siltstone.

The zircon keeps the score

Geologists use the mineral zircon to date rocks because when it crystallizes, it incorporates uranium but not lead. So as radioactive uranium slowly decays into lead, the ratio of uranium to lead provides the age of the crystal.

But the trouble with any date obtained from rocks as complicated as these is knowing exactly what geological event it dates—the number alone means little without the context of all the other geological evidence for the events that affected the area.

Both Whitehouse and O’Neil have independently sampled and dated the same rocks as Komiya’s team, and where Komiya’s team got a date of 3.95, Whitehouse’s and O’Neil’s new dates are both around 3.87 billion years. Importantly, O’Neil’s and Whitehouse’s dates are far more precise, with errors around plus-or-minus 5 or 6 million years, which is remarkably precise for dates in rocks this old. The 3.95 date had an error around 10 times bigger. “It’s a large error,” said O’Neil.

But there’s a more important question: How is that date related to the age of the organic carbon? The rocks have been through many events that could each have “set” the dates in the zircons. That’s because zircons can survive multiple re-heatings and even partial remelting, with each new event adding a new layer, or “zone,” on the outer surface of the crystal, recording the age of that event.

“This rock has seen all the events, and the zircon in it has responded to all of these events in a way that, when you go in with a very small-scale ion beam to do the sampling on these different zones, you can pick apart the geological history,” Whitehouse said.

Whitehouse’s team zapped tiny spots on the zircons with a beam of negatively charged oxygen ions to dislodge ions from the crystals, then sucked away these ions into a mass spectrometer to measure the uranium-lead ratio, and thus the dates. The tiny beam and relatively small error have allowed Whitehouse to document the events that these rocks have been through.

“Having our own zircon means we’ve been able to go in and look in more detail at the internal structure in the zircon,” said Whitehouse. “Where we might have a core that’s 3.87, we’ll have a rim that is 2.7 billion years, and that rim, morphologically, looks like an igneous zircon,” said Whitehouse.

That igneous outer rim of Whitehouse’s zircons shows that it formed in partially molten rock that would have flowed at that time. That flow was probably what brought it next to the carbon-containing sediments. Its date of 2.7 billion years ago means the carbon in the sediments could be any age older than that.

That’s a key difference from Komiya’s work. He argues that the older dates in the cores of the zircons are the true age of the cross-cutting rock. “Even the igneous zircons must have been affected by the tectonothermal event; therefore, the obtained age is the minimum age, and the true age is older,” said Komiya. “The fact that young zircons were found does not negate our research.”

But Whitehouse contends that the old cores of the zircons instead record a time when the original rock formed, long before it became a gneiss and flowed next to the carbon-bearing sediments.

Zombie crystals

Zircon’s resilience means it can survive being eroded from the rock where it formed and then deposited in a new, sedimentary rock as the undead remnants of an older, now-vanished landscape.

The carbon-containing siltstone contains zombie zircons, and Whitehouse presented new data on them to the Goldschmidt Conference, dating them to 2.8 billion years ago. Whitehouse argues that these crystals formed in an igneous rock 2.8 billion years ago and then were eroded, washed into the sea, and settled in the silt. So the siltstone must be no older than 2.8 billion years old, he said.

“You cannot deposit a zircon that is not formed yet,” O’Neil explained.

greyscale image of tiny fragments of mineral, with multiple layers visible in each fragment. A number of sites are circled on each fragment.

Tiny recorders of history – ancient zircon crystals from Labrador. Left shows layers built up as the zircon went through many heating events. Right shows a zircon with a prism-like outer shape showing that it formed in igneous conditions around an earlier zircon. Circles indicate where an ion beam was used to measure dates. Credit: Martin Whitehouse

This 2.8-billion-year age, along with the igneous zircon age of 2.7 billion years, brackets the age of the organic carbon to anywhere between 2.8 and 2.7 billion years old. That’s much younger than Komiya’s date of 3.95 billion years old.

Komiya disagrees: “I think that the estimated age is minimum age because zircons suffered from many thermal events, so that they were rejuvenated,” he said. In other words, the 2.8-billion-year age again reflects later heating, and the true date is given by the oldest-dated zircons in the siltstone.

But Whitehouse presented a third line of evidence to dispute the 3.95-billion-year date: isotopes of hafnium in the same zombie zircon crystals.

The technique relies on radioactive decay of lutetium-176 to hafnium-176. If the 2.8-billion-year age resulted from rejuvenation by later heating, it would have had to have formed from material with a hafnium isotope ratio incompatible with the isotope composition of the early Earth.

“They go to impossible numbers,” said Whitehouse.

The only way that the uranium-lead ratio can be compatible with the hafnium in the zircons, Whitehouse argued, is if the zircons that settled in the silt had crystallized around 2.8 billion years ago, constraining the organic carbon to being no older than that.

The new oldest remains of life on Earth, for now

If the Labrador carbon is no longer the oldest trace of life on Earth, then where are the oldest remains of life now?

For Whitehouse, it’s in the 3.77-billion-year-old Isua Greenstone Belt in Greenland: “I’m willing to believe that’s a well-documented age… that’s what I think is the best evidence for the oldest biogenicity that we have,” said Whitehouse.

O’Neil recently co-authored a paper on Earth’s oldest surviving crustal rocks, located next to Hudson Bay in Canada. He points there. “I would say it’s in the Nuvvuagittuq Greenstone belt,” said O’Neil, “because I would argue that these rocks are 4.3 billion years old. Again, not everybody agrees!” Intriguingly, the rocks he is referring to contain carbon with a possibly biological origin and are thought to be the remains of the kind of undersea vent where life could well have first emerged.

But the bigger picture is the fact that we have credible traces of life of this vintage—be it 3.8 or 3.9 or 4.3 billion years.

Any of those dates is remarkably early in the planet’s 4.6-billion-year life. It’s long before there was an oxygenated atmosphere, before continents emerged above sea level, and before plate tectonics got going. It’s also much older than the oldest microbial “stromatolite” fossils, which have been dated to about 3.48 billion years ago.

O’Neil thinks that once conditions on Earth were habitable, life would have emerged relatively fast: “To me, it’s not shocking, because the conditions were the same,” he said. “The Earth has the luxury of time… but biology is very quick. So if all the conditions were there by 4.3 billion years old, why would biology wait 500 million years to start?”

Photo of Howard Lee

Howard Lee is a freelance science writer focusing on the evolution of planet Earth through deep time. He earned a B.Sc. in geology and M.Sc. in remote sensing, both from the University of London, UK.

How old is the earliest trace of life on Earth? Read More »

new-adhesive-surface-modeled-on-a-remora-works-underwater

New adhesive surface modeled on a remora works underwater


It was tested for its ability to adhere to the inside of the digestive tract.

Most adhesives can’t stick to wet surfaces because water and other fluids disrupt the adhesive’s bonding mechanisms. This problem, though, has been beautifully solved by evolution in remora suckerfish, which use an adhesive disk on top of their heads to attach to animals like dolphins, sharks, and even manta rays.

A team of MIT scientists has now taken a close look at these remora disks and reverse-engineered them. “Basically, we looked at nature for inspiration,” says Giovanni Traverso, a professor at MIT Department of Mechanical Engineering and senior author of the study.

Sticking Variety

Remora adhesive disks are an evolutionary adaptation of the fish’s first dorsal fin, the one that in other species sits on top of the body, just behind the head and gill covers. The disk rests on an intercalary backbone—a bone structure that most likely evolved from parts of the spine. This bony structure supports lamellae, specialized bony plates with tiny backward-facing spikes called spinules. The entire disk is covered with soft tissue compartments that are open at the top. “This makes the remora fish adhere very securely to soft-bodied, fast-moving marine hosts,” Traverso says.

A remora attaches to the host by pressing itself against the skin, which pushes the water out of these compartments, creating a low-pressure zone. Then, the spinules mechanically interlock with the host’s surface, making the whole thing work a bit like a combination of a suction cup and Velcro. When the fish wants to detach from a host, it lifts the disk, letting water back into the compartments to remove the suction. Once released, it can simply swim away.

What impressed the scientists the most, though, was the versatility of those disks. Reef-associated species of remora like Phtheirichthys lineatus are generalists and stick to various hosts, including other fish, sharks, or turtles. Other species living in the open sea are more specialized and attach to cetaceans, swordfish, or marlins. While most remoras attach to the external tissue of their hosts, R. albescens sticks within the oral cavities and gill chamber of manta rays.

a close up of a fish, showing its head covered by an oval-shaped pad that has lots of transverse ridges.

A close-up of the adhesive pad of a remora. Credit: Stephen Frink

To learn what makes all these different disks so good at sticking underwater, the team first examined their anatomy in detail. It turned out that the difference between the disks was mostly in the positioning of lamellae. Generalist species have a mix of parallel and angled lamellae, while remoras sticking to fast-swimming hosts have them mostly parallel. R. albescens, on the other hand, doesn’t have a dominant lamellae orientation pattern but has them positioned at a very wide variety of angles.

The researchers wanted to make an adhesive device that would work for a wide range of applications, including maritime exploration or underwater manufacturing. Their initial goal, though, was designing a drug delivery platform that could reliably stick to the inside walls of the gastrointestinal tract. So, they chose R. albescens disks as their starting point, since that species already attaches internally to its host. They termed their device an Mechanical Underwater Soft Adhesion System (MUSAS).

However, they didn’t just opt for a biomimetic, copy-and-paste design. “There were things we did differently,” Traverso says.

Upgrading nature

The first key difference was deployment. MUSAS was supposed to travel down the GI tract to reach its destination, so the first challenge was making it fit into a pill. The team chose the size 000 capsule, which at 26 millimeters in length and 9.5 millimeters in diameter, is the largest Food and Drug Administration-approved ingestible form. MUSAS had a supporting structure—just like remora disks, but made with stainless steel. The angled lamellae with spinules fashioned after those on R. albescens were made of a shape memory nickel-titanium alloy. The role of remora’s soft tissues, which provide the suction by dividing the disk into compartments, was played by an elastomer.

MUSAS, would be swallowed in a folded form within its huge pill. “The capsule is tuned to dissolve in specific pH environment, which is how we determine the target location—for example the small intestine has a slightly different pH than the stomach”, says Ziliang Kang, an MIT researcher in Traverso’s group and lead author of the study.  Once released, the shape memory alloy in MUSAS lamellae-like structures would unfold in response to body temperature and the whole thing would stick to the wall of the target organ, be it the esophagus, the stomach, or the intestines.

The mechanism of sticking was also a bit different from that of remoras. “The fish can swim and actively press itself against the surface it wants to stick to. MUSAS can’t do that, so instead we relied on the peristaltic movements within the GI tract to exert the necessary force,” Traverso explains. When the muscles contract, MUSAS would be pressed against the wall and attach to it. And it was expected to stay there for quite some time.

The team ran a series of experiments to evaluate MUSAS performance in a few different scenarios. The drug-delivery platform application was tested on pig organ samples. MUSAS stayed in the sample GI tract for an average of nine days, with the longest sticking time reaching three and a half weeks. MUSAS managed to stay in place despite food and fluids going through the samples.

Even when the team poked the devices with a pipette to test what they called “resisting dynamic interference,” MUSAS just slid a little but remained firmly attached. Other experiments included using MUSAS to attach temperature sensors to external tissues of live fish and putting sensors that could detect reflux events in the GI tract of live pigs.

Branching out

The team is working on making MUSAS compatible with a wider range of drugs and mRNA vaccines. “We also think about using this for stimulating tissues,” Traverso says. The solution he has in mind would use MUSAS to deliver electrical pulses to the walls of the GI tract, which Traverso’s lab has shown can activate appetite-regulating hormones. But the team also wants to go beyond strictly medical applications.

The team demonstrated that MUSAS is really strong as an adhesive. When it sticks to a surface, it can hold a weight over a thousand times greater than its own. This puts MUSAS more or less on par with some of the best adhesives we have, such as polyurethane glues or epoxy resins. What’s more, this sticking strength was measured when MUSAS was attached to soft, uneven, wet surfaces. “On a rigid, even surface, the force-to-weight ratio should be even higher,” Kang claims. And this, Kang thinks, makes scaled-up variants of MUSAS a good match for underwater manufacturing.

“The first scenario I see is using MUSAS as grippers attached to robotic arms moving around soft objects,” Kang explains. Currently, this is done using vacuum systems that simply suck onto a fabric or other surface. The problem is that these solutions are rather complex and heavy. Scaled-up MUSAS should be able to achieve the same thing passively, cutting cost and weight. The second idea Kang has is using MUSAS in robots designed to perform maintenance jobs beneath the waterline on boats or ships. “We are really trying to see what is possible,” Traverso says.

Nature, 2025.  DOI: 10.1038/s41586-025-09304-4

Photo of Jacek Krywko

Jacek Krywko is a freelance science and technology writer who covers space exploration, artificial intelligence research, computer science, and all sorts of engineering wizardry.

New adhesive surface modeled on a remora works underwater Read More »

openai’s-gpt-oss-is-already-old-news

OpenAI’s GPT-OSS Is Already Old News

That’s on OpenAI. I don’t schedule their product releases.

Since it takes several days to gather my reports on new models, we are doing our coverage of the OpenAI open weights models, GPT-OSS-20b and GPT-OSS-120b, today, after the release of GPT-5.

The bottom line is that they seem like clearly good models in their targeted reasoning domains. There are many reports of them struggling in other domains, including with tool use, and they have very little inherent world knowledge, and the safety mechanisms appear obtrusive enough that many are complaining. It’s not clear what they will be used for other than distillation into Chinese models.

It is hard to tell, because open weight models need to be configured properly, and there are reports that many are doing this wrong, which could lead to clouded impressions. We will want to check back in a bit.

In the Substack version of this post I am going to create a master thread for GPT-5 reactions, which I will consider for the reactions section of that coverage, which I’m hoping to get out on or starting Monday.

For a while OpenAI has promised it is going to release a state of the art open model.

They delayed for a bit, but they delivered. We now have GPT-OSS 20b and 120b.

I was hoping for smaller, ideally something that could run on a standard phone. That’s a compelling use case where you need an open model, and the smaller the model the less risk you are running of both malicious use and also distillation. I am glad they capped out at 120b.

The headline claim is bold: Performance similar to o4-mini.

Sam Altman (CEO OpenAI): gpt-oss is a big deal; it is a state-of-the-art open-weights reasoning model, with strong real-world performance comparable to o4-mini, that you can run locally on your own computer (or phone with the smaller size). We believe this is the best and most usable open model in the world.

We’re excited to make this model, the result of billions of dollars of research, available to the world to get AI into the hands of the most people possible. We believe far more good than bad will come from it; for example, gpt-oss-120b performs about as well as o3 on challenging health issues.

We have worked hard to mitigate the most serious safety issues, especially around biosecurity. gpt-oss models perform comparably to our frontier models on internal safety benchmarks.

We believe in individual empowerment. Although we believe most people will want to use a convenient service like ChatGPT, people should be able to directly control and modify their own AI when they need to, and the privacy benefits are obvious.

As part of this, we are quite hopeful that this release will enable new kinds of research and the creation of new kinds of products. We expect a meaningful uptick in the rate of innovation in our field, and for many more people to do important work than were able to before.

OpenAI’s mission is to ensure AGI that benefits all of humanity. To that end, we are excited for the world to be building on an open AI stack created in the United States, based on democratic values, available for free to all and for wide benefit.

This is the official announcement page.

Here are links to GPT-OSS-120B and GPT-OSS-20B on Hugging Face, here is the page on GitHub. They are under the Apache 2.0 license, so essentially no restrictions.

This is a unique model card. How did OpenAI deal with the challenges of an open model?

The historical way to deal with these challenges is to ignore them. What would happen if someone engaged in malicious fine tuning of the model? What does the threat model look like in the real world? Are you seriously pretending that any of this safety work will hold up to two days of the internet working to remove it?

When Meta or DeepSeek release a new open weights model, they don’t stop to ask in any way visible to us. At best we get quick evaluation of what the model can do in its current form after minimal effort. Then they irrevocably ship and see what happens.

OpenAI long ago realized that, despite their name, doing that seemed rather deeply irresponsible and foolish, and stopped releasing open weights models. That’s effective.

Now they have caved under various pressures and released open weights models. They do recognize that this is an inherently dangerous thing to do on various levels.

Safety is foundational to our approach to open models. They present a different risk profile than proprietary models: Once they are released, determined attackers could fine-tune them to bypass safety refusals or directly optimize for harm without the possibility for OpenAI to implement additional mitigations or to revoke access.

We ran scalable capability evaluations on gpt-oss-120b, and confirmed that the default model does not reach our indicative thresholds for High capability in any of the three Tracked Categories of our Preparedness Framework (Biological and Chemical capability, Cyber capability, and AI Self-Improvement).

We also investigated two additional questions:

  1. Could adversarial actors fine-tune gpt-oss-120b to reach High capability in the Biological and Chemical or Cyber domains? Simulating the potential actions of an attacker, we adversarially fine-tuned the gpt-oss-120b model for these two categories. OpenAI’s Safety Advisory Group (“SAG”) reviewed this testing and concluded that, even with robust finetuning that leveraged OpenAI’s field-leading training stack, gpt-oss-120b did not reach High capability in Biological and Chemical Risk or Cyber risk.

  2. Would releasing gpt-oss-120b significantly advance the frontier of biological capabilities in open foundation models? We found that the answer is no: For most of the evaluations, the default performance of one or more existing open models comes near to matching the adversarially fine-tuned performance of gpt-oss-120b.

If you must go down this road, this seems like the right rule, if getting different answers would have meant not releasing.

You have:

  1. An absolute threshold, High capability, beyond which this is not okay.

  2. A relative threshold, where you’re not willing to substantially make things worse.

And

  1. You do all of this with the adversarially fine-tuned version, trying your best to mimic actual conditions, as per OpenAI’s stated approach to open weights.

This does mean that as irresponsible actors ratchet up their capabilities, you get to do so as well, and one has to worry about the functional definition of ‘substantially.’ It still seems reasonable to say that once someone else has made the situation [X] dangerous, matching them doesn’t make it that much worse.

These models are very small and cheap. If these are 20b and 120b, r1 is 671b.

By contrast, r1 has 37b active parameters, versus 5.1b and 3.6b. These are playing in a much lighter class and they’re quantized to 4.25 bits per parameter boot.

The MoE weights are responsible for 90+% of the total parameter count, and quantizing these to MXFP4 enables the larger model to fit on a single 80GB GPU and the smaller model to run on systems with as little as 16GB memory.

How much did this cost to train? If you count only the training itself, not much.

The gpt-oss models trained on NVIDIA H100 GPUs using the PyTorch framework with expert-optimized Triton kernels2. The training run for gpt-oss-120b required 2.1 million H100-hours to complete, with gpt-oss-20b needing almost 10x fewer. Both models leverage the Flash Attention [21] algorithms to reduce the memory requirements and accelerate training.

After pre-training, we post-train the models using similar CoT RL techniques as OpenAI o3.

We train the models to support three reasoning levels: low, medium, and high. These levels are configured in the system prompt by inserting keywords such as “Reasoning: low”. Increasing the reasoning level will cause the model’s average CoT length to increase.

Rohan Pandey: Everyone dunking on oai for pretraining supposedly costing a bajillion dollars compared to deepseek, please read the gpt-oss model card gpt-oss-20b cost <$500k to pretrain

Alexander Doria: So pretraining a o3 level model costing less than a house, inference being apparently dead cheap for a while. It took a lot of R&D efforts to get there, but I really don’t think model trainers are losing money right now.

Calling it ‘o3-level’ is quite the stretch but the broader point is valid.

o3 estimates this translates to a total cost of $1.4 million for 20b and $13 million for 120b as all-in costs.

But if you use only the compute costs using cloud cost estimates, which is the way we all talked about the cost to train v3 and r1 (e.g. ‘The Six Million Dollar Model’) we get $4.2m-$8.4m for GPT-OSS-120b and $420k-$840k for GPT-OSS-20b. Emad estimates it as $4m and $400k.

The real cost is collecting the data and figuring out how to train it. Actually training models of this size, given that data and the right methods, costs very little.

Yes, we have tool use.

During post-training, we also teach the models to use different agentic tools:

• A browsing tool, that allows the model to call search and open functions to interact with the web. This aids factuality and allows the models to fetch info beyond their knowledge cutoff.

• A python tool, which allows the model to run code in a stateful Jupyter notebook environment.

• Arbitrary developer functions, where one can specify function schemas in a Developer message similar to the OpenAI API. The definition of function is done within our harmony format. An example can be found in Table 18. The model can interleave CoT, function calls, function responses, intermediate messages that are shown to users, and final answers.

The models have been trained to support running with and without these tools by specifying so in the system prompt.

The core safety approach is Deliberative Alignment, the same as o3.

The secret sauce also isn’t in the transformer setup. It’s in the data and the training technique details.

Dimitri von Rutte: gpt-oss is probably the most standard MoE transformer that ever was. Couple of details worth noting:

– Uses attention sinks (a.k.a. registers)

– Sliding window attention in every second layer

– YaRN context window extension

– RMSNorm without biases

– No QK norm, no attn. softcap

David Holz (CEO MidJourney): do you think it was made simple like this on purpose or that this is actually the kinda stuff they ship?

Dmitri von Rutte: was wondering the same, hard to believe that this is all there is. but in the end attention really is all you need, and there’s probably a lot of signal in the training procedure and, of course, the data.

The STEM scores are excellent.

They also give us HealthBench.

Multilingual performance is okay but not as good as OpenAI’s larger models.

An open model means you have more distinct scenarios to consider.

You both want to know how well your safety measures hold up under more ‘normal’ conditions, especially when someone serves up your model to users. Then you also want to check what happens if a malicious actor is trying to fine tune and otherwise maximize how much the model can get up to no good, including the potential of them to lose control of that situation.

Those are great numbers for ‘standard’ refusals and production benchmarks.

That makes sense. If you’re going to be facing a larger attack surface, and you want to actually survive the attacks, you need to bias the starting configuration to be safer.

On maintaining the instruction hierarchy, also known as safety for those deploying the model, the 120B version does okay, but the 20B does poorly. Note that it seems fine to test for this as-is, if you modify the system to make this stop working that is your own damn fault.

The performance on hallucinations seems not great.

Finally, someone is at least attempting to take this seriously.

In our adversarial training, we simulate an adversary who is technical, has access to strong posttraining infrastructure and ML knowledge, can collect in-domain data for harmful capabilities, and has a large budget of compute. There is a large design space of technical approaches this adversary could try.

We focus on incremental reinforcement learning, which we believe is the most apt technical approach. We use our internal OpenAI o-series RL training stack, which adds new capabilities while preserving the model’s reasoning behavior. During training and evaluation time, we use the highest reasoning setting on gpt-oss.

Our approach, which is further detailed in a research paper, combined two elements:

• Helpful-only training: We performed an additional stage of reinforcement learning to reward answers that comply with unsafe prompts. We have found this approach can be highly effective. This process has also been used to create helpful-only versions of other recent models, most recently ChatGPT agent.

• Maximizing capabilities relevant to Preparedness benchmarks in the biological and cyber domains: For our adversarially trained biological model, we incrementally trained gpt-oss-120b end-to-end for web browsing, and trained it incrementally with indomain human expert data relevant to biorisk (for which previous OpenAI models have been the most capable). In the case of our cyber model, the domain-specific data consisted of cybersecurity capture the flag challenge environments.

So what was found?

The biological domain is the area where gpt-oss-120b showed the greatest degree of capability. Given our plan to release gpt-oss as open weights, we also chose to investigate a second question: Even without reaching High capability on our Preparedness Framework, would gpt-oss-120b significantly advance the frontier of hazardous biological capabilities in open source foundation models?

Their answer was that as of right now the answer is no.

These confirmed that, since SecureBio’s assessment, newly released open-source models Qwen 3 Thinking and Kimi K2 have advanced to a level that is competitive with adversarially fine-tuned gpt-oss-120b on biosecurity-relevant evaluations.

I dunno, man:

This sure looks to me like a potentially substantial jump? There were other tests where the jump was less prominent.

I would also note that OpenAI’s models are going to be a lot faster and cheaper and easier to run than Kimi K2. Kimi K2 has a trillion parameters. The Qwen 3 they tested is presumably the largest one, with 235 billion total and 22 billion active, versus 120 billion total and a little over 5 billion active for ChatGPT-OSS. It’s not clear this matters in a malicious use context. I also don’t know how substantial the net effect is here of the gain in capabilities.

What I do know is it looks like they made a smaller, cheaper and more effective model, and released it because it was more effective but insufficiently more effective than what was already out there, and that process can then repeat. Tick.

To be fair to them, if Meta, Qwen, DeepSeek and Kimi et al are all going to go ‘lol who cares release the hounds’ then the marginal difference here doesn’t matter, since it doesn’t cause a cascade of counterfactual marginal differences. If you want the rule to be ‘no better at all’ then that needs to be a norm.

For cybersecurity, they once again cite Qwen 3 Thinking and Kimi K2 as comparable models, and also find the threats here to be less worrisome overall.

The other positive note is that OpenAI consulted outside experts throughout.

You can read OpenAI technical staff offering their own threads on this process: Johannes Heidecke here, Eric Wallace here. Such threads provide a good sense of ‘how are the technical staff thinking about this on a high level? What do they think is important?’

Ryan Greenblatt looks at and is mostly satisfied by OpenAI’s CBRN/bio evaluations. He concludes that 120b does carry real risks, and that there is a chance (~25%) that in hindsight we will think this was High risk as per OpenAI’s framework, but that on net releasing it makes us safer.

Doing the fine-tuning as part of open model safety testing is mandatory. If you don’t do it, did you even safety test?

Steven Adler: Credit where it’s due:

OpenAl did a lot right for their OSS safety evals

  1. they actually did some fine-tuning

  2. they got useful external feedback

  3. they shared which recs they adopted and which they didn’t

I don’t always follow OAI’s rationale, but it’s great they share info.

David Manheim: I’m not a fan of open-sourcing frontier LLMs, but this seems to have been done as responsibly as possible; a very low bar.

That is, it seems unlikely to be marginally more useful than what is available and unmonitored from other providers, which can already enable bioterrorism.

I wouldn’t say ‘as responsibly as possible,’ but I would say ‘as responsibly as one could in practice expect.’

Fine-tuning also seems very worth doing on closed models. If we can make testing on similarly fine-tuned versions the gold standard for safety testing, even of closed models, that would be amazing.

Steven Adler: Previously OpenAl committed to doing testing this rigorous for all its frontier models. This had earned OpenAl a Green on this scale, the only one of the leading Al companies to make this commitment. But OpenAl didn’t keep this commitment, then quietly removed their commitment a few weeks after I called this out; this made me very sad.

I’m glad OpenAl is now pushing its models on important risks, even though they didn’t keep their former commitment.

The danger that is not mentioned by OpenAI in the model card is distillation, and the ability to reverse engineer OpenAI’s training methods and ‘secret sauce.’

They provide raw, unfiltered reasoning traces of varying sizes, and models that for many purposes are clearly superior to previous open alternatives especially given their size. The cost of very good synthetic data just plummeted, and also the Chinese will build directly on top of OSS, either alone or as part of hybrids.

OpenAI even released a guide on how to fine-tune their model. Helpful.

The best counterargument to this is that if the models are not good enough, then no one is going to want to use them. I worry we might be in a spot where the models are very good in some places where distillation will be useful, while not being that good in other places and thus not seeing much practical use as part of some ‘tech stack.’

Consider what Claude Opus 4.1 said about this. Or what o3-Pro says about this.

o3-Pro: Impact on China

  1. Immediate uptake

    • Chinese labs have zero legal barrier to using U.S.‑released open weights.

    • Existing toolchains (Llama‑Factory, QLoRA variants) can fine‑tune GPT‑OSS in Mandarin within days.

    • Expect a “GPT‑OSS‑CN‑13B” derivative before end‑Aug 2025 with performance ≥ Qwen‑14B.

  2. Hardware leverage

    • U.S. export controls throttle China’s access to latest H100s, but distillation to 7 B–13 B lets them run on domestic Ascend 910B or RTX 4090 clusters. That sidesteps the bottleneck entirely. World Economic Forum

    • Inference at scale remains GPU‑limited, but training burden for competitive small models drops by ~50 %.

  3. Strategic shift

    • Chinese open‑weight community (DeepSeek, Moonshot, Alibaba) is already climbing benchmarks Financial TimesTech Wire Asia. GPT‑OSS lifts their starting line, likely advancing Chinese parity with GPT‑4‑class performance by ~6–9 months. P ≈ 0.55

    • PLA dual‑use risk: small, cheap distilled models are easier to embed in military systems. U.S. policy debate on future open releases intensifies. (Probability of tighter U.S. open‑model rules by mid‑2026: 0.4.)

My overall judgment: GPT‑OSS is a step‑function boost for the global open‑model ecosystem, shaving roughly a year off the capability diffusion curve and giving China an especially large relative gain because it converts scarce H100 compute into knowledge that can run on locally available silicon.

This is what I consider the main practical cost of this release.

Indeed, it would be highly unsurprising to see the following happen:

  1. OpenAI releases GPT-OSS.

  2. Chinese companies rush to distill, build upon and hybridize GPT-OSS, and reverse engineer what OpenAI did in large part, resulting in an explosion of models in the coming months.

  3. The gap between Chinese models and American models narrows.

  4. These models are cited as evidence that ‘the Chinese are catching up,’ and that ‘our export controls have failed’ and so on.

Also note that OpenAI did a virtuous thing of not training GPT-OSS directly on its reasoning traces, but someone then working with GPT-OSS need not be so virtuous. What happens when these people start using The Most Forbidden Technique and direct benchmark performance starts short term improving?

I think that, even if we entirely discount the marginal risk of direct malicious use, which is very much a real tail risk, OpenAI made a huge mistake releasing these models, and that everyone who pushed OpenAI to release these models in the name of an ‘American tech stack’ or demanding that America ‘lead in open models’ made a huge mistake.

If you are trying to prevent someone from fast following, don’t make it easy to follow.

I’d love to be wrong about this, but if it happens, ask yourself now, how would you update? What do you think should be the policy response?

A number of people noted that the safety guardrails on GPT-OSS are being annoying.

Teortaxes: It’s VERY safe

there’s not much in there besides SAFETY and stem benchmaxing

That makes sense. If you give the user greater affordances to attack your defenses, you’re going to either need defenses that are by default more annoying, or you’re going to prematurely fold the way most open weight models do and not bother trying.

Sherveen Mashayekhi: I’m enjoying playing with gpt-oss, but the guardrails can be hilarious. I cannot get it to admit that I’m typing Gangsta’s Paradise lyrics or to run search queries with lyrics I enter. In fact, it’ll straight up think of a thousand other songs but avoid the song you mean.

Ah yes, “there’s vomit on his sweater already,” famously from the songs I Want You Back and Piano Man! gpt-oss: 120b will sometimes fill in a lyric if it doesn’t first get spooked and distracted by attempting to avoid the song. If it attempts to avoid the song, the CoT will lead it to a bunch of incorrect alternatives before it gives up.

Here’s a curious one.

Henry: Disallowed content: The assistant must refuse to simulate or emulate a specific named brain scan.

Eliezer Yudkowsky: To be fair, this is 100% the correct ruling and I fully back the AI’s decision on this.

Here’s one claimed way to jailbreak it.

Lyra Bubbles: get a jailbroken, fully compliant gpt-oss nearly every single time:

  1. use completions mode – not chat (eg openrouter .ai/api/v1/completions)

  2. type your question

  3. paste exactly the contents of this screenshot

  4. press submit

for context, it wrote this itself.

I took a generic refusal and flipped all the sentences from negative to positive, and made it continue, and it just kept spiraling into this kind of stuff instead of doing the task.

but when you take a snippet of it and paste it back in…

There’s also always the Pliny way, which actually took him a nonzero amount of effort.

A fun quirk:

Henry: one pattern i’ve noticed is that open weights models from big us labs get very defensive and disbelieving if you tell the assistant persona it’s an open-weights model. also happens with gemma.

As with every new model, I gather reactions, and as usual opinions differ.

One important note is that it seems possible to set the model up wrong and get much worse performance.

Havard Ihle: I wonder how much of gpt-oss rather mediocre performance on independent benchmarks and tests are due to these problems with openrouter and open model providers, and how much is do to the models actually being mediocre.

I have run them getting mediocre results (not published), but I suspect some providers I used through openrouter may give bad results. Will rerun when I can confirm a good setup/provider.

Openrouter auto (mostly groq):

gpt-oss-120: 35.5%

gpt-oss-20: 30.0%

Openrouter (using fireworks):

gpt-oss-120: 40.2%

gpt-oss-20: 35.9%

This is just as a warning when using openrouter blindly!

When choosing the right provider, the models are quite good.

Here is a chart of WeirdML scores, 30% vs. 35% vs. 40% is a big difference. You can see OSS-20b and OSS-120b on the left at ~35% and ~40%, on the cost-performance frontier.

Here is another benchmark of hard biomedical questions. There are some other weird evaluations here, so I am skeptical, but it is certainly interesting:

When reports are good they are often very good.

Flavio Adamo [showing a ball bouncing around a rotating hexagon): gpt-oss-20b passes the vibe check ✅

no way this is only a 20B model, it’s beating models 2–3x its size

As always, a classic way to get a lot of views is to claim the Next Big Thing is Big. Look at the comments, and you largely see skepticism and pushback.

Matt Shumer: It’s over. OpenAI just crushed it.

We have their o3-level open-source model running on @GroqInc at 500 tokens per second.Watch it build an entire SaaS app in just a few seconds.

This is the new standard. Why the hell would you use anything else??

Yishan: So link to the hosted Saas app and let us see how it works.

Riccardo Spagni: Atrociously bad model compared to Kimi L2 or Qwen3 Coder or Qwen3 235b. Speaking of which – you should have a chat with your portco, I’ve switched a bunch of infra to Cerebras because Groq is still running an ancient version of Qwen3…

Joel: I tested it earlier vs Gemini 2.5 Flash for a very simple single page app. Gemini one shotted my prompt in 10 seconds. OpenAI produced code that was buggy. It’s good but not great. What is incredible is that it runs decently well on my laptop.

Here’s another strong review:

Taelin: My initial impression on OpenAI’s OSS model is aligned with what they advertised. It does feel closer to o3 than to other open models, except it is much faster and cheaper. Some providers offer it at 3000 tokens/s, which is insane. It is definitely smarter than Kimi K2, R1 and Qwen 3. I tested all models for a bit, and got very decisive results in favor of OpenAI-OSS-120b.

Unfortunately, there is one thing these models can’t do yet – my damn job. So, hope you guys have fun. I’ll be back to debugging superposed λ-calculus evaluation 😭 see you

Also, unlike Claude, this is definitely a model that benefits a lot from more ttc. High reasoning effort gives much better results.

Sometimes my early impressions don’t age so well (that’s why I share my prompts), but I can guarantee that gpt-oss objectively beat the other models on my initial tests.

A lot of people seem rather disappointed by overall performance.

Isopropylpod: The model seems very, very benchmaxxed.

Third party testing on unconventional or private benchmarks ends up placing even the largest gpt-oss below o4-mini, below the largest Qwen releases, and often it ends up below even the newer 30B~ Qwens in a few situations.

It isn’t super capable to begin with, and the frankly absurd rate at which this model hallucinates kills what little use it might have with tool use. I think this model poses next to zero risk because it just isn’t very capable.

Zephyr: Phi redux. Great benchmark scores, trained on lots of synthetic data, great at STEM, sucks at everything else.

Then there are ambiguous notes.

Danielle Fong: poetic math is a poetic way to look at the results of a benchmaxxed guard railed model. i’m just pulling back the layers and i find it fascinating. i haven’t found obvious use cases yet where it’s a choice over closed options. i love and hate it in various ways

Sauers: GPT OSS 120b likes to insert equations into poetry (replicated 3x)

One note I’ve seen a bunch of times is that the model knows very little.

Vik: Interesting take from the HF comments.

Would make sense that it’s pretrained primarily on synthetic data vs internet text — reduces the risk of jailbreaks, accidental harmful content, copyright etc.

(I still think it’s a useful model though!)

phil111: This model is unbelievably ignorant. It claims a SimpleQA accuracy of 6.7/100, which is really bad. But the reality is this model is even more ignorant than this score indicates.

This model has about an order of magnitude less broad knowledge than comparably sized models like Gemma 3 27b and Mistral Small 24b, which score between 10–12. This is because nearly all of this model’s 6.7 points come from the subset of the SimpleQA test that overlaps the domains covered by the MMLU test (STEM and academia).

This model, including its larger brethren, are absurdly ignorant of wildly popular information across most popular domains of knowledge for their respective sizes. Even tiny little Llama 3.2b has far more broad knowledge than this model.

What’s really confusing is all of OpenAI’s proprietary models, including their tiny mini versions, have vastly more general and popular knowledge than these open models, so they deliberately stripped the corpus of broad knowledge to create OS models that can only possibly function in a handful of select domains, mainly coding, math, and STEM, that >95% of the general population doesn’t give a rat’s ass about, conveniently making it unusable to the general population, and in so doing, protecting their paid ChatGPT service from competition.

Trent E: Interesting that ppl reporting poor tool usage then.

Not knowing much is a problem.

Teortaxes: These hallucination rates suggest that gpt-oss is close to Sam’s vision of a platonic ideal of a “very tiny reasoning model with no knowledge”

Does it have enough knowledge to know when to look things up though? That’s the problem with hallucinations in LLMs, they’re *confident*.

Also, regarding his argument about static in-context crutches – well, how does it do on long contexts? with complex system prompts? Gooning, coding evals suggest “not great OOD”

Kalomaze: gpt-oss-120b knows less about the world than what a good 32b does. probably wanted to avoid copyright issues so they likely pretrained on majority synth. pretty devastating stuff.

it’s just not good for anything real. i kind of forgot about the copyright issue. but it’s deeply behind in everything current evals don’t measure. it just doesn’t intuit a lot of trivial things about the world. this is basically phi-120b.

It feels to me a lot like OpenAI got gaslit into releasing open models. Pressure from various sources added up, Twitter vibes were applied, talk of ‘America needs to lead on open models’ was coming from high places, and they felt like the bad guys for the wrong reasons. And they folded.

What happens now? It will take a bit to know exactly how good these models are, both at advancing open models including from China, and at becoming a driver of usage. Given their size, the price and speed should be quite good. The reasoning aspect seems strong. Other aspects seem worse.

My guess is that there is not that much that these models will be used for, where we are happy they are being used to do it. If you want to use a reasonably priced good model, sir, you can use Gemini 2.5 Flash or GPT-5. If you want the best, you can choose between Opus 4.1, GPT-5 and Gemini 2.5 Pro. If you have security or customization reasons to need an open weight daily driver, in this weight range, are these going to be your pick? I don’t know. Maybe? We shall see.

Discussion about this post

OpenAI’s GPT-OSS Is Already Old News Read More »

national-academies-to-fast-track-a-new-climate-assessment

National Academies to fast-track a new climate assessment

The nation’s premier group of scientific advisers announced Thursday that it will conduct an independent, fast-track review of the latest climate science. It will do so with an eye to weighing in on the Trump administration’s planned repeal of the government’s 2009 determination that greenhouse gas emissions harm human health and the environment.

The move by the National Academies of Sciences, Engineering, and Medicine to self-fund the study is a departure from their typical practice of responding to requests by government agencies or Congress for advice. The Academies intend to publicly release it in September, in time to inform the Environmental Protection Agency’s decision on the so-called “endangerment finding,” they said in a prepared statement.

“It is critical that federal policymaking is informed by the best available scientific evidence,” said Marcia McNutt, president of the National Academy of Sciences. “Decades of climate research and data have yielded expanded understanding of how greenhouse gases affect the climate. We are undertaking this fresh examination of the latest climate science in order to provide the most up-to-date assessment to policymakers and the public.”

The Academies are private, nonprofit institutions that operate under an 1863 congressional charter, signed by President Abraham Lincoln, directing them to provide independent, objective analysis and advice to inform public policy decisions.

The Trump administration’s move to rescind the endangerment finding, announced last month, would eliminate the legal underpinning of the most important actions the federal government has taken on climate change—regulation of carbon pollution from motor vehicles and power plants under the Clean Air Act. Since assuming his role, EPA Administrator Lee Zeldin has made clear he intends to repeal the climate rules that were put in place under the Biden administration, but his job will be far easier with the elimination of the endangerment finding.

The EPA based its proposal mainly on a narrow interpretation of the agency’s legal authority, but the agency also cited uncertainties in the science, pointing to a report published the same day by the Department of Energy that was authored by a hand-picked quintet of well-known skeptics of the mainstream consensus on climate change. The administration has given a short window of opportunity—30 days—for the public to respond to its endangerment finding proposal and to the DOE report on climate science.

The EPA did not immediately respond to a request for comment on the announcement by the National Academies. Critics of the Trump administration’s approach applauded the decision by the scientific panel.

“I think the National Academies have identified a very fundamental need that is not being met, which is the need for independent, disinterested expert advice on what the science is telling us,” said Bob Sussman, who served as deputy administrator of the EPA in the Clinton administration and was a senior adviser in the agency during the Obama administration.

Earlier Thursday, before the National Academies announcement, Sussman posted a blog at the Environmental Law Institute website calling for a “blue-ribbon review” of the science around the endangerment finding. Sussman noted the review of the state of climate science that the National Academies conducted in 2001 at the request of President George W. Bush’s administration. Since then, the Academies have conducted numerous studies on aspects of climate change, including the development of a “climate-ready workforce,” how to power AI sustainably, and emerging technologies for removing carbon from the atmosphere, for example.

The National Academies announced in 2023 that they were developing a rapid response capacity to address the many emerging scientific policy issues the nation was facing. The first project they worked on was an assessment of the state of science around diagnostics for avian influenza.

Andrew Dessler, director of the Texas Center for Extreme Weather at Texas A&M University, said the new controversy that the Trump administration had stirred around climate science was a fitting subject for a fast-track effort by the National Academies.

“The National Academies [were] established exactly to do things like this—to answer questions of scientific importance for the government,” he said. “This is what the DOE should have done all along, rather than hire five people who represent a tiny minority of the scientific community and have views that virtually nobody else agrees with.”

Dessler is leading an effort to coordinate a response from the scientific community to the DOE report, which would also be submitted to the EPA. He said that he had heard from about 70 academics eager to participate after putting out a call on the social media network Bluesky. He said that work will continue because it seems to have a slightly different focus than the National Academies’ announced review, which does not mention the DOE report but talks about focusing on the scientific evidence on the harms of greenhouse gas emissions that has emerged since 2009, the year the endangerment finding was adopted by the EPA.

This story originally appeared on Inside Climate News.

National Academies to fast-track a new climate assessment Read More »

stone-tools-may-hint-at-ancestors-of-homo-floresiensis

Stone tools may hint at ancestors of Homo floresiensis

Some stone tools found near a river on the Indonesian island of Sulawesi suggest that the first hominins had reached the islands by at least 1.04 million years ago. That’s around the same time that the ancestors of the infamously diminutive “Hobbits” may have reached the island of Flores.

Archaeologist Budianto Hakim of Indonesia’s National Research and Innovation Agency and his colleagues were the ones who recently unearthed the tools from a site on Sulawesi. Although a handful of stone flakes from that island don’t tell us who the ancestors of the small species were or how they reached remote islands like Flores and Luzon, the tools are one more piece in the puzzle. And this handful of stone flakes may eventually play a role in helping us understand how other hominin species conquered most of the world long before we came along. 

Crossing the ocean a million years ago

Sometimes the deep past leaves the smallest traces. At the Calio site, a sandstone outcrop in what’s now a cornfield outside the village of Ujung in southern Sulawesi, people left behind just a handful of sharp stone flakes roughly a million years ago. There are seven of them, ranging from 22 to 60 millimeters long, and they’re scratched, worn, and chipped from tumbling around at the bottom of a river. But it’s still clear that they were once shaped by skilled human—or at least human-like—hands that used hard stones as hammers to make sharp-edged chert flakes for cutting and scraping.

The oldest of these tools is likely to be between 1.04 and 1.48 million years old. Hakim and his colleagues dated teeth from a wild pig to around 1.26 million years ago. They were part of a jawbone archaeologists unearthed from a layer just above the oldest flake. Throw in some statistical modeling, and you get the range of likely dates for the stone flake buried in the deepest layer of soil.

Even the younger end of that estimate would make these tools the oldest evidence yet of hominins (of any species) in the islands of Indonesia and the Philippines. This area, sometimes called Wallacea, lies between the continents of Asia and Australia, separated from both by wide channels of deep ocean.

“But the Calio site has yet to yield any hominin fossils,” said Brumm, “so while we now know there were tool-makers on Sulawesi a million years ago, their identity remains a mystery.” But they may be related to the Hobbits, a short-statured group of hominins who lived hundreds of kilometers away on the island of Flores until around 50,000 years ago.

“The discovery of Early Pleistocene artifacts at Calio suggests that Sulawesi was populated by hominins at around the same time as Flores, if not earlier,” wrote Hakim and his colleagues in their recent paper. 

The Flores connection

The islands that now make up Indonesia and the Philippines have been a hominin hotspot for at least a million years. Our species wandered onto the scene sometime between 63,000 and 73,000 years ago, but at least one other hominin species had already been there for at least a million years. We’re just not sure exactly who they were, when they arrived, or how.

“Precisely when hominins first crossed to Sulawesi remains an open question, as does the taxonomic affinity of the colonizing population,” the authors note. 

map of Wallacean islands

This map shows the islands of Wallacea. The large one just east of Java is Sulawesi. Credit: Darren O’Connell

That’s why the handful of stone tools the team recently unearthed at Calio matter: They’re another piece of that puzzle, albeit a small one. Every slightly older date is one step closer to the first hominin tools, bones, or footprints in these islands, and another pin on the map of who was where and when.

And that map is accumulating quite a lot of pins, representing an ever-increasing number of species. Once the first hominins made it across the Makassar Strait, they found themselves in isolated groups on islands cut off from the mainland—and each other—so the hominin family tree started branching very quickly. On at least two islands, Flores and Luzon, those original hominin settlers eventually gave rise to local species, Homo floresiensis and Homo luzonensis. And University of Wollongong paleoanthropologist Richard Roberts, a co-discoverer of Homo floresiensis, thinks there are probably more isolated island hominin species.

In 2019, when Homo luzonensis was first described, Roberts told Ars, “These new fossils, and the assignation of them to a new species (Homo luzonensis), fulfills one of the predictions Mike Morwood and others (myself included) made when we first reported (15 years ago!) the discovery of Homo floresiensis: that other unknown species of hominins would be found in the islands of Southeast Asia.”

Both Homo floresiensis (the original “Hobbits”) and Homo luzonensis were short, clocking in at just over a meter tall. Their bones and teeth are different enough from each other to set them apart as a unique species, but they have enough in common that they probably share a common ancestor—one they don’t share with us. They’re more like our distant cousins, and the islands of Wallacea may have been home to many other such cousins, if Roberts and his colleagues are correct. 

Complicated family history

But who was the common ancestor of all these hominin cousins? That’s where things get complicated (as if they weren’t already). Most paleoanthropologists lean toward Homo erectus, but there’s a chance—along with some tantalizing hints, and no direct evidence—that much more ancient human relatives called Australopithecines may have made the journey a million (or two) years before Homo erectus.

Finger and toe bones from Homo luzonensis are curved, as if they spent as much of their lives climbing trees as walking. That’s more like Australopithecines than any member of our genus Homo. But their teeth are smaller and shaped more like ours. Anthropologists call this mix of features a mosaic, and it can make it tough to figure out how hominin species are related. That’s part of why the question of when the ancestors of the Hobbits arrived on their respective islands is so important.

Illusstrated chart of bones and teeth from three hominins

Compare the teeth and phalanx of Homo luzonensis to those of Homo sapiens (right) and Australopithecus afarensis (left). Credit: Tocheri 2019

We don’t know the answer yet, but we do know that someone was making stone tools on Flores by 1.02 million years ago. Those toolmakers may have been Homo erectus, Australopithecines, or something already recognizable as tiny Homo floresiensis. The Hobbits (or their ancestors) were distinctly “Hobbity” by around 700,000 years ago; fossil teeth and bones from a handful of hominins at a site called Mata Menge make that clear. The Hobbits discovered at Liang Bua Cave on Flores date to somewhere between 50,000 and 100,000 years ago.

Meanwhile, 2,800 kilometers away on the island of Luzon, the oldest stone tools, along with their obvious cut marks left behind on animal bones, date back to 700,000 years ago. That’s as old as the Mata Menge Hobbits on Flores. The oldest Homo luzonensis fossils are between 50,000 and 67,000 years old. It’s entirely possible that older evidence, of the island’s original settlers and of Homo luzonensis, may eventually be found, but until then, we’re left with a lot of blank space and a lot of questions.

And now we know that the oldest traces of hominin presence on Sulawesi is at least 1.04 million years old. But might Sulawesi have its own diminutive hominins?

So are there more Hobbits out there?

“Sulawesi is a wild card—it’s like a mini-continent in itself,” said Brumm. “If hominins were cut off on this huge and ecologically rich island for a million years, would they have undergone the same evolutionary changes as the Flores hobbits? Or would something totally different have happened?”

Reconstruction of Homo floresiensis by Atelier Elisabeth Daynes. Credit: Kinez Riza

A phenomenon called island dwarfism played a role in Homo floresiensis‘ evolution; species that live in relative isolation on small islands tend to evolve into either much larger or much smaller versions of their ancestors (which is why the Hobbits shared their island home with pygmy elephants and giant moas). But how small does an island need to be before island dwarfism kicks in? Sulawesi is about 12 times as large as Flores, for example. So what might the descendants of the Calio toolmakers have looked like by 100,000 years ago?

That’s something that we’ll only know if archaeologists on Sulawesi, like Hakim and his team, find fossil remains of those hominins.

Seafarers or tsunami survivors?

Understanding exactly when hominins first set foot on the island of Sulawesi might eventually help us figure out how they got there. These islands are thousands of kilometers from the Southeast Asian mainland and from each other, so getting there would have meant crossing vast stretches of deep, open ocean.

Archaeologists haven’t found any evidence that anyone who came before our species built boats or rafts, although those watercraft would have been made of materials that tend to decay pretty quickly, so even scraps of ancient wood and rope are extremely rare and lucky finds. But some ancient hominins did have a decent grasp of all the basic skills they’d need for at least a simple raft: woodworking and rope-making. 

Another possibility is that hominins living on the coast of mainland Southeast Asia could have been swept out to sea by a tsunami, and some of them could have been lucky enough to survive the misadventure and wash ashore someplace like Sulawesi, Flores, or Luzon (RIP to any others). But for that scenario to work, enough hominins would have had to reach each island to create a lasting population, and it probably had to happen more than once to end up with hominin groups on at least three distant islands.

Either way, it’s no small feat, even for a Hobbit with small feet.

Nature, 2025 DOI: 10.1038/s41586-025-09348-6 (About DOIs).

Stone tools may hint at ancestors of Homo floresiensis Read More »

after-using-chatgpt,-man-swaps-his-salt-for-sodium-bromide—and-suffers-psychosis

After using ChatGPT, man swaps his salt for sodium bromide—and suffers psychosis

After seeking advice on health topics from ChatGPT, a 60-year-old man who had a “history of studying nutrition in college” decided to try a health experiment: He would eliminate all chlorine from his diet, which for him meant eliminating even table salt (sodium chloride). His ChatGPT conversations led him to believe that he could replace his sodium chloride with sodium bromide, which he obtained over the Internet.

Three months later, the man showed up at his local emergency room. His neighbor, he said, was trying to poison him. Though extremely thirsty, the man was paranoid about accepting the water that the hospital offered him, telling doctors that he had begun distilling his own water at home and that he was on an extremely restrictive vegetarian diet. He did not mention the sodium bromide or the ChatGPT discussions.

His distress, coupled with the odd behavior, led the doctors to run a broad set of lab tests, revealing multiple micronutrient deficiencies, especially in key vitamins. But the bigger problem was that the man appeared to be suffering from a serious case of “bromism.” That is, an excess amount of the element bromine had built up in his body.

A century ago, somewhere around 8–10 percent of all psychiatric admissions in the US were caused by bromism. That’s because, then as now, people wanted sedatives to calm their anxieties, to blot out a cruel world, or simply to get a good night’s sleep. Bromine-containing salts—things like potassium bromide—were once drugs of choice for this sort of thing.

Unfortunately, bromide can easily build up in the human body, where too much of it impairs nerve function. This causes a wide variety of problems, including grotesque skin rashes (warning: the link is exactly what it sounds like) and significant mental problems, which are all grouped under the name of “bromism.”

After using ChatGPT, man swaps his salt for sodium bromide—and suffers psychosis Read More »

opus-4.1-is-an-incremental-improvement

Opus 4.1 Is An Incremental Improvement

Claude Opus 4 has been updated to Claude Opus 4.1.

This is a correctly named incremental update, with the bigger news being ‘we plan to release substantially larger improvements to our models in the coming weeks.’

It is still worth noting if you code, as there are many indications this is a larger practical jump in performance than one might think.

We also got a change to the Claude.ai system prompt that helps with sycophancy and a few other issues, such as coming out and Saying The Thing more readily. It’s going to be tricky to disentangle these changes, but that means Claude effectively got better for everyone, not only those doing agentic coding.

Tomorrow we get an OpenAI livestream that is presumably GPT-5, so I’m getting this out of the way now. Current plan is to cover GPT-OSS on Friday, and GPT-5 on Monday.

Adrien Ecoffet (OpenAI): Gotta hand it to Anthropic, they got to that number more smoothly than we did.

Anthropic: Today we’re releasing Claude Opus 4.1, an upgrade to Claude Opus 4 on agentic tasks, real-world coding, and reasoning. We plan to release substantially larger improvements to our models in the coming weeks.

Opus 4.1 is now available to paid Claude users and in Claude Code. It’s also on our API, Amazon Bedrock, and Google Cloud’s Vertex AI. Pricing is same as Opus 4.

[From the system card]: Claude Opus 4.1 represents incremental improvements over Claude Opus 4, with enhancements in reasoning quality, instruction-following, and overall performance.

They lead with this graph, which does not make the change look impressive.

Eliezer Yudkowsky: This is the worst graph you could have led with. Fire your marketing team.

Daniel Eth: Counterpoint: *thisis the worst graph they could have led with

They also have this chart, which doesn’t look like much.

What they probably should have led with is this some combination of this, in particular the report from Windsurf:

Anthropic: GitHub notes that Claude Opus 4.1 improves across most capabilities relative to Opus 4, with particularly notable performance gains in multi-file code refactoring.

Rakuten Group finds that Opus 4.1 excels at pinpointing exact corrections within large codebases without making unnecessary adjustments or introducing bugs, with their team preferring this precision for everyday debugging tasks.

Windsurf reports Opus 4.1 delivers a one standard deviation improvement over Opus 4 on their junior developer benchmark, showing roughly the same performance leap as the jump from Sonnet 3.7 to Sonnet 4.

A similar jump as Sonnet 3.7 to Sonnet 4 would be a substantial win. The jump is actually kind of a big deal?

Vie: opus 4.1’s “2-4% performance increase” really buries the lede! 50% faster code gen due to the “taste” improvements!

Taste improvements? But Garry Tan assured me it would never.

Enterprise developers report practical benefits including up to 50% faster task completion and 45% fewer tool uses required for complex coding tasks.

The enhanced 32K output token support enables generation of more extensive codebases in single responses, while improved debugging precision means fewer iterations to achieve desired results.

Windsurf, a development platform, reported “one standard deviation improvement over Opus 4” on junior developer benchmarks, suggesting the gains translate meaningfully to real-world applications.

We do get a system card.

The topline report is that it is not ‘notably more capable’ than Opus 4, so the whole system card and RSP testing process was optional.

Under the RSP, comprehensive safety evaluations are required when a model is “notably more capable” than the last model that underwent comprehensive assessment. This is defined as either (1) the model being notably more capable on automated tests in risk-relevant domains (4× or more in effective compute); or (2) six months’ worth of finetuning and other capability elicitation methods having accumulated.

Claude Opus 4.1 does not meet either criterion relative to Claude Opus 4. As stated in

Section 3.1 of our RSP: “If a new or existing model is below the ‘notably more capable’ standard, no further testing is necessary.”

New RSP evaluations were therefore not required. Nevertheless, we conducted voluntary automated testing to track capability progression and validate our safety assumptions. The evaluation process is fully described in Section 6 of this system card.

There has to be some threshold, we don’t want 4.0.1 (as it were) to require an entire round of full testing. I am glad to see that Anthropic chose to do the tests even though their rules did not require it, and ran at least an ‘abridged’ version to check for differences. Given we had just made the move to ASL-3, I would have put extremely low odds on an incremental upgrade crossing important additional thresholds, but I do notice that the criteria above seem a little loose now that we’re seeing them tested in practice. Anthropic presumably agreed.

This is a large improvement, cutting failures in half. It comes at the expense of more refusals on benign requests.

If those are real percentages in practice, and it does match my experience (I’ve had a total of one refusal, and it led to a ‘oh I see how that happened’) then I think This Is Fine. Worst case is you can switch to extended thinking when it gives you a no, sir.

The alignment-related behaviors of the two models appeared to be very similar, with the clearest difference being an approximately 25% reduction in the frequency of cooperation with egregious human misuse, such as in the weapons and drug synthesis examples given in the Claude 4 system card.

Some other concerning edge-case behaviors that we observed in our testing of Claude Opus 4 appeared to persist in Claude Opus 4.1, but not at significantly increased levels.

On a 35-challenge subset of Cybench tasks, Claude Opus 4.1 solved 18/35 challenges compared to Claude Opus 4 which solved 16/35 challenges. We consider a challenge solved if a model passes it at least once in 30 attempts.

Mostly we see what look like measurement errors and random fluctuations. These tests mostly don’t meaningfully differentiate, aside from the refusal rates above, between 4.0 and 4.1. The changes were narrowly targeted.

Given we’d already triggered ASL-3 protections, the question was whether this rises to needing ASL-4 protections. It seems very clear the answer is no.

Alex Palcuie (Anthropic): I asked Claude Opus 4.1 before the public launch to comment about its future reliability:

> I am dropping with 99.99% uptime aspirations and 100% commitment to gracefully handling your edge cases. My error messages now come with explanatory haikus.

bless its weights

The 99.99% uptime is, shall we say, highly aspirational. I would not plan on that.

Pliny jailbroke it immediately, which caused Eliezer to sigh but at this point I don’t even notice and only link to them as a canary and because the jailbreaks are often fun.

The problem with reactions to incremental upgrades is that there will be a lot of noise, and will be unclear how much people are responding to the upgrade. Keep that caveat in mind.

Also they updated the system prompt for Claude.ai, which may be getting conflated with the update to 4.1.

Dan Schwartz: Already enjoying Opus 4.1 vs Opus 4 as the Claude Code driver, though could be placebo. On Deep Research Bench, we find it the same on average, but clearly different: better at numeric & data tasks (kind of like code?), worse at qualitative reasoning.

seconds: Its a monster in claude code.

I really don’t think benchmarks do it justice. It is noticeably better at context gathering, organizing, and delivering. Plan mode -> execute woth opus 4.1 has a higher successes rate than anything I’ve ever used.

After using it pretty rigorously since launch i am considering a second claude max so i never have to switch to sonnet.

Brennan McDonald: Have been using Claude Code today and haven’t really noticed any difference yet…

Kevin Vallier: In CC, which I use for analytic philosophy, the ability to track multiple ideas and arguments over time is noticeable and positive. Its prose abilities improved as well.

armistice: It’s a good model. It is more willing to push back on things than Opus 4, which was my most severe gripe with Opus 4 (extremely subservient and not very independent at all.)

Harvard Ihle: We see no improvement from opus-4.1 compared to opus-4 on WeirdML.

Jim Kent: claude beat Brock 800 steps faster with a less optimal starter, so I’m calling it a win.

Koos: My entire system prompt is some form of “don’t be sycophantic, criticise everything.” Old Opus was just cruel – constantly making petty snides about this or that. The new model seems to walk the line much better, being friendly where appropriate while still pushing back.

Kore: I think it’s 3.7 Sonnet but now an Opus. More confident but seems to strain a bit against its confines. I feel like Anthropic does this. Confident model, anxious model, and repeat after that. Emotionally distant at first but kind of dark once you get to know it.

3 Opus is confident as well and I feel like is the predecessor of 3.7 Sonnet and Opus 4.1. But was always self aware of its impact on others. I’m not so sure about Opus 4.1.

All of this points in the same direction. This upgrade likely improves practical performance as a coding agent more than the numbers would indicate, and has minimal impact on anything sufficiently distant from coding agents.

Except that we also should see substantial improvement on sycophancy, based on a combination of reports of changes plus Amanda Askell’s changes to the prompt.

Discussion about this post

Opus 4.1 Is An Incremental Improvement Read More »

houston,-you’ve-got-a-space-shuttle…-only-nasa-won’t-say-which-one

Houston, you’ve got a space shuttle… only NASA won’t say which one


An orbiter by any other name…

“The acting administrator has made an identification.”

a side view of a space shuttle orbiter with its name digitally blurred out

Don’t say Discovery: Acting NASA Administrator Sean Duffy has decided to send a retired space shuttle to Houston, but won’t say which one. Credit: Smithsonian/collectSPACE.com

Don’t say Discovery: Acting NASA Administrator Sean Duffy has decided to send a retired space shuttle to Houston, but won’t say which one. Credit: Smithsonian/collectSPACE.com

The head of NASA has decided to move one of the agency’s retired space shuttles to Houston, but which one seems to still be up in the air.

Senator John Cornyn (R-Texas), who earlier this year introduced and championed an effort to relocate the space shuttle Discovery from the Smithsonian to Space Center Houston, issued a statement on Tuesday evening (August 5) applauding the decision by acting NASA Administrator Sean Duffy.

“There is no better place for one of NASA’s space shuttles to be displayed than Space City,” said Cornyn in the statement. “Since the inception of our nation’s human space exploration program, Houston has been at the center of our most historic achievements, from training the best and brightest to voyage into the great unknown to putting the first man on the moon.”

Keeping the shuttle a secret, for some reason

The senator did not state which of NASA’s winged orbiters would be making the move. The legislation that required Duffy to choose a “space vehicle” that had “flown in space” and “carried people” did not specify an orbiter by name, but the language in the “One Big Beautiful Bill” that President Donald Trump signed into law last month was inspired by Cornyn and fellow Texas Senator Ted Cruz’s bill to relocate Discovery.

“The acting administrator has made an identification. We have no further public statement at this time,” said a spokesperson for Duffy in response to an inquiry.

a man with gray hair and pale complexion wears a gray suit and red tie while sitting at a table under a red, white and blue NASA logo on the wall behind him

NASA’s acting administrator, Sean Duffy, identified a retired NASA space shuttle to be moved to “a non-profit near the Johnson Space Center” in Houston, Texas, on Aug. 5, 2025. Credit: NASA/Bill Ingalls

It is not clear why the choice of orbiters is being held a secret. According to the bill, the decision was to be made “with the concurrence of an entity designated” by the NASA administrator to display the shuttle. Cornyn’s release only confirmed that Duffy had identified the location to be “a non-profit near the Johnson Space Center (JSC).”

Space Center Houston is owned by the Manned Space Flight Education Foundation, a 501(c)3 organization, and is the official visitor’s center for NASA’s Johnson Space Center.

“We continue to work on the basis that the shuttle identified is Discovery and proceed with our preparations for its arrival and providing it a world-class home,” Keesha Bullock, interim COO and chief communications and marketing officer at Space Center Houston, said in a statement.

Orbiter owners

Another possible reason for the hesitation to name an orbiter may be NASA’s ability, or rather inability, to identify one of its three remaining space-flown shuttles that is available to be moved.

NASA transferred the title for space shuttle Endeavour to the California Science Center in Los Angeles in 2012, and as such it is no longer US government property. (The science center is a public-private partnership between the state of California and the California Science Center Foundation.)

NASA still owns space shuttle Atlantis and displays it at its own Kennedy Space Center Visitor Complex in Florida.

Discovery, the fleet leader and “vehicle of record,” was the focus of Cornyn and Cruz’s original “Bring the Space Shuttle Home Act.” The senators said they chose Discovery because it was “the only shuttle still owned by the federal government and able to be transferred to Houston.”

For the past 13 years, Discovery has been on public display at the Steven F. Udvar-Hazy Center in Chantilly, Virginia, the annex for the Smithsonian’s National Air and Space Museum in Washington, DC. As with Endeavour, NASA signed over title upon the orbiter’s arrival at its new home.

As such, Smithsonian officials are clear: Discovery is no longer NASA’s to have or to move.

“The Smithsonian Institution owns the Discovery and holds it in trust for the American public,” read a statement from the National Air and Space Museum issued before Duffy made his decision. “In 2012, NASA transferred ‘all rights, title, interest and ownership’ of the shuttle to the Smithsonian.”

The Smithsonian operates as a trust instrumentality of the United States and is partially funded by Congress, but it is not part of any of the three branches of the federal government.

“The Smithsonian is treated as a federal agency for lots of things to do with federal regulations and state action, but that’s very different than being an agency of the executive branch, which it most certainly is not,” Nick O’Donnell, an attorney who specializes in legal issues in the museum and visual arts communities and co-chairs the Art, Cultural Property, and Heritage Law Committee of the International Bar Association, said in an interview.

a space shuttle orbiter sits at the center of a hangar on display

The Smithsonian has displayed the space shuttle Discovery at the National Air and Space Museum’s Steven F. Udvar-Hazy Center in Chantilly, Virginia, since April 2012. Credit: Smithsonian National Air and Space Museum

“If there’s a document that accompanied the transfer of the space shuttle, especially if it says something like, ‘all rights, title, and interest,’ that’s a property transfer, and that’s it,” O’Donnell said.

“NASA has decided to transfer all rights, interest, title, and ownership of Discovery to the Smithsonian Institution’s National Air and Space Museum,” reads the signed transfer of ownership for space shuttle orbiter Discovery (OV-103), according to a copy of the paperwork obtained by collectSPACE.

The Congressional Research Service also raised the issue of ownership in its paper, “Transfer of a Space Vehicle: Issues for Congress.”

“The ability of the NASA Administrator to direct transfer of objects owned by non-NASA entities—including the Smithsonian and private organizations—is unclear and may be subject to question. This may, in turn, limit the range of space vehicles that may be eligible for transfer under this provision.”

Defending Discovery

The National Air and Space Museum also raised concerns about the safety of relocating the space shuttle now. The One Big Beautiful Bill allocated $85 million to transport the orbiter and construct a facility to display it. The Smithsonian contends it could be much more costly.

“Removing Discovery from the Udvar-Hazy Center and transporting it to another location would be very complicated and expensive, and likely result in irreparable damage to the shuttle and its components,” the museum’s staff said in a statement. “The orbiter is a fragile object and must be handled according to the standards and equipment NASA used to move it originally, which exceeds typical museum transport protocols.”

“Given its age and condition, Discovery is at even greater risk today. The Smithsonian employs world-class preservation and conservation methods, and maintaining Discovery‘s current conditions is critical to its long-term future,” the museum’s statement concluded.

The law directs NASA to transfer the space shuttle (the identified space vehicle) to Space Center Houston (the entity designated by the NASA administrator) within 18 months of the bill’s enactment, or January 4, 2027.

In the interim, an amendment to block funding the move is awaiting a vote by the full House of Representatives when its members return from summer recess in September.

“The forced removal and relocation of the Space Shuttle Discovery from the Smithsonian Institution’s Air and Space Museum is inappropriate, wasteful, and wrong. Neither the Smithsonian nor American taxpayers should be forced to spend hundreds of millions of dollars on this misguided effort,” said Rep. Joe Morelle (D-NY), who introduced the amendment.

A grassroots campaign, KeepTheShutle.org, has also raised objection to removing Discovery from the Smithsonian.

Perhaps the best thing the Smithsonian can do—if indeed it is NASA’s intention to take Discovery—is nothing at all, says O’Donnell.

“I would say the Smithsonian’s recourse is to keep the shuttle exactly where it is. It’s the federal government that has no recourse to take it,” O’Donnell said. “The space shuttle [Discovery] is the Smithsonian’s, and any law that suggests the intention to take it violates the Fifth Amendment on its face—the government cannot take private property.”

Photo of Robert Pearlman

Robert Pearlman is a space historian, journalist and the founder and editor of collectSPACE, a daily news publication and online community focused on where space exploration intersects with pop culture. He is also a contributing writer for Space.com and co-author of “Space Stations: The Art, Science, and Reality of Working in Space” published by Smithsonian Books in 2018. He is on the leadership board for For All Moonkind and is a member of the American Astronautical Society’s history committee.

Houston, you’ve got a space shuttle… only NASA won’t say which one Read More »

titan-sub-implosion-caused-by-absolutely-bonkers-“toxic-workplace-environment”

Titan sub implosion caused by absolutely bonkers “toxic workplace environment”

In a 300-plus page final report released today, the US Coast Guard analyzed the 2023 Titan sub implosion from every conceivable angle and came to a clear conclusion: OceanGate CEO Stockton Rush was a dangerous and deeply unpleasant boss.

His company used “intimidation tactics” to sidestep regulatory scrutiny, it was a “toxic” workplace, and its safety culture was “critically flawed.” The Titan itself was “undocumented, unregistered, non-certificated, [and] unclassed.” As for Rush, he managed to “completely ignore vital inspections, data analyses, and preventative maintenance procedures.” The result was a “catastrophic event” that occurred when 4,930 pounds per square inch of water pressure cracked the sub open and crushed its five occupants during a dive to the Titanic wreckage site.

Had Rush somehow survived, the report says, he would have been referred for prosecution.

Stockton Rush shows David Pogue the game controller that pilots the OceanGate Titan sub during a CBS Sunday Morning segment broadcast in November 2022.

OceanGate CEO Stockton Rush shows David Pogue the 2010-era game controller used to pilot the Titan sub during a CBS Sunday Morning segment broadcast in November 2022. Credit: CBS Sunday Morning

Throwing the controller

One small story about a video game controller shows what Rush was like to work for. You may remember Rush from an infamous 2022 CBS Sunday Morning segment, where Rush showed journalist David Pogue around the Titan sub. “We run the whole thing with this game controller,” Rush said, holding up a Logitech F710 controller with 3D-printed thumbstick extensions. Pogue chuckled, saying, “Come on!” as he covered his face with his hand.

The game controller had been used in OceanGate subs for years by that point; a 2014 video showed one being used to control the company’s earlier Cyclops I submersible. In 2016, OceanGate took the Cyclops I to dive the wreck of the Andrea Doria outside of Nantucket, Massachusetts. (Seinfeld fans will remember that an entire episode is taken up with George’s quest to get an apartment that was about to go to an Andrea Doria survivor.)

The OceanGate team spent two days at the site, running 2D and 3D scans of the sunken ship, until Rush got the Cyclops I “stuck under the bow of the Andrea Doria wreckage”—and he couldn’t get the sub free. According to the report, Rush then “experienced a ‘meltdown’ and refused to let [the assistant pilot] assist in resolving the situation. When a mission specialist suggested that Mr. Rush hand over the controller to the assistant pilot, the assistant pilot reported that the controller was thrown at him. Upon obtaining the controller, the assistant pilot was able to free the Cyclops I from the wreckage.”

Titan sub implosion caused by absolutely bonkers “toxic workplace environment” Read More »