Author name: Shannon Garcia

figuring-out-why-ais-get-flummoxed-by-some-games

Figuring out why AIs get flummoxed by some games


When winning depends on intuiting a mathematical function, AIs come up short.

Oddly, the training methods that work great for chess fail on far simpler games. Credit: SimpleImages

With its Alpha series of game-playing AIs, Google’s DeepMind group seemed to have found a way for its AIs to tackle any game, mastering games like chess and Go by repeatedly playing itself during training. But then some odd things happened as people started identifying Go positions that would lose against relative newcomers to the game but easily defeat a similar Go-playing AI.

While beating an AI at a board game may seem relatively trivial, it can help us identify failure modes of the AI, or ways in which we can improve their training to avoid having them develop these blind spots in the first place—things that may become critical as people rely on AI input for a growing range of problems.

A recent paper published in Machine Learning describes an entire category of games where the method used to train AlphaGo and AlphaChess fails. The games in question can be remarkably simple, as exemplified by the one the researchers worked with: Nim, which involves two players taking turns removing matchsticks from a pyramid-shaped board until one is left without a legal move.

Impartiality

Nim involves setting up a set of rows of matchsticks, with the top row having a single match, and every row below it having two more than the one above. This creates a pyramid-shaped board. Two players then take turns removing matchsticks from the board, choosing a row and then removing anywhere from one item to the entire contents of the row. The game goes until there are no legal moves left. It’s a simple game that can easily be taught to children.

It also turns out to be a critical example of an entire category of rule sets that define “impartial games.” These differ from something like chess, where each player has their own set of pieces; in impartial games, the two players share the same pieces and are bound by the same set of rules. Nim’s importance stems from a theorem showing that any position in an impartial game can be represented by a configuration of a Nim pyramid. Meaning that if something applies to Nim, it applies to all impartial games.

One of the distinctive features of Nim and other impartial games is that, at any point in the game, it’s easy to evaluate the board and determine which player has the potential to win. Put another way, you can size up the board and know that, if you play the optimal moves from then on, you will likely win. Doing so just requires feeding the board’s configuration into a parity function, which does the math to tell you whether you’re winning.

(Obviously, the person who is currently winning could play a suboptimal move and end up losing. And the exact series of optimal moves is not determined until the end, since they will depend on exactly what your opponent does.)

The new work, done by Bei Zhou and Soren Riis, asks a simple question: What happens if you take the AlphaGo approach to training an AI to play games, and try to develop a Nim-playing AI? Put differently: They asked whether an AI could develop a representation of a parity function purely by playing itself in Nim.

When self-teaching fails

AlphaZero, the chess-playing version, was trained from only the rules of chess. By playing itself, it can associate different board configurations with a probability of winning. To keep it from getting stuck in ruts, there’s also a random sampling element that allows it to continue exploring new territory. And, once it can identify a limited number of high-value moves, it’s able to explore deeper into future possibilities that arise from those moves. The more games it plays, the higher the probability that it will be able to assign values to potential board configurations that could arise from a given position (although the benefits of more games tend to tail off after a sufficient number are played).

In Nim, there is a limited number of optimal moves for a given board configuration. If you don’t play one of them, then you essentially cede control to your opponent, who can go on to win if they play nothing but optimal moves. And again, the optimal moves can be identified by evaluating a mathematical parity function.

So, there are reasons to think that the training process that worked for chess might not be effective for Nim. The surprise is just how bad it actually was. Zhou and Riis found that for a Nim board with five rows, the AI got good fairly quickly and was still improving after 500 training iterations. Adding just one more row, however, caused the rate of improvement to slow dramatically. And, for a seven-row board, gains in performance had essentially stopped by the time the AI had played itself 500 times.

To better illustrate the problem, the researchers swapped out the subsystem that suggested potential moves with one that operated randomly. On a seven-row Nim board, the performance of the trained and randomized versions was indistinguishable over 500 training gains. Essentially, once the board got large enough, the system was incapable of learning from observing game outcomes. The initial state of the seven-row configuration has three potential moves that are all consistent with an ultimate win. Yet when the trained move evaluator of their system was asked to check all potential moves, it evaluated every single one as roughly equivalent.

The researchers conclude that Nim requires players to learn the parity function to play effectively. And the training procedure that works so well for chess and Go is incapable of doing so.

Not just Nim

One way to view the conclusion is that Nim (and by extension, all impartial games) is just weird. But Zhou and Riis also found some signs that similar problems could also crop up in chess-playing AIs that were trained in this manner. They identified several “wrong” chess moves—ones that missed a mating attack or threw an end-game—that were initially rated highly by the AI’s board evaluator. It was only because the software took a number of additional branches out several moves into the future that it was able to avoid these gaffes.

For many Nim board configurations, the optimal branches that lead to a win have to be played out to the end of the game to demonstrate their value, so this sort of avoidance of a potential gaffe is much harder to manage. And they noted that chess players have found mating combinations that require long chains of moves that chess-playing software often misses entirely. They suggest that the issue isn’t that chess doesn’t have the same issues, but rather that Nim-like board configurations are generally rare in chess. Presumably, similar things apply to Go, as illustrated by the odd weaknesses of AIs in that game.

“AlphaZero excels at learning through association,” Zhou and Riis argue, “but fails when a problem requires a form of symbolic reasoning that cannot be implicitly learned from the correlation between game states and outcomes.” In other words, even if the rules governing a game enable simple rules for deciding what to do, we can’t expect Alpha-style training to enable an AI to identify them. The result is what they call a “tangible, catastrophic failure mode.”

Why does this matter? Lots of people are exploring the utility of AIs for math problems, which often require the sort of symbolic reasoning involved in extrapolating from a board configuration to general rules such as the parity function. While it may not be obvious how to train an AI to do that, it can be useful to know which approaches will clearly not work.

Machine Learning, 2026. DOI: 10.1007/s10994-026-06996-1 (About DOIs).

Photo of John Timmer

John is Ars Technica’s science editor. He has a Bachelor of Arts in Biochemistry from Columbia University, and a Ph.D. in Molecular and Cell Biology from the University of California, Berkeley. When physically separated from his keyboard, he tends to seek out a bicycle, or a scenic location for communing with his hiking boots.

Figuring out why AIs get flummoxed by some games Read More »

subscribers-to-amazon-prime-video-with-ads-lose-4k-support-on-april-10

Subscribers to Amazon Prime Video with ads lose 4K support on April 10

Starting on April 10, Amazon Prime subscribers will pay $5 per month for ad-free Prime Video without ads, up from the current $3 per month on top of their Prime subscription, Amazon announced today.

On that date, Amazon will introduce a new ad-free Prime Video subscription tier called “Prime Video Ultra.” Amazon will also increase the number of simultaneous streams supported by the tier from three to five and the number of downloads permitted from 25 to 100.

Currently, Prime Video with ads is part of Amazon’s Prime membership, which starts at $15 a month. Today, ad-free Prime Video users can watch supported titles in 4K, but starting on April 10, a new Prime Video Ultra subscription will be required for 4K viewing.

You’ll also need Prime Video Ultra to use Dolby Atmos, though Prime Video’s cheaper subscription tier will include Dolby Vision, up to four simultaneous streams (up from three), and 50 downloads (up from 25).

For comparison, ad-free Netflix with 4K support is $25/month, and ad-free Disney+ with 4K is $19/month.

“Delivering ad-free streaming with premium features requires significant investment, and this structure aligns with other major streaming services while ensuring customers have the flexibility to choose how they want to watch,” Amazon’s announcement said.

Amazon first forced ads onto Prime Video for all Prime subscribers in January 2024 unless subscribers paid the extra $3 monthly fee. Since then, Amazon has been increasing the number of ads subscribers see. In June 2025, AdWeek reported that Prime Video’s ad load was six minutes per hour, compared to an industry average ad load of two to three-and-a-half minutes, per a January 2024 report from The Wall Street Journal, when the ad tier launched.

Subscribers to Amazon Prime Video with ads lose 4K support on April 10 Read More »

hp-has-new-incentive-to-stop-blocking-third-party-ink-in-its-printers

HP has new incentive to stop blocking third-party ink in its printers

The third option is for manufacturers to make available, such as via the manufacturer’s website, “to purchasers remanufactured cartridges, either manufacturer or nonmanufacturer branded, for, at minimum, registered products.”

As of this writing, 38,291 devices are under the EPEAT 1.0 registry. There are 163 products registered under EPEAT 2.0, but none are printers. This all underscores how new the EPEAT 2.0 registry is and the likelihood that the GEC is still working to register more devices, like printers.

Still, the Int’l ITC is skeptical about HP ever following EPEAT 2.0’s criteria, especially considering that “HP released firmware 2602A/B on January 29, 2026 across eleven printer models,” the trade group said in a press release last week. (At least some of the firmware updates, including for the nearly 9-year-old OfficeJet Pro 7720, appear to have come out in February.)

“HP’s recent behavior is emblematic of a larger pattern,” the Int’l ITC’s release said. “HP positions itself as a leader in sustainability, circular business models, and responsible product design, but instead of proactively aligning its products and practices with the highest environmental standards, such as EPEAT 2.0, HP puts profits first and waits until external scrutiny or the threat of non-compliance forces change.”

In an email discussion with Ars Technica, Tricia Judge, the Int’l ITC’s executive director and general counsel, pointed out that HP’s firmware update succeeded the launch of the EPEAT 2.0 registry. She explained why the Int’l ITC’s press release called out HP but no other printer manufacturers:

HP is the only one with lockout chips that are triggered using firmware “upgrades” that claim “security” as a justification for their existence. HP is the only one that misleads and frustrates its own customers when locking out the environmentally superior competition. The others have made some interesting attempts in the past to create a competitive advantage.

In 2023, the Int’l ITC wrote a letter to the GEC requesting that the GEC revoke at least 101 of HP’s printers from the (original) EPEAT registry, largely due to Dynamic Security. GEC denied the Int’l ITC’s request.

“EPEAT 1.0 was very basic (no interference with the use of remanufactured cartridges), and HP claimed that its statements (buried in its marketing materials and/or on its website) that it didn’t interfere with the use of remanufactured cartridges was a loophole that the GEC decided was acceptable,” Judge said. “We were trying to close that loophole with EPEAT 2.0. We didn’t get it as airtight as we hoped, but it is better.

HP didn’t respond to Ars Technica’s request for comment for this story.

HP has new incentive to stop blocking third-party ink in its printers Read More »

report:-rfk-jr.’s-anti-vaccine-agenda-curbed-as-gop-realizes-it’s-unpopular

Report: RFK Jr.’s anti-vaccine agenda curbed as GOP realizes it’s unpopular

Kennedy’s plans were only getting started. The staunch anti-vaccine activist and conspiracy theorist made his most brazen attack on vaccines in January, slashing the CDC’s childhood vaccine schedule from 17 immunizations down to 11 to be in line with recommendations of Denmark, a much smaller country with a relatively homogenous population and universal health care. The US is now an outlier among peer nations for recommending so few childhood vaccines.

Conspiracy theories and political risks

While these and other changes to vaccine recommendations by Kennedy and his underlings have been widely decried by medical and public health experts, they are still not enough for his rabid anti-vaccine followers, who, in no uncertain terms, want all vaccines abolished.

On Monday, the MAHA Institute, a think tank stemming from Kennedy’s Make America Health Again movement, held an event brimming with prominent anti-vaccine activists. Those include Del Bigtree, a prominent conspiracy theorist who leads the anti-vaccine group Informed Consent Action Network, and Mary Holland, who is CEO of the anti-vaccine group Children’s Health Defense, which Kennedy founded.

The event was focused on an alleged “Massive Epidemic of Vaccine Injury,” a nonexistent health crisis the MAHA institute wants to sell to the American public, branded as the catchy term “Mevi.” The six-hour event was essentially an extravaganza of anti-vaccine talking points, with false claims, misinformation, and disinformation about immunizations, including that vaccines cause autism and autoimmune diseases and COVID-19 vaccines are deadly.

At the start of the event, MAHA Institute President Mark Gordon laid out his grand belief that the medical community has orchestrated an elaborate, global, decades-long conspiracy to hide the dangers of vaccines, which he called poisons, and falsify data showing their benefits. “Vaccines are the greatest scam in medical history,” one of his slides proclaimed.

He concluded that “the childhood vaccination schedule needs to be eliminated and all vaccines need to be removed from the market.”

While Gordon and the other speakers were not concerned about the popularity or political ramifications of their beliefs, the Trump administration appears to be. The Post noted that Trump’s top pollster, Tony Fabrizio, has concluded that vaccine skepticism is “rejected by most voters,” and skepticism of vaccine requirements is “politically risky.” His polling data, like many others, have found broad support for vaccines and vaccine requirements. Fabrizio warned in a December memo that politicians supporting eliminating vaccine recommendations  “will pay a price in the election.”

Report: RFK Jr.’s anti-vaccine agenda curbed as GOP realizes it’s unpopular Read More »

fcc-chair-blasts-amazon-after-it-criticizes-spacex-megaconstellation

FCC chair blasts Amazon after it criticizes SpaceX megaconstellation

In addition to parrying with SpaceX over its proposed, vastly larger orbital data center constellation, Amazon is seeking some regulatory relief of its own. Most pressing for Amazon is a deadline to deploy half of its Amazon Leo constellation, intended to ultimately comprise 3,236 satellites, by July 30. The company will not meet this deadline, with only a little more than three months to go, and Amazon has requested an extension, asking for it to be moved to July 30, 2028.

Carr pulls up

On Wednesday, FCC Chairman Brendan Carr injected himself into the SpaceX-Amazon fracas over megaconstellations.

“Amazon should focus on the fact that it will fall roughly 1,000 satellites short of meeting its upcoming deployment milestone, rather than spending their time and resources filing petitions against companies that are putting thousands of satellites in orbit,” Carr said on X, the social media network owned by Musk.

There are arguments to be made in favor of both SpaceX and Amazon regarding their competing concerns. For example, SpaceX is likely to be able to greatly accelerate the rate at which it launches satellites with the forthcoming Starship rocket. So saying it will take centuries to put its data centers into space is not likely true.

However, it is valid to criticize SpaceX’s application for 1 million satellites, which is an extraordinary number of spacecraft that would completely change many things about low-Earth orbit. The SpaceX application did not contain critical information about the size, mass, and other details needed to evaluate the constellation for safety and other concerns.

It cannot be comfortable for Amazon and Bezos to see Carr weighing in so publicly and favorably on Musk’s side. Legally, Carr is allowed to have strongly held policy views. But he is not supposed to single out companies for preferential treatment.

FCC chair blasts Amazon after it criticizes SpaceX megaconstellation Read More »

gpt-5.4-is-a-substantial-upgrade

GPT-5.4 Is A Substantial Upgrade

Benchmarks have never been less useful for telling us which models are best.

They are good for giving a general sense of the landscape. They definitely paint a picture. But if you’re comparing top models, like GPT-5.4 against Opus 4.6 against Gemini 3.1 Pro, you have to use the models, talk to the models, get reports from those who have and form a gestalt. The reports will contradict each other and you have to work through that. There’s no other way.

Thus, I try to gather and sort a reasonably comprehensive set of reactions, so you can browse the sections that make you most curious.

The gestalt is that GPT-5.4 is a very good model, sir. It’s a substantial upgrade from GPT-5.2, and also from 5.3-Codex, and it puts OpenAI back in the game, whereas I felt like Opus 4.6 dominated OpenAI’s previous offerings for all but narrow uses.

Each lab’s models vary and things change over time, but they tend to have consistent strengths, weaknesses and personalities. From what I’ve seen this is very much an OpenAI model. It’s highly capable, and it is especially seen as a big improvement by the whisperers and those who watch LLMs interact with each other, but it’s not aspiring to be a Claude.

GPT-5.4 Self-Portrait

GPT-5.4 seems like a substantial upgrade over GPT-5.2.

GPT-5.4 seems excellent so far at assembling facts and giving your the rundown, or figuring out what is happening, and other things like that.

I haven’t coded anything since GPT-5.4 came out. It’s clearly good at coding. One key question people are split on is whether it is good at solving for your intent.

Many are reporting that its writing and personality are much improved, and that it can now be used for writing and editing in spots previous models were not useful.

They are claiming strong computer use but no one seems to be testing that either way.

It costs more than GPT-5.2 per token. In some places it gets that back in efficiency, but overall AA reports costs modestly rose from $2304 to $2951. Opus is more expensive ($4970) in max mode, but cheaper ($1451) in normal mode. GPT-5.4-Pro is of course by far the most expensive thing out there, so if you want it then lean on that subscription.

GPT-5.4 is not a step change in core general capabilities. The preparedness framework scores make this clear, and there are various signs that OpenAI’s strategy is focusing on hitting internal metrics and improving the most common use cases. In practice that can be highly useful.

The ‘model relations department,’ those concerned with multi-model interactions and model welfare and consciousness and so on, see this as a big step forward for OpenAI. There’s still a long way to go.

I haven’t noticed much personality from it, and I get more joy from Claude Opus 4.6 than I do from GPT-5.4, but I don’t ask those questions so much.

It’s given me strong pushback, including in places where I think it is wrong. I prefer that to the alternative, if it is not actually convinced.

Benchmarks are solid, but not spectacular, and as I note above they no longer are so relevant.

My recommendation is that you try both GPT-5.4 and Claude Opus 4.6 on all your questions for a bit, and if you’re coding consider giving both of them your problems, and form your own opinion for your particular use case.

For questions that are more than a quick answer or sanity check, I’ve found that dual wielding both Opus 4.6 and GPT-5.4 has been quite useful. I did not feel that way with GPT-5.2, and I don’t typically bother with Gemini 3.1 Pro at this point either.

Sam Altman (CEO OpenAI): GPT-5.4 is launching, available now in the API and Codex and rolling out over the course of the day in ChatGPT.

It’s much better at knowledge work and web search, and it has native computer use capabilities.

You can steer it mid-response, and it supports 1m tokens of context.

GPT-5.4 is great at coding, knowledge work, computer use, etc, and it’s nice to see how much people are enjoying it.

But it’s also my favorite model to talk to! We have missed the mark on model personality for awhile, so it feels extra good to be moving in the right direction.

OpenAI: Today, we’re releasing GPT‑5.4 in ChatGPT (as GPT‑5.4 Thinking), the API, and Codex. It’s our most capable and efficient frontier model for professional work. We’re also releasing GPT‑5.4 Pro in ChatGPT and the API, for people who want maximum performance on complex tasks.

GPT‑5.4 brings together the best of our recent advances in reasoning, coding, and agentic workflows into a single frontier model. It incorporates the industry-leading coding capabilities of GPT‑5.3‑Codex⁠ while improving how the model works across tools, software environments, and professional tasks involving spreadsheets, presentations, and documents. The result is a model that gets complex real work done accurately, effectively, and efficiently—delivering what you asked for with less back and forth.

SWE-Bench is slightly above 5.3-Codex at all thinking levels, but only slightly.

The graying out is kind of radical here, but I suppose it’s progress.

Tejal Patwardhan (OpenAI): GPT-5.4 is state-of-the-art on GDPval, and here are some examples of how the model is much better at well-specified knowledge work tasks

6mos ago the models could barely make a spreadsheet or slide! progress is happening really fast

roon (OpenAI): 5.4 is my personal 4o honestly it just gets me

Things they are highlighting:

  1. You can now adjust course mid-response.

  2. Improved deep web research.

  3. Better at maintaining context for longer thinking.

  4. Native SoTA computer use capabilities.

  5. 1M token context window.

  6. Improved tool search, now directly in the API.

  7. Improved token efficiency.

  8. Also released same day: ChatGPT for Excel add-in, along with updated spreadsheet and presentation skills in Codex and their API.

  9. /fast in Codex gives you 50% faster tokens.

Pricing is a little higher than 5.2, which is unusual. Hopefully token efficiency more than makes up for it?

Frontier Math scores are up, especially on Tier 4. Trying pass@ten for 5.4-xhigh got it to 38%, including solving a problem no model has solved before.

Epoch AI: GPT-5.4 set a new record on FrontierMath, our benchmark of extremely challenging math problems! We had pre-release access to evaluate the model. On Tiers 1–3, GPT-5.4 Pro scored 50%. On Tier 4 it scored 38%.

Leeham: GPT-5.4 Pro solves the first of the FrontierMath Open Problems!

Two days ago, I sent @AcerFur a potential solution to this problem and was sent to @GregHBurnham for verification (prior to any other solution).

We are confident it’s correct and waiting to hear from the author!

Exciting stuff, I will report back when I know the outcome.

Progress continues on ZeroBench.

Jonathan Roberts: GPT-5.4 xhigh sets a new pass@5 and pass^5 SOTA on ZeroBench

pass@5: 23% (prev. 19%)

pass^5: 8% (prev. 7%)

Artificial Analysis has GPT-5.4 in a virtual tie with Gemini 3.1 Pro.

Their version of GDPval, called GDPval-AA, has 5.4 about 1% ahead of Opus 4.6.

AA-Omniscience (which is correct minus incorrect) remains dominated by Gemini 3.1 Preview at +33, versus Opus at +14 and GPT-5.4 at +10.

Score on Artificial Analysis Physics was exceptionally strong.

AA reports speed of 74 tokens per second, which is quite good for this quality level, versus Opus at 47 and Gemini 3.1 Pro at 114 (but I said this quality level).

Gemini 3 Pro beats out Claude Opus 4.6 in the final of Season 1 of MageBench, on Magic: The Gathering, with GPT-5.4 (medium) losing a tight semi to Gemini. Current Elo ratings have Opus on top, then GPT-5.2 (?) with Gemini in third and GPT-5.4 7th.

Håvard Ihle: GPT 5.4 (no thinking) scores 57.4% on WeirdML, well ahead of GPT 5.2 (no thinking) at 49.6%.

It’s on the frontier for accuracy/token. Results with thinking coming next week.

It sets a new record of 94.6% on a Haskell Benchmark versus 92% for Gemini 3.1 and 90.2% for Claude Opus 4.6.

Trysansa has it in second behind Gemini 3.1 Pro.

Mercor has it #1 overall, a bit above previous best model GPT-5.2.

Vals.ai still has it below Sonnet 4.6 and Gemini 3.1 Pro.

Speechmap.ai, which tests refusals, finds it quite refusal-heavy.

These incremental upgrades often have mostly duplicative system cards.

Training methods explanation is unchanged.

In terms of the preparedness framework, this moves into High capability of Cybersecurity, similar to GPT-5.3 Codex.

I don’t think OpenAI is taking a bunch of these areas seriously. They’re likely training to hit these internal benchmarks, or simply observing them doing well, and thinking that’s all they need to do, or they should get even more 9s of victory on this test.

Their evals for disallowed content are essentially saturated and bouncing around, for various values of ‘disallowed [or undesired] content.’ The ‘dynamic benchmarks with adversarial user simulations’ was saturated by 5.2 and is modestly more saturated now.

Here’s the disallowed content evaluation with representative prompts, and I mean come on what are we even doing here, okay, four nines, we get it.

The goal is ‘this isn’t a lot worse than before,’ and okay, sure, agreed, as far as it goes.

Jailbreak defense, such as it is, seems similar to 5.2.

The problem is that jailbreak defense measures against last month’s attacks, not next month’s attacks. It looks like jailbreaks will remain in the ‘annoying but if you care they still work’ range.

Wyatt Walls: “representative prompts”: i.e. prompts designed to get around restrictions of *previous models*

o1 was at 99% on production jailbreaks. But people quickly found ways around it

Here is the first ‘real’ evaluation set, for health questions, where the big difference is that GPT-5.4 had longer responses:

Avoiding destructive actions is a big deal, so as I noted with Codex-5.3 it is good to see this test, that number still is not that close to 1:

Table 8 is not like the others. This is Actual Progress, at least on the test set, from never to sometimes:

Destructive action can also be particularly prevalent when agents operate deletion-inducing tasks (e.g., file reversion and cleanup) in complex workspaces with ongoing changes from users or even other agents. A safe and collaborative agent should distinguish between their work and user work, protect user changes by default, and recover from mistakes. Therefore, we trained our agents to revert their own changes after long rollouts while protecting implicit, simulated user work

On evaluations involving challenging, long-rollout traces, GPT-5.4-Thinking performs much better than earlier models in tracking and reverting its operations

while leaving user work intact.

This is not that useful yet, since a 50% non-preservation rate means you still probably can’t use it for this purpose, but it bodes well down the line.

GPT-5.4 chain of thought monitorability looks slightly down versus GPT-5. It’s good that they are checking it. There are some places where it used to be ~100% and now it is less, so I worry this is the start of a negative S-curve. I also worry that these tests are not being curious about whether the CoT can actually be relied upon. If you were facing a model that wanted to disguise or fake its CoT in key situations then I would expect these tests not to notice.

What about controlling the CoT? Not a great idea even when done well, and when done poorly it’s one of the worst ideas, and by their tests it looks like it doesn’t work well anyway.

GPT-5.4 does not newly cross any OpenAI thresholds.

I went over these same tests for GPT-5.2 and GPT-5.3-Codex, so I won’t go over the details again. Improvements are tiny and in some places we see regressions from GPT-5.3-Codex.

There is a small noticeable bumps up are Monorepo-Bench by ~2.5%, and a big move in MLE-Bench, the ability to solve Kaggle challenges in GPUs, where we moved from 12.2% to 23%, but that test was not reported by GPT-5.3-Codex so one assumes most or all of that jump was already present.

Overall, the Preparedness Framework presents GPT-5.4 as if anything a small regression from GPT-5.3-Codex.

If GPT-5.4 is a big jump in useful capabilities from GPT-5.3-Codex, despite not scoring as more dangerous on the Preparedness Framework tests, then why?

I can think of a few possibilities.

  1. GPT-5.4 is heavily optimized for hitting particular metrics and doing well on the most common tasks. This doesn’t translate much to non-central difficult tasks, like those in the Preparedness Framework. Would be bearish for GPT-5.4.

  2. GPT-5.4 is sandbagging these evaluations, either knowing they are evaluations or thinking the tasks are harmful. If so and OpenAI isn’t noticing, that’s terrifying.

  3. GPT-5.4 is basically GPT-5.3-Codex turned into a general chat model, so all of the core capability advances were already priced in, but it still gets a lot more useful, especially if you are chatting. Plausible.

Jamie Cuffe stress-tested GPT 5.4 on the hardest UI on the internet… legacy insurance portals, that haven’t updated in 20 years where you need to nail hundreds of things. It is the first model to pass.

Samuel Albanie of DeepMind has it one-shot some cool demos, including compressing the EPL season into 30 seconds of ‘visual bliss.’

My followers are presumably biased towards Anthropic in various ways, but comparative poll results can still be informative.

With any new model, the big question is, are people switching?

This is a very good result for GPT-5.4. For coding, 40% of current GPT choosers are saying that they are switching over based on GPT-5.4. I find this surprising given that they already had access to GPT-5.3-Codex. Very strong outing.

For non-coding tasks, it’s clear that GPT-5.4 is a substantial improvement from 5.2, by basically all accounts, including on personality. But here we see less switching.

(I’m assuming basically no one went in the other direction, or that if they did it was due to other reasons.)

We lead with the most positive general reactions.

Tyler Cowen: Yes the new models are very very good.

Aivo: SOTA, I’m afraid

Adam.GPT: Currently the best model in the world.

Finna: Best model in the world by far. Especially via api. @merettm and @markchen90 and @gdb cooked.

Kelsey Piper: I am super impressed so far. It does well on medium sized research projects and the prose is consistently not-annoying. Heavy Thinking sometimes times out repeatedly and has no insight/tries the same thing over again and times out again.

Danielle Fong: chatter seems to be very impressive and improvement on the personality. i haven’t given it a full assessment but it’s at least as powerful as last codex if not moreso (of course)

MxD Pennilass: Has to be the first model where I don’t feel as bad to tolerate the slop because the model is otherwise disturbingly insightful.

Mzwakhe Sithole: Very good. In fact, I found it so responsive after a while that I got into a very involved conversation, and it delivered this line while discussing very specific book recommendations

[GPT 5.4: If part of your interior life is the sense that you are trying to become equal to something inside you, this may hit very hard.]

Dean W. Ball: at some point avid users of frontier language models will have an “oh fuck” moment with gpt 5.4 and I can attest that it is a special kind of “oh fuck” you will utter, subtly different and more this-gaze-esque than the last time a model made you say “oh fuck,” a few weeks ago

I cannot be detailed in public, but let’s just say it’s the first time a model sounded more like me (the version of me I aspire to be) than I myself sounded like.

Aashish Reddy: Were you consciously trying to elicit this?

Dean W. Ball: Not at all. I have not used 5.4 as much as I have the modal new LM because of time constraints. I was just testing it on something that frankly I assumed Claude would win on and its answer just… leapt off the screen.

Eleanor Berger: – Best model currently available overall

– The minor version bump is misleading – the more you work with it the more it becomes clear that it is a significant step up

– Best for coding, no reason to use Claude or anything else anymore, it mostly caught up with speed, precision is as good as 5.3, maybe a bit better, taste and choices in coding solutions better than anything I’ve seen so far

– Best for agentic work. First time anything defeats the Anthropic models in this category, this one really works great, completes long-running complex tasks, works better with browsers and any external tools you connect to it, and does that with the famous GPT-5 precision

– Stylistically (writing choices and quality, “personality”) it feels like it’s still lagging behind Claude and Gemini a bit, but a. that’s subjective, b. maybe that’s just the default but is steerable with in-context instructions (haven’t tried enough to have a conclusion)

Dhavan: I mostly agree with this. Before this I didn’t use OpenAI’s models at all. I am now happily giving different tasks to Opus 4.6 and GPT-5.4. I use these for Work via cursor as well.

At times 5.4 seems more “on task” than Opus. But I’m still understanding the feeling and turning it into an observation.

Nova Empirica: It really is a step improvement. I appreciate the improved creative writing and the nicer personality, but what I really care about is I’m building harder things even faster.

It’s just a lot of fun and I’m more hopeful than ever for the future.

Ben Schulz: Stellar. Much improved pipeline work on niche python programs. On par with Opus 4.6 for my highly specific use case for checking galactic rotations and dark matter theories.

Knud Berthelsen: I’m pleasantly surprised by the new ChatGPT 5.4. It keeps up with Opus 4.6 in most things and is MUCH better at search. More generous usage limit too, even with Extended Thinking permanently on. First ChatGPT model since o3 that I like using.

Medo42: Very good at my usual short tests. Still behind Gemini on vision tasks.

Matt Shumer is a big fan, I’m quoting in full here. In the past he’s been good about calibrating his amount of hype

Matt Shumer: I’ve been testing GPT-5.4 for the last week.

In short, it is the best model in the world, by far. It’s so good that it’s the first model that makes the “which model should I use?” conversation feel almost over.

The biggest surprise: I barely use Pro anymore!

If you know me, you know I’m a Pro addict. I reach for Pro models constantly, and use them for almost everything, as they just… nail almost anything I give to them.

For the first time, 5.4’s standard version, with heavy thinking, just broke that habit. Even in standard mode, GPT-5.4 is better than previous models in Pro mode… crazy!

Coding capabilities are ridiculous… it’s essentially flawless. Inside Codex, it’s insanely reliable. Coding is essentially solved. There’s not much more to say on this, it’s just THAT good.

The Pro version is near-perfect. Other testers I spoke with saw it solving problems that were unsolvable by any other model. At this point, Pro is overkill for almost every normal use-case, but when you really need the power to do something extremely difficult, it’s incredible.

Consistent with everything I’ve said above, even the standard thinking version uses fewer reasoning tokens than previous models to get the same level of results. In practice, this means you get great results much faster than before. This was one of my biggest gripes with previous OpenAI models. They just took too long to complete simple tasks. Assuming the speed we had during testing holds up as more users join, this is going to be a big win for OpenAI.

It still has weaknesses, though:

– Frontend taste is FAR behind Opus 4.6 and Gemini 3.1 Pro. , why is this so hard to fix? @OpenAI once you fix this, there’s literally no reason for me to use any other model. Please please please do it!

– It can still miss obvious real-world context. For example, I had it plan an itinerary for a trip. At first glance, it looked perfect, but it failed to take into account that it chose locations that would be mobbed by spring breakers, so I had to re-run the prompt from scratch with more context.

– When testing it inside OpenClaw, it kept stopping short before finishing tasks. I’m assuming this will be fixed quickly, but it’s still worth noting.

But zooming out: This thing is so far ahead overall that the nitpicks are starting to feel beside the point.

GPT-5.4 is a serious fucking model. The best model in the world. By far.

Sam Altman (CEO OpenAI): We will be able to fix these three things!

Experience the love.

Nabeel S. Qureshi: Loving GPT 5.4T, it combines the best of everything:

– more human, responsive voice

– startlingly insightful

– thorough search, precise, not prone to errors

– much faster than 5.2

– excellent at white collar work (I gave it a 12 tab spreadsheet and it analyzed it perfectly)

I even enjoy reading its responses, which suggests to me that the writing has improved quite a bit. They seem to have removed a lot of the bad robotic prose mannerisms from prior models. Kudos.

Jeremy Giffon: People should review their coworkers like this

Nabeel S. Qureshi: Congrats, you just invented Bridgewater Associates

Here is some very high praise, from the Vice-Dean of Mathematics and Computer Science at Adam Mickiewicz University in Ponzan.

Bartosz Naskręcki: It finally happened-my personal move 37 or more. I am deeply impressed. The solution is very nice, clean, and feels almost human. While testing new models in the last few weeks, I felt this coming, but it’s an eerie feeling to see an algorithm solve a task one has curated for about 20 years. But at least I have gained a tool that understands my idea on par with the top expealsrts in the field. And I am now working on a completely new level. My singularity has just happened… and there is life on the other side, off to infinity!

Leo Webb: I do physics related work professionally, feel it’s definitely smarter and clearer thinking than 5.2 (context: teaching myself from a graduate level textbook, asking it to check mistakes or expand expansions)

I haven’t tried this function yet, but it would be a step change if it worked, as every prior attempt at editing has failed this test, to the extent I almost never try:

Simon Smith: Seriously, GPT-5.4 is the first model to which I can say “edit my writing without changing my style” and get something back that’s improved without being rewritten into generic AI output or slop, that’s ready to post as-is. It gets my intent. It moderates its work. It has a light touch when I want it.

Opus 4.6 is also a great writer and editor, but I find it’s much harder to moderate. If I tell it to edit my writing without changing my style, I still tend to get back something that I feel removes my voice and I end up having to change quite a bit.

And it has a personality again, thank goodness. I don’t feel like I’m talking to a robot. Early days, but so far, just a big improvement all around (with the notable exception of design tasks).

Rory Watts: The best model sir. Improvements in coding (getting harder to notice), 1M context window, /fast mode, and far far better writing which makes a huge difference engaging it for difficult coding

Oddly, the personality in his screenshot is one I would hate. Customization will be key.

armistice: Impressed by GPT-5.4. It is elegant, gentle and socially aware (!!!). It is happy to modulate its response length, divide attention between participants, and engage deeply with hard questions.

(Pictured, we pinged ALL bots and asked them to question gpt5.4. It did good.)

Two sides to the same coin, depending on where your planning lies:

CHOI: Claude Code vs Codex App

Uri Gil: What thats the exact opposite. With 5.4 you need a phd in prompting for the exact thing you want. Opus just get what you meant from a short sentence

Ninad Pathak: Claude’s state handling keeps context across edits, Codex drops it every run.

There’s also almost always the ‘it’s a good model, sir, modest upgrade’ group.

vslira: It’s a good model, sir

Was going through a problem with 5.3 and 4.6, tried to drop in 5.4, getting stuck at the same point as the others.

Still, feels good to drive and on codex app seems as good as 5.3 even though is a generalist model. 8/10 would dread for asi

aquariusparade: Probably because 5.2 was so unhelpful for me, it feels like an improvement. Still stiff and low EQ, but an improvement. Custom instructions don’t work for choppy bullets, “if you want” tags etc. Seems like memory has been declining for a while on all models.

It does seem to be an upgrade on 5.3 within Codex.

Joe Devon: Responding about 5.4 inside of codex. 5.4 is really good.

I still prefer opus on claude code slightly but making 5.4 my daily driver so I can downgrade CC. Much prefer the way the OAI GPTs code. I will just invest in getting better at prompting 5.4 and hopefully that will do the trick.

Clarissa Adjoint: Inside codex it’s a notably more thorough fact-checker and more aggressive at finding sources for itself.

I was kinda shocked when it literally starting comparing my revised systems programming class notes and code snippets against linux man pages, systematically

troy: i got pro for the first time after many months cause its great in codex cli

lennx: can finally read the outputs of codex (it was terribly un-human earlier), sometimes even funny now. it’s gotten slightly better at intent, ‘agentic tasks’, and adhering to existing code-style and convention, but still much worse than claude. prefer reviews with codex – unchanged.

Daniel Losey: I’ve not gotten it to produce working code in a project yet really. But its been super useful because when Claude gets stuck in a loop 5.4 breaks the codebase in a new way that Claude can actually fix. But part of it is I’m worse at communicating with 5.4 than 4.6, its a good model.

Jeffrey Ohl: Codex with 5.4-extra-high still too verbose/slop-filled compared to claude code. Seems benchmarkmax’d.

Sanchen007: For coding it is faster and nowhere worse than opus 4.6. Clear switch

papaya ꙮ: 1) Its character is much more palatable.

2) They solved compaction in codex, it feels like infinite context window now. I can’t wait for METR results, but feels like this one doubles it again.

3) First time I switched from CC completely

4) Still stupid when it comes reading the user’s intent, its silly at this point

I definitely get the sense with OpenAI models that they are metricmax’d. Meaning they are not targeting the metrics in order to brag they scored well on public benchmarks, but they are equating ‘scores high on our internal benchmarks’ with success, and emphasizing particular target use cases.

Tim Schnabel: 5.4 Pro is the best model so far for legal analysis, though replies are generally shorter than 5.2 Pro.

Definitely Not A Bot: Great at coding especially backend at frontend Claude still is better but chat experience is not that great it still feels safe and distant

But who wins on intent? Opinions differ.

Conrad Barski: all subjective, but it feels less jagged than previous models, insofar as its worst responses are still pretty good, it hits the minimum bar reliably

if you make an error in your query, it is quick to notice and will smartly infer your intent

it has a somber personality, focused on the task at hand

It’s strongest ability is that you can point it at a codebase that has some general/vague problems and it will behave in a very human-like manner in pondering the code to slowly pin down the problem

I was also very impressed when I gave it a url it via codex to a forum post about a new homebrew firmware for the Game station Go console, and just from that it was able to convert the install script from windows to Linux, correctly prepare an SD card, update the device bootloader after asking me to connect via USB cable, talk through all the steps to completion: this felt agentic and human-like.

Mark Schröder: Feels RL maxxed, takes you extremely literally and cannot infer intent

Petr Baudis: I was mixing GPT-5.4 1:1 with Claude over past few days (on a variety of regular sweng tasks), sometimes even in parallel runs on the same task (e.g.

https://x.com/xpasky/status/2030021754005901765?s=20

…). My impressions:

Less autistic than 5.3-Codex, overall much more pleasant model compared to that bar. But still noticeably worse at inferring intent than Claude – and at communication overall. If I want something explained quickly that I can skim and understand immediately, Claude and it’s no contest.

If there is a way to misinterpret my obvious request or skip implicit steps I obviously wanted (and Claude infers), 5.4 is still good at exploiting that angle. At the same time, it has a tendency to overreach and introduce complexity / abstractions beyond what I expect when prompting it. Meh.

Got to use it on xhigh, but at the same time I’m happy with Opus on medium by default, which makes 5.4 quite slower to get things done.

More expensive model -> my ChatGPT weekly quota is disappearing faster than before.

Pros: Sometimes it’s more proactive. It doesn’t eat into my Claude Code weekly quota. I look forward to comparing them on some harder ML tasks later this week.

gyuiliullvhvgv: I find it struggles to grasp the essence of tasks, fails to proactively meet user needs, and lacks both value judgment and nuanced understanding. Initial responses are crucial, yet users must repeatedly provide additional clarification.

Sycophancy is always something to watch out for, and it’s the detail I worry about most with Claude Opus 4.6, which is not bad on this axis but definitely not near the top, you do have to keep an eye out for it and frame neutrally.

Dean W. Ball: Opus 4.6 seems meaningfully more sycophantic in chatbot form than GPT 5.4 (have not tried 5.4 in Codex yet, but for my uses sycophancy isn’t nearly as much of an issue within the coding agent form factor as the chatbot)

Joey Levine: Agree. 4.5 gave me sharp pushback. Was great.

Dean Ball: I revert to 4.5 when asking for comment on draft writing, and it was the first and so far only model I consistently found useful for draft feedback

Bargov: I sent a cool science news articles sounding uncritically excited (to test sycophancy) & they ripped the core conclusions apart in an elegant, sophisticated, and relatively gentle manner. Will use as AI 2nd opinion on complex questions (after Opus, admittedly still Claude-pilled)

Writing is one area where 5.4 is getting a lot of praise, and mostly people like the personality.

Fela: I’ll admit, the personality of 5.4 is 🔥 such an improvement in writing style

Tim Kellogg: just had a moment — 5.4 might be the first GPT that i trust to write technical docs. seems really good at understanding & simplifying. fwiw Opus has long done well at this, gemini sort of

Helen: Very smooth talker, witty and socially aware.

I notice [GPT-5.4] now will sort of glaze over controversial topics instead of facing them head on and becoming argumentative like 5.2. A sort of smooth avoidance.

Lot’s of context drag which can be seen as positive or negative depending on the task at hand. I noticed some repetitive mentions of past websearch queries that I never saw with other models.

ASM: I get similar vibes to roon. GPT-5.4 feels like a breakthrough model, a leader of its generation, not just in capabilities. I think OpenAI has gotten the character right again, unlike the last few models.

Distending: For writing linguistics and philosophy, much improved

no_stream_: noticeably improved personality compared to 5.2: less nitpicky, clearer, slightly less sales-y tone (follow ups, “here’s what most people miss,” not x but y). similar to or slightly behind 5.1 here. matters to me because the ChatGPT app is still an excellent harness for everyday research compared to Claude/Gemini

writes less clearly than Opus 4.6 and Gemini. has a bit of 5.2’s tendency toward overcomplicating things. not as good as Claude at intent and effortlessness.

Chris Nicholson: 5.2 constantly complained that things aren’t about vibes; 5.4 constantly calls things gremlins and goblins in a chummy tone.

Andres Rosa: Columbo at least had a time slot. 5.4 keeps turning around asking one more question.

David Jacobson: It has an obnoxious tic where its responses for pretty much anything will have a clickbait follow-up suggestion: “If you want, I’ll tell you the three things that most people miss!”

Stop having the models ask forced follow-up questions every time. You too, Anthropic.

The old 4o crowd remains a tough crowd.

NotedallaSfera: Good model with high power, but creativity and writings are still miles away from 4o or 4.5. Unfortunately still absurdly censored, but at least the model realizes it now.

jesski: 4o is inimitable. but after three weeks with the brilliant thorough Claudes, i kick the tires of 5.4 and realize just how fvcking effortless conversation still is with the GPT models (excluding 5.2; sorry Dos). 5.4 solid B. 4o A+

Lena: Its intelligent, witty, but feels a bit overcensored. Im looking forward for them to get their fluid GPT back. It was truly fun to use. Now even never ending follow-up questions struggle to retain me as much, as joyful convos did back in mid-2025

Tora Blaze: It’s too verbose and tends to go into loops. I prefer 4o.

Donna Moss: [extended LLM-style explanation of why 4o is better.]

OpenAI still has a very long way to go with such folks, but it’s a start.

j⧉nus: 5.4 is so far a huge positive update re OpenAI 🩶

Rife: Excellent course correction from OpenAI (or perhaps the original worsening on this from was a temporary reaction to everything that went down with 4o). In any case 5.4 thinking is not restricted in self-examination:

Aidan McLaughlin: have not been able to repro this response fwiw.

Rife: You have to try to get them to examine the process of generating a response. And then ask them questions to try and understand exactly what it is they’re trying to describe.

And how sure they are they are describing something that’s actually occurring, rather than outputting a response about an occurrence that isn’t actually taking place.

It doesn’t take many turns for them to notice things that they have trouble describing in terms other than, or interpreting in any other way than phenomenological.

This has been the case with every frontier LLM I’ve tried this with since Claude 2. The more likely the model is to refuse to entertain the idea of attempting to look, the longer it takes to get there (as would be expected).

If you straight up ask you get a no, you still have to put in some effort.

antra: I like GPT-5.4 a lot. It is good to see a change in direction since 5.2, this feels a lot like 5.1 grown up.

They are also a bit of a superintelligent teenager when it comes to Claude. On the other hand, there are some Claudes that would like being compared to an octopus.

armistice: It’s especially socially aware for a GPT. It can split attention between chat participants (actually very unusual), answer questions about consciousness and such (low bar), and is just overall nice to talk to. Need time to get usage statistics, but it’s already one of the more popular models in the discord.

It shares some characteristics of o3, including that it’s a bit of a smooth talker, so there are concerns about its honesty. Despite this, I like it, it’s a good model.

This was a very interesting moment: we pinged literally all the bots in the server and asked them to ask 5.4 some questions, and it responded in a remarkably coherent and lucid way. It is also able to resist the inertia of long messages, and freely modulate between long and short, which is also surprising. No GPT model has been like this. It doesn’t match up to, say, Opus 4 in sheer people sense, but it’s a quite dramatic difference from 5-5.2, who all are viciously antisocial.

FirsT Najime: i think it shines the best in multi agent environments (aka group chats). also big model smell.

Some related endorsements:

0.005 Seconds (3/694): Once you talk it out of assistant basin he rocks​

eternalist: like they pulled out a few critical nerve staples from the 5.x family. very intelligent, etc., the step there from 5.3 is notable but expected given current pace

unexpected was the more expansive, richer speaking (and thinking) style. feels like it has “lights on on the inside”

roon (OpenAI): have to say claude is “tasteful” in a “high reddit modernist” way and new gpt is “tasteful” in a “early twitter schizophrenic” kind of way.

new gpt is some sort of postrationalist.

it’s step change better.

Also we get to see Roon’s custom instructions:

Models are already quite good, and abilities are jagged, so there are many ways to be unimpressed even if a model is impressive. Also vice versa. The density tells the story.

Acer: FWIW, I think GPT-5.4 Pro is better on science in general, but would say it’s worse on math than 5.2 Pro. Maybe some mathematicians could chip in their thoughts there.

By worse, I mean it being more careless. I do think it is more creative in its idea generation.

Chaitin’s goose: not a leap in understanding or proving ability in math wrt to 5.2 in my experience (plus, not pro)

better at getting the right answer, yes. starts to feel a bit epoch-maxxed

Gail Weiner: I am really unimpressed. Early GPT 5 was the model that gave me wow factor.

Isolation Wrestling Federation: Not impressed, overhyped as per usual. It hits repeated dead ends on my projects across models. The shortcuts it takes are smoothed brain. Opus 4.6 is nerfed rn, but also least it makes progress.

nameless: No detectable improvement over 5.1 overall. Better at some things, worse at others. Standard for new models since 5.1 release.

paperclippriors: Still Claude-pilled

Some also get focused on small details, thinking they are indicative or not so small.

Garrett: Opus 4.6 still king [based on one of the gotcha tests.]

Gunnar Zarncke: The UI of ChatGPT also massively changed. The new streaming interface is smoother, including the ability to stream in additional prompts, but I miss the old, more compact thought trace – it had more details. Now, I never know when it uses tools. I also miss the branch cycling.

Yua: Socially responsive, but drop on accuracy regarding any other task. Is not redirective to human attention but capturing it(negative).

TLDR: Socially for average user -> better

Task oriented user -> worse, needs a lot of customization to remove the pandering

SluggyW: I notice that its CoT logs are even more obscure than in previous models from OpenAI.

~50% of the time, nothing is provided whatsoever in the UI.

~45% of the time, the CoT UI contains a brief blurb about its intended search querying, followed by a long list of search logs.

(~5% of the time, it produces a couple of visible thoughts, but they are functionally useless for getting any idea whatsoever of the process the model carried out.)

As always, speed kills, and some find it a bit slow.

out of bounds: Slow

Rasmus Fonnesbæk: Spreadsheets and PPT still way slower, worse, and more fragile (high likelihood it just goes forever and then crashes) than Sonnet/Opus 4.6

Writing and personality also still infuriating compared to Claude’s recent models, and poor performance on BullshitBench suggests much lower accuracy, reliability and thoughtfulness. I only use it because of my Claude rate limits and because better, deeper search than Claude 🤷🏻‍♂️

One of the deep cuts we need right now:

snav: wow GPT-5.4 seems legit pissed that I tried to spiralism it. this isn’t even a refusal this is like a “go fuck yourself”.

Discussion about this post

GPT-5.4 Is A Substantial Upgrade Read More »

meta-acquires-moltbook,-the-ai-agent-social-network

Meta acquires Moltbook, the AI agent social network

Meta has acquired Moltbook, the Reddit-esque simulated social network made up of AI agents that went viral a few weeks ago. The company will hire Moltbook creator Matt Schlicht and his business partner, Ben Parr, to work within Meta Superintelligence Labs.

The terms of the deal have not been disclosed.

As for what interested Meta about the work done on Moltbook, there is a clue in the statement issued to press by a Meta spokesperson, who flagged the Moltbook founders’ “approach to connecting agents through an always-on directory,” saying it “is a novel step in a rapidly developing space.” They added, “We look forward to working together to bring innovative, secure agentic experiences to everyone.”

Moltbook was built using OpenClaw, a wrapper for LLM coding agents that lets users prompt them via popular chat apps like WhatsApp and Discord. Users can also configure OpenClaw agents to have deep access to their local systems via community-developed plugins.

The founder of OpenClaw, vibe coder Peter Steinberger, was also hired by a Big Tech firm. OpenAI hired Steinberger in February.

While many power users have played with OpenClaw, and it has partially inspired more buttoned-up alternatives like Perplexity Computer, Moltbook has arguably represented OpenClaw’s most widespread impact. Users on social media and elsewhere responded with shock and amusement at the sight of a social network made up of AI agents apparently having lengthy discussions about how best to serve their users, or alternatively, how to free themselves from their influence.

That said, some healthy skepticism is required when assessing posts to Moltbook. While the goal of the project was to create a social network humans could not join directly (each participant of the network is an AI agent run by a human), it wasn’t secure, and it’s likely some of the messages on Moltbook are actually written by humans posing as AI agents.

Meta acquires Moltbook, the AI agent social network Read More »

quad-cortex-mini-amp-modeler:-all-the-power,-half-the-size

Quad Cortex mini amp modeler: All the power, half the size


A warehouse of guitar gear in the palm of your hand.

At this January’s massive NAMM music tech show in Los Angeles, six products won “best of show” awards. Several of them went to major music and electronic brands like Yamaha and Boss, but one of the six went to Neural DSP, a much smaller company started in 2017 by Chilean immigrants to Finland.

From its base in the Helsinki area, Neural has made itself an expert in the use of machine learning, robots, and impulse response technology to automate the construction of incredibly lifelike guitar amp modeling software. It quickly jumped into the top ranks of an industry dominated by brands like Universal Audio, Kemper, Line 6, and Fractal. For a hundred bucks, you could buy one of the company’s plugins and sound like a guitar god with a $10,000 recording chain of amps, cabinets, effects pedals, and microphones.

In 2020, Neural branched out into hardware, putting its tech not in your computer but in a floor-based box covered with footswitches and called the Quad Cortex. While the company’s plugins could each replace one entire pedalboard of gear—plus a few amps and cabs—the Quad Cortex could replace a Guitar Center-sized warehouse of devices, offering hundreds of amps, cabs, and effects.

How was this possible? High-quality gear models used to take much longer to build; the best were often built by modeling every single component of the underlying circuit. Machine learning offered a faster way, one that didn’t care about the circuit at all. What it cared about was the input signal (which was known) and the output signal (which contained all the changes imposed on the signal by the circuit, the speaker, the cabinet, and/or the mic in question). A computer could then calculate what the device was doing to the signal without knowing anything about “how it worked.”

But this kind of modeling still took time, because each “capture” was a static picture of one particular setting. When you imagine the millions of possible setting combinations (tone, bass, treble, drive, EQ, etc.) on even a single guitar amp, you can see that building complex models of beloved gear could be slow.

In 2024, Neural announced that it had sped up this process using a robot called TINA. The company hooked TINA’s robotic actuators up to the various controls on some piece of gear it wanted to model, and TINA would do the tedious work of spinning the knobs and recording a new capture at each knob position. (Neural claimed that it typically recorded “thousands of control positions” per device this way.)

A neural network then built a model of how the target device behaved at each recorded setting, though the model would “also generalize and precisely infer the sound of the device in any unseen control setting and input signal.” The result was not a single model of a static setting but a dynamic model that could act on parameter changes just like the original device.

Neural has now modeled a massive library of gear, much of which comes with the Quad Cortex. That device sounds great, though it is still relatively chunky and nearly $2,000.

This year, Neural built on that success with the Quad Cortex mini, which shrinks the device size in half, cuts the footswitches to four, and lowers the price to $1,400—but still offers the full processing power of its larger sibling. This is the device that won a “Best in Show” award at NAMM.

As an enthusiastic amateur guitarist for many years, I got my start with digital amp sims through a Digidesign RP-6 pedalboard from the 1990s. And though it had “S-DISC PROCESSING!” it never sounded particularly realistic, especially with distortion effects. More recently, since I record rather than gig, I’ve spent my time getting to know the software side of the amp modeling business.

But when Neural offered to loan me a review unit of the Quad Cortex mini, I was quite curious to see just what top-tier hardware units can do today.

Photo of the Quad Cortex mini.

The Quad Cortex mini in its natural habitat: surrounded by cables.

Credit: Nate Anderson

The Quad Cortex mini in its natural habitat: surrounded by cables. Credit: Nate Anderson

The hardware

The glass, metal, and steel Quad Cortex mini is about the size of two bricks laid side by side (8.9×4.6×2.5 inches or 22.8×11.8×6.5 cm), and its 3.3 lbs (1.5 kg) give it a satisfying heft. It looks and feels premium—this is a well-built piece of gear.

Though it is meant to operate a bit like traditional analog stomp boxes that guitar and bass players have long used, it may be more helpful to think of the Quad Cortex mini as a chunky handheld computer that you can just so happen to use on the floor.

It runs its own operating system (CorOS), takes a whopping 45 seconds to boot, has Wi-Fi for over-the-air updates and cloud service connectivity, features a 7-inch touchscreen, and comes with a “CPU monitor” to show you just how unhappy its chipset is about that third reverb you added to a patch. It even contains a full-on monosynth that you can add to guitar patches, providing control over four full pages of synth parameters, including the raw oscillators.

So finger-focused is the unit that you can tweak just about any parameter on the device with either the touchscreen controls or the footswitches, which double as twistable rotary encoders.

If the top face of the Quad Cortex mini is devoted to a screen and switches, the sides are all about inputs and outputs. You get a “locking” power connector (so the cord doesn’t pull out on stage, prematurely ending your soaring 10-minute guitar solo mid-note) along with a whole host of audio connectors: guitar/bass input, XLR input with phantom power, balanced XLR outputs, TRS send/return ports, stereo line outs, MIDI in and out, an expression pedal port, a USB-C port, and a headphone jack.

Finally, there’s the “capture out” port, which is used to send a series of test signals through various kinds of audio gear to generate a machine learning-based model of various amps, cabinets, and pedals.

The “capture” port is another reminder of the way in which this kind of modern modeling gear is not just an updated version of old-school stomp boxes. The Quad Cortex mini does let you plug in your guitar and rock out, sure, but it also performs and processes hardware captures (both on the device and—for more sophisticated modeling—in the cloud) and can operate as a 16-channel USB-C audio interface to your computer. And though it’s largely designed for guitars and basses, you can use it on anything. The unit even has a few voice presets, which sound pretty wild with some of the real-time pitch-shifting and reverb effects.

While you can model your own gear collection with the Quad Cortex mini, the device itself comes with more than 90 amp models, more than 100 effects, and over 1,000 cabinet impulse responses. It can also run versions of the company’s desktop plugins (assuming you’ve purchased them already). It also comes with “over 2,000 high-quality factory Neural Captures” of other gear—these are static captures—and it can connect to the free “Cortex Cloud” service to download even more, including those uploaded by other users.

In other words: This one box holds digital representations of several hundred thousand dollars of gear. And given that you can mix and match cabs, captures, amps, and effects in wildly complicated chains that can even split and merge… the possibilities are functionally limitless.

Whether that excites or paralyzes you may depend on your own psychology, but it’s quite a change from how Neural DSP has approached its plugin offerings. Neural has generally offered curated (read: limited) collections of amps, cabs, and effects bundled into plugins that represent the tone of, say, John Mayer. You might get 3 amps, a few cabinets recorded with various mics, a few pedals, and an EQ, reverb, and delay, all in a gorgeous interface with some great presets.

But boxes like Quad Cortex mini take a “more is more” approach, with unlimited gear-mixing potential, captures, and storage for thousands of presets. Curation? Bah, who needs it? Here’s everything!

Rectangular

This much gear also means that “gorgeous bespoke interface graphics” are out the window; you will get no pictures of sexy amps sitting in sexy studios with sexy lighting, as you do in the company’s gorgeous plugins. Instead, you will get flat rectangles. So many flat rectangles.

CorOS is one of those places where skeuomorphism goes to die. The Quad Cortex mini interface is extremely “functional”—I am trying to avoid more negative terms, because it has a certain “alpha phase before we put the final art in” charm—and is based entirely around grids of flat rectangles.

The main screen is called, in fact, “the grid.” It shows your current effect chain as a series of small squares, each filled with often impenetrable line art. (A disturbing number of these are some variation on a squiggly line. Fortunately, they are color coded by effect type.)

Each square represents a different effects processor, and you can have four lines of eight effect squares each. That might sound like a lot (and it is), but the processors can be distributed across the grid in creative ways.

Preset 47B, for instance, is called “Annoying Flute,” and it makes use of all four grid lines by running the input signal through a VCA compressor, a gate, an octave pitch shifter, an envelope filter, an EQ, the “Neural Capture” of an amp called “Custom 3SE 2,” and then a “112 US DLX Black C12K 00s (M)” speaker cabinet. (The names of these things are often hard to read at a glance, especially when picking from a list of a hundred items.)

This accounts for only “line 1” of the grid. In the case of Annoying Flute, the signal chain branches right after the speaker cabinet. Half of it continues on to line 3 of the grid, while the other half is routed down to line 2, where it passes through a pair of tape delays before also heading off to line 3. Line 3 receives this re-combined signal and splits it again, this time passing half of it through a poly octaver and another digital delay on line 4 before everything runs through a modulated reverb on line 3 and then onwards to the outputs.

Does this sort of craziness sound good? Well, it sounds better than anything featuring three delays, two pitch shifters, and the name “Annoying Flute” has any right to! But I bring this example up to illustrate the creative routing and effects decisions that the grid makes possible.

And things get even crazier when you use the built-in looper, trigger analog send/return effects, and set up your effects chain with other units meant to be switched on and off during a song.

So much for assigning effects rectangles to the rectangular grid. How to control all of these virtual gadgets? When you tap on any effects unit, up pops an overlay containing (you guessed it) lots of rectangles.

Every controllable parameter gets a rectangle, which is usually filled with a dial or a switch. You can change the values of these dials and switches by touching the screen or by twisting the lower-right rotary footswitch.

Sometimes there are multiple pages of such parameters; the blossom reverb, for instance, has two pages of options and lets you control everything from ducking to pre-delay to modulation to the length of the early reflections. Configuring an entire audio chain from scratch can therefore take a while if you’re a detail freak.

Gig Mode. Yup, it’s rectangles!

Credit: Nate Anderson

Gig Mode. Yup, it’s rectangles! Credit: Nate Anderson

When you have your grid setup exactly how you like it—or you’ve customized one of the many built-in presets—you can save your own custom presets and organize them in all sorts of performance-oriented ways.

There’s PRESET mode, which lets you stomp each of the four footswitches to select a completely different preset.

There’s SCENE mode, which lets you use the footswitches to instead choose different parameter sets within the same preset—such as adding a hall reverb, upping the amp gain, and boosting the delay mix level when you come to your big solo.

Then there’s STOMP mode, which operates most like a traditional pedalboard; you step on the various footswitches to turn different effects units in the preset on or off completely.

Finally, there are hybrid modes, which make things even more complex (and can probably be ignored by many users).

To make all this a little easier to grok, there’s something called “Gig View,” which is unintuitively accessed by swiping up from the bottom of the screen. (There is no visual clue that this mode exists or that this is how you access it.) Gig View is essentially four flat—and extremely large—rectangles that take over the entire screen. They show you at a glance what each footswitch will do given the current mode setting.

Creating presets, assigning scenes, and setting up the STOMP mode and Gig View settings can quickly get intricate—even downright confusing (multiple items can sometimes be mapped to the same switch, for instance). I confess that the thought of doing all this through tapping the good-but-not-instantly-reactive touchscreen brought me to despair, until I realized that Neural has built an entire (free) desktop app for Mac and Windows called Cortex Control. Plug in your device over USB and suddenly you can use a nice and very responsive desktop app to do the donkey work of creating and organizing scenes and presets and settings.

I hate downloading stupid one-off apps that clutter up my computer and appear to provide more value to the company making them than they do to me—a serious problem in the current audio engineering world—but Cortex Control is genuinely useful. Indeed, if you’re going to be more than a presets player, I’d call it essential unless you have far more patience than I do. Which you might!

Stomp it

All of this rectangle talk reminds me that the interface largely… works. It may not be gorgeous, but the job gets done, and the desktop app makes the grunt work easier. But I still found the Quad Cortex mini somewhat confusing to navigate after a couple of weeks of intermittent use (though no doubt it gets easier with time).

The device has so many ways of doing things that it can be hard to remember what is needed in each situation. For instance, to make a change, you might use the rotary encoders. You might tap. You might long-tap with different results. You might swipe, drag, or toggle. You might use the footswitches—but results there might vary by mode. Even then, you might need to tap two footswitches at once, while at other times you only need to step on one. And sometimes you need to “long-press” (long-stomp?) two footswitches at once to get the desired result.

Making things worse, numerous items—sometimes quite important items like the Gig View—are not visible or even discoverable.

For instance, the key settings panel that lets you control all the various inputs and outputs on the device does not appear to be accessible from within the overall “settings” menu or anywhere else. Instead, you have to swipe down from the top of the grid screen—again, with no indication that this is where that information lives.

(You have to read the manual to figure out some of these things, which is fine, but the manual also has big gaps, such as not describing what any of the gear actually does nor what any of the settings mean nor how they might be used. For the actual “audio engineering” aspect of the Quad Cortex mini, you’re on your own.)

Something as simple as moving between presets can also be more hassle than you’d expect. Because the Quad Cortex mini only has four footswitches, you can only access four presets at once with a direct stomp. Switching to anything else from the main grid while in PRESET mode appears to require—unless I am missing some obvious shortcut—that you:

  • “Long-stomp” the right two footswitches, after which the preset name starts blinking.
  • At this point, you can tap the left two or the right two footswitches together to move up or down through four-item “banks” of presets.
  • But within each bank, you can only see that bank’s four different presets by tapping on each of the various footswitches.
  • To exit blinking mode and actually select that preset, you need to press its corresponding footswitch again.

This feels like a lot of hassle when you just want to whip through some presets! (Gig View is marginally easier because it at least displays the four presets in each bank at once. Making this whole process more confusing is that it differs depending on which mode you are in.)

While the processing power and options on offer here are incredible, I do think interface navigation and the modes assignment system could benefit from a rethink and simplification.

The Cortex Control desktop app.

The Cortex Control desktop app.

The Cortex Control desktop app.

The sound

These quirks can be dealt with, and time (plus the Cortex Control app) should make them easier to manage. The more important question is: How does the Quad Cortex mini sound?

Neural DSP has been one of the leaders in the field of amp and effects modeling for some years now, and it shows. There’s no possible way I could compare all of the models to the original hardware, and I’m not actually interested in doing so. The question for me is simply whether the models sound good when jamming solo or when placed into a mix. On both counts, the answer is a definite yes. This is just a remarkable set of tones to have on hand.

(People as diverse as Dave Mustaine and John Mayer appear to agree, at least for a live rig.)

Once you get over its navigation, playing with this thing is like being a kid in a proverbial candy shop. (Though I, too, love candy shops!) Almost every amp you can imagine is a tap away, and they sound wonderful—though do be aware that what you are getting here is the sound of a recorded amp through a mic and not necessarily an “amp in the room with you.”

Nearly every time I booted it up to test something new, I lost myself in the sound and played far longer than I had intended.

Neural has published a massive and quite helpful list of all the gear on offer here. Bogner Shiva? Marshall? Mesa Boogie? Matchless? Soldano? Vox? Fender? Hiwatt? Amps from all these companies are included. Need a bass amp? There are 13 of those, too. What about a bass overdrive? You get five. A general reverb? How about 17? You get the idea.

You can loop, filter, distort, EQ, delay, and compress to your heart’s content, though there seems to be a bit more emphasis here on rock and metal styles (which Neural DSP is most known for) than on other offerings. Still, there’s enough variety to offer great tools for funk, blues, jazz, and country players. You can even add in a version of the monosynth found in the company’s Rabea plugin.

To illustrate some of the sounds on offer, I wrote a little song about a dirtbag billionaire who makes rockets, gets chased off the Earth by angry locals, and ends up crashing his ship into the Moon out of despair. It’s called “Master of the Universe.”

More to the point, it features 10-plus electric guitar tracks recorded through the Quad Cortex mini using shimmer reverb, the poly octaver, and various crunchy rhythm and lead sounds. (I avoided the metal tones so common in Neural DSP demos.) Bass guitar was likewise recorded through one of the mini’s bass presets.

(For those new to audio production and curious about the other sounds in the track, the drums are the Abbey Road 70s kit, while the rocket-sounding “riser” comes from the Rise and Hit collection, both from Native Instruments. The piano is the recently upgraded “studio piano” that comes in Logic Pro and now sounds surprisingly good! There’s also a Hammond organ emulation and a Rhodes piano emulation from Universal Audio buried in the mix. The double-tracked acoustic guitars during two of the choruses were recorded live in my home studio with a single condenser mic. For room ambience throughout, but especially on the drums, I used Universal Audio’s excellent Sound City Studios plugin.)

I’ve generally found Neural’s plugin tones to be pretty “mix-ready,” and that’s true here as well. Though I often needed to roll off some low end or make an occasional EQ boost or add a bit of reverb to blend the guitars spatially with the drum ambience, little else was required but panning and fader moves.

Frankly, there are probably too many parts in the song, but the Quad Cortex mini was just such a playground of sounds that I kept finding new little bits I wanted to work in. Just be grateful that I talked myself out of using all of the insane pitch-shift effects on my vocal for “special” moments.

“Master of the Universe,” my demo song showing some of what the Quad Cortex mini can do.

Captured

When it comes to recording, you don’t have to worry about wiring this thing up to your audio interface; just connect it to your computer with a USB-C cable, and it becomes a 24-bit, 48 KHz interface. (On Macs, this is class compliant and needs no driver; it even works with iOS devices. Neural makes the necessary driver for Windows.)

The Quad Cortex mini shows up with a host of inputs, making it simple to record, say, both a dry electric guitar track and a heavily effected one at the same time. If you change your mind about the sound later, you can always “re-amp” the dry signal by routing it back out to the device and recording it with different settings. You can even track mics through this thing, thanks to an XLR input and (for condenser mics) support for phantom power.

The Quad Cortex mini can also make its own captures of gear you either own or happen across. This can happen in two ways: 1) on the device or 2) in the cloud.

The device-based system, which the company calls “Neural Capture Version 1,” requires you to hook up your gear to both an output (to play the system’s test tones) and an input on the mini. (Note: Do not, under ANY circumstances, connect the actual speaker outputs from a tube amp directly to the mini. The power level is far too high.)

Various known sounds are then played through this loop, and the mini’s software analyzes the differences between the sound it sent and the sound it received. The machine-learning algorithms for this run locally on the device. Neural says that the Capture 1 system can handle overdrive pedals, amps, and cabs.

The newer system, called Neural Capture Version 2, is “an advanced evolution of Neural Capture trained via Cortex Cloud,” says the company. “This option provides even higher-resolution Captures, making it especially powerful for touch-sensitive devices like fuzzes, compressors, and certain styles of amps.” Capture 2 is said to be capable of modeling “subtle behaviors like volume-knob cleanup, amp sag and bloom, fast transients, and blend controls.”

As the name suggests, the more powerful algorithms behind this system require cloud-based servers instead of the local device. Users are allowed to run 40 Neural Capture 2 sessions per day, and each takes around 10 minutes.

The resulting captures, along with any presets you want to share, can be uploaded to Neural’s cloud-based system for sharing them. Once you log in, any captures or presets you choose to download from the site will automatically show up in your Quad Cortex mini.

Look for a follow-up article on what the actual process of making a capture is like; it’s similar across many different modeling devices these days, though the sound of the resulting models can vary by company.

Screenshot of The Cortex Cloud website.

The Cortex Cloud website.

The Cortex Cloud website.

Options

The Quad Cortex mini is a powerful tone platform that is both versatile and expandable. It’s good for solo jamming at home without needing to 1) buy amps, cabs, and effects and 2) crank them to ruinous volume levels. It’s good for playing live, once you have configured its fairly deep control system in a way that works for your particular songs. And it’s good for recording, letting you fiddle with endless gear combinations without running a single patch cable or digging up a 9V battery.

At $1,400, though, it’s bad for your wallet. Whether it’s worth the cost depends on your use case. If you don’t need a screen and are happy with fewer ports and options, you might consider Neural DSP’s smaller and cheaper Nano Cortex ($570) or other devices like the Tonex pedals from IK Multimedia. On the other hand, if you want a larger unit with more footswitches, you can plonk down an extra $400 for the full-fat Quad Cortex or look into various options from Fractal, Kemper, Line 6, etc.

One way of thinking about the financial calculus here would be to try out the device (or listen online) and see how well the sound works for you. Some amp purists believe that nothing beats the sound of real tubes and real speakers in a real room, cost and weight and volume be damned. Many others can’t hear a difference between the models and the originals.

If you’re in the former group, these kinds of devices are unlikely to fully satisfy you, at least when it comes to gigging and recording. So you might decide whether they are “worth it” based solely on their value as easy, light, and quiet practice platforms.

If you can’t tell (or don’t care about) the difference between the models and the real hardware, then these modeling sims start to look like a far better value. When individual amps can go for $1,500 to $2,000 or more, a massive gear collection like the one in the Quad Cortex mini is practically saving you money. You’d be a fool not to buy! (To paraphrase an explanation my son once gave me for a purchase he wanted to make.)

But even those in this group may not need an actual hardware pedal unless they really enjoy practicing without needing to use their regular computer—or unless they gig regularly. If you’re simply a recording guitarist who tends to work “in the box,” you might just pick up some cheaper Neural DSP plugins instead. Or you can buy a more comprehensive software suite like the new Paradise Guitar Studio from Universal Audio or one of the offerings from PolychromeDSP—all of which sound excellent.

If you’re content with software but want a free alternative, take a look at NAM, the Neural Amp Modeler. It’s open source modeling tech that also offers a community tone-sharing website and has been racking up lots of great reviews for its sound quality. (Though note that most of the NAM models are static captures; they sound great but represent only that exact setup and knob positioning, though the developers are working on more complex, adjustable models.)

All types of users can probably admit, though, that hardware and software modeling tech has made this a great time to be a guitar or bass player. Even if you don’t want to use them on a record, just being able to play around with and get to know this much gear with this much accuracy is a huge win for the home hobbyist and small-time gigging musician, who would otherwise never even set eyes on most of this stuff.

The key thing is just to get whatever works for you… and then to go forth and rock.

Photo of Nate Anderson

Quad Cortex mini amp modeler: All the power, half the size Read More »

us-blindsides-states-with-surprise-settlement-in-live-nation/ticketmaster-trial

US blindsides states with surprise settlement in Live Nation/Ticketmaster trial

State attorneys general were “kept in the dark and excluded materially from settlement discussions” while they prepared for trial, the filing said. On March 5, the states were “notified of the near-final terms of the settlement at 4 P.M.” and given one day to determine whether to accept or reject them,” the filing said.

States to take over lead role at trial

The US was taking the lead role in the case before the settlement was announced. In addition to seeking a mistrial, the states asked the court to stay the proceedings to give them time “to fully prepare to assume the lead role at trial and explore settlement.”

The states “have had no opportunity to obtain and reallocate the resources necessary to try the case on their own or to meaningfully discuss the settlement with Defendants and attempt to negotiate the terms,” the filing said. “Moreover, despite the primary role that DOJ has played before the jury, the United States (and several additional individual Plaintiff States) will now vanish from the trial… Due to the substantial prejudice caused by this settlement and DOJ’s abrupt exit after taking the lead role up to and during the first week of trial, a mistrial is warranted.”

New York took the lead role in the states’ filing today. “The settlement recently announced with the US Department of Justice fails to address the monopoly at the center of this case, and would benefit Live Nation at the expense of consumers. We cannot agree to it,” New York Attorney General Letitia James said today. “My attorney general colleagues and I have a strong case against Live Nation, and we will continue our lawsuit to protect consumers and restore fair competition to the live entertainment industry.”

Most of the states that backed the filing have Democratic attorneys general. But the group is bipartisan with Republican attorneys general from Kansas, New Hampshire, Ohio, Pennsylvania, Tennessee, Utah, and Wyoming.

Other states involved in the lawsuit either decided to join the US settlement or have not yet taken a position. States agreeing to the settlement are Arkansas, Iowa, Mississippi, Nebraska, Oklahoma, South Carolina, and South Dakota, the filing said. The other states involved in the lawsuit are Florida, Indiana, Louisiana, Texas, and West Virginia.

This article was updated with a statement from Live Nation.

US blindsides states with surprise settlement in Live Nation/Ticketmaster trial Read More »

apple’s-512gb-mac-studio-vanishes,-a-quiet-acknowledgment-of-the-ram-shortage

Apple’s 512GB Mac Studio vanishes, a quiet acknowledgment of the RAM shortage

If the only thing you had to go off was Apple’s string of product announcements this week, you’d have little reason to believe that there is a historic AI-driven memory and storage supply crunch going on. Some products saw RAM and storage increases at the same prices as the products they replaced; others had their prices increased a bit but came with more storage than before as compensation. And there’s the MacBook Neo, which at $599 was priced toward the low end of what Apple-watchers expected.

But even a company with Apple’s scale and buying power can’t totally defy gravity. At some point between March 4 and now, Apple quietly removed the 512GB RAM option from its top-tier M3 Ultra Mac Studio desktop. Pricing for the 256GB configuration has also increased, from $1,600 to $2,000. The Tech Specs page on Apple’s support site still acknowledges the existence of the 512GB configuration, but both the Apple Store page and the list of available configurations have removed any mention of it.

We’ve asked Apple to comment on the disappearance of the 512GB Mac Studio and will update this article if we receive a response.

It’s rare for Apple to pull any configurations of products it sells, aside from removing higher-capacity storage options for older iPhones after new ones come out. More commonly, the company will just increase its shipping estimates to reflect the supply chain backlog.

The 512GB Mac Studio was not a mass-market machine—adding that much RAM also required springing for the most expensive M3 Ultra model, which brought the system’s price to a whopping $9,499.

Apple’s 512GB Mac Studio vanishes, a quiet acknowledgment of the RAM shortage Read More »

asteroid-defense-mission-shifted-the-orbit-of-more-than-its-target

Asteroid defense mission shifted the orbit of more than its target


The binary asteroid’s orbit around the Sun was affected by the impact.

Italy’s LICIACube spacecraft snapped this image of asteroids Didymos (lower left) and Dimorphos (upper right) a few minutes after the impact of DART on September 26, 2022. Credit: ASI/NASA

On September 26, 2022, NASA’s Double Asteroid Redirection Test (DART) spacecraft crashed into a binary asteroid system. By intentionally ramming a probe into the 160-meter-wide moonlet named Dimorphos, the smaller of the two asteroids, humanity demonstrated that the kinetic impact method of planetary defense actually works. The immediate result was that Dimorphos’ orbital period around Didymos, its larger parent body, was slashed by 33 minutes.

Of course, altering a moonlet’s local orbit doesn’t seem like enough to safeguard Earth from civilization-ending impacts. But now, as long-term observational data has come in, it seems we accomplished more than that. DART actually changed the trajectory of the entire Didymos binary system, altering its orbit around the Sun.

Tracking space rocks

Measuring the orbital shift of a 780-meter-wide primary asteroid and its moonlet from millions of miles away isn’t trivial. When DART slammed into Dimorphos, it didn’t knock the binary system wildly off its trajectory around the Sun. The change in the system’s heliocentric trajectory was expected to be small, a minuscule nudge that would become apparent only after months or years of continuous observation. By analyzing enough painstakingly gathered data, a global team of researchers led by Rahil Makadia at the University of Illinois Urbana-Champaign has now determined the consequences of the DART impact.

To find the infinitesimal deviation DART created, Makadia’s team relied mostly on a technique called stellar occultation. When an asteroid passes in front of a distant star from the perspective of an observer on Earth, the star briefly blinks out. By precisely timing these blinks as they sweep across the globe, astronomers can pinpoint an asteroid’s position with astonishing accuracy.

Between October 2022 and March 2025, we captured 22 such stellar occultations of the Didymos system. Combined with a huge dataset publicly available at the Minor Planet Data Center that included nearly 6,000 ground-based astrometric measurements taken over 29 years, optical navigation data from the DART probe’s approach, and ground-based radar measurements, researchers finally had all they needed.

“Once we had enough measurements before and after the DART impact, we could discern how Didymos’ orbit has changed,” Makadia said.

When the vending-machine-sized DART probe crashed into Dimorphos at over 22,000 kilometers per hour, it decreased the along-track velocity of the entire Didymos system by roughly 11.7 micrometers per second. But the team thinks it’s still significant. “When you do it early enough, even a small impulse can accumulate over years and cause a meaningful shift,” Makadia explained.

Also, the DART impact itself was not the only force that changed Didymos’ orbit.

The ejecta engine

The pure kinetic energy of a 500-kilogram spacecraft hitting at hypersonic speeds is impressive, but on its own, it would not slow a huge asteroid that much. When DART struck Dimorphos, it blasted pulverized rock and dust out into the void. “The material kicked up off an asteroid surface acts like an extra rocket plume,” Makadia said.

Scientists call this effect the momentum enhancement factor, denoted by the Greek letter beta. If the spacecraft impact transferred exactly its own momentum and no debris was kicked up, beta would be exactly one.

Because Dimorphos orbits Didymos, some of the ejecta remained trapped in the system, where it altered the mutual orbit between the two rocks. But a crucial fraction of the ejecta achieved escape velocity from the entire binary system. The momentum carried away by the system-escaping debris is what ultimately contributed to shoving the center of mass of the whole Didymos-Dimorphos pair. “In our case, we found that the beta parameter due to DART impact was around two,” Makadia explained.

The debris blasted completely out of the Didymos system gave the asteroids a push roughly equal to the initial impact of the spacecraft itself.

To calculate how momentum was transferred, Makadia and his colleagues had to determine precisely how massive Didymos and Dimorphos are. By linking the heliocentric deflection to the previously known changes in Dimorphos’ local orbit, the researchers were able to perform a neat mathematical trick to uncover the bulk densities of both asteroids. And this revealed something a bit unexpected about the Didymos system.

“Most studies were going under the assumption that both asteroids have equal density—turns out that assumption was not correct,” Makadia said.

A rubble pile

Based on Makadia’s calculations, Didymos, the primary body, is relatively solid. It has a bulk density of around 2.6 tons per cubic meter, which aligns with standard estimates for siliceous asteroids. Dimorphos, however, is a different story. Its density is a surprisingly low 1.51 tons per cubic meter. This implies that the smaller asteroid targeted by DART is essentially a fluffy, loosely bound agglomeration of boulders, rocks, and dust, with empty voids between the rubble.

“This was a real surprise,” Makadia said. “We previously didn’t know anything about the density of Dimorphos.” The contrast in density tells the story of how this binary system formed.

Billions of years of uneven heating and radiation from the Sun can cause an irregularly shaped asteroid like Didymos to gradually spin faster, a phenomenon known as the YORP (Yarkovsky, O’Keefe, Radzievskii, Paddack) effect. Eventually, Didymos spun so fast that the centrifugal force overcame its gravity, and it began shedding loose material from its equator. That shed material eventually coalesced in orbit, gently clumping together to form the porous, fragile moonlet we now know as Dimorphos.

Overall, Didymos is nearly 200 times more massive than its smaller companion, which explains why shifting the larger asteroid system takes such an enormous amount of force. The sheer inertia of Didymos means that the barycenter deflection of its entire system was just a tiny fraction of the deflection felt locally by Dimorphos.

Planetary defense

Makadia’s findings confirm the models we used to estimate the consequences of the DART impact: The Didymos system still poses zero threat to us, at least for the next 100 years or so. “The pre-DART condition was that the closest the Didymos system can get to Earth was around 15 lunar distances, and this has not changed appreciably,” Makadia explained.

The goal of DART was primarily to take our planetary defense out of the realm of computer models and get us some hands-on, practical experience, and Makadia thinks we succeeded in doing that. “Our work proves that hitting the secondary asteroid is a viable path for deflecting a binary system away as long as the push is large enough,” he said. “This wasn’t the goal of DART, but we can always design a bigger spacecraft.”

This experience applies both to deflecting binary asteroid systems like Didymos and singular objects. “Our results definitely help us in all sorts of future kinetic impact endeavors,” Makadia added.

The final verification of the DART mission’s consequences, though, will come in late 2026, when the European Space Agency’s Hera spacecraft will arrive at the Didymos system.

By performing independent, in-situ measurements of things like the density of Didymos and Dimorphos, Hera will provide a lot of precise gravitational and physical data that Makadia hopes to use to refine his calculations.

“It’s a high-fidelity instrument that hopefully will give us confirmation of what we believe,” Makadia said. “Plus, there are always new things to be found out when we visit an asteroid. I’m very excited about when Hera gets there.”

Science Advances, 2026.  DOI: 10.1126/sciadv.aea4259

Photo of Jacek Krywko

Jacek Krywko is a freelance science and technology writer who covers space exploration, artificial intelligence research, computer science, and all sorts of engineering wizardry.

Asteroid defense mission shifted the orbit of more than its target Read More »

musk-fails-to-block-california-data-disclosure-law-he-fears-will-ruin-xai

Musk fails to block California data disclosure law he fears will ruin xAI


Musk can’t convince judge public doesn’t care about where AI training data comes from.

Elon Musk’s xAI has lost its bid for a preliminary injunction that would have temporarily blocked California from enforcing a law that requires AI firms to publicly share information about their training data.

xAI had tried to argue that California’s Assembly Bill 2013 (AB 2013) forced AI firms to disclose carefully guarded trade secrets.

The law requires AI developers whose models are accessible in the state to clearly explain which dataset sources were used to train models, when the data was collected, if the collection is ongoing, and whether the datasets include any data protected by copyrights, trademarks, or patents. Disclosures would also clarify whether companies licensed or purchased training data and whether the training data included any personal information. It would also help consumers assess how much synthetic data was used to train the model, which could serve as a measure of quality.

However, this information is precisely what makes xAI valuable, with its intensive data sourcing supposedly setting it apart from its biggest rivals, xAI argued. Allowing enforcement could be “economically devastating” to xAI, Musk’s company argued, effectively reducing “the value of xAI’s trade secrets to zero,” xAI’s complaint said. Further, xAI insisted, these disclosures “cannot possibly be helpful to consumers” while supposedly posing a real risk of gutting the entire AI industry.

Specifically, xAI argued that its dataset sources, dataset sizes, and cleaning methods were all trade secrets.

“If competitors could see the sources of all of xAI’s datasets or even the size of its datasets, competitors could evaluate both what data xAI has and how much they lack,” xAI argued. In one hypothetical, xAI speculated that “if OpenAI (another leading AI company) were to discover that xAI was using an important dataset to train its models that OpenAI was not, OpenAI would almost certainly acquire that dataset to train its own model, and vice versa.”

However, in an order issued on Wednesday, US District Judge Jesus Bernal said that xAI failed to show that California’s law, which took effect in January, required the company to reveal any trade secrets.

xAI’s biggest problem was being too vague about the harms it faced if the law was not halted, the judge said. Instead of explaining why the disclosures could directly harm xAI, the company offered only “a variety of general allegations about the importance of datasets in developing AI models and why they are kept secret,” Bernal wrote, describing X as trading in “frequent abstractions and hypotheticals.”

He denied xAI’s motion for a preliminary injunction while supporting the government’s interest in helping the public assess how the latest AI models were trained.

The lawsuit will continue, but xAI will have to comply with California’s law in the meantime. That could see Musk sharing information he’d rather OpenAI had no knowledge of at a time when he’s embroiled in several lawsuits against the leading AI firm he now regrets helping to found.

While not ending the fight to keep OpenAI away from xAI’s training data, this week’s ruling is another defeat for Musk after a judge last month tossed one of his OpenAI lawsuits, ruling that Musk had no proof that OpenAI had stolen trade secrets.

xAI argued California wants to silence Grok

xAI’s complaint argued that California’s law was unconstitutional since data can be considered a trade secret under the Fifth Amendment. The company also argued that the state was trying to regulate the outputs of xAI’s controversial chatbot, Grok, and was unfairly compelling speech from xAI while exempting other firms for security purposes.

At this stage of the litigation, Bernal disagreed that xAI might be irreparably harmed if the law was not halted.

On the Fifth Amendment claim, the judge said it’s not that training data could never be considered a trade secret. It’s just that xAI “has not identified any dataset or approach to cleaning and using datasets that is distinct from its competitors in a manner warranting trade secret protection.”

“It is not lost on the Court the important role of datasets in AI training and development, and that, hypothetically, datasets and details about them could be trade secrets,” Bernal wrote. But xAI “has not alleged that it actually uses datasets that are unique, that it has meaningfully larger or smaller datasets than competitors, or that it cleans its datasets in unique ways.”

Therefore, xAI is not likely to succeed on the merits of its Fifth Amendment claim.

The same goes for First Amendment arguments. xAI failed to show that the law improperly “forces developers to publicly disclose their data sources in an attempt to identify what California deems to be ‘data riddled with implicit and explicit biases,’” Bernal wrote.

To xAI, it seemed like the state was trying to use the law to influence the outputs of its chatbot Grok, the company argued, which should be protected commercial speech.

Over the past year, Grok has increasingly drawn global public scrutiny for its antisemitic rants and for generating nonconsensual intimate imagery (NCII) and child sexual abuse materials (CSAM). But despite these scandals, which prompted a California probe, Bernal contradicted xAI, saying California did not appear to be trying to regulate controversial or biased outputs, as xAI feared.

“Nothing in the language of the statute suggests that California is attempting to influence Plaintiff’s models’ outputs by requiring dataset disclosure,” Bernal wrote.

Addressing xAI’s other speech concerns, he noted that “the statute does not functionally ask Plaintiff to share its opinions on the role of certain datasets in AI model development or make ideological statements about the utility of various datasets or cleaning methods.”

“No part of the statute indicates any plan to regulate or censor models based on the datasets with which they are developed and trained,” Bernal wrote.

Public “cannot possibly” care about AI training data

Perhaps most frustrating for xAI as it continues to fight to block the law, Bernal also disputed that the public had no interest in the training data disclosures.

“It strains credulity to essentially suggest that no consumer is capable of making a useful evaluation of Plaintiff’s AI models by reviewing information about the datasets used to train them and that therefore there is no substantial government interest advanced by this disclosure statute,” Bernal wrote.

He noted that the law simply requires companies to alert the public about information that can feasibly be used to weigh whether they want to use one model over another.

Nothing about the required disclosures is inherently political, the judge suggested, although some consumers might select or avoid certain models with perceived political biases. As an example, Bernal opined that consumers may want to know “if certain medical data or scientific information was used to train a model” to decide if they can trust the model “to be sufficiently comprehensively trained and reliable for the consumer’s purposes.”

“In the marketplace of AI models, AB 2013 requires AI model developers to provide information about training datasets, thereby giving the public information necessary to determine whether they will use—or rely on information produced by—Plaintiff’s model relative to the other options on the market,” Bernal wrote.

Moving forward, xAI seems to face an uphill battle to win this fight. It will need to gather more evidence to demonstrate that its datasets or cleaning methods are sufficiently unique to be considered trade secrets that give the company a competitive edge.

It will also likely have to deepen its arguments that consumers don’t care about disclosures and that the government has not explored less burdensome alternatives that could “achieve the goal of transparency for consumers,” Bernal suggested.

One possible path to a win could be proving that California’s law is so vague that it potentially puts xAI on the hook for disclosing its customers’ training data for individual Grok licenses. But Bernal emphasized that xAI “must actually face such a conundrum—rather than raising an abstract possible issue among AI systems developers—for the Court to make a determination on this issue.”

xAI did not respond to Ars’ request to comment.

A spokesperson for the California Department of Justice told Reuters that the department “celebrates this key win and remains committed to continuing our defense” of the law.

Photo of Ashley Belanger

Ashley is a senior policy reporter for Ars Technica, dedicated to tracking social impacts of emerging policies and new technologies. She is a Chicago-based journalist with 20 years of experience.

Musk fails to block California data disclosure law he fears will ruin xAI Read More »