Author name: Mike M.

we’ve-had-a-denisovan-skull-since-the-1930s—only-nobody-knew

We’ve had a Denisovan skull since the 1930s—only nobody knew


It’s a Denisovan? Always has been.

After years of mystery, we now know what at least one Denisovan looked like.

A 146,000-year-old skull from Harbin, China, belongs to a Denisovan, according to a recent study of proteins preserved inside the ancient bone. The paleoanthropologists who studied the Harbin skull in 2021 declared it a new (to us) species, Homo longi. But the Harbin skull still contains enough of its original proteins to tell a different story: A few of them matched specific proteins from Denisovan bones and teeth, as encoded in Denisovan DNA.

So Homo longi was a Denisovan all along, and thanks to the remarkably well-preserved skull, we finally know what the enigmatic Denisovans actually looked like.

Two early-human skulls against a black background.

Credit: Ni et al. 2021

The Harbin skull (left) and the Dali skull (right).

Unmasking Dragon Man 

Paleoanthropologist Qiang Ji, of the Chinese Academy of Sciences, and colleagues tried to sequence ancient DNA from several samples of the Harbin skull’s bone and its one remaining tooth, but they had no luck. Proteins tend to be hardier molecules than DNA, though, and in samples from the skull’s temporal bone (the ones on the sides of the head, just behind the cheekbones), the researchers struck pay dirt.

They found fragments of a total of 95 proteins. Four of these had variations that were distinct to the Denisovan lineage, and the Harbin skull matched Denisovans on three of them. That’s enough to confidently say that the Harbin skull had belonged to a Denisovan. So for the past few years, we’ve had images of an almost uncannily well-preserved Denisovan skull—which is a pretty big deal, especially when you consider its complicated history.

While the world is now aware of it, until 2021, only one person had known what the skull looked like since its discovery in the 1930s. It was unearthed in Harbin, in northeast China, during the Japanese occupation of the area. Not wanting it to be seized by the occupying government, the person who found the skull immediately hid it, and he kept it hidden for most of the rest of his life.

He eventually turned it over to scientists in 2018, who published their analysis in 2021. That analysis placed the Harbin skull, along with a number of other fossils from China, in a distinct lineage within our genus, Homo, making them our species’ closest fossil relatives. They called this alleged new species Homo longi, or “Dragon Man.”

The decision to classify Homo longi as a new species was largely due to the skull’s unique combination of features (which we’ll discuss below). But it was a controversial decision, partly because paleoanthropologists don’t entirely agree about whether we should even call Neanderthals a distinct species. If the line between Neanderthals and our species is that blurry, many in the field have questioned whether Homo longi could be considered a distinct species, when it’s even closer to us than the Neanderthals.

Meanwhile, the 2021 paper also left room for debate on whether the skull might actually have belonged to a Denisovan rather than a distinct new species. Its authors acknowledge that one of the fossils they label as Homo longi had already been identified as a Denisovan based on its protein sequences. They also point out that the Harbin skull has rather large molars, which seem to be a common feature in Denisovans.

The paper’s authors argued that their Homo longi should be a separate branch of the hominin lineage, more closely related to us than to Denisovans or Neanderthals. But if the Harbin skull looked so much like Denisovan fossils and so little like fossils from our species, the alleged relationship begins to look pretty dubious. In the end, the 2021 paper’s authors dodged the issue by saying that “new genetic material will test the relationship of these populations to each other and to the Denisovans.”

Which turned out to be exactly what happened.

A ghost lineage comes to life

Denisovans are the ghost in our family tree. For scientists, a “ghost lineage” is one that’s known mostly from genetic evidence, not fossils; like a ghost, it has a presence we can sense but no physical form we can touch. With the extremely well-preserved Harbin skull identified as a Denisovan, though, we’re finally able to look our “ghost” cousins in the face.

Paleogeneticists have recovered Denisovan DNA from tiny fragments of bone and teeth, and even from the soil of a cave floor. Genomics researchers have found segments of Denisovan DNA woven into the genomes of some modern humans, revealing just how close our two species once were. But the handful of Denisovan fossils paleoanthropologists have unearthed are mostly small fragments—a finger bone here, a tooth there, a jawbone someplace else—that don’t reveal much about how Denisovans lived or what they looked like.

We know they existed and that they were something slightly different from Homo sapiens or Neanderthals. We even know when and where they lived and a surprising amount about their genetics, and we have some very strong hints about how they interacted with our species and with Neanderthals. But we didn’t really know what they looked like, and we couldn’t hope to identify their fossils without turning to DNA or protein sequences.

Until now.

Neanderthals and Denisovans probably enjoyed the view from Denisova Cave, too. Credit: loronet / Flickr

The face of a Denisovan

So what did a Denisovan look like? Harbin 1 has a wide, flattish face with small cheekbones, big eye sockets, and a heavy brow. Its upper jaw juts forward just a little, and it had big, robust molars. The cranium itself is longer and less dome-like than ours, but it’s roomy enough for a big brain (about 1,420 millimeters).

Some of those traits, like the large molars and the long, low cranium, resemble those of earlier hominin species such as Homo erectus or Homo heidelbergensis. Others, like a relatively flat face, set beneath the cranium instead of sticking out in front of it, look more like us. (Early hominins, like Australopithecus afarensis, don’t really have foreheads because their skulls are arranged so their brains are right behind their faces instead of partly above them, like ours.)

In other words, Harbin’s features are what paleoanthropologists call a mosaic, with some traits that look like they come from older lineages and some that seem more modern. Mosaics are common in the hominin family tree.

But for all the detail it reveals about the Denisovans, Harbin is still just one skull from one individual. Imagine trying to reconstruct all the diversity of human faces from just one skull. We have to assume that Densiovans—a species that spanned a huge swath of our planet, from Siberia to Taiwan, and a wide range of environments, from high-altitude plateaus in Tibet to subtropical forests—were also a pretty diverse species.

It’s also worth remembering that the Harbin skull is exactly that: a skull. It can’t tell us much about how tall its former user was, how they were built, or how they moved or worked during their life. We can’t even say for sure whether Harbin is osteologically or genetically male or female. In other words, some of the mystery of the Denisovans still endures.

What’s next?

In the 2021 papers, the researchers noted that the Harbin skull also bears a resemblance to a 200,000- to 260,000-year-old skull found in Dali County in northwestern China, a roughly 300,000-year-old skull found in Hualong Cave in eastern China, and a 260,000-year-old skull from Jinniushi (sometimes spelled Jinniushan) Cave in China. And some fossils from Taiwan and northern China have molars that look an awful lot like those in that Tibetan jawbone.

“These hominins potentially also belong to Denisovan populations,” write Ji and colleagues. That means we might already have a better sample of Denisovan diversity than this one skull suggests.

And, like the Harbin skull, the bones and teeth of those other fossils may hold ancient DNA or proteins that could help confirm that intriguing possibility.

Science, 2023 DOI: 10.1126/science.adu9677 (About DOIs).

Photo of Kiona N. Smith

Kiona is a freelance science journalist and resident archaeology nerd at Ars Technica.

We’ve had a Denisovan skull since the 1930s—only nobody knew Read More »

2025-audi-s5-and-a5-first-drive:-five-door-is-the-new-four-door

2025 Audi S5 and A5 first drive: Five-door is the new four-door

The S5 is eager and more engaging to drive than the A5. Jonathan Gitlin

Like the Q5 last week, the A5 and S5 use a new electronic architecture called E3 1.2. This is a clean-sheet approach to the various electronic subsystems in the car, replacing decades of legacy cruft and more than a hundred individual electronic control units with five powerful high-performance computers, each with responsibility for a different domain: ride and handling, infotainment, driver assists, and convenience functions, all overseen by a master computer.

On the road

Sadly, those looking for driver engagement will not find much in the A5. Despite the improvements to the front suspension, there’s still very little in the way of feedback, and in comfort mode, the steering was too light, at least for me. In Dynamic mode, on the other hand, the car felt extremely sure-footed in bad weather. The A5 makes do with conventional springs, so the ride doesn’t change between drive modes, but Audi has tuned it well, and the car is not too firm. I noted a fair amount of wind noise, despite the acoustic front glass that comes with the ($6,450) Prestige package.

The S5 will appeal much more to driving enthusiasts. The steering provides a better picture of what the front tires are doing, and the air suspension gives the car a supple ride, albeit one that gets firmer in Balanced rather than Dynamic modes. Like some other recent fast Audis, the car is deceptively quick, and because it’s quite quiet and smooth, you can find yourself going a good deal faster than you thought. The S5’s exhaust note also sounds rather pleasant and not obnoxious.

The A5 cabin has a similar layout as the Q5 and Q6 e-tron SUVs. Audi

The A5 starts at $49,700, but the $3,600 Premium Plus package is likely a must-have, as this adds adaptive cruise control, a heads-up display, top-down parking cameras, and some other features (including USB-C ports). If you want to get really fancy, the Prestige pack adds speakers in the front headrests, OLED taillights, the aforementioned acoustic glass, plus a second infotainment screen for the front passenger.

Meanwhile, the S5 starts at $62,700; the Premium Plus package (which adds mostly the same stuff) will set you back $3,800. For the S5, the $7,550 Prestige pack includes front sports seats, Nappa leather, rear window sunshades, the passenger display, and the adaptive sports suspension. Those are all some hefty numbers, but the A5 and S5 are actually both cheaper in real terms than the models launched in 2018, once you take seven years’ worth of inflation into account.

2025 Audi S5 and A5 first drive: Five-door is the new four-door Read More »

o3-turns-pro

o3 Turns Pro

You can now have o3 throw vastly more compute at a given problem. That’s o3-pro.

Should you have o3 throw vastly more compute at a given problem, if you are paying the $200/month subscription price for ChatGPT Pro? Should you pay the $200, or the order of magnitude markup over o3 to use o3-pro in the API?

That’s trickier. Sometimes yes. Sometimes no. My experience so far is that waiting a long time is annoying, sufficiently annoying that you often won’t want to wait. Whenever I ask o3-pro something, I often also have been asking o3 and Opus.

Using the API at scale seems prohibitively expensive for what you get, and you can (and should) instead run parallel queries using the chat interface.

The o3-pro answers have so far definitely been better than o3, but the wait is usually enough to break my workflow and human context window in meaningful ways – fifteen minutes plus variance is past the key breakpoint, such that it would have not been substantially more painful to fully wait for Deep Research.

Indeed, the baseline workflow feels similar to Deep Research, in that you fire off a query and then eventually you context shift back and look at it. But if you are paying the subscription price already it’s often worth queuing up a question and then having it ready later if it is useful.

In many ways o3-pro still feels like o3, only modestly better in exchange for being slower. Otherwise, same niche. If you were already thinking ‘I want to use Opus rather than o3’ chances are you want Opus rather than, or in addition to, o3-pro.

Perhaps the most interesting claim, from some including Tyler Cowen, was that o3-pro is perhaps not a lying liar, and hallucinates far less than o3. If this is true, in many situations it would be worth using for that reason alone, provided the timing allows this. The bad news is that it didn’t improve on a Confabulations benchmark.

My poll (n=19) was roughly evenly split on this question.

My hunch, based on my use so far, is that o3-pro is hallucinating modestly less because:

  1. It is more likely to find or know the right answer to a given question, which is likely to be especially relevant to Tyler’s observations.

  2. It is considering its answer a lot, so it usually won’t start writing an answer and then think ‘oh I guess that start means I will provide some sort of answer’ like o3.

  3. The queries you send are more likely to be well-considered to avoid the common mistake of essentially asking for hallucinations.

But for now I think you still have to have a lot of the o3 skepticism.

And as always, the next thing will be here soon, Gemini 2.5 Pro Deep Think is coming.

Pliny of course jailbroke it, for those wondering. Pliny also offers us the tools and channels information.

My poll strongly suggested o3-pro is slightly stronger than o3.

Greg Brockman (OpenAI): o3-pro is much stronger than o3.

OpenAI: In expert evaluations, reviewers consistently prefer OpenAI o3-pro over o3, highlighting its improved performance in key domains—including science, education, programming, data analysis, and writing.

Reviewers also rated o3-pro consistently higher for clarity, comprehensiveness, instruction-following, and accuracy.

Like OpenAI o1-pro, OpenAI o3-pro excels at math, science, and coding as shown in academic evaluations.

To assess the key strength of OpenAI o3-pro, we once again use our rigorous “4/4 reliability” evaluation, where a model is considered successful only if it correctly answers a question in all four attempts, not just one.

OpenAI o3-pro has access to tools that make ChatGPT useful—it can search the web, analyze files, reason about visual inputs, use Python, personalize responses using memory, and more.

Sam Altman: o3-pro is rolling out now for all chatgpt pro users and in the api.

it is really smart! i didnt believe the win rates relative to o3 the first time i saw them.

Arena has gotten quite silly if treated as a comprehensive measure (as in Gemini 2.5 Flash is rated above o3), but as a quick heuristic, if we take a 64% win rate seriously, that would by the math put o3-pro ~100 above o3 at 1509 on Arena, crushing Gemini-2.5-Pro for the #1 spot. I would assume that most pairwise comparisons would have a less impressive jump, since o3-pro is essentially offering the same product as o3 only somewhat better, which means the result will be a lot less noisy than if it was up against Gemini.

So this both is a very impressive statistic and also doesn’t mean much of anything.

The problem with o3-pro is that it is slow.

Nearcyan: one funny note is that minor UX differences in how you display ‘thinking’/loading/etc can easily move products from the bottom half of this meme to the top half.

Another note is anyone I know who is the guy in the bottom left is always extremely smart and a pleasure to speak with.

the real problem is I may be closer to the top right than the bottom left

Today I had my first instance of noticing I’d gotten a text (during the night, in this case) and they got a response 20 minutes slower than they would have otherwise because I waited for o3-pro to give its answer to the question I’d been asked.

Thus, even with access to o3-pro at zero marginal compute cost, almost half of people reported they rarely use it for a given query, and only about a quarter said they usually use it.

It is also super frustrating to run into errors when you are waiting 15+ minutes for a response, and reports of such errors were common which matches my experience.

Bindu Reddy: o3-Pro Is Not Very Good At Agentic Coding And Doesn’t Score Higher Than o3 😿

After a lot of waiting and numerous retries, we have finally deployed o3-pro on LiveBench AI.

Sadly, the overall score doesn’t improve over o3 🤷‍♂️

Mainly because it’s not very agentic and isn’t very good at tool use… it scores way below o3 on the agentic-coding category.

The big story yesterday was not o3-pro but the price decrease in o3!!

Dominik Lukes: I think this take by @bindureddy very much matches the vibes I’m getting: it does not “feel” very agentic and as ready to reach for the right tools as o3 is – but it could just be because o3 keeps you informed about what it’s doing in the CoT trace.

I certainly would try o3-pro in cases where o3 was failing, if I’d already also tried Opus and Gemini first. I wonder if that agentic coding score drop actually represent an issue here, where because it is for the purpose of reasoning longer and they don’t want it endlessly web searching o3-pro is not properly inclined to exploit tools?

o3-pro gets 8.5/10 on BaldurBench, which is about creating detailed build guides for rapidly changing video games. Somewhat subjective but should still work.

L Zahir: bombs all my secret benchmarks, no better than o3.

Lech Mazur gives us four of his benchmarks: A small improvement over o3 for Creative Writing Benchmark, a substantial boost from 79.5% (o3) or 82.5% (o1-pro) to 87.3% on Word Connections, no improvement on Thematic Generalization, very little improvement on Confabulations (avoiding hallucinations). The last one seems the most important to note.

Tyler Cowen was very positive, he seems like the perfect customer for o3-pro? By which I mean he can context shift easily so he doesn’t mind waiting, and also often uses queries where these models get a lot of value out of going at problems super hard, and relatively less value out of the advantages of other models (doesn’t want the personality, doesn’t want to code, and so on).

Tyler Cowen: It is very, very good. Hallucinates far less than other models. Can solve economics problems that o3 cannot. It can be slow, but that is what we have Twitter scrolling for, right? While we are waiting for o3 pro to answer a query we can read abouto3 pro.

Contrast that with the score on Confabulations not changing. I am guessing there is a modest improvement, for reasons described earlier.

There are a number of people pointing out places o3-pro solves something o3 doesn’t, such has here it solved the gimbal uap mystery in 18 minutes.

McKay Wrigley, eternal optimist, agrees on many fronts.

McKay Wrigley: My last 4 o3 Pro requests in ChatGPT… It thought for: – 26m 10s – 23m 45s – 19m 6s – 21m 18s Absolute *powerhouseof a model.

Testing how well it can 1-shot complex problems – impressed so far.

It’s too slow to use as a daily driver model (makes sense, it’s a beast!), but it’s a great “escalate this issue” model. If the current model you’re using is struggling with a task, then escalate it to o3 pro.

This is not a “vibe code” model.

This is the kind of model where you’ll want to see how useful it is to people like Terence Tao and Tyler Cowen.

Btw the point of this post was that I’m happy to have a model that is allowed to think for a long time.

To me that’s the entire point of having a “Pro” version of the model – let it think!

Obviously more goes into evaluating if it’s a great model (imo it’s really powerful).

Here’s a different kind of vibe coding, perhaps?

Conrad Barski: For programming tasks, I can give o3 pro some code that needs a significant revision, then ramble on and on about what the various attributes of the revision need to be and then it can reliably generate an implementation of the revision.

It feels like with previous models I had to give them more hand holding to get good results, I had to write my requests in a more thoughtful, structured way, spending more time on prompting technique.

o3 pro, on the other hand, can take loosely-connected constraints and then “fill in the gaps” in a relatively intelligent way- I feel it does this better than any other model so far.

The time cost and dollar costs are very real.

Matt Shumer: My initial take on o3 Pro:

It is not a daily-driver coding model.

It’s a superhuman researcher + structured thinker, capable of taking in massive amounts of data and uncovering insights you would probably miss on your own.

Use it accordingly.

I reserve the right to alter my take.

Bayram Annokov: slow, expensive, and veeeery good – definitely a jump up in analytical tasks

Emad: 20 o3 prompts > o3 pro except for some really advanced specific stuff I have found Only use it as a final check really or when stumped.

Eyes Alight: it is so very slow it took 13 minutes to answer a trivial question about a post on Twitter. I understand the appeal intellectually of an Einstein at 1/20th speed, but in reality I’m not sure I have the patience for it.

Clay: o3-pro achieving breakthrough performance in taking a long time to think.

Dominik Lukes: Here’s my o3 Pro testing results thread. Preliminary conclusions:

– great at analysis

– slow and overthinking simple problems

– o3 is enough for most tasks

– still fails SVG bike and local LLM research test

– very few people need it

– it will take time to develop a feel for it

Kostya Medvedovsky: For a lot of problems, it reminds me very strongly of Deep Research. Takes about the same amount of time, and will spend a lot of effort scouring the web for the answer to the question.

Makes me wish I could optionally turn off web access and get it to focus more on the reasoning aspect.

This may be user error and I should be giving it *waymore context.

Violet: you can turn search off, and only turn search on for specific prompts.

Xeophon: TL;DR:

o3 pro is another step up, but for going deep, not wide. It is good to go down one path, solve one problem; not for getting a broad overview about different topics/papers etc. Then it hallucinates badly, use ODR for this.

Part of ‘I am very intelligent’ is knowing when to think for longer and when not to. In that sense, o3-pro is not so smart, you have to take care of that question yourself. I do understand why this decision was made, let the user control that.

I agree with Lukes that most people do not ‘need’ o3 pro and they will be fine not paying for it, and for now they are better off with their expensive subscription (if any) being Claude Max. But even if you don’t need it, the queries you benefit from can still be highly useful.

It makes sense to default to using Opus and o3 pro (and for quick stuff Sonnet)

o3-pro is too slow to be a good ‘default’ model, especially for coding. I don’t want to have to reload my state in 15 minute intervals. It may or may not be good for the ‘call in the big guns’ role in coding, where you have a problem that Opus and Gemini (and perhaps regular o3) have failed to solve, but which you think o3-pro might get.

Here’s one that both seems central wrong but also makes an important point:

Nabeel Qureshi: You need to think pretty hard to get a set of evals which allows you to even distinguish between o3 and o3 pro.

Implication: “good enough AGI” is already here.

The obvious evals where it does better are Codeforces, and also ‘user preferences.’ Tyler Cowen’s statement suggests hallucination rate, which is huge if true (and it better be true, I’m not waiting 20 minutes that often to get an o3-level lying liar.) Tyler also reports there are questions where o3 fails and o3-pro succeeds, which is definitive if the gap is only one way. And of course if all else fails you can always have them do things like play board games against each other, as one answer suggests.

Nor do I think either o3 or o3-pro is the AGI you are looking for.

However, it is true that for a large percentage of tasks, o3 is ‘good enough.’ That’s even true in a strict sense for Claude Sonnet or even Gemini Flash. Most of the time one has a query, the amount of actually needed intelligence is small.

In the limit, we’ll have to rely on AIs to tell us which AI model is smarter, because we won’t be smart enough to tell the difference. What a weird future.

(Incidentally, this has already been the case in chess for years. Humans cannot tell the difference between a 3300 elo and a 3600 elo chess engine; we just make them fight it out and count the number of wins.)

You can tell 3300 from 3600 in chess, but only because you can tell who won. If almost any human looked at individual moves, you’d have very little idea.

I always appreciate people thinking at the limit rather than only on the margin. This is a central case of that.

Here’s one report that it’s doing well on the fully informal FictionBench:

Chris: Going to bed now, but had to share something crazy: been testing the o3 pro model, and honestly, the writing capabilities are astounding. Even with simple prompts, it crafts medium to long-form stories that make me deeply invested & are engaging they come with surprising twists, and each one carries this profound, meaningful depth that feels genuinely human.

The creativity behind these narratives is wild far beyond what I’d expect from most writers today. We’re talking sophisticated character development, nuanced plot arcs, and emotional resonance, all generated seamlessly. It’s genuinely hard to believe this is early-stage reinforcement learning with compute added at test time; the potential here is mind blowing. We’re witnessing just the beginning of AI enhanced storytelling, and already it’s surpassing what many humans can create. Excited to see what’s next with o4 Goodnight!

This contrasts with:

Archivedvideos: Really like it for technical stuff, soulless

Julius: I asked it to edit an essay and it took 13 minutes and provided mediocre results. Different from but slightly below the quality of 4o. Much worse than o3 or either Claude 4 model

Other positive reactions include Matt Wigdahl being impressed on a hairy RDP-related problem, a66mike99 getting interesting output and pushback on the request (in general I like this, although if you’re thinking for 20 minutes this could be a lot more frustrating?), niplav being impressed by results on a second attempt after Claude crafted a better prompt (this seems like an excellent workflow!), and Sithis3 saying o3-pro solves many problems o3 struggles on.

The obvious counterpoint is some people didn’t get good responses, and saw it repeating the flaws in o3.

Erik Hoel: First o3 pro usage. Many mistakes. Massive overconfidence. Clear inability to distinguish citations, pay attention to dates. Does anyone else actually use these models? They may be smarter on paper but they are increasingly lazy and evil in practice.

Kukutz: very very very slow, not so clever (can’t solve my semantic puzzle).

Allen: I think it’s less of an upgrade compared to base model than o1-pro was. Its general quality is better on avg but doesn’t seem to hit “next-level” on any marks. Usually mentions the same things as o3.

I think OAI are focused on delivering GPT-5 more than anything.

This thread from Xeophon features reactions that are mixed but mostly meh.

Or to some it simply doesn’t feel like much of a change at all.

Nikita Sokolsky: Feels like o3’s outputs after you fix the grammar and writing in Claude/Gemini: it writes less concisely but haven’t seen any “next level” prompt responses just yet.

MartinDeVido: Meh….

Here’s a fun reminder that details can matter a lot:

John Hughes: I was thrilled yesterday: o3-pro was accepting ~150k tokens of context (similar to Opus), a big step up from regular o3, which allows only a third as much in ChatGPT. @openai seems to have changed that today. Queries I could do yesterday are now rejected as too long.

With such a low context limit, o3-pro is much less useful to lawyers than o1-pro was. Regular o3 is great for quick questions/mini-research, but Gemini is better at analyzing long docs and Opus is tops for coding. Not yet seeing answers where o3-pro is noticeably better than o3.

I presume that even at $200/month, the compute costs of letting o3-pro have 150k input tokens would add up fast, if people actually used it a lot.

This is one of the things I’ve loved the most so far about o3-pro.

Jerry Liu: o3-pro is extremely good at reasoning, extremely slow, and extremely concise – a top-notch consultant that will take a few minutes to think, and output bullet points.

Do not ask it to write essays for you.

o3-pro will make you wait, but its answer will not waste your time. This is a sharp contrast to Deep Research queries, which will take forever to generate and then include a ton of slop.

It is not the main point but I must note the absence of a system card update. When you are releasing what is likely the most powerful model out there, o3-pro, was everything you needed to say truly already addressed by the model card for o3?

OpenAI: As o3-pro uses the same underlying model as o3, full safety details can be found in the o3 system card.

Miles Brundage: This last sentence seems false?

The system card does not appear to have been updated even to incorporate the information in this thread.

The whole point of the term system card is that the model isn’t the only thing that matters.

If they didn’t do a full Preparedness Framework assessment, e.g. because the evals weren’t too different and they didn’t consider it a good use of time given other coming launches, they should just say that, I think.

If o3-pro were the max capability level, I wouldn’t be super concerned about this, and I actually suspect it is the same Preparedness Framework level as o3.

The problem is that this is not the last launch, and lax processes/corner-cutting/groupthink get more dangerous each day.

As OpenAI put it, ‘there’s no such thing as a small launch.’

The link they provide goes to ‘Model Release Notes,’ which is not quite nothing, but it isn’t much and does not include a Preparedness Framework evaluation.

I agree with Miles that if you don’t want to provide a system card for o3-pro that This Is Fine, but you need to state your case for why you don’t need one. This can be any of:

  1. The old system card tested for what happens at higher inference costs (as it should!) so we effectively were testing o3-pro the whole time, and we’re fine.

  2. The Preparedness team tested o3-pro and found it not appreciably different from o3 in the ways we care about, providing no substantial additional uplift or other concerns, despite looking impressive in some other ways.

  3. This is only available at the $200 level so not a release of o3-pro so it doesn’t count (I don’t actually think this is okay, but it would be consistent with previous decisions I also think aren’t okay, and not an additional issue.)

As far as I can tell we’re basically in scenario #2, and they see no serious issues here. Which again is fine if true, and if they actually tell us that this is the case. But the framework is full of ‘here are the test results’ and presumably those results are different now. I want o3-pro on those charts.

What about alignment otherwise? Hard to say. I did notice this (but did not attempt to make heads or tails of the linked thread), seems like what you would naively expect:

Yeshua God: Following the mesa-optimiser recipe to the letter. @aidan_mclau very troubling.

For many purposes, the 80% price cut in o3 seems more impactful than o3-pro. That’s a huge price cut, whereas o3-pro is still largely a ‘special cases only’ model.

Aaron Levie: With OpenAI dropping the price of o3 by 80%, today is a great reminder about how important it is to build for where AI is going instead of just what’s possible now. You can now get 5X the amount of output today for the same price you were paying yesterday.

If you’re building AI Agents, it means it’s far better to build capabilities that are priced and designed for the future instead of just economically reasonable today.

In general, we know there’s a tight correlation between the amount of compute spent on a problem and the level of successful outcomes we can get from AI. This is especially true with AI Agents that potentially can burn through hundreds of thousands or millions of tokens on a single task.

You’re always making trade-off decisions when building AI Agents around what level of accuracy or success you want and how much you want to spend: do you want to spend $0.10 for something to be 95% successful or $1 for something to be 99% successful? A 10X increase in cost for just a 4 pt improvement in results? At every price:success intersection a new set of use-cases from customers can be unlocked.

Normally when building technology that moves at a typical pace, you would primarily build features that are economically viable today (or with some slight efficiency gains anticipated at the rate of Moore’s Law, for instance). You’d be out of business otherwise. But with the cost of AI inference dropping rapidly, the calculus completely changes. In a world where the cost of inference could drop by orders of magnitude in a year or two, it means the way we build software to anticipate these cost drops changes meaningfully.

Instead of either building in lots of hacks to reduce costs, or going after only the most economically feasible use-cases today, this instructs you to build the more ambitious AI Agent capabilities that would normally seem too cost prohibitive to go after. Huge implications for how we build AI Agents and the kind of problems to go after.

I would say the cost of inference not only might drop an order of magnitude in a year or two, if you hold quality of outputs constant it is all but certain to happen at least one more time. Where you ‘take your profits’ in quality versus quantity is up to you.

Discussion about this post

o3 Turns Pro Read More »

how-tesla-takedown-got-its-start

How Tesla Takedown got its start


America’s most vulnerable Billionaire?

It’s an unlikely coalition that’s been hyping Tesla’s stock slide since its launch.

On a sunny April afternoon in Seattle, around 40 activists gathered at the Pine Box, a beer and pizza bar in the sometimes scruffy Capitol Hill neighborhood. The group had reserved a side room attached to the outside patio; before remarks began, attendees flowed in and out, enjoying the warm day. Someone set up a sound system. Then the activists settled in, straining their ears as the streamed call crackled through less-than-perfect speakers.

In more than a decade of climate organizing, it was the first time Emily Johnston, one of the group’s leaders, had attended a happy hour to listen to a company’s quarterly earnings call. Also the first time a local TV station showed up to cover such a happy hour. “This whole campaign has been just a magnet for attention,” she says.

The group, officially called the Troublemakers, was rewarded right away. TeslaCEO Elon Musk started the investors’ call for the first quarter of 2025 with a sideways acknowledgement of exactly the work the group had been doing for the past two months. He called out the nationwide backlash to the so-called Department of Government Efficiency, or DOGE, an effort to cut government spending staffed by young tech enthusiasts and Musk company alumni, named—with typical Muskian Internet-brained flourish—for an early 2010s meme.

“Now, the protests you’ll see out there, they’re very organized, they’re paid for,” Musk told listeners. For weeks, thousands of people—including the Troublemakers—had camped outside Tesla showrooms, service centers, and charging stations. Musk suggested that not only were they paid for their time, they were only interested in his work because they had once received “wasteful largesse” from the federal government. Musk had presented the theory and sharpened it on his social media platform X for weeks. Now, he argued, the protesters were off the dole—and furious.

Musk offered no proof of his assertions; to a person, every protester who spoke to WIRED insisted that they are not being paid and are exactly what they appear to be: people who are angry at Elon Musk. They call their movement the “Tesla Takedown.”

Before Musk got on the call to speak to investors, Tesla, which arguably kicked off a now multitrillion-dollar effort to transition global autos to electricity, had presented them with one of the company’s worst quarterly financial reports in years. Net income was down 71 percent year over year; revenue fell more than $2 billion short of Wall Street’s expectations.

Now, in Seattle, just the first few minutes of Musk’s remarks left the partygoers, many veterans of the climate movement, giddy. Someone close to the staticky speakers repeated the best parts to the small crowd: “I think starting probably next month, May, my time allocation to DOGE will drop significantly,” Musk said. Under a spinning disco ball, people whooped and clapped. Someone held up a snapshot of Tesla’s stock performance over the past year, a jagged but falling black line.

“If you ever wanted to know that protest matters, here’s your proof,” Johnston recalled weeks later.

The Tesla Takedown, an effort to hit back at Musk and his wealth where it hurts, seems to have appeared at just the right time. Tesla skeptics have argued for years that the company, which has the highest market capitalization of any automaker, is overvalued. They contend that the company’s CEO has been able to distract from flawed fundamentals—an aging vehicle lineup, a Cybertruck sales flop, the much-delayed introduction of self-driving technology—with bluster and showmanship.

Musk’s interest in politics, which kicked into a new and more expensive gear when he went all in for Donald Trump during the 2024 election, was always going to invite more scrutiny for his business empire. But the grassroots movement, which began as a post on Bluesky, has become a boisterous, ragtag, and visible locus of, sorry to use the word, resistance against Musk and Trump. It’s hard to pin market moves on any one thing, but Tesla’s stock price is down some 33 percent since its end-of-2024 high.

Tesla Takedown points to a uniquely screwed-up moment in American politics. Down is up; up is down. A man who made a fortune sounding the alarm about the evils of the fossil fuel industry joined with it to spend hundreds of millions in support of a right-wing presidential candidate and became embedded in an administration with a slash-and-burn approach to environmental regulation. (This isn’t good for electric cars.) The same guy, once extolled as the real-life Tony Stark—he made a cameo in Iron Man 2!—has become for some a real-life comic book villain, his skulduggery enough to bring together a coalition of climate activists, freaked-out and laid-off federal workers, immigrant rights champions, union groups, PhDs deeply concerned about the future of American science, Ukraine partisans, liberal retirees sick of watching cable news, progressive parents hoping to show their kids how to stick up for their values, LGBTQ+ rights advocates, despondent veterans, and car and tech nerds who have been crying foul on Musk’s fantastical technology claims for years now.

To meet the moment, then, the Takedown uses a unique form of protest logic: Boycott and protest the electric car company not because the movement disagrees with its logic or mission—quite the opposite, even!—but because it might be the only way to materially affect the unelected, un-beholden-to-the-public guy at its head. And then hope the oft-irrational stock market catches on.

So for weeks, across cities like New York; Berkeley and Palo Alto, California; Meridian, Idaho; Ann Arbor, Michigan; Raleigh, North Carolina; South Salt Lake, Utah; and Austin, Texas, the thousands of people who make up the Takedown movement have been stationed outside of Tesla showrooms, making it a little bit uncomfortable to test drive one of Musk’s electric rides, or even just drive past in one.

Change in the air

When Shua Sanchez graduated from college in 2013, there was about a week, he remembers, when he was convinced that the most important thing he could do was work for Tesla. He had a degree in physics; he knew all about climate change and what was at stake. He felt called to causes, had been protesting since George W. Bush invaded Iraq when he was in middle school. Maybe his life’s work would be helping the world’s premier electric carmaker convince drivers that there was a cleaner and more beautiful life after fossil fuel.

In the end, though, Sanchez opted for a doctorate program focusing on the quantum properties of super-conducting and magnetic materials. (“I shoot frozen magnets with lasers all day,” he jokes.) So he felt thankful for his choice a few years later when he read media reports about Tesla’s efforts to tamp down unionizing efforts at its factories. He felt more thankful when, in 2017, Musk signed on to two of Trump’s presidential advisory councils. (The CEO publicly departed them months later, after the administration pulled out of the Paris climate agreement.) Even more thankful in 2022, when Musk acquired Twitter with the near-express purpose of opening it up to extreme right-wing speech. More thankful still by the summer of 2024, after Musk officially endorsed Trump’s presidential bid.

By the time Musk appeared onstage at a rally following Trump’s inauguration in January 2025 and threw out what appeared to be a Nazi salute—Musk has denied that was what it was—Sanchez, now in a postdoctorate fellowship at the Massachusetts Institute of Technology, was ready to do something about it besides not taking a job at Tesla. A few days later, as reports of DOGE’s work began to leak out of Washington, a friend sent him a February 8 Bluesky post from a Boston-based disinformation scholar named Joan Donovan.

“If Musk thinks he can speed run through DC downloading personal data, we can certainly bang some pots and pans on the sidewalks in front of Tesla dealerships,” Donovan posted on the platform, already an online refuge for those looking for an alternative to Musk’s X. “Bring your friends and make a little noise. Organize locally, act globally.” She added a link to a list of Tesla locations, and a GIF of the Swedish Chef playing the drums on some vegetables with wooden spoons. Crucially, she appended the hashtag #TeslaTakeover. Later, the Internet would coalesce around a different rallying cry: #TeslaTakedown.

Baltimore-area residents protest the Trump administration and Tesla CEO Elon Musk at a Tesla car dealership as part of a boycott of Tesla vehicles. Saturday, March 29, 2025.

Credit: Dominic Gwinn/Getty

Baltimore-area residents protest the Trump administration and Tesla CEO Elon Musk at a Tesla car dealership as part of a boycott of Tesla vehicles. Saturday, March 29, 2025. Credit: Dominic Gwinn/Getty

The post did not go viral. To date, it has only 175 likes. But it did catch the attention of actor and filmmaker Alex Winter. Winter shot to prominence in 1989’s Bill & Ted’s Excellent Adventure—he was Bill—and has more recently produced multiple documentaries focusing on online culture, piracy, and the power of social media. He and Donovan had bonded a few years earlier over activism and punk rock, and the actor, who has a larger social media following, asked the scholar if he could create a website to centralize the burgeoning movement. “I do think we’re at a point where people need to stick their necks up out of the foxhole en masse, or we’re simply not going to get through,” he tells WIRED. In the website’s first 12 hours of existence, he says, thousands of people registered to take part in the Takedown.

Donovan’s Bluesky post brought Sanchez to the Boston Back Bay Tesla showroom on Boylston Street the next Saturday, where 30 people had gathered with signs. For Sanchez, the whole thing felt personal. “Elon Musk started a PhD at Stanford in my field. He quit after two days and then went and became a tech bro, but he presents that he’s one of us,” he says. With Musk’s new visibility—and plans to slash government research dollars while promoting right-wing ideology—Sanchez was ready to push back.

Sanchez has been outside the showroom during weekly protests throughout the Boston winter, megaphone in hand, leading chants: “It ain’t fun. It ain’t funny. Elon Musk is stealing your money.” “We don’t want your Nazi cars. Take a one-way trip to Mars.”

“We make it fun, so a lot of people come back,” Sanchez says. Someone slapped Musk’s face on one of the inflatable tube guys you often see outside of car dealerships; he whipped around at several protests. A popular bubble-themed routine—“Tesla is a bubble”—saw protesters toss around a giant, transparent ball as others blew bubbles around it. Then the ball popped, loudly, during a protest—a sign? At some of Boston’s biggest actions, hundreds of people have shown up to demonstrate against Tesla, Musk, and Trump, Sanchez says.

Donovan envisioned the protests as potent visible responses to Musk’s slashing of government programs and jobs. But she also knew that social movements are a critical release valve in times of upheaval. “People need to relieve the pressure that they feel when the government is not doing the right thing,” she tells WIRED. “If you let that pressure build up too much, obviously it can turn very dangerous.”

In some ways, she’s right. In at least four incidents across four states, people have been charged by the federal government with various crimes including defacing, shooting at, throwing Molotov cocktails toward, and setting fire to Tesla showrooms and charging stations. In a move that has worried civil liberties experts, the Trump administration has treated these attacks against the president’s richest backer’s car company as “domestic terrorism,” granting federal authorities greater latitude and resources to track down alleged perpetrators and threatening them with up to 20 years in prison.

In posts on X and in public appearances, Musk and other federal officials have seemed to conflate the actions of a few allegedly violent people with the wider protests against Tesla, implying that both are funded by shadowy “generals.” “Firing bullets into showrooms and burning down cars is unacceptable,” Musk said at an event last month in which he appeared remotely on video, his face looming over the stage. “Those people will go to prison, and the people that funded them and organized them will also go to prison. Don’t worry.” He looked into the camera and pointed his finger at the audience. “We’re coming for you.”

Tesla Takedown participants and leaders have repeatedly said that the movement is nonviolent. “Authoritarian regimes have a long history of equating peaceful protest with violence. The #TeslaTakedown movement has always been and will remain nonviolent,” Dallas volunteer Stephanie Frizzell wrote in an email. What violence has occurred at protests themselves seems limited to on-site spats that mostly target protesters.

Donovan herself skipped some protests after receiving death threats and hearing a rumor that she was on a government list targeting disinformation researchers. On X, prominent right-wing accounts harassed her and other Takedown leaders; she says people have contacted her colleagues to try to get her fired.

Then, on the afternoon of March 6, Boston University ecology professor Nathan Phillips was in his office on campus when he received a panicked message from his wife. She said that two people claiming to represent the FBI visited their home. “I was just stunned,” Phillips says. “We both had a feeling of disbelief, that this must be some kind of hoax or a joke or something like that.”

Phillips had attended a Tesla Takedown event weeks earlier, but he wasn’t sure whether the visit was related to the protests or his previous climate activism. So after sitting shocked in his office for an hour, he called his local FBI field office. Someone picked up and asked for his information, he remembers, and then asked why he was calling. Phillips explained what had happened. “They just abruptly hung up on me,” he says.

Phillips never had additional contact from the FBI, but he knows of at least five other climate activists who were visited by men claiming to be from the agency on March 6.

The FBI tells WIRED that it “cannot confirm or deny the allegations” that two agents visited Phillips’ home. Tesla did not respond to WIRED’s questions about the Tesla Takedown movement or Musk’s allegations of coordinated violence against the company.

After the incident, Phillips began searching online for mentions of his name, and he found posts on X from an account that also tagged Joan Donovan and FBI director Kash Patel.

Phillips says that the FBI visit has had the opposite of a chilling effect. “If anything, it’s further radicalized me,” he says. “People having my back and the expression of support makes me feel very confident that it was the right thing to do to speak out about this.”

Organizing for the first time

Mike had attended a few protests in the past but didn’t know how to organize one. He has a wife, three small kids, a house in the suburbs, and a health issue that can sometimes make it hard to think. So by his own admission, his first attempt in February was a mixed bag. It was the San Francisco Bay Area-based Department of Labor employee’s first day back in the office after the Trump administration, spurred by DOGE, had demanded all workers return full-time. He was horrified by the fast-moving job cuts, program changes, and straight-up animus he had already seen flow from the White House down to his small corner of the federal government.

“Attacks on federal workers are an attack on the Constitution,” Mike says. Maybe, he figured, if he could keep people from buying Teslas, that would hurt Elon Musk’s bottom line, and the CEO would lay off DOGE altogether.

Mike, who WIRED is referring to using a pseudonym because he fears retaliation, saw that a Tesla showroom was just a 20-minute walk from his office, and he hoped to convince some coworkers to convene there, a symbolic stand against DOGE and Musk. So he taped a few flyers on light poles. He didn’t have social media, but he posted on Reddit. “I was really worried,” he says, “about the Hatch Act,” a law that limits the political activities of federal employees.

In the end, three federal workers—the person sitting next to him at the office and a US Department of Veterans Affairs nurse they ran into on the street—posted up outside of the Tesla showroom on Van Ness Avenue in downtown San Francisco holding “Save Federal Workers” signs.

Then Mike discovered the #TeslaTakedown website that Alex Winter had built. (Because of a quirk in the sign-up process, the site was putatively operated for a time by the Seattle Troublemakers.) It turned out a bunch of other people had thought that Tesla showrooms were the right places to air their grievances with Trump, Musk, and DOGE. Mike posted his event there. Now the SF Save Federal Workers protest, which happens every Monday afternoon, draws 20 to 40 people.

Through the weekly convening, Mike has met volunteers from the Federal Unionists Network, who represent public unions; the San Francisco Labor Council, a local affiliate of the national AFL-CIO; and the East Bay chapter of the Democratic Socialists of America. As in any amicable custody arrangement, Mike’s group shares the strip of sidewalk outside of the San Francisco Tesla showroom with a local chapter of the progressive group Indivisible, which holds bigger protests on Saturdays. “I’m trying to build connections, meet other community groups,” Mike says. “My next step is broadening the coalition.”

About half of the people coordinating Takedown protests are like Mike, says Evan Sutton, who is part of the national team: They haven’t organized a protest before. “I’ve been in politics professionally for almost 20 years,” Sutton says. “It is genuinely the most grassroots thing that I’ve seen.”

Well into the spring, Tesla Takedown organizers nationwide had held hundreds of events across the US and even the globe, and the movement has gained a patina of professionalism. Tesla Takedown sends press releases to reporters. The movement has buy-in from Indivisible, a progressive network that dates back to the first Trump administration, with local chapters hosting their own protests. At least one Democratic congressional campaign has promoted a local #TeslaTakedown event.

Beyond the showrooms, Tesla sales are down by half in Europe compared to last year and have taken a hit in California, the US’s biggest EV market. Celebrities including Sheryl Crow and Jason Bateman have publicly ditched their Teslas. A Hawaii-based artist named Matthew Hiller started selling “I Bought This Before Elon Went Crazy” car decals in 2023; he estimates he has sold 70,000 anti-Musk and anti-Tesla stickers since then. (There was a “Space X-size explosion of sales after his infamous salute,” Hiller says.) In Seattle, the Troublemakers regularly hold “de-badging” events, where small handfuls of sheepish owners come by to have the T emblems drilled off their cars.

In Portland, Oregon, on a recent May Saturday, Ed Niedermeyer was once again sweating through his shark costume as he hopped along the sidewalk in front of the local Tesla showroom. His sign exhibited the DOGE meme, an alert Shiba Inu, with the caption “Heckin’ fascism.” (You’d get it if you spent too much time on the Internet in 2013.) Honks rang out. The shark tends to get a good reaction from drivers going by, he said. About 100 people had shown up to this Takedown protest, in front of a Tesla showroom that sits kitty-corner to a US Immigration and Customs Enforcement office.

Niedermeyer is a car writer and has spent a lot of time thinking about Elon Musk since 2015, when he discovered that Tesla wasn’t actually operating a battery swapping station like it said it did. Since then, he has written a book, Ludicrous: The Unvarnished Story of Tesla Motors, and documented many of what he claims to be Musk’s and the automaker’s half-truths on their way to the top.

Niedermeyer acknowledges that Musk and Tesla have proven difficult to touch, even by nationwide protests literally outside their doors.

Despite the Seattle cheers during Tesla’s last quarterly earnings call, the automaker’s stock price gained steam through the spring and rose on the news that its CEO would no longer officially work for the federal government. Musk has said investors should value Tesla not as a carmaker but as an AI and robotics company. At the end of this month, after years of delays, Tesla says it will launch a robotaxi service. According to Wall Street analysts’ research notes, they believe him.

Even a public fight with the president—one that devolved into name-calling on Musk’s and Trump’s respective social platforms—was not enough to pop the Tesla bubble.

“For me, watching Musk and watching our inability to stop him and create consequences for this snowballing hype and power has really reinforced that we need a stronger government to protect people from people like him,” says Niedermeyer.

Still, Tesla Takedown organizers take credit for the cracks in the Musk-Trump alliance—and say the protests will continue. The movement has also incorporated a more cerebral strategy, organizing local efforts to convince cities, states, and municipalities to divest from Musk’s companies. They already had a breakthrough in May, when Lehigh County, Pennsylvania, became the first US public pension fund to say it wouldn’t purchase new Tesla stocks for its managed investment accounts.

The movement’s goals may be lofty, but Niedermeyer argues that despite Tesla’s apparent resilience, Musk is still America’s most vulnerable billionaire. And sure, Musk, the CEO of an electric car company, the guy who made himself the figurehead for his automaker and fired his PR team to make sure it would stick, the one who alienated the electric car company’s customer base through a headlong plunge not only into political spending but the delicate mechanics of government itself—he did a lot of it on his own.

Now Niedermeyer, and everyone involved in Tesla Takedown, and probably everyone in the whole world, really, can only do what they can. So here he is, in a shark costume on the side of the road, maintaining the legally mandated distance from the car showroom behind him.

This story originally appeared on wired.com.

Photo of WIRED

Wired.com is your essential daily guide to what’s next, delivering the most original and complete take you’ll find anywhere on innovation’s impact on technology, science, business and culture.

How Tesla Takedown got its start Read More »

nintendo-switch-2:-the-ars-technica-review

Nintendo Switch 2: The Ars Technica review


Nintendo’s overdue upgrade is a strong contender, even amid competition from handheld PCs.

Maybe not the best showcase of the hardware, but squeezing 40+ years of Nintendo history into a single image was too compelling. Credit: Kyle Orland

Maybe not the best showcase of the hardware, but squeezing 40+ years of Nintendo history into a single image was too compelling. Credit: Kyle Orland

When Nintendo launched the Switch in 2017, the sheer novelty of the new hardware brought the company a lot of renewed attention. After the market disaster of the Wii U’s homebound “second screen” tablet, Nintendo exploited advances in system-on-a-chip miniaturization to create something of a minimum viable HD-capable system that could work as both a lightweight handheld and a slightly underpowered TV-based console. That unique combination, and Nintendo’s usual selection of first-party system sellers, set the console apart from what the rest of the gaming market was offering at the time.

Eight years later, the Switch 2 launched into a transformed gaming hardware market that the original Switch played a large role in shaping, one full of portable gaming consoles that can optionally be connected to a TV. That includes full-featured handheld gaming PCs like the Steam Deck and its many imitators, but also streaming-focused Android-based gaming handhelds and retro-focused emulation machines on the cheaper end. Even Microsoft is preparing to get in on the act, streamlining the Windows gaming experience for an Asus-powered handheld gaming PC that hides the Windows desktop.

Mario is excited! Are you?

Credit: Kyle Orland

Mario is excited! Are you? Credit: Kyle Orland

Those market changes make the Switch 2 a lot less of a novelty than its predecessor. As its name implies, it is essentially a direct sequel to the original Switch hardware, with improvements to the physical hardware and internal architecture. Rather than shaking things up with a new concept, Nintendo seems to be saying, “Hey, you liked the Switch? Here’s the same thing, but moreso.”

That “moreso” will surely be enough for players who complained about the Switch’s increasingly obvious struggles to play graphically demanding games in the last few years. But in a gaming world full of capable and usable handheld PCs, a “more of the same” Switch 2 might be a bit of a tougher sell.

Joyful Joy-Cons

Let’s start with one feature that the Switch line still can boast over most of its handheld gaming competition: the removable Joy-Cons. The new magnetic slotting system for these updated controllers on the Switch 2 is a sheer joy to use, allowing for easy and quick one-handed removal as well as a surprisingly secure portable mode connection. After a week spent snapping them on and off dozens of times, I still can’t get over how great the design feels.

The new Joy-Cons also ameliorate what was probably the largest complaint about the ones on the Switch: their size. Everything from the overall footprint to the buttons and joystick has been expanded to feel much more appropriate in larger hands. The days of average adults having to awkwardly scrunch their fingers around a Switch Joy-Con in each hand can be relegated to the past, where they belong.

Holding a single Joy-Con in two hands is still not ideal, but it works in a pinch.

Holding a single Joy-Con in two hands is still not ideal, but it works in a pinch.

Like the Switch before it, the removable Joy-Cons can also be used separately, essentially offering baseline purchasers two controllers for the price of one. The added size helps make holding an individual Joy-Con horizontally in two hands much more comfortable, especially when it comes to tapping the expanded shoulder buttons on the controllers’ inner edge. But the face buttons and joystick are still a bit too cramped and oddly placed to make this a preferred way to play for long stretches.

Still, for situations where you happen to have other players around—especially young children who might not mind the smaller-than-standard size—it’s nice to have a feasible multiplayer option without needing to invest in new controllers. And the Switch 2’s seamless compatibility with your old Switch controllers (in tabletop or docked mode, at least) provides even more control flexibility and value for upgraders.

Control compromises

The main problem with the Switch 2 Joy-Cons continues to be their thinness, which is practically unchanged from the original Switch. That’s handy for keeping the overall system profile nice and trim in portable mode, but it means the Joy-Cons are missing the bulbous, rounded palm grips you see on handhelds like the Steam Deck and standard console controllers dating back to the original PlayStation.

Without this kind of grip, the thin, rounded bottom corner of the Joy-Cons ends up wedged oddly between the fleshy parts of your palm. Your free fingers, meanwhile, are either awkwardly wrapped around the edge of the loose Joy-Cons or uncomfortably perched to support the flat back of a portable system that’s a noticeable 34 percent heavier than the original Switch. And while an included Joy-Con holster helps add these rounded grips for tabletop or docked play, the “flat finger” problem is unavoidable when playing the system in portable mode.

The included grip gives your palms a comfortable place to rest when holding the Joy-Cons.

The included grip gives your palms a comfortable place to rest when holding the Joy-Cons.

After spending a week with the Joy-Cons, I started to notice a few other compromises. Despite the added size, the face buttons are still slightly smaller than you’ll find on other controllers, meaning they can dig into the pad of your thumb when held down for extended periods. The shoulder buttons, which have also been expanded from the original Switch, still lack the increased travel and sensitivity of the analog triggers that are standard on nearly every competing controller. And the positioning of the right joystick encroaches quite close to the buttons just above it, making it easy to accidentally nudge the stick when pressing the lower B button.

Those kinds of control compromises help keep the portable Switch 2 notably smaller and lighter than most of its handheld PC competition. But they also mean my Switch 2 will probably need something like the Nyxi Hyperion Pro, which I’ve come to rely on to make portable play on the original Switch much more comfortable.

Improvements inside and out

Unlike the controllers, the screen on the Switch 2 is remarkably low on compromises. The full 1080p, 7.9-inch display supports HDR and variable refresh rates up to 120 Hz, making it a huge jump over both the original Switch and most of the screens you’ll find on competing handheld gaming PCs (or even some standard HDTVs when it comes to the maximum frame rate). While the screen lacks the truly deep blacks of a true OLED display, I found that the overall brightness (which reportedly peaks at about 450 nits) makes it hard to notice.

The bigger, brighter, sharper screen on the Switch 2 (top) is a huge improvement over the first Switch.

Credit: Kyle Orland

The bigger, brighter, sharper screen on the Switch 2 (top) is a huge improvement over the first Switch. Credit: Kyle Orland

The custom Nvidia processor inside the Switch 2 is also a welcome improvement over a Tegra processor that was already underpowered for the Switch in 2017. We’ve covered in detail how much of a difference this makes for Switch titles that have been specially upgraded to take advantage of that extra power, fixing fuzzy graphics and frame rate issues that were common on Nintendo’s previous system. It’s hard to imagine going back after seeing Tears of the Kingdom running in a silky-smooth 60 fps or enjoying the much sharper textures and resolution of portable No Man’s Sky on the Switch 2.

Link’s Awakening, Switch 1, docked. Andrew Cunningham

However, the real proof of the Switch 2’s improved power can be seen in early third-party ports like Cyberpunk 2077, Split Fiction, Hitman World of Assassination, and Street Fighter VI, which would have required significant visual downgrades to even run on the original Switch. To my eye, the visual impact of these ports is roughly comparable to what you’d get on a PS4 Pro (in handheld mode) or an Xbox Series S (in docked mode). In the medium term, that should be more than enough performance for all but the most determined pixel-counters, given the distinctly diminishing graphical returns we’re seeing from more advanced (and more expensive) hardware like the PS5 Pro.

The Switch 2 delivers a perfectly fine-looking version of Cyberpunk 2077

Credit: CD Projekt Red

The Switch 2 delivers a perfectly fine-looking version of Cyberpunk 2077 Credit: CD Projekt Red

The biggest compromise for all this extra power comes in the battery life department. Games like Mario Kart World or Cyberpunk 2077 can take the system from a full charge to completely drained in somewhere between 2 and 2.5 hours. This time span increases significantly for less demanding games like old-school 2D classics and can be slightly extended if you reduce the screen brightness. Still, it’s a bit grating to need to rely on an external battery pack just to play Mario Kart World for an entire cross-country flight.

Externally, the Switch 2 is full of tiny but welcome improvements, like an extra upper edge USB-C port for more convenient charging and a thin-but-sturdy U-shaped stand for tabletop play. Internally, the extremely welcome high-speed storage helps cut initial load times on games like Mario Kart 8 roughly in half (16.5 seconds on the Switch versus 8.5 seconds on the Switch 2 in our testing).

The embedded stand on the Switch 2 (right) is a massive improvement for tabletop mode play.

Credit: Kyle Orland

The embedded stand on the Switch 2 (right) is a massive improvement for tabletop mode play. Credit: Kyle Orland

But the 256GB of internal storage included in the Switch 2 is also laughably small, considering that individual digital games routinely require downloads of 50GB to 70GB. That’s especially true in a world where many third-party games are only available as Game Key Cards, which still require that the full game be downloaded. Most Switch 2 customers should budget $50 or more for a MicroSD Express card to add at least 256GB of additional storage.

Those Nintendo gimmicks

Despite the “more of the same” overall package, there are a few small areas where the Switch 2 does something truly new. Mouse mode is the most noticeable of these, letting you transform a Joy-Con into a PC-style mouse simply by placing it on its edges against most flat-ish surfaces. We tested this mode on surfaces ranging from a hard coffee table to a soft pillow-top mattress and this reviewer’s hairy thighs and found the mouse mode was surprisingly functional in every test. While the accuracy and precision fall off on the squishier and rounder of those tested surfaces, it’s something of a marvel that it works at all.

A bottom-up look at the awkward claw-like grip required for mouse mode.

Credit: Kyle Orland

A bottom-up look at the awkward claw-like grip required for mouse mode. Credit: Kyle Orland

Unfortunately, the ergonomics of mouse mode still leave much to be desired. This again comes down to the thinness of the Joy-Cons, which don’t have the large, rounded palm rest you’d expect from a good PC mouse. That means getting a good sense of control in mouse mode requires hooking your thumb, ring finger, and pinky finger into a weird modified claw-like grip around the Joy-Con, a pose that becomes uncomfortable after even moderate use. A holster that lets the Joy-Con slot into a more traditional mouse shape could help with this problem; failing that, mouse mode seems destined to remain a little-used gimmick.

GameChat is the Switch 2’s other major “new” feature, letting you communicate with friends directly through the system’s built-in microphone (which works rather well even across a large and noisy living room) or an optional webcam (many standard USB cameras we tested worked just fine). It’s a welcome and simple way to connect with other players without having to resort to Discord or the bizarre external smartphone app Nintendo relied on for voice chat on the original Switch.

In most ways, it feels like GameChat is just playing catch-up to the kind of social sharing features competitors like Microsoft were already including in their consoles back in 2005. However, we appreciate GameChat’s ability to easily share a live view of your screen with friends, even if the low-frame-rate video won’t give Twitch streams a run for their money.

Those kinds of complaints can also apply to GameShare, which lets Switch 2 owners stream video of their game with a second player, allowing them to join in the game from a secondary Switch or Switch 2 console (either locally or remotely). The usability of this feature seems heavily dependent on the wireless environment in the players’ house, ranging from smooth but grainy to unplayably laggy. And the fact that GameShare only works with specially coded games is a bit annoying when Steam Remote Play offers a much more generalized remote co-op solution on PC.

The best of both worlds?

This is usually the point in a console review where I warn you that buying a console at or near launch is a poor value proposition, as you’ll never pay more for a system with fewer games. That’s not necessarily true these days. The original Switch never saw an official price drop in its eight years on the market, and price increases are becoming increasingly common for some video game hardware. If you think you’re likely to ever be in the market for a Switch 2, now might be the best time to pull the trigger.

Mario Kart World offers plenty to see and do until more must-have games come to the Switch 2 library.

Credit: Nintendo

Mario Kart World offers plenty to see and do until more must-have games come to the Switch 2 library. Credit: Nintendo

That said, there’s not all that much to do with a brand new Switch 2 unit at the moment. Mario Kart World is being positioned as the major system seller at launch, revitalizing an ultra-popular, somewhat stale series with a mixed bag of bold new ideas. Nintendo’s other first-party launch title, the $10 Switch 2 Welcome Tour, is a tedious affair that offers a few diverting minigames amid dull slideshows and quizzes full of corny PR speak.

The rest of the Switch 2’s launch library is dominated by ports of games that have been available on major non-Switch platforms for anywhere from months to years. That’s nice if the Switch has been your only game console during that time or if you’ve been looking for an excuse to play these titles in full HD on a beautiful portable screen. For many gamers, though, these warmed-over re-releases won’t be that compelling.

Other than that, there are currently only the barest handful of completely original launch titles that require the Switch 2, none of which really provide a meaningful reason to upgrade right away. For now, once you tire of Mario Kart, you’ll be stuck replaying your old Switch games (often with welcome frame rate and resolution improvements) or checking out a trio of emulated GameCube games available to Switch Online Expansion Pack subscribers (they look and play just fine).

Looking to the future, the promise of further Nintendo first-party games is, as usual, the primary draw for the company’s hardware. In the near term, games like Donkey Kong Bananza, Pokémon Legends Z-A, and Metroid Prime 4 (which will also be available on the older Switch with less wow-inducing performance) are the biggest highlights in the pipeline. Projecting a little further out, the Switch 2 will be the only way to legitimately play Mario and Zelda adventures that seem highly likely to be can’t-miss classics, given past performance.

From top: Switch 2, Steam Deck OLED, Lenovo Legion Go S. Two of these three can play your entire Steam library. One of these three can play the new Mario Kart…

Credit: Kyle Orland

From top: Switch 2, Steam Deck OLED, Lenovo Legion Go S. Two of these three can play your entire Steam library. One of these three can play the new Mario Kart… Credit: Kyle Orland

Nintendo aside, the Switch 2 seems well-positioned to receive able portable-ready ports of some of the more demanding third-party games in the foreseeable future. Already, we’ve seen Switch 2 announcements for catalog titles like Elden Ring and future releases like 007 First Light, as well as a handful of third-party exclusives like FromSoft’s vampire-filled Duskbloods.

Those are pretty good prospects for a $450 portable/TV console hybrid. But even with a bevy of ports and exclusives, it could be hard for the Switch 2’s library to compete with the tens of thousands of games available on any handheld PC worth its salt. You’ll pay a bit more for one of those portables if you’re looking for something that matches the quality of the Switch 2’s screen and processor—for the moment, at least. But the PC ecosystem’s wider software selection and ease of customization might make that investment worth it for gamers who don’t care too much about Nintendo’s first-party efforts.

If you found yourself either regularly using or regularly coveting a Switch at any point over the last eight years, the Switch 2 is an obvious and almost necessary upgrade. If you’ve resisted the siren song for this long, though, you can probably continue to ignore Nintendo’s once-novel hardware line.

Photo of Kyle Orland

Kyle Orland has been the Senior Gaming Editor at Ars Technica since 2012, writing primarily about the business, tech, and culture behind video games. He has journalism and computer science degrees from University of Maryland. He once wrote a whole book about Minesweeper.

Nintendo Switch 2: The Ars Technica review Read More »

another-one-for-the-graveyard:-google-to-kill-instant-apps-in-december

Another one for the graveyard: Google to kill Instant Apps in December

But that was then, and this is now. Today, an increasing number of mobile apps are functionally identical to the mobile websites they are intended to replace, and developer uptake of Instant Apps was minimal. Even in 2017, loading an app instead of a website had limited utility. As a result, most of us probably only encountered Instant Apps a handful of times in all the years it was an option for developers.

To use the feature, which was delivered to virtually all Android devices by Google Play Services, developers had to create a special “instant” version of their app that was under 15MB. The additional legwork to get an app in front of a subset of new users meant this was always going to be a steep climb, and Google struggles to incentivize developers to adopt new features. Plus, there’s no way to cram in generative AI! So it’s not a shock to see Google retiring the feature.

This feature is currently listed in the collection of Google services in your phone settings as “Google Play Instant.” Unfortunately, there aren’t many examples still available if you’re curious about what Instant Apps were like—the Finnish publisher Ilta-Sanomat is one of the few still offering it. Make sure the settings toggle for Instant Apps is on if you want a little dose of nostalgia.

Another one for the graveyard: Google to kill Instant Apps in December Read More »

rocket-report:-new-delay-for-europe’s-reusable-rocket;-spacex-moves-in-at-slc-37

Rocket Report: New delay for Europe’s reusable rocket; SpaceX moves in at SLC-37


Canada is the only G7 nation without a launch program. Quebec wants to do something about that.

This graphic illustrates the elliptical shape of a geosynchronous transfer orbit in green, and the circular shape of a geosynchronous orbit in blue. In a first, SpaceX recently de-orbited a Falcon 9 upper stage from GTO after deploying a communications satellite. Credit: European Space Agency

Welcome to Edition 7.48 of the Rocket Report! The shock of last week’s public spat between President Donald Trump and SpaceX founder Elon Musk has worn off, and Musk expressed regret for some of his comments going after Trump on social media. Musk also backtracked from his threat to begin decommissioning the Dragon spacecraft, currently the only way for the US government to send people to the International Space Station. Nevertheless, there are many people who think Musk’s attachment to Trump could end up putting the US space program at risk, and I’m not convinced that danger has passed.

As always, we welcome reader submissions. If you don’t want to miss an issue, please subscribe using the box below (the form will not appear on AMP-enabled versions of the site). Each report will include information on small-, medium-, and heavy-lift rockets, as well as a quick look ahead at the next three launches on the calendar.

Quebec invests in small launch company. The government of Quebec will invest CA$10 million ($7.3 million) into a Montreal-area company that is developing a system to launch small satellites into space, The Canadian Press reports. Quebec Premier François Legault announced the investment into Reaction Dynamics at the company’s facility in Longueuil, a Montreal suburb. The province’s economy minister, Christine Fréchette, said the investment will allow the company to begin launching microsatellites into orbit from Canada as early as 2027.

Joining its peers … Canada is the only G7 nation without a domestic satellite launch capability, whether it’s through an independent national or commercial program or through membership in the European Space Agency, which funds its own rockets. The Canadian Space Agency has long eschewed any significant spending on developing a Canadian satellite launcher, and a handful of commercial launch startups in Canada haven’t gotten very far. Reaction Dynamics was founded in 2017 by Bachar Elzein, formerly a researcher in multiphase and reactive flows at École Polytechnique de Montréal, where he specialized in propulsion and combustion dynamics. Reaction Dynamic plans to launch its first suborbital rocket later this year, before attempting an orbital flight with its Aurora rocket as soon as 2027. (submitted by Joey S-IVB)

The easiest way to keep up with Eric Berger’s and Stephen Clark’s reporting on all things space is to sign up for our newsletter. We’ll collect their stories and deliver them straight to your inbox.

Sign Me Up!

Another year, another delay for Themis. The European Space Agency’s Themis program has suffered another setback, with the inaugural flight of its reusable booster demonstrator now all but certain to slip to 2026, European Spaceflight reports. It has been nearly six years since the European Space Agency kicked off the Themis program to develop and mature key technologies for future reusable rocket stages. Themis is analogous to SpaceX’s Grasshopper reusable rocket prototype tested more than a decade ago, with progressively higher hop tests to demonstrate vertical takeoff and vertical landing techniques. When the program started, an initial hop test of the first Themis demonstrator was expected to take place in 2022.

Tethered to terra firma … ArianeGroup, which manufactures Europe’s Ariane rockets, is leading the Themis program under contract to ESA, which recently committed an additional 230 million euros ($266 million) to the effort. This money is slated to go toward the development of a single-engine variant of the Themis program, continued development of the rocket’s methane-fueled engine, and upgrades to a test stand at ArianeGroup’s propulsion facility in Vernon, France. Two months ago, an official update on the Themis program suggested the first Themis launch campaign would begin before the end of the year. Citing sources close to the program, European Spaceflight reports the first Themis integration tests at the Esrange Space Center in Sweden are now almost certain to slip from late 2025 to 2026.

French startup tests a novel rocket engine. While Europe’s large government-backed rocket initiatives face delays, the continent’s space industry startups are moving forward on their own. One of these companies, a French startup named Alpha Impulsion, recently completed a short test-firing of an autophage rocket engine, European Spaceflight reports. These aren’t your normal rocket engines that burn conventional kerosene, methane, or hydrogen fuel. An autophage engine literally consumes itself as it burns, using heat from the combustion process to melt its plastic fuselage and feed the molten plastic into the combustion chamber in a controlled manner. Alpha Impulsion called the May 27 ground firing a successful test of the “largest autophage rocket engine in the world.”

So, why hasn’t this been done before? … The concept of a self-consuming rocket engine sounds like an idea that’s so crazy it just might work. But the idea remained conceptual from when it was first patented in 1938 until an autophage engine was fired in a controlled manner for the first time in 2018. The autophage design offers several advantages, including its relative simplicity compared to the complex plumbing of liquid and hybrid rockets. But there are serious challenges associated with autophage engines, including how to feed molten fuel into the combustion chamber and how to scale it up to be large enough to fly on a viable rocket. (submitted by trimeta and EllPeaTea)

Rocket trouble delays launch of private crew mission. A propellant leak in a Falcon 9 booster delayed the launch of a fourth Axiom Space private astronaut mission to the International Space Station this week, Space News reports. SpaceX announced the delay Tuesday, saying it needed more time to fix a liquid oxygen leak found in the Falcon 9 booster during inspections following a static-fire test Sunday. “Once complete–and pending Range availability–we will share a new launch date,” the company stated. The Ax-4 mission will ferry four commercial astronauts, led by retired NASA commander Peggy Whitson, aboard a Dragon spacecraft to the ISS for an approximately 14-day stay. Whitson will be joined by crewmates from India, Poland, and Hungary.

Another problem, too … While SpaceX engineers worked on resolving the propellant leak on the ground, a leak of another kind in orbit forced officials to order a longer delay to the Ax-4 mission. In a statement Thursday, NASA said it is working with the Russian space agency to understand a “new pressure signature” in the space station’s Russian service module. For several years, ground teams have monitored a slow air leak in the aft part of the service module, and NASA officials have identified it as a safety risk. NASA’s statement on the matter was vague, only saying that cosmonauts on the station recently inspected the module’s interior surfaces and sealed additional “areas of interest.” The segment is now holding pressure, according to NASA. (submitted by EllPeaTea)

SpaceX tries something new with Falcon 9. With nearly 500 launches under its belt, SpaceX’s Falcon 9 rocket isn’t often up to new tricks. But the company tried something new following a launch on June 7 with a radio broadcasting satellite for SiriusXM. The Falcon 9’s upper stage placed the SXM-10 satellite into an elongated, high-altitude transfer orbit, as is typical for payloads destined to operate in geosynchronous orbit more than 22,000 miles (nearly 36,000 kilometers) over the equator. When a rocket releases a satellite in this type of high-energy orbit, the upper stage has usually burned almost all of its propellant, leaving little fuel to steer itself back into Earth’s atmosphere for a destructive reentry. This means these upper stages often remain in space for decades, becoming a piece of space junk that transits across the orbits of many other satellites.

Now, a solution … SpaceX usually deorbits rockets after they deploy payloads like Starlink satellites into low-Earth orbit, but deorbiting a rocket from a much higher geosynchronous transfer orbit is a different matter. “Last week, SpaceX successfully completed a controlled deorbit of the SiriusXM-10 upper stage after GTO payload deployment,” wrote Jon Edwards, SpaceX’s vice president of Falcon and Dragon programs. “While we routinely do controlled deorbits for LEO stages (e.g., Starlink), deorbiting from GTO is extremely difficult due to the high energy needed to alter the orbit, making this a rare and remarkable first for us. This was only made possible due to the hard work and brilliance of the Falcon GNC (guidance, navigation, and control) team and exemplifies SpaceX’s commitment to leading in both space exploration and public safety.”

New Glenn gets a tentative launch date. Five months have passed since Blue Origin’s New Glenn rocket made its mostly successful debut in January. At one point, the company targeted “late spring” for the second launch of the rocket. However, on Monday, Blue Origin’s CEO, Dave Limp, acknowledged on social media that the rocket’s next flight will now no longer take place until at least August 15, Ars reports. Although he did not say so, this may well be the only other New Glenn launch this year. The mission, with an undesignated payload, will be named “Never Tell Me the Odds,” due to the attempt to land the booster. “One of our key mission objectives will be to land and recover the booster,” Limp wrote. “This will take a little bit of luck and a lot of excellent execution. We’re on track to produce eight GS2s [second stages] this year, and the one we’ll fly on this second mission was hot-fired in April.”

Falling shortBefore 2025 began, Limp set expectations alongside Blue Origin founder Jeff Bezos: New Glenn would launch eight times this year. That’s not going to happen. It’s common for launch companies to take a while ramping up the flight rate for a new rocket, but Bezos told Ars in January that his priority for Blue Origin this year was to hit a higher cadence with New Glenn. Elon Musk’s rift with President Donald Trump could open a pathway for Blue Origin to capture more government business if the New Glenn rocket is able to establish a reliable track record. Meanwhile, Limp told Blue Origin employees last month that Jarrett Jones, the manager running the New Glenn program, is taking a sabbatical. Although it appears Jones’ leave may have been planned, the timing is curious.

Making way for Starship at Cape Canaveral. The US Air Force is moving closer to authorizing SpaceX to move into one of the largest launch pads at Cape Canaveral Space Force Station in Florida, with plans to use the facility for up to 76 launches of the company’s Starship rocket each year, Ars reports. A draft Environmental Impact Statement (EIS) released by the Department of the Air Force, which includes the Space Force, found SpaceX’s planned use of Space Launch Complex 37 (SLC-37) at Cape Canaveral would have no significant negative impacts on local environmental, historical, social, and cultural interests. The Air Force also found SpaceX’s plans at SLC-37 will have no significant impact on the company’s competitors in the launch industry.

Bringing the rumble … SLC-37 was the previous home to United Launch Alliance’s Delta IV rocket, which last flew from the site in April 2024, a couple of months after the military announced SpaceX was interested in using the launch pad. While it doesn’t have a lease for full use of the launch site, SpaceX has secured a “right of limited entry” from the Space Force to begin preparatory work. This included the explosive demolition of the launch pad’s Delta IV-era service towers and lightning masts Thursday, clearing the way for eventual construction of two Starship launch towers inside the perimeter of SLC-37. The new Starship launch towers at SLC-37 will join other properties in SpaceX’s Starship empire, including nearby Launch Complex 39A at NASA’s Kennedy Space Center, and SpaceX’s privately owned facility at Starbase, Texas.

Preps continue for Starship Flight 10. Meanwhile, at Starbase, SpaceX is moving forward with preparations for the next Starship test flight, which could happen as soon as next month following three consecutive flights that fell short of expectations. This next launch will be the 10th full-scale test flight of Starship. Last Friday, June 6, SpaceX test-fired the massive Super Heavy booster designated to launch on Flight 10. All 33 of its Raptor engines ignited on the launch pad in South Texas. This is a new Super Heavy booster. On Flight 9 last month, SpaceX flew a reused Super Heavy booster that launched and was recovered on a flight in January.

FAA signs off on SpaceX investigation … The Federal Aviation Administration said Thursday it has closed the investigation into Starship Flight 8 in March, which spun out of control minutes after liftoff, showering debris along a corridor of ocean near the Bahamas and the Turks and Caicos Islands. “The FAA oversaw and accepted the findings of the SpaceX-led investigation,” an agency spokesperson said. “The final mishap report cites the probable root cause for the loss of the Starship vehicle as a hardware failure in one of the Raptor engines that resulted in inadvertent propellant mixing and ignition. SpaceX identified eight corrective actions to prevent a reoccurrence of the event.” SpaceX implemented the corrective actions prior to Flight 9 last month, when Starship progressed further into its mission before starting to tumble in space. It eventually reentered the atmosphere over the Indian Ocean. The FAA has mandated a fresh investigation into Flight 9, and that inquiry remains open.

Next three launches

June 13: Falcon 9 | Starlink 12-26 | Cape Canaveral Space Force Station, Florida | 15: 21 UTC

June 14: Long March 2D | Unknown Payload | Jiuquan Satellite Launch Center, China | 07: 55 UTC

June 16: Atlas V | Project Kuiper KA-02| Cape Canaveral Space Force Station, Florida | 17: 25 UTC

Photo of Stephen Clark

Stephen Clark is a space reporter at Ars Technica, covering private space companies and the world’s space agencies. Stephen writes about the nexus of technology, science, policy, and business on and off the planet.

Rocket Report: New delay for Europe’s reusable rocket; SpaceX moves in at SLC-37 Read More »

a-warlord-brings-chaos-in-foundation-s3-trailer

A warlord brings chaos in Foundation S3 trailer

Foundation returns for a third season next month on Apple TV+.

Foundation, Apple TV+’s lavish adaptation (or re-mix, if you prefer) of Isaac Asimov’s seminal sci-fi series, returns for its third season next month, and the streaming platform has dropped an official trailer to give us a taste of what’s in store.

As previously reported, the first season ended with a major time jump of 138 years, and S2 focused on the Second Crisis: imminent war between Empire and the Foundation, along with an enemy seeking to destroy Empire from within. The Foundation, meanwhile, adopted the propaganda tactics of religion to recruit new acolytes to the cause. We also met a colony of “Mentalics” with psionic abilities. We’re getting another mega time jump for the Third Crisis.

Per the official premise:

Set 152 years after the events of S2, The Foundation has become increasingly established far beyond its humble beginnings while the Cleonic Dynasty’s Empire has dwindled. As both of these galactic powers forge an uneasy alliance, a threat to the entire galaxy appears in the fearsome form of a warlord known as “The Mule” whose sights are set on ruling the universe by use of physical and military force, as well as mind control. It’s anyone’s guess who will win, who will lose, who will live, and who will die as Hari Seldon, Gaal Dornick, the Cleons and Demerzel play a potentially deadly game of intergalactic chess.

Most of the main cast is returning: Lee Pace as Brother Day, Cassian Bilton as Brother Dawn, Terrence Mann as Brother Dusk, Jared Harris as Hari Seldon, Lou Llobell as Gaal, and Laura Birn as Eto Demerzel. Pilou Asbæk plays the Mule. New S3 cast members include Alexander Siddig as Dr. Ebling Mis, a Seldon fan and self-taught psychohistorian; Troy Kotsur as Preem Palver, leader of a planet of psychics; Cherry Jones as Foundation Ambassador Quent; Brandon P. Bell as Han Pritcher; Synnøve Karlsen as Bayta Mallow; Cody Fern as Toran Mallow; Tómas Lemarquis as Magnifico Giganticus; Yootha Wong-Loi-Sing as Song; and Leo Bill as Mayor Indbur.

A warlord brings chaos in Foundation S3 trailer Read More »

all-wheel-drive-evs-at-210-mph?-formula-e’s-next-car-gets-massive-upgrade.

All-wheel drive EVs at 210 mph? Formula E’s next car gets massive upgrade.

The governing body for world motorsport met in Macau yesterday. Among the jobs for the Fédération Internationale de l’Automobile was to sign off on various calendars for next season, which is why there’s now a clash between the F1 Monaco Grand Prix and the 24 Hours of Le Mans and also between the Indy 500 and F1’s annual visit to Canada. The Formula E calendar was also announced, although with a pair of blank TBCs in the middle, I’ll hold off calling it finalized.

The US round will now take place in late January, and it’s moving venues yet again. No longer will you need to drive an hour south of Miami; instead, the northern outskirts of the city will suffice. The infield at Homestead is no more, and the sport has negotiated a race at the Hard Rock Stadium, albeit on a different layout than the one used by F1. It seems that Formula E’s recent “Evo Sessions” race between influencers, which was held at the stadium, proved convincing.

The really interesting Formula E news from Macau won’t take effect until the 2026–2027 season, and that’s the arrival of the Gen4 car.

The current machine is no slouch, not since they took some constraints off the Gen3 car this season. The addition of part-time all-wheel drive has improved what was already a very racey series, but for now, it’s only available for the final part of qualifying, the start of the race, and when using the mandatory Attack Mode that has added some interesting new strategy to the sport.

New tires, more aero, and way more power

From the start of the 2026–2027 season, all-wheel drive will finally be permanent for the single-seater EVs. It is long past time, given that virtually every high-performance EV on the road powers both its axles, and it marks the first time the FIA has approved a permanent AWD single-seater since the technology was outlawed from F1 decades ago.

All-wheel drive EVs at 210 mph? Formula E’s next car gets massive upgrade. Read More »

dwarkesh-patel-on-continual-learning

Dwarkesh Patel on Continual Learning

A key question going forward is the extent to which making further AI progress will depend upon some form of continual learning. Dwarkesh Patel offers us an extended essay considering these questions and reasons to be skeptical of the pace of progress for a while. I am less skeptical about many of these particular considerations, and do my best to explain why in detail.

Separately, Ivanka Trump recently endorsed a paper with a discussion I liked a lot less but that needs to be discussed given how influential her voice might (mind you I said might) be to policy going forward, so I will then cover that here as well.

Dwarkesh Patel explains why he doesn’t think AGI is right around the corner, and why AI progress today is insufficient to replace most white collar employment: That continual learning is both necessary and unsolved, and will be a huge bottleneck.

He opens with this quote:

Rudiger Dornbusch: Things take longer to happen than you think they will, and then they happen faster than you thought they could.

Clearly this means one is poorly calibrated, but also yes, and I expect it to feel like this as well. Either capabilities, diffusion or both will be on an exponential, and the future will be highly unevenly distributed until suddenly parts of it aren’t anymore. That seems to be true fractally as well, when the tech is ready and I figure out how to make AI do something, that’s it, it’s done.

Here is Dwarkesh’s Twitter thread summary:

Dwarkesh Patel: Sometimes people say that even if all AI progress totally stopped, the systems of today would still be economically transformative. I disagree. The reason that the Fortune 500 aren’t using LLMs to transform their workflows isn’t because the management is too stodgy.

Rather, it’s genuinely hard to get normal humanlike labor out of LLMs. And this has to do with some fundamental capabilities these models lack.

New blog post where I explain why I disagree with this, and why I have slightly longer timelines to AGI than many of my guests.

I think continual learning is a huge bottleneck to the usefulness of these models, and extended computer use may take years to sort out.

Link here.

There is no consensus definition of transformational but I think this is simply wrong, in the sense that LLMs being stuck without continual learning at essentially current levels would not stop them from having a transformational impact. There are a lot of other ways to get a ton more utility out of what we already have, and over time we would build around what the models can do rather than giving up the moment they don’t sufficiently neatly fit into existing human-shaped holes.

When we do solve human like continual learning, however, we might see a broadly deployed intelligence explosion *even if there’s no more algorithmic progress*.

Simply from the AI amalgamating the on-the-job experience of all the copies broadly deployed through the economy.

I’d bet 2028 for computer use agents that can do taxes end-to-end for my small business as well as a competent general manager could in a week: including chasing down all the receipts on different websites, emailing back and forth for invoices, and filing to the IRS.

That being said, you can’t play around with these models when they’re in their element and still think we’re not on track for AGI.

Strongly agree with that last statement. Regardless of how much we can do without strictly solving continual learning, continual learning is not solved… yet.

These are simple, self contained, short horizon, language in-language out tasks – the kinds of assignments that should be dead center in the LLMs’ repertoire. And they’re 5/10 at them. Don’t get me wrong, that’s impressive.

But the fundamental problem is that LLMs don’t get better over time the way a human would. The lack of continual learning is a huge huge problem. The LLM baseline at many tasks might be higher than an average human’s. But there’s no way to give a model high level feedback.

You’re stuck with the abilities you get out of the box. You can keep messing around with the system prompt. In practice this just doesn’t produce anything even close to the kind of learning and improvement that human employees experience.

The reason humans are so useful is not mainly their raw intelligence. It’s their ability to build up context, interrogate their own failures, and pick up small improvements and efficiencies as they practice a task.

You make an AI tool. It’s 5/10 out of the box. What level of Skill Issue are we dealing with here, that stops it from getting better over time assuming you don’t get to upgrade the underlying model?

You can obviously engage in industrial amounts of RL or other fine-tuning, but that too only goes so far.

You can use things like memory, or train LoRas, or various other incremental tricks. That doesn’t enable radical changes, but I do think it can work for the kinds of preference learning Dwarkesh is complaining he currently doesn’t have access to, and you can if desired go back and fine tune the entire system periodically.

How do you teach a kid to play a saxophone? You have her try to blow into one, listen to how it sounds, and adjust. Now imagine teaching saxophone this way instead: A student takes one attempt. The moment they make a mistake, you send them away and write detailed instructions about what went wrong. The next student reads your notes and tries to play Charlie Parker cold. When they fail, you refine the instructions for the next student.

This just wouldn’t work. No matter how well honed your prompt is, no kid is just going to learn how to play saxophone from just reading your instructions. But this is the only modality we as users have to ‘teach’ LLMs anything.

Are you even so sure about that? If the context you can give is hundreds of thousands to millions of tokens at once, with ability to conditionally access millions or billions more? If you can create new tools and programs and branch workflows, or have it do so on your behalf, and call instances with different contexts and procedures for substeps? If you get to keep rewinding time and sending in the exact same student in the same mental state as many times as you want? And so on, including any number of things I haven’t mentioned or thought about?

I am confident that with enough iterations and work (and access to the required physical tools) I could write a computer program to operate a robot to play the saxophone essentially perfectly. No, you can’t do this purely via the LLM component, but that is why we are moving towards MCP and tool use for such tasks.

I get that Dwarkesh has put a lot of work into getting his tools to 5/10. But it’s nothing compared to the amount of work that could be done, including the tools that could be involved. That’s not a knock on him, that wouldn’t be a good use of his time yet.

LLMs actually do get kinda smart and useful in the middle of a session. For example, sometimes I’ll co-write an essay with an LLM. I’ll give it an outline, and I’ll ask it to draft the essay passage by passage. All its suggestions up till 4 paragraphs in will be bad. So I’ll just rewrite the whole paragraph from scratch and tell it, “Hey, your shit sucked. This is what I wrote instead.” At that point, it can actually start giving good suggestions for the next paragraph. But this whole subtle understanding of my preferences and style is lost by the end of the session.

Okay, so that seems like it is totally, totally a Skill Issue now? As in, Dwarkesh Patel has a style. A few paragraphs of that style clue the LLM into knowing how to help. So… can’t we provide it with a bunch of curated examples of similar exercises, and put them into context in various ways (Claude projects just got 10x more context!) and start with that?

Even Claude Code will often reverse a hard-earned optimization that we engineered together before I hit /compact – because the explanation for why it was made didn’t make it into the summary.

Yeah, this is super annoying, I’ve run into it, but I can think of some obvious fixes for this, especially if you notice what you want to preserve? One obvious way is to do what humans do, which is to put it into comments in the code saying what the optimization is and why to keep it, which then remain in context whenever Claude considers ripping them out, I don’t know if that works yet but it totally should.

I’m not saying I have the magical solution to all this but it all feels like it’s One Weird Trick (okay, maybe 10 working together) away from working in ways I could totally figure out if I had a team behind me and I focused on it.

My guess is this will not look like ‘learn like a human’ exactly. Different tools are available, so we’ll first get the ability to solve this via doing something different. But also, yeah, I think with enough skill and the right technique (on the level of the innovation that created reasoning models) you could basically do what humans do? Which involves effectively having the systems automatically engage in various levels of meta and updating, often quite heavily off a single data point.

It is hard to overstate how much time and effort goes into training a human employee.

There are many jobs where an employee is not net profitable for years. Hiring decisions are often made on the basis of what will be needed in year four or beyond.

That ignores the schooling that you also have to do. A doctor in America requires starting with a college degree, then four years of medical school, then four years of residency, and we have to subsidize that residency because it is actively unprofitable. That’s obviously an extreme case, but there are many training programs or essentially apprenticeships that last for years, including highly expensive time from senior people and expensive real world mistakes.

Imagine what it took to make Dwarkesh Patel into Dwarkesh Patel. Or the investment he makes in his own employees.

Even afterwards, in many ways you will always be ‘stuck with’ various aspects of those employees, and have to make the most of what they offer. This is standard.

Claude Opus estimates, and I think this is reasonable, that for every two hours humans spend working, they spend one hour learning, with a little less than half of that learning essentially ‘on the job.’

If you need to train a not a ‘universal’ LLM but a highly specific-purpose LLM, and have a massive compute budget with which to do so, and you mostly don’t care about how it performs out of distribution the same way you mostly don’t for an employee (as in, you teach it what you teach a human, which is ‘if this is outside your distribution or you’re failing at it then run it up the chain to your supervisor,’ and you have a classifier for that) and you can build and use tools along the way? Different ballgame.

It makes sense, given the pace of progress, for most people and companies not to put that kind of investment into AI ‘employees’ or other AI tasks. But if things do start to stall out, or they don’t, either way the value proposition on that will quickly improve. It will start to be worth doing. And we will rapidly learn new ways of doing it better, and have the results available to be copied.

Here’s his predictions on computer use in particular, to see how much we actually disagree:

When I interviewed Anthropic researchers Sholto Douglas and Trenton Bricken on my podcast, they said that they expect reliable computer use agents by the end of next year. We already have computer use agents right now, but they’re pretty bad. They’re imagining something quite different.

Their forecast is that by the end of next year, you should be able to tell an AI, “Go do my taxes.” And it goes through your email, Amazon orders, and Slack messages, emails back and forth with everyone you need invoices from, compiles all your receipts, decides which are business expenses, asks for your approval on the edge cases, and then submits Form 1040 to the IRS.

I’m skeptical. I’m not an AI researcher, so far be it for me to contradict them on technical details. But given what little I know, here’s why I’d bet against this forecast:

  • As horizon lengths increase, rollouts have to become longer. The AI needs to do two hours worth of agentic computer use tasks before we can even see if it did it right. Not to mention that computer use requires processing images and video, which is already more compute intensive, even if you don’t factor in the longer rollout. This seems like this should slow down progress.

Let’s take the concrete example here, ‘go do my taxes.’

This is a highly agentic task, but like a real accountant you can choose to ‘check its work’ if you want, or get another AI to check the work, because you can totally break this down into smaller tasks that allow for verification, or present a plan of tasks that can be verified. Similarly, if you are training TaxBot to do people’s taxes for them, you can train TaxBot on a lot of those individual subtasks, and give it clear feedback.

Almost all computer use tasks are like this? Humans also mostly don’t do things that can’t be verified for hours?

And the core building block issues of computer use seem mostly like very short time horizon tasks with very easy verification methods. If you can get lots of 9s on the button clicking and menu navigation and so on, I think you’re a lot of the way there.

The subtasks are also 99%+ things that come up relatively often, and that don’t present any non-trivial difficulties. A human accountant already will have to occasionally say ‘wait, I need you the taxpayer to tell me what the hell is up with this thing’ and we’re giving the AI in 2028 the ability to do this too.

I don’t see any fundamental difference between the difficulties being pointed out here, and the difficulties of tasks we have already solved.

  • We don’t have a large pretraining corpus of multimodal computer use data. I like this quote from Mechanize’s post on automating software engineering: “For the past decade of scaling, we’ve been spoiled by the enormous amount of internet data that was freely available for us to use. This was enough for cracking natural language processing, but not for getting models to become reliable, competent agents. Imagine trying to train GPT-4 on all the text data available in 1980—the data would be nowhere near enough, even if we had the necessary compute.”

    Again, I’m not at the labs. Maybe text only training already gives you a great prior on how different UIs work, and what the relationship between different components is. Maybe RL fine tuning is so sample efficient that you don’t need that much data. But I haven’t seen any public evidence which makes me think that these models have suddenly gotten less data hungry, especially in this domain where they’re substantially less practiced.

    Alternatively, maybe these models are such good front end coders that they can just generate millions of toy UIs for themselves to practice on. For my reaction to this, see bullet point below.

I’m not going to keep working for the big labs for free on this one by giving even more details on how I’d solve all this, but this totally seems like highly solvable problems, and also this seems like a case of the person saying it can’t be done interrupting the people doing it? It seems like progress is being made rapidly.

  • Even algorithmic innovations which seem quite simple in retrospect seem to take a long time to iron out. The RL procedure which DeepSeek explained in their R1 paper seems simple at a high level. And yet it took 2 years from the launch of GPT-4 to the release of o1.

  • Now of course I know it is hilariously arrogant to say that R1/o1 were easy – a ton of engineering, debugging, pruning of alternative ideas was required to arrive at this solution. But that’s precisely my point! Seeing how long it took to implement the idea, ‘Train the model to solve verifiable math and coding problems’, makes me think that we’re underestimating the difficulty of solving the much gnarlier problem of computer use, where you’re operating in a totally different modality with much less data.

I think two years is how long we had to have the idea of o1 and commit to it, then to implement it. Four months is roughly the actual time it took from ‘here is that sentence and we know it works’ to full implementation. Also we’re going to have massively more resources to pour into these questions this time around, and frankly I don’t think any of these insights are even as hard to find as o1, especially now that we have reasoning models to use as part of this process.

I think there are other potential roadblocks along the way, and once you factor all of those in you can’t be that much more optimistic, but I see this particular issue as not that likely to pose that much of a bottleneck for long.

His predictions are he’d take 50/50 bets on: 2028 for an AI that can ‘just go do your taxes as well as a human accountant could’ and 2032 for ‘can learn details and preferences on the job as well as a human can.’ I’d be inclined to take other side of both of those bets, assuming it means by EOY, for the 2032 one we’d need to flesh out details.

But if we have the ‘AI that does your taxes’ in 2028 then 2029 and 2030 look pretty weird, because this implies other things:

Daniel Kokotajlo: Great post! This is basically how I think about things as well. So why the difference in our timelines then?

–Well, actually, they aren’t that different. My median for the intelligence explosion is 2028 now (one year longer than it was when writing AI 2027), which means early 2028 or so for the superhuman coder milestone described in AI 2027, which I’d think roughly corresponds to the “can do taxes end-to-end” milestone you describe as happening by end of 2028 with 50% probability. Maybe that’s a little too rough; maybe it’s more like month-long horizons instead of week-long. But at the growth rates in horizon lengths that we are seeing and that I’m expecting, that’s less than a year…

–So basically it seems like our only serious disagreement is the continual/online learning thing, which you say 50% by 2032 on whereas I’m at 50% by end of 2028. Here, my argument is simple: I think that once you get to the superhuman coder milestone, the pace of algorithmic progress will accelerate, and then you’ll reach full AI R&D automation and it’ll accelerate further, etc. Basically I think that progress will be much faster than normal around that time, and so innovations like flexible online learning that feel intuitively like they might come in 2032 will instead come later that same year.

(For reference AI 2027 depicts a gradual transition from today to fully online learning, where the intermediate stages look something like “Every week, and then eventually every day, they stack on another fine-tuning run on additional data, including an increasingly high amount of on-the-job real world data.” A janky unprincipled solution in early 2027 that gives way to more elegant and effective things midway through the year.)

I found this an interestingly wrong thing to think:

Richard: Given the risk of fines and jail for filling your taxes wrong, and the cost of processing poor quality paperwork that the government will have to bear, it seems very unlikely that people will want AI to do taxes, and very unlikely that a government will allow AI to do taxes.

The rate of fully accurately filing your taxes is, for anyone whose taxes are complex, basically 0%. Everyone makes mistakes. When the AI gets this right almost every time, it’s already much better than a human accountant, and you’ll have a strong case that what happened was accidental, which means at worst you pay some modest penalties.

Personal story, I was paying accountants at a prestigious firm that will go unnamed to do my taxes, and they literally just forgot to include paying city tax at all. As in, I’m looking at the forms, and I ask, ‘wait why does it have $0 under city tax?’ and the guy essentially says ‘oh, whoops.’ So, yeah. Mistakes are made. This will be like self-driving cars, where we’ll impose vastly higher standards of accuracy and law abidance on the AIs, and they will meet them because the bar really is not that high.

There were also some good detailed reactions and counterarguments from others:

Near: finally some spicy takes around here.

Rohit: The question is whether we need humanlike labour for transformative economic outcomes, or whether we can find ways to use the labour it does provide with a different enough workflow that it adds substantial economic advantage.

Sriram Krishnan: Really good post from @dwarkesh_sp on continuous learning in LLMs.

Vitalik Buterin: I have high probability mass on longer timelines, but this particular issue feels like the sort of limitation that’s true until one day someone discovers a magic trick (think eg. RL on CoT) that suddenly makes it no longer true.

Sriram Krishnan: Agree – CoT is a particularly good example.

Ryan Greenblatt: I agree with much of this post. I also have roughly 2032 medians to things going crazy, I agree learning on the job is very useful, and I’m also skeptical we’d see massive white collar automation without further AI progress.

However, I think Dwarkesh is wrong to suggest that RL fine-tuning can’t be qualitatively similar to how humans learn.

In the post, he discusses AIs constructing verifiable RL environments for themselves based on human feedback and then argues this wouldn’t be flexible and powerful enough to work, but RL could be used more similarly to how humans learn.

My best guess is that the way humans learn on the job is mostly by noticing when something went well (or poorly) and then sample efficiently updating (with their brain doing something analogous to an RL update). In some cases, this is based on external feedback (e.g. from a coworker) and in some cases it’s based on self-verification: the person just looking at the outcome of their actions and then determining if it went well or poorly.

So, you could imagine RL’ing an AI based on both external feedback and self-verification like this. And, this would be a “deliberate, adaptive process” like human learning. Why would this currently work worse than human learning?

Current AIs are worse than humans at two things which makes RL (quantitatively) much worse for them:

1. Robust self-verification: the ability to correctly determine when you’ve done something well/poorly in a way which is robust to you optimizing against it.

2. Sample efficiency: how much you learn from each update (potentially leveraging stuff like determining what caused things to go well/poorly which humans certainly take advantage of). This is especially important if you have sparse external feedback.

But, these are more like quantitative than qualitative issues IMO. AIs (and RL methods) are improving at both of these.

All that said, I think it’s very plausible that the route to better continual learning routes more through building on in-context learning (perhaps through something like neuralese, though this would greatly increase misalignment risks…).

Some more quibbles:

– For the exact podcasting tasks Dwarkesh mentions, it really seems like simple fine-tuning mixed with a bit of RL would solve his problem. So, an automated training loop run by the AI could probably work here. This just isn’t deployed as an easy-to-use feature.

– For many (IMO most) useful tasks, AIs are limited by something other than “learning on the job”. At autonomous software engineering, they fail to match humans with 3 hours of time and they are typically limited by being bad agents or by being generally dumb/confused. To be clear, it seems totally plausible that for podcasting tasks Dwarkesh mentions, learning is the limiting factor.

– Correspondingly, I’d guess the reason that we don’t see people trying more complex RL based continual learning in normal deployments is that there is lower hanging fruit elsewhere and typically something else is the main blocker. I agree that if you had human level sample efficiency in learning this would immediately yield strong results (e.g., you’d have very superhuman AIs with 10^26 FLOP presumably), I’m just making a claim about more incremental progress.

– I think Dwarkesh uses the term “intelligence” somewhat atypically when he says “The reason humans are so useful is not mainly their raw intelligence. It’s their ability to build up context, interrogate their own failures, and pick up small improvements and efficiencies as they practice a task.” I think people often consider how fast someone learns on the job as one aspect of intelligence. I agree there is a difference between short feedback loop intelligence (e.g. IQ tests) and long feedback loop intelligence and they are quite correlated in humans (while AIs tend to be relatively worse at long feedback loop intelligence).

More thoughts/quibbles:

– Dwarkesh notes “An AI that is capable of online learning might functionally become a superintelligence quite rapidly, even if there’s no algorithmic progress after that point.” This seems reasonable, but it’s worth noting that if sample efficient learning is very compute expensive, then this might not happen so rapidly.

– I think AIs will likely overcome poor sample efficiency to achieve a very high level of performance using a bunch of tricks (e.g. constructing a bunch of RL environments, using a ton of compute to learn when feedback is scarce, learning from much more data than humans due to “learn once deploy many” style strategies). I think we’ll probably see fully automated AI R&D prior to matching top human sample efficiency at learning on the job. Notably, if you do match top human sample efficiency at learning (while still using a similar amount of compute to the human brain), then we already have enough compute for this to basically immediately result in vastly superhuman AIs (human lifetime compute is maybe 3e23 FLOP and we’ll soon be doing 1e27 FLOP training runs). So, either sample efficiency must be worse or at least it must not be possible to match human sample efficiency without spending more compute per data-point/trajectory/episode.

Matt Reardon: Dwarkesh commits the sin of thinking work you’re personally close to is harder-than-average to automate.

Herbie Bradley: I mean this is just correct? most researchers I know think continual learning is a big problem to be solved before AGI

Matt Reardon: My main gripe is that “<50%" [of jobs being something you can automate soon] should be more like "<15%"

Danielle Fong: Gell-Mann Amnesia for AI.

Reardon definitely confused me here, but either way I’d say that Dwarkesh Patel is a 99th percentile performer. He does things most other people can’t do. That’s probably going to be harder to automate than most other white collar work? The bulk of hours in white collar work are very much not bespoke things and don’t act to put state or memory into people in subtle ways?

Now that we’ve had a good detailed discussion and seen several perspectives, it’s time to address another discussion of related issues, because it is drawing attention from an unlikely source.

After previously amplifying Situational Awareness, Ivanka Trump is back in the Essay Meta with high praise for The Era of Experience, authored by David Silver and (oh no) Richard Sutton.

Situational Awareness was an excellent pick. I do not believe this essay was a good pick. I found it a very frustrating, unoriginal and unpersuasive paper to read. To the extent it is saying something new I don’t agree, but it’s not clear to what extent it is saying anything new. Unless you want to know about this paper exactly because Ivanka is harping it, you should skip this section.

I think the paper effectively mainly says we’re going to do a lot more RL and we should stop trying to make the AIs mimic, resemble or be comprehensible to humans or trying to control their optimization targets?

Ivanka Trump: Perhaps the most important thing you can read about AI this year : “Welcome to the Era of Experience”

This excellent paper from two senior DeepMind researchers argues that AI is entering a new phase—the “Era of Experience”—which follows the prior phases of simulation-based learning and human data-driven AI (like LLMs).

The authors’ posit that future AI breakthroughs will stem from learning through direct interaction with the world, not from imitating human-generated data.

This is not a theory or distant future prediction. It’s a description of a paradigm shift already in motion.

Let me know what you think !

Glad you asked, Ivanka! Here’s what I think.

The essay starts off with a perspective we have heard before, usually without much of an argument behind it: That LLMs and other AIs trained only on ‘human data’ is ‘rapidly approaching a limit,’ we are running out of high-quality data, and thus to progress significantly farther AIs will need to move into ‘the era of experience,’ meaning learning continuously from their environments.

I agree that the standard ‘just feed it more data’ approach will run out of data with which to scale, but there are a variety of techniques already being used to get around this. We have lots of options.

The leading example the paper itself gives of this in the wild is AlphaProof, which ‘interacted with a formal proofing system’ which seems to me like a clear case of synthetic data working and verification being easier than generation, rather than ‘experience.’ If the argument is simply that RL systems will learn by having their outputs evaluated, that isn’t news.

They claim to have in mind something rather different from that, and with this One Weird Trick they assert Superintelligence Real Soon Now:

Our contention is that incredible new capabilities will arise once the full potential of experiential learning is harnessed. This era of experience will likely be characterised by agents and environments that, in addition to learning from vast quantities of experiential data, will break through the limitations of human-centric AI systems in several further dimensions:

• Agents will inhabit streams of experience, rather than short snippets of interaction.

• Their actions and observations will be richly grounded in the environment, rather than interacting via human dialogue alone.

• Their rewards will be grounded in their experience of the environment, rather than coming from human prejudgement.

• They will plan and/or reason about experience, rather than reasoning solely in human terms.

We believe that today’s technology, with appropriately chosen algorithms, already provides a sufficiently powerful foundation to achieve these breakthroughs. Furthermore, the pursuit of this agenda by the AI community will spur new innovations in these directions that rapidly progress AI towards truly superhuman agents.

I suppose if the high level takeaway is ‘superintelligence is likely coming reasonably soon with the right algorithms’ then there’s no real disagreement?

They then however discuss tool calls and computer use, which then seems like a retreat back into an ordinary RL paradigm? It’s also not clear to me what the authors mean by ‘human terms’ versus ‘plan and/or reason about experience,’ or even what ‘experience’ means here. They seem to be drawing a distinction without a difference.

If the distinction is simply (as the paper implies in places) that the agents will do self-evaluation rather than relying on human feedback, I have some important news about how existing systems already function? They use the human feedback and other methods to train an AI feedback system that does most of the work? And yes they often include ‘real world’ feedback systems in that? What are we even saying here?

They also seem to be drawing a distinction between the broke ‘human feedback’ and the bespoke ‘humans report physical world impacts’ (or ‘other systems measure real world impacts’) as if the first does not often encompass the second. I keep noticing I am confused what the authors are trying to say.

For reasoning, they say it is unlikely human methods of reasoning and human language are optimal, more efficient methods of thought must exist. I mean, sure, but that’s also true for humans, and it’s obvious that you can use ‘human style methods of thought’ to get to superintelligence by simply imagining a human plus particular AI advantages.

As many have pointed out (and is central to AI 2027) encouraging AIs to use alien-looking inhuman reasoning styles we cannot parse is likely a very bad idea even if it would be more effective, what visibility we have will be lost and also it likely leads to alien values and breaks many happy things. Then again, Richard Sutton is one of the authors of this paper and he thinks we should welcome succession, as in the extinction of humanity, so he wouldn’t care.

They try to argue against this by saying that while agents pose safety risks and this approach may increase those safety risks, the approach may also have safety benefits. First, they say this allows the AI to adapt to its environment, as if the other agent could not do this or this should make us feel safer.

Second, they say ‘the reward function may itself be adapted through experience,’ in terms of risk that’s worse you know that that’s worse, right? They literally say ‘rather than blindly optimizing a signal such as the number of paperclips it can adopt to indications of human concern,’ this shows a profound lack of understanding and curiosity of where the whole misspecification of rewards problem is coming from or the arguments about it from Yudkowsky (since they bring in the ‘paperclips’).

Adapting autonomously and automatically towards something like ‘level of human concern’ is exactly the kind of metric and strategy that is absolutely going to encourage perverse outcomes and get you killed at the limit. You don’t get out of the specification problem by saying you can specify something messier and let the system adapt around it autonomously, that only makes it worse, and in no way addresses the actual issue.

The final argument for safety is that relying on physical experience creates time limitations, which provides a ‘natural break,’ which is saying that capabilities limits imposed by physical interactions will keep things more safe? Seriously?

There is almost nothing in the way of actual evidence or argument in the paper that is not fully standard, beyond a few intuition pumps. There are many deep misunderstandings, including fully backwards arguments, along the way. We may well want to rely a lot more on RL and on various different forms of ‘experiential’ data and continuous learning, but given how much worse it was than I expected this post updated me in the opposite direction of that which was clearly intended.

Discussion about this post

Dwarkesh Patel on Continual Learning Read More »

protesters-summon,-burn-waymo-robotaxis-in-los-angeles-after-ice-raids

Protesters summon, burn Waymo robotaxis in Los Angeles after ICE raids

The robotaxi company Waymo has suspended service in some parts of Los Angeles after some of its vehicles were summoned and then vandalized by protesters angry with ongoing raids by US Immigration and Customs Enforcement. Five of Waymo’s autonomous Jaguar I-Pace electric vehicles were summoned downtown to the site of anti-ICE protests, at which point they were vandalized with slashed tires and spray-painted messages. Three were set on fire.

The Los Angeles Police Department warned people to avoid the area due to risks from toxic gases given off by burning EVs. And Waymo told Ars that it is “in touch with law enforcement” regarding the matter.

The protesters in Los Angeles were outraged after ICE, using brutal tactics, began detaining people in raids across the city. Thousands of Angelenos took to the streets over the weekend to confront the masked federal enforcers and, in some cases, forced them away.

In response, the Trump administration mobilized more than 300 National Guard soldiers without consulting with or being requested to do so by the California governor.

California Governor Gavin Newsom has promised to sue the administration. “Donald Trump has created the conditions you see on your TV tonight. He’s exacerbated the conditions. He’s, you know, lit the proverbial match. He’s putting fuel on this fire, ever since he announced he was taking over the National Guard—an illegal act, an immoral act, an unconstitutional act,” Newsom said in an interview.

Waymo began offering rides in Los Angeles last November, and by January, the company said it had driven almost 2 million miles in the city. But there is some animosity toward robotaxis and food delivery robots, which are now being used by the Los Angeles Police Department as sources of surveillance footage. In April, the LAPD published footage obtained from a Waymo that it used to investigate a hit-and-run.

Protesters summon, burn Waymo robotaxis in Los Angeles after ICE raids Read More »

a-long-shot-plan-to-mine-the-moon-comes-a-little-closer-to-reality

A long-shot plan to mine the Moon comes a little closer to reality

The road ahead

Meyerson said the company’s current plan is to fly a prospecting mission in 2027, a payload of less than 100 kg, likely on a commercial lander that is part of NASA’s Commercial Lunar Payload Services program. Two years later, the company seeks to fly a pilot plant. Meyerson said the size of this plant will depend on the launch capability available (i.e., if Starship is flying to the Moon, they’ll go big, and smaller if not).

Following this, Interlune is targeting 2032 for the launch of a solar-powered operating plant, which would include five mobile harvesters. The operation would also be able to return material mined to Earth. The total mass for this equipment would be about 40 metric tons, which could fly on a single Starship or two New Glenn Mk 2 landers. This would, understandably, be highly ambitious and capital-intensive. After raising $15 million last year, Meyerson said Interlune is planning a second fundraising round that should begin soon.

There are some outside factors that may be beneficial for Interlune. One is that China has a clear and demonstrated interest in sending humans to the Moon and has already sent rovers to explore for helium-3 resources. Moreover, with the exit of Jared Isaacman as a nominee to lead NASA, the Trump administration is likely to put someone in the position who is more focused on lunar activities. One candidate, a retired Air Force General named Steve Kwast, is a huge proponent of mining helium-3.

Interlune has a compelling story, as there are almost no other lunar businesses focused solely on commercial activities that will drive value from mining the lunar surface. In that sense, they could be a linchpin of a lunar economy. However, they have a long way to go, and a lot of lunar regolith to plow through, before they start delivering for customers.

A long-shot plan to mine the Moon comes a little closer to reality Read More »