Author name: Shannon Garcia

analyzing-a-critique-of-the-ai-2027-timeline-forecasts

Analyzing A Critique Of The AI 2027 Timeline Forecasts

There was what everyone agrees was a high quality critique of the timelines component of AI 2027, by the LessWrong user and Substack writer Titotal.

It is great to have thoughtful critiques like this. The way you get actual thoughtful critiques like this, of course, is to post the wrong answer (at length) on the internet, and then respond by listening to the feedback and by making your model less wrong.

This is a high-effort, highly detailed, real engagement on this section, including giving the original authors opportunity to critique the critique, and warnings to beware errors, give time to respond, shares the code used to generate the graphs, engages in detail, does a bunch of math work, and so on. That is The Way.

So, Titotal: Thank you.

I note up front that at least Daniel Kokotajlo has indeed adjusted his estimates, and has moved his median from ‘AI 2027’ to ‘AI 2028’ based on events since publication, and Eli’s revisions also push the estimates back a bit.

I also note up front that if you evaluated most statements made in the discourse (either non-worried AI forecasting, or AI in general, or more broadly) with this level of rigor, mostly you couldn’t because you’d hit ‘I made it up’ very quickly, but in other cases where someone is trying at least a little, in my experience the models fall apart a lot worse and a lot faster. No one has suggested ‘here is a better attempt to forecast the future and take the whole thing seriously’ that I consider to have a reasonable claim to that.

A lot of the disagreements come down to how much one should care about which calculations and graphs match past data how closely in different contexts. Titotal demands very strong adherence throughout. I think it’s good to challenge and poke at the gaps but this seems to in several places go too far.

  1. The Headline Message Is Not Ideal.

  2. An Explanation of Where Superexponentiality Is Coming From.

  3. Three Methods.

  4. Time Horizon Extension Method.

  5. The Public Versus Internal Gap.

  6. The Difficulty Gap.

  7. Recent Progress.

  8. Infinite Time Horizons.

  9. Intermediate Speedups.

  10. Is There A Flawed Graph Still Up?.

  11. Some Skepticism About Projection.

  12. Part 2: Benchmarks and Gaps and Beyond.

  13. Benchmarks.

  14. The Time Horizon Part of the Second Model.

  15. Why The Thresholds?

  16. The Gap Model.

  17. Eli Responds On LessWrong.

  18. On Eli’s Recent Update.

  19. Conclusion.

  20. Perhaps The Most Important Disagreement.

Note that this section is about discourse rather than the model, so many of you can skip it.

While I once again want to say up front that I am very much thankful for the substance of this critique, it would also be great to have an equally thoughtful headline presentation of such critiques. That, alas, (although again, thanks for writing this!) we did not get.

It is called ‘A deep critique of AI 2027’s bad timeline model,’ one could simply not use the word ‘bad’ here and we would still know you have strong disagreements with it, and there is much similar talk throughout, starting with the title and then this, the first use of bold:

Titotal (formatting in original): The article is huge, so I focussed on one section alone: their “timelines forecast” code and accompanying methodology section. Not to mince words, I think it’s pretty bad.

I’m not full on ‘please reconsider your use of adjectives’ but, well, maybe? Here is an active defense of the use of the word ‘bad’ here:

Neel Nanda: I agree in general [to try and not call things bad], but think that titotal’s specific use was fine. In my opinion, the main goal of that post was not to engage the AI 2027, which had already be done extensively in private but rather to communicate their views to the broader community.

Titles in particular are extremely limited, many people only read the title, and titles are a key way people decide whether to eat on, and efficiency of communication is extremely important.

The point they were trying to convey was these models that are treated as high status and prestigious should not be and I disagree that non-violent communication could have achieved a similar effect to that title (note, I don’t particularly like how they framed the post, but I think this was perfectly reasonable from their perspective.)

I mean, yes, if the goal of the post was to lower the status and prestige of AI 2027 and to do so through people reading the title and updating in that way, rather than to offer a helpful critique, then it is true that the title was the best local way to achieve that objective, epistemic commons be damned. I would hope for a different goal?

There are more of these jabs, and a matching persistent attitude and framing, sprinkled throughout what is in its actual content an excellent set of critiques – I find much that I object to, but I think a good critique here should look like that. Most of your objections should be successfully answered. Others can be improved. This is all the system working as designed, and the assessments don’t match the content.

To skip ahead, the author is a physicist, which is great except that they are effectively holding AI 2027 largely to the standards of a physics model before they would deem it fit for anyone to use it make life decisions, even if this is ‘what peak modeling performance looks like.’

Except that you don’t get to punt the decisions, and Bayes Rule is real. Sharing one’s probability estimates and the reasons behind them is highly useful, and you can and should use that to help you make better decisions.

Tyler Cowen’s presentation of the criticism then compounds this, entitled ‘Modeling errors in AI doom circles’ (which is pejorative on multiple levels), calling the critique ‘excellent’ (the critique in its title calls the original ‘bad’), then presenting this as an argument for why this proves they should have… submitted AI 2027 to a journal? Huh?

Tyler Cowen: There is much more detail (and additional scenarios) at the link. For years now, I have been pushing the line of “AI doom talk needs traditional peer review and formal modeling,” and I view this episode as vindication of that view.

That was absurd years ago. It is equally absurd now, unless the goal of this communication is to lower the status of its subject.

This is the peer! This is the review! That is how all of this works! This is it working!

Classic ‘if you want the right answer, post the (ideally less) wrong one on the internet.’ The system works. Whereas traditional peer review is completely broken here.

Indeed, Titotal says it themselves.

Titotal: What makes AI 2027 different from other similar short stories is that it is presented as a forecast based on rigorous modelling and data analysis from forecasting experts. It is accompanied by five appendices of “detailed research supporting these predictions” and a codebase for simulations.

Now, I was originally happy to dismiss this work and just wait for their predictions to fail, but this thing just keeps spreading, including a youtube video with millions of views.

As in: I wasn’t going to engage with any of this until I saw it getting those millions of views, only then did I actually look at any of it.

Which is tough but totally fair, a highly sensible decision algorithm, except for the part where Titotal dismissed the whole thing as bogus before actually looking.

The implications are clear. You want peer review? Earn it with views. Get peers.

It is strange to see these two juxtaposed together. You get the detailed thoughtful critique for those who Read the Whole Thing. For those who don’t, at the beginning and conclusion, you get vibes.

Also (I discovered this after I’d finished analyzing the post) it turns out this person’s substack (called Timeline Topography Tales) is focused on, well, I’ll let Titotal explain, by sharing the most recent headlines and the relevant taglines in order, that appear before you click ‘see all’:

15 Simple AI Image prompts that stump ChatGPT

Slopworld 2035: The dangers of mediocre AI. None of this was written with AI assistance.

AI is not taking over material science (for now): an analysis and conference report. Confidence check: This is my field of expertise, I work in the field and I have a PhD in the subject.

A nerds guide to dating: Disclaimer: this blog is usually about debunking singularity nerds. This is not a typical article, nor is it my area of expertise.

The walled marketplace of ideas: A statistical critique of SSC book reviews.

Is ‘superhuman’ AI forecasting BS? Some experiments on the “539” bot from the Centre for AI Safety.

Most smart and skilled people are outside of the EA/rationalist community: An analysis.

I’m not saying this is someone who has an axe and is grinding it, but it is what it is.

Despite this, it is indeed a substantively excellent post, so LessWrong has awarded this post 273 karma as of this writing, very high and more than I’ve ever gotten in a single post, and 213 on the EA forum, also more than I’ve ever gotten in a single post.

Okay, with that out of the way up top, who wants to stay and Do Forecasting?

This tripped me up initially, so it’s worth clarifying up front.

The AI 2027 model has two distinct sources of superexponentiality. That it is why Titotal will later talk about there being an exponential model and a superexponential model, and then that there is a superexponential effect applied to both.

The first source is AI automation of AI R&D. It should be clear why this effect is present.

The second source is a reduction in difficulty of doubling the length or reliability of tasks, once the lengths in question pass basic thresholds. As in, at some point, it is a lot easier to go from reliably doing one year tasks to two year tasks, than it is to go from one hour to two hours, or from one minute to two minutes. I think this is true in humans, and likely true for AIs in the circumstances in question, as well. But you certainly could challenge this claim.

Okay, that’s out of the way, on to the mainline explanation.

Summarizing the breakdown of the AI 2027 model:

  1. The headline number is the time until development of ‘superhuman coders’ (SC), that can do an AI researcher job 30x as fast and 30x cheaper than a human.

  2. Two methods are used, ‘time horizon extension’ and ‘benchmarks and gaps.’

  3. There is also a general subjective ‘all things considered.’

Titotal (matching my understanding): The time horizon method is based on 80% time horizons from this report, where the team at METR tried to compare the performance of AI on various AI R&D tasks and quantify how difficult they are by comparing to human researchers. An 80% “time horizon” of 1 hour would mean that an AI has an overall success rate of 80% on a variety of selected tasks that would take a human AI researcher 1 hour to complete, presumably taking much less time than the humans (although I couldn’t find this statement explicitly).

The claim of the METR report is that the time horizon of tasks that AI can do has been increasing at an exponential rate. The following is one of the graphs showing this progress: note the logarithmic scale on the y-axis:

Titoral warns that this report is ‘quite recent, not peer-reviewed and not replicated.’ Okay. Sure. AI comes at you fast, the above graph is already out of date and the o3 and Opus 4 (or even Sonnet 4) data points should further support the ‘faster progress recently’ hypothesis.

The first complaint is that they don’t include uncertainty in current estimates, and this is framed (you see this a lot) as one-directional uncertainty: Maybe the result is accurate, maybe it’s too aggressive.

But we don’t know whether or not this is the new normal or just noise or temporary bump where we’ll go back to the long term trend at some point. If you look at a graph of Moore’s law, for example, there are many points where growth is temporarily higher or lower than the long term trend. It’s the long term curve you are trying to estimate, you should be estimating the long term curve parameters, not the current day parameters.

This is already dangerously close to assuming the conclusion that there is a long term trend line (a ‘normal’), and we only have to find out what it is. This goes directly up against the central thesis being critiqued, which is that the curve bends when AI speeds up coding and AI R&D in a positive feedback loop.

There are three possibilities here:

  1. We have a recent blip of faster than ‘normal’ progress and will go back to trend.

    1. You could even suggest, this is a last gasp of reasoning models and inference scaling, and soon we’ll stall out entirely. You never know.

  2. We have a ‘new normal’ and will continue on the new trend.

  3. We have a pattern of things accelerating, and they will keep accelerating.

That’s where the whole ‘super exponential’ part comes in. I think the good critique here is that we should have a lot of uncertainty regarding which of these is true.

So what’s up with that ‘super exponential’ curve? They choose to model this as ‘each subsequent doubling time is 10% shorter than the one before.’ Titotal does some transformational math (which I won’t check) and draws curves.

Just like before, the initial time horizon H0 parameter is not subject to uncertainty analysis. What’s much more crazy here is that the rate of doubling growth, which we’ll call alpha, wasn’t subject to uncertainty either! (Note that this has been updated in Eli’s newest version). As we’ll see, the value of this alpha parameter is one of the most impactful parameters in the whole model, so it’s crazy that they didn’t model any uncertainty on it, and just pick a seemingly arbitrary value of 10% without explaining why they did so.

The central criticism here seems to be that there isn’t enough uncertainty, that essentially all the parameters here should be uncertain. I think that’s correct. I think it’s also a correct general critique of most timeline predictions, that people are acting far more certain than they should be. Note that this goes both ways – it makes it more likely things could be a lot slower, but also they could be faster.

What the AI 2027 forecast is doing is using the combination of different curve types to embody the uncertainty in general, rather than also trying to fully incorporate uncertainty in all individual parameters.

I also agree that this experiment shows something was wrong, and a great way to fix a model is to play with it until it produces a stupid result in some hypothetical world, then figure out why that happened:

Very obviously, having to go through a bunch more doublings should matter more than this. You wouldn’t put p(SC in 2025) at 5.8% if we were currently at fifteen nanoseconds. Changing the initial conditions a lot seems to break the model.

If you think about why the model sets up the way it does, you can see why it breaks. The hypothesis is that as AI improves, it gains the ability to accelerate further AI R&D progress, and that this may be starting to happen, or things might otherwise still go superexponential.

Those probabilities are supposed to be forward looking from this point, whereas we know they won’t happen until this point. It’s not obvious when we should have had this effect kick in if we were modeling this ‘in the past’ without knowing what we know now, but it obviously shouldn’t kick in before several minute tasks (as in, before the recent potential trend line changes) because the human has to be in the loop and you don’t save much time.

Thus, yes, the model breaks if you start it before that point, and ideally you would force the super exponential effects to not kick in until H is at least minutes long (with some sort of gradual phase in, presumably). Given that we were using a fixed H0, this wasn’t relevant, but if you wanted to use the model on situations with lower H0s you would have to fix that.

How much uncertainty do we have about current H0, at this point? I think it’s reasonable to argue something on the order of a few minutes is on the table if you hold high standards for what that means, but I think 15 seconds is very clearly off the table purely on the eyeball test.

Similarly, there is the argument that these equations start giving you crazy numbers if you extend them past some point. And I’d say, well, yeah, if you hit a singularity then your model outputting Obvious Nonsense is an acceptable failure mode. Fitting, even.

The next section asks for why we are using both super exponential curves in general, and this ‘super exponential’ curve in particular.

So, what arguments do they provide for superexponentiality? Let’s take a look, in no particular order:

Argument 1: public vs internal:

“The trend would likely further tilt toward superexponetiality if we took into account that the public vs. internal gap has seemed to decrease over time.

But even if we do accept this argument, this effect points to a slower growth rate, not a faster one.

I do think we should accept this argument, and also Titoral is correct on this one. The new curve suggests modestly slower progress.

The counterargument is that we used to be slowed down by this wait between models, in two ways.

  1. Others couldn’t know about see, access, distill or otherwise follow your model while it wasn’t released, which previously slowed down progress.

  2. No one could use the model to directly accelerate progress during the wait.

The counterargument to the counterargument is that until recently direct acceleration via using the model wasn’t a thing, so that effect shouldn’t matter, and mostly the trendline is OpenAI models so that effect shouldn’t matter much either.

I can see effects in both directions, but overall I do think within this particular context the slower direction arguments are stronger. We only get to accelerate via recklessly releasing new models once, and we’ve used that up now.

Slightly off topic, but it is worth noting that in AI 2027, this gap opens up again. The top lab knows that its top model accelerates AI R&D, so it does not release an up-to-date version not for safety but to race ahead of the competition, and to direct more compute towards further R&D.

This argument is that time doublings get easier. Going from being able to consistently string together an hour to a week is claimed to be a larger conceptual gap than a week to a year.

Titoral is skeptical of this for both AIs and humans, especially because we have a lot of short term tutorials and few long term ones.

I would say that learning how to do fixed short term tasks, where you follow directions, is indeed far easier than general ‘do tasks that are assigned’ but once you are past that phase I don’t think the counterargument does much.

I agree with the generic ‘more research is needed’ style call here. Basically everywhere, more research is needed, better understanding would be good. Until then, better to go with what you have than to throw up one’s hands and say variations on ‘no evidence,’ of course one is free to disagree with the magnitudes chosen.

In humans, I think the difficulty gap is clearly real if you were able to hold yourself intact, once you are past the ‘learn the basic components’ stage. You can see it in the extremes. If you can sustain an effort reliably for a year, you’ve solved most of the inherent difficulties of sustaining it for ten.

The main reasons ten is harder (and a hundred is much, much harder!) is because life gets in the way, you age and change, and this alters your priorities and capabilities. At some point you’re handing off to successors. There’s a lot of tasks where humans essentially do get to infinite task length if the human were an em that didn’t age.

With AIs in this context, aging and related concepts are not an issue. If you can sustain a year, why couldn’t you sustain two? The answer presumably is ‘compounding error rates’ plus longer planning horizons, but if you can use system designs that recover from failures, that solves itself, and if you get non-recoverable error rates either down to zero or get them to correlate enough, you’re done.

A recent speedup is quite weak evidence for this specific type of super exponential curve. As I will show later, you can come up with lots of different superexponential equations, you have to argue for your specific one.

That leaves the “scaling up agency training”. The METR report does say that this might be a cause for the recent speedup, but it doesn’t say anything about “scaling up agency training” being a superexponential factor. If agency training only started recently, could instead be evidence that the recent advances have just bumped us into a faster exponential regime.

Or, as the METR report notes, it could just be a blip as a result of recent advances: “But 2024–2025 agency training could also be a one-time boost from picking low-hanging fruit, in which case horizon growth will slow once these gains are exhausted”.

This seems like an argument that strictly exponential curves should have a very strong prior? So you need to argue hard if you want to claim more than that?

The argument that ‘agency training’ has led to a faster doubling curve seems strong. Of course we can’t ‘prove’ it, but the point of forecasting is to figure out our best projections and models in practice, not to pass some sort of theoretical robustness check, or to show strongly why things must be this exact curve.

Is it possible that this has ‘only’ kicked us into a new faster exponential? Absolutely, but that possibility is explicitly part of AI 2027’s model, and indeed earlier Titotal was arguing that we shouldn’t think that the exponential was likely to even have permanently altered, and they’re not here admitting that the mechanisms involved make this shift likely to be real.

I mention the ‘one time blip’ possibility above, as well, but it seems to me highly implausible that if it is a ‘blip’ that we are close to done with this. There is obviously quite a lot of unhobbling left to do related to agency.

Should superhuman AGIs have infinite time horizons? AI 2027 doesn’t fully endorse their argument on this, but I think it is rather obvious that at some point doublings are essentially free.

Titotal responds to say that an AI that could do extremely long time horizon CS tasks would be a superintelligence, to which I would tap the sign that says we are explicitly considering what would be true about a superintelligence. That’s the modeling task.

The other argument here, that given a Graham’s number of years (and presumably immortality of some kind, as discussed earlier) a human can accomplish quite an awful lot, well, yes, even if you force them not to do the obviously correct path of first constructing a superintelligence to do it for them. But I do think there’s an actual limit here if the human has to do all the verification too, an infinite number of monkeys on typewriters can write Shakespeare but they can’t figure out where they put it afterwards, and their fastest solution to this is essentially to evolve into humans.

Alternatively, all we’re saying is ‘the AI can complete arbitrary tasks so long as they are physically possible’ and at that point it doesn’t matter if humans can do them too, the metric is obviously not mapping to Reality in a useful way and the point is made.

Now if you read the justifications in the section above, you might be a little confused as to why they didn’t raise the most obvious justification for superexponentiality: the justification that as AI gets better, people will be able to use the AI for r&d research, thus leading to a feedback loop of faster AI development.

The reason for this that they explicitly assume this is true and apply it to every model, including the “exponential” and “subexponential” ones. The “exponential” model is, in fact, also superexponential in their model.

(Note: in Eli’s newest model this is substantially more complicated, I will touch on this later)

Titotal walks us through the calculation, which is essentially a smooth curve that speeds up progress based on feedback loops proportional to progress made towards a fully superhuman coder, implemented in a way to make it easily calculable and so it doesn’t go haywire on parameter changes.

Titotal’s first objection is that this projection implies (if you run the calculation backwards) AI algorithmic progress is currently 66% faster than it was in 2022, whereas Nikola (one of the forecasters) estimates current algorithmic progress is only 3%-30% faster, and the attempt to hardcode a different answer in doesn’t work, because relative speeds are what matters and they tried to change absolute speeds instead. That seems technically correct.

The question is, how much does this mismatch ultimately matter? It is certainly possible for the speedup factor from 2022 to 2025 to be 10% (1 → 1.1) and for progress to then accelerate far faster going forward as AI crosses into more universally useful territory.

As in, if you have an agent or virtual employee, it needs to cross some threshold to be useful at all, but after that it rapidly gets a lot more useful. But that’s not the way the model works here, so it needs to be reworked, and also yes I think we should be more skeptical about the amount of algorithmic progress speedup we can get in the transitional stages here, with the amount of progress required to get to SC, or both.

After walking through the curves in detail, this summarizes the objection to the lack of good fit for the past parts of the curve:

I assume the real data would mostly be within the 80% CI of these curves, but I don’t think the actual data should be an edge case of your model.

So, to finish off the “superexponential” the particular curve in their model does not match empirically with data, and as I argued earlier, it has very little conceptual justification either. I do not see the justification for assigning this curve 40% of the probability space.

I don’t think 75th percentile is an ‘edge case’ but I do agree that it is suspicious.

I think that the ‘super exponential’ curves are describing a future phenomena, for reasons that everyone involved understands, that one would not expect to match backwards in time unless you went to the effort of designing equations to do that, which doesn’t seem worthwhile here.

This is the graph in question, the issues with it are in the process of being addressed.

I agree that various aspects of this graph and how it was presented weren’t great, especially using a 15% easier-each-time doubling curve rather than the 10% that AI 2027 actually uses, and calling it ‘our projection.’ I do think it mostly serves the purpose of giving a rough idea what is being discussed, but more precision would have been better, and I am glad this is being fixed.

This objection is largely that there are only 11 data points (there are now a few more) on the METR curve, and you can fit it with curves that look essentially the same now but give radically different future outcomes. And yes, I agree, that is kind of the point, and if anything we are underrepresenting the uncertainty here, we can agree that even if we commit to using fully simplified and fully best-fit-to-the-past models we get a range of outcomes that prominently include 2028-2029 SCs.

I do think it is a reasonable to say that the super exponential curve the way AI 2027 set it up has more free variables than you would like when fitting 11 data points, if that’s all you were looking to do, but a lot of these parameters are far from free and are not being chosen in order to fit the past curve data.

We now move on to the second more complex model, which Titotal says in many ways is worse, because if you use a complicated model you have to justify the complications, and it doesn’t.

I think a better way to describe the 2nd model is, it predicts a transition in rate of progress around capabilities similar to saturation of re-bench, after which things will move at a faster pace, and uses the re-bench point as a practical way of simulating this.

Method 2 starts by predicting how long it would take to achieve a particular score (referred to as “saturation”) on Re-bench, a benchmark of AI skill on a group of ML research engineering tasks, also prepared by METR. After that, the time horizon extension model is used as with method 1, except that it starts later (when Re-bench saturates), and that it stops earlier (when a certain convoluted threshold is reached).

After that stopping point, 5 new gaps are estimated, which are just constants (as always, sampled from lognormal), and then the whole thing is run through an intermediate speedup model. So any critiques of model 1 will also apply to model 2, there will just be some dilution with all the constant gap estimates and the “re-bench” section.

The reason to start later is obvious, you can’t start actually using AI skill for ML research tasks until it can beat not using it. So what you actually have is a kind of ‘shadow curve’ that starts out super negative – if you tried to use AI to do your ML tasks in 2017 you’d very obviously do way worse than doing it yourself. Then at some point in the 2020s you cross that threshold.

We also need a top of the curve, because this is a benchmark and by its nature it saturates even if the underlying skills don’t. In some senses the top of the S-curve is artificial, in some it isn’t.

Titotal points out that you can’t meaningfully best-fit an S-curve until you know you’ve already hit the top, because you won’t know where the top is. The claim is that we have no idea where the benchmark saturates, that projecting it to be 2 is arbitrary. To which I’d say, I mean, okay, weird but if true who cares? If the maximum is 3 and we approach that a bit after we hit 2, then that’s a truth about the benchmark not about Reality, and nothing important changes. As I then realize Titotal noticed too that as long as you’re above human performance it doesn’t change things substantially, so why are we having this conversation?

This is a general pattern here. It’s virtuous to nitpick, but you should know when you’re nitpicking and when you’re not.

When you’re doing forecasting or modeling, you have to justify your decisions if and only if those decisions matter to the outcome. If it does not matter, it does not matter.

Speaking of doesn’t matter, oh boy does it not matter?

Step 2 is to throw this calculation in the trash.

I’m serious here. Look at the code. The variable t_sat_ci, the “CI for date when capability saturates”, is set by the forecaster, not calculated. There is no function related to the RE-bench data at all in the code. Feel free to look! It’s not in the updated code either.

Eli gives an 80% CI of saturation between september 2025 to january 2031, and Nikola gives an 80% CI of saturation between august 2025 and november 2026. Neither of these are the same as the 80% CI in the first of the two graphs, which is early 2026 to early 2027. Both distributions peak like half a year earlier than the actual Re-bench calculation, although Eli’s median value is substantially later.

Eli has told me that the final estimates for saturation time are “informed” by the logistic curve fitting, but if you look above they are very different estimates.

Those are indeed three very different curves. It seems that the calculation above is an intuition pump or baseline, and they instead go with the forecasters predictions, with Nikola expecting it to happen faster than the projection, and Eli having more uncertainty. I do think Nikola’s projection here seems unreasonably fast and I’d be surprised if he hasn’t updated by now?

Eli admits the website should have made the situation clear and he will fix it.

Titotal says we’ve ‘thrown out’ the re-bench part of the appendix. I say no, that’s not how this works, yes we’re not directly doing math with the output of the model above, but we are still projecting the re-bench results and using that to inform the broader model. That should have been made clear, and I am skeptical of Eli and Nikola’s graphs on this, especially the rapid sudden peak in Nikola’s, but the technique used is a thing you sometimes will want to do.

So basically we now do the same thing we did before except a lot starts in the future.

Titotal: Okay, so we’ve just thrown out the re-bench part of the appendix. What happens next? Well, next, we do another time horizons calculation, using basically the same methodology as in method 1. Except we are starting later now, so:

They guess the year that we hit re-bench saturation.

They guess the time horizon at the point we hit re-bench saturation.

They guess the doubling time at the point when we hit re-bench saturation.

They guess the velocity of R&D speedup at the point when we hit re-bench saturation.

Then, they use these parameters to do the time horizons calculation from part 1, with a lower cut-off threshold I will discuss in a minute.

And they don’t have a good basis for these guesses, either. I can see how saturating RE-bench could you give you some information about the time horizon, but not things like the doubling time, which is one of the most crucial parameters that is inextricably tied to long term trends.

Setting aside the cutoff, yes this is obviously how you would do it. Before we estimated those variables now. If you start in the future, you want to know what they look like as you reach the pivot point.

Presumably you would solve this by running your model forward in the previous period, the same way you did in the first case? Except that this is correlated with the pace of re-bench progress, so that doesn’t work on its own. My guess is you would want to assign some percent weight to the date and some percept to what it would look like on your median pivot date.

And the estimation of doubling time is weird. The median estimate for doubling time at re-bench saturation is around 3 months, which is 33% lower than their current estimate for doubling time. Why do they lower it?

Well, partly because under the superexponential model there would have been speedups during the re-bench saturation period.

Titotal then repeats the concern about everything being super exponential, but I don’t see the issue on this one, although I would do a different calculation to decide on my expectation here.

I also don’t understand the ‘this simulation predicts AI progress to freeze in place for two years’ comment, as in I can’t parse why one would say that there.

And now here’s where we come to a place where I actually am more concerned than Titotal is:

The other main difference is that this time horizons model only goes to a lower threshold, corresponding to when AI hits the following requirement:

“Ability to develop a wide variety of software projects involved in the AI R&D process which involve modifying a maximum of 10,000 lines of code across files totaling up to 20,000 lines. Clear instructions, unit tests, and other forms of ground-truth feedback are provided. Do this for tasks that take humans about 1 month (as controlled by the “initial time horizon” parameter) with 80% reliability, add the same cost and speed as humans.”

Despite differing by 2 orders of magnitude on the time horizon required for SC in the first method, when it comes to meeting this benchmark they are both in exact agreement for this threshold, which they both put as a median of half a month.

This is weird to me, but I won’t dwell on it.

I kind of want to dwell on this, and how they are selecting the first set of thresholds, somewhat more, since it seems rather important. I want to understand how these various disagreements interplay, and how they make sense together.

That’s central to how I look at things like this. You find something suspicious that looks like it won’t add up right. You challenge. They address it. Repeat.

I think I basically agree with the core criticism here that this consists of guessing things about future technologies in a way that seems hard to get usefully right, it really is mostly a bunch of guessing, and it’s not clear that this is complexity is helping the model be better than making a more generalized guess, perhaps using this as an intuition pump. I’m not sure. I don’t think this is causing a major disagreement in the mainline results, though?

In addition to updating the model, Eli responds with this comment.

I don’t understand the perspective that this is a ‘bad response.’ It seems like exactly how all of this should work, they are fixing mistakes and addressing communication issues, responding to the rest, and even unprompted offer a $500 bounty payment.

Eli starts off linking to the update to the model from May 7.

Here is Eli’s response on the ‘most important disagreements’:

  1. Whether to estimate and model dynamics for which we don’t have empirical data. e.g. titotal says there is “very little empirical validation of the model,” and especially criticizes the modeling of superexponentiality as having no empirical backing. We agree that it would be great to have more empirical validation of more of the model components, but unfortunately that’s not feasible at the moment while incorporating all of the highly relevant factors.[1]

    1. Whether to adjust our estimates based on factors outside the data. For example, titotal criticizes us for making judgmental forecasts for the date of RE-Bench saturation, rather than plugging in the logistic fit. I’m strongly in favor of allowing intuitive adjustments on top of quantitative modeling when estimating parameters.

  2. [Unsure about level of disagreement] The value of a “least bad” timelines model. While the model is certainly imperfect due to limited time and the inherent difficulties around forecasting AGI timelines, we still think overall it’s the “least bad” timelines model out there and it’s the model that features most prominently in my overall timelines views. I think titotal disagrees, though I’m not sure which one they consider least bad (perhaps METR’s simpler one in their time horizon paper?). But even if titotal agreed that ours was “least bad,” my sense is that they might still be much more negative on it than us. Some reasons I’m excited about publishing a least bad model:

    1. Reasoning transparency. We wanted to justify the timelines in AI 2027, given limited time. We think it’s valuable to be transparent about where our estimates come from even if the modeling is flawed in significant ways. Additionally, it allows others like titotal to critique it.

    2. Advancing the state of the art. Even if a model is flawed, it seems best to publish to inform others’ opinions and to allow others to build on top of it.

My read, as above, is that titotal indeed objects to a ‘least bad’ model if it is presented in a way that doesn’t have ‘bad’ stamped all over it with a warning not to use it for anything. I am strongly with Eli here. I am also with Thane that being ‘least bad’ is not on its own enough, reality does not grade on a curve and you have to hit a minimum quality threshold to be useful, but I do think they hit that.

As discussed earlier, I think #1 is also an entirely fair response, although there are other issues to dig into on those estimates and where they come from.

  1. The likelihood of time horizon growth being superexponential, before accounting for AI R&D automation. See this section for our arguments in favor of superexponentiallity being plausible, and titotal’s responses (I put it at 45% in our original model). This comment thread has further discussion. If you are very confident in no inherent superexponentiality, superhuman coders by end of 2027 become significantly less likely, though are still >10% if you agree with the rest of our modeling choices (see here for a side-by-side graph generated from my latest model).

    1. How strongly superexponential the progress would be. This section argues that our choice of superexponential function is arbitrary. While we agree that the choice is fairly arbitrary and ideally we would have uncertainty over the best function, my intuition is that titotal’s proposed alternative curve feels less plausible than the one we use in the report, conditional on some level of superexponentiality.

    2. Whether the argument for superexponentiality is stronger at higher time horizons. titotal is confused about why there would sometimes be a delayed superexponential rather than starting at the simulation starting point. The reasoning here is that the conceptual argument for superexponentiality is much stronger at higher time horizons (e.g. going from 100 to 1,000 years feels likely much easier than going from 1 to 10 days, while it’s less clear for 1 to 10 weeks vs. 1 to 10 days). It’s unclear that the delayed superexponential is the exact right way to model that, but it’s what I came up with for now.

I don’t think 3b here is a great explanation, as I initially misunderstood it, but Eli has clarified that its intent matches my earlier statements about ease of shifting to longer tasks being clearly easier at some point past the ‘learn the basic components’ stage. Also I worry this does drop out a bunch of the true objections, especially the pointing towards multiple different sources of superexponentiallity (we have both automation of AI R&D and a potential future drop in the difficulty curve of tasks), which he lists under ‘other disagreements’ and says he hasn’t looked into yet – I think that’s probably the top priority to look at here at this point. I find the ‘you have to choose a curve and this seemed like the most reasonable one’ response to be, while obviously not the ideal world state, in context highly reasonable.

He then notes two other disagreements and acknowledges three mistakes.

Eli released an update in response to a draft of the Titotal critiques.

The new estimates are generally a year or two later, which mostly matches the updates I’d previously seen from Daniel Kokotajlo. This seems like a mix of model tweaks and adjusting for somewhat disappointing model releases over the last few months.

Overall Titotal is withholding judgment until Eli writes up more about it, which seems great, and also offers initial thoughts. Mostly he sees a few improvements but doesn’t believe his core objections are addressed.

Titotal challenges the move from 40% chance of super exponential curves to a 90% chance of an eventual such curve, although Eli notes that the 90% includes a lot of probability put into very large time horizon levels and thus doesn’t impact the answer that much.I see why one would generally be concerned about double counting, but I believe that I understand this better now and they are not double counting.

Titotal wraps up by showing you could draw a lot of very distinct graphs that ‘fit the data’ where ‘the data’ is METR’s results. And yes, of course, we know this, but that’s not the point of the exercise. No, reality doesn’t ‘follow neat curves’ all that often, but AI progress remarkably often has so far, and also we are trying to create approximations and we are all incorporating a lot more than the METR data points.

If you want to look at Titotal’s summary of why bad thing is bad, it’s at this link. I’ve already addressed each of these bullet points in detail. Some I consider to point to real issues, some not so much.

What is my overall take on the right modeling choices?

Simplicity is highly valuable. As the saying goes, make everything as simple as possible, but no simpler. There’s a lot to be said for mostly relying on something that has the shape of the first model, with the caveat of more uncertainty in various places, and that the ‘superexponential’ effects have an uncertain magnitude and onset point. There are a few different ways you could represent this. If I was doing this kind of modeling I’d put a lot more thought into the details than I have had the chance to do.

I would probably drop the detailed considerations of future bottlenecks and steps from the ultimate calculation, using them more as an intuition pump, the same way they currently calculate re-bench times and then put the calculation in the trash (see: plans are worthless, planning is essential.)

If I was going to do a deep dive, I would worry about whether we are right to combine these different arguments for superexponential progress, as in both AI R&D feedback loops and ease of future improvements, and whether either or both of them should be incorporated into the preset trend line or whether they have other issues.

The final output is then of course only one part of your full model of Reality.

At core, I buy the important concepts as the important concepts. As in, if I was using my own words for all this:

  1. AI progress continues, although a bit slower than we would have expected six months ago – progress since then has made a big practical difference, it’s kind of hard to imagine going back to models of even six months ago, but proper calibration means that can still be disappointing.

  2. In addition to scaling compute and data, AI itself is starting to accelerate the pace at which we can make algorithmic progress in AI. Right now that effect is real but modest, but we’re crossing critical thresholds where it starts to make a big difference, and this effect probably shouldn’t be considered part of the previous exponentials.

  3. The benefit of assigning tasks to AI starts to take off when you can reliably assign tasks for the AI without needing continuous human supervision, and now can treat those tasks as atomic actions not requiring state.

  4. If AI can take humans out of the effective loops in this research and work for more extended periods, watch the hell out (on many levels, but certainly in terms of capabilities and algorithmic progress.)

  5. Past a certain point where you can reliably do what one might call in-context atomic components, gaining the robustness and covering the gaps necessary to do this more reliably starts to get easier rather than harder, relative to the standard exponential curves.

  6. This could easily ‘go all the way’ to SC (and then quickly to full ASI) although we don’t know that it does. This is another uncertainty point, also note that AI 2027 as written very much involves waiting for various physical development steps.

  7. Thus, without making any claims about what the pace of all this is (and my guess is it is slower than they think it is, and also highly uncertain), the Baseline Scenario very much looks like AI 2027, but there’s a lot of probability mass also on other scenarios.

  8. One then has to ask what happens after you get this ‘superhuman coder’ or otherwise get ASI-like things of various types.

Which all adds up to me saying that I agree with Eli that none of the criticisms raised here challenges, to me, the ultimate or fundamental findings, only the price. The price is of course what we are here to talk about, so that is highly valuable even within relatively narrow bands (2028 is very different from 2029 because of reasons, and 2035 is rather different from that, and so on).

I realize that none of this is the kind of precision that lets you land on the moon.

The explanation for all this is right there: This is a physicist, holding forecasting of AI timelines to the standards of physics models. Well, yeah, you’re not going to be happy. If you try to use this to land on the moon, you will almost certainly miss the moon, the same way that if you try to use current alignment techniques on a superintelligence, you will almost certainly miss and then you will die.

One of the AI 2027 authors joked to me in the comments on a recent article that “you may not like it but it’s what peak AI forecasting performance looks like”.

Well, I don’t like it, and if this truly is “peak forecasting”, then perhaps forecasting should not be taken very seriously.

Maybe this is because I am a physicist, not a Rationalist. In my world, you generally want models to have strong conceptual justifications or empirical validation with existing data before you go making decisions based off their predictions: this fails at both.

Yes, in the world of physics, things work very differently, and we have much more accurate and better models. If you want physics-level accuracy in your predictions of anything that involves interactions of humans, well, sorry, tough luck. And presumably everyone agrees that you can’t have a physics-quality model here and that no one is claiming to have one? So what’s the issue?

The issue is whether basing decisions on modeling attempts like this is better than basing them on ‘I made it up’ or not having probabilities and projections at all and vibing the damn thing.

What I’m most against is people taking shoddy toy models seriously and basing life decisions on them, as I have seen happen for AI 2027.

I am not going to propose an alternate model. If I tried to read the tea leaves of the AI future, it would probably also be very shaky. There are a few things I am confident of, such as a software-only singularity not working and that there will be no diamondoid bacteria anytime soon. But these beliefs are hard to turn into precise yearly forecasts, and I think doing so will only cement overconfidence and leave people blindsided when reality turns out even weirder than you imagined..

Why is this person confident the software-only singularity won’t work? This post does not say. You’d have to read their substack, I assume it’s there.

The forecast here is ‘precise’ in the sense that it has a median, and we have informed people of that median. It is not ‘precise’ in the sense of putting a lot of probability mass on that particular median, even as an entire year, or even in the sense that the estimate wouldn’t change with more work or better data. It is precise in the sense that, yes, Bayes Rule is a thing, and you have to have a probability distribution, and it’s a lot more useful to share it than not share it.

I do find that the AI 2027 arguments updated me modestly towards a faster distribution of potential outcomes. I find 2027 to be a totally plausible time for SC to happen, although my median would be substantially longer.

You can’t ‘not base life decisions’ on information until it crosses some (higher than this) robustness threshold. Or I mean you can, but it will not go great.

In conclusion, I once again thank Titotal for the excellent substance of this critique, and wish it had come with better overall framing.

Discussion about this post

Analyzing A Critique Of The AI 2027 Timeline Forecasts Read More »

google’s-new-robotics-ai-can-run-without-the-cloud-and-still-tie-your-shoes

Google’s new robotics AI can run without the cloud and still tie your shoes

We sometimes call chatbots like Gemini and ChatGPT “robots,” but generative AI is also playing a growing role in real, physical robots. After announcing Gemini Robotics earlier this year, Google DeepMind has now revealed a new on-device VLA (vision language action) model to control robots. Unlike the previous release, there’s no cloud component, allowing robots to operate with full autonomy.

Carolina Parada, head of robotics at Google DeepMind, says this approach to AI robotics could make robots more reliable in challenging situations. This is also the first version of Google’s robotics model that developers can tune for their specific uses.

Robotics is a unique problem for AI because, not only does the robot exist in the physical world, but it also changes its environment. Whether you’re having it move blocks around or tie your shoes, it’s hard to predict every eventuality a robot might encounter. The traditional approach of training a robot on action with reinforcement was very slow, but generative AI allows for much greater generalization.

“It’s drawing from Gemini’s multimodal world understanding in order to do a completely new task,” explains Carolina Parada. “What that enables is in that same way Gemini can produce text, write poetry, just summarize an article, you can also write code, and you can also generate images. It also can generate robot actions.”

General robots, no cloud needed

In the previous Gemini Robotics release (which is still the “best” version of Google’s robotics tech), the platforms ran a hybrid system with a small model on the robot and a larger one running in the cloud. You’ve probably watched chatbots “think” for measurable seconds as they generate an output, but robots need to react quickly. If you tell the robot to pick up and move an object, you don’t want it to pause while each step is generated. The local model allows quick adaptation, while the server-based model can help with complex reasoning tasks. Google DeepMind is now unleashing the local model as a standalone VLA, and it’s surprisingly robust.

Google’s new robotics AI can run without the cloud and still tie your shoes Read More »

sailing-the-fjords-like-the-vikings-yields-unexpected-insights

Sailing the fjords like the Vikings yields unexpected insights


“On we sweep with threshing oar”

Greer Jarrett has identified four possible small ports, or “havens,” used by Vikings along the Norwegian coast.

Experimental archaeologist Greer Jarrett of Lund University in Sweden has been sailing in the footsteps of Vikings for the last three years.

If you want to learn more about how and where the Vikings sailed, making the journey through the fjords yourself in replica boats is a practical, hands-on approach to achieving that end. Greer Jarrett, an archaeologist at Lund University in Sweden, has spent the last three years doing just that, sailing more than 5,000 kilometers along known Viking trade routes in open, spare-rigged clinker boats similar to those used by the Vikings.

Not only has Jarrett learned a great deal about the boats themselves, he also identified four possible havens along the Norwegian coast, part of what may have been a decentralized network that played a crucial role in trade and travel during that period. And those ports are located farther out to sea than other major ports and hubs known to date, according to a paper he published in the Journal of Archaeological Method and Theory.

It’s just the latest intriguing discovery enabled by the growing field of experimental archaeology, whereby researchers seek to reverse-engineer all manner of ancient technologies. Experimental archaeologists have, for instance, built their own versions of Early Upper Paleolithic adzes, axes, and chisels. The resulting fractures and wear enabled them to develop new criteria for identifying the likely functions of ancient tools. Others have tried to cook like the Neanderthals, concluding that flint flakes were surprisingly effective for butchering birds, and that roasting the birds damages the bones to such an extent that it’s unlikely they would be preserved in the archaeological record.

Kent State University’s Metin Eren has done practical experiments to study, for instance, the trajectories of atlatls attached to spears tipped with replica Clovis points, and how their performance compares to javelins used by Neanderthals. He even fashioned rudimentary blades out of his own frozen feces to test whether they could cut through pig hide, muscle, and tendon—solely to test a famous anthropological legend about an elderly Inuit man in the 1950s who purportedly did the same to kill and skin a dog, using its rib cage as a makeshift sled to venture off into the Arctic. (It did not work, so myth: busted. But it did snag Eren an Ig Nobel prize.)

Taking a hands-on, experimental archaeological approach to studying the Vikings makes sense in light of the dearth of contemporary written sources. “We have a few things written by outsiders, but there’s very, very few accounts written or delivered by people from Scandinavia during that period,” Jarrett told Ars. “We normally rely on indirect forms of evidence, be that genetics or archaeology or linguistics, which show strong, very frequent connections across maritime areas in the North Atlantic. But because traveling by boat is kind of an archaeologically invisible act, you don’t leave any footprints. So we have very little information about the voyages between these points.”

The sailing voyages made by Greer Jarrett during the research project. The image also shows the four possible Viking harbours identified by Jarrett.

The sailing voyages made by Greer Jarrett during the research project, as well as the four possible Viking harbors he identified. Credit: Greer Jarrett

Garrett and his crew used four or five different replica boats for their test voyages. Most were built by volunteers, enthusiasts, or students Jarrett had met during his considerable time in the field. They then sailed along the west coast of the Scandinavian Peninsula, a core area of Viking seafaring.

“These are reconstructions of traditional Norwegian boats from the 1800s and early 1900s,” said Jarrett. “My idea was, because of this really long-term continuity in traditional boat building practices, especially in Norway, it might be possible to use these later boats which have lots of similarities to try and work out the potentials of where people might have gotten out. It’s the idea of suggesting potentials based on practical experience to try and join those dots between the different evidence we have across the Viking world.”

That decision has led to some criticism from colleagues because of the enormous gap in time, but Jarrett defends his choice. “The Viking Age ends in the 11th century, and we’re talking about boats from 800 years later,” he said. “But the construction techniques and the way they are rigged and their general performance characteristics are similar enough. Because this is a project about voyages and not a project about boat building, it seemed like a defensible analogy.”

Seeking safe harbor

“On the long-range voyages, we worked in watches of four hours on and four hours off, and that is just about long enough to get some sleep on your off watch, but also just about short enough that you don’t get really, really, really cold, which is obviously a risk,” said Jarrett. “It was manageable, but we looked like penguins. I mean, we’re wearing six layers of wool at any time and sleeping all stacked together for warmth. But other times it’s really nice. The spring and the autumn in Scandinavia, there’s much more likelihood of high-pressure cycles, which means that it’s clearer and sunnier than in the summer itself.”

Nonetheless, there were some rough moments, such as when the mast spar holding up the mainsail snapped, forcing the crew to improvise and lash two oars together to hold the sail so they could continue their journey. It took several days to repair the boat so it could sail again. There was no safety boat following along in case the crew got into trouble, and no engine, although they did have a life raft, which the crew has yet to use.

Based on his sailing trials, Jarrett believes that the Vikings had no need for navigational tools like maps, a compass, or a sextant, relying instead on what he calls “mental maps”—or a “maritime cultural mindscape”—based on sailors’ memories and experiences passed down orally through generations. Those maps might also be informed by the myths linked to well-known coastal landmarks, such as skerries, small islets, or reefs.

“People had been moving by boat along the west coast of Scandinavia for a really, really, really long time, probably since the late Neolithic, if not earlier—thousands of years before the Viking age,” said Jarrett. “There are big trading networks in place beforehand, and that is reflected in the names, place names along the west coast. My primary argument is if you spend 3,000 years traveling up and down a coastline in which you can use the coast at all times for navigation, then it’s unnecessary to develop instrumentation.”

“Instruments are used when you are in a place out in the open sea that you don’t know,” Jarrett continued. “We definitely know they didn’t have compasses because those don’t arrive from China until the 1200s. There are these ideas about sunstones and sundials, or little sun compasses, which are entirely possible. But there’s no legitimate proof of either of them archaeologically yet. I may well be proved wrong if we find them at some point, but I don’t think they’re necessary for this at all.”

Based on the sailing trials, archaeological and documentary evidence of Viking Age maritime centers, and digital reconstructions of past sea levels. Jarrett was able develop a useful set of criteria for evaluating potential havens. For instance, the site should be reachable in low visibility, with land or sea marks that sailors could use as bearings; large enough to accommodate multiple vessels of at least the size of a fyring (which can house a crew of four to 10 people); provide good protection from sea swell and storm surges; and have access to fresh water, among other criteria. Four sites scored sufficiently high by those criteria to qualify as possible Viking havens.

The four sites are Smørhamn, located at the confluence of Oldersund and the Frøysjø, where an inn and trading post are known to have existed since at least the late 17th century; the archipelago of Sørøyane between Stad and Ålesund, near where the sea battle of Hjörungavágr was fought circa 986 CE; Bjørnsund, a number of small islands off the southwestern tip of Hustadvika; and the island of Storfosna, which appears on 16th and 17th century charts.

“I’m not saying, ‘This is where they went,'” said Jarrett. “I’m saying that, with these kinds of boats under these conditions, it would be possible to go to these places. And it’s much more difficult—not impossible, but much more difficult—to go to these other places or to sail in these other conditions.”

Pining for the fjords

The next step is for Jarrett and other archaeologists to hunt for evidence in support of his hypothesis. “Most of these sites have never been excavated,” said Jarrett. “There’s been a long assumption that these are landing places with the idea that you are dragging your boat ashore. I’m very opposed to that idea because these are two-and-a-half-ton boats, let alone the cargo. Unless you have a team of oxen and 20 people at your command, there is no way you’re getting them on the beach. I’m very convinced that these places have jetties and mooring posts likely preserved underwater. All of that organic material survives much better underwater than it does on land. So I think that’s very possible.”

They might also find smaller items suggestive of a thriving harbor community. “Whenever you go into land, you’ve got something that’s broken, so you need to do repairs,” said Jarrett. “So things like clink nails or piles of balustones or signs of smithing—the typical kind of things you’d use for repairing your ship, I think are possible to find.” Jarrett’s methodology might also prove useful for studying other seafaring communities. 

The practical experience of sailing the same seas as the Vikings naturally led to some surprising insights. “You are able to ask very different questions the minute you walk away from your desk and get on a boat,” said Jarrett. “I think it’s essential to do that because you think in new ways. In terms of the results themselves, the boats are extremely seaworthy crafts. When you get in them for the first time, you don’t think that, because they’re very, very light. They feel very flimsy, and they’re very low in the water compared to a modern sailing boat. So you feel really in touch with the wave, which is kind of scary. But because they’re so flexible and because of the way they’re rigged, they’re actually really stable, even in big waves.”

“We kept going out thinking, ‘Oh, this is maybe the limit of what this boat can tolerate,’ and then it would be fine, and we’d be, ‘Okay, let’s go a little bit in slightly bigger waves with slightly stronger wind,'” Jarrett continued. “So I think our comfort zones definitely visibly expanded during that period. And I had the chance to work with the same crews over three years. By the end of those three years, we were doing stuff that we would never have been able to do at the beginning.”

Another big difference from modern boats, Jarrett discovered, is that one cannot sail a traditional Viking craft alone. “It has to be a collaborative effort because of how you need a person at the front and the back of the boat basically at all times,” he said. “So developing the crew together and gaining not only skills, but also trust between us meant that we could do things in 2024 that seemed completely insane just a couple of years earlier. I cannot imagine what that is like if you have an entire lifetime of Viking sailors working together for 30 years. It must be an incredible way of creating social bonds.”

DOI: Journal of Archaeological Method and Theory, 2025. 10.1007/s10816-025-09708-6  (About DOIs).

Photo of Jennifer Ouellette

Jennifer is a senior writer at Ars Technica with a particular focus on where science meets culture, covering everything from physics and related interdisciplinary topics to her favorite films and TV series. Jennifer lives in Baltimore with her spouse, physicist Sean M. Carroll, and their two cats, Ariel and Caliban.

Sailing the fjords like the Vikings yields unexpected insights Read More »

how-a-grad-student-got-lhc-data-to-play-nice-with-quantum-interference

How a grad student got LHC data to play nice with quantum interference


New approach is already having an impact on the experiment’s plans for future work.

The ATLAS particle detector of the Large Hadron Collider (LHC) at the European Nuclear Research Center (CERN) in Geneva, Switzerland. Credit: EThamPhoto/Getty Images

The ATLAS particle detector of the Large Hadron Collider (LHC) at the European Nuclear Research Center (CERN) in Geneva, Switzerland. Credit: EThamPhoto/Getty Images

Measurements at the Large Hadron Collider have been stymied by one of the most central phenomena of the quantum world. But now, a young researcher has championed a new method to solve the problem using deep neural networks.

The Large Hadron Collider is one of the biggest experiments in history, but it’s also one of the hardest to interpret. Unlike seeing an image of a star in a telescope, saying anything at all about the data that comes out of the LHC requires careful statistical modeling.

“If you gave me a theory [that] the Higgs boson is this way or that way, I think people imagine, ‘Hey, you built the experiment, you should be able to tell me what you’re going to see under various hypotheses!’” said Daniel Whiteson, a professor at the University of California, Irvine. “But we don’t.”

One challenge with interpreting LHC data is interference, a core implication of quantum mechanics. Interference allows two possible events to inhibit each other, weakening the likelihood of seeing the result of either. In the presence of interference, physicists needed to use a fuzzier statistical method to analyze data, losing the data’s full power and increasing its uncertainty.

However, a recent breakthrough suggests a different way to tackle the problem. The ATLAS collaboration, one of two groups studying proton collisions at the LHC, released two papers last December that describe new ways of exploring data from their detector. One describes how to use a machine learning technique called Neural Simulation-Based Inference to maximize the potential of particle physics data. The other demonstrates its effectiveness with the ultimate test: re-doing a previous analysis with the new technique and seeing dramatic improvement.

The papers are the culmination of a young researcher’s six-year quest to convince the collaboration of the value of the new technique. Its success is already having an impact on the experiment’s plans for future work.

Making sense out of fusing bosons

Each particle collision at the LHC involves many possible pathways in which different particles combine to give rise to the spray of debris that experimenters see. In 2017, David Rousseau at IJCLab in Orsay, a member of the ATLAS collaboration, asked one of his students, Aishik Ghosh, to improve his team’s ability to detect a specific pathway. That particular pathway is quite important since it’s used to measure properties of the Higgs boson, a particle (first measured in 2012) that helps explain the mass of all other fundamental particles.

It was a pretty big ask. “When a grad student gets started in ATLAS, they’re a tiny cog in a giant, well-oiled machine of 3,500 physicists, who all seem to know exactly what they’re doing,” said Ghosh.

The pathway Ghosh was asked to study occurs via several steps. First, the two colliding protons each emit a W boson, a particle associated with the weak nuclear force. These two bosons fuse together, changing their identity to form a Higgs boson. The Higgs boson then decays, forming a pair of Z bosons, another particle associated with the weak force. Finally, those Z bosons themselves each decay into a lepton, like an electron, and its antimatter partner, like a positron.

A Feynman diagram for the pathway studied by Aishik Ghosh. Credit: ATLAS

Measurements like the one Ghosh was studying are a key way of investigating the properties of the Higgs boson. By precisely measuring how long it takes the Higgs boson to decay, physicists could find evidence of it interacting with new, undiscovered particles that are too massive for the LHC to produce directly.

Ghosh started on the project, hoping to find a small improvement in the collaboration’s well-tested methods. Instead, he noticed a larger issue. The goal he was given, of detecting a single pathway by itself, didn’t actually make sense.

“I was doing that and I realized, ‘What am I doing?’ There’s no clear objective,” said Ghosh.

The problem was quantum interference.

How quantum histories interfere

One of the most famous demonstrations of the mysterious nature of quantum mechanics is called the double-slit experiment. In this demonstration, electrons are shot through a screen with two slits that allow them to pass through to a photographic plate on the other side. With one slit covered, the electrons form a pattern centered on the opening. The photographic plate lights up bright right across from the slit and dims further away from it.

With both slits open, you would expect the pattern to get brighter as more electrons reach the photographic plate. Instead, the effect varies. The two slits do not give rise to two nice bright peaks; instead, you see a rippling pattern in which some areas get brighter while others get dimmer, even though the dimmer areas should, in principle, be easier for electrons to reach.

The effect happens even if the electrons are shot at the screen one by one to stop them from influencing each other directly. It’s as if each electron carries with it two possible histories, one in which it goes through one slit and another where it goes through the other before both end up at the same place. These two histories interfere with each other so that some destinations become less likely instead of more likely.

Results of the double-slit experiment. Credit: Jordgette (CC BY-SA 3.0)

For electrons in the double-slit experiment, the two different histories are two different paths through space. For a measurement at the Large Hadron Collider, the histories are more abstract—paths that lead through transformations of fields. One history might be like the pathway Ghosh was asked to study, in which two W bosons fuse to form a Higgs boson before the Higgs boson splits into two Z bosons. But in another history, the two W bosons might fuse and immediately split into two Z bosons without ever producing a Higgs.

Both histories have the same beginning, with two W bosons, and the same end, with two Z bosons. And just as the two histories of electrons in the double-slit experiment can interfere, so can the two histories for these particles.

Another possible history for colliding particles at the Large Hadron Collider, which interferes with the measurement Ghosh was asked to do. Credit: ATLAS

That interference makes the effect of the Higgs boson much more challenging to spot. ATLAS scientists wanted to look for two pairs of electrons and positrons, which would provide evidence that two Z bosons were produced. They would classify their observations into two types: observations that are evidence for the signal they were looking for (that of a decaying Higgs boson) and observations of events that generate this pattern of particles without the Higgs boson acting as an intermediate (the latter are called the background). But the two types of observations, signal and background, interfere. With a stronger signal, corresponding to more Higgs bosons decaying, you might observe more pairs of electrons and positrons… but if these events interfere, you also might see those pairs disappear.

Learning to infer

In traditional approaches, those disappearances are hard to cope with, even when using methods that already incorporate machine learning.

One of the most common uses of machine learning is classification—for example, distinguishing between pictures of dogs and cats. You train the machine on pictures of cats and pictures of dogs, and it tells you, given a picture, which animal is the most likely match. Physicists at the LHC were already using this kind of classification method to characterize the products of collisions, but it functions much worse when interference is involved.

“If you have something that disappears, you don’t quite know what to train on,” said David Rousseau. “Usually, you’re training signal versus background, exactly like you’re training cats versus dogs. When there is something that disappears, you don’t see what you trained on.”

At first, Ghosh tried a few simple tricks, but as time went on, he realized he needed to make a more fundamental change. He reached out to others in the community and learned about a method called Neural Simulation-Based Inference, or NSBI.

In older approaches, people had trained machine learning models to classify observations into signal and background, using simulations of particle collisions to make the training data. Then they used that classification to infer the most likely value of a number, like the amount of time it takes a Higgs boson to decay, based on data from an actual experiment. Neural Simulation-Based Inference skips the classification and goes directly to the inference.

Instead of trying to classify observations into signal and background, NSBI uses simulations to teach an artificial neural network to guess a formula called a likelihood ratio. Someone using NSBI would run several simulations that describe different situations, such as letting the Higgs boson decay at different rates, and then check how many of each type of simulation yielded a specific observation. The fraction of these simulations with a certain decay rate would provide the likelihood ratio, a method for inferring which decay rate is more likely given experimental evidence. If the neural network is good at guessing this ratio, it will be good at finding how long the Higgs takes to decay.

Because NSBI doesn’t try to classify observations into different categories, it handles quantum interference more effectively. Instead of trying to find the Higgs based on a signal that disappears, it examines all the data, trying to guess which decay time is the most likely.

Ghosh tested the method, which showed promising results on test data, and presented the results at a conference in 2019. But if he was going to convince the ATLAS collaboration that the method was safe to use, he still had a lot of work ahead of him.

Shifting the weight on ATLAS’ shoulders

Experiments like ATLAS have high expectations attached to them. A collaboration of thousands of scientists, ATLAS needs to not only estimate the laws of physics but also have a clear idea of just how uncertain those estimates are. At the time, NSBI hadn’t been tested in that way.

“None of this has actually been used on data,” said Ghosh. “Nobody knew how to quantify the uncertainties. So you have a neural network that gives you a likelihood. You don’t know how good the likelihood is. Is it well-estimated? What if it’s wrongly estimated just in some weird corner? That would completely bias your results.”

Checking those corners was too big a job for a single PhD student and too complex to complete within a single PhD degree. Aishik would have to build a team, and he would need time to build that team. That’s tricky in the academic world, where students go on to short-term postdoc jobs with the expectation that they quickly publish new results to improve their CV for the next position.

“We’re usually looking to publish the next paper within two to three years—no time to overhaul our methods,” said Ghosh. Fortunately, Ghosh had support. He received his PhD alongside Rousseau and went to work with Daniel Whiteson, who encouraged him to pursue his ambitious project.

“I think it’s really important that postdocs learn to take those risks because that’s what science is,” Whiteson said.

Ghosh gathered his team. Another student of Rousseau’s, Arnaud Maury, worked to calibrate the machine’s confidence in its answers. A professor at the University of Massachusetts, Rafael Coelho Lopes de Sa, joined the project. His student Jay Sandesara would have a key role in getting the calculation to work at full scale on a computer cluster. IJCLab emeritus RD Schaffer and University of Liège professor Gilles Loupe provided cross-checks and advice.

The team wanted a clear demonstration that their method worked, so they took an unusual step. They took data that ATLAS had already analyzed and performed a full analysis using their method instead, showing that it could pass every check the collaboration could think of. They would publish two papers, one describing the method and the other giving the results of their upgraded analysis. Zach Marshall, who was the computing coordinator for ATLAS at the time, helped get the papers through, ensuring that they were vetted by experts in multiple areas.

“It was a very small subset of our community that had that overlap between this technical understanding and the physics analysis experience and understanding that were capable of really speaking to whether that paper was sufficient and intelligible and useful. So we really had to make sure that we engaged that little group of humans by name,” said Marshall.

The new method showed significant improvements, getting a much more precise result than the collaboration’s previous analysis. That improvement, and the thorough checks, persuaded ATLAS to use NSBI more broadly going forward. It will give them much more precision than they expected, using the Higgs boson to search for new particles and clarify our understanding of the quantum world. When ATLAS discusses its future plans, it makes projections of the precision it expects to reach in the future. But those plans are now being upended.

“One of the fun things about this method that Aishik pushed hard is each time it feels like now we do that projection—here’s how well we’ll do in 15 years—we absolutely crush those projections,” said Marshall. “So we are just now having to redo a set of projections because we matched our old projections for 15 years out already today. It’s a very fun problem to have.”

How a grad student got LHC data to play nice with quantum interference Read More »

psyche-keeps-its-date-with-an-asteroid,-but-now-it’s-running-in-backup-mode

Psyche keeps its date with an asteroid, but now it’s running in backup mode

The spacecraft, built by Maxar Space Systems, will operate its electric thrusters for the equivalent of three months between now and November to keep the mission on track for arrival at asteroid Psyche in 2029.

“Through comprehensive testing and analysis, the team narrowed down the potential causes to a valve that may have malfunctioned in the primary line,” NASA said in a statement Friday. “The switch to the identical backup propellant line in late May restored full functionality to the propulsion system.”

The next waypoint on Psyche’s voyage will be a flyby of Mars in May 2026. Officials expect Psyche to keep that date, which is critical for using Mars’ gravity to slingshot the spacecraft deeper into the Solar System, eventually reaching the asteroid belt about four years from now.

NASA’s Psyche spacecraft takes a spiral path to the asteroid Psyche, as depicted in this graphic that shows the path from above the plane of the planets, labeled with key milestones of the prime mission. Credit: NASA/JPL-Caltech

At Psyche, the spacecraft will enter orbit and progressively move closer to the asteroid, using a suite of sensors to map its surface, measure its shape, mass, and gravity field, and determine its elemental composition. Observations through telescopes suggest Psyche is roughly 140 miles (226 kilometers) in diameter, or about the width of Massachusetts. But it’s likely not spherical in shape. Scientists describe its shape as more akin to a potato.

Potatoes come in lots of shapes, and researchers won’t know exactly what Psyche looks like until NASA’s asteroid explorer arrives in 2029. Psyche will be the first metallic, or M-type, asteroid visited by any spacecraft, and scientists are eager to study an object that’s largely made of metals—probably iron, nickel, and perhaps some rarer elements instead of rocky minerals.

With the Psyche spacecraft’s plasma thrusters back in action, these goals of NASA’s billion-dollar science mission remain achievable.

“The mission team’s dedication and systematic approach to this investigation exemplifies the best of NASA engineering,” said Bob Mase, Psyche project manager at  JPL, in a statement. “Their thorough diagnosis and recovery, using the backup system, demonstrates the value of robust spacecraft design and exceptional teamwork.”

But there’s still a lingering concern whatever problem caused the valve to malfunction in the primary fuel line might also eventually affect the same kind of valve in the backup line.

“We are doing a lot of good proactive work around that possible issue,” wrote Lindy Elkins-Tanton, Psyche’s principal investigator at Arizona State University, in a post on X.

Psyche keeps its date with an asteroid, but now it’s running in backup mode Read More »

spacex’s-next-starship-just-blew-up-on-its-test-stand-in-south-texas

SpaceX’s next Starship just blew up on its test stand in South Texas


SpaceX had high hopes for Starship in 2025, but it’s been one setback after another.

A fireball erupts around SpaceX’s Starship rocket in South Texas late Wednesday night. Credit: LabPadre

SpaceX’s next Starship rocket exploded during a ground test in South Texas late Wednesday, dealing another blow to a program already struggling to overcome three consecutive failures in recent months.

The late-night explosion at SpaceX’s rocket development complex in Starbase, Texas, destroyed the bullet-shaped upper stage that was slated to launch on the next Starship test flight. The powerful blast set off fires around SpaceX’s Massey’s Test Site, located a few miles from the company’s Starship factory and launch pads.

Live streaming video from NASASpaceflight.com and LabPadremedia organizations with cameras positioned around Starbase—showed the 15-story-tall rocket burst into flames shortly after 11: 00 pm local time (12: 00 am EDT; 04: 00 UTC). Local residents as far as 30 miles away reported seeing and feeling the blast.

SpaceX confirmed the Starship, numbered Ship 36 in the company’s inventory, “experienced a major anomaly” on a test stand as the vehicle prepared to ignite its six Raptor engines for a static fire test. These hold-down test-firings are typically one of the final milestones in a Starship launch campaign before SpaceX moves the rocket to the launch pad.

The explosion occurred as SpaceX finished up loading super-cold methane and liquid oxygen propellants into Starship in preparation for the static fire test. The company said the area around the test site was evacuated of all personnel, and everyone was safe and accounted for after the incident. Firefighters from the Brownsville Fire Department were dispatched to the scene.

“Our Starbase team is actively working to safe the test site and the immediate surrounding area in conjunction with local officials,” SpaceX posted on X. “There are no hazards to residents in surrounding communities, and we ask that individuals do not attempt to approach the area while safing operations continue.”

Picking up the pieces

Earlier Wednesday, just hours before the late-night explosion at Starbase, an advisory released by the Federal Aviation Administration showed SpaceX had set June 29 as a tentative launch date for the next Starship test flight. That won’t happen now, and it’s anyone’s guess when SpaceX will have another Starship ready to fly.

Massey’s Test Site, named for a gun range that once occupied the property, is situated on a bend in the Rio Grande River, just a few hundred feet from the Mexican border. The test site is currently the only place where SpaceX can put Starships through proof testing and static fire tests before declaring the rockets are ready to fly.

The extent of the damage to ground equipment at Massey’s was not immediately clear, so it’s too soon to say how long the test site will be out of commission. For now, though, the explosion leaves SpaceX without a facility to support preflight testing on Starships.

The videos embedded below come from NASASpaceflight.com and LabPadre, showing multiple angles of the Starship blast.

The explosion at Massey’s is a reminder of SpaceX’s rocky path to get Starship to this point in its development. In 2020 and 2021, SpaceX lost several Starship prototypes to problems during ground and flight testing. The visual of Ship 36 going up in flames harkens back to those previous explosions, along with the fiery demise of a Falcon 9 rocket on its launch pad in 2016 under circumstances similar to Wednesday night’s incident.

SpaceX has now launched nine full-scale Starship rockets since April 2023, and before the explosion, the company hoped to launch the 10th test flight later this month. Starship’s track record has been dreadful so far this year, with the rocket’s three most recent test flights ending prematurely. These setbacks followed a triumphant 2024, when SpaceX made clear progress on each successive Starship suborbital test flight, culminating in the first catch of the rocket’s massive Super Heavy booster with giant robotic arms on the launch pad tower.

Stacked together, the Super Heavy booster stage and Starship upper stage stand more than 400 feet tall, creating the largest rocket ever built. SpaceX has already flown a reused Super Heavy booster, and the company has designed Starship itself to be recoverable and reusable, too.

After last year’s accomplishments, SpaceX appeared to be on track for a full orbital flight, an attempt to catch and recover Starship itself, and an important in-space refueling demonstration in 2025. The refueling demo has officially slipped into 2026, and it’s questionable whether SpaceX will make enough progress in the coming months to attempt recovery of a ship before the end of this year.

A Super Heavy booster and Starship upper stage are seen in March at SpaceX’s launch pad in South Texas, before the ship was stacked atop the booster for flight. The Super Heavy booster for the next Starship flight completed its static fire test earlier this month. Credit: Brandon Bell/Getty Images

Ambition meets reality

SpaceX debuted an upgraded Starship design, called Version 2 or Block 2, on a test flight in January. It’s been one setback after another since then.

The new Starship design is slightly taller than the version of Starship that SpaceX flew in 2023 and 2024. It has an improved heat shield to better withstand the extreme heat of atmospheric reentry. SpaceX also installed a new fuel feed line system to route methane fuel to the ship’s Raptor engines, and an improved propulsion avionics module controlling the vehicle’s valves and reading sensors.

Despite—or perhaps because ofall of these changes for Starship Version 2, SpaceX has been unable to replicate the successes it achieved with Starship in the last two years. Ships launched on test flights in January and March spun out of control minutes after liftoff, scattering debris over the sea, and in at least one case, onto a car in the Turks and Caicos Islands.

SpaceX engineers concluded the January failure was likely caused by intense vibrations that triggered fuel leaks and fires in the ship’s engine compartment, causing an early shutdown of the rocket’s engines. Engineers said the vibrations were likely in resonance with the vehicle’s natural frequency, intensifying the shaking beyond the levels SpaceX predicted.

The March flight failed in similar fashion, but SpaceX’s investigators determined the most probable root cause was a hardware failure in one of the ship’s engines, a different failure mode than two months before.

During SpaceX’s most recent Starship test flight last month, the rocket completed the ascent phase of the mission as planned, seemingly overcoming the problems that plagued the prior two launches. But soon after the Raptor engines shut down, a fuel leak caused the ship to begin tumbling in space, preventing the vehicle from completing a guided reentry to test the performance of new heat shield materials.

File photo of a Starship static fire in May at Massey’s Test Site.

SpaceX is working on a third-generation Starship design, called Version 3, that the company says could be ready to fly by the end of this year. The upgraded Starship Version 3 design will be able to lift heavier cargo—up to 200 metric tonsinto orbit thanks to larger propellant tanks and more powerful Raptor engines. Version 3 will also have the ability to refuel in low-Earth orbit.

Version 3 will presumably have permanent fixes to the problems currently slowing SpaceX’s pace of Starship development. And there are myriad issues for SpaceX’s engineers to solve, from engine reliability and the ship’s resonant frequency, to beefing up the ship’s heat shield and fixing its balky payload bay door.

Once officials solve these problems, it will be time for SpaceX to bring a Starship from low-Earth orbit back to the ground. Then, there’s more cool stuff on the books, like orbital refueling and missions to the Moon in partnership with NASA’s Artemis program. NASA has contracts worth more than $4 billion with SpaceX to develop a human-rated Starship that can land astronauts on the Moon and launch them safely back into space.

The Trump administration’s proposed budget for NASA would cancel the Artemis program’s ultra-expensive Space Launch System rocket and Orion crew capsule after two more flights, leaving commercial heavy-lifters to take over launching astronauts from the Earth to the Moon. SpaceX’s Starship, already on contract with NASA as a human-rated lander, may eventually win more government contracts to fill the role of SLS and Orion under Trump’s proposed budget. Other rockets, such as Blue Origin’s New Glenn, are also well-positioned to play a larger role in human space exploration.

NASA’s official schedule for the first Artemis crew landing on the Moon puts the mission some time in 2027, using SLS and Orion to transport astronauts out to the vicinity of the Moon to meet up with SpaceX’s Starship lunar lander. After that mission, known as Artemis III, NASA would pivot to using commercial rockets from Elon Musk’s SpaceX and Jeff Bezos’ Blue Origin to replace the Space Launch System.

Meanwhile, SpaceX’s founder and CEO has his sights set on Mars. Last month, Musk told his employees he wants to launch the first Starships toward the Red Planet in late 2026, when the positions of Earth and Mars in the Solar System make a direct journey possible. Optimistically, he would like to send people to Mars on Starships beginning in 2028.

All of these missions are predicated on SpaceX mastering routine Starship launch operations, rapid reuse of the ship and booster, and cryogenic refueling in orbit, along with adapting systems such as life support, communications, and deep space navigation for an interplanetary journey.

The to-do list is long for SpaceX’s Starship program—too long for Mars landings to seem realistic any time in the next few years. NASA’s schedule for the Artemis III lunar landing mission in 2027 is also tight, and not only because of Starship’s delays. The development of new spacesuits for astronauts to wear on the Moon may also put the Artemis III schedule at risk. NASA’s SLS rocket and Orion spacecraft have had significant delays throughout their history, so it’s not a sure thing they will be ready in 2027.

While it’s too soon to know the precise impact of Wednesday night’s explosion, we can say with some confidence that the chances of Starship meeting these audacious schedules are lower today than they were yesterday.

Photo of Stephen Clark

Stephen Clark is a space reporter at Ars Technica, covering private space companies and the world’s space agencies. Stephen writes about the nexus of technology, science, policy, and business on and off the planet.

SpaceX’s next Starship just blew up on its test stand in South Texas Read More »

senate-passes-genius-act—criticized-as-gifting-trump-ample-opportunity-to-grift

Senate passes GENIUS Act—criticized as gifting Trump ample opportunity to grift

“Why—beyond the obvious benefit of gaining favor, directly or indirectly, with the Trump administration—did you select USD1, a newly launched, untested cryptocurrency with no track record?” the senators asked.

Responding, World Liberty Financial’s lawyers claimed MGX was simply investing in “legitimate financial innovation,” CBS News reported, noting a Trump family-affiliated entity owns a 60 percent stake in the company.

Trump has denied any wrongdoing in the MGX deal, ABC News reported. However, Warren fears the GENIUS Act will provide “even more opportunities to reward buyers of Trump’s coins with favors like tariff exemptions, pardons, and government appointments” if it becomes law.

Although House supporters of the bill have reportedly promised to push the bill through, so Trump can sign it into law by July, the GENIUS Act is likely to face hurdles. And resistance may come from not just Democrats with ongoing concerns about Trump’s and future presidents’ potential conflicts of interest—but also from Republicans who think passing the bill is pointless without additional market regulations to drive more stablecoin adoption.

Dems: Opportunities for Trump grifts are “mind-boggling”

Although 18 Democrats helped the GENIUS Act pass in the Senate, most Democrats opposed the law over concerns of Trump’s feared conflicts of interest, PBS News reported.

Merkley remains one of the staunchest opponents to the GENIUS Act. In a statement, he alleged that the Senate passing the bill was essentially “rubberstamping Trump’s crypto corruption.”

According to Merkley, he and other Democrats pushed to remove the exemption from the GENIUS Act before the Senate vote—hoping to add “strong anti-corruption measures.” But Senate Republicans “repeatedly blocked” his efforts to hold votes on anti-corruption measures. Instead, they “rammed through this fatally flawed legislation without considering any amendments on the Senate floor—despite promises of an open amendment process and debate before the American people,” Merkley said.

Ultimately, it passed with the exemption intact, which Merkley considered “profoundly corrupt,” promising, “I will keep fighting to ban Trump-style crypto corruption to prevent the sale of government policy by elected federal officials in Congress and the White House.”

Senate passes GENIUS Act—criticized as gifting Trump ample opportunity to grift Read More »

openai-weighs-“nuclear-option”-of-antitrust-complaint-against-microsoft

OpenAI weighs “nuclear option” of antitrust complaint against Microsoft

OpenAI executives have discussed filing an antitrust complaint with US regulators against Microsoft, the company’s largest investor, The Wall Street Journal reported Monday, marking a dramatic escalation in tensions between the two long-term AI partners. OpenAI, which develops ChatGPT, has reportedly considered seeking a federal regulatory review of the terms of its contract with Microsoft for potential antitrust law violations, according to people familiar with the matter.

The potential antitrust complaint would likely argue that Microsoft is using its dominant position in cloud services and contractual leverage to suppress competition, according to insiders who described it as a “nuclear option,” the WSJ reports.

The move could unravel one of the most important business partnerships in the AI industry—a relationship that started with a $1 billion investment by Microsoft in 2019 and has grown to include billions more in funding, along with Microsoft’s exclusive rights to host OpenAI models on its Azure cloud platform.

The friction centers on OpenAI’s efforts to transition from its current nonprofit structure into a public benefit corporation, a conversion that needs Microsoft’s approval to complete. The two companies have not been able to agree on details after months of negotiations, sources told Reuters. OpenAI’s existing for-profit arm would become a Delaware-based public benefit corporation under the proposed restructuring.

The companies are discussing revising the terms of Microsoft’s investment, including the future equity stake it will hold in OpenAI. According to The Information, OpenAI wants Microsoft to hold a 33 percent stake in a restructured unit in exchange for foregoing rights to future profits. The AI company also wants to modify existing clauses that give Microsoft exclusive rights to host OpenAI models in its cloud.

OpenAI weighs “nuclear option” of antitrust complaint against Microsoft Read More »

paramount-drops-trailer-for-the-naked-gun-reboot

Paramount drops trailer for The Naked Gun reboot

Liam Neeson stars as Lt. Frank Drebin Jr. in The Naked Gun.

Thirty years after the last film in The Naked Gun crime-spoof comedy franchise, we’re finally getting a new installment, The Naked Gun, described as a “legacy sequel.” And it’s Liam Neeson stepping into Leslie Nielsen’s fumbling shoes, playing that character’s son. Judging by the official trailer, Neeson is up to the task, showcasing his screwball comedy chops.

(Some spoilers for the first three films in the franchise below.)

The original Naked Gun: From the Files of Police Squad! debuted in 1988, with Leslie Nielsen starring as Detective Frank Drebin, trying to foil an assassination attempt on Queen Elizabeth II during her visit to the US. It proved successful enough to launch two sequels. Naked Gun 2-1/2: The Smell of Fear (1991) found Drebin battling an evil plan to kidnap a prominent nuclear scientist. Naked Gun 33-1/3: The Final Insult (1994) found Drebin coming out of retirement and going undercover to take down a crime syndicate planning to blow up the Academy Awards.

The franchise rather lost steam after that, but by 2013, Paramount was planning a reboot starring Ed Helms as “Frank Drebin, no relation.” David Zucker, who produced the prior Naked Gun films and directed the first two, declined to be involved, feeling it could only be “inferior” to his originals. He was briefly involved in the 2017 rewrites, featuring Frank’s son as a secret agent rather than a policeman. That film never transpired either.  The project was revived again in 2021 by Seth MacFarlane (without Zucker’s involvement), and Neeson was cast as Frank Drebin Jr.—a police lieutenant in this incarnation.

In addition to Neeson, the film stars Paul Walter Hauser as Captain Ed Hocken, Jr.—Hauser will also appear as Mole Man in the forthcoming Fantastic Four: First Steps—and Pamela Anderson as a sultry femme fatale named Beth. The cast also includes Kevin Durand, Danny Huston, Liza Koshy, Cody Rhodes, CCH Pounder, Busta Rhymes, and Eddy Yu.

Paramount drops trailer for The Naked Gun reboot Read More »

founder-of-23andme-buys-back-company-out-of-bankruptcy-auction

Founder of 23andMe buys back company out of bankruptcy auction

TTAM’s winning offer requires judicial approval, and a court hearing to approve the bid is set for next week.

Several US states have filed objections or lawsuits with the court expressing concerns about the transfer of customers’ genetic data to a new company, though those may now be moot because of Wojcicki’s continued involvement.

An expert hired by the court to review data privacy concerns over a sale of 23andMe submitted a report on Wednesday that noted Wojcicki had been chief executive when a 2023 data breach compromised 7 million customer accounts. Litigation over the breach continues, although that liability remains with the bankruptcy estate to be paid off with the proceeds from the winning bid.

Wojcicki was once married to Google co-founder Sergey Brin. 23andMe went public in 2021 through a merger with a blank cheque vehicle sponsored by Richard Branson, quickly reaching a market cap of nearly $6 billion.

The company has been plagued by years of falling revenue as it was unable to grow beyond its genetic testing business, in which customers sent saliva samples in to be analyzed for medical conditions and family genealogy.

Wojcicki had bid 40 cents a share to acquire the company prior to the bankruptcy filing.

Shares of 23andMe, which now trade over the counter, have rocketed to $5.49 on the belief the company will stage a recovery after settling the litigation.

© 2025 The Financial Times Ltd. All rights reserved. Not to be redistributed, copied, or modified in any way.

Founder of 23andMe buys back company out of bankruptcy auction Read More »

trump’s-ftc-may-impose-merger-condition-that-forbids-advertising-boycotts

Trump’s FTC may impose merger condition that forbids advertising boycotts

FTC chair alleged “serious risk” from ad boycotts

After Musk’s purchase of Twitter, the social network lost advertisers for various reasons, including changes to content moderation and an incident in which Musk posted a favorable response to an antisemitic tweet and then told concerned advertisers to “go fuck yourself.”

FTC Chairman Andrew Ferguson said at a conference in April that “the risk of an advertiser boycott is a pretty serious risk to the free exchange of ideas.”

“If advertisers get into a back room and agree, ‘We aren’t going to put our stuff next to this guy or woman or his or her ideas,’ that is a form of concerted refusal to deal,” Ferguson said. “The antitrust laws condemn concerted refusals to deal. Now, of course, because of the First Amendment, we don’t have a categorical antitrust prohibition on boycotts. When a boycott ceases to be economic for purposes of the antitrust laws and becomes purely First Amendment activity, the courts have not been super clear—[it’s] sort of a ‘we know it when we see it’ type of thing.”

The FTC website says that any individual company acting on its own may “refuse to do business with another firm, but an agreement among competitors not to do business with targeted individuals or businesses may be an illegal boycott, especially if the group of competitors working together has market power.” The examples given on the FTC webpage are mostly about price competition and do not address the widespread practice of companies choosing where to place advertising based on concerns about their brands.

We contacted the FTC about the merger review today and will update this article if it provides any comment.

X’s ad lawsuit

X’s lawsuit targets a World Federation of Advertisers initiative called the Global Alliance for Responsible Media (GARM), a now-defunct program that Omnicom and Interpublic participated in. X itself was part of the GARM initiative, which shut down after X filed the lawsuit. X alleged that the defendants conspired “to collectively withhold billions of dollars in advertising revenue.”

The World Federation of Advertisers said in a court filing last month that GARM was founded “to bring clarity and transparency to disparate definitions and understandings in advertising and brand safety in the context of social media. For example, certain advertisers did not want platforms to advertise their brands alongside content that could negatively impact their brands.”

Trump’s FTC may impose merger condition that forbids advertising boycotts Read More »

there’s-another-leak-on-the-iss,-but-nasa-is-not-saying-much-about-it

There’s another leak on the ISS, but NASA is not saying much about it

No one is certain. The best guess is that the seals on the hatch leading to the PrK module are, in some way, leaking. In this scenario, pressure from the station is feeding the leak inside the PrK module through these seals, leading to a stable pressure inside—making it appear as though the PrK module leaks are fully repaired.

At this point, NASA is monitoring the ongoing leak and preparing for any possibility. A senior industry source told Ars that the NASA leadership of the space station program is “worried” about the leak and its implications.

This is one reason the space agency delayed the launch of a commercial mission carrying four astronauts to the space station, Axiom-4, on Thursday.

“The postponement of Axiom Mission 4 provides additional time for NASA and Roscosmos to evaluate the situation and determine whether any additional troubleshooting is necessary,” NASA said in a statement. “A new launch date for the fourth private astronaut mission will be provided once available.”

One source indicated that the new tentative launch date is now June 18. However, this will depend on whatever resolution there is to the leak issue.

What’s the worst that could happen?

The worst-case scenario for the space station is that the ongoing leaks are a harbinger of a phenomenon known as “high cycle fatigue,” which affects metal, including aluminum. Consider that if you bend a metal clothes hanger once, it bends. But if you bend it back and forth multiple times, it will snap. This is because, as the metal fatigues, it hardens and eventually snaps. This happens suddenly and without warning, as was the case with an Aloha Airlines flight in 1988.

The concern is that some of these metal structures on board the station could fail quickly and catastrophically. Accordingly, in its previous assessments, NASA has classified the structural cracking issue on the space station as the highest level of concern on its 5v5 risk matrix to gauge the likelihood and severity of risks to the space station.

In the meantime, the space agency has not been forthcoming with any additional information. Despite many questions from Ars Technica and other publications, NASA has not scheduled a press conference or said anything else publicly about the leaks beyond stating, “The crew aboard the International Space Station is safely conducting normal operations.”

There’s another leak on the ISS, but NASA is not saying much about it Read More »