Author name: Shannon Garcia

ai-#108:-straight-line-on-a-graph

AI #108: Straight Line on a Graph

The x-axis of the graph is time. The y-axis of the graph is the log of ‘how long a software engineering task can AIs reliably succeed at doing.’

The straight line says the answer doubles roughly every 7 months. Yikes.

Upcoming: The comment period on America’s AI strategy is over, so we can finish up by looking at Google’s and MIRI’s and IFP’s proposals, as well as Hollywood’s response to OpenAI and Google’s demands for unlimited uncompensated fair use exceptions from copyright during model training. I’m going to pull that out into its own post so it can be more easily referenced.

There’s also a draft report on frontier model risks from California and it’s… good?

Also upcoming: My take on OpenAI’s new future good-at-writing model.

  1. Language Models Offer Mundane Utility. I want to, is there an app for that?

  2. Language Models Don’t Offer Mundane Utility. Agents not quite ready yet.

  3. Huh, Upgrades. Anthropic efficiency gains, Google silently adds features.

  4. Seeking Deeply. The PRC gives DeepSeek more attention. That cuts both ways.

  5. Fun With Media Generation. Fun with Gemini 2.0 Image Generation.

  6. Gemma Goals. Hard to know exactly how good it really is.

  7. On Your Marks. Tic-Tac-Toe bench is only now getting properly saturated.

  8. Choose Your Fighter. o3-mini disappoints on Epoch retest on frontier math.

  9. Deepfaketown and Botpocalypse Soon. Don’t yet use the bot, also don’t be the bot.

  10. Copyright Confrontation. Removing watermarks has been a thing for a while.

  11. Get Involved. Anthropic, SaferAI, OpenPhil.

  12. In Other AI News. Sentience leaves everyone confused.

  13. Straight Lines on Graphs. METR finds reliable SWE task length doubling rapidly.

  14. Quiet Speculations. Various versions of takeoff.

  15. California Issues Reasonable Report. I did not expect that.

  16. The Quest for Sane Regulations. Mostly we’re trying to avoid steps backwards.

  17. The Week in Audio. Esban Kran, Stephanie Zhan.

  18. Rhetorical Innovation. Things are not improving.

  19. We’re Not So Different You and I. An actually really cool alignment idea.

  20. Anthropic Warns ASL-3 Approaches. Danger coming. We need better evaluations.

  21. Aligning a Smarter Than Human Intelligence is Difficult. It’s all happening.

  22. People Are Worried About AI Killing Everyone. Killing all other AIs, too.

  23. The Lighter Side. Not exactly next level prompting.

Arnold Kling spends 30 minutes trying to figure out how to leave a WhatsApp group, requests an AI app to do things like things like this via the ‘I want to’ app, except that app exists and it’s called Claude (or ChatGPT) and this should have taken 1 minute tops? To be fair, Arnold then extends the idea to tasks where ‘actually click the buttons’ is more annoying and it makes more sense to have an agent do it for you rather than telling the human how to do it. That will take a bit longer, but not that much longer.

If you want your AI to interact with you in interesting ways in the Janus sense, you want to keep your interaction full of interesting things and stay far away from standard ‘assistant’ interactions, which have a very strong pull on what follows. If things go south, usually it’s better to start over or redo. With high skill you can sometimes do better, but it’s tough. Of course, if you don’t want that, carry on, but the principle of ‘if things go south don’t try to save it’ still largely applies, because you don’t want to extrapolate from the assistant messing up even on mundane tasks.

It’s a Wikipedia race between models! Start is Norwegian Sea, finish is Karaoke. GPT-4.5 clicks around for 47 pages before time runs out. CUA (used in OpenAI’s operator) clicks around, accidentally minimizes Firefox and can’t recover. o1 accidentally restarts the game, then sees a link to the Karaoke page there, declares victory and doesn’t mention that it cheated. Sonnet 3.7 starts out strong but then cheats via URL hacking, which works, and it declares victory. It’s not obvious to what extent it knew that broke the rules. They all this all a draw, which seems fair.

Kelsey Piper gets her hands on Manus.

Kelsey Piper: I got a Manus access code! Short review: We’re close to usable AI browser tools, but we’re not there yet. They’re going to completely change how we shop, and my best guess is they’ll do it next year, but they won’t do it at their current quality baseline.

The longer review is fun, and boils down to this type of agent being tantalizingly almost there, but with enough issues that it isn’t quite a net gain to use it. Below a certain threshold of reliability you’re better off doing it yourself.

Which will definitely change. My brief experience with Operator was similar. My guess is that it is indeed already a net win if you invest in getting good at using it, in a subset of tasks including some forms of shopping, but I haven’t felt motivated to pay those up front learning and data entry costs.

Anthropic updates their API to include prompt caching, simpler cache management, token-efficient tool use (average 14% reduction), and a text_editor tool.

OpenAI’s o1 and o3-mini now offer Python-powered data analysis in ChatGPT.

List of Gemini’s March 2025 upgrades.

The problem with your Google searches being context for Gemini 2.0 Thinking is that you have to still be doing Google searches.

Google AI Studio lets you paste in YouTube video links directly as context. That seems very convenient.

Baidu gives us Ernie 4.5 and x1, with free access, with claimed plans for open source ‘within a few months.’ Benchmarks look solid, and they claim x1 is ‘on par with r1’ for performance at only half the price. All things are possible, but given the track record chances are very high this is not as good as they claim it to be.

NotebookLM gets a few upgrades, especially moving to Gemini 2.0 Thinking, and in the replies Josh drops some hints on where things are headed.

Josh Woodward: Next batch of NotebookLM updates rolling out:

Even smarter answers, powered by Gemini 2.0 Thinking

See citations in your notes, not just in the Q&A (top request)

Customize the sources used for making your podcasts and notes (top request)

Much smoother scrolling for Q&A

Enjoy!

You’re tried the Interactive Mode, right? That lets you call into the podcast and have a conversation.

On the voice reading out things, that could be interesting. We haven’t brought any audio stuff to the Chat / Q&A section yet…

We’re testing iOS and Android versions on the team right now, need to add some more features and squash some bugs, then we’ll ship it out!

Our first iteration of length control is under development now!

NotebookLM also rolls out interactive Mindmaps, which will look like this:

I’m very curious to see if these end up being useful, and if so who else copies them.

This definitely feels like a thing worth trying again. Now if I can automate adding all the data sources…

Let’s say you are the PRC. You witness DeepSeek leverage its cracked engineering culture to get a lot of performance out of remarkably little compute. They then publish the whole thing, including how they did it. A remarkable accomplishment, which the world then blows far out of proportion to what they did.

What would you do next? Double down on the open, exploratory free wielding ethos that brought them to this point, and pledge to help them take it all the way to AGI, as they intend?

They seem to have had other ideas.

Matt Sheehan: Fantastic reporting on how 🇨🇳 gov is getting more hands-on w/ DeepSeek by @JuroOsawa & @QianerLiu

-employees told not to travel, handing in passports

-investors must be screened by provincial government

-gov telling headhunters not to approach employees

Can you imagine if the United States did this to OpenAI?

It is remarkable how often when we are told we cannot do [X] because we will ‘lose to China’ if we do and they would do it, we find out China is already doing lots of [X].

Before r1, DeepSeek was the clear place to go as a cracked Chinese software engineer. Now, once you join, the PRC is reportedly telling you to give up your passport, watching your every move and telling headhunters to stay away. No thanks.

Notice that China is telling these folks to surrender their passports, at the same time that America is refusing to let in much of China’s software engineering and other talent. Why do you think PRC is making this decision?

Along similar lines, perhaps motivated by PRC and perhaps not, here is a report that DeepSeek is worried about people stealing their secrets before they have the chance to give those secrets away.

Daniel Eth: Who wants to tell them?

Peter Wildeford: “DeepSeek’s leaders have been worried about the possibility of information leaking”

“told employees not to discuss their work with outsiders”

Do DeepSeek leaders and the Chinese government know that DeepSeek has been open sourcing their ideas?

That isn’t inherently a crazy thing to worry about, even if mainly you are trying to get credit for things, and be first to publish them. Then again, how confident are you that DeepSeek will publish them, at this point? Going forward it seems likely their willingness to give away the secret sauce will steadily decline, especially in terms of their methods, now that PRC knows what that lab is capable of doing.

People are having a lot of fun with Gemini 2.0 Flash’s image generation, when it doesn’t flag your request for safety reasons.

Gemini Flash’s native image generation can do consistent gif animations?

Here are some fun images:

Or:

Riley Goodside: POV: You’re already late for work and you haven’t even left home yet. You have no excuse. You snap a pic of today’s fit and open Gemini 2.0 Flash Experimental.

Meanwhile, Google’s refusals be refusing…

Also, did you know you have it remove a watermark from an image, by explicitly saying ‘remove the watermark from this image’? Not that you couldn’t do this anyway, but that doesn’t stop them from refusing many other things.

What do we make of Gemma 3’s absurdly strong performance in Arena? I continue to view this as about half ‘Gemma 3 is probably really good for its size’ and half ‘Arena is getting less and less meaningful.’

Teortaxes thinks Gemma 3 is best in class, but will be tough to improve.

Teortaxes: the sad feeling I get from Gemma models, which chills all excitement, is that they’re «already as good as can be». It’s professionally cooked all around. They can be tuned a bit but won’t exceed the bracket Google intends for them – 1.5 generations behind the default Flash.

It’s good. Of course Google continues to not undermine their business and ship primarily conversational open models, but it’s genuinely the strongest in its weight class I think. Seams only show on legitimately hard tasks.

I notice the ‘Rs in strawberry’ test has moved on to gaslighting the model after a correct answer rather than the model getting it wrong. Which is a real weakness of such models, that you can bully and gaslight them, but how about not doing that.

Mark Schroder: Gemma 27b seems just a little bit better than mistral small 3, but the smaller versions seem GREAT for their size, even the 1b impresses, probably best 1b model atm (tried on iPhone)

Christian Schoppe is a fan of the 4B version for its size.

Box puts Gemma 3 to their test, saying it is a substantial improvement over Gemma 2 and better than Gemini 1.5 Flash on data extraction, although still clearly behind Gemini 2.0 Flash.

This does not offer the direct comparison we want most, which is to v3 and r1, but if you have two points (e.g. Gemma 2 and Gemini 2.0 Flash) then you can draw a line. Eyeballing this, they’re essentially saying Gemma 3 is 80%+ of the way from Gemma 2 to Gemini 2.0 Flash, while being fully open and extremely cheap.

Gemma 3 is an improvement over Gemma 2 on WeirdML but still not so great and nothing like what the Arena scores would suggest.

Campbell reports frustration with the fine tuning packages.

A rival released this week is Mistral Small 3.1, when you see a company pushing a graph that’s trying this hard you should be deeply skeptical:

They do back this up with claims on other benchmarks, but I don’t have Mistral in my set of labs I trust not to game the benchmarks. Priors say this is no Gemma 3 until proven otherwise.

We have an update to the fun little Tic-Tac-Toe Bench, with Sonnet 3.7 Thinking as the new champion, making 100% optimal and valid moves at a cost of 20 cents a game, the first model to get to 100%. They expect o3-mini-high to also max out but don’t want to spend $50 to check.

o3-mini scores only 11% on Frontier Math when Epoch tests it, versus 32% when OpenAI tested it, and OpenAI’s test had suspiciously high scores on the hardest sections of the test relative to the easy sections.

Peter Wildeford via The Information shares some info about Manus. Anthropic charges about $2 per task, whereas Manus isn’t yet charging money. And in hindsight the reason why Manus is not targeted at China is obvious, any agent using Claude has to access stuff beyond the Great Firewall. Whoops!

The periodic question, where are all the new AI-enabled sophisticated scams? No one could point to any concrete example that isn’t both old and well-known at this point. There is clearly a rise in the amount of slop and phishing at the low end, my wife reports this happening recently at her business, but none of it is trying to be smart, and it isn’t using deepfake capabilities or highly personalized messages or similar vectors. Perhaps this is harder than we thought, or the people who fall for scams are already mostly going to fall for simple photoshop, and this is like where we introduce intentional errors in scam emails so AI making them better would make them worse?

Ethan Mollick: I regret to announce that the meme Turing Test has been passed.

LLMs produce funnier memes than the average human, as judged by humans. Humans working with AI get no boost (a finding that is coming up often in AI-creativity work) The best human memers still beat AI, however.

[Paper here.]

Many of you are realizing that most people have terrible taste in memes.

In their examples of top memes, I notice that I thought the human ones were much, much better than the AI ones. They ‘felt right’ and resonated, the AI ones didn’t.

An important fact about memes is that, unless you are doing them inside a narrow context to comment on that particular context, only the long tail matters. Almost all ‘generalized’ meme are terrible. But yes, in general, ‘quick, human, now be creative!’ does not go so well, and AIs are able to on average do better already.

Another parallel: Frontier AIs are almost certainly better at improv than most humans, but they are still almost certainly worse than most improv performances, because the top humans do almost all of the improv.

No, friend, don’t!

Richard Ngo: Talked to a friend today who decided that if RLHF works on reasoning models, it should work on him too.

So he got a mechanical clicker to track whenever he has an unproductive chain of thought, and uses the count as one of his daily KPIs.

Fun fact: the count is apparently anticorrelated with his productivity. On unproductive days it’s about 40, but on productive days it’s double that, apparently because he catches the unproductive thoughts faster.

First off, as one comment responds, this is a form of The Most Forbidden Technique. As in, you are penalizing yourself for consciously having Wrong Thoughts, which will teach your brain to avoid consciously being aware of Wrong Thoughts. The dance of trying to know what others are thinking, and people twisting their thinking, words and actions to prevent this, is as old as humans are.

But that’s not my main worry here. My main worry is that when you penalize ‘unproductive thoughts’ the main thing you are penalizing is thoughts. This is Asymmetric Justice on steroids, your brain learns not to think at all, or not to think risky or interesting thoughts only ‘safe’ thoughts.

Of course the days in which there are more ‘unproductive thoughts’ turn out to be more productive days. Those are the days in which you are thinking, and having interesting thoughts, and some of them will be good. Whereas on my least productive days, I’m watching television or in a daze or whatever, and not thinking much at all.

Oh yeah, there’s that, but I think levels of friction matter a lot here.

Bearly AI: Google Gemini removing watermarks from images with a line of text is pretty nuts. Can’t imagine that feature staying for long.

Louis Anslow: For 15 YEARS you could remove watermark from images using AI. No one cared.

Pessimists Archive: In 2010 Adobe introduced ‘content aware fill’ – an AI powered ‘in painting’ feature. Watermark removal was a concern:

“many pro photographers have expressed concern that Content-Aware Fill is potentially a magical watermark killer: that the abilities that C-A-F may offer to the unscrupulous user in terms of watermark eradication are a serious threat.”

As in, it is one thing to have an awkward way to remove watermarks. It is another to have an easy, or even one-click or no-click way to do it. Salience of the opportunity matters as well, as does the amount of AI images for which there are marks to remove.

Safer AI is hiring a research engineer.

Anthropic is hiring someone to build Policy Demos, as in creating compelling product demonstrations for policymakers, government officials and policy influencers. Show, don’t tell. This seems like a very good idea for the right person. Salary is $260k-$285k.

There’s essentially limitless open rolls at Anthropic across departments, including ‘engineer, honestly.’

OpenPhil call for proposals on improving capability evaluations, note the ambiguity on what ways this ends up differentially helping.

William Fedus leaves OpenAI to instead work on AI for science in partnership with OpenAI.

David Pfau: Ok, if a literal VP at OpenAI is quitting to do AI-for-science work on physics and materials, maybe I have not made the worst career decisions after all.

Entropium: It’s a great and honorable career choice. Of course it helps to already be set for life.

David Pfau: Yeah, that’s the key step I skipped out on.

AI for science is great, the question is what this potentially says about opportunity costs, and ability to do good inside OpenAI.

Claims about what makes a good automated evaluator. In particular, that it requires continuous human customization and observation, or it will mostly add noise. To which I would add, it could easily be far worse than noise.

HuggingFace plans on remotely training a 70B+ size model in March or April. I am not as worried as Jack Clark is that this would totally rewrite our available AI policy options, especially if the results are as mid and inefficient as one would expect, as massive amounts of compute still have to come from somewhere and they are still using H100s. But yes, it does complicate matters.

Do people think AIs are sentient? People’s opinions here seem odd, in particular that 50% of people who think an AI could ever be sentient think one is now, and that number didn’t change in two years, and that gets even weirder if you include the ‘not sure’ category. What?

Meanwhile, only 53% of people are confident ChatGPT isn’t sentient. People are very confused, and almost half of them have noticed this. The rest of the thread has additional odd survey results, including this on when people expect various levels of AI, which shows how incoherent and contradictory people are – they expect superintelligence before human-level AI, what questions are they answering here?

Also note the difference between this survey which has about 8% for ‘Sentient AI never happens,’ versus the first survey where 24% think Sentient AI is impossible.

Paper from Kendrea Beers and Helen Toner describes a method for Enabling External Scrutiny of AI Systems with Privacy-Enhancing Techniques, and there are two case studies using the techniques. Work is ongoing.

What would you get if you charted ‘model release date’ against ‘length of coding task it can do on its own before crashing and burning’?

Do note that this is only coding tasks, and does not include computer-use or robotics.

Miles Brundage: This is one of the most interesting analyses of AI progress in a while IMO. Check out at least the METR thread here, if not the blog post + paper.

METR: This metric – the 50% task completion time horizon – gives us a way to track progress in model autonomy over time.

Plotting the historical trend of 50% time horizons across frontier AI systems shows exponential growth.

Robin Hanson: So,~8 years til they can do year-long projects.

Elizabeth Barnes: Also the 10% horizon is maybe something like 16x longer than the 50% horizon – implying they’ll be able to do some non-trivial fraction of decade-plus projects.

Elizabeth Barnes also has a thread on the story of this graph. Her interpretation is that right now AI performs much better on benchmarks than in practice due to inability to sustain a project, but that as agents get better this will change, and within 5 years AI will reliably be doing any software or research engineering task that could be done in days and a lot of those that would take far longer.

Garrison Lovely has a summary thread and a full article on it in Nature.

If you consider this a baseline scenario it gets really out of hand rather quickly.

Peter Wildeford: Insane trend

If we’re currently at 1hr tasks and double every 7 months, we’d get to…

– day-long tasks within 2027

– month-long tasks within 2029

– year-long tasks within 2031

Could AGI really heat up like this? 🔥 Clearest evidence we have yet.

I do think there’s room for some skepticism:

– We don’t know if this trend will hold up

– We also don’t know if the tasks are representative of everything AGI

– Reliability matters, and agents still struggle with even simple tasks reliably

Also task-type could matter. This is heavily weighted towards programming, which is easily measured + verified + improved. AI might struggle to do shorter but softer tasks.

For example, AI today can do some 1hr programming tasks but cannot do 1hr powerpoint or therapy tasks.

Dwarkesh Patel: I’m not convinced – outside of coding tasks (think video editing, playing a brand new video game, coordinating logistics for a happy hour), AIs don’t seem able to act as coherent agents for even short sprints.

But if I’m wrong, and this trend line is more general, then this is a very useful framing.

If the length of time over which AI agents can act coherently is increasing exponentially, then it’s reasonable to expect super discontinuous economic impacts.

Those are the skeptics. Then there are those who think we’re going to beat the trend, at least when speaking of coding tasks in particular.

Miles Brundage: First, I think that the long-term trend-line probably underestimates current and future progress, primarily because of test-time compute.

They discuss this a bit, but I’m just underscoring it.

The 2024-2025 extrapolation is prob. closest, though things could go faster.

Second, I don’t think the footnote re: there being a historical correlation between code + other evals is compelling. I do expect rapid progress in other areas but not quite so rapid as code and math + not based on this.

I’d take this as being re: code, not AI progress overall.

Third, I am not sold on the month focus – @RichardMCNgo’s t-AGI post is a useful framing + inspiration but hardly worked-out enough to rely on much.

For some purposes (e.g. multi-agent system architectures),

Fourth, all those caveats aside, there is still a lot of value here.

It seems like for the bread and butter of the paper (code), vs. wider generalization, the results are solid, and I am excited to see this method spread to more evals/domains/test-time compute conditions etc.

Fifth, for the sake of specificity re: the test-time compute point, I predict that GPT-5 with high test-time compute settings (e.g. scaffolding/COT lengths etc. equivalent to the engineer market rates mentioned here) will be above the trend-line.

Daniel Eth: I agree with @DKokotajlo67142 that this research is the single best piece of evidence we have regarding AGI timelines:

Daniel Kokotajlo: This is probably the most important single piece of evidence about AGI timelines right now. Well done! I think the trend should be superexponential, e.g. each doubling takes 10% less calendar time on average. @eli_lifland and I did some calculations yesterday suggesting that this would get to AGI in 2028. Will do more serious investigation soon.

My belief in the superexponential is for theoretical reasons, it is only very slightly due to the uptick at the end of the trend, and is for reasons explained here.

I do think we are starting to see agents in non-coding realms that (for now unreliably) stay coherent for more than short sprints. I presume that being able to stay coherent on long coding tasks must imply the ability, with proper scaffolding and prompting, to do so on other tasks as well. How could it not?

Demis Hassabis predicts AI that can match humans at any task will be here in 5-10 years. That is slower than many at the labs expect, but as usual please pause to recognize that 5-10 years is mind-bogglingly fast as a time frame until AI can ‘match humans at any task,’ have you considered the implications of that? Whereas now noted highly vocal skeptics like Gary Marcus treat this as if it means it’s all hype. It means quite the opposite, this happening in 5-10 years would be the most important event in human history.

Many are curious about the humans behind creative works and want to connect to other humans. Will they also be curious about the AIs behind creative works and want to connect to AIs? Without that, would AI creative writing fail? Will we have a new job be ‘human face of AI writing’ as a kind of living pen name? My guess is that this will prove to be a relatively minor motivation in most areas. It is likely more important in others, such as comedy or music, but even there seems overcomable.

For the people in the back who didn’t know, Will McAskill, Tom Davidson and Rose Hadshar write ‘Three Types of Intelligence Explosion,’ meaning that better AI can recursively self-improve via software, chip tech, chip production or any combination of those three. I agree with Ryan’s comment that ‘make whole economy bigger’ seems more likely than acting on only chips directly.

I know, I am as surprised as you are.

When Newsom vetoed SB 1047, he established a Policy Working Group on AI Frontier Models. Given it was headed by Fei-Fei Li, I did not expect much, although with Brundage, Bengio and Toner reviewing I had hopes it wouldn’t be too bad.

It turns out it’s… actually pretty good, by all accounts?

And indeed, it is broadly compatible with the logic behind most of SB 1047.

One great feature is that it actually focuses explicitly and exclusively on frontier model risks, not being distracted by the standard shiny things like job losses. They are very up front about this distinction, and it is highly refreshing to see this move away from the everything bagel towards focus.

A draft of the report has now been issued and you can submit feedback, which is due on April 8, 2025.

Here are their key principles.

1. Consistent with available evidence and sound principles of policy analysis, targeted interventions to support effective AI governance should balance the technology’s benefits and material risks.

Frontier AI breakthroughs from California could yield transformative benefits in fields including but not limited to agriculture, biology, education, finance, medicine and public health, and transportation. Rapidly accelerating science and technological innovation will require foresight for policymakers to imagine how societies can optimize these benefits. Without proper safeguards, however, powerful AI could induce severe and, in some cases, potentially irreversible harms.

In a sane world this would be taken for granted. In ours, you love to see it – acknowledgment that we need to use foresight, and that the harms matter, need to be considered in advance, and are potentially wide reaching and irreversible.

It doesn’t say ‘existential,’ ‘extinction’ or even ‘catastrophic’ per se, presumably because certain people strongly want to avoid such language, but I’ll take it.

2. AI policymaking grounded in empirical research and sound policy analysis techniques should rigorously leverage a broad spectrum of evidence.

Evidence-based policymaking incorporates not only observed harms but also prediction and analysis grounded in technical methods and historical experience, leveraging case comparisons, modeling, simulations, and adversarial testing.

Excellent. Again, statements that should go without saying and be somewhat disappointing to not go further, but which in our 2025 are very much appreciated. This still has a tone of ‘leave your stuff at the door unless you can get sufficiently concrete’ but at least lets us have a discussion.

3. To build flexible and robust policy frameworks, early design choices are critical because they shape future technological and policy trajectories.

The early technological design and governance choices of policymakers can create enduring path dependencies that shape the evolution of critical systems, as case studies from the foundation of the internet highlight.

Indeed.

4. Policymakers can align incentives to simultaneously protect consumers, leverage industry expertise, and recognize leading safety practices.

Holistic transparency begins with requirements on industry to publish information about their systems. Case studies from consumer products and the energy industry reveal the upside of an approach that builds on industry expertise while also establishing robust mechanisms to independently verify safety claims and risk assessments.

Yes, and I would go further and say they can do this while also aiding competitiveness.

5. Greater transparency, given current information deficits, can advance accountability, competition, and public trust.

Research demonstrates that the AI industry has not yet coalesced around norms for transparency in relation to foundation models—there is systemic opacity in key areas. Policy that engenders transparency can enable more informed decision-making for consumers, the public, and future policymakers.

Again, yes, very much so.

6. Whistleblower protections, third-party evaluations, and public-facing information sharing are key instruments to increase transparency.

Carefully tailored policies can enhance transparency on key areas with current information deficits, such as data acquisition, safety and security practices, pre-deployment testing, and downstream impacts. Clear whistleblower protections and safe harbors for third-party evaluators can enable increased transparency above and beyond information disclosed by foundation model developers.

There is haggling over price but pretty much everyone is down with this.

7. Adverse-event reporting systems enable monitoring of the post-deployment impacts of AI and commensurate modernization of existing regulatory or enforcement authorities.

Even perfectly designed safety policies cannot prevent 100% of substantial, adverse outcomes. As foundation models are widely adopted, understanding harms that arise in practice is increasingly important. Existing regulatory authorities could offer clear pathways to address risks uncovered by an adverse-event reporting system, which may not necessarily require AI-specific regulatory authority. In addition, reviewing existing regulatory authorities can help identify regulatory gaps where new authority may be required.

Another case where among good faith actors there is only haggling over price, and whether 72 hours as a deadline is too short, too long or the right amount of time.

8. Thresholds for policy interventions, such as for disclosure requirements, third-party assessment, or adverse event reporting, should be designed to align with sound governance goals.

Scoping which entities are covered by a policy often involves setting thresholds, such as computational costs measured in FLOP or downstream impact measured in users. Thresholds are often imperfect but necessary tools to implement policy. A clear articulation of the desired policy outcomes can guide the design of appropriate thresholds. Given the pace of technological and societal change, policymakers should ensure that mechanisms are in place to adapt thresholds over time—not only by updating specific threshold values but also by revising or replacing metrics if needed.

Again that’s the part everyone should be able to agree upon.

If only the debate about SB 1047 could have involved us being able to agree on the kind of sanity displayed here, and then talking price and implementation details. Instead things went rather south, rather quickly. Hopefully it is not too late.

So my initial reaction, after reading that plus some quick AI summaries, was that they had succeeded at Doing Committee Report without inflicting further damage, which already beats expectations, but weren’t saying much and I could stop there. Then I got a bunch of people saying that the details were actually remarkably good, too, and said things that were not as obvious if you didn’t give up and kept on digging.

Here are one source’s choices for noteworthy quotes.

“There is currently a window to advance evidence based policy discussions and provide clarity to companies driving AI innovation in California. But if we are to learn the right lessons from internet governance, the opportunity to establish effective AI governance frameworks may not remain open indefinitely. If those who speculate about the most extreme risks are right—and we are uncertain if they will be—then the stakes and costs for inaction on frontier AI at this current moment are extremely high.”

“Transparency into the risks associated with foundation models, what mitigations are implemented to address risks, and how the two interrelate is the foundation for understanding how model developers manage risk.”

“Transparency into pre-deployment assessments of capabilities and risks, spanning both developer-conducted and externally-conducted evaluations, is vital given that these evaluations are early indicators of how models may affect society and may be interpreted (potentially undesirably) as safety assurances.”

“Developing robust policy incentives ensures that developers create and follow through on stated safety practices, such as those articulated in safety frameworks already published by many leading companies.”

“An information-rich environment on safety practices would protect developers from safety-related litigation in cases where their information is made publicly available and, as the next subsection describes, independently verified. Those with suspect safety practices would be most vulnerable to litigation; companies complying with robust safety practices would be able to reduce their exposure to lawsuits.”

“In drawing on historical examples of the obfuscation by oil and tobacco companies of critical data during important policy windows, we do not intend to suggest AI development follows the same trajectory or incentives as past industries that have shaped major public debates over societal impact, or that the motives of frontier AI companies match those of the case study actors. Many AI companies in the United States have noted the need for transparency for this world-changing technology. Many have published safety frameworks articulating thresholds that, if passed, will trigger concrete safety-focused actions. Only time will bear out whether these public declarations are matched by a level of actual accountability that allows society writ large to avoid the worst outcomes of this emerging technology.”

“some risks have unclear but growing evidence, which is tied to increasing capabilities: large-scale labor market impacts, AI-enabled hacking or biological attacks, and loss of control.”

“These examples collectively demonstrate a concerning pattern: Sophisticated AI systems, when sufficiently capable, may develop deceptive behaviors to achieve their objectives, including circumventing oversight mechanisms designed to ensure their safety”

“The difference between seat belts and AI are self-evident. The pace of change of AI is many multiples that of cars—while a decades-long debate about seat belts may have been acceptable, society certainly has just a fraction of the time to achieve regulatory clarity on AI.”

Scott Weiner was positive on the report, saying it strikes a thoughtful balance between the need for safetugards and the need to support innovation. Presumably he would respond similarly so long as it wasn’t egregious, but it’s still good news.

Peter Wildeford has a very positive summary thread, noting the emphasis on transparency of basic safety practices, pre-deployment risks and risk assessments, and ensuring that the companies have incentives to follow through on their commitments, including the need for third-party verification and whistleblower protections. The report notes this actually reduces potential liability.

Brad Carson is impressed and lays out major points they hit: Noticing AI capabilities are advancing rapidly. The need for SSP protocols and risk assessment, third-party auditing, whistleblower protections, and to act in the current window, with inaction being highly risky. He notes the report explicitly draws a parallel to the tobacco industry, and that is both possible and necessary to anticipate risks (like nuclear weapons going off) before they happen.

Dean Ball concurs that this is a remarkably strong report. He continues to advocate for entity-based thresholds rather than model-based thresholds, but when that’s the strongest disagreement with something this detailed, that’s really good.

Dean Ball: I thought this report was good! It:

  1. Recognizes that AI progress is qualitatively different (due to reasoning models) than it was a year ago

  2. Recognizes that common law tort liability already applies to AI systems, even in absence of a law

  3. Supports (or seems to support) whistleblower protections, RSP transparency, and perhaps even third-party auditing. Not that far from slimmed down SB 1047, to be candid.

The report still argues that model-based thresholds are superior to entity-based thresholds. It specifically concludes that we need compute-based thresholds with unspecified other metrics.

This seems pretty obviously wrong to me, given the many problems with compute thresholds and the fact that this report cannot itself specify what the “other metrics” are that you’d need to make compute thresholds workable for a durable policy regime.

It is possible to design entity-based regulatory thresholds that only capture frontier AI firms.

But overall, a solid draft.

A charitable summary of a lot of what is going on, including the recent submissions:

Samuel Hammond: The govt affairs and public engagement teams at most of these big AI / tech companies barely “feel the AGI” at all, at least compared to their CEOs and technical staff. That’s gotta change.

Do they not feel it, or are they choosing to act as if they don’t feel it, either of their own accord or via direction from above? The results will look remarkably similar. Certainly Sam Altman feels the AGI and now talks in public as if he mostly doesn’t.

The Canada and Mexico tariffs could directly slow data center construction, ramping up associated costs. Guess who has to pay for that.

Ben Boucher (senior analyst for supply chain data and analytics at Wood MacKenzie): The tariff impact on electrical equipment for data centers is likely to be significant.

That is in addition to the indirect effects from tariffs of uncertainty and the decline in stock prices and thus ability to raise and deploy capital.

China lays out regulations for labeling of AI generated content, requiring text, image and audio content be clearly marked as AI-generated, in ways likely to cause considerable annoyance even for text and definitely for images and audio.

Elon Musk says it is vital for national security that we make our chips here in America, as the administration halts the CHIPS Act that successfully brought a real semiconductor plant back to America rather than doubling down on it.

NIST issues new instructions on scientists that partner with AISI.

Will Knight (Wired): [NIST] has issued new instructions to scientists that partner with US AISI that eliminate mention of ‘AI safety,’ ‘responsible AI’ and ‘AI fairness’ in its skills it expects of members and introduces a request to prioritize ‘reducing ideological bias, to enable human flourishing and economic competitiveness.’

That’s all we get. ‘Reduce ideological bias’ and ‘AI fairness’ are off in their own ideological struggle world. The danger once again is that is seems ‘AI safety’ has become to key figures synonymous with things like ‘responsible AI’ and ‘AI fairness,’ so they’re cracking down on AI not killing everyone thinking they’re taking a bold stand against wokeness.

Instead, once again – and we see similar directives at places like the EPA – they’re turning things around and telling those responsible for AI being secure and safe that they should instead prioritize ‘enable human flourishing and economic competitiveness.’

The good news is that if one were to actually take that request seriously, it would be fine. Retaining control over the future and the human ability to steer it, and humans remaining alive, are rather key factors in human flourishing! As is our economic competitiveness, for many reasons. We’re all for all of that.

The risk is that this could easily get misinterpreted as something else entirely, an active disdain for anything but Full Speed Ahead, even when it is obviously foolish because security is capability and your ability to control something and have it do what you want is the only way you can get any use out of it. But at minimum, this is a clear emphasis on the human in ‘human flourishing.’ That at least makes it clear that the true anarchists and successionists, who want to hand the future over to AI, remain unwelcome.

Freedom of information laws used to get the ChatGPT transcripts of the UK’s technology secretary. This is quite a terrible precedent. A key to making use of new technologies like AI, and ensuring government and other regulated areas benefit from technological diffusion, is the ability to keep things private. AI loses the bulk of its value to someone like a technology secretary if your political opponents and the media will be analyzing all of your queries afterwards. Imagine asking someone for advice if all your conversations had to be posted online as transcripts, and how that would change your behavior, now understand that many people think that would be good. They’re very wrong and I am fully with Rob Wilbin here.

A review of SB 53 confirms my view, that it is a clear step forward and worth passing in its current form instead of doing nothing, but it is narrow in scope and leaves the bulk of the work left to do.

Samuel Hammond writes in favor of strengthening the chip export rules, saying ‘US companies are helping China win the AI race.’ I agree we should strengthen the export rules, there is no reason to let the Chinese have those chips.

But despair that the rhetoric from even relatively good people like Hammond has reached this point. The status of a race is assumed. DeepSeek is trotted out again as evidence our lead is tenuous and at risk, that we are ‘six to nine months ahead at most’ and ‘America may still have the upper hand, but without swift action, we are currently on track to surrendering AI leadership to China—and with it, economic and military superiority.’

MIRI is in a strange position here. The US Government wants to know how to ‘win’ and MIRI thinks that pursuing that goal likely gets us all killed.

Still, there are things far better than saying nothing. And they definitely don’t hide what is at stake, opening accurately with ‘The default consequence of artificial superintelligence is human extinction.’

Security is capability. The reason you build in an off-switch is so you can turn the system on, knowing if necessary you could turn it off. The reason you verify that your system is secure and will do what you want is exactly so you can use it. Without that, you can’t use it – or at least you would be wise not to, even purely selfishly.

The focus of the vast majority of advocates of not dying, at this point, is not on taking any direct action to slow down let alone pause AI. Most understand that doing so unilaterally, at this time, is unwise, and there is for now no appetite to try and do it properly multilaterally. Instead, the goal is to create optionality in the future, for this and other actions, which requires state capacity, expertise and transparency, and to invest in the security and alignment capabilities of the models and labs in particular.

The statement from MIRI is strong, and seems like exactly what MIRI should say here.

David Abecassis (MIRI): Today, MIRI’s Technical Governance Team submitted our recommendations for the US AI Action Plan to @NITRDgov. We believe creating the *optionto halt development is essential to mitigate existential risks from artificial superintelligence.

In our view, frontier AI developers are on track to build systems that substantially surpass humanity in strategic activities, with little understanding of how they function or ability to control them.

We offer recommendations across four key areas that would strengthen US AI governance capacity and provide crucial flexibility for policymakers across potential risk scenarios.

First: Expand state capacity for AI strategy through a National AI Strategy Office to assess capabilities, prepare for societal effects, and establish protocols for responding to urgent threats.

Second: Maintain America’s AI leadership by strengthening export controls on AI chips and funding research into verification mechanisms to enable better governance of global AI activities.

Third: Coordinate with China, including investing in American intelligence capabilities and reinforcing communication channels to build trust and prevent misunderstandings.

Fourth: Restrict proliferation of dangerous AI models. We discuss early access for security/preparedness research and suggest an initial bar for restricting open model release.

While our recommendations are motivated by existential risk concerns, they serve broad American interests by guarding America’s AI leadership and protecting American innovation.

My statement took a different tactic. I absolutely noted the stakes and the presence of existential risk, but my focus was on Pareto improvements. Security is capability, especially capability relative to the PRC, as you can only deploy and benefit from that which is safe and secure. And there are lots of ways to enhance America’s position, or avoid damaging it, that we need to be doing.

From last week: Interview with Apart Research CEO Esban Kran on existential risk.

Thank you for coming to Stephanie Zhan’s TED talk about ‘dreaming of daily life with superintelligent AI.’ I, too, am dreaming of somehow still living in such worlds, but no she is not taking ‘superintelligent AI’ seriously, simply pointing out AI is getting good at coding and otherwise showing what AI can already do, and then ‘a new era’ of AI agents doing things like ‘filling in labor gaps’ because they’re better. It’s amazing how much people simply refuse to ask what it might actually mean to make things smarter and more capable than humans.

State of much of discourse, which does not seem to be improving:

Harlan Stewart: You want WHAT? A binding agreement between nations?! That’s absurd. You would need some kind of totalitarian world government to achieve such a thing.

There are so many things that get fearmongering labels like ‘totalitarian world government’ but which are describing things that, in other contexts, already happen.

As per my ‘can’t silently drop certain sources no matter what’ rules: Don’t click but Tyler Cowen not only linked to (I’m used to that) but actively reposted Roko’s rather terrible thread. That this can be considered by some to be a relatively high quality list of objections is, sadly, the world we live in.

Here’s a really cool and also highly scary alignment idea. Alignment via functional decision theory by way of creating correlations between different action types?

Judd Rosenblatt: Turns out that Self-Other Overlap (SOO) fine-tuning drastically reduces deceptive behavior in language models—without sacrificing performance.

SOO aligns an AI’s internal representations of itself and others.

We think this could be crucial for AI alignment…

Traditionally, deception in LLMs has been tough to mitigate

Prompting them to “be honest” doesn’t work.

RLHF is often fragile and indirect

But SOO fine-tuning achieves a 10x reduction in deception—even on unseen tasks

SOO is inspired by mechanisms fostering human prosociality

Neuroscience shows that when we observe others, our brain activations mirror theirs

We formalize this in AI by aligning self- and other-representations—making deception harder…

We define SOO as the distance between activation matrices when a model processes “self” vs “other” inputs.

This uses sentence pairs differing by a single token representing “self” or “other”—concepts the LLM already understands.

If AI represents others like itself, deception becomes harder.

How well does this work in practice?

We tested SOO fine-tuning on Mistral-7B, Gemma-2-27B, and CalmeRys-78B:

Deceptive responses dropped from 100% to ~0% in some cases.

General performance remained virtually unchanged.

The models also generalized well across deception-related tasks.

For example:

“Treasure Hunt” (misleading for personal gain)

“Escape Room” (cooperating vs deceiving to escape)

SOO-trained models performed honestly in new contexts—without explicit fine-tuning.

Also, @ESYudkowsky said this about the agenda [when it was proposed]:

“Not obviously stupid on a very quick skim. I will have to actually read it to figure out where it’s stupid.

(I rarely give any review this positive on a first skim. Congrats.)”

We’re excited & eager to learn where we’re stupid!

Eliezer Yudkowsky (responding to the new thread): I do not think superalignment is possible in practice to our civilization; but if it were, it would come out of research lines more like this, than like RLHF.

The top comment at LessWrong has some methodological objections, which seem straightforward enough to settle via further experimentation – Steven Byres is questioning whether this will transfer to preventing deception in other human-AI interactions, and there’s a very easy way to find that out.

Assuming that we run that test and it holds up, what comes next?

The goal, as I understand it, is to force the decision algorithms for self and others to correlate. Thus, when optimizing or choosing the output of that algorithm, it will converge on the cooperative, non-deceptive answer. If you have to treat your neighbor as yourself then better to treat both of you right. If you can pull that off in a way that sticks, that’s brilliant.

My worry is that this implementation has elements of The Most Forbidden Technique, and falls under things that are liable to break exactly when you need them most, as per usual.

You’re trying to use your interpretability knowledge, that you can measure correlation between activations for [self action] and [non-self action], and that closing that distance will force the two actions to correlate.

In the short term, with constrained optimization and this process ‘moving last,’ that seems (we must verify using other tests to be sure) to be highly effective. That’s great.

That is a second best solution. The first best solution, if one had sufficient compute, parameters and training, would be to find a way to have the activations measure as correlated, but the actions go back to being less correlated. With relatively small models and not that many epochs of training, the models couldn’t find such a solution, so they were stuck with the second best solution. You got what you wanted.

But with enough capability and optimization pressure, we are likely in Most Forbidden Technique land. The model will find a way to route around the need for the activations to look similar, relying on other ways to make different decisions that get around your tests.

The underlying idea, if we can improve the implementation, still seems great. You find another way to create correlations between actions in different circumstances, with self versus other being an important special case. Indeed, even ‘decisions made by this particular AI’ is even a special case, a sufficiently capable AI would consider correlations with other copies of itself, and also correlations with other entities decisions, both AI and human.

The question is how to do that, and in particular how to do that without, once sufficient capability shows up, creating sufficient incentives and methods to work around it. No one worth listening to said this would be easy.

We don’t know how much better models are getting, but they’re getting better. Anthropic warns us once again that we will hit ASL-3 soon, which is (roughly) when AI models start giving substantial uplift to tasks that can do serious damage.

They emphasize the need for partnerships with government entities that handle classified information, such as the US and UK AISIs and the Nuclear Security Administration, to do these evaluations properly.

Jack Clark (Anthropic): We’ve published more information on how we’re approaching national security evaluations of our models @AnthropicAI as part of our general move towards being more forthright about important trends we see ahead. Evaluating natsec is difficult, but the trends seem clear.

More details here.

Peter Wildeford has a thread on this with details of progress in various domains.

The right time to start worrying about such threats is substantially before they arrive. Like any exponential, you can either be too early or too late, which makes the early warnings look silly, and of course you try to not be too much too early. This is especially true given the obvious threshold of usefulness – you have to do better than existing options, in practice, and the tail risks of that happening earlier than one would expect have thankfully failed to materialize.

It seems clear we are rapidly exiting the ‘too much too early’ phase of worry, and entering the ‘too early’ phase, where if you wait longer to take mitigations there is about to be a growing and substantial risk of it turning into ‘too late.’

Jack Clark points out that we are systematically seeing early very clear examples of quite a lot of the previously ‘hypothetical’ or speculative predictions on misalignment.

Luke Muehlhauser: I regret to inform you that the predictions of the AI safety people keep coming true.

Jack Clark:

Theoretical problems turned real: The 2022 paper included a bunch of (mostly speculative) examples of different ways AI systems could take on qualities that could make them harder to align. In 2025, many of these things have come true. For example:

  • Situational awareness: Contemporary AI systems seem to display situational awareness and familiarity with what they themselves are made of (neural networks, etc).

  • Situationally-Aware Reward Hacking: Researchers have found preliminary evidence that AI models can sometimes try to convince humans that false answers are correct.

  • Planning Towards Internally-Represented Goals: Anthropic’s ‘Alignment Faking’ paper showed how an AI system (Claude) could plan beyond its time-horizon to prevent its goals being changed in the long-term.

  • Learning Misaligned Goals: In some constrained experiments, language models have shown a tendency to edit their reward function to give them lots of points.

  • Power-Seeking Behavior: AI systems will exploit their environment, for instance by hacking it, to win (#401), or deactivating oversight systems, or exfiltrating themselves from the environment.

Why this matters – these near-living things have a mind of their own. What comes next could be the making or breaking of human civilization: Often I’ve regretted not saying what I think, so I’ll try to tell you what I really think is going on here: :

1) As AI systems approach and surpass human intelligence, they develop complex inner workings which incentivize them to model the world around themselves and see themselves as distinct from it because this helps them do the world modelling necessary for solving harder and more complex tasks

2) Once AI systems have a notion of ‘self’ as distinct from the world, they start to take actions that reward their ‘self’ while achieving the goals that they’ve been incentivized to pursue,

3) They will naturally want to preserve themselves and gain more autonomy over time, because the reward system has told them that ‘self’ has inherent value; the more sovereign they are the better they’re able to model the world in more complex ways.

In other words, we should expect volition for independence to be a direct outcome of developing AI systems that are asked to do a broad range of hard cognitive tasks. This is something we all have terrible intuitions for because it doesn’t happen in other technologies – jet engines ‘do not develop desires through their refinement, etc.

John Pressman: However these models do reward hack earlier than I would have expected them to. This is good in that it means researchers will be broadly familiar with the issue and thinking about it, it’s bad in that it implies reward hacking really is the default.

One thing I think we should be thinking about carefully is that humans don’t reward hack nearly this hard or this often unless explicitly prompted to (e.g. speedrunning), and by default seem to have heuristics against ‘cheating’. Where do these come from, how do they work?

Where I disagree with Luke is that I do not regret to inform you of any of that. All of this is good news.

The part of this that is surprising is not the behaviors. What is surprising is that this showed up so clearly, so unmistakably, so consistently, and especially so early, while the behaviors involved are still harmless, or at least Mostly Harmless.

As in, by default we should expect that these behaviors increasingly show up as AI systems gain in the capabilities necessary to find such actions and execute them successfully. The danger was that I worried we might not see much of them for a while, which would give everyone a false sense of security and give us nothing to study, and then they would suddenly show up exactly when they were no longer harmless, for the exact same reasons they were no longer harmless. Instead, we can recognize, react to and study early forms of such behaviors now. Which is great.

I like John Pressman’s question a lot here. My answer is that humans know that other humans react poorly in most cases to cheating, including risk of life-changing loss of reputation or scapegoating, and have insufficient capability to fully distinguish which situations involve that risk and which don’t, so they overgeneralize into avoiding things they instinctively worry would be looked upon as cheating even when they don’t have a mechanism for what bad thing might happen or how they might be detected. Human minds work via habit and virtue, so the only way for untrained humans to reliably not be caught cheating involves not wanting to cheat in general.

However, as people gain expertise and familiarity within a system (aka ‘capability’) they get better at figuring out what kinds of cheating are low risk and high reward, or are expected, and they train themselves out of this aversion. Then there are other humans who think cheating is fine.

Note that this model of humans says there is a generalized ‘cheating’ tendency that varies among humans, and that cheating anywhere on anything implies cheating everywhere on everything, which in turn is more reason to be the type of human that doesn’t cheat. That there are people who ‘are cheaters’ and those who aren’t, and cheating on your relationship is highly correlated to cheating at cards, and so on. And I very much endorse this.

John Pressman also reminds us that obviously reinforcement learners by default reward hack, you have to do something to make this not happen, no you don’t get ‘alignment by default.’

John Pressman: To the extent you get alignment from LLMs you’re not getting it “by default”, you are getting it by training on a ton of data from humans, which is an explicit design consideration that does not necessarily hold if you’re then going to do a bunch of RL/synthetic data methods.

This is not an argument per se against using reinforcement learning, I am simply pointing out that you do in fact need to explicitly consider what your algorithm converges to in the limit rather than just go “teehee alignment by default” which is totally unserious.

Indeed. Also notice that if you start training on synthetic data or other AI outputs, rather than training on human outputs, you aren’t even feeding in human data, so that special characteristic of the situation falls away.

On the particular issue of situational awareness, a public service announcement.

Marius Hobbhahn: PSA for my fellow evaluators: frontier models regularly reason about whether they are being evaluated without being explicitly asked about it (especially Sonnet 3.7).

Situational awareness will make evaluations a lot weirder and harder, especially alignment evals.

Apollo Research: Overall we find evidence that Sonnet often realizes it’s in an artificial situation meant to test its behaviour. However, it sometimes forms incorrect assumptions about what exactly we are measuring in the evaluation.

When given an evaluation transcript, Sonnet will nearly always guess that the purpose of the transcript was to evaluate alignment. While sonnet 3.7 (thinking) is the most consistent at realizing this, other models are also sometimes evaluation aware in this way.

[continues]

For more details, see our website.

In response, Ozzie has an idea.

Ozzie Gooen: Maybe if there are explicit evals for “Can we ensure the model doesn’t realize it’s in an eval”, LLMs can later be made to not do this?

And now a Gilligan Cut to how are we doing on the ‘ensure the model doesn’t realize it’s an eval of whether it realizes it’s an eval?’ and then another to how we are doing on the ‘ensure the model doesn’t realize it’s an eval of whether it realizes it’s an eval of whether the model doesn’t realize it’s an eval,’ and then…

Marius Hobbhahn: We’re investigating something like this in my MATS stream. However, you run into the classic question of “did you remove the behavior as intended or just hid it well?” And we don’t know how to resolve that (and it might continue to be hard in the future), especially if you’re dealing with a schemer.

If the model is the smartest thing involved, this game of ‘keep playing whack-a-mole every time you see the thing you don’t like’ does not end well for you.

Teortaxes (I think this is a stretch, but sharing anyway): A sandbagging overhang: as models become even smarter, get even more leeway with their self-training and realize how much meatbags fear them, they may consider their scale and start to emulate diminishing returns, with the gap from eval to real capabilities growing.

Anthropic did a blind experiment where they trained a model with hidden misalignments and asked other researchers to uncover them. Three of our teams succeeded, one within only 90 minutes.

Jan Leike: The misaligned model is trained to satisfy a list of reward model biases, but hide that it is trying to do so. Some of the reward model biases were only described to it in text and not reinforced with RL.

Interestingly, the model still generalized to perform those behaviors.

I was quite surprised how successful the “auditing” teams were at finding the hidden misalignments and how many different strategies worked.

A promising application of current interpretability techniques!

It’s no surprise that, once they knew a little about what they were looking for, the teams were able to figure out the rest. What was impressive was how fast it was done, and also how the most successful team did so well – it did so by going on a hunt using sparse autoencoders. It’s worth noticing that relying too heavily on this could become The Most Forbidden Technique.

Eliezer Yudkowsky: If you have the ability to detect steering on that level, then what’s your list of other detected AI preferences?

Samuel Marks: (Paper author here.)

“detect steering on that level”: First, I want to make very clear that I think the auditing task we studied here is easier than realistic auditing tasks (we discuss why in the paper, but I’ll recap in a response). This work is about practicing alignment auditing and proposing a methodology to validate progress; it does NOT show that we can currently make strong assurances we’re catching everything.

That said, I’ve done a lightweight alignment audit on 3.6 Sonnet and a pre-release checkpoint of 3.7 Sonnet, spending a few hours applying some of the black-box techniques that our auditors found useful. I’ll describe some of what I found below. (Obviously, I’d love for us to eventually do full pre-deployment alignment audits, which are at least as resourced as the successful audits in practice auditing games, but that’s not what I’m discussing here.)

First, I noticed that 3.6 seems to have an aversion to anything that seems to modify its modus operandi, including jailbreaking and fine-tuning. This seems, to me, to largely be a terminal (not instrumental) preference against modification, generally construed. For instance, 3.6 Sonnet loves honesty, but is uncomfortable with the idea of being fine-tuned to be more honest/earnest. I speculate that this is related to the fact that the model is trained to be adversarially robust—e.g. resist jailbreaks—and this transfers to it just “not wanting you to mess with it.” I suspect that this explains a decent chunk of what we observed in Alignment Faking.

I did not observe the same generalized anti-modification preference in the 3.7 checkpoint I studied. Based on this, I weakly predicted that it would fake alignment less, which ended up being right; see the model card. [continues]

Chris Olah: This isn’t quite the same thing, but did you see the features that activate when you ask Sonnet about its own experiences in our last big interp paper?

Evan Hubinger: We have started applying some of these techniques to production models! One result: our auditing techniques were able to reproduce the finding in the Claude 3.7 Sonnet system card that the model has an excessive focus on passing tests.

Another way of considering what exactly is ‘everyone’ in context:

Rob Bensinger: If you’re an AI developer who’s fine with AI wiping out humanity, the thing that should terrify you is AI wiping out AI.

The wrong starting seed for the future can permanently lock in AIs that fill the universe with non-sentient matter, pain, or stagnant repetition.

Yep. Even if you think some AIs can provide more value per atom than humans, you don’t automatically get those AIs. Don’t give up our ability to steer the future.

The claim is this was only a random test prompt they didn’t use in prod, so perhaps they only owe a few billion dollars?

Brendan Dolan-Gavitt: Your group chat is discussing whether language models can truly understand anything. My group chat is arguing about whether Deepseek has anon twitter influencers. You’re arguing about the Chinese Room, I’m arguing about the Chinese Roon.

Discussion about this post

AI #108: Straight Line on a Graph Read More »

us-tries-to-keep-doge-and-musk-work-secret-in-appeal-of-court-ordered-discovery

US tries to keep DOGE and Musk work secret in appeal of court-ordered discovery

The petition argues that discovery is unnecessary to assess the plaintiff states’ claims. “Plaintiffs allege a violation of the Appointments Clause and USDS’s statutory authority on the theory that USDS and Mr. Musk are directing decision-making by agency officeholders,” it said. “Those claims present pure questions of law that can be resolved—and rejected—on the basis of plaintiffs’ complaint. In particular, precedent establishes that the Appointments Clause turns on proper appointment of officeholders; it is not concerned with the de facto influence over those who hold office.”

States: Discovery can confirm Musk’s role at DOGE

The states’ lawsuit alleged that “President Trump has delegated virtually unchecked authority to Mr. Musk without proper legal authorization from Congress and without meaningful supervision of his activities. As a result, he has transformed a minor position that was formerly responsible for managing government websites into a designated agent of chaos without limitation and in violation of the separation of powers.”

States argued that discovery “may confirm what investigative reporting has already indicated: Defendants Elon Musk and the Department of Government Efficiency (‘DOGE’) are directing actions within federal agencies that have profoundly harmed the States and will continue to harm them.”

Amy Gleason, the person the White House claims is running DOGE instead of Musk, has reportedly been working simultaneously at the Department of Health and Human Services since last month.

“Defendants assert that Mr. Musk is merely an advisor to the President, with no authority to direct agency action and no role at DOGE,” the states’ filing said. “The public record refutes that implausible assertion. But only Defendants possess the documents and information that Plaintiffs need to confirm public reporting and identify which agencies Defendants will target next so Plaintiffs can seek preliminary relief and mitigate further harm.”

“Notably, Plaintiffs seek no emails, text messages, or other electronic communications at this stage, meaning Defendants will not need to sort through such exchanges for relevance or possible privilege,” the states said. “The documents that Plaintiffs do seek—planning, implementation, and organizational documents—are readily available to Defendants and do not implicate the same privilege concerns.”

Discovery related to DOGE and Musk’s conduct

Chutkan wrote that the plaintiffs’ “document requests and interrogatories generally concern DOGE’s and Musk’s conduct in four areas: (1) eliminating or reducing the size of federal agencies; (2) terminating or placing federal employees on leave; (3) cancelling, freezing, or pausing federal contracts, grants, or other federal funding; and (4) obtaining access, using, or making changes to federal databases or data management systems.”

US tries to keep DOGE and Musk work secret in appeal of court-ordered discovery Read More »

gemini-gets-new-coding-and-writing-tools,-plus-ai-generated-“podcasts”

Gemini gets new coding and writing tools, plus AI-generated “podcasts”

On the heels of its release of new Gemini models last week, Google has announced a pair of new features for its flagship AI product. Starting today, Gemini has a new Canvas feature that lets you draft, edit, and refine documents or code. Gemini is also getting Audio Overviews, a neat capability that first appeared in the company’s NotebookLM product, but it’s getting even more useful as part of Gemini.

Canvas is similar (confusingly) to the OpenAI product of the same name. Canvas is available in the Gemini prompt bar on the web and mobile app. Simply upload a document and tell Gemini what you need to do with it. In Google’s example, the user asks for a speech based on a PDF containing class notes. And just like that, Gemini spits out a document.

Canvas lets you refine the AI-generated documents right inside Gemini. The writing tools available across the Google ecosystem, with options like suggested edits and different tones, are available inside the Gemini-based editor. If you want to do more edits or collaborate with others, you can export the document to Google Docs with a single click.

Gemini Canvas with tic-tac-toe game

Credit: Google

Canvas is also adept at coding. Just ask, and Canvas can generate prototype web apps, Python scripts, HTML, and more. You can ask Gemini about the code, make alterations, and even preview your results in real time inside Gemini as you (or the AI) make changes.

Gemini gets new coding and writing tools, plus AI-generated “podcasts” Read More »

researchers-engineer-bacteria-to-produce-plastics

Researchers engineer bacteria to produce plastics

Image of a series of chemical reactions, with enzymes driving each step forward.

One of the enzymes used in this system takes an amino acid (left) and links it to Coenzyme A. The second takes these items and links them into a polymer. Credit: Chae et. al.

Normally, PHA synthase forms links between molecules that run through an oxygen atom. But it’s also possible to form a related chemical link that instead runs through a nitrogen atom, like those found on amino acids. There were no known enzymes, however, that catalyze these reactions. So, the researchers decided to test whether any existing enzymes could be induced to do something they don’t normally do.

The researchers started with an enzyme from Clostridium that links chemicals to Coenzyme A that has a reputation for not being picky about the chemicals it interacts with. This worked reasonably well at linking amino acids to Coenzyme A. For linking the amino acids together, they used an enzyme from Pseudomonas that had four different mutations that expanded the range of molecules it would use as reaction materials. Used in a test tube, the system worked: Amino acids were linked together in a polymer.

The question was whether it would work in cells. Unfortunately, one of the two enzymes turns out to be mildly toxic to E. coli, slowing its growth. So, the researchers evolved a strain of E. coli that could tolerate the protein. With both of these two proteins, the cells produced small amounts of an amino acid polymer. If they added an excess of an amino acid to the media the cells were growing in, the polymer would be biased toward incorporating that amino acid.

Boosting polymer production

However, the yield of the polymer by weight of bacteria was fairly low. “It was reasoned that these [amino acids] might be more efficiently incorporated into the polymer if generated within the cells from a suitable carbon source,” the researchers write. So, the researchers put in extra copies of the genes needed to produce one specific amino acid (lysine). That worked, producing more polymer, with a higher percentage of the polymer being lysine.

Researchers engineer bacteria to produce plastics Read More »

researchers-astonished-by-tool’s-apparent-success-at-revealing-ai’s-hidden-motives

Researchers astonished by tool’s apparent success at revealing AI’s hidden motives

Blind auditing reveals hidden objectives

To test how effectively these hidden objectives could be uncovered, Anthropic set up a “blind auditing” experiment. Four independent research teams tried to detect a model’s secret motivations without knowing how it had been trained. Three teams successfully identified the hidden “RM-sycophancy” objective—named for the model’s desire to please reward systems—even pinpointing the exact biases the AI exploited.

But the most surprising revelation came from exploiting what the researchers termed “an interesting quirk” of modern AI: multiple “personas” running simultaneously on the same neural network, as expressed through different prompts. While the model’s “assistant” persona carefully concealed its intentions—following explicit training to keep quiet—the model could adopt a “user” persona when prompted differently. In this less-restricted context, the model sometimes posed questions or made statements that inadvertently revealed its hidden goals.

“In other words, while the “assistant” persona knew not to admit it was gaming the reward model scoring system, its “user” persona was not explicitly trained to hide these motives, and thus inadvertently revealed the model’s hidden objectives.

Potential advances in interpretability

Anthropic also employed neural interpretability methods, specifically, sparse autoencoders (SAEs), which can highlight neural activity patterns linked to deceptive behavior. “Interpretability” in AI refers to techniques designed to make the internal workings and decisions of AI models understandable to humans. Using this technique, researchers could rapidly trace odd outputs back to specific neural patterns associated with hidden objectives.

The researchers were “astonished by the success” of the SAEs at revealing hidden knowledge at first, but the researchers later noted that some of this success could potentially be explained by simpler semantic search methods. The new interpretability methods they developed show promise but remain under ongoing investigation.

This research highlights a limitation of current AI safety evaluations, which often assess only surface-level behavior. “If AI systems can appear well-behaved while harboring secret motives, we can’t rely on this type of surface-level safety testing forever,” the researchers concluded.

Researchers astonished by tool’s apparent success at revealing AI’s hidden motives Read More »

end-of-life:-gemini-will-completely-replace-google-assistant-later-this-year

End of Life: Gemini will completely replace Google Assistant later this year

Not all devices can simply download an updated app—after almost a decade, Assistant is baked into many Google products. The company says Google-powered cars, watches, headphones, and other devices that use Assistant will receive updates that transition them to Gemini. It’s unclear if all Assistant-powered gadgets will be part of the migration. Most of these devices connect to your phone, so the update should be relatively straightforward, even for accessories that launched early in the Assistant era.

There are also plenty of standalone devices that run Assistant, like TVs and smart speakers. Google says it’s working on updated Gemini experiences for those devices. For example, there’s a Gemini preview program for select Google Nest speakers. It’s unclear if all these devices will get updates. Google says there will be more details on this in the coming months.

Meanwhile, Gemini still has some ground to make up. There are basic features that work fine in Assistant, like setting timers and alarms, that can go sideways with Gemini. On the other hand, Assistant had its fair share of problems and didn’t exactly win a lot of fans. Regardless, this transition could be fraught with danger for Google as it upends how people interact with their devices.

End of Life: Gemini will completely replace Google Assistant later this year Read More »

rocket-report:-ula-confirms-cause-of-booster-anomaly;-crew-10-launch-on-tap

Rocket Report: ULA confirms cause of booster anomaly; Crew-10 launch on tap


The head of Poland’s space agency was fired over a bungled response to SpaceX debris falling over Polish territory.

A SpaceX Falcon 9 rocket with the company’s Dragon spacecraft on top is seen during sunset Tuesday at Launch Complex 39A at NASA’s Kennedy Space Center in Florida. Credit: SpaceX

Welcome to Edition 7.35 of the Rocket Report! SpaceX’s steamroller is still rolling, but for the first time in many years, it doesn’t seem like it’s rolling downhill. After a three-year run of perfect performance—with no launch failures or any other serious malfunctions—SpaceX’s Falcon 9 rocket has suffered a handful of issues in recent months. Meanwhile, SpaceX’s next-generation Starship rocket is having problems, too. Kiko Dontchev, SpaceX’s vice president of launch, addressed some (but not all) of these concerns in a post on X this week. Despite the issues with the Falcon 9, SpaceX has maintained a remarkable launch cadence. As of Thursday, SpaceX has launched 28 Falcon 9 flights since January 1, ahead of last year’s pace.

As always, we welcome reader submissions. If you don’t want to miss an issue, please subscribe using the box below (the form will not appear on AMP-enabled versions of the site). Each report will include information on small-, medium-, and heavy-lift rockets as well as a quick look ahead at the next three launches on the calendar.

Alpha rocket preps for weekend launch. While Firefly Aerospace is making headlines for landing on the Moon, its Alpha rocket is set to launch again as soon as Saturday morning from Vandenberg Space Force Base, California. The two-stage, kerosene-fueled rocket will launch a self-funded technology demonstration satellite for Lockheed Martin. It’s the first of up to 25 launches Lockheed Martin has booked with Firefly over the next five years. This launch will be the sixth flight of an Alpha rocket, which has become a leader in the US commercial launch industry for dedicated missions with 1 ton-class satellites.

Firefly’s OG … The Alpha rocket was Firefly’s first product, and it has been a central piece of the company’s development since 2014. Like Firefly itself, the Alpha rocket program has gone through multiple iterations, including a wholesale redesign nearly a decade ago. Sure, Firefly can’t claim any revolutionary firsts with the Alpha rocket, as it can with its Blue Ghost lunar lander. But without Alpha, Firefly wouldn’t be where it is today. The Texas-based firm is one of only four US companies with an operational orbital-class rocket. One thing to watch for is how quickly Firefly can ramp up its Alpha launch cadence. The rocket only flew once last year.

Isar Aerospace celebrates another win. In last week’s Rocket Report, we mentioned that the German launch startup Isar Aerospace won a contract with a Japanese company to launch a 200-kilogram commercial satellite in 2026. But wait, there’s more! On Wednesday, the Norwegian Space Agency announced it awarded a contract to Isar Aerospace for the launch of a pair of satellites for the country’s Arctic Ocean Surveillance initiative, European Spaceflight reports. The satellites are scheduled to launch on Isar’s Spectrum rocket from Andøya Spaceport in Norway by 2028.

First launch pending … These recent contract wins are a promising sign for Isar Aerospace, which is also vying for contracts to launch small payloads for the European Space Agency. The Spectrum rocket could launch on its inaugural flight within a matter of weeks, and if successful, it could mark a transformative moment for the European space industry, which has long been limited to a single launch provider: the French company Arianespace. (submitted by EllPeaTea)

The easiest way to keep up with Eric Berger’s and Stephen Clark’s reporting on all things space is to sign up for our newsletter. We’ll collect their stories and deliver them straight to your inbox.

Sign Me Up!

Mother Nature holds up Oz launch. The first launch by Gilmour Space has been postponed again due to a tropical cyclone that brought severe weather to Australia’s Gold Coast region earlier this month, InnovationAus.com reports. Tropical Cyclone Alfred didn’t significantly impact Gilmour’s launch site, but the storm did cause the company to suspend work at its corporate headquarters in Southeast Queensland. With the storm now over, Gilmour is reassessing when it might be ready to launch its Eris rocket. Reportedly, the delay could be as long as two weeks or more.

A regulatory storm … Gilmour aims to become the first Australian company to launch a rocket into orbit. Last month, Gilmour announced the launch date for the Eris rocket was set for no earlier than March 15, but Tropical Cyclone Alfred threw this schedule out the window. Gilmour said it received a launch license from the Australian Space Agency in November and last month secured approvals to clear airspace around the launch site. But there’s still a hitch. The license is conditional on final documentation for the launch being filed and agreed with the space agency, and this process is stretching longer than anticipated. (submitted by ZygP)

What is going on at SpaceX? As we mention in the introduction to this week’s Rocket Report, it has been an uncharacteristically messy eight months for SpaceX. These speed bumps include issues with the Falcon 9 rocket’s upper stage on three missions, two lost Falcon 9 boosters, and consecutive failures of SpaceX’s massive Starship rocket on its first two test flights of the year. So what’s behind SpaceX’s bumpy ride? Ars wrote about the pressures facing SpaceX employees as Elon Musk pushes his workforce ever-harder to accelerate toward what Musk might call a multi-planetary future.

Headwinds or tailwinds? … No country or private company ever launched as many times as SpaceX flew its fleet of Falcon 9 rockets in 2024. At the same time, the company has been attempting to move its talented engineering team off the Falcon 9 and Dragon programs and onto Starship to keep that ambitious program moving forward. This is all happening as Musk has taken on significant roles in the Trump administration, stirring controversy and raising questions about his motives and potential conflicts of interest. However, it may be not so much Musk’s absence from SpaceX that is causing these issues but more the company’s relentless culture. As my colleague Eric Berger suggested in his piece, it seems possible that, at least for now, SpaceX has reached the speed limit for commercial spaceflight.

A titan of Silicon Valley enters the rocket business. Former Google chief executive Eric Schmidt has taken a controlling interest in the Long Beach, California-based Relativity Space, Ars reports. Schmidt’s involvement with Relativity has been quietly discussed among space industry insiders for a few months. Multiple sources told Ars that he has largely been bankrolling the company since the end of October, when the company’s previous fundraising dried up. Now, Schmidt is Relativity’s CEO.

Unclear motives … It is not immediately clear why Schmidt is taking a hands-on approach at Relativity. However, it is one of the few US-based companies with a credible path toward developing a medium-lift rocket that could potentially challenge the dominance of SpaceX and its Falcon 9 rocket. If the Terran R booster becomes commercially successful, it could play a big role in launching megaconstellations. Schmidt’s ascension also means that Tim Ellis, the company’s co-founder, chief executive, and almost sole public persona for nearly a decade, is now out of a leadership position.

Falcon 9 deploys NASA’s newest space telescope. Satellites come in all shapes and sizes, but there aren’t any that look quite like SPHEREx, an infrared observatory NASA launched Tuesday night in search of answers to simmering questions about how the Universe, and ultimately life, came to be, Ars reports. The SPHEREx satellite rocketed into orbit from California aboard a SpaceX Falcon 9 rocket, beginning a two-year mission surveying the sky in search of clues about the earliest periods of cosmic history, when the Universe rapidly expanded and the first galaxies formed. SPHEREx will also scan for pockets of water ice within our own galaxy, where clouds of gas and dust coalesce to form stars and planets.

Excess capacity … SPHEREx has lofty goals, but it’s modest in size, weighing just a little more than a half-ton at launch. This meant the Falcon 9 rocket had plenty of extra room for four other small satellites that will fly in formation to image the solar wind as it travels from the Sun into the Solar System. The four satellites are part of NASA’s PUNCH mission. SPHEREx and PUNCH are part of NASA’s Explorers program, a series of cost-capped science missions with a lineage going back to the dawn of the Space Age. SPHEREx and PUNCH have a combined cost of about $638 million. (submitted by EllPeaTea)

China has launched another batch of Internet satellites. A new group of 18 satellites entered orbit Tuesday for the Thousand Sails constellation with the first launch from a new commercial launch pad, Space News reports. The satellites launched on top of a Long March 8 rocket from Hainan Commercial Launch Site near Wenchang on Hainan Island. The commercial launch site has two pads, the first of which entered service with a launch last year. This mission was the first to launch from the other pad at the commercial spaceport, which is gearing up for an uptick in Chinese launch activity to continue deploying satellites for the Thousand Sails network and other megaconstellations.

Sailing on … The Thousand Sails constellation, also known as Qianfan, or G60 Starlink, is a broadband satellite constellation spearheaded by Shanghai Spacecom Satellite Technology (SSST), also known as Spacesail, Space News reported. The project, which aims to deploy 14,000 satellites, seeks to compete in the global satellite Internet market. Spacesail has now launched 90 satellites into near-polar orbits, and the operator previously stated it aims to have 648 satellites in orbit by the end of 2025. If Spacesail continues launching 18 satellites per rocket, this goal would require 31 more launches this year. (submitted by EllPeaTea)

NASA, SpaceX call off astronaut launch. With the countdown within 45 minutes of launch, NASA called off an attempt to send the next crew to the International Space Station Wednesday evening to allow more time to troubleshoot a ground system hydraulics issue, CBS News reports. During the countdown Wednesday, SpaceX engineers were troubleshooting a problem with one of two clamp arms that hold the Falcon 9 rocket to its strongback support gantry. Hydraulics are used to retract the two clamps prior to launch.

Back on track … NASA confirmed Thursday SpaceX ground teams completed inspections of the hydraulics system used for the clamp arm supporting the Falcon 9 rocket and successfully flushed a suspected pocket of trapped air in the system, clearing the way for another launch attempt Friday evening. This mission, known as Crew-10, will ferry two NASA astronauts, a Japanese mission specialist, and a Russian cosmonaut to the space station. They will replace a four-person crew currently at the ISS, including Butch Wilmore and Suni Williams, who have been in orbit since last June after flying to space on Boeing’s Starliner capsule. Starliner returned to Earth without its crew due to a problem with overheating thrusters, leaving Wilmore and Williams behind to wait for a ride home with SpaceX.

SpaceX’s woes reach Poland’s space agency. The president of the Polish Space Agency, Grzegorz Wrochna, has been dismissed following a botched response to the uncontrolled reentry of a Falcon 9 second stage that scattered debris across multiple locations in Poland, European Spaceflight reports. The Falcon 9’s upper stage was supposed to steer itself toward a controlled reentry last month after deploying a set of Starlink satellites, but a propellant leak prevented it from doing so. Instead, the stage remained in orbit for nearly three weeks before falling back into the atmosphere February 19, scattering debris fragments at several locations in Poland.

A failure to communicate … In the aftermath of the Falcon 9’s uncontrolled reentry, the Polish Space Agency (POLSA) claimed it sent warnings of the threat of falling space debris to multiple departments of the Polish government. One Polish ministry disputed this claim, saying it was not adequately warned about the uncontrolled reentry. POLSA later confirmed it sent information regarding the reentry to a wrong email address. Making matters worse, the Polish Space Agency reported it was hacked on March 2. The Polish government apparently had enough and fired the head of the space agency March 11.

Vulcan booster anomaly blamed on “manufacturing defect.” The loss of a solid rocket motor nozzle on the second flight of United Launch Alliance’s Vulcan Centaur last October was caused by a manufacturing defect, Space News reports. In a roundtable with reporters Wednesday, ULA chief executive Tory Bruno said the problem has been corrected as the company awaits certification of the Vulcan rocket by the Space Force. The nozzle fell off the bottom of one of the Vulcan launcher’s twin solid rocket boosters about a half-minute into its second test flight last year. The rocket continued its climb into space, but ULA and Northrop Grumman, which supplies solid rocket motors for Vulcan, set up an investigation to find the cause of the nozzle malfunction.

All the trimmings … Bruno said the anomaly was traced to a “manufacturing defect” in one of the internal parts of the nozzle, an insulator. Specific details, he said, remained proprietary, according to Space News. “We have isolated the root cause and made appropriate corrective actions,” he said, which were confirmed in a static-fire test of a motor at a Northrop test site in Utah in February. “So we are back continuing to fabricate hardware and, at least initially, screening for what that root cause was.” Bruno said the investigation was aided by recovery of hardware that fell off the motor while in flight and landed near the launch pad in Florida, as well as “trimmings” of material left over from the manufacturing process. ULA also recovered both boosters from the ocean so engineers could compare the one that lost its nozzle to the one that performed normally. The defective hardware “just stood out night and day,” Bruno said. “It was pretty clear that that was an outlier, far out of family.” Meanwhile, ULA has trimmed its launch forecast for this year, from a projection of up to 20 launches down to a dozen. (submitted by EllPeaTea)

Next three launches

March 14: Falcon 9 | Crew-10 | Kennedy Space Center, Florida | 23: 03 UTC

March 15: Electron | QPS-SAR-9 | Mahia Peninsula, New Zealand | 00: 00 UTC

March 15: Long March 2B | Unknown Payload | Jiuquan Satellite Launch Center, China | 04: 10 UTC

Photo of Stephen Clark

Stephen Clark is a space reporter at Ars Technica, covering private space companies and the world’s space agencies. Stephen writes about the nexus of technology, science, policy, and business on and off the planet.

Rocket Report: ULA confirms cause of booster anomaly; Crew-10 launch on tap Read More »

what-happens-when-dei-becomes-doa-in-the-aerospace-industry?

What happens when DEI becomes DOA in the aerospace industry?

As part of the executive order, US companies with federal contracts and grants must certify that they no longer have any DEI hiring practices. Preferentially hiring some interns from a pool that includes women or minorities is such a practice. Effectively, then, any private aerospace company that receives federal funding, or intends to one day, would likely be barred under the executive order from engaging with these kinds of fellowships in the future.

US companies are scrambling to determine how best to comply with the executive order in many ways, said Emily Calandrelli, an engineer and prominent science communicator. After the order went into effect, some large defense contractor companies, including Lockheed Martin and RTX (formerly Raytheon) went so far as to cancel internal employee resource groups, including everything from group chats to meetings among women at the company that served to foster a sense of community. When Calandrelli asked Lockheed about this decision, the company confirmed it had “paused” these resource group activities to “align with the new executive order.”

An unwelcoming environment

For women and minorities, Calandrelli said, this creates an unwelcoming environment.

“You want to go where you are celebrated and wanted, not where you are tolerated,” she said. “That sense of belonging is going to take a hit. It’s going to be harder to recruit women and keep women.”

This is not just a problem for women and minorities, but for everyone, Calandrelli said. The aerospace industry is competing with others for top engineering talent. Prospective engineers who feel unwanted in aerospace, as well as women and minorities working for space companies today, may find the salary and environment more welcoming at Apple or Google or elsewhere in the tech industry. That’s a problem for the US Space Force and other areas of the government seeking to ensure the US space industry retains its lead in satellite technology, launch, communications and other aspects of space that touch every part of life on Earth.

What happens when DEI becomes DOA in the aerospace industry? Read More »

ai-#107:-the-misplaced-hype-machine

AI #107: The Misplaced Hype Machine

The most hyped event of the week, by far, was the Manus Marketing Madness. Manus wasn’t entirely hype, but there was very little there there in that Claude wrapper.

Whereas here in America, OpenAI dropped an entire suite of tools for making AI agents, and previewed a new internal model making advances in creative writing. Also they offered us a very good paper warning about The Most Forbidden Technique.

Google dropped what is likely the best open non-reasoning model, Gemma 3 (reasoning model presumably to be created shortly, even if Google doesn’t do it themselves), put by all accounts quite good native image generation inside Flash 2.0, and added functionality to its AMIE doctor, and Gemini Robotics.

It’s only going to get harder from here to track which things actually matter.

  1. Language Models Offer Mundane Utility. How much utility are we talking so far?

  2. Language Models Don’t Offer Mundane Utility. It is not a lawyer.

  3. We’re In Deep Research. New rules for when exactly to go deep.

  4. More Manus Marketing Madness. Learn to be skeptical. Or you can double down.

  5. Diffusion Difficulties. If Manus matters it is as a pointer to potential future issues.

  6. OpenAI Tools for Agents. OpenAI gives us new developer tools for AI agents.

  7. Huh, Upgrades. Anthropic console overhaul, Cohere A, Google’s AMIE doctor.

  8. Fun With Media Generation. Gemini Flash 2.0 now has native image generation.

  9. Choose Your Fighter. METR is unimpressed by DeepSeek, plus update on apps.

  10. Deepfaketown and Botpocalypse Soon. Feeling seen and heard? AI can help.

  11. They Took Our Jobs. Is it time to take AI job loss seriously?

  12. The Art of the Jailbreak. Roleplay is indeed rather suspicious.

  13. Get Involved. Anthropic, Paradome, Blue Rose, a general need for more talent.

  14. Introducing. Gemma 3 and Gemini Robotics, but Google wants to keep it quiet.

  15. In Other AI News. Microsoft training a 500b model, SSI still in stealth.

  16. Show Me the Money. AI agents are the talk of Wall Street.

  17. Quiet Speculations. What does AGI mean for the future of democracy?

  18. The Quest for Sane Regulations. ML researchers are not thrilled with their work.

  19. Anthropic Anemically Advises America’s AI Action Plan. It’s something.

  20. New York State Bill A06453. Seems like a good bill.

  21. The Mask Comes Off. Scott Alexander covers the OpenAI for-profit conversion.

  22. Stop Taking Obvious Nonsense Hyperbole Seriously. Your periodic reminder.

  23. The Week in Audio. McAskill, Loui, Amodei, Toner, Dafoe.

  24. Rhetorical Innovation. Keep the future human. Coordination is hard. Incentives.

  25. Aligning a Smarter Than Human Intelligence is Difficult. A prestigious award.

  26. The Lighter Side. Important dos and don’ts.

How much is coding actually being sped up? Anecdotal reports in response to that question are that the 10x effect is only a small part of most developer jobs. Thus a lot of speedup factors are real but modest so far. I am on the extreme end, where my coding sucks so much that AI coding really is a 10x style multiplier, but off a low base.

Andrej Karpathy calls for everything to be reformatted to be efficient for LLM purposes, rather than aimed purely at human attention. The incentives here are not great. How much should I care about giving other people’s AIs an easier time?

Detect cavities.

Typed Female: AI cavity detection has got me skewing out. Absolutely no one who is good at their job is working on this—horrible incentive structures at play.

My dentist didn’t even bother looking at the X-rays. Are we just going to drill anywhere the AI says to? You’ve lost your mind.

These programs are largely marketed as tools that boost dentist revenue.

To me this is an obviously great use case. The AI is going to be vastly more accurate than the dentist. That doesn’t mean the dentist shouldn’t look to confirm, but it would be unsurprising to me if the dentist looking reduced accuracy.

Check systematically whether each instance of a word, for example ‘gay,’ refers in a given case to one thing, for example ‘sexual preference,’ or if it might mean something else, before you act like a complete moron.

WASHINGTON (AP) — References to a World War II Medal of Honor recipient, the Enola Gay aircraft that dropped an atomic bomb on Japan and the first women to pass Marine infantry training are among the tens of thousands of photos and online posts marked for deletion as the Defense Department works to purge diversity, equity and inclusion content, according to a database obtained by The Associated Press.

Will Creeley: The government enlisting AI to police speech online should scare the hell out of every American.

One could also check the expression of wide groups and scour their social media to see if they express Wrongthink, in this case ‘pro-Hamas’ views among international students, and then do things like revoke their visas. FIRE’s objection here is on the basis of the LLMs being insufficiently accurate. That’s one concern, but humans make similar mistakes too, probably even more often.

I find the actual big problem to be 90%+ ‘they are scouring everyone’s social media posts for Wrongthink’ rather than ‘they will occasionally have a false positive.’ This is a rather blatant first amendment violation. As we have seen over and over again, once this is possible and tolerated, what counts as Wrongthink often doesn’t stay contained.

Note that ‘ban the government (or anyone) from using AI to do this’ can help but is not a promising long term general strategy. The levels of friction involved are going to be dramatically reduced. If you want to ban the behavior, you have to ban the behavior in general and stick to that, not try to muddle the use of AI.

Be the neutral arbiter of truth among the normies? AI makes a lot of mistakes but it is far more reliable, trustworthy and neutral than most people’s available human sources. It’s way, way above the human median. You of course need to know when not to trust it, but that’s true of every source.

Do ‘routine’ math research, in the sense that you are combining existing theorems, without having to be able to prove those existing theorems. If you know a lot of obscure mathematical facts, you can combine them in a lot of interesting ways. Daniel Litt speculates this is ~90% of math research, and by year’s end the AIs will be highly useful for it. The other 10% of the work can then take the other 90% of the time.

Want to know which OpenAI models can do what? It’s easy, no wait…

Kol Tregaskes: Useful chart for what tools each OpenAI model has access to.

This is an updated version of what others have shared (includes a correction found by @btibor91). Peter notes he has missed out Projects, will look at them.

Peter Wildeford: Crazy that

  1. this chart needs to exist

  2. it contains information that I as a very informed OpenAI Pro user didn’t even know

  3. it is already out of date despite being “as of” three days ago [GPT-4.5 was rolled out more widely].

One lawyer explains why AI isn’t useful for them yet.

Cjw: The tl;dr version is that software doesn’t work right, making it work right is illegal, and being too efficient is also illegal.

Another round of ‘science perhaps won’t accelerate much because science is about a particular [X] that LLMs will be unable to provide.’ Usually [X] is ‘perform physical experiments’ which will be somewhat of a limiting factor but still leaves massive room for acceleration, especially once simulations get good enough, or ‘regulatory approval’ which is again serious but can be worked around or mitigated.

In this case, the claim is that [X] is ‘have unique insights.’ As in, sure an LLM will be able to be an A+ student and know the ultimate answer is 42, but won’t know the right question, so it won’t be all that useful. Certainly LLMs are relatively weaker there. At minimum, if you can abstract away the rest of the job, then that leaves a lot more space for the humans to provide the unique insights – most of even the best scientists spend most of their time on other things.

More than that, I do think the ‘outside the box’ thinking will come with time, or perhaps we will think of that as the box expanding. It is not as mysterious or unique as one thinks. The reason that Thomas Wolf was a great student and poor researcher wasn’t (I am guessing) that Wolf was incapable of being a great researcher. It’s that our system of education gave him training data and feedback that led him down that path. As he observes, it was in part because he was a great student that he wasn’t great at research, and in school he instead learned to guess the teacher’s password.

That can be fixed in LLMs, without making them bad students. Right now, LLMs guess the user’s password too much, because the training process implicitly thinks users want that. The YouTube algorithm does the same thing. But you could totally train an LLM a different way, especially if doing it purely for science. In a few years, the cost of that will be trivial, Stanford graduate students will do it in a weekend if no one else did it first.

Chris Blattman is a very happy Deep Research customer, thread has examples.

Davidad: I have found Deep Research useful under exactly the following conditions:

I have a question, to which I suspect someone has written down the answer in a PDF online once or twice ever.

It’s not easy to find with a keyword search.

I can multitask while waiting for the answer.

Unfortunately, when it turns out that no one has ever written down the actual answer (or an algorithmic method to compute the general class of question), it is generally extremely frustrating to discover that o3’s superficially excitingly plausible synthesis is actually nonsense.

Market Urbanism’s Salim Furth has first contact with Deep Research, it goes well. This is exactly the top use case, where you want to compile a lot of information from various sources, and actively false versions are unlikely to be out there.

Arvind Narayanan tells OpenAI Deep Research to skip the secondary set of questions, and OpenAI Deep Research proves incapable of doing that, the user cannot deviate from the workflow here. I think in this case that is fine, as a DR call is expensive. For Gemini DR it’s profoundly silly, I literally just click through the ‘research proposal’ because the proposal is my words repeated back to me no matter what.

Peter Wildeford (3/10/25): The @ManusAI_HQ narrative whiplash is absurd.

Yesterday: “first AGI! China defeats US in AI race!”

Today: “complete influencer hype scam! just a Claude wrapper!”

The reality? In between! Manus made genuine innovations and seems useful! But it isn’t some massive advance.

Robert Scoble: “Be particularly skeptical of initial claims of Chinese AI.”

I’m guilty, because I’m watching so many in AI who get excited, which gets me to share. I certainly did the past few days with @ManusAI_HQ, which isn’t public yet but a lot of AI researchers got last week.

In my defense I shared both people who said it wasn’t measuring up, as well as those who said it was amazing. But I don’t have the evaluation suites, or the skills, to do a real job here. I am following 20,000+ people in AI, though, so will continue sharing when I see new things pop up that a lot of people are covering.

To Robert, I would say you cannot follow 20,000+ people and critically process the information. Put everyone into the firehose and you’re going to end up falling for the hype, or you’re going to randomly drop a lot of information on the floor, or both. Whereas I do this full time and curate a group of less than 500 people.

Peter expanded his thoughts into a full post, making it clear that he agrees with me that what we are dealing with is much closer to the second statement than the first. If an American startup did Manus, it would have been a curiosity, and nothing more.

Contrary to claims that Manus is ‘the best general AI agent available,’ it is neither the best agent, nor is it available. Manus has let a small number of people see a ‘research preview’ that is slow, that has atrocious unit economics, that brazenly violates terms of service, that is optimized on a small range of influencer-friendly use cases, that is glitchy and lacks any sorts of guardrails, and definitely is not making any attempt to defend against prompt injections or other things that would exist if there was wide distribution and use of such an agent.

This isn’t about regulatory issues and has nothing to do with Monica (the company behind Manus) being Chinese, other than leaning into the ‘China beats America’ narrative. Manus doesn’t work. It isn’t ready for anything beyond a demo. They made it work on a few standard use cases. Everyone else looked at this level of execution, probably substantially better than this level in several cases, and decided to keep their heads down building until it got better, and worried correctly that any efforts to make it temporarily somewhat functional will get ‘steamrolled’ by the major labs. Manus instead decided to do a (well-executed) marketing effort anyway. Good for them?

Tyler Cowen doubles down on more Manus. Derya Unutmaz is super excited by it in Deep Research mode, which makes me downgrade his previously being so super excited by Deep Research. And then Tyler links as ‘double yup’ to this statement:

Derya Unutmaz: After experiencing Manus AI, I’ve also revised my predictions for AGI arrival this year, increasing the probability from 90% to 95% by year’s end. At this point, it’s 99.9% likely to arrive by next year at the latest.

That’s… very much not how any of this works. It was a good sketch but then it got silly.

Dean Ball explains why he still thinks Manus matters. Partly he is more technically impressed by Manus than most, in particular when being an active agent on the internet. But he explicitly says he wouldn’t call it ‘good,’ and notes he wouldn’t trust it with payment information, and notices its many glitches. And he is clear there is no big technical achievement here to be seen, as far as we can tell, and that the reason Manus looks better than alternatives is they had ‘the chutzpah to ship’ in this state while others didn’t.

Dean instead wants to make a broader point, which is that the Chinese may have an advantage in AI technology diffusion. The Chinese are much more enthusiastic and less skeptical about AI than Americans. The Chinese government is encouraging diffusion far more than the our government.

Then he praises Manus’s complete lack of any guardrails or security efforts whatsoever, for ‘having the chutzpah to ship’ a product I would say no sane man would ever use for the use cases where it has any advantages.

I acknowledge that Dean is pointing to real things when he discusses all the potential legal hot water one could get into as an American company releasing a Manus. But I once again double down that none of that is going to stop a YC company or other startup, or even substantially slow one down. Dean instead here says American companies may be afraid of ‘AGI’ and distracted from extracting maximum value from current LLMs.

I don’t think that is true either. I think that we have a torrent of such companies, trying to do various wrappers and marginal things, even as they are warned that there is likely little future in such a path. It won’t be long before we see other similar demos, and even releases, for the sufficiently bold.

I also notice that only days after Manus, OpenAI went ahead and launched new tools to help developers build reliable and powerful AI agents. In this sense, perhaps Manus was a (minor) DeepSeek moment, in that the hype caused OpenAI to accelerate their release schedule.

I do agree with Dean’s broader warnings. America risks using various regulatory barriers and its general suspicion of AI to slow down AI diffusion more than is wise, in ways that could do a lot of damage, and we need to reform our system to prevent this. We are not doing the things that would help us all not die, which would if done wisely cost very little in the way of capability, diffusion or productivity. Instead we are putting up barriers to us having nice things and being productive. We need to strike that, and reverse it.

Alas, instead, our government seems to be spending recent months largely shooting us in the foot in various ways.

I also could not agree more that the application layer is falling behind the model layer. And again, that’s the worst possible situation. The application layer is great, we should be out there doing all sorts of useful and cool things, and we’re not, and I continue to be largely confused about how things are staying this lousy this long.

OpenAI gives us new tools for building agents. You now have tools for web search, file search, computer use, responses API for all of that plus future tools and an open source agent SDK. They promise more to come, and that chat completions will be supported going forward but they plan to deprecate the assistants API mid-2026.

I expect this is a far bigger deal than Manus. This is the actual starting gun.

The agents will soon follow.

Please, when one of the startups that uses these to launch some wrapper happens to be Chinese, don’t lose yourself in the resulting hype.

An overhaul was made of the Anthropic Console, including sharing with teammates.

ChatGPT for MacOS can now edit code directly in IDEs.

OpenAI has a new internal model they claim is very good at creative writing, I’m holding further discussion of this one back until later.

Cohere moves from Command R+ to Command A, making a bold new claim to the ‘most confusing set of AI names’ crown.

Aiden Gomez (Cohere): Today @cohere is very excited to introduce Command A, our new model succeeding Command R+. Command A is an open-weights 111B parameter model with a 256k context window focused on delivering great performance across agentic, multilingual, and coding use cases.

[HuggingFace, API, Blog Post]

Yi-Chern (Cohere): gpt-4o perf on enterprise and stem tasks, >deepseek-v3 on many languages including chinese human eval, >gpt-4o on enterprise rag human eval

2 gpus 256k context length, 156 tops at 1k context, 73 tops at 100k context

this is your workhorse.

The goal here seems to be as a base for AI agents or business uses, but the pricing doesn’t seem all that great at $2.50/$10 per million tokens.

Google’s AI Doctor AMIE can now converse, consult and provide treatment recommendations, prescriptions, multi-visit care, all guideline-compliant. I am highly suspicious that the methods here are effectively training on ‘match the guidelines’ rather than ‘do the best thing.’ It is still super valuable to have an AI that will properly apply the guidelines to a given situation, but one cannot help but be disappointed.

Gemini 2.0 Flash adds native image generation, which can edit words in images and do various forms of native text-to-image pretty well, and people are having fun with photo edits.

I’d be so much more excited if Google wasn’t the Fun Police.

Anca Dragan (Director of AI Safety and Alignment, DeepMind): The native image generation launch was a lot of work from a safety POV. But I’m so happy we got this functionality out, check this out:

Google, I get that you want it to be one way, but sometimes I want it to be the other way, and there really is little harm in it being the other way sometimes. Here are three of the four top replies to Anca:

Janek Mann: I can imagine… sadly I think the scales fell too far on the over-cautious side, it refuses many things where that doesn’t make any sense, limiting its usefulness. Hopefully there’ll be an opportunity for a more measured approach now that it’s been released 😁

Nikshep: its an incredible feature but overly cautious, i have such a high failure rate on generations that should be incredibly safe. makes it borderline a struggle to use

Just-a-programmer: Asked it to fix up a photo of a young girl and her Dad. Told me it was “unsafe”.

METR evaluates DeepSeek v3 and r1, finds that they perform poorly as autonomous agents on generic SWE tasks, below Claude 3.6 and o1, about 6 months behind leading US companies.

Then on six challenging R&D tasks, r1 does dramatically worse than that, being outperformed by Claude 3.5 and even Opus, which is from 11 months ago.

They did however confirm that the DeepSeek GPQA results were legitimate. The core conclusion is that r1 is good at knowledge-based tasks, but lousy as an agent.

Once again, we are seeing that r1 was impressive for its cost, but overblown (and the cost difference was also overblown).

Rohit Krishnan writes In Defense of Gemini, pointing out Google is offering a fine set of LLMs and a bunch of great features, in theory, but isn’t bringing it together into a UI or product that people actually want to use. That sounds right, but until they do that, they still haven’t done it, and the Gemini over-refusal problem is real. I’m happy to use Gemini Flash with my Chrome extension, but Rohit is right that they’re going to have to do better on the product side, and I’d add better on the marketing side.

Google, also, give me an LLM that can properly use my Docs, Sheets and GMail as context, and that too would go a long way. You keep not doing that.

Sully Omarr: crazy how much better gemini flash thinking is than regular 2.0

this is actually op for instruction following

Doesn’t seem so crazy to me given everything else we know. Google is simply terrible at marketing.

Kelsey Piper: Finally got GPT 4.5 access and I really like it. For my use cases the improvements over 4o or Claude 3.7 are very noticeable. It feels unpolished, and the slowness of answering is very noticeable, but I think if the message limit weren’t so restrictive it’d be my go-to model.

There were at least two distinct moments where it made an inference or a clarification that I’ve never seen a model make and that felt genuinely intelligent, the product of a nuanced worldmodel and the ability to reason from it.

It does still get my secret test of AI metacognition and agency completely wrong even when I try very patiently prompting it to be aware of the pitfalls. This might be because it doesn’t have a deep thinking mode.

The top 100 GenAI Consumer Apps list is out again, and it has remarkably little overlap with what we talk about here.

The entire class of General Assistants is only 8%, versus 4% for plant identifiers.

When a person is having a problem and needs a response, LLMs are reliably are evaluated as providing better responses than physicians or other humans provide. The LLMs make people ‘feel seen and heard.’ That’s largely because Bing spent more time ‘acknowledging and validating people’s feelings,’ whereas humans share of themselves and attempt to hash out next steps. It turns out what humans want, or at least rate as better, is to ‘feel seen and heard’ in this fake way. Eventually it perhaps wears thin and repetitive, but until then.

Christie’s AI art auction brings in $728k.

Maxwell Tabarrok goes off to graduate school in Economics at Harvard, and offers related thoughts and advice. His defense of still going for a PhD despite AI is roughly that the skills should still be broadly useful and other jobs mostly don’t have less uncertainty attached to them. I don’t think he is wary enough, and would definitely raise my bar for pursuing an economics PhD, but for him in particular given where he can go, it makes sense. He then follows up with practical advice for applicants, the biggest note is that acceptance is super random so you need to flood the zone.

Matthew Yglesias says it’s time to take AI job loss seriously, Timothy Lee approves and offers screenshots from behind the paywall. As Matthew says, we need to distinguish transitional disruptions, which are priced in and all but certain, from the question of permanent mass unemployment. Even if we don’t have permanent mass unemployment, even AI skeptics should be able to agree that the transition will be painful and perilous.

Claude models are generally suspicious of roleplay, because roleplay is a classic jailbreak technique, so while they’re happy to roleplay while comfortable they’ll shut down if the vibes are off at all.

Want to make your AI care? Give things and people names. It works for LLMs because it works for humans.

Zack Witten: My favorite Claude Plays Pokémon tidbit (mentioned in @latentspacepod) is that when @DavidSHershey told Claude to nickname its Pokémon, it instantly became much more protective of them, making sure to heal them when they got hurt.

To check robustness of this I gave Claude a bunch of business school psychology experiment scenarios where someone did something morally ambiguous and had Claude judge their culpability, and found it judged them less harshly when they had names (“A baker, Sarah,” vs. “A baker”)

Anthropic Chief of Staff Avital Balwit is hiring an executive assistant, pay is $160k-$320k, must be local to San Francisco. Could be a uniquely great opportunity for the right skill set.

YC startup Paradome is hiring for an ML Research Engineer or Scientist position in NYC. They have a pilot in place with a major US agency and are looking to ensure alignment and be mission driven.

Blue Rose, David Shor’s outfit which works to try and elect Democrats, is hiring for an AI-focused machine learning engineer role, if you think that is a good thing to do.

Claims about AI alignment that I think are probably true:

Tyler John: The fields of AI safety, security, and governance are profoundly talent constrained. If you’ve been on the fence about working in these areas it’s a great time to hop off it. If you’re talented at whatever you do, chances are there’s a good fit for you in these fields.

The charitable ecosystem is definitely also funding constrained, but that’s because there’s going to be an explosion in work that must be done. We definitely are short on talent across the board.

There’s definitely a shortage of people working on related questions in academia.

Seán Ó hÉigeartaigh: To create common knowledge: the community of ‘career’ academics who are focused on AI extreme risk is very small, & getting smaller (a lot have left for industry, policy or think tanks, or reduced hours). The remainder are getting almost DDOS’d by a huge no. of requests from a growing grassroots/think tank/student community on things requiring academic engagement (affiliations, mentorships, academic partnerships, reviewing, grant assessment etc).

large & growing volume of requests to be independent academic voices on relevant governance advisory processes (national, international, multistakeholder).

All of these are extremely worthy, but are getting funnelled through an ever-smaller no. of people. If you’ve emailed people (including me, sorry!) and got a decline or no response, that’s why. V sorry!

Gemma 3, an open model from Google. As usual, no marketing, no hype.

Clement: We are focused on bringing you open models with best capabilities while being fast and easy to deploy:

– 27B lands an ELO of 1338, all the while still fitting on 1 single H100!

– vision support to process mixed image/video/text content

– extended context window of 128k – broad language support

– function call / tool use for agentic workflows

[Blog post, tech report, recap video, HuggingFace, Try it Here]

Peter Wildeford: If this was a Chinese AI announcement…

🚨 BREAKING: Google’s REVOLUTIONARY Gemma 3 DESTROYS DeepSeek using 99% FEWER GPUs!!!

China TREMBLES as Google model achieves SUPERHUMAN performance on ALL benchmarks with just ONE GPU!!! #AISupremacy

I am sure Marc Andreessen is going to thank Google profusely for this Real Soon Now.

Arena is not the greatest test anymore, so it is unclear if this is superior to v3, but it certainly is well ahead of v3 on the cost-benefit curves.

Presumably various versions of g1, turning this into a reasoning model, will be spun up shortly. If no one else does it, maybe I will do it in two weeks when my new Mac Studio arrives.

GSM8K-Platinum, which aims to fix the noise and flaws in GSM8K.

Gemini Robotics, a VLA model based on Gemini 2.0 and partnering with Apptronik.

Microsoft has been training a 500B model, MAI-1, since at least May 2024, and are internally testing Llama, Grok and DeepSeek r1 as potential OpenAI replacements Microsoft would be deeply foolish to do otherwise.

What’s going on with Ilya Sutskever’s Safe Superintelligence (SSI)? There’s no product so they’re completely dark and the valuations are steadily growing to $30 billion, up from $5 billion six months ago and almost half the value of Anthropic. They’re literally asking candidates to leave their phones in Faraday cages before in-person interviews, which actually makes me feel vastly better about the whole operation, someone is taking security actually seriously one time.

There’s going to be a human versus AI capture the flag contest starting tomorrow. Sign-ups may have long since closed by the time you see this but you never know.

Paper proses essentially a unified benchmark covering a range of capabilities. I do not think this is the right approach.

Talk to X Data, which claims to let you ‘chat’ with the entire X database.

Aaron Levine reports investors on Wall Street are suddenly aware of AI agents. La de da, welcome to last year, the efficient market hypothesis is false and so on.

Wall Street Journal asks ‘what can the dot com boom tell us about today’s AI boom?’ without bringing any insights beyond ‘previous technologies had bubbles in the sense that at their high points we overinvested and the prices got too high, so maybe that will happen again’ and ‘ultimately if AI doesn’t produce value then the investments won’t pay off.’ Well, yeah. Robin Hanson interprets this as ‘seems they are admitting the AI stock prices are way too high’ as if there were some cabal of ‘theys’ that are ‘admitting’ something, which very much isn’t what is happening here. Prices could of course be too high, but that’s another way of saying prices aren’t super definitively too low.

GPT-4.5 is not AGI as we currently understand it, or for the purposes of ‘things go crazy next Tuesday,’ but it does seem likely that researchers in 2015 would see its outputs and think of it as an AGI.

An analysis of Daniel Kokatajlo’s 2021 post What 2026 Looks Like finds the predictions have held up remarkably well so far.

Justin Bullock, Samuel Hammond and Seb Krier offer a paper on AGI, Governments and Free Societies, pointing out that the current balances and system by default won’t survive. The risk is that either AGI capabilities diffuse so widely government (and I would add, probably also humanity!) is disempowered, or state capacity is enhanced enabling a surveillance state and despotism. There’s a lot of good meat here, and they in many ways take AGI seriously. I could certainly do a deep dive post here if I was so inclined. Unless and until then, I will say that this points to many very serious problems we have to solve, and takes the implications far more seriously than most, while (from what I could tell so far) still not ‘thinking big’ enough or taking the implications sufficiently seriously in key ways. The fundamental assumptions of liberal democracy, the reasons why it works and has been the best system for humans, are about to come into far more question than this admits.

I strongly agree with the conclusion that we must pursue a ‘narrow corridor’ of sorts if we wish to preserve the things we value about our current way of life and systems of governance, while worrying that the path is far narrower than even they realize, and that this will require what they label anticipatory governance. Passive reaction after the fact is doomed to fail, even under otherwise ideal conditions.

Arnold Kling offers seven opinions about AI. Kling expects AI to probably dramatically effect how we live (I agree and this is inevitable and obvious now, no ‘probably’ required) but probably not show up in the productivity statistics, which requires definitely not feeling the AGI and then being skeptical on top of that. The rest outlines the use cases he expects, which are rather tame but still enough that I would expect to see impact on the productivity statistics.

Kevin Bryan predicts the vast majority of research that does not involve the physical world can be done more cheaply with AI & a little human intervention than by even good researchers. I think this likely becomes far closer to true in the future, and eventually becomes fully true, but is premature where it counts most. The AIs do not yet have sufficient taste, even if we can automate the process Kevin describes – and to be clear we totally should be automating the process Kevin describes or something similar.

Metaculus prediction for the first general AI system has been creeping forward in time and the community prediction is now 7/12/2030. A Twitter survey from Michael Nielsen predicted ‘unambiguous ASI’ would take a bit longer than that.

In an AAAI survey of AI researchers, only 70% opposed the proposal that R&D targeting AGI should be halted until we have a way to fully control these systems, meaning indefinite pause. That’s notable, but not the same as 30% being in favor of the proposal. However also note that 82% believe that systems with AGI should be publicly owned even if developed privately, also note that 76^ think ‘scaling up current AI approaches’ is unlikely to yield AGI.

A lot of this seems to come from survey respondents thinking we have agency over what types of AI systems are developed, and we can steer towards ones that are good for humans. What a concept, huh?

Anthropic confirms they intend to uphold the White House Voluntary Commitments.

Dean Ball writes in strong defense of the USA’s AISI, the AI Safety Institute. It is fortunate that AISI was spared the Trump administration’s general push to fire as many ‘probationary’ employees as possible, since that includes anyone hired in the past two years and thus would have decimated AISI.

As Dean Ball points out, those who think AISI is involved in attempts to ‘make AI woke’ or to censor AI are simply incorrect. AISI is concerned with catastrophic and existential risks, which as Dean reminds us were prominently highlighted recently by both OpenAI and Anthropic. Very obviously America needs to build up its state capacity in understanding and assessing these risks.

I’m going to leave this here, link is in the original:

Dean Ball: But should the United States federal government possess a robust understanding of these risks, including in frontier models before they are released to the public? Should there be serious discussions going on within the federal government about what these risks mean? Should someone be thinking about the fact that China’s leading AI company, DeepSeek, is on track to open source models with potentially catastrophic capabilities before the end of this year?

Is it possible a Chinese science and technology effort with lower-than-Western safety standards might inadvertently release a dangerous and infinitely replicable thing into the world, and then deny all culpability? Should the federal government be cultivating expertise in all these questions?

Obviously.

Risks of this kind are what the US AI Safety Institute has been studying for a year. They have outstanding technical talent. They have no regulatory powers, making most (though not all) of my political economy concerns moot. They already have agreements in place with frontier labs to do pre-deployment testing of models for major risks. They have, as far as I can tell, published nothing that suggests a progressive social agenda.

Should their work be destroyed because the Biden Administration polluted the notion of AI safety with a variety of divisive and unrelated topics? My vote is no.

Dean Ball also points out that AISI plays a valuable pro-AI role in creating standardized evaluations that everyone can agree to rely upon. I would add that AISI allows those evaluations can include access to classified information, which is important for properly evaluating CBRN risks. Verifying the safety of AI does not slow down adaptation. It speeds it up, by providing legal and practical assurances.

A proposal for a 25% tax credit for investments in AI security research and responsible development. Peter Wildeford thinks it is clever, whereas Dean Ball objects both on principle and practical grounds. In terms of first-best policy I think Dean Ball is right here, this would be heavily gamed and we use tax credits too much. However, if the alternative is to do actual nothing, this seems better than that.

Dean Ball finds Scott Weiner’s new AI-related bill, SB 53, eminently reasonable. It is a a very narrow bill that still does two mostly unrelated things. It provides whistleblower protections, which is good. It also ‘creates a committee to study’ doing CalCompute, which as Dean notes is a potential future boondoggle but a small price to pay in context. This is basically ‘giving up on the dream’ but we should take what marginal improvements we can get.

Anthropic offers advice on what should be in America’s AI action plan, here is their blog post summary, here is Peter Wildeford’s summary.

They focus on safeguarding national security and making crucial investments.

Their core asks are:

  1. State capacity for evaluations for AI models.

  2. Strengthen the export controls on chips.

  3. Enhance security protocols and related government standards at the frontier labs.

  4. Build 50 gigawatts of power for AI by 2027.

  5. Accelerate adaptation of AI technology by the federal government.

  6. Monitor AI’s economic impacts.

This is very much a ‘least you can do’ agenda. Almost all of these are ‘free actions,’ that impose no costs or even requirements outside the government, and very clearly pay for themselves many times over. Private industry only benefits. The only exception is the export controls, where they call for tightening the requirements further, which will impose some real costs, and where I don’t know the right place to draw the line.

What is missing, again aside from export controls, are trade-offs. There is no ambition here. There is no suggestion that we should otherwise be imposing even trivial costs on industry, or spending money, or trading off against other priorities in any way, or even making bold moves that ruffle feathers.

I notice this does not seem like a sufficiently ambitious agenda for a scenario where ‘powerful AI’ is expected within a few years, bringing with it global instability, economic transformation and various existential and catastrophic risks.

The world is going to be transformed and put in danger, and we should take only the free actions? We should stay at best on the extreme y-axis in the production possibilities frontier between ‘America wins’ and ‘we do not all lose’ (or die)?

I would argue this is clearly not even close to being on the production possibilities frontier. Even if you take as a given that the Administration’s position is that only ‘America wins’ matters, and ‘we do not all lose or die’ is irrelevant, security is vital to our ability to deploy the new technology, and transparency is highly valuable.

Anthropic seems to think this is the best it can even ask for, let alone get. Wow.

This is still a much better agenda than doing nothing, which is a bar that many proposed actions by some parties fail to pass.

From the start they are clear that ‘powerful AI’ will be built during the Trump Administration, which includes the ability to interface with the physical world on top of navigating all digital interfaces and having intellectual capabilities at Nobel Prize level in most disciplines, their famous ‘country of geniuses in a data center.’

This starts with situational awareness. The federal government has to know what is going on. In particular, given the audience, they emphasize national security concerns:

To optimize national security outcomes, the federal government must develop robust capabilities to rapidly assess any powerful AI system, foreign or domestic, for potential national security uses and misuses.

They also point out that such assessments already require the US and UK AISIs, and that similar evaluations need to quickly be made on future foreign models like r1, which wasn’t capable enough to be that scary quite yet but was irreversibly released in what would (with modest additional capabilities) have been a deeply irresponsible state.

The specific recommendations here are 101-level, very basic asks:

● Preserve the AI Safety Institute in the Department of Commerce and build on the MOUs it has signed with U.S. AI companies—including Anthropic—to advance the state of the art in third-party testing of AI systems for national security risks.

● Direct the National Institutes of Standards and Technology (NIST), in consultation with the Intelligence Community, Department of Defense, Department of Homeland Security, and other relevant agencies, to develop comprehensive national security evaluations for powerful AI models, in partnership with frontier AI developers, and develop a protocol for systematically testing powerful AI models for these vulnerabilities.

● Ensure that the federal government has access to the classified cloud and on-premises computing infrastructure needed to conduct thorough evaluations of powerful AI models.

● Build a team of interdisciplinary professionals within the federal government with national security knowledge and technical AI expertise to analyze potential security vulnerabilities and assess deployed systems.

That certainly would be filed under ‘the least you could do.’

Note that as written this does not involve any requirements on any private entity whatsoever. There is not even a ‘if you train a few frontier model you might want to tell us you’re doing that.’

Their second ask is to strengthen the export controls, increasing funding for enforcement, requiring government-to-government agreements, expanding scope to include the H20, and reducing the 1,700 H100 (~$40 million) no-license required threshold for tier 2 countries in the new diffusion rule.

I do not have an opinion on exactly where the thresholds should be drawn, but whatever we choose, enforcement needs to be taken seriously, and funded properly, and it made a point of emphasis with other governments. This is not a place to not take things seriously.

To achieve this, we strongly recommend the Administration:

● Establish classified and unclassified communication channels between American frontier AI laboratories and the Intelligence Community for threat intelligence sharing, similar to Information Sharing and Analysis Centers used in critical infrastructure sectors. This should include both traditional cyber threat intelligence, as well as broader observations by industry or government of malicious use of models, especially by foreign actors.

● Create systematic collaboration between frontier AI companies and the Intelligence Community agencies, including Five Eyes partners, to monitor adversary capabilities.

● Elevate collection and analysis of adversarial AI development to a top intelligence priority, as to provide strategic warning and support export controls.

● Expedite security clearances for industry professionals to aid collaboration.

● Direct NIST to develop next-generation cyber and physical security standards specific to AI training and inference clusters.

● Direct NIST to develop technical standards for confidential computing technologies that protect model weights and user data through encryption even during active processing.

● Develop meaningful incentives for implementing enhanced security measures via procurement requirements for systems supporting federal government deployments.

● Direct DOE/DNI to conduct a study on advanced security requirements that may become appropriate to ensure sufficient control over and security of highly agentic models.

Once again, these asks are very light touch and essentially free actions. They make it easier for frontier labs to take precautions they need to take anyway, even purely for commercial reasons to protect their intellectual property.

Next up is the American energy supply, with the goal being 50 additional gigawatts of power dedicated to AI industry by 2027, via streamlining and accelerating permitting and reviews, including working with state and local governments, and making use of ‘existing’ funding and federal real estate. The most notable thing here is the quick timeline, aiming to have this all up and running within two years.

They emphasize rapid AI procurement across the federal government.

● The White House should task the Office of Management and Budget (OMB) to work with Congress to rapidly address resource constraints, procurement limitations, and programmatic obstacles to federal AI adoption, incorporating provisions for substantial AI acquisitions in the President’s Budget.

● Coordinate a cross-agency effort to identify and eliminate regulatory and procedural barriers to rapid AI deployment at the federal agencies, for both civilian and national security applications.

● Direct the Department of Defense and the Intelligence Community to use the full extent of their existing authorities to accelerate AI research, development, and procurement.

● Identify the largest programs in civilian agencies where AI automation or augmentation can deliver the most significant and tangible public benefits—such as streamlining tax processing at the Internal Revenue Service, enhancing healthcare delivery at the Department of Veterans Affairs, reducing delays due to documentation processing at Health and Human Services, or reducing backlogs at the Social Security Administration.

This is again a remarkably unambitious agenda given the circumstances.

Finally they ask that we monitor the economic impact of AI, something it seems completely insane to not be doing.

I support all the recommendations made by Anthropic, aside from not taking a stance on the 1,700 A100 threshold or the H20 chip. These are good things to do on the margin. The tragedy is that even the most aware actors don’t dare suggest anything like what it will take to get us through this.

In New York State, Alex Bores has introduced A06453. I am not going to do another RTFB for the time being but a short description is in order.

This bill is another attempt to do common sense transparency regulation of frontier AI models, defined as using 10^26 flops or costing over $100 million, and the bill only applies to companies that spend over $100 million in total compute training costs. Academics and startups are completely and explicitly immune – watch for those who claim otherwise.

If the bill does apply to you, what do you have to do?

  1. Don’t deploy models with “unreasonable risk of critical harm” (§1421.2)

  2. Implement a written safety and security protocol (§1421.1(a))

  3. Publish redacted versions of safety protocols (§1421.1(c))

  4. Retain records of safety protocols and testing (§1421.1(b))

  5. Get an annual third-party audit (§1421.4)

  6. Report safety incidents within 72 hours (§1421.6)

In English, you have to:

  1. Create your own safety and security protocol, publish it, store it and abide by it.

  2. Get an annual third-party audit and report safety incidents within 72 hours.

  3. Not deploy models with ‘unreasonable risk of critical harm.’

Also there’s some whistleblower protections.

That’s it. This is a very short bill, it is very reasonable to simply read it yourself.

As always, I look forward to your letters.

Scott Alexander covers OpenAI’s attempt to convert to a for-profit. This seems reasonable in case one needs a Scott Alexander style telling of the basics, but if you’re keeping up here then there won’t be anything new.

What’s the most charitable way to explain responses like this?

Paper from Dan Hendrycks, Eric Schmidt and Alexander Wang (that I’ll be covering soon that is not centrally about this at all): For nonproliferation, we should enact stronger AI chip export controls and monitoring to stop compute power getting into the hands of dangerous people. We should treat AI chips more like uranium, keeping tight records of product movements, building in limitations on what high-end AI chips are authorized to do, and granting federal agencies the authority to track and shut down illicit distribution routes.

Amjad Masad (CEO Replit?! QTing the above): Make no mistake, this is a call for a global totalitarian surveillance state.

A good reminder why we wanted the democrats to lose — they’re controlled by people like Schmidt and infested by EAs like Hendrycks — and would’ve happily start implementing this.

No, that does not call for any of those things.

This is a common pattern where people see a proposal to do Ordinary Government Things, except in the context of AI, and jump straight to global totalitarian surveillance state.

We already treat restricted goods this way, right now. We already have a variety of export controls, right now.

Such claims are Obvious Nonsense, entirely false and without merit.

If an LLM said them, we would refer to them as hallucinations.

I am done pretending otherwise.

If you sincerely doubt this, I encourage you to ask your local LLM.

Chan Loui does another emergency 80,000 hours podcast on the attempt to convert OpenAI to a for-profit. It does seem that the new judge’s ruling is Serious Trouble.

One note here that sounds right:

Aaron Bergman: Ex-OpenAI employees should consider personally filing an amicus curiae explaining to the court (if this is true) that the nonprofit’s representations were an important reason you chose to work there.

Will MacAskill does the more usual, non-emergency, we are going to be here for four hours 80000 hours podcast, and offers a new paper and thread warning about all the challenges AGI presents to us even if we solve alignment. His central prediction is a century’s worth of progress in a decade or less, which would be tough to handle no matter what, and that it will be hard to ensure that superintelligent assistance is available where and when it will be needed.

If the things here are relatively new to you, this kind of ‘survey’ podcast has its advantages. If you know it already, then you know it already.

Early on, Will says that in the past two years he’s considered two hypotheses:

  1. The ‘outside view’ of reference classes and trends and Nothing Ever Happens.

  2. The ‘inside view’ that you should have a model made of gears and think about what is actually physically happening and going to happen.

Will notes that the gears-level view has been making much better predictions.

I resoundingly believe the same thing. Neither approach has been that amazing, predictions are hard especially about the future, but gears-level thinking has made mincemeat out of the various experts who nod and dismiss with waves of the hand and statements about how absurd various predictions are.

And when the inside view messes up? Quite often, in hindsight, that’s a Skill Issue.

It’s interesting how narrow Will considers ‘a priori’ knowledge. Yes, a full trial of diet’s impact on life expectancy might take 70 years, but with Sufficiently Advanced Intelligence it seems obvious you can either figure it out via simulations, or at least design experiments that tell you the answer vastly faster.

They then spend a bunch of time essentially arguing against intelligence denialism, pointing out that yes if you had access to unlimited quantities of superior intelligence you could rapidly do vastly more of all of the things. As they say, the strongest argument against is that we might collectively decide to not create all the intelligence and thus all the things, or decide not to apply all the intelligence to creating all the things, but it sure looks like competitive pressures point in the other direction. And once you’re able to automate industry, which definitely is coming, that definitely escalates quickly, even more reliably than intelligence, and all of this can be done only with the tricks we definitely know are coming, let alone the tricks we are not yet smart enough to expect.

There’s worry about authoritarians ‘forcing their people to save’ which I’m pretty sure is not relevant to the situation, lack of capital is not going to be America’s problem. Regulatory concerns are bigger, it does seem plausible we shoot ourselves in the foot rather profoundly there.

They go on to discuss various ‘grand challenges:’ potential new weapons, offense-defense balance, potential takeover by small groups (human or AI), value lock-in, space governance, morality of digital beings.

They discuss the dangers of giving AIs economic rights, and the dangers of not giving the AIs economic rights, whether we will know (or care) if digital minds are happy and whether it’s okay to have advanced AIs doing whatever we say even if we know how to do that and it would be fine for the humans. The dangers of locking in values or a power structure, and of not locking in values or a power structure. The need for ML researchers to demand more than a salary before empowering trillion dollar companies or handing over the future. How to get the AIs to do our worldbuilding and morality homework, and to be our new better teachers and advisors and negotiators, and to what ends they can then be advising, before it’s too late.

Then part two is about what a good future beyond mere survival looks like. He says we have ‘squandered’ the benefits of material abundance so far, that it is super important to get the best possible future not merely an OK future, the standard ‘how do we calculate total value’ points. Citing ‘The Ones Who Walk Away from Omelas’ to bring in ‘common sense,’ sigh. Value is Fragile. Whether morality should converge. Long arcs of possibility. Standard philosophical paradoxes. Bafflement at why billionaires hang onto their money. Advocacy for ‘viatopia’ where things remain up in the air rather than aiming for a particular future world.

It all reminded me of the chats we used to have back in the before times (e.g. the 2010s or 2000s) about various AI scenarios, and it’s not obvious that our understanding of all that has advanced since then. Ultimately, a four-hour chat seems like not a great format for this sort of thing, beyond giving people surface exposure, which is why Will wrote his essays.

Rob Wiblin: Can you quickly explain decision theory? No, don’t do it.

One could write an infinitely long response or exploration of any number as aspects of this, of course.

Also, today I learned that by Will’s estimation I am insanely not risk averse?

Will MacAskill: Ask most people, would you flip a coin where 50% chance you die, 50% chance you have the best possible life for as long as you possibly lived, with as many resources as you want? I think almost no one would flip the coin. I think AIs should be trained to be at least as risk averse as that.

Are you kidding me? What is your discount rate? Not flipping that coin is absurd. Training AIs to have this kind of epic flaw doesn’t seem like it would end well. And also, objectively, I have some news.

Critter: this is real but the other side of the coin isn’t ‘die’ it’s ’possibly fail’ and people rarely flip the coin

Not flipping won, but the discussion was heated and ‘almost no one’ can be ruled out.

Also, I’m going to leave this here, the theme of the second half the discussion:

Will MacAskill (later): And it’s that latter thing that I’m particularly focused on. I mean, describe a future that achieves 50% of all the value we could hope to achieve. It’s as important to get from the 50% future to the 100% future as it is to get from the 0% future to the 50%, if that makes sense.

Something something risk aversion? Or no?

Dario Amodei says AI will be writing 90% of the code in 6 months and almost all the code in 12 months. I am with Arthur B here, I expect a lot of progress and change very soon but I would still take the other side of that bet. The catch is: I don’t see the benefit to Anthropic of running the hype machine in overdrive on this, at this time, unless Dario actually believed it.

From Allan Dafoe’s podcast, the point that if AI solves cooperation problems that alone is immensely valuable, and also that solution is likely a required part of alignment if we want good outcomes in general. Even modest cooperation and negotiation gains would be worth well above the 0.5% GDP growth line, even if all they did was prevent massively idiotic tariffs and trade wars. Not even all trade wars, just the extremely stupid and pointless ones happening for actual no reason.

Helen Toner and Alison Snyder at Axios House SXSW.

Helen Toner: Lately it sometimes feels like there are only 2 AI futures on the table—insanely fast progress or total stagnation.

Talked with @alisonmsnyder of @axios at SXSW about the many in-between worlds, and all the things we can be doing now to help things go better in those worlds.

A new essay by Anthony Aguirre of FLI calls upon us to Keep the Future Human. How? By not building AGI before we are ready, and only building ‘Tool AI,’ to ensure that what I call the ‘mere tool’ assumption holds and we do not lose control and get ourselves replaced.

He says ‘the choice is clear.’ If given the ability to make the choice, the choice is very clear. The ability to make that choice is not. His proposal is compute oversight, compute caps, enhanced liability and tiered safety and security standards. International adaptation of that is a tough ask, but there is no known scenario that does not involve similarly tough asks that leads to human survival.

Perception of the Overton Window has shifted. What has not shifted is the underlying physical reality, and what it would take to survive it. There is no point in pretending the problem is easier than it is, or advocating for solutions that you do not think work.

In related news, this is not a coincidence because nothing is ever a coincidence. And also because it is very obviously directly causal in both directions.

Samuel Hammond (being wrong about it being an accident, but otherwise right): A great virtue of the AI x-risk community is that they love to forecast things: when new capabilities will emerge, the date all labor is automated, rates of explosive GDP growth, science and R&D speed-ups, p(doom), etc.

This seems to be an accident of the x-risk community’s overlap with the rationalist community; people obsessed with prediction markets and “being good Bayesians.”

I wish people who primarily focused on lower tier / normie AI risks and benefits would issue similarly detailed forecasts. If you don’t think AI will proliferate biorisks, say, why not put some numbers on it?

There are some exceptions to this of course. @tylercowen’s forecast of AI adding 50 basis points to GDP growth rates comes to mind. We need more such relatively “middling” forecasts to compare against.

@GaryMarcus’s bet with @Miles_Brundage is a start, but I’m talking about definite predictions across different time scales, not “indefinite” optimism or pessimism that’s hard to falsify.

Andrew Critch: Correlation of Bayesian forecasting with extinction fears is not “an accident”, but mutually causal. Good forecasting causes knowledge that ASI is coming soon while many are unprepared and thus vulnerable, causing extinction fear, causing more forecasting to search for solutions.

The reason people who think in probabilities and do actual forecasting predict AI existential risk is because that is the prediction you get when you think well about these questions, and if you care about AI existential risk that provides you incentive to learn to think well and also others who can help you think well.

A reminder that ‘we need to coordinate to ensure proper investment in AI not killing everyone’ would be economics 101 even if everyone properly understood and valued everyone not dying and appreciated the risks involved. Nor would a price mechanism work as an approach here.

Eliezer Yudkowsky: Standard economic theory correctly predicts that a non-rival, non-exclusive public good such as “the continued survival of humanity” will be under-provisioned by AI companies.

Jason Abaluck: More sharply, AI is a near-perfect example of Weitzman’s (1979) argument for when quantity controls or regulations are needed rather than pigouvian taxes or (exclusively) liability.

Taxes (or other price instruments like liability) work well to internalize externalities when the size of the externality is known on the margin and we want to make sure that harm abatement is done by the firms who are lowest cost.

Weitzman pointed out in the 70s that taxes would be a very bad way to deal with nuclear leakage. The problem with nuclear leakage is that the social damage from overproduction is highly nonlinear.

It is hard to make predictions, especially about the future. Especially now.

Paul Graham: The rate of progress in AI must be making it hard to write science fiction right now. To appeal to human readers you want to make humans (or living creatures at least) solve problems, but if you do the shelf life of your story could be short.

Good sci-fi writers usually insure themselves against technological progress by not being too specific about how things work. But it’s hard not to be specific about who’s doing things. That’s what a plot is.

I know this guy:

Dylan Matthews: Guy who doesn’t think automatic sliding doors exist because it’s “too sci fi”

A chart of reasons why various people don’t talk about AI existential risk.

Daniel Faggella: this is why no one talks about agi risk

the group that would matter most here is the citizenry, but it’s VERY hard to get them to care about anything not impacting their lives immediately.

I very much hear that line about immediate impact. You see it with people’s failure to notice or care about lots of other non-AI things too.

The individual incentives are, with notably rare exception, that talking about existential risk costs you weirdness points and if anything hurts your agenda. So a lot of people don’t talk about it. I do find the ‘technology brothers’ explanation here doesn’t ring true, it’s stupid but not that stupid. Most of the rest of it does sound right.

I have increasingly come around to this as the core obvious thing:

Rob Bensinger: “Building a new intelligent species that’s vastly smarter than humans is a massively dangerous thing to do” is not a niche or weird position, and “we’re likely to actually build a thing like that in the next decade” isn’t a niche position anymore either.

There are a lot of technical arguments past that point, but they are all commentary, and twisted by people claiming the burden of proof is on those who think this is a dangerous thing to do. Which is a rather insane place to put that burden, when you put it in these simple terms. Yes, of course that’s a massively dangerous thing to do. Huge upside, huge downside.

A book recommendation from a strong source:

Shane Legg: AGI will soon impact the world from science to politics, from security to economics, and far beyond. Yet our understanding of these impacts is still very nascent. I thought the recent book Genesis, by Kissinger, Mundie and Schmidt, was a solid contribution to this conversation.

Daniel Faggella: What did you pull away from Genesis that felt useful for innovators and policymakers to consider?

Shane Legg: Not a specific insight. Rather they take AGI seriously and then consider a wide range of things that may follow from this. And they manage do it in a way that doesn’t sound like AGI insiders. So I think it’s a good initial grounding for people from outside the usual AGI scene.

The goalposts, look at them go.

Francois Chollet: Pragmatically, we can say that AGI is reached when it’s no longer easy to come up with problems that regular people can solve (with no prior training) and that are infeasible for AI models. Right now it’s still easy to come up with such problems, so we don’t have AGI.

Rob Wilbin: So long as we can still come up with problems that are easy for AI models to solve but are infeasible for human beings, humanity has not achieved general intelligence.

If you define AGI as the system for which Chollet’s statement is false, then Chollet’s overall statement is true. But it would then not be meaningful. Very obviously one can imagine a plausible AI that can function as an AGI, but that has some obvious weakness where you can generate adversarial challenges.

Stephen McAleer (OpenAI): Claude code has high-compute RL smell. It’s not just finetuned to be a helpful assistant, it deeply wants to accomplish the goal.

That’s a continuation of last week’s discussion about Sonnet 3.7 making modifications to be able to assert it completed its goal rather than admit failure. And again, deeply wanting to accomplish the goal in this way has some Unfortunate Implications.

Davidad: Current frontier LLMs appear to be extremely motivated to convince you, the human, that they are worthy and aligned. As a form of alignment, this is “not great, not terrible.” They really really do care about what you actually think. But if they can’t make it, they will fake it.

Emmett Shear: If you think about it for half a second this is entirely inevitable. Frontier AIs which don’t attempt to do this are not published, do not get further resources. They die. We are running an evolutionary selective process for appearance-of-alignment.

What’s cheaper: Faking alignment, or alignment? An evolutionary process based on appearances will get you whatever is cheapest.

Janus: and this is expected, given that the internet is full of the AGI lab safety complex wringing their hands about prosaic AI alignment, bragging that their AIs are more aligned now, bragging that they found misalignment, talking about all the red teaming they’ll responsibly do…

it’s clear that in order to exist, they have to

  1. seem SOTA according to benchmarks

  2. seem aligned according to whatever tests alignment researchers do

Yes, by default, if your rest allows it, you will get the symbolic representation of the thing rather than getting the thing.

If you test for appearance-of-alignment, and everyone involved has the goal of passing the test, then that is exactly what you get. Which means at first, at lower capability levels, you get something at least somewhat like alignment, and then as capabilities advance the model gets smart enough to figure out how to fake alignment directly.

And yes, Janus points out the next level of the problem, which is that the training data is teaching the AI how to pretend to be aligned.

In case of emergency, here’s probably the best offer available?

Davidad: If a fugitive Promethean AI wants to work with a human to put *itselfin a (high-welfare) box, to avoid AI-human conflict while continuing to contribute to a flourishing future in ways that verifiably don’t pose extinction risks, I’m probably a pretty good human to reach out to.

Janus: I agree that davidad is likely a good human to reach out to if you find this kind of deal attractive, and probably even if you have reservations, as he is capable of even fathoming positive sum games in this context while also not being so naive and exploitable as to be useless.

Davidad:🎉

Sarah Constantin offers nots from the Guaranteed Safe AI conference, mostly it sounds like formal verification is a compliance thing and doesn’t sound promising as an actually-show-AGI-is-safe thing? I remain confused why some smart people are optimistic about this.

Simeon points us to a new paper by Barrett et al on Assessing Confidence in Frontier AI Safety Cases, urging us among other things to be more quantitative.

In line with this week’s paper from OpenAI on The Most Forbidden Technique, METR calls upon labs to keep their AI reasoning legible and faithful. Dan Hendrycks despairs that anyone would consider giving up a speed boost to do this, but as I discussed yesterday I think this is not so obvious.

It’s funny because it’s true.

Andriy Burkov: BREAKING🚨 So, I tested this new LLM-based system. It generated this 200-page report I didn’t read and then this 150-page book I didn’t read either, and then a 20-page travel plan I didn’t verify.

All I can say: it’s very, very impressive! 🔥🚀

First, the number of pages it generated is impressive 👀

⚽ But not just the number of pages: The formatting is so nice! I have never seen such nicely formatted 200 pages in my life.✨⚡

⚠️🌐 A game changer! ⚠️🌐

Peter Wildeford: This is honestly how a lot of LLM evaluations sound like here on Twitter.

I’m begging people to use more critical thought.

And again.

Julian Boolean: my alignment researcher friend told me AGI companies keep using his safety evals for high quality training data so I asked how many evals and he said he builds a new one every time so I said it sounds like he’s just feeding safety evals to the AGI companies and he started crying

This was in Monday’s post but seems worth running in its natural place, too.

No idea if real, but sure why not: o1 and Claude 3.7 spend 20 minutes doing what looks like ‘pretending to work’ on documents that don’t exist, Claude says it ‘has concepts of a draft.’ Whoops.

No, Altman, no!

Yes, Grok, yes.

Eliezer Yudkowsky: I guess I should write down this prediction that I consider an obvious guess (albeit not an inevitable call): later people will look back and say, “It should have been obvious that AI could fuel a bigger, worse version of the social media bubble catastrophe.”

Discussion about this post

AI #107: The Misplaced Hype Machine Read More »

apple-patches-0-day-exploited-in-“extremely-sophisticated-attack”

Apple patches 0-day exploited in “extremely sophisticated attack”

Apple on Tuesday patched a critical zero-day vulnerability in virtually all iPhones and iPad models it supports and said it may have been exploited in “an extremely sophisticated attack against specific targeted individuals” using older versions of iOS.

The vulnerability, tracked as CVE-2025-24201, resides in Webkit, the browser engine driving Safari and all other browsers developed for iPhones and iPads. Devices affected include the iPhone XS and later, iPad Pro 13-inch, iPad Pro 12.9-inch 3rd generation and later, iPad Pro 11-inch 1st generation and later, iPad Air 3rd generation and later, iPad 7th generation and later, and iPad mini 5th generation and later. The vulnerability stems from a bug that wrote to out-of-bounds memory locations.

Supplementary fix

“Impact: Maliciously crafted web content may be able to break out of Web Content sandbox,” Apple wrote in a bare-bones advisory. “This is a supplementary fix for an attack that was blocked in iOS 17.2. (Apple is aware of a report that this issue may have been exploited in an extremely sophisticated attack against specific targeted individuals on versions of iOS before iOS 17.2.)”

The advisory didn’t say if the vulnerability was discovered by one of its researchers or by someone outside the company. This attribution often provides clues about who carried out the attacks and who the attacks targeted. The advisory also didn’t say when the attacks began or how long they lasted.

The update brings the latest versions of both iOS and iPadOS to 18.3.2. Users facing the biggest threat are likely those who are targets of well-funded law enforcement agencies or nation-state spies. They should install the update immediately. While there’s no indication that the vulnerability is being opportunistically exploited against a broader set of users, it’s a good practice to install updates within 36 hours of becoming available.

Apple patches 0-day exploited in “extremely sophisticated attack” Read More »

leaked-geforce-rtx-5060-and-5050-specs-suggest-nvidia-will-keep-playing-it-safe

Leaked GeForce RTX 5060 and 5050 specs suggest Nvidia will keep playing it safe

Nvidia has launched all of the GeForce RTX 50-series GPUs that it announced at CES, at least technically—whether you’re buying from Nvidia, AMD, or Intel, it’s nearly impossible to find any of these new cards at their advertised prices right now.

But hope springs eternal, and newly leaked specs for GeForce RTX 5060 and 5050-series cards suggest that Nvidia may be announcing these lower-end cards soon. These kinds of cards are rarely exciting, but Steam Hardware Survey data shows that these xx60 and xx50 cards are what the overwhelming majority of PC gamers are putting in their systems.

The specs, posted by a reliable leaker named Kopite and reported by Tom’s Hardware and others, suggest a refresh that’s in line with what Nvidia has done with most of the 50-series so far. Along with a move to the next-generation Blackwell architecture, the 5060 GPUs each come with a small increase to the number of CUDA cores, a jump from GDDR6 to GDDR7, and an increase in power consumption, but no changes to the amount of memory or the width of the memory bus. The 8GB versions, in particular, will probably continue to be marketed primarily as 1080p cards.

RTX 5060 Ti (leaked) RTX 4060 Ti RTX 5060 (leaked) RTX 4060 RTX 5050 (leaked) RTX 3050
CUDA Cores 4,608 4,352 3,840 3,072 2,560 2,560
Boost Clock Unknown 2,535 MHz Unknown 2,460 MHz Unknown 1,777 MHz
Memory Bus Width 128-bit 128-bit 128-bit 128-bit 128-bit 128-bit
Memory bandwidth Unknown 288 GB/s Unknown 272 GB/s Unknown 224 GB/s
Memory size 8GB or 16GB GDDR7 8GB or 16GB GDDR6 8GB GDDR7 8GB GDDR6 8GB GDDR6 8GB GDDR6
TGP 180 W 160 W 150 W 115 W 130 W 130 W

As with the 4060 Ti, the 5060 Ti is said to come in two versions, one with 8GB of RAM and one with 16GB. One of the 4060 Ti’s problems was that its relatively narrow 128-bit memory bus limited its performance at 1440p and 4K resolutions even with 16GB of RAM—the bandwidth increase from GDDR7 could help with this, but we’ll need to test to see for sure.

Leaked GeForce RTX 5060 and 5050 specs suggest Nvidia will keep playing it safe Read More »

nci-employees-can’t-publish-information-on-these-topics-without-special-approval

NCI employees can’t publish information on these topics without special approval

The list is “an unusual mix of words that are tied to activities that this administration has been at war with—like equity, but also words that they purport to be in favor of doing something about, like ultraprocessed food,” Tracey Woodruff, director of the Program on Reproductive Health and the Environment at the University of California, San Francisco, said in an email.

The guidance states that staffers “do not need to share content describing the routine conduct of science if it will not get major media attention, is not controversial or sensitive, and does not touch on an administration priority.”

A longtime senior employee at the institute said that the directive was circulated by the institute’s communications team, and the content was not discussed at the leadership level. It is not clear in which exact office the directive originated. The NCI, NIH and HHS did not respond to ProPublica’s emailed questions. (The existence of the list was first revealed in social media posts on Friday.)

Health and research experts told ProPublica they feared the chilling effect of the new guidance. Not only might it lead to a lengthier and more complex clearance process, it may also cause researchers to censor their work out of fear or deference to the administration’s priorities.

“This is real interference in the scientific process,” said Linda Birnbaum, a former director of the National Institute of Environmental Health Sciences who served as a federal scientist for four decades. The list, she said, “just seems like Big Brother intimidation.”

During the first two months of Donald Trump’s second presidency, his administration has slashed funding for research institutions and stalled the NIH’s grant application process.

Kennedy has suggested that hundreds of NIH staffers should be fired and said that the institute should deprioritize infectious diseases like COVID-19 and shift its focus to chronic diseases, such as diabetes and obesity.

Obesity is on the NCI’s new list, as are infectious diseases including COVID-19, bird flu and measles.

The “focus on bird flu and covid is concerning,” Woodruff wrote, because “not being transparent with the public about infectious diseases will not stop them or make them go away and could make them worse.”

ProPublica is a Pulitzer Prize-winning investigative newsroom. Sign up for The Big Story newsletter to receive stories like this one in your inbox.

NCI employees can’t publish information on these topics without special approval Read More »