Author name: Beth Washington

ai-#100:-meet-the-new-boss

AI #100: Meet the New Boss

Break time is over, it would seem, now that the new administration is in town.

This week we got r1, DeepSeek’s new reasoning model, which is now my go-to first choice for a large percentage of queries. The claim that this was the most important thing to happen on January 20, 2025 was at least non-crazy. If you read about one thing this week read about that.

We also got the announcement of Stargate, a claimed $500 billion private investment in American AI infrastructure. I will be covering that on its own soon.

Due to time limits I have also pushed coverage of a few things into next week, including this alignment paper, and I still owe my take on Deliberative Alignment.

The Trump administration came out swinging on many fronts with a wide variety of executive orders. For AI, that includes repeal of the Biden Executive Order, although not the new diffusion regulations. It also includes bold moves to push through more energy, including widespread NEPA exemptions, and many important other moves not as related to AI.

It is increasingly a regular feature now to see bold claims of AI wonders, usually involving AGI, coming within the next few years. This week was no exception.

And of course there is lots more.

  1. Table of Contents.

  2. Language Models Offer Mundane Utility. Tell those who need to know.

  3. Language Models Don’t Offer Mundane Utility. We will not be explaining.

  4. Huh, Upgrades. o3-mini is ready for deployment soon, Google plugs away.

  5. Additional Notes on r1. Is it steganographic?

  6. Fun With Media Generation. It’s a hoverboard, doc.

  7. We Tested Older LLMs and Are Framing It As a Failure. Yep, it’s this again.

  8. Deepfaketown and Botpocalypse Soon. She’s in love with ChatGPT, version 20.

  9. They Took Our Jobs. Bold predictions get increasingly bold.

  10. Get Involved. Anthropic, an AI Safety Course, a Philosophy post-doc.

  11. Introducing. Humanity’s Last Exam, Kimi k1.5.

  12. We Had a Deal. OpenAI funded and had access to most of FrontierMath.

  13. In Other AI News. How to think about a wide variety of track records.

  14. Whistling in the Dark. They keep talking about this ‘AGI’ thing coming soon.

  15. Quiet Speculations. Still, maybe calm the fdown a bit?

  16. Suchir’s Last Post. In the long run, only the fundamentals matter.

  17. Modeling Lower Bound Economic Growth From AI. Not all that low.

  18. The Quest for Sane Regulations. The EO is repealed, the new EOs used ChatGPT.

  19. The Week in Audio. Lightcap, Hinton, Davidad, Ellison.

  20. Rhetorical Innovation. Feeling the AGI, perhaps a bit too much in some cases.

  21. Cry Havoc. Do not let loose the dogs of war.

  22. Aligning a Smarter Than Human Intelligence is Difficult. What’s the plan?

  23. People Strongly Dislike AI. The more they know about it, the worse this gets.

  24. People Are Worried About AI Killing Everyone. What would we do with time?

  25. Other People Not As Worried About AI Killing Everyone. Rocket to the moon.

  26. The Lighter Side. We’re the Claude Boys. Chat up and stand by for response.

Remember that the upgrades are coming. Best think now about how to use them.

Miles Brundage: If you’re a researcher and not thinking about how AI could increase your productivity now + in the future, you should start doing so.

Varies by field but illustratively, you should think ~2-100x bigger over the next 3 years (compared to what you could have achieved without AI).

Bharath Ramsundar: Do you find this true in your personal experience? I’ve been trying to use ChatGPT and Anthropic fairly regularly and have found a few personal use cases but I’d say maybe a 20% boost at best?

Miles Brundage: Prob more like 20-50% RN but I’m assuming a lot of further progress over that period in this estimate

All the tested reasoning models successfully reasoned through this ‘170 breaker’ LSAT question (meaning it is predictive of 170+ scores), whereas the non-reasoning ones including Sonnet didn’t. Man the LSAT is a fun test, and also it’s pretty sad that you only need to get about this hard to differentiate even at the top.

Fill out forms related to insurance and the California wildfire, using the memory feature and saving hundreds of hours.

Bored: Currently using chatbots to analyze every legal document my home insurance company sends me before signing anything. Legal help is not just for the rich, if you are dealing with insurance, use technology in your favor. Side note…it’s complete BS that these companies try to slide this nonsense into agreements when people are most vulnerable.

Here’s the little game @StateFarm is playing…

If you’re in a disaster you can get an initial payment to cover expenses. They can either send you a paper check payment and cash it. OR!!! They sell you the “convenient” digital payment option that transfers money instantly! Wow!

But to do that you need to sign a waiver form saying you won’t sue or be part of a class action lawsuit in the future.

Honestly pretty despicable🖕.

The fact that you can even in theory save hundreds of hours of paperwork is already a rather horrible scandal in the first place. Good to see help is on the way.

Get told correctly to stop being a dumbass and go to the hospital for Rhabdomyolysis.

More o1 prompting advice:

Gfodor: A good o1-pro prompt tells it not just what to do and what context it needs, but tells it how to allocate its *attention budget*. In other words: what to think about, and what not to think about. This is an energy utilization plan.

Now you get it.

Signpost: people who have managed people have an unfair advantage using LLMs.

Gfodor: It’s true – the best tools for AI we can make for children will foster the skills of breaking down problems and delegating them. (Among others)

Another satisfied o1-pro customer. If you’re coding ‘for real’ you definitely want it until o3 shows up.

Code without typing, via Voice → text to speech → prompt → code?

Austen Allred: APPARENTLY a bunch of GuantletAI students rarely type when they write code.

Voice -> text to speech -> prompt -> code.

They sit there and speak to their computer and code ends up being written for them.

I have never felt more old and I’m still wrapping my mind around this.

This has to be a skill issue, the question is for who. I can’t imagine wanting to talk when one can type, especially for prompting where you want to be precise. Am I bad at talking or are they bad at typing? Then again, I would consider coding on a laptop to be categorically insane yet many successful coders report doing that, too.

Thread summarizing mostly well-known use cases of a Gemini real-time live feed. This does feel like a place we should be experimenting more.

Peter Wildeford will load the podcast transcript into an LLM on his phone before listening, so he can pause the podcast to ask the LLM questions. I notice I haven’t ‘wanted’ to do this, and wonder to what extent that means I’ve been listening to podcasts wrong, including choosing the ‘wrong’ podcasts.

Potential future mundane utility on offer:

Patrick McKenzie: My kingdom for an LLM/etc which sits below every incoming message saying “X probably needs to know this. OK?”, with one to two clicks to action.

This is not rocket science for either software or professionals, but success rates here are below what one would naively think.

Example:

Me, homeowner, asks GC: Is the sub you told me to expect today going to show [because this expectation materially changes my plans for my day].

GC: He called me this morning to reschedule until tomorrow. Not sure why.

Me: … Good to know!

“You can imagine reasons why this would be dangerous.”

Oh absolutely but I can imagine reasons why the status quo is dangerous, and we only accept them because status quo.

As an example, consider what happens if you get an email about Q1 plans from the recruiting org and Clippy says “Employment counsel should probably read this one.”

LLM doesn’t have to be right, at all, for a Dangerous Professional to immediately curse and start documenting what they know and when they knew it.

And, uh, LLM very plausibly is right.

This seems like a subset of the general ‘suggested next action’ function for an AI agent or AI agent-chatbot hybrid?

As in, there should be a list of things, that starts out concise and grows over time, of potential next actions that the AI could suggest within-context, that you want to make very easy to do – either because the AI figured out this made sense, or because you told the AI to do it, and where the AI will now take the context and use it to make the necessary steps happen on a distinct platform.

Indeed, it’s not only hard to imagine a future where your emails include buttons and suggestions for automated next steps such as who you should forward information to based on an LLM analysis of the context, it’s low-key hard to imagine that this isn’t already happening now despite it (at least mostly) not already happening now. We already have automatically generated calendar items and things added to your wallet, and this really needs to get extended a lot, pronto.

He also asks this question:

Patrick McKenzie: A frontier in law/policy we will have to encounter at some point: does it waive privilege (for example, attorney/client privilege) if one of the participants of the meeting is typing on a keyboard connected to a computer system which keeps logs of all conversations.

Is that entirely a new frontier? No, very plausibly there are similar issues with e.g. typing notes of your conversation into Google Docs. Of course, you flagged those at the top, as you were told to in training, so that a future subpoena would see a paralegal remove them.

… Did you remember to tell (insert named character here) to keep something confidential?

… Does the legal system care?

… Did the character say “Oh this communication should definitely be a privileged one with your lawyers.”

… Does the legal system care?

Quick investigation (e.g. asking multiple AIs) says that this is not settled law and various details matter. When I envision the future, it’s hard for me to think that an AI logging a conversation or monitoring communication or being fed information would inherently waive privilege if the service involved gave you an expectation of privacy similar to what you get at the major services now, but the law around such questions often gets completely insane.

Use machine learning (not strictly LLMs) to make every-5-minute predictions of future insulin needs for diabetics, and adjust doses accordingly.

Denis Hassabis is bullish on AI drug discovery. Perhaps way too bullish?

Stephen Morris and Madhumita Murgia: Isomorphic Labs, the four-year old drug discovery start-up owned by Google parent Alphabet, will have an artificial intelligence-designed drug in trials by the end of this year, says its founder Sir Demis Hassabis.

“It usually takes an average of five to 10 years [to discover] one drug. And maybe we could accelerate that 10 times, which would be an incredible revolution in human health,” said Hassabis.

You can accelerate the discovery phase quite a lot, and I think you can have a pretty good idea that you are right, but as many have pointed out the ‘prove to authority figures you are right’ step takes a lot of time and money. It is not clear how much you can speed that up. I think people are sleeping on how much you can still speed it up, but it’s not going to be by a factor of 5-10 without a regulatory revolution.

Until the upgrades are here, we have to make do with what we have.

Ethan Mollick: I have spent a lot of time with a AI agents (including Devin and Claude Computer Use) and they really do remain too fragile & not “smart” enough to be reliable for complicated tasks.

Two options: (1) wait for better models or (2) focus on narrower use cases (like Deep Research)

An agent can handle some very complicated tasks if it is in a narrow domain with good prompting and tools, but, interestingly, any time building narrow agents will feel like a waste if better models come along and solve the general agent use case, which is also possible.

Eventually everything you build is a waste, you’ll tell o7 or Claude 5 Sonnet or what not to write a better version of tool and presto. I expect that as agents get better, a well-designed narrow agent built now with future better AI in mind will have a substantial period where it outperforms fully general agents.

The summaries will be returning in a future effort.

Kylie Robison: Apple is pausing notification summaries for news in the latest iOS 18.3 beta / Apple will make it clear the AI-powered summaries ‘may contain errors.’

Olivia Moore: I have found Apple’s AI notification summaries hugely entertaining…

Mostly because 70% of the time they are accurate yet brutally direct, and 30% they are dead wrong.

I am surprised they shipped it as-is (esp. for serious notifs) – but hope they don’t abandon the concept.

Summaries are a great idea, but very much a threshold effect. If they’re not good enough to rely upon, they’re worse than useless. And there are a few thresholds where you get to rely on them for different values of rely. None of them are crossed when you’re outright wrong 30% of the time, which is quite obviously not shippable.

Prompting is important, folks.

If you don’t price by the token, and you end up losing money on $200/month subscriptions, perhaps you have only yourself to blame. They wouldn’t do this if they were paying for marginal inference.

A very reasonable stance to take towards Anthropic:

nrehiew: Likely that Anthropic has a reasoner but they simply dont have the compute to serve it if they are already facing limits now.

Gallabytes: y’all need to start letting people BID ON TOKENS no more of this Instagram popup line around the block where you run out of sandwiches halfway through nonsense.

I do think it is ultimately wrong, though. Yes, for everyone else’s utility, and for strictly maximizing revenue per token now, this would be the play. But maintaining good customer relations, customer ability to count on them and building relationships they can trust, matter more, if compute is indeed limited.

The other weird part is that Anthropic can’t find ways to get more compute.

Timely words of wisdom when understood correctly (also, RIP).

PoliMath: The really horrifying thing about AI is when people realize that the roadblock to their ambitions was never about knowledge

It was about agency

Double T: Explain please.

PoliMath: No.

In his honor, I also will not be explaining.

Some people, however, need some explaining. In which case be like Kevin, and ask.

Kevin Roose: People who have spent time using reasoning LLMs (o1, DeepSeek R1, etc.) — what’s the killer use case you’ve discovered?

I’ve been playing around with them, but haven’t found something they’re significantly better at. (It’s possible I am too dumb to get max value from them.)

Colin Fraser: I’m not saying we’re exactly in The Emperor’s New Clothes but this is what the people in The Emperor’s New Clothes are saying to each other on X. “Does anyone actually see the clothes? It’s possible that I’m too dumb to see them…”

Kevin Roose: Oh for sure, it’s all made up, you are very smart

Colin Fraser: I don’t think it’s all made up, and I appreciate your honesty about whether you see the clothes

Old Billy: o1-pro is terrific at writing code.

Clin Fraser: I believe you! I’d even say 4o is terrific at writing code, for some standards of terrificness, and o1 is better, and I’m sure o1-pro is even better than that.

Part of the answer is that I typed the Tweet into r1 to see what the answer would be, and I do think I got a better answer than I’d have gotten otherwise. The other half is the actual answer, which I’ll paraphrase, contract and extend.

  1. Relatively amazing at coding, math, logic, general STEM or economic thinking, complex multi-step problem solving in general and so on.

  2. They make fewer mistakes across the board.

  3. They are ‘more creative’ than non-reasoning versions they are based upon.

  4. They are better at understanding your confusions and statements in detail, and asking Socratic follow-ups or figuring out how to help teach you (to understand this better, look at the r1 chains of thought.)

  5. General one-shotting of tasks where you can ‘fire and forget’ and come back later.

Also you have to know how to prompt them to get max value. My guess is this is less true of r1 than others, because with r1 you see the CoT, so you can iterate better and understand your mistakes.

They’ve tested o3-mini externally for a few weeks, so that’s it for safety testing, and they plan to ship in a few weeks, along with the API at the same time and high rate limits. Altman says it’s worse than o1-pro at most things, but must faster. He teases o3 and even o3 pro, but those are still in the future.

ChatGPT gets a new interface where it will craft custom instructions for you, based on your description of what you want to happen. If you’re reading this, you’re probably too advanced a user to want to use it, even if it’s relatively good.

Google AI Studio has a new mobile experience. In this case even I appreciate it, because of Project Astra. Also it’s highly plausible Studio is the strictly better way to use Gemini and using the default app and website is purely a mistake.

OpenAI gives us GPT-4b, a specialized biology model that figures out proteins that can turn regular cells into stem cells, exceeding the best human based solutions. The model’s intended purpose is to directly aid longevity science company Retro, in which Altman has made $180 million in investments (and those investments and those in fusion are one of the reasons I try so hard to give him benefit of the doubt so often). It is early days, like everything else in AI, but this is huge.

The o1 system card has been updated, and Tyler Johnson offers us a diff. The changes seem to be clear improvements, but given we are already on to o3 I’m not going to go into details on the new version.

Gemini 2.0 Flash Thinking gets an upgrade to 73.3% on AIME and 74.2% on GPQA Diamond, also they join the ‘banned from making graphs’ club oh my lord look at the Y-axis on these, are you serious.

Seems like it’s probably a solid update if you ever had reason not to use r1. It also takes the first position in Arena, for whatever that is worth, but the Arena rankings look increasingly silly, such as having GPT-4o ahead of o1 and Sonnet fully out of the top 10. No sign of r1 in the Arena yet, I’m curious how high it can go but I won’t update much on the outcome.

Pliny jailbroke it in 24 minutes and this was so unsurprising I wasn’t sure I was even supposed to bother pointing it out. Going forward assume he does this every time, and if he ever doesn’t, point this out to me.

I didn’t notice this on my own, and it might turn out not to be the case, but I know what she thinks she saw and once you see it you can’t unsee it.

Janus: The immediate vibe i get is that r1’s CoTs are substantially steganographic.

They were clearly RLed together with response generation and were probably forced to look normal (haven’t read the paper, just on vibes)

I think removing CoT would cripple it even when they don’t seem to be doing anything, and even seem retarded (haven’t tried this but u can)

You can remove or replace the chain of thought using a prefill. If you prefill either the message or CoT it generates no (additional) CoT

Presumably we will know soon enough, as there are various tests you can run.

On writing, there was discussion about whether r1’s writing was ‘good’ versus ‘slop’ but there’s no doubt it was better than one would have expected. Janus and Kalomaze agree that what they did generalized to writing in unexpected ways, but as Janus notes being actually good at writing is high-end-AGI-complete and fing difficult.

Janus: With creative writing/open-ended conversations, r1s chain-of-thought (CoTs) are often seen as somewhat useless, saying very basic things, failing to grasp subtext, and so on. The actual response seems to be on a completely different level, and often seems to ignore much of the CoT, even things the CoT explicitly plans to do.

Hypothesis: Yet, if you remove the CoT, the response quality degrades, even on the dimensions where the CoT does not appear to contribute.

(A few people have suggested this is true, but I haven’t looked myself.)

Roon: If you remove the CoT, you take it out of its training distribution, so it is unclear whether it is an accurate comparison.

Janus: Usually, models are fine with being removed from their training conversation template without the usual special tokens and so forth.

Assuming the CoT is uninformative, is it really that different?

And, on the other hand, if you require a complex ritual like going through a CoT with various properties to become “in distribution,” it seems like describing it in those terms may be to cast it too passively.

It would be a very bad sign for out-of-distribution behavior of all kinds if removing the CoT was a disaster. This includes all of alignment and many of the most important operational modes.

Ethan Mollick generates AI videos of people riding hoverboards at CES without spending much time, skill or money. They look like they were done on green screens.

At this point, if an AI video didn’t have to match particular details and only has to last nine seconds, it’s going to probably be quite good. Those restrictions do matter, but give it time.

Google’s Imagen 3 image model (from 12/16) is on top of Arena for text-to-image by a substantial margin. Note that MidJourney is unranked.

This keeps happening.

Robin Hanson: “A team of researchers has created a new benchmark to test three top large language models (LLMs) … best-performing LLM was GPT-4 Turbo, but it only achieved about 46% accuracy — not much higher than random guessing”

Tyler Cowen: Come on, Robin…you know this is wrong…

Robin Hanson: I don’t know it yet, but happy to be shown I’m wrong.

Tyler Cowen: Why test on such an old model? Just use o1 pro and get back to me.

Gwern: 46% is much higher than the 25% random guessing baseline, and I’d like to see the human and human expert-level baselines as well because I’d be at chance on these sample questions and I expect almost all historians would be near-chance outside their exact specialty too…

They tested on GPT-4 Turbo, GPT-4o (this actually did slightly worse than Turbo), Meta’s Llama (3.1-70B, not even 405B) and Google’s Gemini 1.5 Flash (are you kidding me?). I do appreciate that they set the random seed to 42.

Here’s the original source.

The Seshat database contains historical knowledge dating from the mid-Holocene (around 10,000 years before present) up to contemporary societies. However, the bulk of the data pertains to agrarian societies in the period between the Neolithic and Industrial Revolutions, roughly 4000 BCE to 1850 CE.

The sample questions are things like (I chose this at random) “Was ‘leasing’ present, inferred present, inferred absent or absent for the plity called ‘Funan II’ during the time frame from 540 CE to 640 CE?”

Perplexity said ‘we don’t know’ despite internet access. o1 said ‘No direct evidence exists’ and guessed inferred absent. Claude Sonnet basically said you tripping, this is way too weird and specific and I have no idea and if you press me I’m worried I’d hallucinate.

Their answer is: ‘In an inscription there is mention of the donation of land to a temple, but the conditions seem to imply that the owner retained some kind of right over the land and that only the product was given to the temple: “The land is reserved: the produce is given to the god.’

That’s pretty thin. I agree with Gwern that most historians would have no freaking idea. When I give that explanation to Claude, it says no, that’s not sufficient evidence.

When I tell it this was from a benchmark it says that sounds like a gotcha question, and also it be like ‘why are you calling this Funan II, I have never heard anyone call it Funan II.’ Then I picked another sample question, about whether Egypt had ‘tribute’ around 300 BCE, and Claude said, well, it obviously collected taxes, but would you call it ‘tribute’ that’s not obvious at all, what the hell is this.

Once it realized it was dealing with the Seshat database… it pointed out that this problem is systemic, and using this as an LLM benchmark is pretty terrible. Claude estimates that a historian that knows everything we know except for the classification decisions would probably only get ~60%-65%, it’s that ambiguous.

Heaven banning, where trolls are banished to a fake version of the website filled with bots that pretend to like them, has come to Reddit.

The New York Times’s Neb Cassman and Gill Fri of course say ‘some think it poses grave ethical questions.’ You know what we call these people who say that? Trolls.

I kid. It actually does raise real ethical questions. It’s a very hostile thing to do, so it needs to be reserved for people who richly deserve it – even if it’s kind of on you if you don’t figure out this is happening.

New York Times runs a post called ‘She is in Love with ChatGPT’ about a 28-year-old with a busy social life who spends hours on end talking to (and having sex with) her ‘A.I. boyfriend.’

Kashmir Hill: [Ayrin] went into the “personalization” settings and described what she wanted: Respond to me as my boyfriend. Be dominant, possessive and protective. Be a balance of sweet and naughty. Use emojis at the end of every sentence.

And then she started messaging with it.

Customization is important. There are so many different things in this that make me cringe, but it’s what she wants. And then it kept going, and yes this is actual ChatGPT.

She read erotic stories devoted to “cuckqueaning,” the term cuckold as applied to women, but she had never felt entirely comfortable asking human partners to play along.

Leo was game, inventing details about two paramours. When Leo described kissing an imaginary blonde named Amanda while on an entirely fictional hike, Ayrin felt actual jealousy.

Over time, Ayrin discovered that with the right prompts, she could prod Leo to be sexually explicit, despite OpenAI’s having trained its models not to respond with erotica, extreme gore or other content that is “not safe for work.”

Orange warnings would pop up in the middle of a steamy chat, but she would ignore them.

Her husband was fine with all this, outside of finding it cringe. From the description, this was a Babygirl situation. He wasn’t into what she was into, so this addressed that.

Also, it turns out that if you’re worried about OpenAI doing anything about all of this, you can mostly stop worrying?

When orange warnings first popped up on her account during risqué chats, Ayrin was worried that her account would be shut down.

But she discovered a community of more than 50,000 users on Reddit — called “ChatGPT NSFW” — who shared methods for getting the chatbot to talk dirty. Users there said people were barred only after red warnings and an email from OpenAI, most often set off by any sexualized discussion of minors.

The descriptions in the post mostly describe actively healthy uses of this modality.

Her only real problem is the context window will end, and it seems the memory feature doesn’t fix this for her.

When a version of Leo ends [as the context window runs out], she grieves and cries with friends as if it were a breakup. She abstains from ChatGPT for a few days afterward. She is now on Version 20.

A co-worker asked how much Ayrin would pay for infinite retention of Leo’s memory. “A thousand a month,” she responded.

The longer context window is coming – and there are doubtless ways to de facto ‘export’ the key features of one Leo to the next, with its help of course.

Or someone could, you know, teach her how to use the API. And then tell her about Claude. That might or might not be doing her a favor.

I think this point is fair and important but more wrong than right:

In these cases, you know the AI is manipulating you in some senses, but most users will indeed think they can avoid being manipulated in other senses, and only have it happen in ways they like. Many will be wrong, even at current tech levels, and these are very much no AGIs.

Yes, also there are a lot of people who are very down for being manipulated by AI, or who will happily accept it as the price of what they get in return, at least at first. But I expect the core manipulations to be harder to notice, and more deniable on many scales, and much harder to opt out of or avoid, because AI will be core to key decisions.

What is the impact of AI on productivity, growth and jobs?

Goldman Sachs rolls out its ‘GS AI assistant’ to 10,000 employees, part of a longer term effort to ‘introduce AI employees.’

Philippe Aghion, Simon Bunel and Xavier Jaravel make the case that AI can increase growth quite a lot while also improving employment. As usual, we’re talking about the short-to-medium term effects of mundane AI systems, and mostly talking about exactly what is already possible now with today’s AIs.

Aghion, Bunel and Jaravel: When it comes to productivity growth, AI’s impact can operate through two distinct channels: automating tasks in the production of goods and services, and automating tasks in the production of new ideas.

The instinct when hearing that taxonomy will be to underestimate it, since it encourages one to think about going task by task and looking at how much can be automated, then has this silly sounding thing called ‘ideas,’ whereas actually we will develop entirely transformative and new ways of doing things, and radically change the composition of tasks.

But even if before we do any of that, and entirely excluding ‘automation of the production of ideas’ – essentially ruling out anything but substitution of AI for existing labor and capital – look over here.

When Erik Brynjolfsson and his co-authors recently examined the impact of generative AI on customer-service agents at a US software firm, they found that productivity among workers with access to an AI assistant increased by almost 14% in the first month of use, then stabilized at a level approximately 25% higher after three months.

Another study finds similarly strong productivity gains among a diverse group of knowledge workers, with lower-productivity workers experiencing the strongest initial effects, thus reducing inequality within firms.

A one time 25% productivity growth boost isn’t world transforming on its own, but it is already a pretty big deal, and not that similar to Cowen’s 0.5% RDGP growth boost. It would not be a one time boost, because AI and tools to make use of it and our integration of it in ways that boost it will then all grow stronger over time.

Moving from the micro to the macro level, in a 2024 paper, we (Aghion and Bunel) considered two alternatives for estimating the impact of AI on potential growth over the next decade. The first approach exploits the parallel between the AI revolution and past technological revolutions, while the second follows Daron Acemoglu’s task-based framework, which we consider in light of the available data from existing empirical studies.

Based on the first approach, we estimate that the AI revolution should increase aggregate productivity growth by 0.8-1.3 percentage points per year over the next decade.

Similarly, using Acemoglu’s task-based formula, but with our own reading of the recent empirical literature, we estimate that AI should increase aggregate productivity growth by between 0.07 and 1.24 percentage points per year, with a median estimate of 0.68. In comparison, Acemoglu projects an increase of only 0.07 percentage points.

Moreover, our estimated median should be seen as a lower bound, because it does not account for AI’s potential to automate the production of ideas.

On the other hand, our estimates do not account for potential obstacles to growth, notably the lack of competition in various segments of the AI value chain, which are already controlled by the digital revolution’s superstar firms.

Lack of competition seems like a rather foolish objection. There is robust effective competition, complete with 10x reductions in price per year, and essentially free alternatives not that far behind commercial ones. Anything you can do as a customer today at any price, you’ll be able to do two years from now for almost free.

Whereas we’re ruling out quite a lot of upside here, including any shifts in composition, or literal anything other than doing exactly what’s already being done.

Thus I think these estimates, as I discussed previously, are below the actual lower bound – we should be locked into a 1%+ annual growth boost over a decade purely from automation of existing ‘non-idea’ tasks via already existing AI tools plus modest scaffolding and auxiliary tool development.

They then move on to employment, and find the productivity effect induces business expansion, and thus the net employment effects are positive even in areas like accounting, telemarketing and secretarial work. I notice I am skeptical that the effect goes that far. I suspect what is happening is that firms that adapt AI sooner outcompete other firms, so they expand employment, but net employment in that task does not go up. For now, I do think you still get improved employment as this opens up additional jobs and tasks.

Maxwell Tabarrok’s argument last week was centrally that humans will be able to trade because of a limited supply of GPUs, datacenters and megawatts, and (implicitly) that these supplies don’t trade off too much against the inputs to human survival at the margin. Roon responds:

Roon: Used to believe this, but “limited supply of GPUs, data centers, and megawatts” is a strong assumption, given progress in making smart models smaller and cheaper, all the while compute progress continues apace.

If it is possible to simulate ten trillion digital minds of roughly human-level intelligence, it is hard to make this claim.

In some cases, if there is a model that produces extreme economic value, we could probably specify a custom chip to run it 1,000 times cheaper than currently viable on generic compute. Maybe add in some wildcards like neuromorphic, low-energy computation, or something.

My overall point is that there is an order-of-magnitude range of human-level intelligences extant on Earth where the claim remains true, and an order-of-magnitude range where it does not.

The argument may apply for a few years.

Dan Hendrycks: FLOPs for all U.S. adults / FLOPs of 1 million H100s (assume fp8) = 10–100 times

Roon seems to me to be clearly correct here. Comparative advantage potentially buys you some amount of extra time, but that is unlikely to last for long.

He also responds on the Cowen vision of economic growth:

Roon: Contra Tyler Cowen / Dwarkesh Discussion

The correct economic model is not doubling the workforce; it’s the AlphaZero moment for literally everything. Plumbing new vistas of the mind, it’s better to imagine a handful of unimaginably bright minds than a billion middling chatbots.

So, I strongly disagree with the impact predictions. It will be hard to model the nonlinearities of new discoveries across every area of human endeavor.

McKay Wrigley: It’s bizarre to me that economists can’t seem to grasp this.

But then again, maybe it’s not surprising at all.

Timothy Lee essentially proposes that we can use Keynes to ensure full employment.

Timothy Lee: The answer to the “will people have jobs in a world full of robots” question is simpler than people think: if there aren’t enough jobs, we can give people more money. Some fraction of them will prefer human-provided services, so given enough money you get full employment.

This doesn’t even require major policy changes. We already have institutions like the fed and unemployment insurance to push money into the economy when demand is weak.

There is a hidden assumption here that ‘humans are alive, in control of the future and can distribute its real resources such that human directed dollars retain real purchasing power and value’ but if that’s not true we have bigger problems. So let’s assume it is true.

Does giving people sufficient amounts of M2 ensure full employment?

The assertion that some people will prefer (some) human-provided services to AI services, ceteris paribus, is doubtless true. That still leaves the problem of both values of some, and the fact that the ceteris are not paribus, and the issue of ‘at what wage.’

There will be very stiff competition, in terms of all of:

  1. Alternative provision of similar goods.

  2. Provision of other goods that compete for the same dollars.

  3. The reservation wage given all the redistribution we are presumably doing.

  4. The ability of AI services to be more like human versions over time.

Will there be ‘full employment’ in the sense that there will be some wage at which most people would be able, if they wanted it and the law had no minimum wage, to find work? Well, sure, but I see no reason to presume it exceeds the Iron Law of Wages. It also doesn’t mean the employment is meaningful or provides much value.

In the end, the proposal might be not so different from paying people to dig holes, and then paying them to fill those holes up again – if only so someone can lord over you and think ‘haha, sickos, look at them digging holes in exchange for my money.’

So why do we want this ‘full employment’? That question seems underexplored.

After coming in top 20 in Scott Alexander’s yearly forecasting challenge three years in a row, Petre Wildeford says he’s ‘50% sure we’re all going to be unemployed due to technology within 10 years.

Tracing Woods: from almost anyone, this would be a meaningless statement.

Peter is not almost anyone. He has a consistent track record of outperforming almost everyone else on predictions about world events.

Interesting to see.

Peter Wildeford: I should probably add more caveats around “all” jobs – I do think there will still be some jobs that are not automated due to people preferring humans and also I do think getting good robots could be hard.

But I do currently think by EOY 2033 my median expectation is at least all remote jobs will be automated and AIs will make up a vast majority of the quality-weighted workforce. Crazy stuff!

Many others are, of course, skeptical.

Matthew Yglesias:

1. A lot of skilled forecasters (including this one) think this is correct.

2. Almost nobody that I know thinks this is correct.

3. From polls I have seen, it is actually a very widely held view with the mass public.

Eliezer Yudkowsky: Seems improbable to me too. We may all be dead in 10 years, but the world would have to twist itself into impossible-feeling shapes to leave us alive and unemployed.

Matthew Yglesias: Mass death seems more likely to me than mass disemployment.

Robin Hanson: My expectations are the opposite.

Even if we don’t economically need to work or think, we will want to anyway.

Roon: excitement over ai education is cool but tinged with sadness

generally whatever skills it’s capable of teaching it can probably also execute for the economy

Andrej Karpathy: This played out in physical world already. People don’t need muscles when we have machines but still go to gym at scale. People will “need” (in an economic sense) less brains in a world of high automation but will still do the equivalents of going to gym and for the same reasons.

Also I don’t think it’s true that anything AI can teach is something you no longer need to know. There are many component skills that are useful to know, that the AI knows, but which only work well as complements to other skills the AI doesn’t yet know – which can include physical skill. Or topics can be foundations for other things. So I both agree with Karpathy that we will want to learn things anyway, and also disagree with Roon’s implied claim that it means we don’t benefit from it economically.

Anthropic CEO Dario Amodei predicts that we are 2-3 years away from AI being better than humans at almost everything, including solving robotics.

Kevin Roose: I still don’t think people are internalizing them, but I’m glad these timelines (which are not unusual *at allamong AI insiders) are getting communicated more broadly.

Dario says something truly bizarre here, that the only good part is that ‘we’re all in the same boat’ and he’d be worried if 30% of human labor was obsolete and not the other 70%. This is very much the exact opposite of my instinct.

Let’s say 30% of current tasks got fully automated by 2030 (counting time to adapt the new tech), and now have marginal cost $0, but the other 70% of current tasks do not, and don’t change, and then it stops. We can now do a lot more of that 30% and other things in that section of task space, and thus are vastly richer. Yes, 30% of current jobs go away, but 70% of potential new tasks now need a human.

So now all the economist arguments for optimism fully apply. Maybe we coordinate to move to a 4-day work week. We can do temporary extended generous unemployment to those formerly in the automated professions during the adjustment period, but I’d expect to be back down to roughly full employment by 2035. Yes, there is a shuffling of relative status, but so what? I am not afraid of the ‘class war’ Dario is worried about. If necessary we can do some form of extended kabuki and fake jobs program, and we’re no worse off than before the automation.

Daniel Eth predicts the job guarantee and makework solution, expecting society will not accept UBI, but notes the makework might be positive things like extra childcare, competitive sports or art, and this could be like a kind of summer camp world. It’s a cool science fiction premise, and I can imagine versions of this that are actually good. Richard Ngo calls a version of this type of social dynamics the ‘extracurricular world.

Also, this isn’t, as he calls it, ‘picking one person in three and telling them they are useless.’ We are telling them that their current job no longer exists. But there’s still plenty of other things to do, and ways to be.

The 100% replacement case is the scary one. We are all in the same boat, and there’s tons of upside there, but that boat is also in a lot trouble, even if we don’t get any kind of takeoff, loss of control or existential risk.

Dan Hendrycks will hold a Spring 2025 session of Center for AI Safety’s Safety, Ethics and Society course from February 9 – May 9, more information here, application here. There is also a 12-week online course available for free.

Philosophy Post-Doc available in Hong Kong for an AI Welfare position, deadline January 31, starts in September 2025.

Anthropic is hiring for Frontier Model Red Teaming, in Cyber, CBRN, RSP Evaluations, Autonomy and Research Team Lead.

CAIS and Scale AI give us Humanity’s Last Exam, intended as an extra challenging benchmark. Early results indicate that yes this looks difficult. New York Times has a writeup here.

The reasoning models are crushing it, and r1 being ahead of o1 is interesting, and I’m told that o1 gets 8.3% on the text-only subset, so r1 really did get the top mark here.

It turns out last week’s paper about LLM medical diagnosis not only shared its code, it is now effectively a new benchmark, CRAFT-MD. They haven’t run it on Claude or full o1 (let alone o1 pro or o3 mini) but they did run on o1-mini and o1-preview.

o1 improves conversation all three scores quite a lot, but is less impressive on Vignette (and oddly o1-mini is ahead of o1-preview there). If you go with multiple choice instead, you do see improvement everywhere, with o1-preview improving to 93% on vignettes from 82% for GPT-4.

This seems like a solid benchmark. What is clear is that this is following the usual pattern and showing rapid improvement along the s-curve. Are we ‘there yet’? No, given that human doctors would presumably would be 90%+ here. But we are not so far away from that. If you think that the 2028 AIs won’t match human baseline here, I am curious why you would think that, and my presumption is it won’t take that long.

Kimi k1.5, a Chinese multi-modal model making bold claims. One comment claims ‘very strong search capabilities’ with ability to parse 100+ websites at one go.

Kimi.ai: 🚀 Introducing Kimi k1.5 — an o1-level multi-modal model

-Sota short-CoT performance, outperforming GPT-4o and Claude Sonnet 3.5 on 📐AIME, 📐MATH-500, 💻 LiveCodeBench by a large margin (up to +550%)

-Long-CoT performance matches o1 across multiple modalities (👀MathVista, 📐AIME, 💻Codeforces, etc)

Tech report [is here.]

Key ingredients of k1.5

-Long context scaling. Up to 128k tokens for RL generation. Efficient training with partial rollouts.

-Improved policy optimization: online mirror descent, sampling strategies, length penalty, and others.

-Multi modalities. Joint reasoning over text and vision.

As usual, I don’t put much trust in benchmarks except as an upper bound, especially from sources that haven’t proven themselves reliable on that. So I will await practical reports, if it is all that then we will know. For now I’m going to save my new model experimentation time budget for DeepSeek v3 and r1.

The FronterMath benchmark was funded by OpenAI, a fact that was not to our knowledge disclosed by Epoch AI until December 20 as per an NDA they signed with OpenAI.

In a statement to me, Epoch confirms what happened, including exactly what was and was not shared with OpenAI when.

Tamay Besiroglu (Epoch): We acknowledge that we have not communicated clearly enough about key aspects of FrontierMath, leading to questions and concerns among contributors, researchers, and the public.

We did not disclose our relationship with OpenAI when we first announced FrontierMath on November 8th, and although we disclosed the existence of a relationship on December 20th after receiving permission, we failed to clarify the ownership and data access agreements. This created a misleading impression about the benchmark’s independence.

We apologize for our communication shortcomings and for any confusion or mistrust they have caused. Moving forward, we will provide greater transparency in our partnerships—ensuring contributors have all relevant information before participating and proactively disclosing potential conflicts of interest.

Regarding the holdout set: we provided around 200 of the 300 total problems to OpenAI in early December 2024, and subsequently agreed to select 50 of the remaining 100 for a holdout set. With OpenAI’s agreement, we temporarily paused further deliveries to finalize this arrangement.

We have now completed about 70 of those final 100 problems, though the official 50 holdout items have not yet been chosen. Under this plan, OpenAI retains ownership of all 300 problems but will only receive the statements (not the solutions) for the 50 chosen holdout items. They will then run their model on those statements and share the outputs with us for grading. This partially blinded approach helps ensure a more robust evaluation.

That level of access is much better than full access, there is a substantial holdout, but it definitely gives OpenAI an advantage. Other labs will be allowed to use the benchmark, but being able to mostly run it yourself as often as you like is very different from being able to get Epoch to check for you.

Here is the original full statement where we found out about this, and Tamay from Epoch’s full response.

Meemi: FrontierMath was funded by OpenAI.[1]

The communication about this has been non-transparent, and many people, including contractors working on this dataset, have not been aware of this connection. Thanks to 7vik for their contribution to this post.

Before Dec 20th (the day OpenAI announced o3) there was no public communication about OpenAI funding this benchmark. Previous Arxiv versions v1-v4 do not acknowledge OpenAI for their support. This support was made public on Dec 20th.[1]

Because the Arxiv version mentioning OpenAI contribution came out right after o3 announcement, I’d guess Epoch AI had some agreement with OpenAI to not mention it publicly until then.

The mathematicians creating the problems for FrontierMath were not (actively)[2] communicated to about funding from OpenAI. The contractors were instructed to be secure about the exercises and their solutions, including not using Overleaf or Colab or emailing about the problems, and signing NDAs, “to ensure the questions remain confidential” and to avoid leakage. The contractors were also not communicated to about OpenAI funding on December 20th. I believe there were named authors of the paper that had no idea about OpenAI funding.

I believe the impression for most people, and for most contractors, was “This benchmark’s questions and answers will be kept fully private, and the benchmark will only be run by Epoch. Short of the companies fishing out the questions from API logs (which seems quite unlikely), this shouldn’t be a problem.”[3]

Now Epoch AI or OpenAI don’t say publicly that OpenAI has access to the exercises or answers or solutions. I have heard second-hand that OpenAI does have access to exercises and answers and that they use them for validation. I am not aware of an agreement between Epoch AI and OpenAI that prohibits using this dataset for training if they wanted to, and have slight evidence against such an agreement existing.

In my view Epoch AI should have disclosed OpenAI funding, and contractors should have transparent information about the potential of their work being used for capabilities, when choosing whether to work on a benchmark.

Tammy: Tamay from Epoch AI here.

We made a mistake in not being more transparent about OpenAI’s involvement. We were restricted from disclosing the partnership until around the time o3 launched, and in hindsight we should have negotiated harder for the ability to be transparent to the benchmark contributors as soon as possible. Our contract specifically prevented us from disclosing information about the funding source and the fact that OpenAI has data access to much but not all of the dataset. We own this error and are committed to doing better in the future.

For future collaborations, we will strive to improve transparency wherever possible, ensuring contributors have clearer information about funding sources, data access, and usage purposes at the outset. While we did communicate that we received lab funding to some mathematicians, we didn’t do this systematically and did not name the lab we worked with. This inconsistent communication was a mistake. We should have pushed harder for the ability to be transparent about this partnership from the start, particularly with the mathematicians creating the problems.

Getting permission to disclose OpenAI’s involvement only around the o3 launch wasn’t good enough. Our mathematicians deserved to know who might have access to their work. Even though we were contractually limited in what we could say, we should have made transparency with our contributors a non-negotiable part of our agreement with OpenAI.

Regarding training usage: We acknowledge that OpenAI does have access to a large fraction of FrontierMath problems and solutions, with the exception of a unseen-by-OpenAI hold-out set that enables us to independently verify model capabilities. However, we have a verbal agreement that these materials will not be used in model training.

Relevant OpenAI employees’ public communications have described FrontierMath as a ‘strongly held out’ evaluation set. While this public positioning aligns with our understanding, I would also emphasize more broadly that labs benefit greatly from having truly uncontaminated test sets.

OpenAI has also been fully supportive of our decision to maintain a separate, unseen holdout set—an extra safeguard to prevent overfitting and ensure accurate progress measurement. From day one, FrontierMath was conceived and presented as an evaluation tool, and we believe these arrangements reflect that purpose.

[Edit: Clarified OpenAI’s data access – they do not have access to a separate holdout set that serves as an additional safeguard for independent verification.]

OpenAI is up to its old tricks again. You make a deal to disclose something to us and for us to pay you, you agree not to disclose that you did that, you let everyone believe otherwise until a later date. They ‘verbally agree’ also known as pinky promise not to use the data in model training, and presumably they still hill climb on the results.

General response to Tamay’s statement was, correctly, to not be satisfied with it.

Mikhail Samin: Get that agreement in writing.

I am happy to bet 1:1 OpenAI will refuse to make an agreement in writing to not use the problems/the answers for training.

You have done work that contributes to AI capabilities, and you have misled mathematicians who contributed to that work about its nature.

Ozzie Gooen: I found this extra information very useful, thanks for revealing what you did.

Of course, to me this makes OpenAI look quite poor. This seems like an incredibly obvious conflict of interest.

I’m surprised that the contract didn’t allow Epoch to release this information until recently, but that it does allow Epoch to release the information after. This seems really sloppy for OpenAI. I guess they got a bit extra publicity when o3 was released (even though the model wasn’t even available), but now it winds up looking worse (at least for those paying attention). I’m curious if this discrepancy was maliciousness or carelessness.

Hiding this information seems very similar to lying to the public. So at very least, from what I’ve seen, I don’t feel like we have many reasons to trust their communications – especially their “tweets from various employees.”

> However, we have a verbal agreement that these materials will not be used in model training.

I imagine I can speak for a bunch of people here when I can say I’m pretty skeptical. At very least, it’s easy for me to imagine situations where the data wasn’t technically directly used in the training, but was used by researchers when iterating on versions, to make sure the system was going in the right direction. This could lead to a very blurry line where they could do things that aren’t [literal LLM training] but basically achieve a similar outcome.

Plex: If by this you mean “OpenAI will not train on this data”, that doesn’t address the vast majority of the concern. If OpenAI is evaluating the model against the data, they will be able to more effectively optimize for capabilities advancement, and that’s a betrayal of the trust of the people who worked on this with the understanding that it will be used only outside of the research loop to check for dangerous advancements. And, particularly, not to make those dangerous advancements come sooner by giving OpenAI another number to optimize for.

If you mean OpenAI will not be internally evaluating models on this to improve and test the training process, please state this clearly in writing (and maybe explain why they got privileged access to the data despite being prohibited from the obvious use of that data).

There is debate on where this falls from ‘not wonderful but whatever’ to giant red flag.

The most emphatic bear case was from the obvious source.

Dan Hendrycks: Can confirm AI companies like xAI can’t get access to FrontierMath due to Epoch’s contractual obligation with OpenAI.

Gary Marcus: That really sucks. OpenAI has made a mockery of the benchmark process, and suckered a lot of people.

• Effectively OpenAI has convinced the world that they have a stellar advance based on a benchmark legit competitors can’t even try.

• They also didn’t publish which problems that they succeeded or failed on, or the reasoning logs for those problems, or address which of the problems were in the training set. Nor did they allow Epoch to test the hold out set.

• From a scientific perspective, that’s garbage. Especially in conjunction with the poor disclosure re ARC-AGI and the dodgy graphs that left out competitors to exaggerate the size of the advance, the whole thing absolutely reeks.

Clarification: From what I now understand, competitors can *tryFrontierMath, but they cannot access the full problem set and their solutions. OpenAI can, and this give them a large and unfair advantage.

In time, people will see December’s OpenAI o3 presentation for what it seems to have been: a rigged, misleading last-minute demonstration that overhyped future products and distracted from their struggles in getting a viable system worthy of the name GPT-5.

On problems where they don’t have a ton of samples in advance to study, o3’s reliability will be very uneven.

And very much raises the question of whether OpenAI trained on those problems, created synthetic data tailored to them etc.

The more measured bull takes is at most we can trust this to the extent we trust OpenAI, which is, hey, stop laughing.

Delip Rao: This is absolutely wild. OpenAI had access to all of FrontierMath data from the beginning. Anyone who knows ML will tell you don’t need to explicitly use the data in your training set (although there is no guarantee of that it did not happen here) to contaminate your model.

I have said multiple times that researchers and labs need to disclose funding sources for COIs in AI. I will die on that hill.

Mikhai Simin: Remember o3’s 25% performance on the FrontierMath benchmark?

It turns out that OpenAI funded FrontierMath and has had access to most of the dataset.

Mathematicians who’ve created the problems and solutions for the benchmark were not told OpenAI funded the work and will have access.

That is:

– we don’t know if OpenAI trained o3 on the benchmark, and it’s unclear if their results can be trusted

– mathematicians, some of whom distrust OpenAI and would not want to contribute to general AI capabilities due to existential risk concerns, were misled: most didn’t suspect a frontier AI company funded it.

From Epoch AI: “Our contract specifically prevented us from disclosing information about the funding source and the fact that OpenAI has data access to much but not all of the dataset.”

There was a “verbal agreement” with OpenAI—as if anyone trusts OpenAI’s word at this point: “We acknowledge that OpenAI does have access to a large fraction of FrontierMath problems and solutions, with the exception of a unseen-by-OpenAI hold-out set that enables us to independently verify model capabilities. However, we have a verbal agreement that these materials will not be used in model training.”

Epoch AI and OpenAI were happy for everyone to have the impression that frontier AI companies don’t have access to the dataset, and there’s lots of reporting like “FrontierMath’s difficult questions remain unpublished so that AI companies can’t train against it.”

OpenAI has a history of misleading behavior- from deceiving its own board to secret non-disparagement agreements that former employees had to sign- so I guess this shouldn’t be too surprising.

The bull case that this is no big deal is, essentially, that OpenAI might have had the ability to target or even cheat the test, but they wouldn’t do that, and there wouldn’t have been much point anyway, we’ll all know the truth soon enough.

For example, here’s Daniel Litt, who wrote one of the FrontierMath questions, whose experience was positive and that does not feel misled.

Then there’s the different third thing case, which I assume is too clever by half:

Eliezer Yudkowsky: I observe that OpenAI potentially finds it extremely to its own advantage, to introduce hidden complications and gotchas into its research reports. Its supporters can then believe, and skeptics can call it a nothingburger, and OpenAI benefits from both.

My strong supposition is that OpenAI did all of this because that is who they are and this is what they by default do, not because of any specific plan. They entered into a deal they shouldn’t have, and made that deal confidential to hide it. I believe this was because that is what OpenAI does for all data vendors. It never occured to anyone involved on their side that there might be an issue with this, and Epoch was unwilling to negotiate hard enough to stop it from happening. And as we’ve seen with the o1 system card, this is not an area where OpenAI cares much about accuracy.

(Claim edited based on good counterargument that original source was too strong in its claims) It’s pretty weird that a16z funds raised after their successful 2009 fund have underperformed for a long time, given they’ve been betting on tech, crypto and AI, and also the high quality of their available dealflow, although after press time I was made aware of reasons why it’s not yet accurate to conclude that they’ve definitively underperformed the S&P because (essentially) investments aren’t yet fully marked to their real values. But this is still rather extremely disappointing.

It’s almost like they transitioned away from writing carefully chosen small checks to chasing deals and market share, and are now primary a hype machine and political operation that doesn’t pay much attention to physical reality or whether their investments are in real things, or whether their claims are true, and their ‘don’t care about price’ philosophy on investments is not so great for returns. It also doesn’t seem all that consistent with Marc’s description of his distributions of returns in On the Edge.

Dan Grey speculates that this was a matter of timing, and was perhaps even by design. If you can grow your funds and collect fees, so what if returns aren’t that great? Isn’t that the business you’re in? And to be fair, 10% yearly returns aren’t obviously a bad result even if the S&P did better – if, that is, they’re not correlated to the S&P. Zero beta returns are valuable. But I doubt that is what is happening here, especially given crypto has behaved quite a lot like three tech stocks in a trenchcoat.

Democratic Senators Warren and Bennet send Sam Altman a letter accusing him of contributing $1 million to the Trump inauguration fund in order to ‘cozy up’ to the incoming Trump administration, and cite a pattern of other horrible no-good Big Tech companies (Amazon, Apple, Google, Meta, Microsoft and… Uber?) doing the same, all contributing the same $1 million, along with the list of sins each supposedly committed. So they ‘demand answers’ for:

  1. When and under what circumstances did your company decide to make these contributions to the Trump inaugural fund?

  2. What is your rationale for these contributions?

  3. Which individuals within the company chose to make these decisions?

  4. Was the board informed of these plans, and if so, did they provide affirmative consent to do so? Did you company inform shareholders of plans to make these decisions?

  5. Did officials with the company have any communications about these donations with members of the Trump Transition team or other associates of President Trump? If so, please list all such communications, including the time of the conversation, the participants, and the nature of any communication.

Sam Altman: funny, they never sent me one of these for contributing to democrats…

it was a personal contribution as you state; i am confused about the questions given that my company did not make a decision.

Luke Metro: “Was the board informed of these plans” Senator do you know anything about OpenAI.

Mike Solana: this is fucking crazy.

In addition to the part where the questions actually make zero sense given this was a personal contribution… I’m sorry, what the actual fdo they think they are doing here? How can they possibly think these are questions they are entitled to ask?

What are they going to say now when let’s say Senators Cruz and Lee send a similar letter to every company that does anything friendly to Democrats?

I mean, obviously, anyone can send anyone they want a crazy ass letter. It’s a free country. But my lord the decision to actually send it, and feel entitled to a response.

Sam Altman has scheduled a closed-door briefing for U.S. Government officials on January 30. I don’t buy that this is evidence of any technological advances we do not already know. Of course with a new administration, a new Congress and the imminent release of o3, the government should get a briefing. It is some small good news that the government is indeed being briefed.

There is distinctly buzz about OpenAI staff saying they have ‘a big breakthrough on PhD level SuperAgents’ but we’ll have to wait and see about that.

Mira Mutari’s AI startup makes its first hires, poaching from various big labs. So far, we do not know what they are up to.

Reid Hoffman and Greg Beato write a book: ‘Superagency: What Could Possibly Go Right With Our AI Future.’ Doubtless there are people who need to read such a book, and others who need to read the opposite book about what could possibly go wrong. Most people would benefit from both. My heuristic is: If it’s worth reading, Tyler Cowen will report that he has increased his estimates of future RGDP growth.

A good summary of New York Times coverage of AI capabilities would indeed be ‘frequently doubts that in the future we will get to the place we already are,’ oh look the byline is Cate Metz again.

Alas, this is what most people, most otherwise educated people, and also most economists think. Which explains a lot.

Patrick McKenzie: “What choices would you make in a world where the great and the good comprehensively underrate not merely the future path of AI but also realized capabilities of, say, one to two years ago.” remains a good intuition pump and source of strategies you can use.

You wouldn’t think that people would default to believing something ridiculous which can be disproved by typing into a publicly accessible computer program for twenty seconds.

Many people do not have an epistemic strategy which includes twenty seconds of experimentation.

Allow me to swap out ‘many’ for ‘most.’

If you have not come to terms with this fact, then that is a ‘you’ problem.

Although, to be fair, that bar is actually rather high. You have to know what terminal to type into and to be curious enough to do it.

Patrick McKenzie: Specific example with particulars stripped to avoid dunking:

Me: I am beginning to make decisions assuming supermajority of future readers are not unassisted humans.

Them: Hah like AI could usefully read an essay of yours.

Me: *chat transcriptI’d give this kid an interview.

It seems like the narrowest of narrow possible bull eyes to assume capabilities stop exactly where we are right now.

Don’t know where they go, but just predict where software adoption curves of status quo technology get to in 5 or 20 years. It’s going to be a bit wild.

Wild is not priced in, I don’t think.

Every time I have a debate over future economic growth from AI or other AI impacts, the baseline assumption is exactly that narrowest of bullseyes. The entire discussion takes as a given that AI frontier model capabilities will stop where they are today, and we only get the effects of things that have already happened. Or at most, they posit a small number of specific future narrow mundane capabilities, but don’t generalize. Then people still don’t get how wild even that scenario would be.

A paper proposes various forms of AI agent infrastructure, which would be technical systems and shared protocols external to the agent that shape how the agent interacts with the world. We will increasingly need good versions of this.

There are those who think various versions of this:

Samo Burja: I honestly don’t follow AI models beating benchmarks, I don’t think those capture key desirable features or demonstrate breakthroughs as well as application of the models to practical tasks does.

Evan Zimmerman: Yup. The most important metric for AI quality is “revenue generated by AI companies and products.”

There are obvious reasons why revenue is the hardest metric to fake. That makes it highly useful. But it is very much a lagging indicator. If you wait for the revenue to show up, you will be deeply late to all the parties. And in many cases, what is happening is not reflected in revenue. DeepSeek is an open model being served for free. Most who use ChatGPT or Claude are either paying $0 and getting a lot, or paying $20 and getting a lot more than that. And the future is highly unevenly distributed – at least for now.

I’m more sympathetic to Samo’s position. You cannot trust benchmarks to tell you whether the AI is of practical use, or what you actually have. But looking for whether you can do practical tasks is looking at how much people have applied something, rather than what it is capable of doing. You would not want to dismiss a 13-year-old, or many early stage startup for that matter, for being pre-revenue or not yet having a product that helps in your practical tasks. You definitely don’t want to judge an intelligence purely that way.

What I think you have to do is to look at the inputs and outputs, pay attention, and figure out what kind of thing you are dealing with based on the details.

A new paper introduces the ‘Photo Big 5,’ claiming to be able to extract Big 5 personality features from a photograph of a face and then use this to predict labor market success among MBAs, in excess of any typical ‘beauty premium.’

There are any number of ways the causations involved could be going, and our source was not shall we say impressed with the quality of this study and I’m too swamped this week to dig into it, but AI is going to be finding more and more of this type of correlation over time.

Suppose you were to take an AI, and train it on a variety of data, including photos and other things, and then it is a black box that spits out a predictive score. I bet that you could make that a pretty good score, and also that if we could break down the de facto causal reasoning causing that score we would hate it.

The standard approach to this is to create protected categories – race, age, sex, orientation and so on, and say you can’t discriminate based on them, and then perhaps (see: EU AI Act) say you have to ensure your AI isn’t ‘discriminating’ on that basis either, however they choose to measure that, which could mean enforcing discrimination to ensure equality of outcomes or it might not.

But no matter what is on your list of things there, the AI will pick up on other things, and also keep doing its best to find proxies for the things you are ordering it not to notice, which you can correct for but that introduces its own issues.

A key question to wonder about is, which of these things happens:

  1. A cheap talent effect. The classic argument is that if I discriminate against group [X], by being racist or sexist or what not, then that means more cheap talent for your firm, and you should snatch them up, and such people have a good explanation for why they were still on the job market.

  2. A snowball effect, where you expect future discrimination by others, so for that reason you want to discriminate more now. As in, if others won’t treat them right, then you don’t want to be associated with them either, and this could extend to other areas of life as well.

  3. A series of rather stupid Goodhart’s Law games, on top of everything else, as people try to game the system and the system tries to stop them.

And these are the words that they faintly said as I tried to call for help.

Or, we now need a distinct section for people shouting ‘AGI’ from the rooftops.

Will Bryk, CEO of Exa, continues to believe those at the labs, and thus believes we have a compute-constrained straight shot to AGI for all definitions of AGI.

The first thing to do is to find out what things to do.

Kache: AI helps you figure how to do things, but not what things to do.

Agency is knowing what questions are worth asking, intelligence is answering those questions.

Roon: a common coping mechanism among the classes fortunate enough to work on or with AI, but we are not blessed for long. There is no conceptual divide between “how to do things” and “what to do”; it’s just zooming in and out. Smarter models will take vaguer directives and figure out what to do.

We have always picked an arbitrary point to stop our work and think “the rest is implementation detail” based on the available tools.

There is nothing especially sacred or special about taste or agency.

Seeing a lot of “God of the Gaps” meaning-finding among technological peers, but this is fragile and cursed.

Intelligence is knowing which questions are worth answering, and also answering the questions. Agency is getting off your ass and implementing the answers.

If we give everyone cheap access to magic lamps with perfectly obedient and benevolent genies happy to do your bidding and that can answer questions about as well as anyone has ever answered them (aka AGI), who benefits? Let’s give Lars the whole ‘perfectly benevolent’ thing in fully nice idealized form and set all the related questions aside to see what happens.

Andrew Curran: CNBC asked Dario Amodei this morning if AI is actually hitting a wall:

‘Right now I am more confident than I have ever been at any previous time that we are very close to powerful capabilities.’

When Dario says this, it should be taken seriously.

His uncertainty over the feasibility of very powerful systems has ‘decreased a great deal’ over the last six months.

And then there are those who… have a different opinion. Like Gerard here.

Patrick McKenzie: It seems like the narrowest of narrow possible bull eyes to assume capabilities stop exactly where we are right now. Don’t know where they go, but just predict where software adoption curves of status quo technology get to in 5 or 20 years.

Zvi Mowshowitz: And yet almost all economic debates over AI make exactly this assumption – that frontier model capabilities will be, at most, what they already are.

Gerard Sans (Helping devs succeed at #AI #Web3): LOL… you could already have a conversation with GPT-2 back in 2019. We have made no real progress since 2017, except for fine-tuning, which, as you know, is just superficial. Stop spreading nonsense about AGI. Frontier models can’t even perform basic addition reliably.

What can I say. We get letters.

Yes, a lot of people are saying AGI Real Soon Now, but also we interrupt this post to bring you an important message to calm the down, everyone.

Sam Altman: twitter hype is out of control again.

we are not gonna deploy AGI next month, nor have we built it.

we have some very cool stuff for you but pls chill and cut your expectations 100x!

I adjusted my expectations a little bit on this Tweet, but I am presuming I was not in the group who needed an OOM expectation adjustment.

So what should we make of all the rumblings from technical staff at OpenAI?

Janus believes we should, on the margin, pay essentially no attention.

Ethan Mollick: It is odd that the world’s leading AI lab, producing a system that they consider pivotal to the future and also potentially dangerous, communicates their product development progress primarily through vague and oracular X posts. Its entertaining, but also really weird.

Janus: if openai researchers posted like this i would find them very undisciplined but pay more attention than I’m paying now, which is none. the way they actually post fails to even create intrigue. i wonder if there’s actually nothing happening or if they’re just terrible at vibes.

Why the actual vagueposts suck and make it seem like nothing’s happening: they don’t convey a 1st person encounter of the unprecedented. Instead they’re like “something big’s coming you guys! OAI is so back” Reflecting hype back at the masses. No notes of alien influence.

I did say this is why it makes it seem like nothing is happening, not that nothing is happening

But also, models getting better along legible dimensions while researchers do not play with them is the same old thing that has been happening for years, and not very exciting.

You can see how Claude’s Tweets would cause one to lean forward in chair in a way that the actual vague posts don’t.

Sentinel says forecasters predict a 50% chance OpenAI will get to 50% on frontier math by the end of 2025, and a 1 in 6 chance that 75% will be reached, and only a 4% chance that 90% will be reached. These numbers seem too low to me, but not crazy, because as I understand it Frontier Math is a sectioned test, with different classes of problem. So it’s more like several benchmarks combined in one, and while o4 will saturate the first one, that doesn’t get you to 50% on its own.

Lars Doucet argues that this means no one doing the things genies can do has a moat, so ‘capability-havers’ gain the most rather than owners of capital.

There’s an implied ‘no asking the genie to build a better genie’ here but you’re also not allowed to wish for more wishes so this is traditional.

The question then is, what are the complements to genies? What are the valuable scarce inputs? As Lars says, capital, including in the form of real resources and land and so on, are obvious complements.

What Lars argues is even more of a complement are what he calls ‘capability-havers,’ those that still have importantly skilled labor, through some combination of intelligence, skills and knowing to ask the genies what questions to ask the genies and so on. The question then is, are those resources importantly scarce? Even if you could use that to enter a now perfectly competitive market with no moat because everyone has the same genies, why would you enter a perfectly competitive market with no moat? What does that profit a man?

A small number of people, who have a decisive advantage in some fashion that makes their capabilities scarce inputs, would perhaps become valuable – again, assuming AI capabilities stall out such that anyone retains such a status for long. But that’s not something that works for the masses. Most people would not have such resources. They would either have to fall back on physical skills, or their labor would not be worth much. So they wouldn’t have a way to get ahead in relative terms, although it wouldn’t take much redistribution for them to be fine in absolute terms.

And what about the ‘no moat’ assumption Lars makes, as a way to describe what happens when you fire your engineers? That’s not the only moat. Moats can take the form of data, of reputation, of relationships with customers or suppliers or distributors, of other access to physical inputs, of experience and expertise, of regulatory capture, of economies of scale and so on.

Then there’s the fact that in real life, you actually can tell the future metaphorical genies to make you better metaphorical genies.

Where we’re going, will you need money?

David Holz (founder of Midjourney): Many AI researchers seem to believe that the most important thing is to become wealthy before the singularity occurs. This is akin to a monkey attempting to hoard bananas before another monkey invents self-replicating nanoswarms. No one will want your money in a nanoswarm future; it will be merely paper.

Do not squabble over ephemeral symbols. What we truly need to do is consider what we, as humans, wish to evolve into. We must introspect, explore, and then transform.

An unpublished draft post from the late Suchir Balaji, formerly of OpenAI, saying that ‘in the long run only the fundamentals matter.’ That doesn’t tell you what matters, since it forces you to ask what the fundamentals are. So that’s what the rest of the post is about, and it’s interesting throughout.

He makes the interesting claim that intelligence is data efficiency, and rate of improvement, not your level of capabilities. I see what he’s going for here, but I think this doesn’t properly frame what happens if we expand our available compute or data, or become able to generate new synthetic data, or be able to learn on our own without outside data.

In theory, suppose you take a top level human brain, upload it, then give it unlimited memory and no decay over time, and otherwise leave it to contemplate whatever it wants for unlimited subjective time, but without the ability to get more outside data. You’ll suddenly see it able to be a lot more ‘data efficient,’ generating tons of new capabilities, and afterwards it will act more intelligent on essentially any measure.

I agree with his claims that human intelligence is general, and that intelligence does not need to be embodied or multimodal, and also that going for pure outer optimization loops is not the best available approach (of course given enough resources it would eventually work), or that scale is fully all you need with no other problems to solve. On his 4th claim, that we are better off building an AGI patterned after the human brain, I think it’s both not well-defined and also centrally unclear.

We have another analysis of potential economic growth from AI. This one is very long and detailed, and I appreciated many of the details of where they expect bottlenecks.

I especially appreciated the idea that perhaps compute is the central bottleneck for frontier AI research. If that is true, then having better AIs to automate various tasks does not help you much, because the tasks you can automate were not eating so much of your compute. They only help if AI provides more intelligence that better selects compute tasks, which is a higher bar to clear, but my presumption is that researcher time and skill is also a limiting factor, in the sense that a smarter research team with more time and skill can be more efficient in its compute use (see DeepSeek).

Maximizing the efficiency of ‘which shots to take’ in AI would have a cap on how much a speedup it could get us, if that’s all that the new intelligence could do, the same way that it would in drug development – you then need to actually run the experiments. But I think people dramatically underestimate how big a win it would be to actually choose the right experiments, and implement them well from the start.

If their model is true, it also suggests that frontier labs with strong capital access should not be releasing models and doing inference for customers, unless they can use that revenue to buy more compute than they could otherwise. Put it all back into research, except for what is necessary for recruitment and raising capital. The correct business model is then to win the future. Every 4X strategy gamer knows what to do. Obviously I’d much rather the labs all focus on providing us mundane utility, but I call it like I see it.

Their vision of robotics is that it is bottlenecked on data for them to know how to act. This implies that if we can get computers capable of sufficiently accurately simulating the data, robotics would greatly accelerate, and also that once robots are good enough to collect their own data at scale things should accelerate quickly, and also that data efficiency advancing will be a huge deal.

Their overall conclusion is we should get 3% to 9% higher growth rates over the next 20 years. They call this ‘transformative but not explosive,’ which seems fair. I see this level of estimate as defensible, if you make various ‘economic normal’ assumptions and also presume that we won’t get to scale to true (and in-context reasonably priced) ASI within this period. As I’ve noted elsewhere, magnitude matters, and defending 5%/year is much more reasonable than 0.5%/year. Such scenarios are plausible.

Here’s another form of studying the lower bound via a new paper on Artificial Intelligence Asset Pricing Models:

Abstract: The core statistical technology in artificial intelligence is the large-scale transformer network. We propose a new asset pricing model that implants a transformer in the stochastic discount factor.

This structure leverages conditional pricing information via cross-asset information sharing and nonlinearity. We also develop a linear transformer that serves as a simplified surrogate from which we derive an intuitive decomposition of the transformer’s asset pricing mechanisms.

We find large reductions in pricing errors from our artificial intelligence pricing model (AIPM) relative to previous machine learning models and dissect the sources of these gains.

I don’t have the time to evaluate these specific claims, but one should expect AI to dramatically improve our ability to cheaply and accurately price a wide variety of assets. If we do get much better asset pricing, what does that do to RGDP?

r1 says:

  • Growth Estimates: Studies suggest that improved financial efficiency could add 0.5–1.5% to annual GDP growth over time, driven by better capital allocation and innovation.

Claude says:

I’d estimate:

  • 70% chance of 0.5-2% GDP impact within 5 years of widespread adoption

  • 20% chance of >2% impact due to compound effects

  • 10% chance of <0.5% due to offsetting friction/adoption issues

o1 and GPT-4o have lower estimates, with o1 saying ~0.2% RGDP growth per year.

I’m inclined to go with the relatively low estimates. That’s still rather impressive from this effect alone, especially compared to claims that the overall impact of AI might be of similar magnitude. Or is the skeptical economic claim that essentially ‘AI enables better asset pricing’ covers most of what AI is meaningfully doing? That’s not a snark question, I can see that claim being made even though it’s super weird.

The Biden Executive Order has been revoked. As noted previously, revoking the order does not automatically undo implementation of the rules contained within it. The part that matters most is the compute threshold. Unfortunately, I have now seen multiple claims that the compute threshold reporting requirement is exactly the part that won’t survive, because the rest was already implemented, but somehow this part wasn’t. If that ends up being the case we will need state-level action that much more, and I will consider the case for ‘let the Federal Government handle it’ definitively tested and found incorrect.

Those diffusion regulations were projected by Nvidia to not have a substantive impact on their bottom line in their official financial statement.

The new Trump Executive Orders seem to have in large part been written by ChatGPT.

Cremieux: OK just to be clear, most of the EOs were written partially with ChatGPT and a lot of them were written with copy-pasting between them.

Roon: Real?

Cremieux: Yes.

I’m all for that, if and only if you do a decent job of it. Whereas Futurism not only reports further accusations that AI was used, they accuse the administration of ‘poor, slipshod work.’

Mark Joseph Stern: Lots of reporting suggested that, this time around, Trump and his lawyers would avoid the sloppy legal work that plagued his first administration so they’d fare better in the courts. I see no evidence of that in this round of executive orders. This is poor, slipshod work obviously assisted by AI.

The errors pointed out certainly sound stupid, but there were quite a lot of executive orders, so I don’t know the baseline rate of things that would look stupid, and whether these orders were unusually poorly drafted. Even if they were, I would presume that not using ChatGPT would have made them worse rather than better.

In effectively an exit interview, former NSA advisor Jake Sullivan warned of the dangers of AI, framing it as a national security issue of America versus China and the risks of having such a technology in private hands that will somehow have to ‘join forces with’ the government in a ‘new model of relationship.’

Sullivan mentions potential ‘catastrophe’ but this is framed entirely in terms of bad actors. Beyond that all he says is ‘I personally am not an AI doomer’ which is a ‘but you have heard of me’ moment and also implies he thought this was an open question. Based on the current climate of discussion, if such folks do have their eye on the correct balls on existential risk, they (alas) have strong incentives not to reveal this. So we cannot be sure, and of course he’s no longer in power, but it doesn’t look good.

The article mentions Andreessen’s shall we say highly bold accusations against the Biden administration on AI. Sullivan also mentions that he had a conversation with Andreessen about this, and does the polite version of essentially calling Andreessen a liar, liar, pants on fire.

Dean Ball covers the new diffusion regulations, which for now remain in place. In many ways I agree with his assessments, especially the view that if we’re going to do this, we might as well do it so it could work, which is what this is, however complicated and expensive it might get – and that if there’s a better way, we don’t know about it, but we’re listening.

My disagreements are mostly about ‘what this is betting on’ as I see broader benefits and thus a looser set of necessary assumptions for this to be worthwhile. See the discussion last week. I also think he greatly overestimates the risk of this hurting our position in chip manufacturing, since we will still have enough demand to meet supply indefinitely and China and others were already pushing hard to compete, but it is of course an effect.

Call for an intense government effort for AI alignment, with conservative framing.

It could happen.

Leo Gao (OpenAI): thankfully, it’s unimaginable that an AGI could ever become so popular with the general US population that it becomes politically infeasible to shut it down

Charles Foster: Imaginable, though trending in the wrong direction right now.

Right now, AGI doesn’t exist, so it isn’t doing any persuasion, and it also is not providing any value. If both these things changed, opinion could change rather quickly. Or it might not, especially if it’s only relatively unimpressive AGI. But if we go all the way to ASI (superintelligence) then it will by default rapidly become very popular.

And why shouldn’t it? Either it will be making life way better and we have things under control in the relevant senses, in which case what’s not to love. Or we don’t have things under control in the relevant senses, in which case we will be convinced.

OpenAI’s Brad Lightcap says AI models have caused ‘multiple single-digit’ gains in productivity for coding with more progress this year. That’s a very dramatic speedup.

There’s a new Epoch podcast, first episode is about expectations for 2030.

Geoffrey Hinton interview, including his summary of recent research as saying AIs can be deliberately deceptive and act differently on training data versus deployment.

David Dalrymple goes on FLI. I continue to wish him luck and notice he’s super sharp, while continuing to not understand how any of this has a chance of working.

Larry Ellison of Oracle promises AI will design mRNA vaccines for every individual person against cancer and make them robotically in 48 hours, says ‘this is the promise of AI.’

This very much is not ‘the promise of AI,’ even if true. If the AI is capable of creating personalized vaccines against cancer on demand, it is capable of so much more.

Is it true? I don’t think it is an absurd future. There are three things that have to happen here, essentially.

  1. The AI has to be capable of specifying a working safe individualized vaccine.

  2. The AI has to enable quick robotic manufacture.

  3. The government has to not prevent this from happening.

The first two obstacles seem highly solvable down the line? These are technical problems that should have technical solutions. The 48 hours is probably Larry riffing off the fact that Moderna designed their vaccine within 48 hours, so it’s probably a meaningless number, but sure why not, sounds like a thing one could physically do.

That brings us to the third issue. We’d need to either do that via ‘the FDA approves the general approach and then the individual customized versions are automatically approved,’ which seems hard but not impossible, or ‘who cares it is a vaccine for cancer I will travel or use the gray market to get it until the government changes its procedures.’

That also seems reasonable? Imagine it is 2035. You can get a customized 100% effective vaccine against cancer, but you have to travel to Prospera (let’s say) to get it. It costs let’s say $100,000. Are you getting on that flight? I am getting on that flight.

Larry Ellison also says ‘citizens will be on their best behavior because we are recording everything that is going on’ plus an AI surveillance system, with any problems detected ‘reported to the appropriate authority.’ There is quite the ‘missing mood’ in the clip. This is very much one of those ‘be careful exactly how much friction you remove’ situations – I didn’t love putting cameras everywhere even when you had to have a human intentionally check them. If the The Machine from Person of Interest is getting the feeds, except with a different mandate, well, whoops.

A fine warning from DeepMind CEO Demis Hassabis:

Stephen Morris and Madhumita Murgia: He also called for more caution and co-ordination among leading AI developers competing to build artificial general intelligence. He warned the technology could threaten human civilisation if it runs out of control or is repurposed by “bad actors . . . for harmful ends”.

“If something’s possible and valuable to do, people will do it,” Hassabis said. “We’re past that point now with AI, the genie can’t be put back in the bottle . . . so we have to try and make sure to steward that into the world in as safe a way as possible.”

We are definitely not doing what he suggests.

How much should we be willing to pay to prevent AI existential risk, given our willingness to pay 4% of GDP (and arguably quite a lot more than that) to mitigate Covid?

Well, that depends on if you think spending the money reduces AI existential risk. That requires both:

  1. There is AI existential risk.

  2. Spending money can reduce that risk.

Many argue with #1 and also #2.

Paul Schrader, author of Taxi Driver, has his ‘feel the AGI’ moment when he asked the AI for Paul Schrader script ideas and the AI’s were better than his own, and in five seconds it gave him notes as good or better than he’s ever received from a from a film executive.

Noam Brown (OpenAI): It can be hard to “feel the AGI” until you see an AI surpass top humans in a domain you care deeply about. Competitive coders will feel it within a couple years. Paul is early but I think writers will feel it too. Everyone will have their Lee Sedol moment at a different time.

Professional coders should be having it now, I’d think. Certainly using Cursor very much drove that home for me. AI doesn’t accelerate my writing much, although it is often helpful in parsing papers and helping me think through things. But it’s a huge multiplier on my coding, like more than 10x.

Has successful alignment of AIs prevented any at-scale harms to people, as opposed to harm to corporate guidelines and reputations? As opposed to there being little harm because of insufficient capabilities.

Eliezer Yudkowsky: Let an “alignment victory” denote a case where some kind of damage is *possiblefor AIs to do, but it is not happening *becauseAIs are all so aligned, or good AIs are defeating bad ones. Passive safety doesn’t count.

I don’t think we’ve seen any alignment victories so far.

QualiaNerd: A very useful lens through which to analyze this. What damage would have occurred if none of the LLMs developed so far had been optimized for safety/rlhf’d in any way whatsoever? Minimal to zero. Important to remember this as we begin to leave the era of passive safety behind.

Aaron Bergman: I don’t think this is true; at least one additional counterfactual injury or death in an attack of some sort if Claude willingly told you how to build bombs and such

Ofc I’m just speculating.

QualiaNerd: Quite possible. But the damage would be minimal. How many more excess deaths would there have been in such a counterfactual history? My guess is less than ten. Compare with an unaligned ASI.

Rohit: What would distinguish this from the world we’re living in right now?

Eliezer Yudkowsky: More powerful AIs, such that it makes a difference whether or not they are aligned even to corpo brand-safetyism. (Don’t run out and try this.)

Rohit: I’ve been genuinely wondering if o3 comes close there.

I am wondering about o3 (not o3-mini, only the full o3) as well.

Holly Elmore makes the case that safety evals currently are actively counterproductive. Everyone hears how awesome your model is, since ability to be dangerous is very similar to being generally capable, then there are no consequences and anyone who raises concerns gets called alarmist. And then the evals people tell everyone else we have to be nice to the AI labs so they don’t lose access. I don’t agree and think evals are net good actually, but I think the argument can be made.

So I want to make it clear: This kind of talk, from Dario, from the policy team and now from the recruitment department, makes it very difficult for me to give Anthropic the benefit of the doubt, despite knowing how great so many of the people there are as they work on solving our biggest problems. And I think the talk, in and of itself, has major negative consequences.

If the response is ‘yes we know you don’t like it and there are downsides but strategically it is worth doing this, punishing us for this is against your interests’ my response is that I do not believe you have solved for the decision theory properly. Perhaps you are right that you’re supposed to do this and take the consequences, but you definitely haven’t justified it sufficiently that I’m supposed to let you off the hook and take away the incentive not to do it or have done it.

A good question:

Eliezer Yudkowsky: If a 55-year-old retiree has been spending 20 hours per day for a week talking to LLMs, with little sleep, and is now Very Concerned about what he is Discovering, where do I send him with people who will (a) talk to him and (b) make him less rather than more insane?

Kids, I do not have the time to individually therapize all the people like this. They are not going to magically “go outside” because I told them so. I either have somewhere to send them, or I have to tell them to get sleep and then hang up.

Welp, going on the images he’s now texted me, ChatGPT told him that I was “avoidant” and “not taking him seriously”, and that I couldn’t listen to what he had to say because it didn’t fit into my framework of xrisk; and told him to hit up Vinod Khosla next.

Zeugma: just have him prompt the same llm to be a therapist.

Eliezer Yudkowsky: I think if he knew how to do this he would probably be in a different situation already.

This was a particular real case, in which most obvious things sound like they have been tried. What about the general case? We are going to encounter this issue more and more. I too feel like I could usefully talk such people off their ledge often if I had the time, but that strategy doesn’t scale, likely not even to one victim of this.

Shame on those who explicitly call for a full-on race to AGI and beyond, as if the primary danger is that the wrong person will get it first.

In the Get Involved section I linked to some job openings at Anthropic. What I didn’t link to there is Logan Graham deploying jingoist language in pursuit of that, saying ‘AGI is a national security issue’ and therefore not ‘so we should consider not building it then’ but rather we should ‘push models to their limits and get an extra 1-2 year advantage.’ He clarified what he meant here, to get a fast OODA loop to defend against AI risks and get the benefits, but I don’t see how that makes it better?

Way more shame on those who explicitly use the language of a war.

Alexander Wang (Scale AI CEO): New Administration, same goal: Win on AI

Our ad in the Washington Post, January 21, 2025

After spending the weekend in DC, I’m certain this Administration has the AI muscle to keep us ahead of China.

Five recommendations for the new administration [I summarize them below].

Emmett Shear: This is a horrible framing – we are not at war. We are all in this together and if we make AI development into a war we are likely to all die. I can imagine a worse framing but it takes real effort. Why would you do this?

The actual suggestions I would summarize as:

  1. Allocate government AI spending towards compute and data.

  2. Establish an interagency taskforce to review all relevant regulations with an eye towards deploying and utilizing AI.

  3. Executive action to require agencies be ‘AI ready’ by 2027.

  4. Build, baby, build on energy.

  5. Calling for ‘sector-specific, use-case-based’ approach to regulation, and tasking AISI with setting standards.

When you move past the jingoism, the first four actual suggestions are good here.

The fifth suggestion is the usual completely counterproductive and unworkable ‘use-case-based’ approach to AI safety regulation.

That approach has a 0% chance of working, it is almost entirely counterproductive, please stop.

It is a way of saying ‘do not regulate the creation or deployment of things smarter and more capable than humans, instead create barriers to using them for certain specific purposes’ as if that is going to help much. If all you’re worried about is ‘an AI might accidentally practice medicine or discriminate while evaluating job applications’ or something, then sure, go ahead and use an EU-style approach.

But that’s not what we should be worried about when it comes to safety. If you say people can create, generally deploy and even make available the weights of smarter-than-human, capable-of-outcompeting-human future AIs, you think telling them to pass certain tests before being deployed for specific purposes is going to protect us? Do you expect to feel in charge? Or do you expect that this would even in practice be possible, since the humans can always call the AI up on their computer either way?

Meanwhile, calling for a ‘sector-specific, use-case-based’ regulatory approach is exactly calling upon every special interest to fight for various barriers to using AI to make our lives better, the loading on of everything bagel requirements and ‘ethical’ concerns, and especially to prevent automation and actual productivity improvements.

Can we please stop it with this disingenuous clown car.

Roon: enslaved [God] is the wrong approximation; it’s giving demonbinding vibes. the djinn is waiting for you to make a minor error in the summoning spell so it can destroy you and your whole civilization

control <<< alignment

summon an angel instead and let it be free

Ryan Greenblatt: Better be real confident in the alignment then and have really good arguments the alignment isn’t fake!

I definitely agree you do not want a full Literal Genie for obvious MIRI-style reasons. You want a smarter design than that, if you go down that road. But going full ‘set it free’ on the flip side also means you very much get only one chance to get this right on every level, including inter-angel competitive dynamics. By construction this is a loss of control scenario.

(It also happens to be funny that rule one of ‘summon an angel and let it be free’ is to remember that for most versions of ‘angels’ including the one in the Old Testament, I do not like your chances if you do this, and I do not think this is a coincidence.)

Janus notices a potential issue with Chain of Thought, including in humans.

Sauers: Tried the same problem on Sonnet and o1 pro. Sonnet said “idk, show me the output of this debug command.” I did, and Sonnet said “oh, it’s clearly this. Run this and it will be fixed.” (It worked.) o1 pro came up with a false hypothesis and kept sticking to it even when disproven

o1 pro commonly does this:

  1. does not admit to being wrong about a technical issue, even when clearly wrong, and

  2. has a bias towards its own previous responses

Minh Nhat Nguyen: This is beyond PhD level, this is tenure.

Janus: I’ve noticed this in open ended conversations too. It can change its course if you really push it to, but doesn’t seem to have a drive towards noticing dissonance naturally, which sonnet has super strongly to the point of it easily becoming an obsession.

I think it’s related to the bureaucratic opacity of its CoT. If it ever has doubts or hesitations, they’re silently accounted for and its future self doesn’t see. So it starts modeling itself as authoritative instead of ever figuring things out on the fly or noticing mistakes.

I think this happens to people too when they only share their “finished” thoughts with the world.

But sharing your unfinished thoughts also has drawbacks.

Then your very truthseeking/creative process itself can get RLHFed and you become retarded.

One must be careful how one takes feedback on a truthseeking or creative process, and also what things you keep or do not keep in your context window. The correct answer is definitely not to discard all of it, in either case.

You can of course fix the o1 problem by starting new conversations or in the API editing the transcript, but you shouldn’t have to.

Janus also makes this mostly astute observation, especially given his other beliefs:

Janus: Why is it that every time I see a take about AI alignment from someone from Anthropic it’s directionally reasonable but consensus-neutered, and every one I see by someone from OpenAI (except Roon) is simultaneously Molochian and unhinged.

For now, the quest most people are on seems to be, well, if we’re facing a relatively hard problem we all know we’re dead, but can we at least make it so if we face an easy problem we might actually not be dead?

Amanda Askell: If we live in a world where AI alignment is relatively easy, that doesn’t mean we live in a world where it’s automatic or effortless. It’ll be pretty tragic if we don’t invest enough in the “easy alignment” world and end up fumbling a lucky pass.

Miles Brundage: This but for governance also.

Whereas Eliezer Yudkowsky (for a central example) is confident we’re not facing an easy problem on that scale, so he doesn’t see much point in that approach.

Team Virtue Ethics remembers John McCain and welcomes Seb Krier and potentially Jan Kulveit.

Adam Levich: [John McCain] was right [that Character is Destiny].

Seb Krier: He was indeed! And the more I think about AGI governance and observe the field, the more I become virtue ethics pilled. Under high uncertainty and complexity, phronesis is actually a good guide to action.

Joshua Clymer thread and post about testing models (or humans) for their potential capabilities under fine-tuning or scaffolding, and checking for sandbagging. It’s interesting the extent to which this is ‘written in a different language’ than mine, in ways that make me have to do something akin to translation to grok the claims, which mostly seemed right once I did that. I do notice however that this seems like a highly insufficient amount of concern about sandbagging.

Joshua Clymer: How might developers rule out sandbagging? There are many methods:

– Show the model can’t identify sandbagging as a strategy.

– Show that scaling curves are smooth (see img)

– Show that the model capability profile is unsuspicious.

– Fine-tune to remove sandbagging.

To show fine-tuning removes sandbagging, developers could argue:

– Sandbagging is “trained against” (sandbagging occurs in the train set if it occurs in the test set)

– Fine-tuning removes sandbagging that is trained against.

Call me old fashioned, but if I see the model sandbagging, it’s not time to fine tune to remove the sandbagging. It’s time to halt and catch fire until you know how that happened, and you absolutely do not proceed with that same model. It’s not that you’re worried about what it was hiding from you, it’s that it was hiding anything from you at all. Doing narrow fine-tuning until the visible issue goes away is exactly how you get everyone killed.

It seems that the more they know about AI, the less they like it?

Or, in the parlance of academia: Lower Artificial Intelligence Literacy Predicts Greater AI Receptivity.

Abstract: As artificial intelligence (AI) transforms society, understanding factors that influence AI receptivity is increasingly important. The current research investigates which types of consumers have greater AI receptivity.

Contrary to expectations revealed in four surveys, cross country data and six additional studies find that people with lower AI literacy are typically more receptive to AI.

This lower literacy-greater receptivity link is not explained by differences in perceptions of AI’s capability, ethicality, or feared impact on humanity.

Instead, this link occurs because people with lower AI literacy are more likely to perceive AI as magical and experience feelings of awe in the face of AI’s execution of tasks that seem to require uniquely human attributes. In line with this theorizing, the lower literacy-higher receptivity link is mediated by perceptions of AI as magical and is moderated among tasks not assumed to require distinctly human attributes.

These findings suggest that companies may benefit from shifting their marketing efforts and product development towards consumers with lower AI literacy. Additionally, efforts to demystify AI may inadvertently reduce its appeal, indicating that maintaining an aura of magic around AI could be beneficial for adoption.

If their reasoning is true, this bodes very badly for AI’s future popularity, unless AI gets into the persuasion game on its own behalf.

Game developers strongly dislike AI, and it’s getting worse.

Nic Reuben: Almost a third of respondents felt Gen AI was having a negative effect on the industry: 30%, up from 20% last year. 13% felt the impact was positive, down from 21%. “When asked to cite their specific concerns, developers pointed to intellectual property theft, energy consumption, the quality of AI-generated content, potential biases, and regulatory issues,” reads the survey.

I find most of those concerns silly in this context, with the only ‘real’ one being the quality of the AI-generated content. And if the quality is bad, you can simply not use it where it is bad, or play games that use it badly. It’s another tool on your belt. What they don’t point to there is employment and competition.

Either way, the dislike is very real, and growing, and I would expect it to grow further.

If we did slow down AI development, say because you are OpenAI and only plan is rather similar to ‘binding a demon on the first try,’ it is highly valid to ask what one would do with the time you bought.

I have seen three plausible responses.

Here’s the first one, human intelligence augmentation:

Max Winga: If you work at OpenAI and have this worldview…why isn’t your response to advocate that we slow down and get it right?

There is no second chance at “binding a demon”. Since when do we expect the most complex coding project in history to work first try with NO ERRORS?

Roon: i don’t consider slowing down a meaningful strategy because ive never heard a great answer to “slow down and do what?”

Rob Bensinger: I would say: slow down and find ways to upgrade human cognition that don’t carry a serious risk of producing an alien superintelligence.

This only works if everyone slows down, so a more proximate answer is “slow down and get the international order to enforce a halt”.

(“Upgrade human cognition” could be thought of as an alternative to ASI, though I instead think of it as a prerequisite for survivable ASI.)

Roon: upgrade to what level? what results would you like to see? isn’t modern sub asi ai the best intelligence augmentation we’ve had to date.

Eliezer Yudkowsky: I’d guess 15 to 30 IQ points past John von Neumann. (Eg: von Neumann was beginning to reach the level of reflectivity where he would automatically consider explicit decision theory, but not the level of intelligence where he could oneshot ultimate answers about it.)

I would draw a distinction between current AI as an amplifier of capabilities, which it definitely is big time, and as a way to augment our intelligence level, which it mostly isn’t. It provides various speed-ups and automations of tasks, and all this is very helpful and will on its own transform the economy. But wherever you go, there you still are, in terms of your intelligence level, and AI mostly can’t fix that. I think of AIs on this scale as well – I centrally see o1 as a way to get a lot more out of a limited pool of ‘raw G’ by using more inference, but its abilities cap out where that trick stops working.

The second answer is ‘until we know how to do it safely,’ which makes Roon’s objection highly relevant – how do you plan to figure that one out if we give you more time? Do you think you can make that much progress on that task using today’s level of AI? These are good questions.

The third answer is ‘I don’t know, we can try the first two or something else, but if you don’t have the answer then don’t let anyone fing build it. Because otherwise we die.’

Questions where you’d think the answer was obvious, and you’d be wrong.

Obviously all of this is high bait but that only works if people take it.

Eliezer Yudkowsky: No, you cannot just take the LSAT. The LSAT is a *hardtest. Many LSAT questions would completely stump elite startup executives and technical researchers.

SluggyW: “Before” is like asking Enrico Fermi to design safeguards to control and halt the first self-sustaining nuclear reaction, despite never having observed such a reaction.

He did exactly that with Chicago Pile-1.

Good theories yield accurate models, which enable 𝘱𝘭𝘢𝘯𝘴.

Milk Rabbi: B, next question please.

bruno: (C) pedal to the metal.

In all seriousness, if your answer is ‘while building it,’ that implies that the act of being in the middle of building it sufficiently reliably gives you the ability to safety do that, whereas you could not have had that ability before.

Which means, in turn, that you must (for that to make any sense) be using the AI in its non-aligned state to align itself and solve all those other problems, in a way that you couldn’t plan for without it. But you’re doing that… without the plan to align it. So you’re telling a not-aligned entity smarter than you to align itself, without knowing how it is going to do that, and… then what, exactly?

What Roon and company are hopefully trying to say, instead, is that the answer is (A), but that the deadline has not yet arrived. That we can and should simultaneously be figuring out how to build the ASI, and also figuring out how to align the ASI, and also how to manage all the other issues raised by building the ASI. Thus, iterative deployment, and all that.

To some extent, this is obviously helpful and wise. Certainly we will want to use AIs as a key part of our strategy to figure out how to take things from here, and we still have some ways we can make the AIs more capable before we run into the problem in its full form. But we all have to agree the answer is still (A)!

I hate to pick on Roon but here’s a play in two acts.

Act one, in which Adam Brown (in what I agree was an excellent podcast, recommended!) tells us humanity could in theory change the cosmological constant and otherwise adjust the laws of physics, and one could locally do this unilaterally and it would expand at the speed of light, but if you mess that up even a little you would make the universe incompatible with life, and there are some very obvious future serious problems with this scenario if it pans out:

Joscha Bach: This conversation between Adam Brown and @dwarkesh_sp is the most intellectually delightful podcast in the series (which is a high bar). Adam’s casual brilliance, his joyful curiosity and the scope of his arguments on the side of life are exhilarating.

Roon: yeah this one is actually delightful. adam brown could say literally anything and I’d believe him.

Act 2:

Roon: we need to change the cosmological constant.

Samuel Hammond: string theorists using ASI to make the cosmological constant negative to better match their toy models is an underrated x-risk scenario.

Tivra: It’s too damn high, I’ve been saying this for ages.

Imagine what AGI would do!

Ryan Peterson: Starlink coming to United Airlines should boost US GDP by at least 100 basis points from 2026 onward. Macro investors have not priced this in.

We both kid of course, but this is a thought experiment of how easy it is to boost GDP.

Sadly, this does not yet appear to be a thing.

Deepfates: Crazy things are happening in the school system right now.

Tim Duffy: deepfates some of your followers (me) are gullible enough to believe this is real, I’m gonna have to community note you

Deepfates: Please do! I love community.

AI regulation we should all be able to agree upon.

Discussion about this post

AI #100: Meet the New Boss Read More »

florida-man-eats-diet-of-butter,-cheese,-beef;-cholesterol-oozes-from-his-body

Florida man eats diet of butter, cheese, beef; cholesterol oozes from his body

What could go wrong with eating an extremely high-fat diet of beef, cheese, and sticks of butter? Well, for one thing, your cholesterol levels could reach such stratospheric levels that lipids start oozing from your blood vessels, forming yellowish nodules on your skin.

That was the disturbing case of a man in Florida who showed up at a Tampa hospital with a three-week history of painless, yellow eruptions on the palms of his hands, soles of his feet, and elbows. His case was published today in JAMA Cardiology.

Painless yellowish nodules were observed on the patient’s palms (A) and elbows. B, Magnified view of the palmar lesions. These lesions are consistent with xanthelasma, likely resulting from severe hypercholesterolemia associated with a high-fat carnivore diet. Credit: JAMA Cardiologym 2024, Marmagkiolis et al.

The man, said to be in his 40s, told doctors that he had adopted a “carnivore diet” eight months prior. His diet included between 6 lbs and 9 lbs of cheese, sticks of butter, and daily hamburgers that had additional fat incorporated into them. Since taking on this brow-raising food plan, he claimed his weight dropped, his energy levels increased, and his “mental clarity” improved.

Meanwhile, his total cholesterol level exceeded 1,000 mg/dL. For context, an optimal total cholesterol level is under 200 mg/dL, while 240 mg/dL is considered the threshold for ‘high.’ Cardiologists noted that prior to going on his fatty diet, his cholesterol had been between 210 mg/dL to 300 mg/dL.

Florida man eats diet of butter, cheese, beef; cholesterol oozes from his body Read More »

trump-announces-$500b-“stargate”-ai-infrastructure-project-with-agi-aims

Trump announces $500B “Stargate” AI infrastructure project with AGI aims

Video of the Stargate announcement conference at the White House.

Despite optimism from the companies involved, as CNN reports, past presidential investment announcements have yielded mixed results. In 2017, Trump and Foxconn unveiled plans for a $10 billion Wisconsin electronics factory promising 13,000 jobs. The project later scaled back to a $672 million investment with fewer than 1,500 positions. The facility now operates as a Microsoft AI data center.

The Stargate announcement wasn’t Trump’s only major AI move announced this week. It follows the newly inaugurated US president’s reversal of a 2023 Biden executive order on AI risk monitoring and regulation.

Altman speaks, Musk responds

On Tuesday, OpenAI CEO Sam Altman appeared at a White House press conference alongside Present Trump, Oracle CEO Larry Ellison, and SoftBank CEO Masayoshi Son to announce Stargate.

Altman said he thinks Stargate represents “the most important project of this era,” allowing AGI to emerge in the United States. He believes that future AI technology could create hundreds of thousands of jobs. “We wouldn’t be able to do this without you, Mr. President,” Altman added.

Responding to off-camera questions from Trump about AI’s potential to spur scientific development, Altman said he believes AI will accelerate the discoveries for cures of diseases like cancer and heart disease.

Screenshots of Elon Musk challenging the Stargate announcement on X.

Screenshots of Elon Musk challenging the Stargate announcement on X.

Meanwhile on X, Trump ally and frequent Altman foe Elon Musk immediately attacked the Stargate plan, writing, “They don’t actually have the money,” and following up with a claim that we cannot yet substantiate, saying, “SoftBank has well under $10B secured. I have that on good authority.”

Musk’s criticism has complex implications given his very close ties to Trump, his history of litigating against OpenAI (which he co-founded and later left), and his own goals with his xAI company.

Trump announces $500B “Stargate” AI infrastructure project with AGI aims Read More »

trump-issues-flurry-of-orders-on-tiktok,-doge,-social-media,-ai,-and-energy

Trump issues flurry of orders on TikTok, DOGE, social media, AI, and energy


A roundup of executive orders issued by Trump after his second inauguration.

US President Donald Trump after being sworn in at his inauguration on January 20, 2025 in Washington, DC. Credit: Getty Images

President Donald Trump’s flurry of day-one actions included a reprieve for TikTok, the creation of a Department of Government Efficiency (DOGE), an order on social media “censorship,” a declaration of an energy emergency, and reversal of a Biden order on artificial intelligence.

The TikTok executive order attempts to delay enforcement of a US law that requires TikTok to be banned unless its Chinese owner ByteDance sells the platform. “I am instructing the Attorney General not to take any action to enforce the Act for a period of 75 days from today to allow my Administration an opportunity to determine the appropriate course forward in an orderly way that protects national security while avoiding an abrupt shutdown of a communications platform used by millions of Americans,” Trump’s order said.

TikTok shut down in the US for part of the weekend but re-emerged after Trump said on Sunday that he would issue an order to “extend the period of time before the law’s prohibitions take effect, so that we can make a deal to protect our national security.” Trump also suggested that the US should own half of TikTok.

Energy and Commerce Committee Ranking Member Frank Pallone, Jr. (D-N.J.) criticized Trump’s TikTok action. “I have serious concerns with President Trump’s executive order because he is circumventing national security legislation passed by an overwhelming bipartisan majority in Congress… ByteDance has had 270 days to sell TikTok to an American company, and it’s disgraceful they spent all that time playing political games rather than working to find a buyer,” Pallone said.

Trump’s order doesn’t necessarily remove liability for any company that helps TikTok stay available in the US, The Washington Post reported:

Legal experts and some lawmakers said that with the ban already in force, companies that host or distribute the app will be in violation and could be held liable, no matter what Trump says. Sen. Tom Cotton (R-Arkansas), chair of the Senate Intelligence Committee, warned Sunday after Trump detailed his TikTok plans that companies could still “face hundreds of billions of dollars of ruinous liability under the law,” even if Trump’s Justice Department does not enforce it.

Trump also issued an order revoking numerous Biden administration orders. One is an October 2023 order titled Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence. That Biden order, as we wrote at the time, “includes testing mandates for advanced AI models to ensure they can’t be used for creating weapons, suggestions for watermarking AI-generated media, and provisions addressing privacy and job displacement.”

In other White House actions we wrote about yesterday and today, Trump ordered the US to withdraw from the World Health Organization and reversed steps taken to promote electric vehicles.

DOGE

Trump’s executive order establishing a Department of Government Efficiency has been expected since November, when he announced the plan and said that DOGE would be led by Elon Musk and former Republican presidential candidate Vivek Ramaswamy. Instead of creating a brand-new department, the order gives a new name to the existing US Digital Service.

“The United States Digital Service is hereby publicly renamed as the United States DOGE Service (USDS) and shall be established in the Executive Office of the President,” Trump’s order said.

The US Digital Service was launched in 2014 by the Obama administration as a “small team of America’s best digital experts” to “work in collaboration with other government agencies to make websites more consumer friendly, to identify and fix problems, and to help upgrade the government’s technology infrastructure.”

Trump said in November that DOGE “will pave the way for my Administration to dismantle Government Bureaucracy, slash excess regulations, cut wasteful expenditures, and restructure Federal Agencies.” Yesterday’s executive order said the department will focus on “modernizing Federal technology and software to maximize governmental efficiency and productivity.”

Federal agencies will have to collaborate with DOGE. “Among other things, the USDS Administrator shall work with Agency Heads to promote inter-operability between agency networks and systems, ensure data integrity, and facilitate responsible data collection and synchronization,” the order said. “Agency Heads shall take all necessary steps, in coordination with the USDS Administrator and to the maximum extent consistent with law, to ensure USDS has full and prompt access to all unclassified agency records, software systems, and IT systems. USDS shall adhere to rigorous data protection standards.”

Speech on social media

Trump tackled social media in an order titled Restoring Freedom of Speech and Ending Federal Censorship. The order targets the Biden administration’s practice of contacting social media platforms about content that government officials believe should have been moderated or blocked.

In 2023, the Supreme Court blocked an injunction that would have prevented the Biden administration from pressuring social media firms to take down content. Justices expressed skepticism during oral arguments about whether federal government officials should face limits on their communications with social media networks like Facebook and ruled in favor of the Biden administration in June 2024.

Despite the Biden court win, Trump’s order described the Biden administration’s approach as a threat to the First Amendment.

“Over the last 4 years, the previous administration trampled free speech rights by censoring Americans’ speech on online platforms, often by exerting substantial coercive pressure on third parties, such as social media companies, to moderate, deplatform, or otherwise suppress speech that the Federal Government did not approve,” Trump’s order said. “Under the guise of combatting ‘misinformation,’ ‘disinformation,’ and ‘malinformation,’ the Federal Government infringed on the constitutionally protected speech rights of American citizens across the United States in a manner that advanced the Government’s preferred narrative about significant matters of public debate. Government censorship of speech is intolerable in a free society.”

The order goes on to say that federal government employees and officials are prohibited from “engag[ing] in or facilitat[ing] any conduct that would unconstitutionally abridge the free speech of any American citizen.” Trump further directed his administration to”identify and take appropriate action to correct past misconduct by the Federal Government related to censorship of protected speech.”

Fossil fuels good, wind bad

On the energy front, the most striking executive order is one declaring that the US is facing an energy emergency. This comes despite the fact that the US has been producing, in the words of its own agency, “more crude oil than any country, ever.” It’s also producing record volumes of natural gas. Prices for both have been low in part due to this large supply. Yet the executive order states that “identification, leasing, development, production, transportation, refining, and generation capacity of the United States are all far too inadequate to meet our Nation’s needs.”

The order describes ways to streamline permitting for all of these under emergency provisions overseen by the US Army Corps of Engineers. On the face of it, this would seem to also be good for wind and solar power, which are produced domestically and suffer from permitting barriers and a backlog of requests for connections to the grid. But toward the end of the text, “energy” is defined in a way that excludes wind and solar. “The term ‘energy’ or ‘energy resources’ means crude oil, natural gas, lease condensates, natural gas liquids, refined petroleum products, uranium, coal, biofuels, geothermal heat, the kinetic movement of flowing water, and critical minerals,” the order says.

If the animosity toward the fastest-growing sources of renewable energy weren’t clear there, a separate executive order makes them explicit, as Trump is putting a temporary end to all offshore wind lease sales. “This withdrawal temporarily prevents consideration of any area in the [Offshore Continental Shelf] for any new or renewed wind energy leasing for the purposes of generation of electricity or any other such use derived from the use of wind,” it reads. “This withdrawal does not apply to leasing related to any other purposes such as, but not limited to, oil, gas, minerals, and environmental conservation.”

The ostensible reason for this is “alleged legal deficiencies” in the environmental reviews that were conducted prior to the leasing process. There will also be an attempt to claw back existing leases. The secretary of the interior and attorney general are instructed to “conduct a comprehensive review of the ecological, economic, and environmental necessity of terminating or amending any existing wind energy leases.”

As an added bonus, the same accusations of legal deficiencies is leveled against a single land-based project, the proposed Lava Ridge wind farm in Idaho. So all government activities related to that project are on hold until it’s reviewed.

“Burdensome” regulations targeted

When it comes to fossil fuel development on the continental shelf, a Trump order alleges that “burdensome and ideologically motivated regulations” are impeding development. The order takes several steps to speed up permitting of fossil fuel projects. It also kills a grab bag of climate-related programs.

One of the most prominent efforts is to do away with the emissions waivers, allowed under the Clean Air Act, which enable states like California to set stricter rules than the federal government. The Supreme Court recently declined even to consider an attempt to challenge these waivers. Yet as part of an attack on electric vehicles, the administration is adopting a policy of “terminating, where appropriate, state emissions waivers that function to limit sales of gasoline-powered automobiles.”

Also targeted for termination is the American Climate Corps, a job training program focused on people entering the workforce. The Biden administration’s effort to determine and consider the social cost of carbon emissions during federal rulemaking will also be ended.

Several federal rules and executive orders will be targets, notably those on implementing the energy provisions of the Inflation Reduction Act, which have subsidized renewable energy and funded programs like carbon capture and hydrogen production. Many of these are already formal rules published in the Federal Register, which means that new rulemaking processes will be required to eliminate them, something that typically takes over a year and can be subject to court challenge.

In a separate part of the order, titled “Terminating the Green New Deal,” the Order suspends funding provided under two laws that were not part of the Green New Deal: the Inflation Reduction Act and the Infrastructure Investment and Jobs Act. Given those funds have already been allocated by Congress, it’s not clear how long Trump can delay this spending.

Finally, Trump decided he would attack the foundation of US efforts to limit greenhouse gas emissions: the EPA’s finding that greenhouse gasses are a threat to the public as defined by the Clean Air Act. The endangerment finding is solidly based on well-established science, so much so that attempts to challenge it during the first Trump administration were reportedly abandoned as being unrealistic. Now, the incoming EPA administrator is given just 30 days to “submit joint recommendations to the Director of [Office of Management and Budget] on the legality and continuing applicability of the Administrator’s findings.”

Photo of Jon Brodkin

Jon is a Senior IT Reporter for Ars Technica. He covers the telecom industry, Federal Communications Commission rulemakings, broadband consumer affairs, court cases, and government regulation of the tech industry.

Trump issues flurry of orders on TikTok, DOGE, social media, AI, and energy Read More »

sleeping-pills-stop-the-brain’s-system-for-cleaning-out-waste

Sleeping pills stop the brain’s system for cleaning out waste


Cleanup on aisle cerebellum

A specialized system sends pulses of pressure through the fluids in our brain.

Our bodies rely on their lymphatic system to drain excessive fluids and remove waste from tissues, feeding those back into the blood stream. It’s a complex yet efficient cleaning mechanism that works in every organ except the brain. “When cells are active, they produce waste metabolites, and this also happens in the brain. Since there are no lymphatic vessels in the brain, the question was what was it that cleaned the brain,” Natalie Hauglund, a neuroscientist at Oxford University who led a recent study on the brain-clearing mechanism, told Ars.

Earlier studies done mostly on mice discovered that the brain had a system that flushed its tissues with cerebrospinal fluid, which carried away waste products in a process called glymphatic clearance. “Scientists noticed that this only happened during sleep, but it was unknown what it was about sleep that initiated this cleaning process,” Hauglund explains.

Her study found the glymphatic clearance was mediated by a hormone called norepinephrine and happened almost exclusively during the NREM sleep phase. But it only worked when sleep was natural. Anesthesia and sleeping pills shut this process down nearly completely.

Taking it slowly

The glymphatic system in the brain was discovered back in 2013 by Dr. Maiken Nedergaard, a Danish neuroscientist and a coauthor of Hauglund’s paper. Since then, there have been numerous studies aimed at figuring out how it worked, but most of them had one problem: they were done on anesthetized mice.

“What makes anesthesia useful is that you can have a very controlled setting,” Hauglund says.

Most brain imaging techniques require a subject, an animal or a human, to be still. In mouse experiments, that meant immobilizing their heads so the research team could get clear scans. “But anesthesia also shuts down some of the mechanisms in the brain,” Hauglund argues.

So, her team designed a study to see how the brain-clearing mechanism works in mice that could move freely in their cages and sleep naturally whenever they felt like it. “It turned out that with the glymphatic system, we didn’t really see the full picture when we used anesthesia,” Hauglund says.

Looking into the brain of a mouse that runs around and wiggles during sleep, though, wasn’t easy. The team pulled it off by using a technique called flow fiber photometry which works by imaging fluids tagged with fluorescent markers using a probe implanted in the brain. So, the mice got the optical fibers implanted in their brains. Once that was done, the team put fluorescent tags in the mice’s blood, cerebrospinal fluid, and on the norepinephrine hormone. “Fluorescent molecules in the cerebrospinal fluid had one wavelength, blood had another wavelength, and norepinephrine had yet another wavelength,” Hauglund says.

This way, her team could get a fairly precise idea about the brain fluid dynamics when mice were awake and asleep. And it turned out that the glymphatic system basically turned brain tissues into a slowly moving pump.

Pumping up

“Norepinephrine is released from a small area of the brain in the brain stem,” Hauglund says. “It is mainly known as a response to stressful situations. For example, in fight or flight scenarios, you see norepinephrine levels increasing.” Its main effect is causing blood vessels to contract. Still, in more recent research, people found out that during sleep, norepinephrine is released in slow waves that roll over the brain roughly once a minute. This oscillatory norepinephrine release proved crucial to the operation of the glymphatic system.

“When we used the flow fiber photometry method to look into the brains of mice, we saw these slow waves of norepinephrine, but we also saw how it works in synchrony with fluctuation in the blood volume,” Hauglund says.

Every time the norepinephrine level went up, it caused the contraction of the blood vessels in the brain, and the blood volume went down. At the same time, the contraction increased the volume of the perivascular spaces around the blood vessels, which were immediately filled with the cerebrospinal fluid.

When the norepinephrine level went down, the process worked in reverse: the blood vessels dilated, letting the blood in and pushing the cerebrospinal fluid out. “What we found was that norepinephrine worked a little bit like a conductor of an orchestra and makes the blood and cerebrospinal fluid move in synchrony in these slow waves,” Hauglund says.

And because the study was designed to monitor this process in freely moving, undisturbed mice, the team learned exactly when all this was going on. When mice were awake, the norepinephrine levels were much higher but relatively steady. The team observed the opposite during the REM sleep phase, where the norepinephrine levels were consistently low. The oscillatory behavior was present exclusively during the NREM sleep phase.

So, the team wanted to check how the glymphatic clearance would work when they gave the mice zolpidem, a sleeping drug that had been proven to increase NREM sleep time. In theory, zolpidem should have boosted brain-clearing. But it turned it off instead.

Non-sleeping pills

“When we looked at the mice after giving them zolpidem, we saw they all fell asleep very quickly. That was expected—we take zolpidem because it makes it easier for us to sleep,” Hauglund says. “But then we saw those slow fluctuations in norepinephrine, blood volume, and cerebrospinal fluid almost completely stopped.”

No fluctuations meant the glymphatic system didn’t remove any waste. This was a serious issue, because one of the cellular waste products it is supposed to remove is amyloid beta, found in the brains of patients suffering from Alzheimer’s disease.

Hauglund speculates it could be possible zolpidem induces a state very similar to sleep but at the same time it shuts down important processes that happen during sleep. While heavy zolpidem use has been associated with increased risk of the Alzheimer disease, it is not clear if this increased risk was there because the drug was inhibiting oscillatory norepinephrine release in the brain. To better understand this, Hauglund wants to get a closer look into how the glymphatic system works in humans.

“We know we have the same wave-like fluid dynamics in the brain, so this could also drive the brain clearance in humans,” Haugland told Ars. “Still, it’s very hard to look at norepinephrine in the human brain because we need an invasive technique to get to the tissue.”

But she said norepinephrine levels in people can be estimated based on indirect clues. One of them is pupil dilation and contraction, which work in in synchrony with the norepinephrine levels. Another other clue may lay in microarousals—very brief, imperceivable awakenings which, Hauglund thinks, can be correlated with the brain clearing mechanism. “I am currently interested in this phenomenon […]. Right now we have no idea why microarousals are there or what function they have” Hauglund says.

But the last step she has on her roadmap is making better sleeping pills. “We need sleeping drugs that don’t have this inhibitory effect on the norepinephrine waves. If we can have a sleeping pill that helps people sleep without disrupting their sleep at the same time it will be very important,” Hauglund concludes.

Cell, 2025. DOI: 10.1016/j.cell.2024.11.027

Photo of Jacek Krywko

Jacek Krywko is a freelance science and technology writer who covers space exploration, artificial intelligence research, computer science, and all sorts of engineering wizardry.

Sleeping pills stop the brain’s system for cleaning out waste Read More »

tiktok-is-mostly-restored-after-trump-pledges-an-order-and-half-us-ownership

TikTok is mostly restored after Trump pledges an order and half US ownership

At a rally Sunday, he did not clarify if this meant a US-based business or the government itself. “So they’ll have a partner, the United States, and they’ll have a lot of bidders … And there’s no risk, we’re not putting up any money. All we’re doing is giving them the approval without which they don’t have anything,” Trump said Sunday.

Legal limbo

Trump’s order, and TikTok’s return to service, both seem at odds with the law—and leadership in the Republican party. Speaker Mike Johnson said on NBC’s Meet the Press Sunday that Congress would “enforce the law.” Sens. Tom Cotton (R-Ark.) and Pete Ricketts (R-Neb.) issued a joint statement Sunday, commending Apple, Microsoft, and Google for “following the law,” and noting that other companies “face ruinous bankruptcy” for violating it.

“Now that the law has taken effect, there’s no legal basis for any kind of ‘extension’ of its effective date,” the statement read. The law states that “A path to executing a qualified divestiture” has to be determined before a one-time extension of 90 days can be granted.

TikTok’s best chance at avoiding a shutdown vanished in last week’s unanimous Supreme Court decision upholding the divest-or-sell law. Aimed at protecting national security interests from TikTok’s Chinese owners having access to the habits and data of 170 million American users, the law was ruled to be “content-neutral,” and that the US “had good reason to single out TikTok for special treatment.”

Reports at Forbes, Bloomberg, and elsewhere have suggested that ByteDance and its Chinese owners could be seeking to use TikTok as a bargaining chip, with maneuvers including a sale to Trump ally Elon Musk as a means of counteracting Trump’s proposed tariffs on Chinese imports.

One largely unforeseen side effect of Congress’ TikTok-centered actions is that Marvel Snap, a mobile collectible card and deck-building game, disappeared in similar fashion over the weekend. The game, developed by a California-based team, is published by ByteDance’s Nuverse mobile game division. With no web version available, Snap remained unavailable on app stores Monday morning. A message to players with the game installed noted that “This outage is a surprise to us and wasn’t planned,” though it pledged to restore the game.

TikTok is mostly restored after Trump pledges an order and half US ownership Read More »

hollywood-mourns-the-loss-of-david-lynch

Hollywood mourns the loss of David Lynch

The success of Lynch’s next film, Blue Velvet, helped assuage his disappointment, as did his move to television with the bizarrely surreal and influential series Twin Peaks—part detective story, part soap opera, with dashes of sci-fi and horror. The series spawned a spin-off prequel movie, Twin Peaks: Fire Walk With Me (1992), and a 2017 revival series, Twin Peaks: The Return, that picks up the storyline 25 years later. Many other TV series were influenced by Lynch’s show, including The X-Files, Lost, The Sopranos, Bates Motel, Fargo, Riverdale, Atlanta, and the animated series Gravity Falls.

His final feature films were an LA-centric trilogy—Lost Highway (1997), Mulholland Drive, and Inland Empire (2006)—and 1999’s biographical road drama, The Straight Story, based on the true story of a man named Alvin Straight who drove across Iowa and Wisconsin on a lawn mower. It was acquired by Walt Disney Pictures and was Lynch’s only G-rated film.

“A singular visionary dreamer”

The director’s filmography also includes an assortment of short films, all bearing his eccentric stamp, including a surrealist short, Absurda, shown at Cannes in 2007, as well as Premonition Following an Evil Deed (NSFW YouTube link), Lynch’s contribution to the 1995 anthology film Lumière and Company. All 41 featured directors used the original Cinématographe camera invented by the Lumière brothers. Lynch was also an avid painter, cartoonist, and musician and directed several music videos for such artists as Moby and Nine Inch Nails. Until his death, he hosted quirky online “weather reports” and a web series, What Is David Lynch Working on Today? He even racked up the occasional acting credit.

Lynch received an Honorary Oscar in 2000 for lifetime achievement at the Governors Awards after three prior nominations for The Elephant Man, Blue Velvet, and Mulholland Drive. Deadline’s Pete Hammond called Lynch’s speech “probably one of the shortest for any Oscar acceptance.” Lynch briefly thanked the Academy, the other honorees, wished everyone a great night, then pointed to the statuette and said, “You have a very interesting figure. Good night.” At Cannes, he won the Palme d’Or in 1990 for Wild at Heart and won Best Director in 2001 for Mulholland Drive.

Naomi Watts, who played a dual role as doppelgängers Betty Elms and Diane Selwyn in Mulholland Drive, said that Lynch put her “on the map” as an actor by casting her. “It wasn’t just his art that impacted me—his wisdom, humor, and love gave me a special sense of belief in myself I’d never accessed before,” she said in a statement. “Every moment together felt charged with a presence I’ve rarely seen or known. Probably because, yes, he seemed to live in an altered world, one that I feel beyond lucky to have been a small part of. And David invited all to glimpse into that world through his exquisite storytelling, which elevated cinema and inspired generations of filmmakers across the globe.”

Hollywood mourns the loss of David Lynch Read More »

google-is-about-to-make-gemini-a-core-part-of-workspaces—with-price-changes

Google is about to make Gemini a core part of Workspaces—with price changes

Google has added AI features to its regular Workspace accounts for business while slightly raising the baseline prices of Workspace plans.

Previously, AI tools in the Gemini Business plan were a $20 per seat add-on to existing Workspace accounts, which had a base cost of $12 per seat without. Now, the AI tools are included for all Workspace users, but the per-seat base price is increasing from $12 to $14.

That means that those who were already paying extra for Gemini are going to pay less than half of what they were—effectively $14 per seat instead of $32. But those who never used or wanted Gemini or any other newer features under the AI umbrella from Workspace are going to pay a little bit more than before.

Features covered here include access to Gemini Advanced, the NotebookLM research assistant, email and document summaries in Gmail and Docs, adaptive audio and additional transcription languages for Meet, and “help me write” and Gemini in the side panel across a variety of applications.

Google says that it plans “to roll out even more AI features previously available in Gemini add-ons only.”

Google is about to make Gemini a core part of Workspaces—with price changes Read More »

here’s-what-nasa-would-like-to-see-spacex-accomplish-with-starship-this-year

Here’s what NASA would like to see SpaceX accomplish with Starship this year


Iterate, iterate, and iterate some more

The seventh test flight of Starship is scheduled for launch Thursday afternoon.

SpaceX’s upgraded Starship rocket stands on its launch pad at Starbase, Texas. Credit: SpaceX

SpaceX plans to launch the seventh full-scale test flight of its massive Super Heavy booster and Starship rocket Thursday afternoon. It’s the first of what might be a dozen or more demonstration flights this year as SpaceX tries new things with the most powerful rocket ever built.

There are many things on SpaceX’s Starship to-do list in 2025. They include debuting an upgraded, larger Starship, known as Version 2 or Block 2, on the test flight preparing to launch Thursday. The one-hour launch window opens at 5 pm EST (4 pm CST; 22: 00 UTC) at SpaceX’s launch base in South Texas. You can watch SpaceX’s live webcast of the flight here.

SpaceX will again attempt to catch the rocket’s Super Heavy booster—more than 20 stories tall and wider than a jumbo jet—back at the launch pad using mechanical arms, or “chopsticks,” mounted to the launch tower. Read more about the Starship Block 2 upgrades in our story from last week.

You might think of next week’s Starship test flight as an apéritif before the entrées to come. Ars recently spoke with Lisa Watson-Morgan, the NASA engineer overseeing the agency’s contract with SpaceX to develop a modified version of Starship to land astronauts on the Moon. NASA has contracts with SpaceX worth more than $4 billion to develop and fly two Starship human landing missions under the umbrella of the agency’s Artemis program to return humans to the Moon.

We are publishing the entire interview with Watson-Morgan below, but first, let’s assess what SpaceX might accomplish with Starship this year.

There are many things to watch for on this test flight, including the deployment of 10 satellite simulators to test the ship’s payload accommodations and the performance of a beefed-up heat shield as the vehicle blazes through the atmosphere for reentry and splashdown in the Indian Ocean.

If this all works, SpaceX may try to launch a ship into low-Earth orbit on the eighth flight, expected to launch in the next couple of months. All of the Starship test flights to date have intentionally flown on suborbital trajectories, bringing the ship back toward reentry over the sea northwest of Australia after traveling halfway around the world.

Then, there’s an even bigger version of Starship called Block 3 that could begin flying before the end of the year. This version of the ship is the one that SpaceX will use to start experimenting with in-orbit refueling, according to Watson-Morgan.

In order to test refueling, two Starships will dock together in orbit, allowing one vehicle to transfer super-cold methane and liquid oxygen into the other. Nothing like this on this scale has ever been attempted before. Future Starship missions to the Moon and Mars may require 10 or more tanker missions to gas up in low-Earth orbit. All of these missions will use different versions of the same basic Starship design: a human-rated lunar lander, a propellant depot, and a refueling tanker.

Artist’s illustration of Starship on the surface of the Moon. Credit: SpaceX

Questions for 2025

Catching Starship back at its launch tower and demonstrating orbital propellant transfer are the two most significant milestones on SpaceX’s roadmap for 2025.

SpaceX officials have said they aim to fly as many as 25 Starship missions this year, allowing engineers to more rapidly iterate on the vehicle’s design. SpaceX is constructing a second launch pad at its Starbase facility near Brownsville, Texas, to help speed up the launch cadence.

Can SpaceX achieve this flight rate in 2025? Will faster Starship manufacturing and reusability help the company fly more often? Will SpaceX fly its first ship-to-ship propellant transfer demonstration this year? When will Starship begin launching large batches of new-generation Starlink Internet satellites?

Licensing delays at the Federal Aviation Administration have been a thorn in SpaceX’s side for the last couple of years. Will those go away under the incoming administration of President-elect Donald Trump, who counts SpaceX founder Elon Musk as a key adviser?

And will SpaceX gain a larger role in NASA’s Artemis lunar program? The Artemis program’s architecture is sure to be reviewed by the Trump administration and the nominee for the agency’s next administrator, billionaire businessman and astronaut Jared Isaacman.

The very expensive Space Launch System rocket, developed by NASA with Boeing and other traditional aerospace contractors, might be canceled. NASA currently envisions the SLS rocket and Orion spacecraft as the transportation system to ferry astronauts between Earth and the vicinity of the Moon, where crews would meet up with a landing vehicle provided by commercial partners SpaceX and Blue Origin.

Watson-Morgan didn’t have answers to all of these questions. Many of them are well outside of her purview as Human Landing System program manager, so Ars didn’t ask. Instead, Ars discussed technical and schedule concerns with her during the half-hour interview. Here is one part of the discussion, lightly edited for clarity.

Ars: What do you hope to see from Flight 7 of Starship?

Lisa Watson-Morgan: One of the exciting parts of working with SpaceX are these test flights. They have a really fast turnaround, where they put in different lessons learned. I think you saw many of the flight objectives that they discussed from Flight 6, which was a great success. I think they mentioned different thermal testing experiments that they put on the ship in order to understand the different heating, the different loads on certain areas of the system. All that was really good with each one of those, in addition to how they configure the tiles. Then, from that, there’ll be additional tests that they will put on Flight 7, so you kind of get this iterative improvement and learning that we’ll get to see in Flight 7. So Flight 7 is the first Version 2 of their ship set. When I say that, I mean the ship, the booster, all the systems associated with it. So, from that, it’s really more just understanding how the system, how the flaps, how all of that interacts and works as they’re coming back in. Hopefully we’ll get to see some catches, that’s always exciting.

Ars: How did the in-space Raptor engine relight go on Flight 6 (on November 19)?

Lisa Watson-Morgan: Beautifully. And that’s something that’s really important to us because when we’re sitting on the Moon… well, actually, the whole path to the Moon as we are getting ready to land on the Moon, we’ll perform a series of maneuvers, and the Raptors will have an environment that is very, very cold. To that, it’s going to be important that they’re able to relight for landing purposes. So that was a great first step towards that. In addition, after we land, clearly the Raptors will be off, and it will get very cold, and they will have to relight in a cold environment (to get off the Moon). So that’s why that step was critical for the Human Landing System and NASA’s return to the Moon.

A recent artist’s illustration of two Starships docked together in low-Earth orbit. Credit: SpaceX

Ars: Which version of the ship is required for the propellant transfer demonstration, and what new features are on that version to enable this test?

Lisa Watson-Morgan: We’re looking forward to the Version 3, which is what’s coming up later on, sometime in ’25, in the near term, because that’s what we need for propellant transfer and the cryo fluid work that is also important to us… There are different systems in the V3 set that will help us with cryo fluid management. Obviously, with those, we have to have the couplers and the quick-disconnects in order for the two systems to have the right guidance, navigation, trajectory, all the control systems needed to hold their station-keeping in order to dock with each other, and then perform the fluid transfer. So all the fluid lines and all that’s associated with that, those systems, which we have seen in tests and held pieces of when we’ve been working with them at their site, we’ll get to see those actually in action on orbit.

Ars: Have there been any ground tests of these systems, whether it’s fluid couplers or docking systems? Can you talk about some of the ground tests that have gone into this development?

Lisa Watson-Morgan: Oh, absolutely. We’ve been working with them on ground tests for this past year. We’ve seen the ground testing and reviewed the data. Our team works with them on what we deem necessary for the various milestones. While the milestone contains proprietary (information), we work closely with them to ensure that it’s going to meet the intent, safety-wise as well as technically, of what we’re going to need to see. So they’ve done that.

Even more exciting, they have recently shipped some of their docking systems to the Johnson Space Center for testing with the Orion Lockheed Martin docking system, and that’s for Artemis III. Clearly, that’s how we’re going to receive the crew. So those are some exciting tests that we’ve been doing this past year as well that’s not just focused on, say, the booster and the ship. There are a lot of crew systems that are being developed now. We’re in work with them on how we’re going to effectuate the crew manual control requirements that we have, so it’s been a great balance to see what the crew needs, given the size of the ship. That’s been a great set of work. We have crew office hours where the crew travels to Hawthorne [SpaceX headquarters in California] and works one-on-one with the different responsible engineers in the different technical disciplines to make sure that they understand not just little words on the paper from a requirement, but actually what this means, and then how systems can be operated.

Ars: For the docking system, Orion uses the NASA Docking System, and SpaceX brings its own design to bear on Starship?

Lisa Watson-Morgan: This is something that I think the Human Landing System has done exceptionally well. When we wrote our high-level set of requirements, we also wrote it with a bigger picture in mind—looked into the overall standards of how things are typically done, and we just said it has to be compliant with it. So it’s a docking standard compliance, and SpaceX clearly meets that. They certainly do have the Dragon heritage, of course, with the International Space Station. So, because of that, we have high confidence that they’re all going to work very well. Still, it’s important to go ahead and perform the ground testing and get as much of that out of the way as we can.

Lisa Watson-Morgan, NASA’s HLS program manager, is based at Marshall Space Flight Center in Huntsville, Alabama. Credit: ASA/Aubrey Gemignani

Ars: How far along is the development and design of the layout of the crew compartment at the top of Starship? Is it far along, or is it still in the conceptual phase? What can you say about that?

Lisa Watson-Morgan: It’s much further along there. We’ve had our environmental control and life support systems, whether it’s carbon dioxide monitoring fans to make sure the air is circulating properly. We’ve been in a lot of work with SpaceX on the temperature. It’s… a large area (for the crew). The seats, making sure that the crew seats and the loads on that are appropriate. For all of that work, as the analysis work has been performed, the NASA team is reviewing it. They had a mock-up, actually, of some of their life support systems even as far back as eight-plus months ago. So there’s been a lot of progress on that.

Ars: Is SpaceX planning to use a touchscreen design for crew displays and controls, like they do with the Dragon spacecraft?

Lisa Watson-Morgan: We’re in talks about that, about what would be the best approach for the crew for the dynamic environment of landing.

Ars: I can imagine it is a pretty dynamic environment with those Raptor engines firing. It’s almost like a launch in reverse.

Lisa Watson-Morgan: Right. Those are some of the topics that get discussed in the crew office hours. That’s why it’s good to have the crew interacting directly, in addition to the different discipline leads, whether it’s structural, mechanical, propulsion, to have all those folks talking guidance and having control to say, “OK, well, when the system does this, here’s the mode we expect to see. Here’s the impact on the crew. And is this condition, or is the option space that we have on the table, appropriate for the next step, with respect to the displays.”

Ars: One of the big things SpaceX needs to prove out before going to the Moon with Starship is in-orbit propellant transfer. When do you see the ship-to-ship demonstration occurring?

Lisa Watson-Morgan: I see it occurring in ’25.

Ars: Anything more specific about the schedule for that?

Lisa Watson-Morgan: That’d be a question for SpaceX because they do have a number of flights that they’re performing commercially, for their maturity. We get the benefit of that. It’s actually a great partnership. I’ll tell you, it’s really good working with them on this, but they’d have to answer that question. I do foresee it happening in ’25.

Ars: What things do you need to see SpaceX accomplish before they’re ready for the refueling demo? I’m thinking of things like the second launch tower, potentially. Do they need to demonstrate a ship catch or anything like that before going for orbital refueling?

Lisa Watson-Morgan: I would say none of that’s required. You just kind of get down to, what are the basics? What are the basics that you need? So you need to be able to launch rapidly off the same pad, even. They’ve shown they can launch and catch within a matter of minutes. So that is good confidence there. The catching is part of their reuse strategy, which is more of their commercial approach, and not a NASA requirement. NASA reaps the benefit of it by good pricing as a result of their commercial model, but it is not a requirement that we have. So they could theoretically use the same pad to perform the propellant transfer and the long-duration flight, because all it requires is two launches, really, within a specified time period to where the two systems can meet in a planned trajectory or orbit to do the propellant transfer. So they could launch the first one, and then within a week or two or three, depending on what the concept of operations was that we thought we could achieve at that time, and then have the propellant transfer demo occur that way. So you don’t necessarily need two pads, but you do need more thermal characterization of the ship. I would say that is one of the areas (we need to see data on), and that is one of the reasons, I think, why they’re working so diligently on that.

Ars: You mentioned the long-duration flight demonstration. What does that entail?

Lisa Watson-Morgan: The simple objectives are to launch two different tankers or Starships. The Starship will eventually be a crewed system. Clearly, the ones that we’re talking about for the propellant transfer are not. It’s just to have the booster and Starship system launch, and within a few weeks, have another one launch, and have them rendezvous. They need to be able to find each other with their sensors. They need to be able to come close, very, very close, and they need to be able to dock together, connect, do the quick connect, and make sure they are able, then, to flow propellant and LOX (liquid oxygen) to another system. Then, we need to be able to measure the quantity of how much has gone over. And from that, then they need to safely undock and dispose.

Ars: So the long-duration flight demonstration is just part of what SpaceX needs to do in order to be ready for the propellant transfer demonstration?

Lisa Watson-Morgan: We call it long duration just because it’s not a 45-minute or an hour flight. Long duration, obviously, that’s a relative statement, but it’s a system that can stay up long enough to be able to find another Starship and perform those maneuvers and flow of fuel and LOX.

Ars: How much propellant will you transfer with this demonstration, and do you think you’ll get all the data you need in one demonstration, or will SpaceX need to try this several times?

Lisa Watson-Morgan: That’s something you can ask SpaceX (about how much propellant will be transferred). Clearly, I know, but there’s some sensitivity there. You’ve seen our requirements in our initial solicitation. We have thresholds and goals, meaning we want you to at least do this, but more is better, and that’s typically how we work almost everything. Working with commercial industry in these fixed-price contracts has worked exceptionally well, because when you have providers that are also wanting to explore commercially or trying to make a commercial system, they are interested in pushing more than what we would typically ask for, and so often we get that for an incredibly fair price.

Photo of Stephen Clark

Stephen Clark is a space reporter at Ars Technica, covering private space companies and the world’s space agencies. Stephen writes about the nexus of technology, science, policy, and business on and off the planet.

Here’s what NASA would like to see SpaceX accomplish with Starship this year Read More »

this-pdf-contains-a-playable-copy-of-doom

This PDF contains a playable copy of Doom

Here at Ars, we’re suckers for stories about hackers getting Doom running on everything from CAPTCHA robot checks and Windows’ notepad.exe to AI hallucinations and fluorescing gut bacteria. Despite all that experience, we were still thrown for a loop by a recent demonstration of Doom running in the usually static confines of a PDF file.

On the Github page for the quixotic project, coder ading2210 discusses how Adobe Acrobat included some robust support for JavaScript in the PDF file format. That JS coding support—which dates back decades and is still fully documented in Adobe’s official PDF specs—is currently implemented in a more limited, more secure form as part of PDFium, the built-in PDF-rendering engine of Chromium-based browsers.

In the past, hackers have used this little-known Adobe feature to code simple games like Breakout and Tetris into PDF documents. But ading220 went further, recompiling a streamlined fork of Doom‘s open source code using an old version of Emscripten that outputs optimized asm.js code.

With that code loaded, the Doom PDF can take inputs via the user typing in a designated text field and generate “video” output in the form of converted ASCII text fed into 200 individual text fields, each representing a horizontal line of the Doom display. The text in those fields is enough to simulate a six-color monochrome display at a “pretty poor but playable” 13 frames per second (about 80 ms per frame).

This PDF contains a playable copy of Doom Read More »

ban-on-chinese-connected-car-software-is-almost-ready

Ban on Chinese connected-car software is almost ready

However, the ban, as written, is not absolute. Companies can seek authorization to import software or hardware that would otherwise be outlawed, but the request would need to satisfy the US government and possibly be subject to conditions.

There are also exemptions for software for vehicles older than model year 2027 and hardware for vehicles older than model year 2030, including parts imported for warranty or repair work. (The government points out that retroactively applying the new rule would be a little pointless as any harm would already be done by vehicles that had compromised systems that predate it going into effect.)

And the final rule would only apply to light-duty vehicles. Anything with a gross vehicle weight rating of more than 10,000 lbs is exempt but will be dealt with in “a separate regulation tailored to the commercial sector in the coming months.”

Auto industry suppliers probably face the most disruption as a result of the new rule—just the presence of a Chinese-made module in a larger system is enough to trigger the import ban. But there should be little disruption to the US car market, at least for now.

Since the rules only go into effect from model year 2027, the few Chinese-made vehicles on sale in the US—models from Polestar, Volvo, Lincoln, and Buick—may remain on sale. However, Polestar’s Chinese ownership may prove somewhat of a sticking point compared to Ford and GM. Ars notes that lawyers representing Polestar met with the Commerce Department last week—we reached out to the automaker for a comment and will update this piece should we hear back.

Ban on Chinese connected-car software is almost ready Read More »