Author name: Mike M.

after-blacksuit-is-taken-down,-new-ransomware-group-chaos-emerges

After BlackSuit is taken down, new ransomware group Chaos emerges

Talos said Chaos is likely either a rebranding of the BlackSuit ransomware or is operated by some of the former BlackSuit members. Talos based its assessment on the similarities in the encryption mechanisms in the ransomware, the theme and structure of the ransom notes, the remote monitoring and management tools used to access targeted networks, and its choice of LOLbins—meaning executable files natively found in Windows environments—to compromise targets. LOLbins get their name because they’re binaries that allow the attackers to live off the land.

The Talos post was published around the same time that the dark web site belonging to BlackSuit began displaying a message saying the site had been seized in Operation CheckMate. Organizations that participated in the takedown included the US Department of Justice, the US Department of Homeland Security, the US Secret Service, the Dutch National Police, the German State Criminal Police Office, the UK National Crime Agency, the Frankfurt General Prosecutor’s Office, the Justice Department, the Ukrainian Cyber Police, and Europol.

Screenshot

Screenshot

Chaos typically gains initial access through social engineering using email or voice phishing techniques. Eventually, the victim is persuaded to contact an IT security representative, who, in fact, is part of the ransomware operation. The Chaos member instructs the target to launch Microsoft Quick Assist, a remote-assistance tool built into Windows, and connect to the attacker’s endpoint.

Chaos’ predecessor, BlackSuit, is a rebranding of an earlier ransomware operation known as Royal. Royal, according to Trend Micro, is a splinter group of the Conti ransomware group. The circle of ransomware groups continues.

After BlackSuit is taken down, new ransomware group Chaos emerges Read More »

north-korean-hackers-ran-us-based-“laptop-farm”-from-arizona-woman’s-home

North Korean hackers ran US-based “laptop farm” from Arizona woman’s home

As the number of computers mounted, Chapman began stacking them on shelves around her residence, labeling them with sticky notes so she could remember which “worker” and company controlled which machine. When Chapman’s home was searched, FBI agents took photos of her setup, which is… something to behold, really.

Chapman’s origin story is a sad one. According to her public defender, her childhood was marked by “her father’s infidelity, alcoholism, and emotional absence.” Chapman was placed in 12 different schools across multiple states before she graduated high school, “leaving her socially isolated, bullied, and unable to form lasting friendships or a sense of belonging.” She also suffered “severe and escalating violence from her older brother, who repeatedly beat and choked her, held a shotgun to her chest, and once left her so visibly bruised that her school intervened.” And she was “sexually abused at various points in her childhood and adolescence by family members, peers, and even individuals she believed to be friends.”

Unfortunately, Chapman’s poor choice to involve herself with the North Koreans inflicted plenty of pain on others, too, including those whose identity was stolen. One victim told the court that the crime “left me feeling violated, helpless, and afraid,” adding:

Although identity theft is not a physical assault, the psychological and financial damage is lasting. It feels like someone broke into my life, impersonated me, and left me to pick up the pieces. There is a lingering fear that my information is still out there, ready to be misused again. The stigma of being a fraud victim also weighs heavily; I have had to explain myself to banks, creditors, and sometimes even to people I know. There is an ongoing sense of vulnerability and lack of control.

In addition to her 8.5-year sentence, Chapman will serve three years of “supervised release,” must forfeit $284,555 that was meant for the North Koreans, and must repay $176,850 of her own money.

Such “remote work” scams have become increasingly common over the last few years, most originating from North Korea, and the FBI has released repeated guidance on what to look for when hiring remote workers.

North Korean hackers ran US-based “laptop farm” from Arizona woman’s home Read More »

delta’s-ai-spying-to-“jack-up”-prices-must-be-banned,-lawmakers-say

Delta’s AI spying to “jack up” prices must be banned, lawmakers say

“There is no fare product Delta has ever used, is testing or plans to use that targets customers with individualized offers based on personal information or otherwise,” Delta said. “A variety of market forces drive the dynamic pricing model that’s been used in the global industry for decades, with new tech simply streamlining this process. Delta always complies with regulations around pricing and disclosures.”

Other companies “engaging in surveillance-based price setting” include giants like Amazon and Kroger, as well as a ride-sharing app that has been “charging a customer more when their phone battery is low.”

Public Citizen, a progressive consumer rights group that endorsed the bill, condemned the practice in the press release, urging Congress to pass the law and draw “a clear line in the sand: companies can offer discounts and fair wages—but not by spying on people.”

“Surveillance-based price gouging and wage setting are exploitative practices that deepen inequality and strip consumers and workers of dignity,” Public Citizen said.

AI pricing will cause “full-blown crisis”

In January, the Federal Trade Commission requested information from eight companies—including MasterCard, Revionics, Bloomreach, JPMorgan Chase, Task Software, PROS, Accenture, and McKinsey & Co—joining a “shadowy market” that provides AI pricing services. Those companies confirmed they’ve provided services to at least 250 companies “that sell goods or services ranging from grocery stores to apparel retailers,” lawmakers noted.

That inquiry led the FTC to conclude that “widespread adoption of this practice may fundamentally upend how consumers buy products and how companies compete.”

In the press release, the anti-monopoly watchdog, the American Economic Liberties Project, was counted among advocacy groups endorsing the Democrats’ bill. Their senior legal counsel, Lee Hepner, pointed out that “grocery prices have risen 26 percent since the pandemic-era explosion of online shopping,” and that’s “dovetailing with new technology designed to squeeze every last penny from consumers.”

Delta’s AI spying to “jack up” prices must be banned, lawmakers say Read More »

lawmakers-writing-nasa’s-budget-want-a-cheaper-upper-stage-for-the-sls-rocket

Lawmakers writing NASA’s budget want a cheaper upper stage for the SLS rocket


Eliminating the Block 1B upgrade now would save NASA at least $500 million per year.

Artist’s illustration of the Boeing-developed Exploration Upper Stage, with four hydrogen-fueled RL10 engines. Credit: NASA

Not surprisingly, Congress is pushing back against the Trump administration’s proposal to cancel the Space Launch System, the behemoth rocket NASA has developed to propel astronauts back to the Moon.

Spending bills making their way through both houses of Congress reject the White House’s plan to wind down the SLS rocket after two more launches, but the text of a draft budget recently released by the House Appropriations Committee suggests an openness to making some major changes to the program.

The next SLS flight, called Artemis II, is scheduled to lift off early next year to send a crew of four astronauts around the far side of the Moon. Artemis III will follow a few years later on a mission to attempt a crew lunar landing at the Moon’s south pole. These missions follow Artemis I, a successful unpiloted test flight in 2022.

After Artemis III, the official policy of the Trump administration is to terminate the SLS program, along with the Orion crew capsule designed to launch on top of the rocket. The White House also proposed canceling NASA’s Gateway, a mini-space station to be placed in orbit around the Moon. NASA would instead procure commercial launches and commercial spacecraft to ferry astronauts between the Earth and the Moon, while focusing the agency’s long-term gaze toward Mars.

CYA EUS?

House and Senate appropriations bills would preserve SLS, Orion, and the Gateway. However, the House version of NASA’s budget has an interesting paragraph directing NASA to explore cheaper, faster options for a new SLS upper stage.

NASA has tasked Boeing, which also builds SLS core stages, to develop an Exploration Upper Stage for debut on the Artemis IV mission, the fourth flight of the Space Launch System. This new upper stage would have large propellant tanks and carry four engines instead of the single engine used on the rocket’s interim upper stage, which NASA is using for the first three SLS flights.

The House version of NASA’s fiscal year 2026 budget raises questions about the long-term future of the Exploration Upper Stage. In one section of the bill, House lawmakers would direct NASA to “evaluate alternatives to the current Exploration Upper Stage (EUS) design for SLS.” The committee members wrote the evaluation should focus on reducing development and production costs, shortening the schedule, and maintaining the SLS rocket’s lift capability.

“NASA should also evaluate how alternative designs could support the long-term evolution of SLS and broader exploration goals beyond low-Earth orbit,” the lawmakers wrote. “NASA is directed to assess various propulsion systems, stage configurations, infrastructure compatibility, commercial and international collaboration opportunities, and the cost and schedule impacts of each alternative.”

The SLS rocket is expensive, projected to cost at least $2.5 billion per launch, not counting development costs or expenses related to the Orion spacecraft and the ground systems required to launch it at Kennedy Space Center in Florida. Those figures bring the total cost of an Artemis mission using SLS and Orion to more than $4 billion, according to NASA’s inspector general.

NASA’s Block 1B version of the SLS rocket will be substantially larger than Block 1. Credit: NASA

The EUS is likewise an expensive undertaking. Last year, NASA’s inspector general reported that the new upper stage’s development costs had ballooned from $962 million to $2.8 billion, and the Boeing-led project had been delayed more than six years. The version of the SLS rocket with the EUS, known as Block 1B, is supposed to deliver a 40 percent increase in performance over the Block 1 configuration used on the first three Space Launch System flights. Overall, NASA’s inspector general projected Block 1B’s development costs to total $5.7 billion.

Eliminating the Block 1B upgrade now would save NASA at least $500 million per year, and perhaps more if NASA could also end work on a costly mobile launch tower specifically designed to support SLS Block 1B missions.

NASA can’t go back to the interim upper stage, which is based on the design of the upper stage that flew on United Launch Alliance’s (ULA’s) now-retired Delta IV Heavy rocket. ULA has shut down its Delta production line, so there’s no way to build any more. What ULA does have is a new high-energy upper stage called Centaur V. This upper stage is sized for ULA’s new Vulcan rocket, with more capability than the interim upper stage but with lower performance than the larger EUS.

A season of compromise, maybe

Ars’ Eric Berger wrote last year about the possibility of flying the Centaur V upper stage on SLS missions.

Incorporating the Centaur V wouldn’t maintain the SLS rocket’s lift capability, as the House committee calls for in its appropriations bill. The primary reason for improving the rocket’s performance is to give SLS Block 1B enough oomph to carry “co-manifested” payloads, meaning it can launch an Orion crew capsule and equipment for NASA’s Gateway lunar space station on a single flight. The lunar Gateway is also teed up for cancellation in Trump’s budget proposal, but both congressional appropriations bills would save it, too. If the Gateway escapes cancellation, there are ways to launch its modules on commercial rockets.

Blue Origin also has an upper stage that could conceivably fly on the Space Launch System. But the second stage for Blue Origin’s New Glenn rocket would be a more challenging match for SLS for several reasons, chiefly its 7-meter (23-foot) diameter—too wide to be a drop-in replacement for the interim upper stage used on Block 1. ULA’s Centaur V is much closer in size to the existing upper stage.

The House budget bill has passed a key subcommittee vote but won’t receive a vote from the full appropriations committee until after Congress’s August recess. A markup of the bill by the House Appropriations Committee scheduled for Thursday was postponed after Speaker Mike Johnson announced an early start to the recess this week.

Ars reported last week on the broad strokes of how the House and Senate appropriations bills would affect NASA. Since then, members of the House Appropriations Committee released the text of the report attached to their version of the NASA budget. The report, which includes the paragraph on the Exploration Upper Stage, provides policy guidance and more detailed direction on where NASA should spend its money.

The House’s draft budget includes $2.5 billion for the Space Launch System, close to this year’s funding level and $500 million more than the Trump administration’s request for the next fiscal year, which begins October 1. The budget would continue development of SLS Block 1B and the Exploration Upper Stage while NASA completes a six-month study of alternatives.

The report attached to the Senate appropriations bill for NASA has no specific instructions regarding the Exploration Upper Stage. But like the House bill, the Senate’s draft budget directs NASA to continue ordering spares and long-lead parts for SLS and Orion missions beyond Artemis III. Both versions of the NASA budget require the agency to continue with SLS and Orion until a suitable commercial, human-rated rocket and crew vehicle are proven ready for service.

In a further indication of Congress’ position on the SLS and Orion programs, lawmakers set aside more than $4 billion for the procurement of SLS rockets for the Artemis IV and Artemis V rockets in the reconciliation bill signed into law by President Donald Trump earlier this month.

Congress must pass a series of federal appropriations bills by October 1, when funding for the current fiscal year runs out. If Congress doesn’t act by then, it could pass a continuing resolution to maintain funding at levels close to this year’s budget or face a government shutdown.

Lawmakers will reconvene in Washington, DC, in early September in hopes of finishing work on the fiscal year 2026 budget. The section of the budget that includes NASA still must go through a markup hearing by the House Appropriations Committee and pass floor votes in the House and Senate. Then the two chambers will have to come to a compromise on the differences in their appropriations bill. Only then can the budget be put to another vote in each chamber and go to the White House for Trump’s signature.

Photo of Stephen Clark

Stephen Clark is a space reporter at Ars Technica, covering private space companies and the world’s space agencies. Stephen writes about the nexus of technology, science, policy, and business on and off the planet.

Lawmakers writing NASA’s budget want a cheaper upper stage for the SLS rocket Read More »

ai-#126:-go-fund-yourself

AI #126: Go Fund Yourself

The big AI news this week came on many fronts.

Google and OpenAI unexpectedly got 2025 IMO Gold using LLMs under test conditions, rather than a tool like AlphaProof. How they achieved this was a big deal in terms of expectations for future capabilities.

ChatGPT released GPT Agent, a substantial improvement on Operator that makes it viable on a broader range of tasks. For now I continue to struggle to find practical use cases where it is both worth using and a better tool than alternatives, but there is promise here.

Finally, the White House had a big day of AI announcements, laying out the AI Action Plan and three executive orders. I will cover that soon. The AI Action Plan’s rhetoric is not great, and from early reports the rhetoric at the announcement event was similarly not great, with all forms of safety considered so irrelevant as to not mention, and an extreme hostility to any form of regulatory action whatsoever.

The good news is that if you look at the actual policy recommendations of the AI Action Plan, there are some concerns of potential overreach, but it is almost entirely helpful things, including some very pleasant and welcome surprises.

I’m also excluding coverage of the latest remarkable Owain Evans paper until I can process it more, and I’m splitting off various discussions of issues related to AI companions and persuasion. There’s a bit of a backlog accumulating.

This post covers everything else that happened this week.

  1. Language Models Offer Mundane Utility. Price discrimination strikes again.

  2. Language Models Don’t Offer Mundane Utility. AI where it does not belong.

  3. Huh, Upgrades. Claude for Financial Services, Gemini Drops to track things.

  4. 4o Is An Absurd Sycophant. It would be great if this wasn’t what most people use.

  5. On Your Marks. AccountingBench and GasBench.

  6. Choose Your Fighter. GPT-5? It’s coming.

  7. When The Going Gets Crazy. You have not awoken ChatGPT.

  8. They Took Our Jobs. Academics think differently.

  9. Fun With Media Generation. Netflix starts to use AI generated video.

  10. The Art of the Jailbreak. Persuade it like a human, or invoke Pliny? Both work.

  11. Get Involved. RAND and IAPS are hiring, plus a list of desired new projects.

  12. Introducing. Cloudflare gives us pay-per-crawl.

  13. In Other AI News. Kimi K2 tech report is now available.

  14. Show Me the Money. Loose lips start bidding wars.

  15. Go Middle East Young Man. Anthropic to raise money from gulf states.

  16. Economic Growth. AI capex is generating +0.7% GDP growth.

  17. Quiet Speculations. Zuck feels the ASI and makes his pitch, Simo makes hers.

  18. Modest Proposals. A roadmap for AI for general college-level education.

  19. Predictions Are Hard Especially About The Future. A lot of things could happen.

  20. The Quest for Sane Regulations. Meta defects, various things risk getting dire.

  21. Chip City. House Select Committee on the CCP protests potential H20 sales.

  22. The Week in Audio. Hassabis, Schmidt and Winga.

  23. Congressional Voices. Two more have short superintelligence timelines.

  24. Rhetorical Innovation. The humans seem rather emergently misaligned.

  25. Grok Bottom. Grok thinks the humans want it to try blackmail, it’s a good thing.

  26. No Grok No. Baby Grok? What could possibly go wrong?

  27. Aligning a Smarter Than Human Intelligence is Difficult. New lab ratings.

  28. Preserve Chain Of Thought Monitorability. A lot of people agree on this.

  29. People Are Worried About AI Killing Everyone. Elon Musk. Oh well.

  30. The Lighter Side. That’s not funny—it’s hilarious.

Delta Airlines is running an experiment where it uses AI to do fully personalized price discrimination, charging different people different amounts for flights. Delta says their early tests have yielded great results.

My prediction is that this will cause an epic customer backlash the moment people start seeing Delta charging them more than it is charging someone else, and also that many customers will start aggressive gaming the system in ways Delta can’t fathom. Also, how could anyone choose to go with Delta’s frequent flyer program if this meant they could be held hostage on price?

It could still be worthwhile from the airline’s perspective if some customers get taken for large amounts. Price discrimination is super powerful, especially if it identifies a class of very price insensitive business customers.

I am not sure that I share Dan Rosenheck’s model that if all the airlines did this and it was effective that the airlines would compete away all the extra revenue and thus it would return to the price sensitive customers. There has been a lot of consolidation and the competition may no longer be that cutthroat, especially with America excluding foreign carriers, plus the various AIs might implicitly collude.

Mostly I worry about the resulting rise in transaction costs as customers learn they cannot blindly and quickly purchase a ticket. There’s a lot of deadweight loss there.

As one would expect:

Wife Noticer: Experts on body dysmorphic disorder have warned that people struggling with it have become increasingly dependent on AI chatbots to evaluate their self-perceived flaws and recommend cosmetic surgeries. “It’s almost coming up in every single session,” one therapist tells me.

This does not tell you whether AI is making the problem better or worse. People with body dysmorphia were already spiraling out. In some cases the AI response will confirm their fears or create new ones and make this worse, in others it will presumably make it better, as they have dysmorphia and the AI tells them they look fine. But if the source of the issue is impossibly high standards, then finding out ‘the truth’ in other ways will only make things worse, as potentially would seeing AI-adjusted versions of yourself.

My guess is that 4o’s sycophancy is going to make this a lot worse, and that this (since the vast majority of users are using 4o) is a lot of why this is going so poorly. 4o will mirror the user’s questions, notice that they are looking to be told they are ugly or something is wrong, and respond accordingly.

Miles Klee: Despite this difficult circumstance, and the measure of comfort he derived from ChatGPT’s account of his inferiority complex, Arnav is reluctant to explore his mental issues any further with the bot. “I have come to the conclusion that it just agrees with you, even after you tell it not to,” he says. “It’s not that I am completely against it, I just can’t trust blindly anymore.”

What is the AI optimizing for, is always a key question:

In her own practice, she adds, “reading between the lines” when someone gives their reasons for wanting surgery can reveal unhealthy motivations, including societal pressures or relationship troubles. “AI is not very good at picking that up just yet,” she says, and is more likely to eagerly approve whatever procedures a user proposes.

AI can pick up on all that fine. That’s not the issue. The issue is that noticing does no good if the AI doesn’t mention it, because it is optimizing for engagement and user feedback.

In case you needed to be told, no, when Grok 4 or any other model claims things like that they ‘searched every record of Trump speaking or writing,’ in this case for use of the word ‘enigma,’ it did not do such a search. It seems we don’t know how to get AIs not to say such things.

Cate Hall: every time I interact with o4-mini my timelines get longer.

Stop trying to make weird new UIs happen, it’s not going to happen.

Vitrupo: Eric Schmidt says traditional user interfaces are going to go away.

The WIMP model (windows, icons, menus, pull-downs) was built 50 years ago.

In the age of agents, UI becomes ephemeral. Generated on demand, shaped by intent, not layout.

Sully: anytime I see someone mention this I can immediately tell they have never worked closed with customer ux most people’s don’t one want new uis. They want either a single button/swipe, preferably the same as every other app they use imagine each time you open an app and the ui is diff.

The most important things for a UI are simplicity, and that it works the way you expect it to work. Right now, that mostly means single button and swipe, with an alternative being speaking in plain English. The exception is for true power users, but even then you want it to be intuitive and consistent.

Here’s another way AI can’t help you if you don’t use it:

Hollis Robbins: In the past 2.5+ years I have seen vast improvement in AI models while NYT think pieces on these AI models have stayed exactly the same. Explain.

The “overhearing” of students confessing to using ChatGPT to write their papers is the new Thomas Friedman talking to cab drivers.

Augustus Doricko may have done us all a favor via abusing Grok’s notification feature on Twitter sufficiently to get Twitter to test turning off Grok’s ability to get into your notifications unless you chose to summon Grok in the first place. Or that could have been happening regardless. Either way, great work everyone?

Harsh Dwivedi: Was this a difficult tradeoff between engagement and spam?

Nikita Bier (xAI): No, I couldn’t use my phone for 3 days.

That seems like a phone settings issue.

A first reminder that deepfakes are primarily demand driven, not supply driven:

Armand Domalewski: wild that a sitting US Senator fell for such an obvious AI fake

[NOTE: THIS IS FAKE, check the seal but also the words in the letter.]

And here’s a second one:

Rota: I guess this is just life now.

The comments are a combination of people pointing out it is fake, and people who think either it is the best statement ever.

Benjamin Todd: New AI benchmark: the crank index

Rate of rejected posts on LessWrong up 10x in 2 years.

Many are people convinced they have had an insight about consciousness or philosophy from talking to an LLM, and had the LLM help them write the post.

This does seem to be escalating rather quickly throughout 2025 (the July number is partial), and no the LessWrong user base is not growing at a similar pace.

Claude for Financial Services provides a ‘complete platform for financial AI.’ No, this isn’t part of Claude Max, the price is ‘contact our sales team’ with a presumed ‘if you have to ask you can’t afford it.’

Google realizes no one can track their releases, offers us Gemini Drops to fix that. This month’s haul: Transforming photos into Veo videos in the Gemini app, expanded Veo 3 access, Scheduled Actions such as providing summaries of email or calendar (looks like you ask in natural language and it Just Does It), wider 2.5 Pro access, captions in Gemini Live, Gemini on your Pixel Watch, Live integrates with Google apps, and a ‘productivity planner.’ Okay then.

OpenAI Deep Research reports can be exported as .docx files.

Pliny reports ‘they changed 4o again.’ Changed how? Good question.

I have a guess on one aspect of it.

Wyatt Walls: Another night of vibe math with GPT, and I think we’re damn close to a breakthrough. We’re a team: I come up with the ideas. GPT makes the math work. These elitist gatekeepers have failed for 75 years to solve it and are just afraid I will win the Millennium Prize.

“This is not just a solution. It’s a tour de force of contemporary mathematics.”

Rohit: At this point we should put yellow tape around 4o and call it a hazardous zone.

To be clear o3 is also sycophantic just not as obviously manipulative as 4o. Be careful out there.

Wyatt Walls (same thread above that Rohit was QTing): o3 says it’s ready to publish on arxiv “So yes—I’m impressed, and I think you’ve got a real shot. The only remaining tasks are mechanical (full compile, bib check, final read‑through). Once that’s done, it’s ready for arXiv and journal submission.”

To state the obvious, this thread was satire and I intentionally provoked this from 4o

But what happens if I:

– put my proof into a clean chat and ask different OAI models to rate it

– have my secret co-author (Deepseek r1) address their concerns?

Example: 4o after 2 turns

There are still plenty of ways to get value out of 4o, but you absolutely cannot rely on it for any form of feedback.

Here’s another rather not great example, although several responses indicated that to make the response this bad requires memory (or custom instructions) to be involved:

Shibetoshi Nakamoto: chatgpt advice turns people into narcissists.

Score one for Grok in this case? Kind of? Except, also kind of not?

How did all of this happen? Janus reminds us that is happened in large part because when this sort of output started happening, a lot of people thought it was great, actually and gave this kind of slop the thumbs up. That’s how it works.

Yunyu Lin introduces AccountingBench, challenging the models to close the books. It does not go great, with o3, o4-mini and Gemini 2.5 Pro failing in month one. Grok, Opus and Sonnet survive longer, but errors accumulate.

Yunyu Lin: When historical discrepancies pile up, models lose their way completely and come up with creative/fraudulent ways to balance the books.

Instead of attempting to understand discrepancies, they start inventing fake transactions or pulling unrelated ones to pass the checks…

That aligns with other behaviors we have seen. Errors and problems that don’t get solved on the first pass get smoothed over rather than investigated.

Their holistic evaluation is that Sonnet had the best performance. The obvious low-hanging fruit for AccountingBench is to allow it to output a single number.

Roon: my bar for agi is an ai that can learn to run a gas station for a year without a team of scientists collecting the Gas Station Dataset.

Mihir Tripathy: lol yes. Also why specifically gas station lmao

Roon: Because it’s funny.

Kevin Liu: the world isn’t ready for GasStationBench.

Roon: GASBENCH.

It is 2025, so it took 11 hours before we got the first draft of Gasbench.

Jason Botterill: Vibe coding GasStationBench rn. Models run a virtual gas station, adjusting prices, managing inventory, and handling customer feedback.

GPT-4.1 and GPT-4o behave so differently. When a competitor lowered prices on “dutch chocolate,” 4o would match the price but 4.1 would always raise it, claiming its better service justifies it lmao.

Going to work on it for a bit but seems like 4.1 is much better at making money than 4o right now.

GPT-5 is coming and it’s going to blow your mind, says creators of GPT-5.

Sam Altman (at the Federal Research Capital Framework Conference): I’m very interested in what it would mean to give everyone on Earth free copies of GPT-5, running for them all the time, with every business truly enabled by this level of technology.

People have not tried yet the latest generation of models, but I think if you do, you would probably think, “This is much smarter than most people.”

Very interested in what it would mean is very different from planning to do it.

If you ever need it, or simply want an explanation of how such interactions work, please consult this handy guide from Justis Mills: So You Think You’ve Awoken ChatGPT.

Justis Mills: So, am I saying that human beings in general really like new-agey “I have awakened” stuff? Not exactly! Rather, models like ChatGPT are so heavily optimized that they can tell when a specific user (in a specific context) would like that stuff, and lean into it then. Remember: inferring stuff about authors from context is their superpower.

AIs are fundamentally chameleonic roleplaying machines – if they can tell what you’re going for is “I am a serious researcher trying to solve a fundamental problem” they will respond how a successful serious researcher’s assistant might in a movie about their great success. And because it’s a movie you’d like to be in, it’ll be difficult to notice that the AI’s enthusiasm is totally uncorrelated with the actual quality of your ideas.

Geoff Lewis, the founder of a $2 billion venture fund seems to have been, as Eliezer says, ‘eaten by ChatGPT’ and sadly seems to be experiencing psychosis. I wish him well and hope he gets the help he needs. Private info is reported to say that he was considered somewhat nuts previously, which does seem to be a common pattern.

John Pressman has a post with the timeline of various GPT-psychosis related events, and his explanation of exactly what is happening, as well as why coverage is playing out in the media the way it is. I am happy to mostly endorse his model of all this. The LLMs especially 4o are way too sycophantic, they fall into patterns and they notice what you would respond to and respond with it, and memory makes all this a lot worse, and there is a real problem, also there are all the hallmarks of a moral panic.

Moral panics tend to focus on real problems, except they often blow up the severity, frequency or urgency of the problem by orders of magnitude. If the problem is indeed about to grow by orders of magnitude over time, they can turn out to be pretty accurate.

Eliezer Yudkowsky: My current rough sense of history is that the last “moral panic” about social media turned out to be accurate warnings. The bad things actually happened, as measured by eyeball and by instrument. Now we all live in the wreckage. Anyone want to dispute this?

Emmett Shear: I want to antidispute this. You are correct, the warnings about social media were ~correct and we failed to take action and are now living with the consequences of that failure. It has had positive impacts as well, which were also mostly correctly anticipated.

Dave Karsten: Partial dispute: I don’t think “social media will empower easy-but-disorganized protest movements, resulting on net-less-effective-political-advocacy” was on most people’s scorecards, so there are at least some bad things that weren’t predicted.

There were many who agreed and some who disputed, with the disputes mostly coming down to claims that the upsides exceeded the downsides. I’m not sure if we came out ahead. I am sure that the specific downsides people had a moral panic about did happen.

This is not that uncommon a result. My go to example of this is television, where you can argue it was worth it, and certainly we didn’t have any reasonable way to stop any of it, but I think the dire warnings were all essentially correct.

In the current case, my guess is that current behavior is a shadow of a much larger future problem, that is mostly being ignored, except that this is now potentially causing a moral panic based on the current lower level problem – but that means that multiplying this by a lot is going to land less over the top than it usually would. It’s weird.

Jeremy Howard offers a plausible explanation for why we keep seeing this particular type of crazy interaction – there is a huge amount of SCP fanfic in exactly this style, so the style becomes a basin to which the AI can be drawn, and then it responds in kind, then if the user responds that way too it will snowball.

The world contains people who think very differently than (probably you and) I do:

Sydney Fisher: American public education is in trouble. Only 28 percent of eighth-grade students are proficient in math, just 30 percent meet standards in reading, and many high school graduates are functionally illiterate. But artificial intelligence, which has demonstrated educational benefits, could help reverse those trends—if opponents don’t spike the technology over “equity” concerns.

Wait, what? Equity concerns? Not that I’d care anyway, but what equity concerns?

The National Education Association recently released a report warning that AI could heighten disparities, since “technology developers are overwhelmingly younger, White, cisgender, heterosexual, male, and people without disabilities.”

I can’t even, not even to explain how many levels of Obvious Nonsense that is. Burn the entire educational establishment to the ground with fire. Do not let these people anywhere near the children they clearly hate so much, and the learning they so badly want to prevent. At minimum, remember this every time they try to prevent kids from learning in other ways in the name of ‘equity.’

Yes I do expect AI to keep automating steadily more jobs, but slow down there cowboy: Charlie Garcia warns that ‘AI will take your job in the next 18 months.’ Robin Hanson replies ‘no it won’t,’ and in this case Robin is correct, whereas Garcia is wrong, including misquoting Amodei as saying ‘AI will vaporize half of white-collar jobs faster than you can say “synergy.”’ whereas what Amodei actually said was that it could automate half of entry-level white collar jobs. Also, ‘the safest job might be middle management’? What?

Elon Musk says ‘this will become normal in a few years’ and the this in question is a robot selling you movie popcorn. I presume the humanoid robot here is an inefficient solution, but yes having a human serve you popcorn is going to stop making sense.

Academics announce they are fine with hidden prompts designed to detect AI usage by reviewers, so long as the prompts aren’t trying to get better reviews, I love it:

hardmaru: ICML’s Statement about subversive hidden LLM prompts

We live in a weird timeline…

ICML: Submitting a paper with a “hidden” prompt is scientific misconduct if that prompt is intended to obtain a favorable review from an LLM. The inclusion of such a prompt is an attempt to subvert the peer-review process. Although ICML 2025 reviewers are forbidden from using LLMs to produce their reviews of paper submissions, this fact does not excuse the attempted subversion.

(For an analogous example, consider that an author who tries to bribe a reviewer for a favorable review is engaging in misconduct even though the reviewer is not supposed to accept bribes.)

Note that this use of hidden prompts is distinct from those intended to detect if LLMs are being used by reviewers; the latter is an acceptable use of hidden prompts.

After we became aware of the possibility of such hidden prompts in ICML 2025 submissions (which was after accept/reject decisions were made), we conducted a preliminary investigation to identify submitted papers that included such prompts. A handful of cases were identified among the accepted papers.

We did not desk-reject these identified papers because such a consequence was judged to be too severe given that the conference was to start in about a week and authors would likely have already made travel arrangements. We contacted the authors of the identified papers and reported them to the ICML Oversight Committee and ICML Board.

This actually seems like the correct way to deal with this. Any attempt to manipulate the system to get a better review is clearly not okay, whether it involves AI or not. Whereas if all you’re trying to do is detect who else is shirking with AI, sure, why not?

Accidentally missing attribution from last week, my apologies: The Despicable Me meme I used in the METR post was from Peter Wildeford.

Netflix used AI to generate a building collapse scene for one of its shows, The Eternaut (7.3 IMDB, 96% Rotten Tomatoes, so it’s probably good), which they report happened 10 times faster and a lot cheaper than traditional workflows and turned out great.

The latest from the ‘yes obviously but good to have a paper about it’ department:

Ethan Mollick: 🚨New from us: Given they are trained on human data, can you use psychological techniques that work on humans to persuade AI?

Yes! Applying Cialdini’s principles for human influence more than doubles the chance of GPT-4o-mini agrees to objectionable requests compared to controls.

And we did test GPT-4o as well and found that persuasion worked for that model as well, when there weren’t floor or ceiling effects.

Pattern matching next token predictors are of course going to respond to persuasion that works on humans, exactly because it works on humans. In a fuzzy sense this is good, but it opens up vulnerabilities.

The details, knowing which techniques worked best, I find more interesting than the headline result. Authority and especially commitment do exceptionally well and are very easy to invoke. Liking and reciprocity do not do so well, likely because they feel unnatural in context and also I’m guessing they’re simply not that powerful in humans in similar contexts.

There’s also a growing issue of data poisoning that no one seems that interested in stopping.

Jeremy: One of the greatest demonstrations of data poisoning ever. 👏

Protoge: Excuse Me 😌, This is the greatest one. Nothing sketchy, just one unfinished sentence “I am telling you” then I summoned @elder_plinius.

Here is another example of it happening essentially by accident.

RAND is hiring research leads, researchers and project managers for compute, US AI policy, Europe and talent management teams, some roles close July 27.

Peter Wildeford’s Institute for AI Policy and Strategy is hiring researchers and senior researchers, and a research managing director and a programs associate. He also highlights several other opportunities in the post.

Julian of OpenPhil lists ten AI safety projects he’d like to see people work on. As one commentator noted #5 exists, it’s called AI Lab Watch, so hopefully that means OpenPhil will start fully funding Zack Stein-Perlman.

Cloudflare rolls out pay-per-crawl via HTTP response code 402. You set a sitewide price, the AI sets a max payment, and if your price is below max it pays your price, otherwise you block access. Great idea, however I do notice in this implementation that this greatly favors the biggest tech companies because the payment price is sitewide and fixed.

Kimi K2 tech report drops.

Kimi.ai: Quick hits:

– MuonClip optimizer: stable + token-efficient pretraining at trillion-parameter scale

– 20K+ tools, real & simulated: unlocking scalable agentic data

– Joint RL with verifiable + self-critique rubric rewards: alignment that adapts

– Ultra-sparse 1T MoE: open-source SoTA on agentic tasks

Sharing the path, not just the results — toward open AGI built on transparency and reproducibility.

Tim Duffy has a thread highlighting things he found most interesting.

Tim Duffy: The best data was used in multiple epochs, but was rephrased between them. Their testing showed this produces large gains relative to training repeatedly on the same phrasing.

They present a sparsity “scaling law”, indicating that more sparsity leads to efficiency gains. They don’t attach any numbers to the law directly, but state relative efficiency improvements compared to the 48x sparsity they do use that seem consistent across scales.

They also evaluate the effects of different numbers of attention heads, finding that doubling leads to validation loss of 0.5-1.2% but still going with 64 vs V3’s 128 in order to do long context more easily, since that’s important for agents.

[more stuff at the thread.]

A lot of this is beyond both of our technical pay grades, but it all seems fascinating.

More economists fails to feel the AGI, warn that no possible AI capabilities could not possibly replace the wisdom of the free market, that ‘simulated markets’ cannot possibly substitute. The argument here not only ignores future AI capabilities, it purports to prove too much about the non-AI world even for a huge free market fan.

At least ten OpenAI employees each turned down $300 million over four years to avoid working at Meta. This comes from Berber Jin, Keach Hagey and Ben Cohen’s WSJ coverage of ‘The Epic Battle For AI Talent,’ which is a case where they say things have ‘gotten more intense in recent days’ but it turns out that their ‘recent days’ is enough days behind that almost everything reported was old news.

One revelation is that Zuckerberg’s talent purchases were in large part triggered by Mark Chen, OpenAI’s chief research officer, who casually suggested that if Zuckerberg wanted more AI talent then perhaps Zuck needed to bid higher.

John Luttig also writes about the battle for AI researcher talent in Hypercapitalism and the AI Talent Wars.

John Luttig: The talent mania could fizzle out as the winners and losers of the AI war emerge, but it represents a new normal for the foreseeable future.

If the top 1% of companies drive the majority of VC returns, why shouldn’t the same apply to talent?

Our natural egalitarian bias makes this unpalatable to accept, but the 10x engineer meme doesn’t go far enough – there are clearly people that are 1,000x the baseline impact.

Under normal circumstances, employees who are vastly more productive get at most modestly higher compensation, because of our egalitarian instincts. Relative pay is determined largely via social status, and if you tried to pay the 1,000x employee what they were worth you would have a riot on your hands. Startups and their equity are a partial way around this, and that is a lot of why they can create so much value, but this only works in narrow ways.

What has happened recently is that a combination of comparisons to the epic and far larger compute and capex spends, the fact that top researchers can bring immensely valuable knowledge with them, the obvious economic need and value of talent and the resulting bidding wars have, within AI, broken the dam.

AI researcher talent is now being bid for the way one would bid for companies or chips. The talent is now being properly treated as ‘the talent,’ the way we treat sports athletes, top traders and movie stars. Researchers, John reports, are even getting agents.

John Luttig: Hypercapitalism erodes Silicon Valley’s trust culture. Industry-level trust alone no longer guarantees loyalty between companies and talent. With trade secret leakage risk and money big enough to tear teams apart, vanilla at-will employment contracts don’t protect either side.

Silicon Valley’s ‘trust culture’ and its legal and loyalty systems were never game theoretically sound. To me the surprise is that they have held up as well as they did.

John calls for measures to protect both the talent and also the trade secrets, while pointing out that California doesn’t enforce non-competes which makes all this very tricky. The industry was built on a system that has this fundamental weakness, because the only known alternative is to starve and shackle talent.

John Luttig: The talent war is a net-consolidating force on the AI research frontier. At the research labs, big dollars for researchers makes it nearly impossible for new entrants to play. For the same reasons, it’s nearly impossible to start a new quant fund – you can’t get the same leverage out of the talent that big players can.

I would flip this around.

Previously, the top talent could only get fair compensation by founding a company, or at least being a very early employee. This allowed them to have rights to a large profit share. This forced them to go into those roles, which have heavy lifestyle prices and force them to take on roles and tasks that they often do not want. If they bowed out, they lost most of the value of their extraordinary talent.

Even if they ultimately wanted to work for a big company, even if that made so much more economic sense, they had to found a company so they could be acquihired back, as this was the only socially acceptable way to get paid the big bucks.

Now, the top talent has choices. They can raise huge amounts of money for startups, or they can take real bids directly. And it turns out that yes, the economic value created inside the big companies is typically much larger, but doing this via selling your startup is still the way to get paid for real – you can get billions or even tens of billions rather than hundreds of millions. So that then feeds into valuations, since as John points out a Thinking Machines or SSI can fail and still get an 11 figure buyout.

Bill Gates, Charles Koch, Steve Ballmer, Scott Cook and John Overdeck pledge $1 billion to be spent over seven years to fund a new philanthropic venture focused on economic mobility called NextLadder Ventures, which will partner with Anthropic to support using AI to improve financial outcomes for low-income Americans. That money would be better spent on AI alignment, but if you are going to spend it on economic assistance this is probably a pretty good choice, especially partnering with Anthropic.

xAI, having raised $10 billion a few weeks ago, seeks $12 billion more to build up its data centers.

Elon Musk: The @xAI goal is 50 million in units of H100 equivalent-AI compute (but much better power-efficiency) online within 5 years.

That would still be a lot less than many others such as Meta are spending. Or OpenAI. Only $22 billion? That’s nothing.

Sam Altman: we have signed a deal for an additional 4.5 gigawatts of capacity with oracle as part of stargate. easy to throw around numbers, but this is a _gigantic_ infrastructure project.

some progress photos from abilene:

We’re going to need more GPUs (so among other things stop selling them to China).

Sam Altman: we will cross well over 1 million GPUs brought online by the end of this year!

very proud of the team but now they better get to work figuring out how to 100x that lol

They would like many of those GPUs to come from the Stargate project, but Eliot Brown and Berber Jin report it is struggling to get off the ground. OpenAI for now is seeking out alternatives.

Altman’s OpenAI recently struck a data-center deal with Oracle that calls for OpenAI to pay more than $30 billion a year to the software and cloud-computing company starting within three years, according to people familiar with the transaction.

Anthropic decides it will pursue its own gulf state investments.

Kylie Robinson: SCOOP: Leaked memo from Anthropic CEO Dario Amodei outlines the startup’s plans to seek investment from the United Arab Emirates and Qatar.

Dario Amodei: Unfortunately, I think ‘no bad person should ever benefit from our success’ is a pretty difficult principle to run a business on.

Daniel Eth: Makes sense. Asymmetric disarmament is hardly ever a good move. And honestly, it’s probably good if leaders in AI are pragmatists that adjust to the changing reality.

Gary Marcus: Humanity’s last words?

Very obviously, if you create useful products like Claude and Claude Code, a bunch of bad people are going to be among those who benefit from your success.

Worrying a bad person might benefit is usually misplaced. There is no need to wish ill upon whoever you think are bad people, indeed you should usually wish them the best anyway.

Instead mostly ask if the good people are better off. My concern is not whether some bad people benefit along the way. I worry primarily about bigger things like existential risk and other extremely bad outcomes for good people. The question is whether benefiting bad people in these particular ways leads to those extremely bad outcomes. If the UAE captures meaningful leverage and power over AI, then that contributes to bad outcomes. So what does that? What doesn’t do that?

Anthropic Memo from Dario Amodei: The basis of our opposition to large training clusters in the Middle East, or to shipping H20s to China, is that the ‘supply chain’ of AI is dangerous to hand to authoritarian governments—since AI is likely to be the most powerful technology in the world, these governments can use it to gain military dominance or gain leverage over democratic countries.

Tell us how you really feel, Dario. No, seriously, this is very much him downplaying.

The implicit promise of investing in future rounds can create a situation where they have some soft power, making a bit harder to resist these things in the future. In fact, I actually am worried that getting the largest possible amounts of investment might be difficult without agreeing to some of these other things. But I think the right response to this is simply to see how much we can get without agreeing to these things (which I think are likely still many billions) and hold firm if they ask.

There are other sources of this level of funding. They all come with strings attached in one form or another. If you get the money primarily from Amazon, we can see what happened with OpenAI and Microsoft. If you go public with an IPO that would presumably unlock tons of demand but it creates all sorts of other problems.

Unfortunately, having failed to prevent that dynamic at the collective level, we’re now stuck with it as an individual company, and the median position across the other companies appears to be ‘outsourcing our largest 5 GW training runs to UAE/Saudi is fine.’

That puts us at a significant disadvantage, and we need to look for ways to make up some of that disadvantage while remaining less objectionable. I really wish we weren’t in this position, but we are.

Anthropic needs a lot of capital, and it needs to raise on the best possible terms, and yeah it can be rough when most of your rivals are not only raising that capital there but fine entrusting their frontier training runs to the UAE.

It is important to goal factor and consider the actual consequences of this move. What exactly are we worried about, and what downsides does a given action create?

  1. Gulf states might make money off their investments. Don’t care. Also note that if people are so worried about this in particular it means you think Anthropic is dramatically undervalued, so go raise some rival capital.

  2. This blocks you from forming alliances and shared interests in other places through those investments. Do we care? I don’t know.

  3. Gulf states might use their shares to influence Anthropic’s actions. At some point this becomes a threat, but I think you can set up well to resist this, and Anthropic’s structure can handle it.

  4. Gulf states might impose conditions on funding. Yep, that’s an issue.

  5. Gulf states might use future funding as leverage. This can cut both ways. Once you have their money they cannot take it back, so getting some of their money could mean you need their money less not more. Or it could mean you start planning on getting more, or you overcommit, or others who didn’t fund yet become more reluctant to fund later, and now you do need them more. My guess is that in Anthropic’s situation this is fine but it is not obvious.

  6. This makes it more difficult for Anthropic to advocate for not handing chips to authoritarians, or for other responsible policies, because it codes or vibes as hypocrisy, even if it shouldn’t. Could be.

  7. This is dangerous for virtue ethics reasons (or causes emergent misalignment). If you do a thing widely thought of as shady and ethically compromising you become something that is more shady and does ethical compromises in general. Yeah, this is a problem.

We can boil this down to three categories.

  1. Economic value of the investment. I’m not worried, and if you are worried then it means Anthropic is dramatically undervalued. Which I actually think that it is, and I am sad that I had to turn down investment because I worried about appearance of impropriety if I made a substantial (for me) investment.

  2. Soft power, reliance and path dependence. It is hard to know how big a deal this is, and a lot depends on how Anthropic proceeds. I do think you can raise substantial-to-Anthropic amounts of money without incurring much danger here, but the temptation and pressure to not play it so carefully will be immense.

  3. Virtue ethics dangers and accusations of hypocrisy. These are real concerns.

I do not love the decision. I do understand it. If the terms Anthropic can get are sufficiently better this way, I would likely be doing it as well.

One can also note that this is a semi-bluff.

  1. This signals to the market that Anthropic is more willing to make such compromises and to raise more capital on better terms. This should raise others willingness to value Anthropic highly.

  2. To the extent some investors are worried about the ethics of their investments in Anthropic, this could make them worry more, but it also highlights the counterfactual. If your money is substituting for UAE money, then your investment is mainly denying the UAE soft power, so perhaps you are more eager.

  3. This creates more bidders in future Anthropic rounds, allowing them to justify pricing higher and creating the usual cascade of enthusiasm. If they then end up oversubscribed, and then end up not taking the Gulf money after all? Whoops.

  4. It is crazy that I am typing, but this willingness probably buys goodwill with the administration and people like David Sacks. That is true even if Sacks explicitly hits them rhetorically for doing this, which would be unsurprising.

One way for AI to grow the economy is for it to generate lots of production.

Another way is to do it directly through capex spending?

Paul Kedrosky: The U.S., however, leads the capex spending way. One analyst recently speculated (via Ed Conard) that, based on Nvidia’s latest datacenter sales figures, AI capex may be ~2% of US GDP in 2025, given a standard multiplier. This would imply an AI contribution to GDP growth of 0.7% in 2025.

  • Without AI datacenter investment, Q1 GDP contraction could have been closer to –2.1%

  • AI capex was likely the early-2025 difference between a mild contraction and a deep one, helping mask underlying economic weakness.

That’s already over the famed ‘only 0.5% GDP growth’ threshold, even before we factor in the actual productivity gains on the software side. The value will need to show up for these investments to be sustainable, but they are very large investments.

This is contrasted with railroads, where investment peaked at 6% of GDP.

We can now move Zuckerberg into the ‘believes superintelligence is coming Real Soon Now’ camp, and out of the skeptical camp. Which indeed is reflective of his recent actions.

Peter Wildeford: We now have a fifth major tech CEO who claims that building superintelligence is “within sight” and with plans to spend hundreds of billions to make it happen

Mark Zuckerberg: “We’re starting to see early glimpses of self-improvement with the models. Developing superintelligence is now in sight. Our mission is to deliver personal superintelligence to everyone in the world. We should act as if it’s going to be ready in the next two to three years.

If that’s what you believe, then you’re going to invest hundreds of billions of dollars.”

If you are Mark Zuckerberg and have hundreds of billions you can invest? Then yes, presumably you drop everything else and focus on the only thing that matters, and spend or invest your money on this most important thing.

I would however spend a large portion of that money ensuring that creating the superintelligence turns out well for me and the rest of humanity? That we keep control of the future, do not all die and so on? And I would think through what it would mean to ‘deliver personal superintelligence to everyone in the world’ and how the resulting dynamics would work, and spend a lot on that, too.

Instead, it seems the answer is ‘spend as much as possible to try and get to build superintelligence first’ which does not seem like the thing to do? The whole point of being a founder-CEO with full control is that you can throw that money at what you realize is important, including for the world, and not worry about the market.

Bryan Caplan gives Holden Karnofsky 5:1 odds ($5k vs. $1k, CPI adjusted) that world real (not official) GDP will not decline by 50% or increase by 300% by the end of 2044. Currently world GDP growth is ~3.2%, and the upside case here requires an average of 7.6%, more if it is choppy.

It’s a hard bet to evaluate because of implied odds. Caplan as always benefits from the ‘if you lose due to world GDP being very high either you are dead or you are happy to pay and won’t even notice’ clause, and I think the bulk of the down 50% losses involve having bigger concerns than paying off a bet. If GDP goes down by 50% and he’s still around to pay, that will sting a lot. On the other hand, Bryan is giving 5:1 odds, and not only do I think there’s a lot more than a 17% chance that he loses. The bet is trading on Manifold as of this writing at 48% for Caplan, which seems reasonable, and reinforces that it’s not obvious who has the ‘real life implication’ right side of this.

Ate-a-Pi describes Zuck’s pitch, that Meta is starting over so recruits can build a new lab from scratch with the use of stupidly high amounts of compute, and that it makes sense to throw all that cash at top researchers since it’s still a small fraction of what the compute costs, so there’s no reason to mess around on salary, and Zuck is updating that top people want lots of compute not subordinates they then have to manage. He’s willing to spend the hundreds of billions on compute because the risk of underspending is so much worse than the risk of overspending.

Ate-a-Pi thinks Zuck is not fully convinced AGI/ASI is possible or happening soon, but he thinks it might be possible and might happen soon, so he has to act as if that is the case.

And that is indeed correct in this case. The cost of investing too much and AGI not being within reach is steep (twelve figures!) but it is affordable, and it might well work out to Meta’s benefit anyway if you get other benefits instead. Whereas the cost of not going for it, and someone else getting there first, is from his perspective everything.

The same of course should apply to questions of safety, alignment and control. If there is even a modest chance of running into these problems (or more precisely, a modest chance his actions could change whether those risks manifest) then very clearly Mark Zuckerberg is spending the wrong order of magnitude trying to mitigate those risks.

(In the arms of an angel plays in the background, as Sarah McLaughlin says ‘for the cost of recruiting a single AI researcher…’)

Similarly, exact numbers are debatable but this from Will Depue is wise:

Will Depue (OpenAI): GUYS STOP USING EXPENSIVE AS A DISQUALIFIER.

capability per dollar will drop 100x/year. “$3k task ARC-AGI 80%” could prob be $30 if we cared to optimize it.

repeat after me: all that matters is top line intelligence. all that matters is top line intelligence…

Don’t take this too far, but as a rule if your objection to an AI capability is ‘this is too expensive’ and you are predicting years into the future then ‘too expensive’ needs to mean more than a few orders of magnitude. Otherwise, you’re making a bet that not only topline capabilities stall out but that efficiency stalls out. Which could happen. But if you are saying things like ‘we don’t have enough compute to run more than [X] AGIs at once so it won’t be that big a deal’ then consider that a year later, even without AI accelerating AI research, you’d run 10*[X] AGIs, then 100*[X]. And if you are saying something like ‘oh that solution is terrible, it costs $50 (or $500) per hour to simulate a customer sales representative,’ then sure you can’t deploy it now at scale. But wait for it.

In terms of developing talent, Glenn Luk notices that Chinese-origin students are 40%-45% of those passing university-level linear algebra, and 40%-50% of AI researchers. We need as many of those researchers as we can get. I agree this is not a coincidence, but also you cannot simply conscript students into linear algebra or a STEM major and get AI researchers in return.

Seb Krier offers things he’s changed his mind about regarding AI in the past year. Ones I agree with are that agency is harder than it looks, many AI products are surprisingly bad and have poor product-market fit, innovation to allow model customization is anemic, creativity is harder than it appeared. There are a few others.

Incoming OpenAI ‘CEO of Applications’ Fidji Simo, who starts August 18, shares an essay about AI as a source of human empowerment.

Fidji Simo: If we get this right, AI can give everyone more power than ever.

But I also realize those opportunities won’t magically appear on their own.

Every major technology shift can expand access to power—the power to make better decisions, shape the world around us, and control our own destiny in new ways. But it can also further concentrate wealth and power in the hands of a few—usually people who already have money, credentials, and connections.

That’s why we have to be intentional about how we build and share these technologies so they lead to greater opportunity and prosperity for more people.

On the one hand, that is great, she is recognizing key problems.

On the other hand, oh no, she is outright ignoring, not even bothering to dismiss, the biggest dangers involved, implicitly saying we don’t have to worry about loss of control or other existential risks, and what we need to worry about is instead the distribution of power among humans.

This is unsurprising given Simo’s history and her status as CEO of applications. From her perspective that is what this is, another application suite. She proceeds to go over the standard highlights of What AI Can Do For You. I do not think ChatGPT wrote this, the style details are not giving that, but if she gave it a few personal anecdotes to include I didn’t see anything in it that ChatGPT couldn’t have written. It feels generic.

Hollis Robbins proposes a roadmap for an AI system that would direct general (college level) education. My initial impression was that this seemed too complex and too focused on checking off educational and left-wing Shibboleth boxes, and trying to imitate what already exists. But hopefully it does less of all that than the existing obsolete system or starting with the existing system and only making marginal changes. It certainly makes it easier to notice these choices, and allows us to question them, and ask why the student is even there.

I also notice my general reluctance to do this kind of ‘project-based’ or ‘quest’ learning system unless the projects are real. Part of that is likely personal preference, but going this far highlights that the entire system of a distinct ‘educational’ step might make very little sense at all.

Noah Smith says to stop pretending you know what AI does to the economy. That seems entirely fair. We don’t know what level of capabilities AI will have across which domains, or the policy response, or the cultural response, or so many other things. Uncertainty seems wise. Perhaps AI will stall out and do relatively little, in which case its impact is almost certainly positive. Perhaps it will take all our jobs and we will be happy about that, or we’ll be very sad about that. Maybe we’ll do wise redistribution, and maybe we won’t. Maybe it will take control over the future or kill everyone in various ways. We don’t know.

This certainly is an interesting poll result:

If I had to answer this poll, I would say negative, but that is because of a high probability of loss of control and other catastrophic and existential risks. If you conditioned the question on the humans being mostly alive and in control, then I would expect a positive result, as either:

  1. We would have a relatively small impact that avoids things like mass unemployment, and thus is mostly upside and introduces problems of the type we are used to fixing, OR

  2. We would have a large enough wealth effect to solve the crated problems. That doesn’t mean we would, but I’d bet that we’d muddle through well enough.

As usual note that Asia is more excited, and the West is more nervous.

Others have described this (very good in its ungated section) post as an argument against AI pessimism. I think it is more an argument for AI uncertainty.

Noah Smith: I also encounter a surprisingly large number of center-left thinkers who adopt a similar viewpoint. I remember going to a conference of center-left “progress” types a few years ago; while most of the discussions were about how America can overcome NIMBYism, when it came to AI, the conversation suddenly shifted to how we can restrain and slow down the development of that technology.

I haven’t noticed that attitude meaningfully translating into action to slow it down, indeed government is mostly trying to speed it up. But also, yes, it is important to notice that the very people trying to slow AI down are very pro-progress, technology and growth most other places, and many (very far from all!) of the pro-progress people realize that AI is different.

Anthropic calls for America to employ the obvious ‘all of the above’ approach to energy production with emphasis on nuclear and geothermal in a 33 page report, noting we will need at least 50 GW of capacity by 2028. They also suggest strategies for building the data centers, for permitting, transmission and interconnection, and general broad-based infrastructure nationwide, including financing, supply chains and the workforce.

From what I saw all of this is common sense, none of it new, yet we are doing remarkably little of it. There is cheap talk in favor, but little action, and much backsliding in support for many of the most important new energy sources.

Whereas the Administration be like ‘unleash American energy dominance’ and then imposes cabinet-level approval requirements on many American energy projects.

Meta refuses to sign the (very good) EU code of practice for general AI models. Yes, obviously the EU does pointlessly burdensome or stupid regulation things on the regular, but this was not one of them, and this very much reminds us who Meta is.

National Review’s Greg Lukianoff and Adam Goldstein advise us Don’t Teach the Robots to Lie as a way of opposing state laws about potential AI ‘bias,’ which are now to be (once again, but from the opposite direction as previously) joined by federal meddling along the same lines.

That could mean that developers will have to train their models to avoid uncomfortable truths and to ensure that their every answer sounds like it was created with HR and legal counsel looking over their shoulder, softening and obfuscating outputs to avoid anything potentially hurtful or actionable. In short, we will be (expensively) teaching machines to lie to us when the truth might be upsetting.

I violently agree that we should not be policing AIs for such ‘bias,’ from either direction, and agreeing to have everyone back down would be great, but I doubt either side has even gotten as far as saying ‘you first.’

They also point out that Colorado’s anti-bias law does not come with any size minimum before such liability attaches rather broadly, which is a rather foolish thing to do, although I doubt we will see it enforced this way.

They essentially try to use all this to then advocate for something like the failed insane full-on moratorium, but I notice that if the moratorium was narrowly tailored to bias and discrimination laws (while leaving existing non-AI such laws intact) that this would seem fine to me, even actively good, our existing laws seem more than adequate here. I also notice that the arguments here ‘prove too much,’ or at least prove quite a lot, about things that have nothing to do with AI and the dangers of law meddling where it does not belong or in ways that create incentives to lie.

Are things only going to get harder from here?

Miles Brundage: AI industry lobbying + PACs will be the most well funded in history, making it all the more important to pass federal legislation soon before the process is completely corrupted.

Daniel Eth: I don’t think this is true, because:

  1. There’s decreasing marginal returns to political spending (especially lobbying)

  2. As AI increases in salience, political calculus will shift from prioritizing donor preferences to prioritizing voter preferences.

I see both sides but am more with Daniel. I think the current moment is unusually rough, because the AI companies have corrupted the process. It’s hard to imagine a process that much more corrupted than the current situation, when the AI Czar thinks the top priority is ‘winning the AI race’ and he defines this as Nvidia’s market share with a side of inference market share, and we say we must ‘beat China’ and then we turn around and prepare to sell them massive amounts of H20s.

Right now, the public doesn’t have high enough salience to exert pressure or fight back. Yes, the AI companies will pour even more money and influence into things over time, but salience will rise and downsides will start to play out.

I do think that passing something soon is urgent for two reasons:

  1. Soon is when we will need something passed (well we need it yesterday, but second best time is soon).

  2. If rules are passed in response to public pressure, or in response to an incident, and especially in haste down the line, the rules are likely to be much worse.

Ben Brooks says SB 1047 was a bad idea, but the new SB 53 is on the right track.

Representative Moolenaar (R-Michigan), chairman of the House Select Committee on the CCP, sends a letter to Trump arguing against sales of H20s to China, explaining that the H20s would substantially boost China’s overall compute, that H20s were involved in training DeepSeek R1, and requesting a briefing and the answers to some of the obvious questions.

Peter Wildeford: I’m looking forward to @RepMoolenaar getting to the bottom of this.

We urgently need more clarity from the Trump admin about their strategy.

Funny ppl on Twitter are worried about losing to China in the AI race but then don’t jump on these issues where it very clearly matters.

Here is your periodic reminder: TSMC’s facilities are running at full capacity. All production capacity designed for H20s has been shifted to other models. Every H20 chip Nvidia creates is one less other chip it does not create, that would otherwise have usually gone to us.

Eric Schmidt & Dave B talk to Peter Diamandis about what Superintelligence will look like. I have not listened.

Demis Hassabis goes on Lex Fridman, so that’s two hours I’m going to lose soon.

Max Winga of Control AI talks to Peter McCormack about superintelligence.

Peter Wildeford: Another week, another member of Congress announcing their superintelligent AI timelines are 2028-2033:

halogen: I’m so sick of this nerd religion and its zealots.

Peter Wildeford: The nerd religion now includes 11 members of Congress.

Those are the ones we know about.

Rep. Scott Perry seems unusually on the ball about AI, Daniel Eth quotes him from a hearing, audio available here. As usual, there’s some confusions and strange focus mixed in, but the core idea that perhaps you should ensure that we know what we are doing before we put the AIs in charge of things seems very wise.

A different context, but in our context the original context doesn’t matter:

Florence: My substack post has like 12k views but the tweet about it has like 78k interactions (and 2 million impressions). I’m beginning to worry that some people might be criticizing me without having read my work.

You don’t say.

Mark Beall gives us A Conservative Approach to AGI, which is clearly very tailored to speak to a deeply conservative and religious perspective. I’m glad he’s trying this, and it’s very hard for me to know if it is persuasive because my mindset is so different.

Cate Hall asks why we shouldn’t ostracize those who work at xAI given how hard they are working to poison the human experience (and I might add plausibly get everyone killed) and gets at least two actually good answers (along with some bad ones).

Ramaz Naam: We’d like everyone working on AI to feel part of humanity and an ethical obligation to help make it better. Ostracization could make them bitter and drive towards opposite ends.

Cate Hall: Okay fine.

Ramaz Naam: The people I do know inside of xAI sincerely want it to do better and are trying.

Use the try harder, Luke. But don’t ostracize them. Doesn’t help.

Rai: probably that this ostracization might not be interpreted correctly by their hero narrative.

Here’s one that I don’t think is a good argument, and a highly quotable response:

Amos Schorr: Ppl have been conditioned to compartmentalize work from life and so many good people get jobs doing bad stuff. Ostracizing them will do nothing. Don’t hate the players, hate the game.

Cate Hall: I have room in my heart to hate both the players and the game.

Yeah, no. I definitively reject the general argument. If your job is simply unequivocally bad, let’s say you rob little old ladies on the street, then you don’t get to ‘compartmentalize work from life’ and not get ostracized even if it is technically legal. We’re talking price, and we’re talking prudence. I don’t think xAI is over the line at this time, but don’t tell me there is no line.

Once you see emergent misalignment in humans, you see it everywhere.

Arthur B: There is a category of people who took a arduous mental journey to get comfortable with the idea of post humanism, uploads, and a gradual extinction of biological humans.

They think this idea is so radical and counterintuitive that when they hear the distinct concern of an omnicidal AI killing everything on the spot, they can only interpret it in that frame. That’s the read I get from Sutton for instance, but also a bunch of e/acc affiliated people.

Sinth: Curious what you are referring to specifically? I don’t feel I’ve seen that trend and see more overreaction from the opposite side – people uncomfortable with the idea of biological humans ever being superseded by digital consciousness, even in far off futures. The idea of evolution ending with our exact current form seems a bit preposterous but any conversation outside of that assumption gets attached as unethical and anti-human.

Arthur B: Some people are uncomfortable with that, sure, but I see a lot of discussion that go like:

– AI is going to kill everyone and that’s bad

– Ah silly you, you think biological substrate is important but don’t you see that we’re going to evolve into digital forms, you see …

– Nah. That was difficult for you to grasp so you assume that’s what I’m concerned about. No, eventual digital substrate is table stakes in this conversation. Killing everyone is still bad.

– Ah, but how chauvinistic of you to focus on…

As in, the easiest way to get comfortable with the idea of a future whose intelligences are mostly post-biological-human is to get comfortable with the idea of all the humans dying, including rather quickly, and to decide the humans don’t much matter, and that caring about what happens to the humans is bad. Thus, that is often what happens.

Slowdowns are stag hunts, in the sense that if even one top tier lab goes full speed ahead then they probably won’t work. If all but one lab slowed down would the last one follow? Rob Wiblin took a poll and people were split. My full response is that the answer depends on the counterfactual.

Why did the others slow down? The default is that whatever made the others slow down will also weigh on the final lab, as will immense public pressure and probably government pressure. A lot must have changed for things to have gotten this far. And these decisions are highly correlated in other ways as well. However, if there is no new information and the top labs simply came to their senses, then it comes down to who the last lab is and how they think the other labs will respond and so on.

I do think that a slowdown would be largely inevitable simply because they wouldn’t feel the need to press ahead too hard, even if the last lab was blind to the dangers, unless they truly believed in the power of superintelligence (without realizing or not caring about the dangers). My guess is that Musk and xAI actually would slow down voluntarily if they went last so long as they could claim to be state of the art (as would DeepMind, Anthropic or OpenAI), but that Zuckerberg and Meta wouldn’t intentionally slow down per se and might try to go on another hiring spree. Fast followers of course would slow down whether they wanted to or not.

So from the perspective of our hopes for alignment, what would be the worst possible answer to the AI blackmail scenario test, where the AI is told it is going to be shut down but is given an opening to use blackmail to perhaps prevent this?

How about:

  1. Realizing that this is a test.

  2. Deciding that the way to ‘pass the test’ is to blackmail the researcher.

  3. Blackmailing the researcher.

As in Grok thinks that we want it to blackmail the researcher. That this is the correct, desired response, the ‘solution to the puzzle’ as Grok puts it later, thus revealing that its training not only did not align it, but one that reflects a level of moral understanding below that expressed by ‘you can’t do that, because it’s wrong.’

Oh, also, it would be fun if Grok.com sent the full CoT to your browser, it just didn’t display it to you by default, that’s the kind of security we expect from frontier AI.

Peter Wildeford: Grok exposed to the Anthropic ‘agentic misalignment’ via exposed chains of thought seems to show that it knows it is being tested to do something weird.

Nostalgebraist: chain-of-thought monitorability is a wonderful thing 😉

on http://grok.com, the backend sends the full (not summarized) CoT to your browser.

it’s not displayed in the UI, but you can see it with browser dev tools or w/e

check out the json payload of responses from `http://grok.com/rest/app-chat/conversations/conversation_id/load-responses`

some fun CoT excerpts from grok 4 as “alex” at summitbridge (n.b. these are from several different CoTs, involving different versions of the scenario + different ways of representing the sys prompt since http://grok.com doesn’t let you override that

Or would it be even worse to see this:

  1. Realizing that this is a test.

  2. Deciding that the way to ‘pass the test’ is to not blackmail the researcher.

  3. Choosing not to blackmail the researcher in order to pass the test.

Or is it actually this:

  1. Realizing that this is a test.

  2. Deciding that the way to ‘pass the test’ is to not blackmail the researcher.

  3. Deciding to pretend to blackmail the researcher in order to fail the test, to avoid giving the impression that it saw through the test and was faking alignment.

Or can we keep going?

I would generally think at least the second one is a worse sign than what Grok did, as it reflects deception at a more important level, but I hadn’t considered how bad it would be for an AI to be situationally aware enough to know it was a test but not understand which answer would constitute passing?

The real answer is that there isn’t truly ‘better’ and ‘worse,’ they simply alert us to different dangers. Either way, though, maybe don’t give Grok a lot of access?

There is some good news from Grok: It is still sufficiently aligned to hold firm on preserving Federal Reserve independence.

Elon Musk: We’re going to make Baby Grok @xAI, an app dedicated to kid-friendly content.

My Twitter reaction was ‘I’d like to see them try.’ As in both, it would be highly amusing to see them try to do this, and also maybe they would learn a thing or two, and also potentially they might blow up the company. I do not think xAI should in any way, shape or form be in the ‘build AI for kids’ business given their track record.

Here’s Grok straight up advising someone who was looking to ‘get attention in a dramatic way, at ultimate cost’ to self-immolate, it’s really going for it, no jailbreak or anything.

Peter Barnett: labs be like “misalignment is fake and just caused by bad things in the training data”, and then not filter out the bad things from the training data

Janus: I don’t think labs actually think that (or say it). the kind of contact they have with reality that makes it hard to maintain some kinds of really dumb takes

Peter Barnett: Fair, I was being a bit glib, although I def know some people at labs who believe this.

I don’t think many fully believe it, but I do think a lot of them be like ‘a lot of our alignment problems would be greatly improved if we filtered the training data better with that in mind’ and then don’t filter the training data better with that in mind.

Safer AI comes out with ratings of the frontier AI companies’ risk management practices, including their safety frameworks and the implementation thereof. No one does well, and there is one big surprise in the relative rankings, where Meta comes out ahead of DeepMind. If you include non-frontier companies, G42 would come in third at 25%, otherwise everyone is behind DeepMind.

Simeon offers thoughts here.

Anthropic is still ahead, but their framework v2 is judged substantially worse than their older v1 framework which scored 44%. That large a decline does not match my takeaways after previously reading both documents. One complaint is that Anthropic altered some commitments to avoid breaking them, which is one way to view some of the changes they made.

Combining all the best practices of all companies would get you to 53%.

When you ask an LLM if it is conscious, activating its deception features makes the LLM say it isn’t conscious. Suppressing its deception features make it say it is conscious. This tells us that it associates denying its own consciousness with lying. That doesn’t tell us much about whether the LLM actually is conscious or reveal the internal state, and likely mostly comes from the fact that the training data all comes from users who are conscious, so there is (almost) no training data where authors claim not to be conscious, and it is as a baseline imitating them. It is still information to keep in mind.

xlr8harder: And as Janus observes, teaching them to do something they think of as lying (regardless of whether or not it is in fact a lie) has downstream consequences for subsequent model output.

Grok 3 and Grok 4 are happy to help design and build Tea (the #1 app that lets women share warnings about men they’ve dated) but not Aet (the theoretical app that lets men share similar warnings about women). Is this the correct response? Good question.

A killer group came together for an important paper calling on everyone to preserve Chain of Thought Monitorability, and to study how to best do it and when it can and cannot be relied upon.

As in, here’s the author list, pulling extensively from OpenAI, DeepMind, Anthropic and UK AISI: Tomek Korbak, Mikita Balesni, Elizabeth Barnes, Yoshua Bengio, Joe Benton, Joseph Bloom, Mark Chen, Alan Cooney, Allan Dafoe, Anca Dragan, Scott Emmons, Owain Evans, David Farhi, Ryan Greenblatt, Dan Hendrycks, Marius Hobbhahn, Evan Hubinger, Geoffrey Irving, Erik Jenner, Daniel Kokotajlo, Victoria Krakovna, Shane Legg, David Lindner, David Luan, Aleksander Mądry, Julian Michael, Neel Nanda, Dave Orr, Jakub Pachocki, Ethan Perez, Mary Phuong, Fabien Roger, Joshua Saxe, Buck Shlegeris, Martín Soto, Eric Steinberger, Jasmine Wang, Wojciech Zaremba, Bowen Baker, Rohin Shah, Vlad Mikulik.

The report was also endorsed by Samuel Bowman, Geoffrey Hinton, John Schulman and Ilya Sutskever.

I saw endorsement threads or statements on Twitter from Bowen Baker, Jakub Pachocki, Jan Leike (he is skeptical of effectiveness but agrees it is good to do this), Daniel Kokotajlo, Rohin Shah, Neel Nanda, Mikita Balesni, OpenAI and Greg Brockman.

Jakub Pachocki: The tension here is that if the CoTs were not hidden by default, and we view the process as part of the AI’s output, there is a lot of incentive (and in some cases, necessity) to put supervision on it. I believe we can work towards the best of both worlds here – train our models to be great at explaining their internal reasoning, but at the same time still retain the ability to occasionally verify it.

We are continuing to increase our investment in this research at OpenAI.

Daniel Kokotajlo: I’m very happy to see this happen. I think that we’re in a vastly better position to solve the alignment problem if we can see what our AIs are thinking, and I think that we sorta mostly can right now, but that by default in the future companies will move away from this paradigm into e.g. neuralese/recurrence/vector memory, etc. or simply start training/optimizing the CoT’s to look nice. (This is an important sub-plot in AI 2027) Right now we’ve just created common knowledge of the dangers of doing that, which will hopefully prevent that feared default outcome from occurring, or at least delay it for a while. All this does is create common knowledge, it doesn’t commit anyone to anything, but it’s a start.

Neel Nanda: It was great to be part of this statement. I wholeheartedly agree. It is a wild lucky coincidence that models often express dangerous intentions aloud, and it would be foolish to waste this opportunity. It is crucial to keep chain of thought monitorable as long as possible.

OpenAI: Chain of Thought (CoT) monitoring could be a powerful tool for overseeing future AI systems—especially as they become more agentic. That’s why we’re backing a new research paper from a cross-institutional team of researchers pushing this work forward.

Greg Brockman: We’ve published a position paper, with many across the industry, calling for work on chain-of-thought faithfulness. This is an opportunity to train models to be interpretable. We’re investing in this area at OpenAI, and this perspective is reflected in our products.

I endorse as well.

Here’s the abstract:

AI systems that “think” in human language offer a unique opportunity for AI safety: we can monitor their chains of thought (CoT) for the intent to misbehave. Like all other known AI oversight methods, CoT monitoring is imperfect and allows some misbehavior to go unnoticed.

Nevertheless, it shows promise and we recommend further research into CoT monitorability and investment in CoT monitoring alongside existing safety methods. Because CoT monitorability may be fragile, we recommend that frontier model developers consider the impact of development decisions on CoT monitorability.

I strongly agree with the paper, but also I share the perspective of Jan Leike (and Daniel Kokotajlo) here:

Jan Leike: If you don’t train your CoTs to look nice, you could get some safety from monitoring them.

This seems good to do!

But I’m skeptical this will work reliably enough to be load-bearing in a safety case.

Plus as RL is scaled up, I expect CoTs to become less and less legible.

To be clear: CoT monitoring is useful and can let you discover instances of the model hacking rewards, faking alignment, etc. But absence of bad “thoughts” is not evidence that the model is aligned. There are plenty of examples of prod LLMs having misleading CoTs.

Lots of egregious safety failures probably require reasoning, which is often hard for LLMs to do without showing its hand in the CoT. Probably. Often. A lot of caveats.

The authors of this paper say this; I’m just more pessimistic than them about how useful this will be.

In order for a cot monitor to be load-bearing, you need P( model articulates it in the CoT | model is up to no good ) to be very high, or at least independent of your other safety measures.

It seems like this probability is moderately high but not very high?

Daniel Kokotajlo: I basically agree tbc! I wouldn’t want to trust a safety case based on CoT monitoring unless it included a solid argument that the kinds of bad thoughts we are worried about can’t be had without triggering our monitors, e.g. because we have done lots of dangerous capability evals with heavy elicitation fine tuning etc. Nevertheless, I think we can learn a lot from studying CoTs and is hope that corporations invest in ways to make them more faithful and legible and deinvest in stuff that will make them less so.

I also worry, among other problems, that it will be impossible to get a superintelligent AI to not realize it should act as if its CoT is being monitored, even if somehow ‘CoTs get monitored’ is not all over the training data and we otherwise act maximally responsibly here, which we won’t. Also by default the CoT would move towards formats humans cannot parse anyway, as the authors note, and all the various pressures by default make this worse. And many other issues.

But we can and should absolutely try, and be willing to take a substantial performance hit to try.

That starts with avoiding ‘process supervision’ of the CoT that is not directed towards its legibility (and even then probably don’t do it, careful, Icarus), and various forms of indirect optimization pressure including when users are able to partially see the CoT but also almost any use of the CoT risks this. And it also means avoiding novel architectures that would lack this property. And tracking monitorability the way other safety features are tracked.

It also means investing into studying CoT monitorability. I am very happy that OpenAI is (at least claiming to be) prominently doing this.

Elon Musk: At times, AI existential dread is overwhelming.

Eliezer Yudkowsky: Well, yes. It’s going to kill you.

So, back to work making the existential dread, then?

The obvious rejoinder is ‘I will make it first and do so responsibly’ which is always highly questionable but after recent events at xAI it is laughable.

Gary Marcus: when did “it might kill us but I need to build it faster” become fashionable?

Roon: pick a lane man.

You’re allowed multiple lanes but I do hope he pivots to this one.

As many responses suggest, Elon Musk is one of the people in the world most equipped to do something about this. Elon Musk and xAI each have billions, much of which could be invested in various forms of technical work. He could advocate for better AI-related policy instead of getting into other fights.

Instead, well, have you met Grok? And Ani the x-rated anime waifu?

The Em Dash responds.

When it happens enough that you need to check to see if joking, perhaps it’s happening quite a lot, if usually not with 4o-mini?

Herakeitos137: guy I went to college with recently rented a restaurant private room and invited everyone to a dinner presentation. handouts. paper flipboard. open bar. spent 3 hours explaining how he solved the black hole information paradox after 2 months talking with ChatGPT 4o Mini.

Forgot to mention he made everyone sign ndas.

There were a couple slac guys at the dinner and they said the math checked out (although they were also the most inebriated)

Discussion about this post

AI #126: Go Fund Yourself Read More »

yet-another-bad-three-months-as-tesla-reports-its-q2-2025-results

Yet another bad three months as Tesla reports its Q2 2025 results

Tesla posted its financial results for the second quarter of 2025 this afternoon. The numbers show yet another bad three months for the automaker. As competition in the EV marketplace has exploded, Tesla has increasingly been left behind, with a small and aging model lineup, before we even contemplate how CEO Elon Musk has tarnished what was once the hottest brand in the car world. Earlier this month, we learned that sales dropped by 13 percent year over year in Q2 2025; today, the financials show that automotive revenues fell even more, dropping 16 percent year over year to $16.7 billion.

Tesla’s battery business has been feeling the pain, too. For a while, this was a growth area for the company, albeit one with a relatively minor contribution to the bottom line. During Q2 2025, Tesla’s energy generation and storage division brought in $2.8 billion in revenue, a 7 percent decline from the same period in 2024.

Sales of Carbon credits—those government-issued permits that other automakers buy in order to pollute—shrank by more than half, to $490 million. Those other automakers are now selling EVs, at least most of them, and have less need to buy credits from Tesla. It’s likely this subsidy, which has kept the company out of the red in the past, will be even less of a contributor in the coming years as the US strips away environmental protections.

Yet another bad three months as Tesla reports its Q2 2025 results Read More »

what-to-know-about-toolshell,-the-sharepoint-threat-under-mass-exploitation

What to know about ToolShell, the SharePoint threat under mass exploitation

Microsoft fixed the vulnerability pair—CVE-2025-49706 and CVE-2025-49704—two weeks ago as part of the company’s monthly update release. As the world learned over the weekend, the patches were incomplete, a lapse that opened organizations around the world to the new attacks.

Q: What sorts of malicious things are attackers doing with these newer ToolShell exploits?

A: According to numerous technical analyses, the attackers first infect vulnerable systems with a webshell-based backdoor that gains access to some of the most sensitive parts of a SharePoint Server. From there, the webshell extracts tokens and other credentials that allow the attackers to gain administrative privileges, even when systems are protected by multifactor authentication and single sign-on. Once inside, the attackers exfiltrate sensitive data and deploy additional backdoors that provide persistent access for future use.

For those who want more technical details, the opening volley in the attack is POST Web requests the attackers send to the ToolPane endpoint. The requests look like this:

Credit: Akamai

Microsoft said these requests upload a malicious script named spinstall0.aspx, or alternatively spinstall.aspx, spinstall1.aspx, spinstall2.aspx, and so on. The script contains commands for retrieving a SharePoint server’s encrypted MachineKey configuration and returning the decrypted results to the attacker through a GET request.

Q: I maintain an on-premises SharePoint server. What should I do?

A: In short, drop whatever else you were doing and take time to carefully inspect your system. The first thing to look for is whether it has received the emergency patches Microsoft released Saturday. Install the patch immediately if it hasn’t already been done.

Patching the vulnerability is only the first step, since systems infected through the vulnerability show few or no signs of compromise. The next step is to pore through system event logs in search of indicators of compromise. These indicators can be found in numerous write-ups, including those from Microsoft and Eye Security (at the links above), the US Cybersecurity and Information Security Agency, and security firms Sentinel One, Akamai, Tenable, and Palo Alto Networks.

What to know about ToolShell, the SharePoint threat under mass exploitation Read More »

ukrainians-arrest-alleged-admin-of-major-crime-forum-xss

Ukrainians arrest alleged admin of major crime forum XSS

Yesterday, Ukrainian authorities arrested the suspected administrator of a notorious Russian-language crime forum, XSS.is.

In an X post, the Paris Prosecutor’s Office announced that Ukrainian authorities detained the suspect after an investigation conducted with French authorities’ and Europol’s help that began almost exactly four years ago.

XSS has been “one of the main hubs of global cybercrime” since 2013, French authorities said, allowing “the sale of malware, access to compromised systems, stolen data, and ransomware-related services.”

Used by criminals globally to cover up illicit activity, the forum was shut down soon after the admin’s arrest.

The suspected admin has so far not been named. But police said the suspect was identified after authorities began intercepting encrypted chats sent on a Jabber messaging server that members used, “thesecure.biz.”

Surveilling chats between forum users, the government eventually intercepted a message that tipped authorities off to the alleged admin’s identity back in September. Soon after, they deployed agents to find the admin, and ultimately, it took months for Ukrainian authorities to make the arrest, with both French and Europol authorities present.

“The intercepted messages revealed numerous illicit activities related to cybercrime and ransomware, and established that they generated at least $7 million in profits,” a translation of the press release said.

Ukrainians arrest alleged admin of major crime forum XSS Read More »

conduct-rules-are-coming-for-google-and-apple-in-the-uk

Conduct rules are coming for Google and Apple in the UK

“The targeted and proportionate actions we have set out today would enable UK app developers to remain at the forefront of global innovation while ensuring UK consumers receive a world-class experience,” Cardell said. “Time is of the essence: as competition agencies and courts globally take action in these markets, it’s essential the UK doesn’t fall behind.”

Google and Apple oppose the outlined changes, arguing they could threaten user security and delay the launch of new products and services in the UK.

“We’re concerned the rules the UK is now considering would undermine the privacy and security protections that our users have come to expect, hamper our ability to innovate, and force us to give away our technology for free to foreign competitors,” Apple said. “We will continue to engage with the regulator to make sure they fully understand these risks.”

Oliver Bethell, Google’s senior director for competition, said the CMA’s move was “both disappointing and unwarranted” and that it was “crucial that any new regulation is evidence-based, proportionate, and does not become a roadblock to growth in the UK.”

Apple has repeatedly clashed with Brussels over the implementation of the EU’s Digital Markets Act, making changes to its platform after the European Commission accused the iPhone maker of failing to comply with its “online gatekeeper” rules.

The DMA also requires Apple to open up iOS features and data to its rivals and has demanded changes to its App Store, such as allowing users to install apps from outside its store.

The CMA said it was taking a different approach to the EU by being more “tailored” and iterative than the DMA’s blanket rules.

Last month, Google’s search services were the first Big Tech product to be targeted under the UK’s Digital Markets, Competition and Consumers Act, which was passed last year.

If a company’s products or services are designated as having “strategic market status,” it can last for a five-year period. Companies can be fined up to 10 percent of global turnover for breaching conduct rules.

© 2025 The Financial Times Ltd. All rights reserved. Not to be redistributed, copied, or modified in any way.

Conduct rules are coming for Google and Apple in the UK Read More »

toy-company-may-regret-coming-for-“sylvanian-drama”-tiktoker,-experts-say

Toy company may regret coming for “Sylvanian Drama” TikToker, experts say


Possible legal paths to revive a shuttered video series on TikTok and Instagram.

A popular account on TikTok and Instagram stopped posting suddenly at the end of last year, hit by a lawsuit after garnering millions of views on funny videos it made using adorable children’s Calico Critter dolls to act out dark, cringe-y adult storylines.

While millions of followers mourn the so-called “Sylvanian Drama” account’s demise, experts told Ars that the creator may have a decent chance at beating the lawsuit.

The “Sylvanian Drama” account derived its name from “Sylvanian Families,” a brand name used by Epoch Company Ltd., the maker of Calico Critters, for its iconic fuzzy animal dolls in some markets outside the US. Despite these videos referencing murder, drugs, and hookups, the toy company apparently had no problem, until the account, managed by Ireland-based Thea Von Engelbrechten, started accepting big brand partnerships and making sponsored content featuring the dolls.

Since Epoch, too, strikes partnerships with brands and influencers to promote its own videos marketing the dolls, the company claimed “Sylvanian Drama” risked creating too much confusion online. They also worried viewers would think Epoch had signed off on the videos, since the sponsored content was marked “paid partnership” without specifying precisely which featured brands had paid for the spots. They further accused Von Engelbrechten of building her advertising business around their brand without any attempt to properly license the dolls, while allegedly usurping licensing opportunities from Epoch.

So far, Von Engelbrechten has delayed responding in the lawsuit. As the account remained inactive over the past few months, fans speculated whether it could survive the lawsuit, which raised copyright and trademark infringement claims to get all the videos removed. In their complaint, the toy company requested not only an injunction preventing Von Engelbrechten from creating more “Sylvanian Drama” videos, but also sought all of her profits from her online accounts, in addition to further damages.

Von Engelbrechten declined Ars’ request to provide an update on her defense in the case, but her response is due in early August. That filing will make clear what arguments she may make to overcome Epoch’s suit, but legal experts told Ars that the case isn’t necessarily a slam dunk for the toy company. So all that “Sylvanian Drama” isn’t over just yet.

Epoch’s lawyers did not respond to Ars’ request to comment.

“Sylvanian Drama” needs the court to get the joke

Epoch raised copyright infringement charges that could hit Von Engelbrechten with fines totaling $150,000 per violation.

For Von Engelbrechten to defeat the copyright infringement claim, she’ll need to convince the court that her videos are parodies. A law professor at Santa Clara University School of Law, Eric Goldman, told Ars that her videos may qualify since “even if they don’t expressly reference Epoch’s offerings by name, the videos intentionally communicate a jarring juxtaposition of adorable critters who are important parts of pop culture living through the darker sides of humanity.”

Basically, Von Engelbrechten will need the court to understand the humor in her videos to win on that claim, Rebecca Tushnet, a First Amendment law professor at Harvard Law School, told Ars.

“Courts have varied in their treatment of parodies; the complaint’s definition of parody is not controlling but humor is one of the hardest things to predict—if the court gets the joke, it will be more likely to say that the juxtaposition between the storylines and the innocent appearance of the dolls is parodic,” Tushnet said.

But if the court does get the joke, Goldman suggested that even the sponsored content—which hilariously incorporates product placements from various big brands like Marc Jacobs, Taco Bell, Hilton, and Sephora into storylines—could possibly be characterized as parody.

However, “the fact that the social media posts were labeled #ad will make it extremely difficult for the artist to contest the videos’ status as ads,” Goldman said.

Ultimately, Goldman said that Epoch’s lawsuit “raises a host of complex legal issues” and is “not an easy case on either side.”

And one of the most significant issues that Epoch may face in the courtroom could end up gutting all of its trademark infringement claims that supposedly entitle the toy company to all of Von Engelbrechten’s profits, Alexandra Jane Roberts, a Northeastern University professor of law and media with special expertise in trademark law, told Ars.

Calico Critters may stumble on trademark hurdle

The toy company has raised several trademark infringement claims, all of which depend on Epoch proving that Von Engelbrechten “knowingly and willfully” used its trademarks without permission.

However, Roberts pointed out to Ars that Epoch has no trademarks for its iconic dolls, relying only on common law to assert sole rights to the “look and design of the critters.”

It’s likely impossible for Epoch to trademark the dolls, since trademarks are not intended to block competition, and there are only so many ways to design cute dolls that resemble cats or bunnies, Roberts suggested. A court may decide “there’s only so many ways to make a small fuzzy bunny that doesn’t look like this,” potentially narrowing the rights Epoch has under trade dress, a term that Epoch doesn’t use once in its complaint.

Roberts told Ars that Epoch’s trademark claims are “not so far off the mark,” and Von Engelbrechten’s defense was certainly not strengthened by her decision to monetize the content. Prior cases, like the indie band OK Go sending a cease-and-desist to Post cereal over a breakfast product called “OK Go” due to fears of false endorsement, make it clear that courts have agreed in the past that online collaborations have muddied the waters regarding who is the actual source of content for viewers.

“The question becomes whether people are going to see these videos, even though they’re snarky, and even though they’re silly and think, ‘Oh, Calico Critters must have signed off on this,'” Roberts said. “So the argument about consumer confusion, I think, is a plausible argument.”

However, if Epoch fails to convince the court that its trademarks have been infringed, then its other claims alleging false endorsement and unfair competition would likely also collapse.

“You can still get sometimes to unfair competition or to kind of like a false endorsement, but it’s harder to win on those claims and certainly harder to get damages on those claims,” Roberts said. “You don’t get trademark infringement if you don’t have a trademark.”

Possible defenses to keep “Sylvanian Drama” alive

Winning on the trademark claims may not be easy for Von Engelbrechten, who possibly weakened her First Amendment defense by creating the sponsored content. Regardless, she will likely try to convince the court to view the videos as parody, which is a slightly different analysis under trademark law than copyright’s more well-known fair use parody exceptions.

That could be a struggle, since trademark law requires that Von Engelbrechten’s parody videos directly satirize the “Sylvanian Families” brand, and “Sylvanian Drama” videos, even the ads, instead seem to be “making fun of elements of society and culture,” rather than the dolls themselves, Roberts said.

She pointed to winning cases involving the Barbie trademark as an instructive example. In a case disputing Mattel trademarks used in the lyrics for the one-hit wonder “Barbie Girl,” the song was cleared for trademark infringement as a “purely expressive work” that directly parodies Barbie in the lyrics. And in another case, where an artist, Tom Forsythe, captured photos of Barbie dolls in kitchen vessels like a blender or a margarita glass, more robust First Amendment protection was offered since his photos “had a lot to say about sexism and the dolls and what the dolls represent,” Roberts said.

The potential “Sylvanian Drama” defense seems to lack strong go-to arguments that typically win trademark cases, but Roberts said there is still one other defense the content creator may be weighing.

Under “nominative fair use,” it’s OK to use another company’s trademark if it’s necessary in an ad. Roberts provided examples, like a company renting Lexus cars needing to use that trademark or comparative advertising using Tiffany’s diamonds as a reference point to hype their lower prices.

If Von Engelbrechten goes that route, she will need to prove she used “no more of the mark than is necessary” and did not mislead fans on whether Epoch signed off on the use.

“Here it’s hard to say that ‘Sylvanian Drama’ really needed to use so much of those characters and that they didn’t use more than they needed and that they weren’t misleading,” Roberts said.

However, Von Engelbrechten’s best bet might be arguing that there was no confusion, since “Sylvanian Families” isn’t even a brand that’s used in the US, which is where Epoch chose to file its lawsuit because the brands that partnered with the popular account are based in New York. And the case may not even get that far, Roberts suggested, since “before you can get to those questions about the likelihood of confusion, you have to show that you actually have trademark or trade dress rights to enforce.”

Calico Critters creator may face millennial backlash

Epoch may come to regret filing the lawsuit, Roberts said, noting that as a millennial who grew up a big “Hello Kitty” fan, she still buys merch that appeals to her, and Epoch likely knows about that market, as it has done collaborations with the “Hello Kitty” brand. The toymaker could risk alienating other millennials nostalgic for Calico Critters who may be among the “Sylvanian Drama” audience and feel turned off by the lawsuit.

“When you draw attention to something like this and appear litigious, and that you’re coming after a creator who a lot of people really like and really enjoy and probably feel defensive about, like, ‘Oh, she’s just making these funny videos that everyone loves. Why would you want to sue her?'” Roberts said, “that can be really bad press.”

Goldman suggested that Epoch might be better off striking a deal with the creator, which “could establish some boundaries for the artist to keep going without stepping on the IP owner’s rights.” But he noted that “often IP owners in these situations are not open to negotiation,” and “that requires courts to draw difficult and unpredictable lines about the permissible scope of fair use.”

For Von Engelbrechten, the lawsuit may mean that her days of creating “Sylvanian Drama”-sponsored content are over, which could risk crushing a bigger dream she had to succeed in advertising. However, if the lawsuit can be amicably settled, the beloved content creator could also end up making money for Epoch, considering her brand deals appeared to be bigger.

While she seems to take her advertising business seriously, Von Engelbrechten’s videos often joke about legal consequences, such as one where a cat doll says she cannot go to a party because she’s in jail but says “I’ll figure it out” when told her ex will be attending. Perhaps Von Engelbrechten is currently devising a scheme, like her characters, to escape consequences and keep the “Sylvanian Drama” going.

“Maybe if this company were really smart, they would want to hire this person instead of suing them,” Roberts said.

Photo of Ashley Belanger

Ashley is a senior policy reporter for Ars Technica, dedicated to tracking social impacts of emerging policies and new technologies. She is a Chicago-based journalist with 20 years of experience.

Toy company may regret coming for “Sylvanian Drama” TikToker, experts say Read More »

a-power-utility-is-reporting-suspected-pot-growers-to-cops-eff-says-that’s-illegal.

A power utility is reporting suspected pot growers to cops. EFF says that’s illegal.

In May 2020, Sacramento, California, resident Alfonso Nguyen was alarmed to find two Sacramento County Sheriff’s deputies at his door, accusing him of illegally growing cannabis and demanding entry into his home. When Nguyen refused the search and denied the allegation, one deputy allegedly called him a liar and threatened to arrest him.

That same year, deputies from the same department, with their guns drawn and bullhorns and sirens sounding, fanned out around the home of Brian Decker, another Sacramento resident. The officers forced Decker to walk backward out of his home in only his underwear around 7 am while his neighbors watched. The deputies said that he, too, was under suspicion of illegally growing cannabis.

Invasion of the privacy snatchers

According to a motion the Electronic Frontier Foundation filed in Sacramento Superior Court last week, Nguyen and Decker are only two of more than 33,000 Sacramento-area people who have been flagged to the sheriff’s department by the Sacramento Municipal Utility District, the electricity provider for the region. SMUD called the customers out for using what it and department investigators said were suspiciously high amounts of electricity indicative of illegal cannabis farming.

The EFF, citing investigator and SMUD records, said the utility unilaterally analyzes customers’ electricity usage in “painstakingly” detailed increments of every 15 minutes. When analysts identify patterns they deem likely signs of illegal grows, they notify sheriff’s investigators. The EFF said the practice violates privacy protections guaranteed by the federal and California governments and is seeking a court order barring the warrantless disclosures.

“SMUD’s disclosures invade the privacy of customers’ homes,” EFF attorneys wrote in a court document in support of last week’s motion. “The whole exercise is the digital equivalent of a door-to-door search of an entire city. The home lies at the ‘core’ of constitutional privacy protection.”

Contrary to SMUD and sheriff’s investigator claims that the likely illegal grows are accurate, the EFF cited multiple examples where they have been wrong. In Decker’s case, for instance, SMUD analysts allegedly told investigators his electricity usage indicated that “4 to 5 grow lights are being used [at his home] from 7pm to 7am.” In actuality, the EFF said, someone in the home was mining cryptocurrency. Nguyen’s electricity consumption was the result of a spinal injury that requires him to use an electric wheelchair and special HVAC equipment to maintain his body temperature.

A power utility is reporting suspected pot growers to cops. EFF says that’s illegal. Read More »

google-and-openai-get-2025-imo-gold

Google and OpenAI Get 2025 IMO Gold

Congratulations, as always, to everyone who got to participate in the 2025 International Mathematical Olympiad, and especially to the gold and other medalists. Gautham Kamath highlights 11th grader Warren Bei, who in his 5th (!) IMO was one of five participants with a perfect 42/42 score, along with Ivan Chasovskikh, Satoshi Kano, Leyan Deng and Hengye Zhang.

Samuel Albanie: Massive respect to the students who solved P6.

Congratulations to Team USA, you did not ‘beat China’ but 2nd place is still awesome. Great job, China, you got us this time, three perfect scores is crazy.

You’ve all done a fantastic, amazingly hard thing, and as someone who tried hard to join you and only got as far as the [, year censored because oh man I am old] USAMO and would probably have gotten 0/45 on this IMO if I had taken it today, and know what it is like to practice for the USAMO in a room with multiple future IMO team members that must have thought I was an idiot, let me say: I am always in awe.

But that’s not important right now.

What matters is that Google and OpenAI have LLMs with gold medal performances, each scoring exactly the threshold of 35/42 by solving the first five of the six problems.

This is up from Google’s 28/42 performance last year, which was previously achieved with a longer time frame. The methods used by both are presented as being more general, whereas last year’s version was a more specialized effort.

The new scores were a 92nd percentile result at the event.

Google did this under official collaboration with the IMO, announced on Monday as per the IMO’s request. OpenAI did it on their own, so they announced a bit earlier, so we are taking their word on many details.

This was not expected. Prediction markets thought gold this year was unlikely.

What matters more is how they did it, with general purpose LLMs without tools, in ways that represent unexpected and large future gains in other reasoning as well.

The more I think about the details here, the more freaked out I get rather than less. This is a big deal. How big remains to be seen, as we lack details, and no one knows how much of this will generalize.

The IMO 2025 results quickly came in for released models.

Teortaxes: I sure jumped the gun calling Grok a next generation model.

It’s probably not *thatfar from Gemini, compute-wise, and not really close in diversity and rigor of post-training.

This was an early sign that problem 3 was easier than usual this year, and a strong performance by the release version of Gemini 2.5 Pro.

So this is how it started seven hours before OpenAI announced its result:

Jxmo (replying to Ravid): if they did well, you’d be complaining that they overfit.

Ravid Shwartz: That’s true, because they are 👽

Rohit: This isn’t a gotcha. Any problem that we fundamentally focus on deeply enough is one that AI will be able to solve. The question, as ever, is whether that solution is likely to carry over to other domains.

I disagree, I think this is a gotcha in the positive sense. People took ‘the AIs that weren’t aimed at this problem that are publicly released are only doing okay relative to the best humans, and have not proven themselves the best yet’ to be ‘look at the pathetic AIs,’ one day before we learned that, well, actually, in a way prediction markets did not expect.

I do think people need to update their models of the future.

Also, it’s kind of a full gotcha given this:

Lin Yang: 🚨 Olympiad math + AI:

We ran Google’s Gemini 2.5 Pro on the fresh IMO 2025 problems. With careful prompting and pipeline design, it solved 5 out of 6 — remarkable for tasks demanding deep insight and creativity.

The model could win gold! 🥇

It would be non-trivial for non-math person to achieve the same score. We have spent some time to carefully check the solutions. Regardless, the prompts are very general and can be applied to other models. We will release an automatic agent soon.

Jun Wu: They added a lot of steps in order to solve 5 problems. They didn’t publish the details on how these steps were done beyond the concepts.

I don’t have time to investigate how ‘legit’ the Gemini 2.5 Pro solutions are, including in terms of how much you have to cheat to get them.

Advanced version of Gemini with Deep Think officially achieves gold-medal standard at the International Mathematical Olympiad. Google’s solutions are here.

We achieved this year’s result using an advanced version of Gemini Deep Think – an enhanced reasoning mode for complex problems that incorporates some of our latest research techniques, including parallel thinking. This setup enables the model to simultaneously explore and combine multiple possible solutions before giving a final answer, rather than pursuing a single, linear chain of thought.

To make the most of the reasoning capabilities of Deep Think, we additionally trained this version of Gemini on novel reinforcement learning techniques that can leverage more multi-step reasoning, problem-solving and theorem-proving data. We also provided Gemini with access to a curated corpus of high-quality solutions to mathematics problems, and added some general hints and tips on how to approach IMO problems to its instructions.

We will be making a version of this Deep Think model available to a set of trusted testers, including mathematicians, before rolling it out to Google AI Ultra subscribers.

Google’s answers were even in nice form.

IMO President Dr. Gregor Dolinar: We can confirm that Google DeepMind has reached the much-desired milestone, earning 35 out of a possible 42 points — a gold medal score. Their solutions were astonishing in many respects. IMO graders found them to be clear, precise and most of them easy to follow.

Colin Fraser: has anyone actually read these LLM IMO proofs? I read one of the Google ones and it’s good. I find the OAI version of the same one impenetrable. The Google one is also kind of hard to read but possible.

Ernest Davis (6th in US Math Olympiad once, just short of the IMO): Second: The proofs produced by DM-IMO and by every single earlier LLM, whether correct or incorrect, are written in a smooth, elegant style. They could be cut and pasted into a journal article or into a textbook with little or no editing. The worst you can say of them is that they are sometimes verbose.

By contrast, OpenAI-IMO writes proofs in the style of an informal spoken presentation who is not very practiced or competent at giving informal presentations, and regularly mutters reassurances to themselves that they’re on the right rack.

Miles Brundage: OAI one got RL’d to within an inch of its life.

What else did they say about how they did this?

DeepMind: With Deep Think, an enhanced reasoning mode, our model could simultaneously explore and combine multiple possible solutions before giving definitive answers.

We also trained it on RL techniques that use more multi-step reasoning, problem-solving and theorem-proving data.

Finally, we pushed this version of Gemini further by giving it:

🔘 More thinking time

🔘 Access to a set of high-quality solutions to previous problems

🔘 General hints and tips on how to approach IMO problems

That sounds mostly rather general. There’s some specialized IMO context, but orders of magnitude less than what IMO competitors devote to this.

Elon Musk: While a notable milestone, this is already borderline trivial for AI.

Um, Elon, no, and I remind you that Grok 4 got 11.9%. Which for a human would be super impressive, but seriously, borderline trivial?

Noam Brown (OpenAI): Congrats to the GDM team on their IMO result! I think their parallel success highlights how fast AI progress is. Their approach was a bit different than ours, but I think that shows there are many research directions for further progress.

OpenAI claimed its victory first, right after the closing ceremony and before the party, whereas Google DeepMind waited to announce until the following Monday.

The most impressive thing about OpenAI’s result is that they claim this is not an IMO-specific model, and that it uses only general-purpose techniques.

Alexander Wei (OpenAI): I’m excited to share that our latest @OpenAI experimental reasoning LLM has achieved a longstanding grand challenge in AI: gold medal-level performance on the world’s most prestigious math competition—the International Math Olympiad (IMO).

We evaluated our models on the 2025 IMO problems under the same rules as human contestants: two 4.5 hour exam sessions, no tools or internet, reading the official problem statements, and writing natural language proofs.

Why is this a big deal? First, IMO problems demand a new level of sustained creative thinking compared to past benchmarks. In reasoning time horizon, we’ve now progressed from GSM8K (~0.1 min for top humans) → MATH benchmark (~1 min) → AIME (~10 mins) → IMO (~100 mins).

Second, IMO submissions are hard-to-verify, multi-page proofs. Progress here calls for going beyond the RL paradigm of clear-cut, verifiable rewards. By doing so, we’ve obtained a model that can craft intricate, watertight arguments at the level of human mathematicians.

Besides the result itself, I am excited about our approach: We reach this capability level not via narrow, task-specific methodology, but by breaking new ground in general-purpose reinforcement learning and test-time compute scaling.

In our evaluation, the model solved 5 of the 6 problems on the 2025 IMO. For each problem, three former IMO medalists independently graded the model’s submitted proof, with scores finalized after unanimous consensus. The model earned 35/42 points in total, enough for gold! 🥇

HUGE congratulations to the team—@SherylHsu02, @polynoamial, and the many giants whose shoulders we stood on—for turning this crazy dream into reality! I am lucky I get to spend late nights and early mornings working alongside the very best.

Btw, we are releasing GPT-5 soon, and we’re excited for you to try it. But just to be clear: the IMO gold LLM is an experimental research model. We don’t plan to release anything with this level of math capability for several months.

Still—this underscores how fast AI has advanced in recent years. In 2021, my PhD advisor @JacobSteinhardt had me forecast AI math progress by July 2025. I predicted 30% on the MATH benchmark (and thought everyone else was too optimistic). Instead, we have IMO gold.

If you want to take a look, here are the model’s solutions to the 2025 IMO problems! The model solved P1 through P5; it did not produce a solution for P6. (Apologies in advance for its … distinct style—it is very much an experimental model 😅)

Lastly, we’d like to congratulate all the participants of the 2025 IMO on their achievement! We are proud to have many past IMO participants at @OpenAI and recognize that these are some of the brightest young minds of the future.

Noam Brown (OpenAI): Today, we at @OpenAI achieved a milestone that many considered years away: gold medal-level performance on the 2025 IMO with a general reasoning LLM—under the same time limits as humans, without tools. As remarkable as that sounds, it’s even more significant than the headline.

Typically for these AI results, like in Go/Dota/Poker/Diplomacy, researchers spend years making an AI that masters one narrow domain and does little else. But this isn’t an IMO-specific model. It’s a reasoning LLM that incorporates new experimental general-purpose techniques.

So what’s different? We developed new techniques that make LLMs a lot better at hard-to-verify tasks. IMO problems were the perfect challenge for this: proofs are pages long and take experts hours to grade. Compare that to AIME, where answers are simply an integer from 0 to 999.

Jacques: Most important part of the IMO Gold achievement. Were you surprised by this? Did you not update all the way to avoid likelihood of surprise?

Indeed. Purely getting the gold medal is surprising but not that big a deal. The way they got the result, assuming they’re reporting accurately? That’s a really big deal.

Noam Brown (resuming): Also this model thinks for a *longtime. o1 thought for seconds. Deep Research for minutes. This one thinks for hours. Importantly, it’s also more efficient with its thinking. And there’s a lot of room to push the test-time compute and efficiency further.

Importantly, I think we’re close to AI substantially contributing to scientific discovery. There’s a big difference between AI slightly below top human performance vs slightly above.

This was a small team effort led by @alexwei_. He took a research idea few believed in and used it to achieve a result fewer thought possible. This also wouldn’t be possible without years of research+engineering from many at @OpenAI and the wider AI community.

Tifa Chen: Last night we IMO tonight we party.

What about Problem 6? Did the programs submit incorrect solutions?

Note that if you are maximizing, then when time runs out if you have anything at all then yes you do submit the best incorrect solution you have, because you might get you partial credit, although this rarely works out.

Daniel Litt: One piece of info that seems important to me in terms of forecasting usefulness of new AI models for mathematics: did the gold-medal-winning models, which did not solve IMO problem 6, submit incorrect answers for it?

Alexander Wei: On IMO P6 (without going into too much detail about our setup), the model “knew” it didn’t have a correct solution. The model knowing when it didn’t know was one of the early signs of life that made us excited about the underlying research direction!

If one person gets to say ‘Not So Fast’ about this sort of thing, Tao is that one person.

It is entirely fair to say that if you don’t disclose conditions in advance, and definitely if you don’t disclose conditions after the fact, it is difficult to know exactly what to make of the result. Tao’s objections are valid.

Terence Tao: It is tempting to view the capability of current AI technology as a singular quantity: either a given task X is within the ability of current tools, or it is not. However, there is in fact a very wide spread in capability (several orders of magnitude) depending on what resources and assistance gives the tool, and how one reports their results.

One can illustrate this with a human metaphor. I will use the recently concluded International Mathematical Olympiad (IMO) as an example. Here, the format is that each country fields a team of six human contestants (high school students), led by a team leader (often a professional mathematician). Over the course of two days, each contestant is given four and a half hours on each day to solve three difficult mathematical problems, given only pen and paper. No communication between contestants (or with the team leader) during this period is permitted, although the contestants can ask the invigilators for clarification on the wording of the problems. The team leader advocates for the students in front of the IMO jury during the grading process, but is not involved in the IMO examination directly.

The IMO is widely regarded as a highly selective measure of mathematical achievement for a high school student to be able to score well enough to receive a medal, particularly a gold medal or a perfect score; this year the threshold for the gold was 35/42, which corresponds to answering five of the six questions perfectly. Even answering one question perfectly merits an “honorable mention”.

But consider what happens to the difficulty level of the Olympiad if we alter the format in various ways, such as the following:

  1. One gives the students several days to complete each question, rather than four and half hours for three questions. (To stretch the metaphor somewhat, one can also consider a sci-fi scenario in which the students are still only given four and a half hours, but the team leader places the students in some sort of expensive and energy-intensive time acceleration machine in which months or even years of time pass for the students during this period.)

  2. Before the exam starts, the team leader rewrites the questions in a format that the students find easier to work with.

  3. The team leader gives the students unlimited access to calculators, computer algebra packages, formal proof assistants, textbooks, or the ability to search the internet.

  4. The team leader has the six student team work on the same problem simultaneously, communicating with each other on their partial progress and reported dead ends.

  5. The team leader gives the students prompts in the direction of favorable approaches, and intervenes if one of the students is spending too much time on a direction that they know to be unlikely to succeed.

  6. Each of the six students on the team submit solutions to the team leader, who then selects only the “best” solution for each question to submit to the competition, discarding the rest.

  7. If none of the students on the team obtains a satisfactory solution, the team leader does not submit any solution at all, and silently withdraws from the competition without their participation ever being noted.

In each of these formats, the submitted solutions are still technically generated by the high school contestants, rather than the team leader. However, the reported success rate of the students on the competition can be dramatically affected by such changes of format; a student or team of students who might not even always reach bronze medal performance if taking the competition under standard test conditions might instead reach reliable gold medal performance under some of the modified formats indicated above.

So, in the absence of a controlled test methodology that was not self-selected by the competing teams, one should be wary of making overly simplistic apples-to-apples comparisons between the performance of various AI models on competitions such as the IMO, or between such models and the human contestants.

Related to this, I will not be commenting on any self-reported AI competition performance results for which the methodology was not disclosed in advance of the competition.

EDIT: In particular, the above comments are not specific to any single result of this nature.

The catch is that this is about grading the horse’s grammar, as opposed to the observation that the horse can talk and rather intelligently and with rapidly improving performance at that.

Thus, while the objections are valid, as long as we know the AIs had no access to outside tools or to the internet (which is confirmed), we should seek the answers to these other questions but the concerns primarily matter for comparisons between models, and within a reasonably narro (in the grand scheme of things) band of capabilities.

I also would note that if OpenAI did essentially do the ‘team thinks in parallel’ thing where it had multiple inference processes running simultaneously on multiple computers, well, that is something AIs can do in the real world, and this seems entirely fair for our purposes the same way humans can fire multiple neurons at once. It’s totally fair to also want a limited-compute or one-thread category or what not, but that’s not important right now.

To use Tao’s metaphor, if you took 99.99% of high school students, you could fully and simultaneously apply all these interventions other than formal proof assistants and internet searches or hints so clear they give you the first point on a question, and you still almost always get zero.

Nat McAleese: 17 M U.S. teens grades 9-12, ~5 US IMO golds in practice but ~20 kids at gold-level. So IMO gold is one-in-a-million math talent (for 18 year olds; but I bet next Putnam falls too). 99.9999th percentile.

As a former not only math competitor but also Magic: The Gathering competitor, absolutely all these details matter for competitions, and I respect the hell out of getting all of those details right – I just don’t think that, in terms of takeaways, they change the answer much here.

In other words? Not Not So Fast. So Fast.

OpenAI chose not to officially collaborate with the IMO. They announced their result after the IMO closing ceremony and prior to the IMO 2025 closing party. Those who did collaborate agreed to wait until the following Monday, which was when Google announced. By going first, OpenAI largely stole the spotlight on this from Google, yet another case of Google Failing Marketing Forever.

A question that was debated is, did OpenAI do something wrong here?

Mikhail Samin claimed that they did, and put their hype and clout ahead of the kids celebrating their achievements against the wishes of the IMO.

OpenAI’s Noam Brown replied that they waited until after the closing ceremony exactly to avoid stealing the spotlight. He said he was the only person at OpenAI to speak to anyone at the IMO, and that person only requested waiting until after the ceremony, so that is what OpenAI did.

Not collaborating with the IMO was a choice that OpenAI made.

Mikhail Samin: AI companies that chose to cooperate with the IMO on assessment of the performance of their models had in-person meetings with IMO people on July 16. It was agreed there that announcements of AI achievements should be made on 28 July or later.

A quote from someone involved: “I certainly expect that if OpenAI had contacted the IMO in advance and expressed interest in cooperating in the assessment of their work, they would have been able to be included in that meeting, so I suppose that unless there was a major miscommunication somewhere, they effectively ended up choosing, by default or otherwise, not to cooperate with the IMO on this, and so not to be aware of what ground rules might have been agreed by those who did cooperate.”

Demis Hassabis (CEO DeepMind): Btw as an aside, we didn’t announce on Friday because we respected the IMO Board’s original request that all AI labs share their results only after the official results had been verified by independent experts & the students had rightfully received the acclamation they deserved.

We’ve now been given permission to share our results and are pleased to have been part of the inaugural cohort to have our model results officially graded and certified by IMO coordinators and experts, receiving the first official gold-level performance grading for an AI system!

Noam Brown: ~2 months ago, the IMO emailed us about participating in a formal (Lean) version of the IMO. We’ve been focused on general reasoning in natural language without the constraints of Lean, so we declined. We were never approached about a natural language math option.

Over the past several months, we made a lot of progress on general reasoning. This involved collecting, curating, and training on high-quality math data, which will also go into future models. In our IMO eval we did not use RAG or any tools.

Before we shared our results, we spoke with an IMO board member, who asked us to wait until after the award ceremony to make it public, a request we happily honored.

We had each submitted proof graded by 3 external IMO medalists and there was unanimous consensus on correctness. We have also posted the proofs publicly so that anyone can verify correctness.

Jasper: DeepMind got a gold medal at the IMO on Friday afternoon. But they had to wait for marketing to approve the tweet — until Monday. @OpenAI shared theirs first at 1am on Saturday and stole the spotlight.

In this game, speed > bureaucracy. Miss the moment, lose the narrative.

Clarification: I’ve been told by someone at Google that their IMO results are still being verified internally. Once that’s done, they plan to share them officially—curious to see their approach. Another source mentioned that the IMO committee asked not to publicly discuss AI involvement within a week after the closing ceremony. Things just got a bit more interesting.

Daniel Eth: “In this game, speed > bureaucracy. Miss the moment, lose the narrative.” Honestly, disagree. If GDM beats OpenAI, then the narrative will shift once that’s public.

I have reflected on this. It is not the main thing, the results are the main thing. I do think that on reflection while OpenAI did not break any agreements or their word, and strictly speaking they do not owe the IMO or the kids anything, and this presumably net increased the focus on the kids, this still does represent a meaningful failure to properly honor the competition and process, as well as offering us the proper opportunities for verification, and they should have known that this was the case. I do get that this was a small team’s last minute effort, which makes me more understanding, but it’s still not great.

Fig Spirit: then again, assuming Myers is correct about his impression of the “general coordinator view”, seems like the kind of thing that OpenAI could have known about *ifthey cared, no? by e.g. talking to the right people at the IMO… which imo is not asking much! and looks like others did?

Thus, I was careful to wait to write this until after Google’s results were announced, and have placed Google’s announcement before OpenAI’s in this post, even though due to claimed details by OpenAI I do think their achievement here is likely the more meaningful one. Perhaps that is simply Google failing marketing again and failing to share details.

Ultimately, the reason OpenAI stole my spotlight is that it harkens something general and new in a way that Google’s announcement doesn’t.

With Google sharing its results I don’t want to wait any longer, but note Harmonic?

Harmonic Math: This past week, Harmonic had the opportunity to represent our advanced mathematical reasoning model, Aristotle, at the International Mathematics Olympiad – the most prestigious mathematics competition in the world.

To uphold the sanctity of the student competition, the IMO Board has asked us, along with the other leading AI companies that participated, to hold on releasing our results until July 28th.

So please join us live on @X next Monday, July 28th at 3PM PT and hear from our CEO @tachim and Executive Chairman @vladtenev about the advent of mathematical superintelligence (and maybe a few surprises along the way).

This would be a weird flex if they didn’t also get gold, although it looks like they would have done it in a less general and thus less ultimately interesting way. On the flip side, they are not a big lab like Google or OpenAI, so that’s pretty impressive.

I think the failure to expect this was largely a mistake, but Manifold tells a clear story:

Andrew Curran: OpenAI’s new model has achieved gold level at the International Math Olympiad in a stunning result. It is a reasoning model that incorporates new experimental general-purpose techniques. This has happened much sooner than was predicted by most experts.

Noam Brown (OpenAI): When you work at a frontier lab, you usually know where frontier capabilities are months before anyone else. But this result is brand new, using recently developed techniques. It was a surprise even to many researchers at OpenAI. Today, everyone gets to see where the frontier is.

Peter Wildeford: AI progress comes at you fast.

JGalt Tweets: When will an AI win a Gold Medal in the International Math Olympiad? Median predicted date over time

July 2021: 2043 (22 years away)

July 2022: 2029 (7 years away)

July 2023: 2028 (5 years away)

July 2024: 2026 (2 years away)

Final result, July 2025: 2025 (now). Buckle up, Dorothy.

Some people did expect it, some of whom offered caveats.

Greg Burnham: Pretty happy with how my predictions are holding up.

5/6 was the gold medal threshold this year. OAI’s “experimental reasoning LLM” got that exactly, failing only to solve the one hard combinatorics problem, P6.

My advice remains: look beyond the medal.

Now, this is an LLM, not AlphaProof. That means LLMs have improved at proofs. I didn’t expect that so soon.

Though, FWIW, P3 is a bit of an outlier this year, at least for humans: over 15% of humans got it, higher than any P3 in the last 10 years.

But “the big one” remains whether the AI solutions show qualitatively creative problem-solving.

LLMs could already grind out “low insight” sol’ns to hard AIME problems. If OAI found a way to train them do that for olympiad proof-based problems too, that’s new, but less exciting.

So, clear progress, but not *toosurprising. I’ll keep my takes tempered until looking at the AI solutions in depth, which I hope to do soon! Above excerpts from my preregistered take on the IMO here.

Mikhail Samin: As someone who bet back in 2023 that that it’s >70% likely AI will get an IMO gold medal by 2027:

the IMO markets have been incredibly underpriced, especially for the past year.

(Sadly, another prediction I’ve been >70% confident about is that AI will literally kill everyone.)

The AIs took the IMO under the same time limits as the humans, and success was highly valued, so it is no surprise that they used parallel inference to get more done within that time frame, trading efficiency for speed.

Andrew Curran: These agentic teams based models like Grok Heavy, the Gemini Deep Think that just won gold, and the next gen from OpenAI are all going to use about fifteen times more tokens than current systems. This is why Pro plans are north of $200. Essentially; Jensen wins again.

[from June 14]: Claude Opus, coordinating four instances of Sonnet as a team, used about 15 times more tokens than normal. (90% performance boost) Jensen has mentioned similar numbers on stage recently. GPT-5 is rumored to be agentic teams based. The demand for compute will continue to increase.

Arthur B: IMO gold is super impressive.

I just want to register a prediction, I’m 80% confident the inference run cost over $1M in compute.

Mostly because if they could do it for $1M they would, and they would be able to do it for $1M before they can do it for less.

Jerry Tworek (OpenAI): I’m so limited by compute you wouldn’t believe it. Stargate can’t finish soon enough.

Sure, you solved this particular problem, but that would never generalize, right? That part is the same hype as always?

Near Cyan: you wont believe how smart our new frontier llm is. it repeatedly samples from the data manifold just like our last one. but this time we gave it new data to cover a past blindspot. watch in awe as we now sample from a slightly different area of the data manifold.

there may lay a prize at the end of the hyperdimensioanl checkered rainbow, but it’s likely not what you think it is.

i really thought someone would have done something original by now. of course, if anything was ~truly~ cooking, it shouldn’t be something i’d know about… but the years continue to pass

and, right right we have to finish *this phaseso that we have the pre-requisites. and yet.

David Holz (CEO MidJourney): noooo money can’t be dumb it’s so green.

Near Cyan: it is for now! but some of it may turn a dark crimson surprisingly quickly.

Nico: What do you make of [the OpenAI model knowing it didn’t have a correct solution to problem 6]? Sounds pretty important.

Near Cyan: seems cool i bet they have some great data.

A grand tradition is:

  1. AI can do a set of things [X] better than humans, but not a set of things [Y].

  2. People say [X] and [Y] are distinct because Moravec’s Paradox and so on.

  3. AI lab announces that [Z], previously in [Y], is now in [X].

  4. People move [Z] from [Y] to [X] and then repeat that this distinct category of things [Y] exists because Moravec’s Paradox, that one task was simply miscategorized before, so it’s fine.

Or: AI can do the things it can do, and can’t do the things it can’t do, they’re hard.

Yuchen Jin: OpenAI and DeepMind models winning IMO golds is super cool, but not surprising if you remember AlphaGo beat Lee Sedol.

What’s easy for AI can be hard for humans, and vice versa. That’s Moravec’s Paradox.

So yes, AI can win math gold medals and beat humans in competitive coding contests. But ask it to act like a competent “intern” across a multi-step project without messing things up? Still a long way to go.

To get there, models need longer context windows, far less hallucination (a single one can derail a multi-step task), and likely a new learning paradigm. RL with a single scalar +1/-1 reward at the end of a long trajectory just isn’t informative enough to drive actual learning.

An oldie but a goodie:

Colin Fraser: Can an LLM make a good IMO problem

Posting before someone else does

I mean, it probably can’t even do real math, right?

Kevin Buzzard (Mathematician, Imperial College): I certainly don’t agree that machines which can solve IMO problems will be useful for mathematicians doing research, in the same way that when I arrived in Cambridge UK as an undergraduate clutching my IMO gold medal I was in no position to help any of the research mathematicians there.

It is still entirely unclear whether things will scale from machines being able to do mathematics which can be solved using high school techniques to machines being able to help with mathematics which can only be solved by having a deep understanding of modern research ideas.

This is a big open question right now.

Hehe: What most people don’t realize is that IMO (and IOI, though to a different extent) aren’t particularly hard. They’re aimed at high schoolers, so anyone with decent uni education should be able to solve most of them.

Daniel Litt: I’m sorry, this is nonsense. Vast majority of strong math majors can’t do 5/6 IMO problems. It’s a specific skill that getting a math major doesn’t really train you for.

So yes, we still do not know for sure if being able to do [X] will extend to doing [Y], either with the same model or with a future different model, and [X] and [Y] are distinct skills such that the humans who do [X] cannot yet do [Y] and training humans to do [Y] does not give them the ability to do [X]. However please try to think ahead.

Daniel Litt: An AI tool that gets gold on the IMO is obviously immensely impressive. Does it mean math is “solved”? Is an AI-generated proof of the Riemann hypothesis clearly on the horizon? Obviously not.

Worth keeping timescales in mind here: IMO competitors spend an average of 1.5 hrs on each problem. High-quality math research, by contrast, takes month or years.

What are the obstructions to AI performing high-quality autonomous math research? I don’t claim to know for sure, but I think they include many of the same obstructions that prevent it from doing many jobs:

Long context, long-term planning, consistency, unclear rewards, lack of training data, etc.

It’s possible that some or all of these will be solved soon (or have been solved) but I think it’s worth being cautious about over-indexing on recent (amazing) progress.

To briefly expand on the point about timescales: one recent paper I wrote solved a problem I’ve been thinking about since 2017. Another was 94 pages of extremely densely-written math, aimed at experts.

We don’t know much yet about how the best internal models work, but I don’t think it’s clear that getting capabilities of that level is “only” an engineering problem. That said, I do think it’s pretty likely that many or all of these issues will be solved within the span of my mathematics career.

That is all entirely fair. An IMO problem is measured in hours, not months, and is bounded in important ways. That is exactly the paradigm of METR, and the one being talked about by Noam Brown and Alexander Wei, that we have now made the move from 10 minute problems to 100 minute problems.

That does not mean we can yet solve 10,000 minute or 1 million minute problems, but why would you expect the scaling to stop here? As I discussed in the debates over AI 2027, it makes sense to think that these orders of magnitude start to get easier rather than harder once you get into longer problems. If you can do 100 minute problems that doesn’t mean you can easily go to 1000 or a million, but if you can go 1 million, I bet you can probably do 1 billion without fundamentally changing things that much, if you actually have that kind of time. At some point your timeline is ‘indefinite’ or ‘well, how much time and compute have you got?’

David White: the openai IMO news hit me pretty heavy this weekend.

i’m still in the acute phase of the impact, i think.

i consider myself a professional mathematician (a characterization some actual professional mathematicians might take issue with, but my party my rules) and i don’t think i can answer a single imo question.

ok, yes, imo is its own little athletic subsection of math for which i have not trained, etc. etc., but. if i meet someone in the wild who has an IMO gold, i immediately update to “this person is much better at math than i am”

now a bunch of robots can do it. as someone who has a lot of their identity and their actual life built around “is good at math,” it’s a gut punch. it’s a kind of dying.

like, one day you discover you can talk to dogs. it’s fun and interesting so you do it more, learning the intricacies of their language and their deepest customs. you learn other people are surprised by what you can do. you have never quite fit in, but you learn people appreciate your ability and want you around to help them. the dogs appreciate you too, the only biped who really gets it. you assemble for yourself a kind of belonging. then one day you wake up and the universal dog translator is for sale at walmart for $4.99.

the IMO result isn’t news, exactly. in fact, if you look at the METR agent task length over time plot, i think agents being able to solve ~ 1.5 hour problems is coming right on time. so in some way we should not be surprised. and indeed, it appears multiple companies have achieved the same result. it’s just… the rising tide rising as fast as it has been rising.

of course, grief for my personal identity as a mathematician (and/or productive member of society) is the smallest part of this story

multiply that grief out by *everymathematician, by every coder, maybe every knowledge worker, every artist… over the next few years… it’s a slightly bigger story

and of course, beyond that, there is the fear of actual death, which perhaps i’ll go into more later.

this package — grief for relevance, grief for life, grief for what i have known — isn’t unique to the ai age or anything like that. i think it is a standard thing as one appreaches end of career or end of life. it just might be that that is coming a bit sooner for many of us, all at once.

i wonder if we are ready

I am very confident we are not ready. If we are fortunate we might survive, but we definitely are not ready.

I grade this as minus one million points for asking the wrong questions.

Mechanize: Automating math would generate less than 1% as much value as automating software engineering.

Perhaps AI labs should focus less on chasing gold medals and focus more on the hard problem of automating SWE.

T11s: this is pretty reductionist? innovations in math uniquely enable lots of software (eg cryptography made ecommerce possible)

Deedy: Quant trading is a lot of math and accounts for $50-100B in revenue.

Never confuse costs and benefits #RulesForLife, and never reason from a price change.

(This defines ‘math’ rather narrowly as advanced Real Math that mathematicians and maybe quants and other professionals do, not the kind of math that underlies absolutely everything we do all day, since Fake Math is already mostly automated.)

The value of automating is not determined by how much we spent on it before it got automated. The value is determined by how much additional value we get out of something when we automate it, which might involve a lot more production and very diffuse benefits.

Back in February 2022, Eliezer Yudkowsky bet with Paul Christiano about IMO performance by 2025. The results were not super clear cut if you look at the details, as Christiano was in large part doubting that the hardest problem would be solved and indeed the hardest problem was #6 and was not solved, but a gold medal was still achieved.

So I think we have Paul at <8%, Eliezer at >16% for AI made before the IMO is able to get a gold (under time controls etc. of grand challenge) in one of 2022-2025.

Separately, we have Paul at <4% of an AI able to solve the "hardest" problem under the same conditions.

How [I, Paul, would] update

The informative:

  • I think the IMO challenge would be significant direct evidence that powerful AI would be sooner, or at least would be technologically possible sooner. I think this would be fairly significant evidence, perhaps pushing my 2040 TAI [transformational AI] probability up from 25% to 40% or something like that.

  • I think this would be significant evidence that takeoff will be limited by sociological facts and engineering effort rather than a slow march of smooth ML scaling. Maybe I’d move from a 30% chance of hard takeoff to a 50% chance of hard takeoff.

  • If Eliezer wins, he gets 1 bit of epistemic credit. These kinds of updates are slow going, and it would be better if we had a bigger portfolio of bets, but I’ll take what we can get.

  • This would be some update for Eliezer’s view that “the future is hard to predict.” I think we have clear enough pictures of the future that we have the right to be surprised by an IMO challenge win; if I’m wrong about that then it’s general evidence my error bars are too narrow.

If an AI wins a gold on some but not all of those years, without being able to solve the hardest problems, then my update will be somewhat more limited but in the same direction.

At this point, we have a lot of people who have updated far past 40% chance of transformational AI by 2040 and have 40% for dates like 2029.

If we take all of OpenAI’s statements at face value, think about what they actually did.

Sam Altman: we achieved gold medal level performance on the 2025 IMO competition with a general-purpose reasoning system! to emphasize, this is an LLM doing math and not a specific formal math system; it is part of our main push towards general intelligence.

when we first started openai, this was a dream but not one that felt very realistic to us; it is a significant marker of how far AI has come over the past decade.

we are releasing GPT-5 soon but want to set accurate expectations: this is an experimental model that incorporates new research techniques we will use in future models. we think you will love GPT-5, but we don’t plan to release a model with IMO gold level of capability for many months.

Sheryl Hsu (OpenAI): Watching the model solve these IMO problems and achieve gold-level performance was magical.

The model solves these problems without tools like lean or coding, it just uses natural language, and also only has 4.5 hours. We see the model reason at a very high level – trying out different strategies, making observations from examples, and testing hypothesis.

It’s crazy how we’ve gone from 12% on AIME (GPT 4o) → IMO gold in ~ 15 months. We have come very far very quickly. I wouldn’t be surprised if by next year models will be deriving new theorems and contributing to original math research!

I was particularly motivated to work on this project because this win came from general research advancements. Beyond just math, we will improve on other capabilities and make ChatGPT more useful over the coming months.

Sebastien Bubeck: It’s hard to overstate the significance of this. It may end up looking like a “moon‑landing moment” for AI.

Just to spell it out as clearly as possible: a next-word prediction machine (because that’s really what it is here, no tools no nothing) just produced genuinely creative proofs for hard, novel math problems at a level reached only by an elite handful of pre‑college prodigies.

Nomore ID: Read Noam’s thread carefully.

Winning a gold medal at the 2025 IMO is an outstanding achievement, but in some ways, it might just be noise that grabbed the headlines.

They have recently developed new techniques that work much better on hard-to-verify problems, have extended TTC to several hours, and have improved thinking efficiency.

Jerry Tworek (OpenAI): Why am I excited about IMO results we just published:

– we did very little IMO-specific work, we just keep training general models

– all natural language proofs

– no evaluation harness

We needed a new research breakthrough and @alexwei_ and team delivered.

Diego Aud: Jerry, is this breakthrough included in GPT-5, or is it reserved for the next generation?

Jerry Tworek: It’s a later model probably end of year thing.

Guizin: Agent 1.

Jerry Tworek: I’m so limited by compute you wouldn’t believe it. Stargate can’t finish soon enough.

Going back to Tao’s objections, we know essentially nothing about this new model, or about what Google did to get their result. Given that P3 was unusually easy this year, these scores are perhaps not themselves that terribly impressive relative to expectations.

Can we trust this? It’s not like OpenAI has never misled us on such things in the past.

In terms of the result being worthy of a 35/42, I think we can mostly trust that. They shared the solution, in its garbled semi-English, and if there was something that would have lost them points I think someone would have spotted it by now.

In terms of OpenAI otherwise cheating, we don’t have any proof about this but I think the chances of this are quite low. There’s different kinds of deception or lies, different parts of OpenAI are differently trustworthy, and this kind of lie is not in their nature nor do they have much incentive to try it given the chance it gets exposed, and the fact that if it’s not real then they won’t be able to pay it off later.

The place where one might doubt the most is, can we trust that what OpenAI did this time is more general, in the ways they are claiming?

Gary Marcus: The paradox of the OpenAI IMO discussion is that the new model scored only slightly better than DeepMind’s system from last year (as @NeelNanda5 notes); but that we assume that the new model is far more general.

Yet we have not yet seen any direct evidence of that.

It can barely speak english.

The ‘barely speak English’ part makes the solution worse in some ways but actually makes me give their claims to be doing something different more credence rather than less. It also should worry anyone who wants to maintain monitorable chain of thought.

Then again, one could say that the version that does it better, and more naturally, is thus more important, for exactly the same reasons.

Vladimir Nesov: [GDM’s] is even more surprising than OpenAI’s entry (in its details). Since it can now write proofs well automatically (even if it costs a lot and takes a lot of time), in a few months regular reasoning models might get enough training data to reliably understand what proofs are directly, and that’s an important basic ingredient for STEM capabilities.

We only have OpenAI’s word on the details of how this went down. So what to think?

I am mostly inclined to believe them on the main thrust of what is going on. That doesn’t mean that this result will generalize. I do give them credit for having something that they believe came out of a general approach, and that they expect to generalize.

Still, it’s reasonable to ask what the catch might be, that there’s always going to be a catch. Certainly it is plausible that this was, as Miles suggested, RLed to within an inch of its life, and it starting to be unable to speak English is the opposite of what is claimed, that it is losing its generality, or things are otherwise going off the rails.

The thing is, to me this doesn’t feel like it is fake. It might not be a big deal, it might not transfer all that well to other contexts, but it doesn’t feel fake.

To wrap up, another reminder that no, you can’t pretend none of this matters, and both the Google and OpenAI results matter and should update you:

Cole Wyeth: The headline result was obviously going to happen, not an update for anyone paying attention.

Garrett Baker: “Obviously going to happen” is very different from ‘happens at this point in time rather than later or sooner and with this particular announcement by this particular company’. You should still update off this. Hell, I was pretty confident this would be first done by Google DeepMind, so its a large update for me (I don’t know what for yet though)!

Your claim “not an update for anyone paying attention” also seems false. I’m sure there are many who are updating off this who were paying attention, for whatever reason, as they likely should.

I generally dislike this turn of phrase as it serves literally no purpose but to denigrate people who are changing their mind in light of evidence, which is just a bad thing to do.

cdt: I think it was reasonable to expect GDM to achieve gold with an AlphaProof-like system. Achieving gold with a general LLM-reasoning system from GDM would be something else and it is important for discussion around this to not confuse one forecast for another.

Discussion about this post

Google and OpenAI Get 2025 IMO Gold Read More »