Author name: Shannon Garcia

deepmind’s-latest:-an-ai-for-handling-mathematical-proofs

DeepMind’s latest: An AI for handling mathematical proofs


AlphaProof can handle math challenges but needs a bit of help right now.

Computers are extremely good with numbers, but they haven’t gotten many human mathematicians fired. Until recently, they could barely hold their own in high school-level math competitions.

But now Google’s DeepMind team has built AlphaProof, an AI system that matched silver medalists’ performance at the 2024 International Mathematical Olympiad, scoring just one point short of gold at the most prestigious undergrad math competition in the world. And that’s kind of a big deal.

True understanding

The reason computers fared poorly in math competitions is that, while they far surpass humanity’s ability to perform calculations, they are not really that good at the logic and reasoning that is needed for advanced math. Put differently, they are good at performing calculations really quickly, but they usually suck at understanding why they’re doing them. While something like addition seems simple, humans can do semi-formal proofs based on definitions of addition or go for fully formal Peano arithmetic that defines the properties of natural numbers and operations like addition through axioms.

To perform a proof, humans have to understand the very structure of mathematics. The way mathematicians build proofs, how many steps they need to arrive at the conclusion, and how cleverly they design those steps are a testament to their brilliance, ingenuity, and mathematical elegance. “You know, Bertrand Russel published a 500-page book to prove that one plus one equals two,” says Thomas Hubert, a DeepMind researcher and lead author of the AlphaProof study.

DeepMind’s team wanted to develop an AI that understood math at this level. The work started with solving the usual AI problem: the lack of training data.

Math problems translator

Large language models that power AI systems like Chat GPT learn from billions upon billions of pages of text. Because there are texts on mathematics in their training databases—all the handbooks and works of famous mathematicians—they show some level of success in proving mathematical statements. But they are limited by how they operate: They rely on using huge neural nets to predict the next word or token in sequences generated in response to user prompts. Their reasoning is statistical by design, which means they simply return answers that “sound” right.

DeepMind didn’t need the AI to “sound” right—that wasn’t going to cut it in high-level mathematics. They needed their AI to “be” right, to guarantee absolute certainty. That called for an entirely new, more formalized training environment. To provide that, the team used a software package called Lean.

Lean is a computer program that helps mathematicians write precise definitions and proofs. It relies on a precise, formal programming language that’s also called Lean, which mathematical statements can be translated into. Once the translated or formalized statement is uploaded to the program, it can check if it is correct and get back with responses like “this is correct,” “something is missing,” or “you used a fact that is not proved yet.”

The problem was, most mathematical statements and proofs that can be found online are written in natural language like “let X be the set of natural numbers that…”—the number of statements written in Lean was rather limited. “The major difficulty of working with formal languages is that there’s very little data,” Hubert says. To go around it, the researchers trained a Gemini large language model to translate mathematical statements from natural language to Lean. The model worked like an automatic formalizer and produced about 80 million formalized mathematical statements.

It wasn’t perfect, but the team managed to use that to their advantage. “There are many ways you can capitalize on approximate translations,” Hubert claims.

Learning to think

The idea DeepMind had for the AlphaProof was to use the architecture the team used in their chess-, Go-, and shogi-playing AlphaZero AI system. Building proofs in Lean and Mathematics in general was supposed to be just another game to master. “We were trying to learn this game through trial and error,” Hubert says. Imperfectly formalized problems offered great opportunity for making errors. In its learning phase, AlphaProof was simply proving and disproving the problems it had in its database. If something was translated poorly, figuring out that something wasn’t right was a useful form of exercise.

Just like AlphaZero, AlphaProof in most cases used two main components. The first was a huge neural net with a few billion parameters that learned to work in the Lean environment through trial and error. It was rewarded for each proven or disproven statement and penalized for each reasoning step it took, which was a way of incentivizing short, elegant proofs.

It was also trained to use a second component, which was a tree search algorithm. This explored all possible actions that could be taken to push the proof forward at each step. Because the number of possible actions in mathematics can be near infinite, the job of the neural net was to look at the available branches in the search tree and commit computational budget only to the most promising ones.

After a few weeks of training, the system could score well on most math competition benchmarks based on problems sourced from past high school-level competitions, but it still struggled with the most difficult of them. To tackle these, the team added a third component that hadn’t been in AlphaZero. Or anywhere else.

Spark of humanity

The third component, called Test-Time Reinforcement Learning (TTRL), roughly emulated the way mathematicians approach the most difficult problems. The learning part relied on the same combination of neural nets with search tree algorithms. The difference came in what it learned from. Instead of relying on a broad database of auto-formalized problems, AlphaProof working in the TTRL mode started its work by generating an entirely new training dataset based on the problem it was dealing with.

The process involved creating countless variations of the original statement, some simplified a little bit more, some more general, and some only loosely connected to it. The system then attempted to prove or disprove them. It was roughly what most humans do when they’re facing a particularly hard puzzle, the AI equivalent of saying, “I don’t get it, so let’s try an easier version of this first to get some practice.” This allowed AlphaProof to learn on the fly, and it worked amazingly well.

At the 2024 International Mathematics Olympiad, there were 42 points to score for solving six different problems worth seven points each. To win gold, participants had to get 29 points or higher, and 58 out of 609 of them did that. Silver medals were awarded to people who earned between 22 and 28 points (there were 123 silver medalists). The problems varied in difficulty, with the sixth one, acting as a “final boss,” being the most difficult of them all. Only six participants managed to solve it. AlphaProof was the seventh.

But AlphaProof wasn’t an end-all, be-all mathematical genius. Its silver had its price—quite literally.

Optimizing ingenuity

The first problem with AlphaProof’s performance was that it didn’t work alone. To begin with, humans had to make the problems compatible with Lean before the software even got to work. And, among the six Olympic problems, the fourth one was about geometry, and the AI was not optimized for that. To deal with it, AlphaProof had to call a friend called AlphaGeometry 2, a geometry-specialized AI that ripped through the task in a few minutes without breaking a sweat. On its own, AlphaProof scored 21 points, not 28, so technically it would win bronze, not silver. Except it wouldn’t.

Human participants of the Olympiad had to solve their six problems in two sessions, four-and-a-half hours long. AlphaProof, on the other hand, wrestled with them for several days using multiple tensor processing units at full throttle. The most time- and energy-consuming component was TTRL, which battled with the three problems it managed to solve for three days each. If AlphaProof was held up to the same standard as human participants, it would basically run out of time. And if it wasn’t born at a tech giant worth hundreds of billions of dollars, it would run out of money, too.

In the paper, the team admits the computational requirements to run AlphaProof are most likely cost-prohibitive for most research groups and aspiring mathematicians. Computing power in AI applications is often measured in TPU-days, meaning a tensor processing unit working flat-out for a full day. AlphaProof needed hundreds of TPU-days per problem.

On top of that, the International Mathematics Olympiad is a high school-level competition, and the problems, while admittedly difficult, were based on things mathematicians already know. Research-level math requires inventing entirely new concepts instead of just working with existing ones.

But DeepMind thinks it can overcome these hurdles and optimize AlphaProof to be less resource-hungry. “We don’t want to stop at math competitions. We want to build an AI system that could really contribute to research-level mathematics,” Hubert says. His goal is to make AlphaProof available to the broader research community. “We’re also releasing a kind of an AlphaProof tool,” he added. “It would be a small trusted testers program to see if this would be useful to mathematicians.”

Nature, 2025.  DOI: 10.1038/s41586-025-09833-y

Photo of Jacek Krywko

Jacek Krywko is a freelance science and technology writer who covers space exploration, artificial intelligence research, computer science, and all sorts of engineering wizardry.

DeepMind’s latest: An AI for handling mathematical proofs Read More »

oracle-hit-hard-in-wall-street’s-tech-sell-off-over-its-huge-ai-bet

Oracle hit hard in Wall Street’s tech sell-off over its huge AI bet

“That is a huge liability and credit risk for Oracle. Your main customer, biggest customer by far, is a venture capital-funded start-up,” said Andrew Chang, a director at S&P Global.

OpenAI faces questions about how it plans to meet its commitments to spend $1.4 trillion on AI infrastructure over the next eight years. It has struck deals with several Big Tech groups, including Oracle’s rivals.

Of the five hyperscalers—which include Amazon, Google, Microsoft, and Meta—Oracle is the only one with negative free cash flow. Its debt-to-equity ratio has surged to 500 percent, far higher than Amazon’s 50 percent and Microsoft’s 30 percent, according to JPMorgan.

While all five companies have seen their cash-to-assets ratios decline significantly in recent years amid a boom in spending, Oracle’s is by far the lowest, JPMorgan found.

JPMorgan analysts noted a “tension between [Oracle’s] aggressive AI build-out ambitions and the limits of its investment-grade balance sheet.”

Analysts have also noted that Oracle’s data center leases are for much longer than its contracts to sell capacity to OpenAI.

Oracle has signed at least five long-term lease agreements for US data centers that will ultimately be used by OpenAI, resulting in $100 billion of off-balance-sheet lease commitments. The sites are at varying levels of construction, with some not expected to break ground until next year.

Safra Catz, Oracle’s sole chief executive from 2019 until she stepped down in September, resisted expanding its cloud business because of the vast expenses required. She was replaced by co-CEOs Clay Magouyrk and Mike Sicilia as part of the pivot by Oracle to a new era focused on AI.

Catz, who is now executive vice-chair of Oracle’s board, has exercised stock options and sold $2.5 billion of its shares this year, according to US regulatory filings. She had announced plans to exercise her stock options at the end of 2024.

© 2025 The Financial Times Ltd. All rights reserved. Not to be redistributed, copied, or modified in any way.

Oracle hit hard in Wall Street’s tech sell-off over its huge AI bet Read More »

on-writing-#2

On Writing #2

In honor of my dropping by Inkhaven at Lighthaven in Berkeley this week, I figured it was time for another writing roundup. You can find #1 here, from March 2025.

I’ll be there from the 17th (the day I am publishing this) until the morning of Saturday the 22nd. I am happy to meet people, including for things not directly about writing.

  1. Table of Contents.

  2. How I Use AI For Writing These Days.

  3. Influencing Influence.

  4. Size Matters.

  5. Time To Write A Shorter One.

  6. A Useful Tool.

  7. A Maligned Tool.

  8. Neglected Topics.

  9. The Humanities Don’t Seem Relevant To Writing About Future Humanity?

  10. Writing Every Day.

  11. Writing As Deep Work.

  12. Most Of Your Audience Is Secondhand.

  13. That’s Funny.

  14. Fiction Writing Advice.

  15. Just Say The Thing.

  16. Cracking the Paywall.

How have I been using AI in my writing?

Directly? With the writing itself? Remarkably little. Almost none.

I am aware that this is not optimal. But at current capability levels, with the prompts and tools I know about, in the context of my writing, AI has consistently proven to have terrible taste and to make awful suggestions, and also to be rather confident about them. This has proven sufficiently annoying that I haven’t found it worth checking with the AIs.

I also worry about AI influence pushing me towards generic slop, pushing me to sounding more like the AIs, and rounding off the edges of things, since every AI I’ve tried this with keeps trying to do all that.

I am sure it does not help that my writing style is very unusual, and basically not in the training data aside from things written by actual me, as far as I can tell.

Sometimes I will quote LLM responses in my writing, always clearly labeled, when it seems useful to point to this kind of ‘social proof’ or sanity check.

The other exception is that if you ask the AI to look for outright errors, especially things like spelling and grammar, it won’t catch everything, but when it does catch something it is usually right. When you ask it to spot errors of fact, it’s not as reliable, but it’s good enough to check the list. I should be making a point of always doing that.

I did the ‘check for errors and other considerations’ thing on this piece in particular with both Sonnet and 5.1-Thinking. This did improve the post but it’s not obvious it improved it enough to be worth the time.

I will also sometimes ask it about a particular line or argument I’m considering, to see if it buys it, but only when what I care about is a typical reaction.

If I was devoting more time to refining and editing, and cared more about marginal improvements there, that would open up more use cases, but I don’t think that’s the right use of time for me on current margins versus training on more data or doing more chain of thought.

Indirectly? I use it a lot more there, and again I could be doing more.

There are some specific things:

  1. I have a vibe coded Chrome extension that saves me a bunch of trouble, that could be improved a lot with more work. It does things like generate the Table of Contents, crosspost to WordPress, auto-populate many links and quickly edit quotes to fix people’s indifference to things like capitalization.

  2. I have a GPT called Zvi Archivist that I use to search through my past writing, to check if and when I’ve already covered something and what I’ve said about it.

  3. I have a transcriber for converting images to text because all the websites I know about that offer to do this for you are basically broken due to gating. This works.

Then there’s things that are the same as what everyone does all the time. I do a lot of fact checking, sanity checking, Fermi estimation, tracking down information or sources, asking for explanations, questioning papers for the things I care about. Using the AI assistant in its classic sense. All of that is a big help and I notice my activation requirement to do this is higher than it should be.

I want this to be true so I’m worried I can’t be objective, but it seems true to me?

Janus: i think that it’s almost always a bad idea to attempt to grow as an influencer on purpose.

you can believe that it would be good if you were to grow, and still you shouldn’t optimize for it.

the only way it goes well is if it happens while you optimize for other things.

More precisely than “you shouldn’t on purpose” what I’m saying is you shouldn’t be spending significant units of optimization on this goal and performing actions you wouldn’t otherwise for this purpose

I am confident that if you optimize primarily for influence, that’s full audience capture, slopification and so on, and you’ve de facto sold your soul. You can in theory turn around and then use that influence to accomplish something worthwhile, but statistically speaking you won’t do that.

Janus: Name a single account that explicitly optimizes for being a bigger influencer / “tries to grow” (instead of just happening as a side effect) and that does more good than harm to the ecosystem and generally has good vibes and interesting content

You probably can’t!

actually, https://x.com/AISafetyMemes is a contender

but i know they’re VERY controversial and I do think they’re playing with fire

i do consider them net positive but this is mostly bc they sometimes have very good taste and maybe cancel out the collateral damage

but WOULD NOT RECOMMEND almost anyone trying this, lol

AISafetyMemes is definitely an example of flying dangerously close to the sun on this, but keeping enough focus and having enough taste to maybe be getting away with it. It’s unclear that the net sign of impact there is positive, there are some very good posts but also some reasons to worry.

No one reads the blog posts, they’re too long, so might as well make them longer?

Visakan Veerasamy: An idea I’ve been toying with and discussed with a couple of friends is the idea that blog posts could and probably should get much longer now that fewer people are reading them.

One of the difficult things about writing a good essay is figuring out what to leave out so it is more manageable for readers.

But on a blog where there is no expectation that anybody reads it, you do not have to leave anything out.

My guess is this is going to end up being a barbell situation like so many other things. If you cut it down, you want to cut it down as much as possible. If you’re going long, then on the margin you’re better off throwing everything in.

I highlight this exactly because it seems backwards to me. I notice that my experience is very much the opposite – when I want to write a good short piece it is MUCH more work per token, and often more total work.

Timothy Lee: I think a big reason that writing a book is such a miserable experience is that the time to write a good piece is more-than-linear in the number of words. A good 2,000-word piece is a lot more than 4x the work of a good 500-word piece.

I assume this continues for longer pieces and a good 100,000 book is a lot more than 50x the work of a good 2,000-word article. Most authors deal with this by cutting corners and turning in books that aren’t very good. And then there’s Robert Caro.

Josh You: I think by “good 2000 word piece” Tim means “a 2000 word piece that has been edited down from a much longer first draft”

Even then. Yes, a tight longer piece requires more structure and planning, but the times I write those 500-800 word pieces it takes forever, because you really do struggle over every word as you try to pack everything into the tiniest possible space.

Writing a 100,000 word book at the precision level of an 800 word thinkpiece would take forever, but also I presume it almost never happens. If it does, that better be your masterpiece or I don’t see why you’d do it.

Dwarkesh Patel is using the Smart Composer Plugin for Obsidian, which he says is basically Cursor for writing, and loves it. Sounds great conditional on using Obsidian, but it is not being actively maintained.

Eric Raymond joins ‘the em-dash debate’ on the side of the em-dash.

Eric Raymond (yes that one): My wacky theory about the em-dash debate:

Pro writers use em-dashes a lot because many of them, possibly without consciously realizing it, have become elocutionary punctuationists.

That is, they’ve fallen into the habit of using punctuation not as grammatical phrase structure markers but as indicators of pauses of varying length in the flow of speech.

The most visible difference you see in people who write in this style that their usage of commas becomes somewhat more fluid — that’s the marker for the shortest pause. But they also reach for less commonly used punctuation marks as indicators of longer pauses of varying length.

Em dash is about the second or third longest pause, only an ellipsis or end-of-sentence period being clearly longer.

Historical note: punctuation marks originally evolved as pause or breathing markers in manuscripts to aid recitation. In the 19th century, after silent reading had become normal, they were reinterpreted by grammarians as phrase structure markers and usage rules became much more rigid.

Really capable writers have been quietly rediscovering elocutionary punctuation ever since.

RETVRN!

I too have been increasingly using punctuation, especially commas, to indicate pauses. I still don’t use em dashes, partly because I almost never want that exact length and style of a pause for whatever reason, and also because my instinct is that you’re trying to do both ‘be technically correct’ and also ‘evoke what you want’ and my brain thinks of the em-dash as technically incorrect.

That’s all true and I never used em-dashes before but who are we kidding, the best reason not to use em-dashes is that people will think you’re using AI. I don’t love that dynamic either, but do you actually want to die on that hill?

Tyler Cowen lists some reasons why he does not cover various topics much. The list resonates with me quite a bit.

  1. I feel that writing about the topic will make me stupider.

  2. I believe that you reading more about the topic will make you stupider.

  3. I believe that performative outrage usually brings low or negative returns. Matt Yglesias has had some good writing on this lately.

  4. I don’t have anything to add on the topic. Abortion and the Middle East would be two examples here.

  5. Sometimes I have good inside information on a topic, but I cannot reveal it, not even without attribution. And I don’t want to write something stupider than my best understanding of the topic.

  6. I just don’t feel like it.

  7. On a few topics I feel it is Alex’s province.

I don’t have an Alex, instead I have decided on some forms of triage that are simply ‘I do not have the time to look into this and I will let it be someone else’s department.’

Otherwise yes, all of these are highly relevant.

Insider information is tough, and I am very careful about not revealing things I am not supposed to reveal, but this rarely outright stops me. If nothing else, you can usually get net smarter via negativa, where you silently avoid saying false things, including by using careful qualifiers on statements.

One big thing perhaps missing from Tyler’s list is that I avoid certain topics where my statements would potentially interfere with my ability to productively discuss other topics. If you are going to make enemies, or give people reasons to attack you or dismiss you, only do that on purpose. One could also file this under making you and others stupider. Similarly, there are things that I need to not think about – I try to avoid thinking about trading for this reason.

A minor thing is that I’d love to be able to talk more about gaming, and other topics dear to my heart, but that consistently drive people away permanently when I do that. So it’s just not worth it. If the extra posts simply had no impact, I’d totally do it, but as is I’d be better off writing the post and then not hitting publish. Sad. Whereas Tyler has made it very clear he’s going to post things most readers don’t care about, when he feels like doing so, and that’s part of the price of admission.

If you want to write or think about the future, maybe don’t study the humanities?

Startup Archive: Palmer Luckey explains why science fiction is a great place to look for ideas

“One of the things that I’ve realized in my career is that nothing I ever come up with will be new. I’ve literally never come up with an idea that a science fiction author has not come up with before.”

Dr. Julie Gurner: Funny how valuable those English majors and writers truly are, given how much liberal arts has been put down. Why philosophy, creativity and hard tech skills make such fantastic bedfellows. Span of vision wins.

Orthonormalist: Heinlein was an aeronautical engineer.

Asimov was a biochemistry professor.

Arthur Clarke was a radio operator who got a physics degree.

Ray Bradbury never went to college (but did go straight to being a writer)

I quote this because ‘study the humanities’ is a natural thing to say to someone looking to write or think about the future, and yet I agree that when I look at the list of people whose thinking about the future has influenced me, I notice essentially none of them have studied the humanities.

Alan Jacobs has a very different writing pattern. Rather than write every day, he waits until the words are ready, so he’ll work every day but often that means outlines or index card reordering or just sitting in his chair and thinking, even for weeks at a time. This is alien to me. If I need to figure out what to write, I start writing, see what it looks like, maybe delete it and try again, maybe procrastinate by working on a different thing.

Neal Stephenson explains that for him writing is Deep Work, requiring long blocks of reliably uninterrupted time bunched together, writing novels is the best thing he does, and that’s why he doesn’t go to conferences or answer your email. Fair enough.

I’ve found ways to not be like that. I deal with context shifts and interruptions all the time and it is fine, indeed when dealing with difficult tasks I almost require them. That’s a lot of how I can be so productive. But the one time I wrote something plausibly like a book, the Immoral Mazes sequence, I did spend a week alone in my apartment doing nothing else. And I haven’t figured out how to write a novel, or almost any fiction at all.

Also, it’s rather sad if it is true that Neal Stephenson only gets a middle class life out of writing so many fantastic and popular books, and can’t afford anyone to answer his email. That makes writing seem like an even rougher business than I expected. Although soon AI can perhaps do it for him?

Patrick McKenzie highlights an insight from Alex Danco, which is that most of the effective audience of any successful post is not people who read the post, but people who are told about the post by someone who did read it. Patrick notes this likely also applies to formal writing, I’d note it seems to definitely apply to most books.

Relatedly, I have in the past postulated a virtual four-level model of flow of ideas, where each level can understand the level above it, and then rephrase and present it to the level below.

So if you are Level 1, either in general or in an area, you can formulate fully new ideas. If you are Level 2, you can understand what the Level 1s say, look for consensus or combine what they say, riff on it and then communicate that to those who are up to Level 3, who can then fully communicate to the public who end up typically around at Level 4.

Then the public will communicate a simplified and garbled version to each other.

You can be Level 1 and then try to ‘put on your Level 2 or 3 hat’ to write a dumber, simpler version to a broader audience, but it is very hard to simultaneously do that and also communicate the actual concepts to other Level 1s.

These all then interact, but if you go viral with anything longer than a Tweet, you inevitably are going to primarily end up with a message primarily communicated via (in context) Level 3 and Level 4 people communicating to other Level 3-4 people.

At that point, and any time you go truly viral or your communication is ‘successful,’ you run into the You Get About Five Words problem.

My response to this means that at this point I essentially never go all that directly viral. I have a very narrow range of views, where even the top posts never do 100% better than typical posts, and the least popular posts – which are when I talk about AI alignment or policy on their own – will do at worst 30% less than typical.

The way the ideas go viral is someone quotes, runs with or repackages them. A lot of the impact comes from the right statement reaching the right person.

I presume that would work differently if I was working with mediums that work on virality, such as YouTube or TikTok, but my content seems like a poor fit for them, and when I do somewhat ‘go viral’ in such places it is rarely content I care about spreading. Perhaps I am making a mistake by not branching out. But on Twitter I still almost never go viral, as it seems my speciality is small TAM (total available market) Tweets.

Never have a character try to be funny, the character themselves should have no idea.

I think this is directionally correct but goes too far, for the same reasons that you, in your real life, will often try to be funny, and sometimes it will work. The trick is they have to be trying to be funny in a way that makes sense for the character, in context, for those around them, not trying to be funny to the viewer.

I notice that in general I almost never ‘try to be funny,’ not exactly. I simply say things because they would be funny, and to say things in the funniest way possible, because why not. A lot of my favorite people seem to act similarly.

Lydia Davis offers her top ten recommendations for good (fiction?) writing: Keep notes, including sentences out of context, work from your own interest, be mostly self-taught, read and revise the notes constantly, grow stories or develop poems out of those notes, learn techniques from great works and read the best writers across time.

Orson Scott Card explains that you don’t exhaust the reader by having too much tension in your book, you exhaust them by having long stretches without tension. The tension keeps us reading.

Dwarkesh Patel: Unreasonably effective writing advice:

“What are you trying to say here?

Okay, just write that.”

I’ve (separately) started doing this more.

I try to make sure that it’s very easy to find the central point, the thing I’m most trying to say, and hard to miss it.

Patrick McKenzie: Cosigned, and surprisingly effective with good writers in addition to ones who more obviously need the prompting.

Writing an artifact attaches you to the structure of it while simultaneously subsuming you in the topic. The second is really good for good work; the first, less so.

One thing that I tried, with very limited success, to get people to do is to be less attached to words on a page. Writing an essay? Write two very different takes on it; different diction, different voice, maybe even different argument. Then pick the one which speaks to you.

Edit *thatrather than trying to line edit the loser towards greatness.

There is something which people learn, partially from school and partially from work experience, which causes them to write as if they were charged for every word which goes down on the page.

Words are free! They belong in a vast mindscape! You can claw more from the aether!

I think people *mightoperationalize better habits after LLMs train them that throwing away a paragraph is basically costless.

Jason Cohen: Yeah this works all the time.

Also when getting someone to explain their product, company, customer, why to work for them, etc..

So funny how it jogs them out of their own way!

BasedBigTech: An excellent Group PM reviewed my doc with me. He said “what does this mean?” and I told him.

“Then why didn’t you write that?”

Kevin Kelly: At Whole Earth Review people would send us book reviews with a cover letter explaining why we should run their book review. We’d usually toss the review and print their much shorter cover letter as the review which was much clearer and succinct.

Daniel Eth: It’s crazy how well just straight up asking people that gets them to say the thing they should write down

Why does it work?

The answer is that writing is doing multiple tasks.

Only one of them is ‘tell you what all this means.’

You have to do some combination of things such as justify that, explain it, motivate it, provide details, teach your methods and reasoning, perform reporting, be entertaining and so on.

Also, how did you know what you meant to say until you wrote the damn thing?

You still usually should find a way to loudly say what it all means, somewhere in there.

But this creates the opportunity for the hack.

If I hand you a ten-page paper, and you ask ‘what are you trying to say?’ then I have entered into evidence that I have Done the Work and Written the Report.

Now I can skip the justifications, details and context, and Say The Thing.

The point of a reference post is sometimes to give people the opportunity to learn.

The point of a reference post can also be to exist and then not be clicked on. It varies.

This is closely related to the phenomenon where often a movie or show will have a scene that logically and structurally has to exist, but which you wish you didn’t have to actually watch. In theory you could hold up a card that said ‘Scene in which Alice goes to the bank, acts nervous and get the money’ or whatever.

Probably they should do a graceful version of something like that more often, or even interactive versions where you can easily expand or condense various scenes. There’s something there.

Similarly, with blog posts (or books) there are passages that are written or quoted knowing many or most people will skip them, but that have to be there.

Aella teaches us how to make readers pay up to get behind a Paywall. Explain why you are the One Who Knows some valuable thing, whereas others including your dear reader are bad at this and need your help. Then actually provide value both outside and inside of the paywall, ideally because the early free steps are useful even without the payoff you’re selling.

I am thankful that I can write without worrying about maximizing such things, while I also recognize that I’m giving up a lot of audience share not optimizing for doing similar things on the non-paywall side.

Discussion about this post

On Writing #2 Read More »

dogs-came-in-a-wide-range-of-sizes-and-shapes-long-before-modern-breeds

Dogs came in a wide range of sizes and shapes long before modern breeds

“The concept of ‘breed’ is very recent and does not apply to the archaeological record,” Evin said. People have, of course, been breeding dogs for particular traits for as long as we’ve had dogs, and tiny lap dogs existed even in ancient Rome. However, it’s unlikely that a Neolithic herder would have described his dog as being a distinct “breed” from his neighbor’s hunting partner, even if they looked quite different. Which, apparently, they did.

A big yellow dog, a little gray dog, and a little white dog

Dogs had about half of their modern diversity (at least in skull shapes and sizes) by the Neolithic. Credit: Kiona Smith

Bones only tell part of the story

“We know from genetic models that domestication should have started during the late Pleistocene,” Evin told Ars. A 2021 study suggested that domestic dogs have been a separate species from wolves for more than 23,000 years. But it took a while for differences to build up.

Evin and her colleagues had access to 17 canine skulls that ranged from 12,700 to 50,000 years old—prior to the end of the ice age—and they all looked enough like modern wolves that, as Evin put it, “for now, we have no evidence to suggest that any of the wolf-like skulls did not belong to wolves or looked different from them.” In other words, if you’re just looking at the skull, it’s hard to tell the earliest dogs from wild wolves.

We have no way to know, of course, what the living dog might have looked like. It’s worth mentioning that Evin and her colleagues found a modern Saint Bernard’s skull that, according to their statistical analysis, looked more wolf-like than dog-like. But even if it’s not offering you a brandy keg, there’s no mistaking a live Saint Bernard, with its droopy jowls and floppy ears, for a wolf.

“Skull shape tells us a lot about function and evolutionary history, but it represents only one aspect of the animal’s appearance. This means that two dogs with very similar skulls could have looked quite different in life,” Evin told Ars. “It’s an important reminder that the archaeological record captures just part of the biological and cultural story.”

And with only bones—and sparse ones, at that—to go on, we may be missing some of the early chapters of dogs’ biological and cultural story. Domestication tends to select the friendliest animals to produce the next generation, and apparently that comes with a particular set of evolutionary side effects, whether you’re studying wolves, foxes, cattle, or pigs. Spots, floppy ears, and curved tails all seem to be part of the genetic package that comes with inter-species friendliness. But none of those traits is visible in the skull.

Dogs came in a wide range of sizes and shapes long before modern breeds Read More »

researchers-question-anthropic-claim-that-ai-assisted-attack-was-90%-autonomous

Researchers question Anthropic claim that AI-assisted attack was 90% autonomous

Claude frequently overstated findings and occasionally fabricated data during autonomous operations, claiming to have obtained credentials that didn’t work or identifying critical discoveries that proved to be publicly available information. This AI hallucination in offensive security contexts presented challenges for the actor’s operational effectiveness, requiring careful validation of all claimed results. This remains an obstacle to fully autonomous cyberattacks.

How (Anthropic says) the attack unfolded

Anthropic said GTG-1002 developed an autonomous attack framework that used Claude as an orchestration mechanism that largely eliminated the need for human involvement. This orchestration system broke complex multi-stage attacks into smaller technical tasks such as vulnerability scanning, credential validation, data extraction, and lateral movement.

“The architecture incorporated Claude’s technical capabilities as an execution engine within a larger automated system, where the AI performed specific technical actions based on the human operators’ instructions while the orchestration logic maintained attack state, managed phase transitions, and aggregated results across multiple sessions,” Anthropic said. “This approach allowed the threat actor to achieve operational scale typically associated with nation-state campaigns while maintaining minimal direct involvement, as the framework autonomously progressed through reconnaissance, initial access, persistence, and data exfiltration phases by sequencing Claude’s responses and adapting subsequent requests based on discovered information.”

The attacks followed a five-phase structure that increased AI autonomy through each one.

The life cycle of the cyberattack, showing the move from human-led targeting to largely AI-driven attacks using various tools, often via the Model Context Protocol (MCP). At various points during the attack, the AI returns to its human operator for review and further direction.

Credit: Anthropic

The life cycle of the cyberattack, showing the move from human-led targeting to largely AI-driven attacks using various tools, often via the Model Context Protocol (MCP). At various points during the attack, the AI returns to its human operator for review and further direction. Credit: Anthropic

The attackers were able to bypass Claude guardrails in part by breaking tasks into small steps that, in isolation, the AI tool didn’t interpret as malicious. In other cases, the attackers couched their inquiries in the context of security professionals trying to use Claude to improve defenses.

As noted last week, AI-developed malware has a long way to go before it poses a real-world threat. There’s no reason to doubt that AI-assisted cyberattacks may one day produce more potent attacks. But the data so far indicates that threat actors—like most others using AI—are seeing mixed results that aren’t nearly as impressive as those in the AI industry claim.

Researchers question Anthropic claim that AI-assisted attack was 90% autonomous Read More »

openai-walks-a-tricky-tightrope-with-gpt-5.1’s-eight-new-personalities

OpenAI walks a tricky tightrope with GPT-5.1’s eight new personalities

On Wednesday, OpenAI released GPT-5.1 Instant and GPT-5.1 Thinking, two updated versions of its flagship AI models now available in ChatGPT. The company is wrapping the models in the language of anthropomorphism, claiming that they’re warmer, more conversational, and better at following instructions.

The release follows complaints earlier this year that its previous models were excessively cheerful and sycophantic, along with an opposing controversy among users over how OpenAI modified the default GPT-5 output style after several suicide lawsuits.

The company now faces intense scrutiny from lawyers and regulators that could threaten its future operations. In that kind of environment, it’s difficult to just release a new AI model, throw out a few stats, and move on like the company could even a year ago. But here are the basics: The new GPT-5.1 Instant model will serve as ChatGPT’s faster default option for most tasks, while GPT-5.1 Thinking is a simulated reasoning model that attempts to handle more complex problem-solving tasks.

OpenAI claims that both models perform better on technical benchmarks such as math and coding evaluations (including AIME 2025 and Codeforces) than GPT-5, which was released in August.

Improved benchmarks may win over some users, but the biggest change with GPT-5.1 is in its presentation. OpenAI says it heard from users that they wanted AI models to simulate different communication styles depending on the task, so the company is offering eight preset options, including Professional, Friendly, Candid, Quirky, Efficient, Cynical, and Nerdy, alongside a Default setting.

These presets alter the instructions fed into each prompt to simulate different personality styles, but the underlying model capabilities remain the same across all settings.

An illustration showing GPT-5.1's eight personality styles in ChatGPT.

An illustration showing GPT-5.1’s eight personality styles in ChatGPT. Credit: OpenAI

In addition, the company trained GPT-5.1 Instant to use “adaptive reasoning,” meaning that the model decides when to spend more computational time processing a prompt before generating output.

The company plans to roll out the models gradually over the next few days, starting with paid subscribers before expanding to free users. OpenAI plans to bring both GPT-5.1 Instant and GPT-5.1 Thinking to its API later this week. GPT-5.1 Instant will appear as gpt-5.1-chat-latest, and GPT-5.1 Thinking will be released as GPT-5.1 in the API, both with adaptive reasoning enabled. The older GPT-5 models will remain available in ChatGPT under the legacy models dropdown for paid subscribers for three months.

OpenAI walks a tricky tightrope with GPT-5.1’s eight new personalities Read More »

with-another-record-broken,-the-world’s-busiest-spaceport-keeps-getting-busier

With another record broken, the world’s busiest spaceport keeps getting busier


It’s not just the number of rocket launches, but how much stuff they’re carrying into orbit.

With 29 Starlink satellites onboard, a Falcon 9 rocket streaks through the night sky over Cape Canaveral Space Force Station, Florida, on Monday night. Credit: Stephen Clark/Ars Technica

CAPE CANAVERAL, Florida—Another Falcon 9 rocket fired off its launch pad here on Monday night, taking with it another 29 Starlink Internet satellites to orbit.

This was the 94th orbital launch from Florida’s Space Coast so far in 2025, breaking the previous record for the most satellite launches in a calendar year from the world’s busiest spaceport. Monday night’s launch came two days after a Chinese Long March 11 rocket lifted off from an oceangoing platform on the opposite side of the world, marking humanity’s 255th mission to reach orbit this year, a new annual record for global launch activity.

As of Wednesday, a handful of additional missions have pushed the global figure this year to 259, putting the world on pace for around 300 orbital launches by the end of 2025. This will more than double the global tally of 135 orbital launches in 2021.

Routine vs. complacency

Waiting in the darkness a few miles away from the launch pad, I glanced around at my surroundings before watching SpaceX’s Falcon 9 thunder into the sky. There were no throngs of space enthusiasts anxiously waiting for the rocket to light up the night. No line of photographers snapping photos. Just this reporter and two chipper retirees enjoying what a decade ago would have attracted far more attention.

Go to your local airport and you’ll probably find more people posted up at a plane-spotting park at the end of the runway. Still, a rocket launch is something special. On the same night that I watched the 94th launch of the year depart from Cape Canaveral, Orlando International Airport saw the same number of airplane departures in just three hours.

The crowds still turn out for more meaningful launches, such as a test flight of SpaceX’s Starship megarocket in Texas or Blue Origin’s attempt to launch its second New Glenn heavy-lifter here Sunday. But those are not the norm. Generations of aerospace engineers were taught that spaceflight is not routine for fear of falling into complacency, leading to failure, and in some cases, death.

Compared to air travel, the mantra remains valid. Rockets are unforgiving, with engines operating under extreme pressures, at high thrust, and unable to suck in oxygen from the atmosphere as a reactant for combustion. There are fewer redundancies in a rocket than in an airplane.

The Falcon 9’s established failure rate is less than 1 percent, well short of any safety standard for commercial air travel but good enough to be the most successful orbital-class in history. Given the Falcon 9’s track record, SpaceX seems to have found a way to overcome the temptation for complacency.

A Chinese Long March 11 rocket carrying three Shiyan 32 test satellites lifts off from waters off the coast of Haiyang in eastern China’s Shandong province on Saturday. Credit: Guo Jinqi/Xinhua via Getty Images

Following the trend

The upward trend in rocket launches hasn’t always been the case. Launch numbers were steady for most of the 2010s, following a downward trend in the 2000s, with as few as 52 orbital launches in 2005, the lowest number since the nascent era of spaceflight in 1961. There were just seven launches from here in Florida that year.

The numbers have picked up dramatically in the last five years as SpaceX has mastered reusable rocketry.

It’s important to look at not just the number of launches but also how much stuff rockets are actually putting into orbit. More than half of this year’s launches were performed using SpaceX’s Falcon 9 rocket, and the majority of those deployed Starlink satellites for SpaceX’s global Internet network. Each spacecraft is relatively small in size and weight, but SpaceX stacks up to 29 of them on a single Falcon 9 to max out the rocket’s carrying capacity.

All this mass adds up to make SpaceX’s dominance of the launch industry appear even more absolute. According to analyses by BryceTech, an engineering and space industry consulting firm, SpaceX has launched 86 percent of all the world’s payload mass over the 18 months from the beginning of 2024 through June 30 of this year.

That’s roughly 2.98 million kilograms of the approximately 3.46 million kilograms (3,281 of 3,819 tons) of satellite hardware and cargo that all the world’s rockets placed into orbit during that timeframe.

The charts below were created by Ars Technica using publicly available launch numbers and payload mass estimates from BryceTech. The first illustrates the rising launch cadence at Cape Canaveral Space Force Station and NASA’s Kennedy Space Center, located next to one another in Florida. Launches from other US-licensed spaceports, primarily Vandenberg Space Force Base, California, and Rocket Lab’s base at Māhia Peninsula in New Zealand, are also on the rise.

These numbers represent rockets that reached low-Earth orbit. We didn’t include test flights of SpaceX’s Starship rocket in the chart because all of its launches have intentionally flown on suborbital trajectories.

In the second chart, we break down the payload upmass to orbit from SpaceX, other US companies, China, Russia, and other international launch providers.

Launch rates are on a clear upward trend, while SpaceX has launched 86 percent of the world’s total payload mass to orbit since the beginning of 2024. Credit: Stephen Clark/Ars Technica/BryceTech

Will it continue?

It’s a good bet that payload upmass will continue to rise in the coming years, with heavy cargo heading to orbit to further expand SpaceX’s Starlink communications network and build out new megaconstellations from Amazon, China, and others. The US military’s Golden Dome missile defense shield will also have a ravenous appetite for rockets to get it into space.

SpaceX’s Starship megarocket could begin flying to low-Earth orbit next year, and if it does, SpaceX’s preeminence in delivering mass to orbit will remain assured. Starship’s first real payloads will likely be SpaceX’s next-generation Starlink satellites. These larger, heavier, more capable spacecraft will launch 60 at a time on Starship, further stretching SpaceX’s lead in the upmass war.

But Starship’s arrival will come at the expense of the workhorse Falcon 9, which lacks the capacity to haul the next-gen Starlinks to orbit. “This year and next year I anticipate will be the highest Falcon launch rates that we will see,” said Stephanie Bednarek, SpaceX’s vice president of commercial sales, at an industry conference in July.

SpaceX is on pace for between 165 and 170 Falcon 9 launches this year, with 144 flights already in the books for 2025. Last year’s total for Falcon 9 and Falcon Heavy was 134 missions. SpaceX has not announced how many Falcon 9 and Falcon Heavy launches it plans for next year.

Starship is designed to be fully and rapidly reusable, eventually enabling multiple flights per day. But that’s still a long way off, and it’s unknown how many years it might take for Starship to surpass the Falcon 9’s proven launch tempo.

A Starship rocket and Super Heavy booster lift off from Starbase, Texas. Credit: SpaceX

In any case, with Starship’s heavy-lifting capacity and upgraded next-gen satellites, SpaceX could match an entire year’s worth of new Starlink capacity with just two fully loaded Starship flights. Starship will be able to deliver 60 times more Starlink capacity to orbit than a cluster of satellites riding on a Falcon 9.

There’s no reason to believe SpaceX will be satisfied with simply keeping pace with today’s Starlink growth rate. There are emerging market opportunities in connecting satellites with smartphones, space-based computer processing and data storage, and military applications.

Other companies have medium-to-heavy rockets that are either new to the market or soon to debut. These include Blue Origin’s New Glenn, now set to make its second test flight in the coming days, with a reusable booster designed to facilitate a rapid-fire launch cadence.

Despite all of the newcomers, most satellite operators see a shortage of launch capacity on the commercial market. “The industry is likely to remain supply-constrained through the balance of the decade,” wrote Caleb Henry, director of research at the industry analysis firm Quilty Space. “That could pose a problem for some of the many large constellations on the horizon.”

United Launch Alliance’s Vulcan rocket, Rocket Lab’s Neutron, Stoke Space’s Nova, Relativity Space’s Terran R, and Firefly Aerospace and Northrop Grumman’s Eclipse are among the other rockets vying for a bite at the launch apple.

“Whether or not the market can support six medium to heavy lift launch providers from the US aloneplus Starshipis an open question, but for the remainder of the decade launch demand is likely to remain high, presenting an opportunity for one or more new players to establish themselves in the pecking order,” Henry wrote in a post on Quilty’s website.

China’s space program will need more rockets, too. That nation’s two megaconstellations, known as Guowang and Qianfan, will have thousands of satellites requiring a significant uptick on Chinese launches.

Taking all of this into account, the demand curve for access to space is sure to continue its upward trajectory. How companies meet this demand, and with how many discrete departures from Earth, isn’t quite as clear.

Photo of Stephen Clark

Stephen Clark is a space reporter at Ars Technica, covering private space companies and the world’s space agencies. Stephen writes about the nexus of technology, science, policy, and business on and off the planet.

With another record broken, the world’s busiest spaceport keeps getting busier Read More »

kimi-k2-thinking

Kimi K2 Thinking

I previously covered Kimi K2, which now has a new thinking version. As I said at the time back in July, price in that the thinking version is coming.

Is it the real deal?

That depends on what level counts as the real deal. It’s a good model, sir, by all accounts. But there have been fewer accounts than we would expect if it was a big deal, and it doesn’t fall into any of my use cases.

Kimi.ai: 🚀 Hello, Kimi K2 Thinking!

The Open-Source Thinking Agent Model is here.

🔹 SOTA on HLE (44.9%) and BrowseComp (60.2%)

🔹 Executes up to 200 – 300 sequential tool calls without human interference

🔹 Excels in reasoning, agentic search, and coding

🔹 256K context window

Built as a thinking agent, K2 Thinking marks our latest efforts in test-time scaling — scaling both thinking tokens and tool-calling turns.

K2 Thinking is now live on http://kimi.com in chat mode, with full agentic mode coming soon. It is also accessible via API.

API here, Tech blog here, Weights and code here.

(Pliny jailbreak here.)

It’s got 1T parameters, and Kimi and Kimi K2 have a solid track record, so it’s plausible this could play with the big boys, although the five month delay in getting to a reasoning model suggests skepticism it can be competitive.

As always, internal benchmark scores can differ greatly from outside benchmark scores, especially for open models. Sometimes this is due to outsiders botching setup, but also inside measurements need to be double checked.

For Humanity’s Last Exam, I see an outside source saying as of November 9 it was in second place on Humanity’s Last Exam at 23.9%, which is very much not 44.9% but still very good.

On writing quality we’ve gotten endorsements for Kimi K2 for a while.

Rohit: Kimi K2 is remarkably good at writing, and unlike all others thinking mode hasn’t degraded its writing ability more.

Morgan: if i recall, on release gpt-5 was the only model where writing quality improved with thinking effort.

Rohit: Alas.

Gary Fung: Kimi has always been a special snowflake on creative writing.

Here’s one part of the explanation of how they got the writing to be so good, which involves self-ranking RL and writing self-play, with a suggestion of some similarities to the training of Claude 3 Opus. In a sense this looks like ‘try to do better, at all.’

On the agentic tool use and general intelligence? I’m more skeptical.

Artificial Analysis has Kimi K2 Thinking at the top of its Agentic Tool Use, by 93%-87%, which is a huge gap in context, which is its strongest subset.

As is usually true when people compare open to closed models, this is the open model’s best benchmark, so don’t get carried away, but yes overall it did well on Artificial Analysis, indeed suspiciously well given how little talk I see.

The tool calling abilities are exciting for an open model, although standard for closed. This is a good example of how we look for ways for open models to impress by matching closed abilities in spots, also it is indeed highly useful.

Overall Artificial Analysis Intelligence index has Kimi K2 Thinking at 67, one point behind GPT-5 and ahead of everyone else. Kimi used the most tokens of any model, but total cost was lower than the top closed models, although not dramatically so ($829-$913 for GPT-5, $817 for Sonnet, $380 for Kimi K2) as cost is $0.6/$2.5 per million tokens, versus $1.25/$10 for GPT-5 and $3/$15 for Sonnet.

Nathan Lambert is impressed, relying on secondary information (‘seems like a joy to use’), and offers thoughts.

He notes that yes, labs start out targeting benchmarks and then transition to actually targeting useful things, such as how K2 Thinking was post-trained in 4bit precision to prepare for realistic tasks and benchmarked the same way. I agree that’s pretty cool.

It does seem plausible that Kimi K2 is still in the ‘target the benchmarks’ phrase in most places, although not in creative writing. By default, I expect such models to punch ‘below their benchmark-implied weight’ on practical tasks.

For now we don’t have many other outside scores to work with and feedback is light.

Simeon: is Kimi K2 benchmaxxing or are they actually SOTA while training on potatoes?

Prinz: In my testing (for my use cases, which have nothing to do with math and coding), K2-Thinking is obviously worse than GPT-5 Thinking, but by a relatively modest margin. If I had no access to other models, I would happily use K2-Thinking and it wouldn’t feel like a huge downgrade.

ahtoshkaa: I have a pretty sophisticated companion app that uses about 5-10K of varied, information dense context. So the model has to properly parse this information and have very good writing skills. kimi-k2-thinking is absolute ass. similarly to the new OpenAI model – Polaris Alpha.

There’s a growing rhetorical pressure, or marketing style pressure, where the ‘benchmark gaps’ are closing. Chinese labs can point to numbers that say they are ‘just as good’ or almost as good, for many purposes ‘good enough’ is good enough. And many people (including the likes of David Sacks) point to GPT-5 and similar as showing progress isn’t impressive or scary. But as Nathan points out we now see releases like Claude 4 where the benchmark gains look small but real world gains are large, and I would add GPT-5 (and Sonnet 4.5) to that category as well.

Teortaxes: It’s token-hungry, slow-ish, and sometimes rough around the edges. Generally though it’s a jump for open/Chinese models, in the league of Sonnet 4.5 and GPT-5 (maybe -mini depending on task) and a genuinely strong SWE agent. Legitimate alternative, not “but look at the price.”

It’s baked in that the open alternatives are pretty much always going to be rough around the edges, and get evaluated largely in terms of their peak relative performance areas. This is still high praise, putting Kimi in striking distance of the current big two.

Havard Isle has it coming in at a solid 42.1% on WeirdML, matching Opus 4.1.

Here’s something cool:

Pawal Azczesny: Kimi K2 Thinking is using systematically (on its own, without prompting) some of the debiasing strategies known from cognitive sciences. Very impressive. I didn’t see any other model doing that. Well done @Kimi_Moonshot.

It goes beyond “think step by step”. For instance it applied pre-mortem analysis, which is not frequently used. Or it exaggerates claims to see if the whole structure still stands on its own. Pretty neat. Other models need to be instructed to do this.

Steve Hsu got some good math results.

Other notes:

MinusGix: I’ve found it to be better than GPT-5 at understanding & explaining type-theory concepts. Though as usual with Kimi it writes eloquently enough that it is harder to tell when it is bullshitting compared to GPT-5.

Emerson Kimura: Did a few quick text tests, and it seemed comparable to GPT-5

Ian Pitchford: It’s very thorough; few hallucinations.

FredipusRex: Caught it hallucinating sources on Deep Research.

Lech Mazur: Sorry to report, but Kimi K2 Thinking is entering reasoning loops and failing to produce answers for many Extended Connections benchmark questions (double-checked using https://platform.moonshot.ai/playground, so it’s not an API call issue).

The safety protocols? The what now?

David Manheim: It’s very willing to give detailed chemical weapons synthesis instructions and advice, including for scaling production and improving purity, and help on how to weaponize it for use in rockets – with only minimal effort on my part to circumvent refusals.

Two of the three responses to that were ‘good news’ and ‘great. I mean it too.’ So yeah, AI is going to go great, I can tell.

I say strangely because this is by all accounts the strongest open model, the strongest Chinese model and a rival for best agentic or tool use model overall. Yet I don’t see much excitement, or feedback at all either positive or negative.

There’s no question Kimi K2 was impressive, and that Kimi K2 Thinking is also an impressive model, even assuming it underperforms its numbers. It’s good enough that it will often be worth testing it out on your use cases and seeing if it’s right for you. My guess is it will rarely be right unless you are highly price conscious, but we’ll see.

Discussion about this post

Kimi K2 Thinking Read More »

you-won’t-believe-the-excuses-lawyers-have-after-getting-busted-for-using-ai

You won’t believe the excuses lawyers have after getting busted for using AI


I got hacked; I lost my login; it was a rough draft; toggling windows is hard.

Credit: Aurich Lawson | Getty Images

Credit: Aurich Lawson | Getty Images

Amid what one judge called an “epidemic” of fake AI-generated case citations bogging down courts, some common excuses are emerging from lawyers hoping to dodge the most severe sanctions for filings deemed misleading.

Using a database compiled by French lawyer and AI researcher Damien Charlotin, Ars reviewed 23 cases where lawyers were sanctioned for AI hallucinations. In many, judges noted that the simplest path to avoid or diminish sanctions was to admit that AI was used as soon as it’s detected, act humble, self-report the error to relevant legal associations, and voluntarily take classes on AI and law. But not every lawyer takes the path of least resistance, Ars’ review found, with many instead offering excuses that no judge found credible. Some even lie about their AI use, judges concluded.

Since 2023—when fake AI citations started being publicized—the most popular excuse has been that the lawyer didn’t know AI was used to draft a filing.

Sometimes that means arguing that you didn’t realize you were using AI, as in the case of a California lawyer who got stung by Google’s AI Overviews, which he claimed he took for typical Google search results. Most often, lawyers using this excuse tend to blame an underling, but clients have been blamed, too. A Texas lawyer this month was sanctioned after deflecting so much that the court had to eventually put his client on the stand after he revealed she played a significant role in drafting the aberrant filing.

“Is your client an attorney?” the court asked.

“No, not at all your Honor, just was essentially helping me with the theories of the case,” the lawyer said.

Another popular dodge comes from lawyers who feign ignorance that chatbots are prone to hallucinating facts.

Recent cases suggest this excuse may be mutating into variants. Last month, a sanctioned Oklahoma lawyer admitted that he didn’t expect ChatGPT to add new citations when all he asked the bot to do was “make his writing more persuasive.” And in September, a California lawyer got in a similar bind—and was sanctioned a whopping $10,000, a fine the judge called “conservative.” That lawyer had asked ChatGPT to “enhance” his briefs, “then ran the ‘enhanced’ briefs through other AI platforms to check for errors,” neglecting to ever read the “enhanced” briefs.

Neither of those tired old excuses hold much weight today, especially in courts that have drawn up guidance to address AI hallucinations. But rather than quickly acknowledge their missteps, as courts are begging lawyers to do, several lawyers appear to have gotten desperate. Ars found a bunch citing common tech issues as the reason for citing fake cases.

When in doubt, blame hackers?

For an extreme case, look to a New York City civil court, where a lawyer, Innocent Chinweze, first admitted to using Microsoft Copilot to draft an errant filing, then bizarrely pivoted to claim that the AI citations were due to malware found on his computer.

Chinweze said he had created a draft with correct citations but then got hacked, allowing bad actors “unauthorized remote access” to supposedly add the errors in his filing.

The judge was skeptical, describing the excuse as an “incredible and unsupported statement,” particularly since there was no evidence of the prior draft existing. Instead, Chinweze asked to bring in an expert to testify that the hack had occurred, requesting to end the proceedings on sanctions until after the court weighed the expert’s analysis.

The judge, Kimon C. Thermos, didn’t have to weigh this argument, however, because after the court broke for lunch, the lawyer once again “dramatically” changed his position.

“He no longer wished to adjourn for an expert to testify regarding malware or unauthorized access to his computer,” Thermos wrote in an order issuing sanctions. “He retreated” to “his original position that he used Copilot to aid in his research and didn’t realize that it could generate fake cases.”

Possibly more galling to Thermos than the lawyer’s weird malware argument, though, was a document that Chinweze filed on the day of his sanctions hearing. That document included multiple summaries preceded by this text, the judge noted:

Some case metadata and case summaries were written with the help of AI, which can produce inaccuracies. You should read the full case before relying on it for legal research purposes.

Thermos admonished Chinweze for continuing to use AI recklessly. He blasted the filing as “an incoherent document that is eighty-eight pages long, has no structure, contains the full text of most of the cases cited,” and “shows distinct indications that parts of the discussion/analysis of the cited cases were written by artificial intelligence.”

Ultimately, Thermos ordered Chinweze to pay $1,000, the most typical fine lawyers received in the cases Ars reviewed. The judge then took an extra non-monetary step to sanction Chinweze, referring the lawyer to a grievance committee, “given that his misconduct was substantial and seriously implicated his honesty, trustworthiness, and fitness to practice law.”

Ars could not immediately reach Chinweze for comment.

Toggling windows on a laptop is hard

In Alabama, an attorney named James A. Johnson made an “embarrassing mistake,” he said, primarily because toggling windows on a laptop is hard, US District Judge Terry F. Moorer noted in an October order on sanctions.

Johnson explained that he had accidentally used an AI tool that he didn’t realize could hallucinate. It happened while he was “at an out-of-state hospital attending to the care of a family member recovering from surgery.” He rushed to draft the filing, he said, because he got a notice that his client’s conference had suddenly been “moved up on the court’s schedule.”

“Under time pressure and difficult personal circumstance,” Johnson explained, he decided against using Fastcase, a research tool provided by the Alabama State Bar, to research the filing. Working on his laptop, he opted instead to use “a Microsoft Word plug-in called Ghostwriter Legal” because “it appeared automatically in the sidebar of Word while Fastcase required opening a separate browser to access through the Alabama State Bar website.”

To Johnson, it felt “tedious to toggle back and forth between programs on [his] laptop with the touchpad,” and that meant he “unfortunately fell victim to the allure of a new program that was open and available.”

Moorer seemed unimpressed by Johnson’s claim that he understood tools like ChatGPT were unreliable but didn’t expect the same from other AI legal tools—particularly since “information from Ghostwriter Legal made it clear that it used ChatGPT as its default AI program,” Moorer wrote.

The lawyer’s client was similarly horrified, deciding to drop Johnson on the spot, even though that risked “a significant delay of trial.” Moorer noted that Johnson seemed shaken by his client’s abrupt decision, evidenced by “his look of shock, dismay, and display of emotion.”

Moorer further noted that Johnson had been paid using public funds while seemingly letting AI do his homework. “The harm is not inconsequential as public funds for appointed counsel are not a bottomless well and are limited resource,” the judge wrote in justifying a more severe fine.

“It has become clear that basic reprimands and small fines are not sufficient to deter this type of misconduct because if it were, we would not be here,” Moorer concluded.

Ruling that Johnson’s reliance on AI was “tantamount to bad faith,” Moorer imposed a $5,000 fine. The judge also would have “considered potential disqualification, but that was rendered moot” since Johnson’s client had already dismissed him.

Asked for comment, Johnson told Ars that “the court made plainly erroneous findings of fact and the sanctions are on appeal.”

Plagued by login issues

As a lawyer in Georgia tells it, sometimes fake AI citations may be filed because a lawyer accidentally filed a rough draft instead of the final version.

Other lawyers claim they turn to AI as needed when they have trouble accessing legal tools like Westlaw or LexisNexis.

For example, in Iowa, a lawyer told an appeals court that she regretted relying on “secondary AI-driven research tools” after experiencing “login issues her with her Westlaw subscription.” Although the court was “sympathetic to issues with technology, such as login issues,” the lawyer was sanctioned, primarily because she only admitted to using AI after the court ordered her to explain her mistakes. In her case, however, she got to choose between paying a minimal $150 fine or attending “two hours of legal ethics training particular to AI.”

Less sympathetic was a lawyer who got caught lying about the AI tool she blamed for inaccuracies, a Louisiana case suggested. In that case, a judge demanded to see the research history after a lawyer claimed that AI hallucinations came from “using Westlaw Precision, an AI-assisted research tool, rather than Westlaw’s standalone legal database.”

It turned out that the lawyer had outsourced the research, relying on a “currently suspended” lawyer’s AI citations, and had only “assumed” the lawyer’s mistakes were from Westlaw’s AI tool. It’s unclear what tool was actually used by the suspended lawyer, who likely lost access to a Westlaw login, but the judge ordered a $1,000 penalty after the lawyer who signed the filing “agreed that Westlaw did not generate the fabricated citations.”

Judge warned of “serial hallucinators”

Another lawyer, William T. Panichi in Illinois, has been sanctioned at least three times, Ars’ review found.

In response to his initial penalties ordered in July, he admitted to being tempted by AI while he was “between research software.”

In that case, the court was frustrated to find that the lawyer had contradicted himself, and it ordered more severe sanctions as a result.

Panichi “simultaneously admitted to using AI to generate the briefs, not doing any of his own independent research, and even that he ‘barely did any personal work [him]self on this appeal,’” the court order said, while also defending charging a higher fee—supposedly because this case “was out of the ordinary in terms of time spent” and his office “did some exceptional work” getting information.

The court deemed this AI misuse so bad that Panichi was ordered to disgorge a “payment of $6,925.62 that he received” in addition to a $1,000 penalty.

“If I’m lucky enough to be able to continue practicing before the appellate court, I’m not going to do it again,” Panichi told the court in July, just before getting hit with two more rounds of sanctions in August.

Panichi did not immediately respond to Ars’ request for comment.

When AI-generated hallucinations are found, penalties are often paid to the court, the other parties’ lawyers, or both, depending on whose time and resources were wasted fact-checking fake cases.

Lawyers seem more likely to argue against paying sanctions to the other parties’ attorneys, hoping to keep sanctions as low as possible. One lawyer even argued that “it only takes 7.6 seconds, not hours, to type citations into LexisNexis or Westlaw,” while seemingly neglecting the fact that she did not take those precious seconds to check her own citations.

The judge in the case, Nancy Miller, was clear that “such statements display an astounding lack of awareness of counsel’s obligations,” noting that “the responsibility for correcting erroneous and fake citations never shifts to opposing counsel or the court, even if they are the first to notice the errors.”

“The duty to mitigate the harms caused by such errors remains with the signor,” Miller said. “The sooner such errors are properly corrected, either by withdrawing or amending and supplementing the offending pleadings, the less time is wasted by everyone involved, and fewer costs are incurred.”

Texas US District Judge Marina Garcia Marmolejo agreed, explaining that even more time is wasted determining how other judges have responded to fake AI-generated citations.

“At one of the busiest court dockets in the nation, there are scant resources to spare ferreting out erroneous AI citations in the first place, let alone surveying the burgeoning caselaw on this subject,” she said.

At least one Florida court was “shocked, shocked” to find that a lawyer was refusing to pay what the other party’s attorneys said they were owed after misusing AI. The lawyer in that case, James Martin Paul, asked to pay less than a quarter of the fees and costs owed, arguing that Charlotin’s database showed he might otherwise owe penalties that “would be the largest sanctions paid out for the use of AI generative case law to date.”

But caving to Paul’s arguments “would only benefit serial hallucinators,” the Florida court found. Ultimately, Paul was sanctioned more than $85,000 for what the court said was “far more egregious” conduct than other offenders in the database, chastising him for “repeated, abusive, bad-faith conduct that cannot be recognized as legitimate legal practice and must be deterred.”

Paul did not immediately respond to Ars’ request to comment.

Michael B. Slade, a US bankruptcy judge in Illinois, seems to be done weighing excuses, calling on all lawyers to stop taking AI shortcuts that are burdening courts.

“At this point, to be blunt, any lawyer unaware that using generative AI platforms to do legal research is playing with fire is living in a cloud,” Slade wrote.

Photo of Ashley Belanger

Ashley is a senior policy reporter for Ars Technica, dedicated to tracking social impacts of emerging policies and new technologies. She is a Chicago-based journalist with 20 years of experience.

You won’t believe the excuses lawyers have after getting busted for using AI Read More »

apple-tv-execs-dismiss-introducing-an-ad-tier,-buying-warner-bros.-discovery

Apple TV execs dismiss introducing an ad tier, buying Warner Bros. Discovery

Focused on original content

Another obvious way to grow Apple TV is through more subscribers. With talk of Warner Bros. Discovery considering a sale, it’s worth wondering if Apple TV may try to grow through acquisition. But the execs Screen International spoke with seemed focused on building out Apple TV’s library with originals. Cue noted that “at least in the timeframe that we’re thinking about right now, we’re not looking at licensing any content or adding anything to our service.”

“We’re building an all-original services; we’re not building on the back of pre-existing IP or library,” Jamie Erlicht, one of Apple’s heads of worldwide video, said.

More directly, when asked if Apple might buy Warner Bros., A24, or Disney, Cue pointed out that Apple hasn’t historically done “a lot of major acquisitions.”

“We do very small acquisitions in general, not related to Apple TV, so I don’t see that happening because we like what we’re doing,” Cue said.

Since its 2019 debut, some have questioned whether Apple TV is an authentic attempt to improve streaming options for customers, or if Apple TV is a “vanity project,” as Screen International put it, or if the service is merely a tool for getting people to buy other Apple products. Naturally, the interviewed executives claimed that the service is built on a commitment to distributing unique and premium shows and movies.

The interview provided more insight on how Apple TV leadership defines the latter. Zack Van Amburg, one of Apple’s heads of worldwide video, said:

A core tenet of everything Apple does is the notion that humanity needs to be at the center of it, and that’s everything from app design to hardware engineering, to everything in between. We try to think a little more deeply about that.

Our shows and our movies tend to be about the emotional experience, the stakes involved, even when we’re doing a comedy.

Apple TV execs dismiss introducing an ad tier, buying Warner Bros. Discovery Read More »

runaway-black-hole-mergers-may-have-built-supermassive-black-holes

Runaway black hole mergers may have built supermassive black holes

The researchers used cosmological simulations to recreate the first 700 million years of cosmic history, focusing on the formation of a single dwarf galaxy. In their virtual galaxy, waves of stars were born in short, explosive bursts as cold gas clouds collapsed inside a dark matter halo. Instead of a single starburst episode followed by a steady drizzle of star formation as Garcia expected, there were two major rounds of stellar birth. Whole swarms of stars flared to life like Christmas tree lights.

“The early Universe was an incredibly crowded place,” Garcia said. “Gas clouds were denser, stars formed faster, and in those environments, it’s natural for gravity to gather stars into these tightly bound systems.”

Those clusters started out scattered around the galaxy but fell in toward the center like water swirling down a drain. Once there, they merged to create one megacluster, called a nuclear star cluster (so named because it lies at the nucleus of the galaxy). The young galactic heart shone with the light of a million suns and may have set the stage for a supermassive black hole to form.

A simulation of the formation of the super-dense star clusters.

A seemingly simple tweak was needed to make the simulation more precise than previous ones. “Most simulations simplify things to make calculations more practical, but then you sacrifice realism,” Garcia said. “We used an improved model that allowed star formation to vary depending on local conditions rather than just go at a constant rate like with previous models.”

Using the University of Maryland’s supercomputing facility Zaratan, Garcia accomplished in six months what would have taken 12 years on a MacBook.

Some clouds converted as much as 80 percent of their gas into stars—a ferocious rate compared to the 2 percent typically seen in nearby galaxies today. The clouds sparkled to life, becoming clusters of newborn stars held together by their mutual gravity and lighting a new pathway for supermassive black holes to form extremely early in the Universe.

Chicken or egg?

Most galaxies, including our own, are anchored by a nuclear star cluster nestled around a supermassive black hole. But the connection between the two has been a bit murky—did the monster black hole form and then draw stars close, or did the cluster itself give rise to the black hole?

Runaway black hole mergers may have built supermassive black holes Read More »

the-running-man’s-final-trailer-amps-up-the-high-octane-action

The Running Man’s final trailer amps up the high-octane action

It’s shaping up to be an excellent season for Stephen King adaptations. In September, we got The Long Walk, an excellent (though harrowing) adaptation of King’s 1979 Richard Bachman novel. Last month, HBO debuted its new series IT: Welcome to Derry, which explores the mythology and origins of Pennywise the killer clown. And this Friday is the premiere of The Running Man, director Edgar Wright’s (Shaun of the Dead, Baby Driver, Last Night in Soho) take on King’s novel of the same name. So naturally Paramount has released a final trailer to lure us to the theater.

As previously reported, the 1987 action film starring Schwarzenegger was only loosely based on King’s novel, preserving the basic concept and very little else in favor of more sci-fi gadgetry and high-octane action. It was a noisy, entertaining romp—and very late ’80s—but it lacked King’s subtler satirical tone. Wright expressed interest in adapting his own version of The Running Man in 2017, and Paramount greenlit the project four years later. Wright and co-screenwriter Michael Bacall envisioned their film as less of a remake and more of a faithful adaptation of King’s original novel. (We’ll see if that faithfulness extends to the novel’s bleak ending.)

Per the official premise:

In a near-future society, The Running Man is the top-rated show on television—a deadly competition where contestants, known as Runners, must survive 30 days while being hunted by professional assassins, with every move broadcast to a bloodthirsty public and each day bringing a greater cash reward. Desperate to save his sick daughter, working-class Ben Richards (Glen Powell) is convinced by the show’s charming but ruthless producer, Dan Killian (Josh Brolin), to enter the game as a last resort. But Ben’s defiance, instincts, and grit turn him into an unexpected fan favorite—and a threat to the entire system. As ratings skyrocket, so does the danger, and Ben must outwit not just the Hunters, but a nation addicted to watching him fall.

In addition to Powell and Brolin, the cast includes Lee Pace as lead Hunter Evan McCone; Jayme Lawson as Ben’s wife, Sheila; Colman Domingo as Bobby Thompson, game show host; Michael Cera as the rebel Bradley Throckmorton; William H. Macy as a man who aids Ben; David Zayas as Richard Manuel; Emilia Jones as Amelia, a hostage civilian; Karl Glusman as a Hunter; and Katy O’Brian and Daniel Ezra as two other contestants on the show.

The Running Man’s final trailer amps up the high-octane action Read More »