math

figuring-out-why-ais-get-flummoxed-by-some-games

Figuring out why AIs get flummoxed by some games


When winning depends on intuiting a mathematical function, AIs come up short.

Oddly, the training methods that work great for chess fail on far simpler games. Credit: SimpleImages

With its Alpha series of game-playing AIs, Google’s DeepMind group seemed to have found a way for its AIs to tackle any game, mastering games like chess and Go by repeatedly playing itself during training. But then some odd things happened as people started identifying Go positions that would lose against relative newcomers to the game but easily defeat a similar Go-playing AI.

While beating an AI at a board game may seem relatively trivial, it can help us identify failure modes of the AI, or ways in which we can improve their training to avoid having them develop these blind spots in the first place—things that may become critical as people rely on AI input for a growing range of problems.

A recent paper published in Machine Learning describes an entire category of games where the method used to train AlphaGo and AlphaChess fails. The games in question can be remarkably simple, as exemplified by the one the researchers worked with: Nim, which involves two players taking turns removing matchsticks from a pyramid-shaped board until one is left without a legal move.

Impartiality

Nim involves setting up a set of rows of matchsticks, with the top row having a single match, and every row below it having two more than the one above. This creates a pyramid-shaped board. Two players then take turns removing matchsticks from the board, choosing a row and then removing anywhere from one item to the entire contents of the row. The game goes until there are no legal moves left. It’s a simple game that can easily be taught to children.

It also turns out to be a critical example of an entire category of rule sets that define “impartial games.” These differ from something like chess, where each player has their own set of pieces; in impartial games, the two players share the same pieces and are bound by the same set of rules. Nim’s importance stems from a theorem showing that any position in an impartial game can be represented by a configuration of a Nim pyramid. Meaning that if something applies to Nim, it applies to all impartial games.

One of the distinctive features of Nim and other impartial games is that, at any point in the game, it’s easy to evaluate the board and determine which player has the potential to win. Put another way, you can size up the board and know that, if you play the optimal moves from then on, you will likely win. Doing so just requires feeding the board’s configuration into a parity function, which does the math to tell you whether you’re winning.

(Obviously, the person who is currently winning could play a suboptimal move and end up losing. And the exact series of optimal moves is not determined until the end, since they will depend on exactly what your opponent does.)

The new work, done by Bei Zhou and Soren Riis, asks a simple question: What happens if you take the AlphaGo approach to training an AI to play games, and try to develop a Nim-playing AI? Put differently: They asked whether an AI could develop a representation of a parity function purely by playing itself in Nim.

When self-teaching fails

AlphaZero, the chess-playing version, was trained from only the rules of chess. By playing itself, it can associate different board configurations with a probability of winning. To keep it from getting stuck in ruts, there’s also a random sampling element that allows it to continue exploring new territory. And, once it can identify a limited number of high-value moves, it’s able to explore deeper into future possibilities that arise from those moves. The more games it plays, the higher the probability that it will be able to assign values to potential board configurations that could arise from a given position (although the benefits of more games tend to tail off after a sufficient number are played).

In Nim, there is a limited number of optimal moves for a given board configuration. If you don’t play one of them, then you essentially cede control to your opponent, who can go on to win if they play nothing but optimal moves. And again, the optimal moves can be identified by evaluating a mathematical parity function.

So, there are reasons to think that the training process that worked for chess might not be effective for Nim. The surprise is just how bad it actually was. Zhou and Riis found that for a Nim board with five rows, the AI got good fairly quickly and was still improving after 500 training iterations. Adding just one more row, however, caused the rate of improvement to slow dramatically. And, for a seven-row board, gains in performance had essentially stopped by the time the AI had played itself 500 times.

To better illustrate the problem, the researchers swapped out the subsystem that suggested potential moves with one that operated randomly. On a seven-row Nim board, the performance of the trained and randomized versions was indistinguishable over 500 training gains. Essentially, once the board got large enough, the system was incapable of learning from observing game outcomes. The initial state of the seven-row configuration has three potential moves that are all consistent with an ultimate win. Yet when the trained move evaluator of their system was asked to check all potential moves, it evaluated every single one as roughly equivalent.

The researchers conclude that Nim requires players to learn the parity function to play effectively. And the training procedure that works so well for chess and Go is incapable of doing so.

Not just Nim

One way to view the conclusion is that Nim (and by extension, all impartial games) is just weird. But Zhou and Riis also found some signs that similar problems could also crop up in chess-playing AIs that were trained in this manner. They identified several “wrong” chess moves—ones that missed a mating attack or threw an end-game—that were initially rated highly by the AI’s board evaluator. It was only because the software took a number of additional branches out several moves into the future that it was able to avoid these gaffes.

For many Nim board configurations, the optimal branches that lead to a win have to be played out to the end of the game to demonstrate their value, so this sort of avoidance of a potential gaffe is much harder to manage. And they noted that chess players have found mating combinations that require long chains of moves that chess-playing software often misses entirely. They suggest that the issue isn’t that chess doesn’t have the same issues, but rather that Nim-like board configurations are generally rare in chess. Presumably, similar things apply to Go, as illustrated by the odd weaknesses of AIs in that game.

“AlphaZero excels at learning through association,” Zhou and Riis argue, “but fails when a problem requires a form of symbolic reasoning that cannot be implicitly learned from the correlation between game states and outcomes.” In other words, even if the rules governing a game enable simple rules for deciding what to do, we can’t expect Alpha-style training to enable an AI to identify them. The result is what they call a “tangible, catastrophic failure mode.”

Why does this matter? Lots of people are exploring the utility of AIs for math problems, which often require the sort of symbolic reasoning involved in extrapolating from a board configuration to general rules such as the parity function. While it may not be obvious how to train an AI to do that, it can be useful to know which approaches will clearly not work.

Machine Learning, 2026. DOI: 10.1007/s10994-026-06996-1 (About DOIs).

Photo of John Timmer

John is Ars Technica’s science editor. He has a Bachelor of Arts in Biochemistry from Columbia University, and a Ph.D. in Molecular and Cell Biology from the University of California, Berkeley. When physically separated from his keyboard, he tends to seek out a bicycle, or a scenic location for communing with his hiking boots.

Figuring out why AIs get flummoxed by some games Read More »

deepmind’s-latest:-an-ai-for-handling-mathematical-proofs

DeepMind’s latest: An AI for handling mathematical proofs


AlphaProof can handle math challenges but needs a bit of help right now.

Computers are extremely good with numbers, but they haven’t gotten many human mathematicians fired. Until recently, they could barely hold their own in high school-level math competitions.

But now Google’s DeepMind team has built AlphaProof, an AI system that matched silver medalists’ performance at the 2024 International Mathematical Olympiad, scoring just one point short of gold at the most prestigious undergrad math competition in the world. And that’s kind of a big deal.

True understanding

The reason computers fared poorly in math competitions is that, while they far surpass humanity’s ability to perform calculations, they are not really that good at the logic and reasoning that is needed for advanced math. Put differently, they are good at performing calculations really quickly, but they usually suck at understanding why they’re doing them. While something like addition seems simple, humans can do semi-formal proofs based on definitions of addition or go for fully formal Peano arithmetic that defines the properties of natural numbers and operations like addition through axioms.

To perform a proof, humans have to understand the very structure of mathematics. The way mathematicians build proofs, how many steps they need to arrive at the conclusion, and how cleverly they design those steps are a testament to their brilliance, ingenuity, and mathematical elegance. “You know, Bertrand Russel published a 500-page book to prove that one plus one equals two,” says Thomas Hubert, a DeepMind researcher and lead author of the AlphaProof study.

DeepMind’s team wanted to develop an AI that understood math at this level. The work started with solving the usual AI problem: the lack of training data.

Math problems translator

Large language models that power AI systems like Chat GPT learn from billions upon billions of pages of text. Because there are texts on mathematics in their training databases—all the handbooks and works of famous mathematicians—they show some level of success in proving mathematical statements. But they are limited by how they operate: They rely on using huge neural nets to predict the next word or token in sequences generated in response to user prompts. Their reasoning is statistical by design, which means they simply return answers that “sound” right.

DeepMind didn’t need the AI to “sound” right—that wasn’t going to cut it in high-level mathematics. They needed their AI to “be” right, to guarantee absolute certainty. That called for an entirely new, more formalized training environment. To provide that, the team used a software package called Lean.

Lean is a computer program that helps mathematicians write precise definitions and proofs. It relies on a precise, formal programming language that’s also called Lean, which mathematical statements can be translated into. Once the translated or formalized statement is uploaded to the program, it can check if it is correct and get back with responses like “this is correct,” “something is missing,” or “you used a fact that is not proved yet.”

The problem was, most mathematical statements and proofs that can be found online are written in natural language like “let X be the set of natural numbers that…”—the number of statements written in Lean was rather limited. “The major difficulty of working with formal languages is that there’s very little data,” Hubert says. To go around it, the researchers trained a Gemini large language model to translate mathematical statements from natural language to Lean. The model worked like an automatic formalizer and produced about 80 million formalized mathematical statements.

It wasn’t perfect, but the team managed to use that to their advantage. “There are many ways you can capitalize on approximate translations,” Hubert claims.

Learning to think

The idea DeepMind had for the AlphaProof was to use the architecture the team used in their chess-, Go-, and shogi-playing AlphaZero AI system. Building proofs in Lean and Mathematics in general was supposed to be just another game to master. “We were trying to learn this game through trial and error,” Hubert says. Imperfectly formalized problems offered great opportunity for making errors. In its learning phase, AlphaProof was simply proving and disproving the problems it had in its database. If something was translated poorly, figuring out that something wasn’t right was a useful form of exercise.

Just like AlphaZero, AlphaProof in most cases used two main components. The first was a huge neural net with a few billion parameters that learned to work in the Lean environment through trial and error. It was rewarded for each proven or disproven statement and penalized for each reasoning step it took, which was a way of incentivizing short, elegant proofs.

It was also trained to use a second component, which was a tree search algorithm. This explored all possible actions that could be taken to push the proof forward at each step. Because the number of possible actions in mathematics can be near infinite, the job of the neural net was to look at the available branches in the search tree and commit computational budget only to the most promising ones.

After a few weeks of training, the system could score well on most math competition benchmarks based on problems sourced from past high school-level competitions, but it still struggled with the most difficult of them. To tackle these, the team added a third component that hadn’t been in AlphaZero. Or anywhere else.

Spark of humanity

The third component, called Test-Time Reinforcement Learning (TTRL), roughly emulated the way mathematicians approach the most difficult problems. The learning part relied on the same combination of neural nets with search tree algorithms. The difference came in what it learned from. Instead of relying on a broad database of auto-formalized problems, AlphaProof working in the TTRL mode started its work by generating an entirely new training dataset based on the problem it was dealing with.

The process involved creating countless variations of the original statement, some simplified a little bit more, some more general, and some only loosely connected to it. The system then attempted to prove or disprove them. It was roughly what most humans do when they’re facing a particularly hard puzzle, the AI equivalent of saying, “I don’t get it, so let’s try an easier version of this first to get some practice.” This allowed AlphaProof to learn on the fly, and it worked amazingly well.

At the 2024 International Mathematics Olympiad, there were 42 points to score for solving six different problems worth seven points each. To win gold, participants had to get 29 points or higher, and 58 out of 609 of them did that. Silver medals were awarded to people who earned between 22 and 28 points (there were 123 silver medalists). The problems varied in difficulty, with the sixth one, acting as a “final boss,” being the most difficult of them all. Only six participants managed to solve it. AlphaProof was the seventh.

But AlphaProof wasn’t an end-all, be-all mathematical genius. Its silver had its price—quite literally.

Optimizing ingenuity

The first problem with AlphaProof’s performance was that it didn’t work alone. To begin with, humans had to make the problems compatible with Lean before the software even got to work. And, among the six Olympic problems, the fourth one was about geometry, and the AI was not optimized for that. To deal with it, AlphaProof had to call a friend called AlphaGeometry 2, a geometry-specialized AI that ripped through the task in a few minutes without breaking a sweat. On its own, AlphaProof scored 21 points, not 28, so technically it would win bronze, not silver. Except it wouldn’t.

Human participants of the Olympiad had to solve their six problems in two sessions, four-and-a-half hours long. AlphaProof, on the other hand, wrestled with them for several days using multiple tensor processing units at full throttle. The most time- and energy-consuming component was TTRL, which battled with the three problems it managed to solve for three days each. If AlphaProof was held up to the same standard as human participants, it would basically run out of time. And if it wasn’t born at a tech giant worth hundreds of billions of dollars, it would run out of money, too.

In the paper, the team admits the computational requirements to run AlphaProof are most likely cost-prohibitive for most research groups and aspiring mathematicians. Computing power in AI applications is often measured in TPU-days, meaning a tensor processing unit working flat-out for a full day. AlphaProof needed hundreds of TPU-days per problem.

On top of that, the International Mathematics Olympiad is a high school-level competition, and the problems, while admittedly difficult, were based on things mathematicians already know. Research-level math requires inventing entirely new concepts instead of just working with existing ones.

But DeepMind thinks it can overcome these hurdles and optimize AlphaProof to be less resource-hungry. “We don’t want to stop at math competitions. We want to build an AI system that could really contribute to research-level mathematics,” Hubert says. His goal is to make AlphaProof available to the broader research community. “We’re also releasing a kind of an AlphaProof tool,” he added. “It would be a small trusted testers program to see if this would be useful to mathematicians.”

Nature, 2025.  DOI: 10.1038/s41586-025-09833-y

Photo of Jacek Krywko

Jacek Krywko is a freelance science and technology writer who covers space exploration, artificial intelligence research, computer science, and all sorts of engineering wizardry.

DeepMind’s latest: An AI for handling mathematical proofs Read More »