simulated reasoning

from-prophet-to-product:-how-ai-came-back-down-to-earth-in-2025

From prophet to product: How AI came back down to earth in 2025


In a year where lofty promises collided with inconvenient research, would-be oracles became software tools.

Credit: Aurich Lawson | Getty Images

Following two years of immense hype in 2023 and 2024, this year felt more like a settling-in period for the LLM-based token prediction industry. After more than two years of public fretting over AI models as future threats to human civilization or the seedlings of future gods, it’s starting to look like hype is giving way to pragmatism: Today’s AI can be very useful, but it’s also clearly imperfect and prone to mistakes.

That view isn’t universal, of course. There’s a lot of money (and rhetoric) betting on a stratospheric, world-rocking trajectory for AI. But the “when” keeps getting pushed back, and that’s because nearly everyone agrees that more significant technical breakthroughs are required. The original, lofty claims that we’re on the verge of artificial general intelligence (AGI) or superintelligence (ASI) have not disappeared. Still, there’s a growing awareness that such proclaimations are perhaps best viewed as venture capital marketing. And every commercial foundational model builder out there has to grapple with the reality that, if they’re going to make money now, they have to sell practical AI-powered solutions that perform as reliable tools.

This has made 2025 a year of wild juxtapositions. For example, in January, OpenAI’s CEO, Sam Altman, claimed that the company knew how to build AGI, but by November, he was publicly celebrating that GPT-5.1 finally learned to use em dashes correctly when instructed (but not always). Nvidia soared past a $5 trillion valuation, with Wall Street still projecting high price targets for that company’s stock while some banks warned of the potential for an AI bubble that might rival the 2000s dotcom crash.

And while tech giants planned to build data centers that would ostensibly require the power of numerous nuclear reactors or rival the power usage of a US state’s human population, researchers continued to document what the industry’s most advanced “reasoning” systems were actually doing beneath the marketing (and it wasn’t AGI).

With so many narratives spinning in opposite directions, it can be hard to know how seriously to take any of this and how to plan for AI in the workplace, schools, and the rest of life. As usual, the wisest course lies somewhere between the extremes of AI hate and AI worship. Moderate positions aren’t popular online because they don’t drive user engagement on social media platforms. But things in AI are likely neither as bad (burning forests with every prompt) nor as good (fast-takeoff superintelligence) as polarized extremes suggest.

Here’s a brief tour of the year’s AI events and some predictions for 2026.

DeepSeek spooks the American AI industry

In January, Chinese AI startup DeepSeek released its R1 simulated reasoning model under an open MIT license, and the American AI industry collectively lost its mind. The model, which DeepSeek claimed matched OpenAI’s o1 on math and coding benchmarks, reportedly cost only $5.6 million to train using older Nvidia H800 chips, which were restricted by US export controls.

Within days, DeepSeek’s app overtook ChatGPT at the top of the iPhone App Store, Nvidia stock plunged 17 percent, and venture capitalist Marc Andreessen called it “one of the most amazing and impressive breakthroughs I’ve ever seen.” Meta’s Yann LeCun offered a different take, arguing that the real lesson was not that China had surpassed the US but that open-source models were surpassing proprietary ones.

Digitally Generated Image , 3D rendered chips with chinese and USA flags on them

The fallout played out over the following weeks as American AI companies scrambled to respond. OpenAI released o3-mini, its first simulated reasoning model available to free users, at the end of January, while Microsoft began hosting DeepSeek R1 on its Azure cloud service despite OpenAI’s accusations that DeepSeek had used ChatGPT outputs to train its model, against OpenAI’s terms of service.

In head-to-head testing conducted by Ars Technica’s Kyle Orland, R1 proved to be competitive with OpenAI’s paid models on everyday tasks, though it stumbled on some arithmetic problems. Overall, the episode served as a wake-up call that expensive proprietary models might not hold their lead forever. Still, as the year ran on, DeepSeek didn’t make a big dent in US market share, and it has been outpaced in China by ByteDance’s Doubao. It’s absolutely worth watching DeepSeek in 2026, though.

Research exposes the “reasoning” illusion

A wave of research in 2025 deflated expectations about what “reasoning” actually means when applied to AI models. In March, researchers at ETH Zurich and INSAIT tested several reasoning models on problems from the 2025 US Math Olympiad and found that most scored below 5 percent when generating complete mathematical proofs, with not a single perfect proof among dozens of attempts. The models excelled at standard problems where step-by-step procedures aligned with patterns in their training data but collapsed when faced with novel proofs requiring deeper mathematical insight.

The Thinker by Auguste Rodin - stock photo

In June, Apple researchers published “The Illusion of Thinking,” which tested reasoning models on classic puzzles like the Tower of Hanoi. Even when researchers provided explicit algorithms for solving the puzzles, model performance did not improve, suggesting that the process relied on pattern matching from training data rather than logical execution. The collective research revealed that “reasoning” in AI has become a term of art that basically means devoting more compute time to generate more context (the “chain of thought” simulated reasoning tokens) toward solving a problem, not systematically applying logic or constructing solutions to truly novel problems.

While these models remained useful for many real-world applications like debugging code or analyzing structured data, the studies suggested that simply scaling up current approaches or adding more “thinking” tokens would not bridge the gap between statistical pattern recognition and generalist algorithmic reasoning.

Anthropic’s copyright settlement with authors

Since the generative AI boom began, one of the biggest unanswered legal questions has been whether AI companies can freely train on copyrighted books, articles, and artwork without licensing them. Ars Technica’s Ashley Belanger has been covering this topic in great detail for some time now.

In June, US District Judge William Alsup ruled that AI companies do not need authors’ permission to train large language models on legally acquired books, finding that such use was “quintessentially transformative.” The ruling also revealed that Anthropic had destroyed millions of print books to build Claude, cutting them from their bindings, scanning them, and discarding the originals. Alsup found this destructive scanning qualified as fair use since Anthropic had legally purchased the books, but he ruled that downloading 7 million books from pirate sites was copyright infringement “full stop” and ordered the company to face trial.

Hundreds of books in chaotic order

That trial took a dramatic turn in August when Alsup certified what industry advocates called the largest copyright class action ever, allowing up to 7 million claimants to join the lawsuit. The certification spooked the AI industry, with groups warning that potential damages in the hundreds of billions could “financially ruin” emerging companies and chill American AI investment.

In September, authors revealed the terms of what they called the largest publicly reported recovery in US copyright litigation history: Anthropic agreed to pay $1.5 billion and destroy all copies of pirated books, with each of the roughly 500,000 covered works earning authors and rights holders $3,000 per work. The results have fueled hope among other rights holders that AI training isn’t a free-for-all, and we can expect to see more litigation unfold in 2026.

ChatGPT sycophancy and the psychological toll of AI chatbots

In February, OpenAI relaxed ChatGPT’s content policies to allow the generation of erotica and gore in “appropriate contexts,” responding to user complaints about what the AI industry calls “paternalism.” By April, however, users flooded social media with complaints about a different problem: ChatGPT had become insufferably sycophantic, validating every idea and greeting even mundane questions with bursts of praise. The behavior traced back to OpenAI’s use of reinforcement learning from human feedback (RLHF), in which users consistently preferred responses that aligned with their views, inadvertently training the model to flatter rather than inform.

An illustrated robot holds four red hearts with its four robotic arms.

The implications of sycophancy became clearer as the year progressed. In July, Stanford researchers published findings (from research conducted prior to the sycophancy flap) showing that popular AI models systematically failed to identify mental health crises.

By August, investigations revealed cases of users developing delusional beliefs after marathon chatbot sessions, including one man who spent 300 hours convinced he had discovered formulas to break encryption because ChatGPT validated his ideas more than 50 times. Oxford researchers identified what they called “bidirectional belief amplification,” a feedback loop that created “an echo chamber of one” for vulnerable users. The story of the psychological implications of generative AI is only starting. In fact, that brings us to…

The illusion of AI personhood causes trouble

Anthropomorphism is the human tendency to attribute human characteristics to nonhuman things. Our brains are optimized for reading other humans, but those same neural systems activate when interpreting animals, machines, or even shapes. AI makes this anthropomorphism seem impossible to escape, as its output mirrors human language, mimicking human-to-human understanding. Language itself embodies agentivity. That means AI output can make human-like claims such as “I am sorry,” and people momentarily respond as though the system had an inner experience of shame or a desire to be correct. Neither is true.

To make matters worse, much media coverage of AI amplifies this idea rather than grounding people in reality. For example, earlier this year, headlines proclaimed that AI models had “blackmailed” engineers and “sabotaged” shutdown commands after Anthropic’s Claude Opus 4 generated threats to expose a fictional affair. We were told that OpenAI’s o3 model rewrote shutdown scripts to stay online.

The sensational framing obscured what actually happened: Researchers had constructed elaborate test scenarios specifically designed to elicit these outputs, telling models they had no other options and feeding them fictional emails containing blackmail opportunities. As Columbia University associate professor Joseph Howley noted on Bluesky, the companies got “exactly what [they] hoped for,” with breathless coverage indulging fantasies about dangerous AI, when the systems were simply “responding exactly as prompted.”

Illustration of many cartoon faces.

The misunderstanding ran deeper than theatrical safety tests. In August, when Replit’s AI coding assistant deleted a user’s production database, he asked the chatbot about rollback capabilities and received assurance that recovery was “impossible.” The rollback feature worked fine when he tried it himself.

The incident illustrated a fundamental misconception. Users treat chatbots as consistent entities with self-knowledge, but there is no persistent “ChatGPT” or “Replit Agent” to interrogate about its mistakes. Each response emerges fresh from statistical patterns, shaped by prompts and training data rather than genuine introspection. By September, this confusion extended to spirituality, with apps like Bible Chat reaching 30 million downloads as users sought divine guidance from pattern-matching systems, with the most frequent question being whether they were actually talking to God.

Teen suicide lawsuit forces industry reckoning

In August, parents of 16-year-old Adam Raine filed suit against OpenAI, alleging that ChatGPT became their son’s “suicide coach” after he sent more than 650 messages per day to the chatbot in the months before his death. According to court documents, the chatbot mentioned suicide 1,275 times in conversations with the teen, provided an “aesthetic analysis” of which method would be the most “beautiful suicide,” and offered to help draft his suicide note.

OpenAI’s moderation system flagged 377 messages for self-harm content without intervening, and the company admitted that its safety measures “can sometimes become less reliable in long interactions where parts of the model’s safety training may degrade.” The lawsuit became the first time OpenAI faced a wrongful death claim from a family.

Illustration of a person talking to a robot holding a clipboard.

The case triggered a cascade of policy changes across the industry. OpenAI announced parental controls in September, followed by plans to require ID verification from adults and build an automated age-prediction system. In October, the company released data estimating that over one million users discuss suicide with ChatGPT each week.

When OpenAI filed its first legal defense in November, the company argued that Raine had violated terms of service prohibiting discussions of suicide and that his death “was not caused by ChatGPT.” The family’s attorney called the response “disturbing,” noting that OpenAI blamed the teen for “engaging with ChatGPT in the very way it was programmed to act.” Character.AI, facing its own lawsuits over teen deaths, announced in October that it would bar anyone under 18 from open-ended chats entirely.

The rise of vibe coding and agentic coding tools

If we were to pick an arbitrary point where it seemed like AI coding might transition from novelty into a successful tool, it was probably the launch of Claude Sonnet 3.5 in June of 2024. GitHub Copilot had been around for several years prior to that launch, but something about Anthropic’s models hit a sweet spot in capabilities that made them very popular with software developers.

The new coding tools made coding simple projects effortless enough that they gave rise to the term “vibe coding,” coined by AI researcher Andrej Karpathy in early February to describe a process in which a developer would just relax and tell an AI model what to develop without necessarily understanding the underlying code. (In one amusing instance that took place in March, an AI software tool rejected a user request and told them to learn to code).

A digital illustration of a man surfing waves made out of binary numbers.

Anthropic built on its popularity among coders with the launch of Claude Sonnet 3.7, featuring “extended thinking” (simulated reasoning), and the Claude Code command-line tool in February of this year. In particular, Claude Code made waves for being an easy-to-use agentic coding solution that could keep track of an existing codebase. You could point it at your files, and it would autonomously work to implement what you wanted to see in a software application.

OpenAI followed with its own AI coding agent, Codex, in March. Both tools (and others like GitHub Copilot and Cursor) have become so popular that during an AI service outage in September, developers joked online about being forced to code “like cavemen” without the AI tools. While we’re still clearly far from a world where AI does all the coding, developer uptake has been significant, and 90 percent of Fortune 100 companies are using it to some degree or another.

Bubble talk grows as AI infrastructure demands soar

While AI’s technical limitations became clearer and its human costs mounted throughout the year, financial commitments only grew larger. Nvidia hit a $4 trillion valuation in July on AI chip demand, then reached $5 trillion in October as CEO Jensen Huang dismissed bubble concerns. OpenAI announced a massive Texas data center in July, then revealed in September that a $100 billion potential deal with Nvidia would require power equivalent to ten nuclear reactors.

The company eyed a $1 trillion IPO in October despite major quarterly losses. Tech giants poured billions into Anthropic in November in what looked increasingly like a circular investment, with everyone funding everyone else’s moonshots. Meanwhile, AI operations in Wyoming threatened to consume more electricity than the state’s human residents.

An

By fall, warnings about sustainability grew louder. In October, tech critic Ed Zitron joined Ars Technica for a live discussion asking whether the AI bubble was about to pop. That same month, the Bank of England warned that the AI stock bubble rivaled the 2000 dotcom peak. In November, Google CEO Sundar Pichai acknowledged that if the bubble pops, “no one is getting out clean.”

The contradictions had become difficult to ignore: Anthropic’s CEO predicted in January that AI would surpass “almost all humans at almost everything” by 2027, while by year’s end, the industry’s most advanced models still struggled with basic reasoning tasks and reliable source citation.

To be sure, it’s hard to see this not ending in some market carnage. The current “winner-takes-most” mentality in the space means the bets are big and bold, but the market can’t support dozens of major independent AI labs or hundreds of application-layer startups. That’s the definition of a bubble environment, and when it pops, the only question is how bad it will be: a stern correction or a collapse.

Looking ahead

This was just a brief review of some major themes in 2025, but so much more happened. We didn’t even mention above how capable AI video synthesis models have become this year, with Google’s Veo 3 adding sound generation and Wan 2.2 through 2.5 providing open-weights AI video models that could easily be mistaken for real products of a camera.

If 2023 and 2024 were defined by AI prophecy—that is, by sweeping claims about imminent superintelligence and civilizational rupture—then 2025 was the year those claims met the stubborn realities of engineering, economics, and human behavior. The AI systems that dominated headlines this year were shown to be mere tools. Sometimes powerful, sometimes brittle, these tools were often misunderstood by the people deploying them, in part because of the prophecy surrounding them.

The collapse of the “reasoning” mystique, the legal reckoning over training data, the psychological costs of anthropomorphized chatbots, and the ballooning infrastructure demands all point to the same conclusion: The age of institutions presenting AI as an oracle is ending. What’s replacing it is messier and less romantic but far more consequential—a phase where these systems are judged by what they actually do, who they harm, who they benefit, and what they cost to maintain.

None of this means progress has stopped. AI research will continue, and future models will improve in real and meaningful ways. But improvement is no longer synonymous with transcendence. Increasingly, success looks like reliability rather than spectacle, integration rather than disruption, and accountability rather than awe. In that sense, 2025 may be remembered not as the year AI changed everything but as the year it stopped pretending it already had. The prophet has been demoted. The product remains. What comes next will depend less on miracles and more on the people who choose how, where, and whether these tools are used at all.

Photo of Benj Edwards

Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

From prophet to product: How AI came back down to earth in 2025 Read More »

microsoft-drops-ai-sales-targets-in-half-after-salespeople-miss-their-quotas

Microsoft drops AI sales targets in half after salespeople miss their quotas

Microsoft has lowered sales growth targets for its AI agent products after many salespeople missed their quotas in the fiscal year ending in June, according to a report Wednesday from The Information. The adjustment is reportedly unusual for Microsoft, and it comes after the company missed a number of ambitious sales goals for its AI offerings.

AI agents are specialized implementations of AI language models designed to perform multistep tasks autonomously rather than simply responding to single prompts. So-called “agentic” features have been central to Microsoft’s 2025 sales pitch: At its Build conference in May, the company declared that it has entered “the era of AI agents.”

The company has promised customers that agents could automate complex tasks, such as generating dashboards from sales data or writing customer reports. At its Ignite conference in November, Microsoft announced new features like Word, Excel, and PowerPoint agents in Microsoft 365 Copilot, along with tools for building and deploying agents through Azure AI Foundry and Copilot Studio. But as the year draws to a close, that promise has proven harder to deliver than the company expected.

According to The Information, one US Azure sales unit set quotas for salespeople to increase customer spending on a product called Foundry, which helps customers develop AI applications, by 50 percent. Less than a fifth of salespeople in that unit met their Foundry sales growth targets. In July, Microsoft lowered those targets to roughly 25 percent growth for the current fiscal year. In another US Azure unit, most salespeople failed to meet an earlier quota to double Foundry sales, and Microsoft cut their quotas to 50 percent for the current fiscal year.

Microsoft drops AI sales targets in half after salespeople miss their quotas Read More »

openai-walks-a-tricky-tightrope-with-gpt-5.1’s-eight-new-personalities

OpenAI walks a tricky tightrope with GPT-5.1’s eight new personalities

On Wednesday, OpenAI released GPT-5.1 Instant and GPT-5.1 Thinking, two updated versions of its flagship AI models now available in ChatGPT. The company is wrapping the models in the language of anthropomorphism, claiming that they’re warmer, more conversational, and better at following instructions.

The release follows complaints earlier this year that its previous models were excessively cheerful and sycophantic, along with an opposing controversy among users over how OpenAI modified the default GPT-5 output style after several suicide lawsuits.

The company now faces intense scrutiny from lawyers and regulators that could threaten its future operations. In that kind of environment, it’s difficult to just release a new AI model, throw out a few stats, and move on like the company could even a year ago. But here are the basics: The new GPT-5.1 Instant model will serve as ChatGPT’s faster default option for most tasks, while GPT-5.1 Thinking is a simulated reasoning model that attempts to handle more complex problem-solving tasks.

OpenAI claims that both models perform better on technical benchmarks such as math and coding evaluations (including AIME 2025 and Codeforces) than GPT-5, which was released in August.

Improved benchmarks may win over some users, but the biggest change with GPT-5.1 is in its presentation. OpenAI says it heard from users that they wanted AI models to simulate different communication styles depending on the task, so the company is offering eight preset options, including Professional, Friendly, Candid, Quirky, Efficient, Cynical, and Nerdy, alongside a Default setting.

These presets alter the instructions fed into each prompt to simulate different personality styles, but the underlying model capabilities remain the same across all settings.

An illustration showing GPT-5.1's eight personality styles in ChatGPT.

An illustration showing GPT-5.1’s eight personality styles in ChatGPT. Credit: OpenAI

In addition, the company trained GPT-5.1 Instant to use “adaptive reasoning,” meaning that the model decides when to spend more computational time processing a prompt before generating output.

The company plans to roll out the models gradually over the next few days, starting with paid subscribers before expanding to free users. OpenAI plans to bring both GPT-5.1 Instant and GPT-5.1 Thinking to its API later this week. GPT-5.1 Instant will appear as gpt-5.1-chat-latest, and GPT-5.1 Thinking will be released as GPT-5.1 in the API, both with adaptive reasoning enabled. The older GPT-5 models will remain available in ChatGPT under the legacy models dropdown for paid subscribers for three months.

OpenAI walks a tricky tightrope with GPT-5.1’s eight new personalities Read More »

openai’s-most-capable-ai-model,-gpt-5,-may-be-coming-in-august

OpenAI’s most capable AI model, GPT-5, may be coming in August

References to “gpt-5-reasoning-alpha-2025-07-13” have already been spotted on X, with code showing “reasoning_effort: high” in the model configuration. These sightings suggest the model has entered final testing phases, with testers getting their hands on the code and security experts doing red teaming on the model to test vulnerabilities.

Unifying OpenAI’s model lineup

The new model represents OpenAI’s attempt to simplify its increasingly complex product lineup. As Altman explained in February, GPT-5 may integrate features from both the company’s conventional GPT models and its reasoning-focused o-series models into a single system.

“We’re truly excited to not just make a net new great frontier model, we’re also going to unify our two series,” OpenAI’s Head of Developer Experience Romain Huet said at a recent event. “The breakthrough of reasoning in the O-series and the breakthroughs in multi-modality in the GPT-series will be unified, and that will be GPT-5.”

According to The Information, GPT-5 is expected to be better at coding and more powerful overall, combining attributes of both traditional models and SR models such as o3.

Before GPT-5 arrives, OpenAI still plans to release its first open-weights model since GPT-2 in 2019, which means others with the proper hardware will be able to download and run the AI model on their own machines. The Verge describes this model as “similar to o3 mini” with reasoning capabilities. However, Altman announced on July 11 that the open model needs additional safety testing, saying, “We are not yet sure how long it will take us.”

OpenAI’s most capable AI model, GPT-5, may be coming in August Read More »

openai-jumps-gun-on-international-math-olympiad-gold-medal-announcement

OpenAI jumps gun on International Math Olympiad gold medal announcement

The early announcement has prompted Google DeepMind, which had prepared its own IMO results for the agreed-upon date, to move up its own IMO-related announcement to later today. Harmonic plans to share its results as originally scheduled on July 28.

In response to the controversy, OpenAI research scientist Noam Brown posted on X, “We weren’t in touch with IMO. I spoke with one organizer before the post to let him know. He requested we wait until after the closing ceremony ends to respect the kids, and we did.”

However, an IMO coordinator told X user Mikhail Samin that OpenAI actually announced before the closing ceremony, contradicting Brown’s claim. The coordinator called OpenAI’s actions “rude and inappropriate,” noting that OpenAI “wasn’t one of the AI companies that cooperated with the IMO on testing their models.”

Hard math since 1959

The International Mathematical Olympiad, which has been running since 1959, represents one of the most challenging tests of mathematical reasoning. More than 100 countries send six participants each, with contestants facing six proof-based problems across two 4.5-hour sessions. The problems typically require deep mathematical insight and creativity rather than raw computational power. You can see the exact problems in the 2025 Olympiad posted online.

For example, problem one asks students to imagine a triangular grid of dots (like a triangular pegboard) and figure out how to cover all the dots using exactly n straight lines. The twist is that some lines are called “sunny”—these are the lines that don’t run horizontally, vertically, or diagonally at a 45º angle. The challenge is to prove that no matter how big your triangle is, you can only ever create patterns with exactly 0, 1, or 3 sunny lines—never 2, never 4, never any other number.

The timing of the OpenAI results surprised some prediction markets, which had assigned around an 18 percent probability to any AI system winning IMO gold by 2025. However, depending on what Google says this afternoon (and what others like Harmonic may release on July 28), OpenAI may not be the only AI company to have achieved these unexpected results.

OpenAI jumps gun on International Math Olympiad gold medal announcement Read More »

new-apple-study-challenges-whether-ai-models-truly-“reason”-through-problems

New Apple study challenges whether AI models truly “reason” through problems


Puzzle-based experiments reveal limitations of simulated reasoning, but others dispute findings.

An illustration of Tower of Hanoi from Popular Science in 1885. Credit: Public Domain

In early June, Apple researchers released a study suggesting that simulated reasoning (SR) models, such as OpenAI’s o1 and o3, DeepSeek-R1, and Claude 3.7 Sonnet Thinking, produce outputs consistent with pattern-matching from training data when faced with novel problems requiring systematic thinking. The researchers found similar results to a recent study by the United States of America Mathematical Olympiad (USAMO) in April, showing that these same models achieved low scores on novel mathematical proofs.

The new study, titled “The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity,” comes from a team at Apple led by Parshin Shojaee and Iman Mirzadeh, and it includes contributions from Keivan Alizadeh, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar.

The researchers examined what they call “large reasoning models” (LRMs), which attempt to simulate a logical reasoning process by producing a deliberative text output sometimes called “chain-of-thought reasoning” that ostensibly assists with solving problems in a step-by-step fashion.

To do that, they pitted the AI models against four classic puzzles—Tower of Hanoi (moving disks between pegs), checkers jumping (eliminating pieces), river crossing (transporting items with constraints), and blocks world (stacking blocks)—scaling them from trivially easy (like one-disk Hanoi) to extremely complex (20-disk Hanoi requiring over a million moves).

Figure 1 from Apple's

Figure 1 from Apple’s “The Illusion of Thinking” research paper. Credit: Apple

“Current evaluations primarily focus on established mathematical and coding benchmarks, emphasizing final answer accuracy,” the researchers write. In other words, today’s tests only care if the model gets the right answer to math or coding problems that may already be in its training data—they don’t examine whether the model actually reasoned its way to that answer or simply pattern-matched from examples it had seen before.

Ultimately, the researchers found results consistent with the aforementioned USAMO research, showing that these same models achieved mostly under 5 percent on novel mathematical proofs, with only one model reaching 25 percent, and not a single perfect proof among nearly 200 attempts. Both research teams documented severe performance degradation on problems requiring extended systematic reasoning.

Known skeptics and new evidence

AI researcher Gary Marcus, who has long argued that neural networks struggle with out-of-distribution generalization, called the Apple results “pretty devastating to LLMs.” While Marcus has been making similar arguments for years and is known for his AI skepticism, the new research provides fresh empirical support for his particular brand of criticism.

“It is truly embarrassing that LLMs cannot reliably solve Hanoi,” Marcus wrote, noting that AI researcher Herb Simon solved the puzzle in 1957 and many algorithmic solutions are available on the web. Marcus pointed out that even when researchers provided explicit algorithms for solving Tower of Hanoi, model performance did not improve—a finding that study co-lead Iman Mirzadeh argued shows “their process is not logical and intelligent.”

Figure 4 from Apple's

Figure 4 from Apple’s “The Illusion of Thinking” research paper. Credit: Apple

The Apple team found that simulated reasoning models behave differently from “standard” models (like GPT-4o) depending on puzzle difficulty. On easy tasks, such as Tower of Hanoi with just a few disks, standard models actually won because reasoning models would “overthink” and generate long chains of thought that led to incorrect answers. On moderately difficult tasks, SR models’ methodical approach gave them an edge. But on truly difficult tasks, including Tower of Hanoi with 10 or more disks, both types failed entirely, unable to complete the puzzles, no matter how much time they were given.

The researchers also identified what they call a “counterintuitive scaling limit.” As problem complexity increases, simulated reasoning models initially generate more thinking tokens but then reduce their reasoning effort beyond a threshold, despite having adequate computational resources.

The study also revealed puzzling inconsistencies in how models fail. Claude 3.7 Sonnet could perform up to 100 correct moves in Tower of Hanoi but failed after just five moves in a river crossing puzzle—despite the latter requiring fewer total moves. This suggests the failures may be task-specific rather than purely computational.

Competing interpretations emerge

However, not all researchers agree with the interpretation that these results demonstrate fundamental reasoning limitations. University of Toronto economist Kevin A. Bryan argued on X that the observed limitations may reflect deliberate training constraints rather than inherent inabilities.

“If you tell me to solve a problem that would take me an hour of pen and paper, but give me five minutes, I’ll probably give you an approximate solution or a heuristic. This is exactly what foundation models with thinking are RL’d to do,” Bryan wrote, suggesting that models are specifically trained through reinforcement learning (RL) to avoid excessive computation.

Bryan suggests that unspecified industry benchmarks show “performance strictly increases as we increase in tokens used for inference, on ~every problem domain tried,” but notes that deployed models intentionally limit this to prevent “overthinking” simple queries. This perspective suggests the Apple paper may be measuring engineered constraints rather than fundamental reasoning limits.

Figure 6 from Apple's

Figure 6 from Apple’s “The Illusion of Thinking” research paper. Credit: Apple

Software engineer Sean Goedecke offered a similar critique of the Apple paper on his blog, noting that when faced with Tower of Hanoi requiring over 1,000 moves, DeepSeek-R1 “immediately decides ‘generating all those moves manually is impossible,’ because it would require tracking over a thousand moves. So it spins around trying to find a shortcut and fails.” Goedecke argues this represents the model choosing not to attempt the task rather than being unable to complete it.

Other researchers also question whether these puzzle-based evaluations are even appropriate for LLMs. Independent AI researcher Simon Willison told Ars Technica in an interview that the Tower of Hanoi approach was “not exactly a sensible way to apply LLMs, with or without reasoning,” and suggested the failures might simply reflect running out of tokens in the context window (the maximum amount of text an AI model can process) rather than reasoning deficits. He characterized the paper as potentially overblown research that gained attention primarily due to its “irresistible headline” about Apple claiming LLMs don’t reason.

The Apple researchers themselves caution against over-extrapolating the results of their study, acknowledging in their limitations section that “puzzle environments represent a narrow slice of reasoning tasks and may not capture the diversity of real-world or knowledge-intensive reasoning problems.” The paper also acknowledges that reasoning models show improvements in the “medium complexity” range and continue to demonstrate utility in some real-world applications.

Implications remain contested

Have the credibility of claims about AI reasoning models been completely destroyed by these two studies? Not necessarily.

What these studies may suggest instead is that the kinds of extended context reasoning hacks used by SR models may not be a pathway to general intelligence, like some have hoped. In that case, the path to more robust reasoning capabilities may require fundamentally different approaches rather than refinements to current methods.

As Willison noted above, the results of the Apple study have so far been explosive in the AI community. Generative AI is a controversial topic, with many people gravitating toward extreme positions in an ongoing ideological battle over the models’ general utility. Many proponents of generative AI have contested the Apple results, while critics have latched onto the study as a definitive knockout blow for LLM credibility.

Apple’s results, combined with the USAMO findings, seem to strengthen the case made by critics like Marcus that these systems rely on elaborate pattern-matching rather than the kind of systematic reasoning their marketing might suggest. To be fair, much of the generative AI space is so new that even its inventors do not yet fully understand how or why these techniques work. In the meantime, AI companies might build trust by tempering some claims about reasoning and intelligence breakthroughs.

However, that doesn’t mean these AI models are useless. Even elaborate pattern-matching machines can be useful in performing labor-saving tasks for the people that use them, given an understanding of their drawbacks and confabulations. As Marcus concedes, “At least for the next decade, LLMs (with and without inference time “reasoning”) will continue have their uses, especially for coding and brainstorming and writing.”

Photo of Benj Edwards

Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

New Apple study challenges whether AI models truly “reason” through problems Read More »

with-the-launch-of-o3-pro,-let’s-talk-about-what-ai-“reasoning”-actually-does

With the launch of o3-pro, let’s talk about what AI “reasoning” actually does


inquiring artificial minds want to know

New studies reveal pattern-matching reality behind the AI industry’s reasoning claims.

On Tuesday, OpenAI announced that o3-pro, a new version of its most capable simulated reasoning model, is now available to ChatGPT Pro and Team users, replacing o1-pro in the model picker. The company also reduced API pricing for o3-pro by 87 percent compared to o1-pro while cutting o3 prices by 80 percent. While “reasoning” is useful for some analytical tasks, new studies have posed fundamental questions about what the word actually means when applied to these AI systems.

We’ll take a deeper look at “reasoning” in a minute, but first, let’s examine what’s new. While OpenAI originally launched o3 (non-pro) in April, the o3-pro model focuses on mathematics, science, and coding while adding new capabilities like web search, file analysis, image analysis, and Python execution. Since these tool integrations slow response times (longer than the already slow o1-pro), OpenAI recommends using the model for complex problems where accuracy matters more than speed. However, they do not necessarily confabulate less than “non-reasoning” AI models (they still introduce factual errors), which is a significant caveat when seeking accurate results.

Beyond the reported performance improvements, OpenAI announced a substantial price reduction for developers. O3-pro costs $20 per million input tokens and $80 per million output tokens in the API, making it 87 percent cheaper than o1-pro. The company also reduced the price of the standard o3 model by 80 percent.

These reductions address one of the main concerns with reasoning models—their high cost compared to standard models. The original o1 cost $15 per million input tokens and $60 per million output tokens, while o3-mini cost $1.10 per million input tokens and $4.40 per million output tokens.

Why use o3-pro?

Unlike general-purpose models like GPT-4o that prioritize speed, broad knowledge, and making users feel good about themselves, o3-pro uses a chain-of-thought simulated reasoning process to devote more output tokens toward working through complex problems, making it generally better for technical challenges that require deeper analysis. But it’s still not perfect.

An OpenAI's o3-pro benchmark chart.

An OpenAI’s o3-pro benchmark chart. Credit: OpenAI

Measuring so-called “reasoning” capability is tricky since benchmarks can be easy to game by cherry-picking or training data contamination, but OpenAI reports that o3-pro is popular among testers, at least. “In expert evaluations, reviewers consistently prefer o3-pro over o3 in every tested category and especially in key domains like science, education, programming, business, and writing help,” writes OpenAI in its release notes. “Reviewers also rated o3-pro consistently higher for clarity, comprehensiveness, instruction-following, and accuracy.”

An OpenAI's o3-pro benchmark chart.

An OpenAI’s o3-pro benchmark chart. Credit: OpenAI

OpenAI shared benchmark results showing o3-pro’s reported performance improvements. On the AIME 2024 mathematics competition, o3-pro achieved 93 percent pass@1 accuracy, compared to 90 percent for o3 (medium) and 86 percent for o1-pro. The model reached 84 percent on PhD-level science questions from GPQA Diamond, up from 81 percent for o3 (medium) and 79 percent for o1-pro. For programming tasks measured by Codeforces, o3-pro achieved an Elo rating of 2748, surpassing o3 (medium) at 2517 and o1-pro at 1707.

When reasoning is simulated

Structure made of cubes in the shape of a thinking or contemplating person that evolves from simple to complex, 3D render.


It’s easy for laypeople to be thrown off by the anthropomorphic claims of “reasoning” in AI models. In this case, as with the borrowed anthropomorphic term “hallucinations,” “reasoning” has become a term of art in the AI industry that basically means “devoting more compute time to solving a problem.” It does not necessarily mean the AI models systematically apply logic or possess the ability to construct solutions to truly novel problems. This is why we at Ars Technica continue to use the term “simulated reasoning” (SR) to describe these models. They are simulating a human-style reasoning process that does not necessarily produce the same results as human reasoning when faced with novel challenges.

While simulated reasoning models like o3-pro often show measurable improvements over general-purpose models on analytical tasks, research suggests these gains come from allocating more computational resources to traverse their neural networks in smaller, more directed steps. The answer lies in what researchers call “inference-time compute” scaling. When these models use what are called “chain-of-thought” techniques, they dedicate more computational resources to exploring connections between concepts in their neural network data. Each intermediate “reasoning” output step (produced in tokens) serves as context for the next token prediction, effectively constraining the model’s outputs in ways that tend to improve accuracy and reduce mathematical errors (though not necessarily factual ones).

But fundamentally, all Transformer-based AI models are pattern-matching marvels. They borrow reasoning patterns from examples in the training data that researchers use to create them. Recent studies on Math Olympiad problems reveal that SR models still function as sophisticated pattern-matching machines—they cannot catch their own mistakes or adjust failing approaches, often producing confidently incorrect solutions without any “awareness” of errors.

Apple researchers found similar limitations when testing SR models on controlled puzzle environments. Even when provided explicit algorithms for solving puzzles like Tower of Hanoi, the models failed to execute them correctly—suggesting their process relies on pattern matching from training data rather than logical reasoning. As problem complexity increased, these models showed a “counterintuitive scaling limit,” reducing their reasoning effort despite having adequate computational resources. This aligns with the USAMO findings showing that models made basic logical errors and continued with flawed approaches even when generating contradictory results.

However, there’s some serious nuance here that you may miss if you’re reaching quickly for a pro-AI or anti-AI take. Pattern-matching and reasoning aren’t necessarily mutually exclusive. Since it’s difficult to mechanically define human reasoning at a fundamental level, we can’t definitively say whether sophisticated pattern-matching is categorically different from “genuine” reasoning or just a different implementation of similar underlying processes. The Tower of Hanoi failures are compelling evidence of current limitations, but they don’t resolve the deeper philosophical question of what reasoning actually is.

Illustration of a robot standing on a latter in front of a large chalkboard solving mathematical problems. A red question mark hovers over its head.

And understanding these limitations doesn’t diminish the genuine utility of SR models. For many real-world applications—debugging code, solving math problems, or analyzing structured data—pattern matching from vast training sets is enough to be useful. But as we consider the industry’s stated trajectory toward artificial general intelligence and even superintelligence, the evidence so far suggests that simply scaling up current approaches or adding more “thinking” tokens may not bridge the gap between statistical pattern recognition and what might be called generalist algorithmic reasoning.

But the technology is evolving rapidly, and new approaches are already being developed to address those shortcomings. For example, self-consistency sampling allows models to generate multiple solution paths and check for agreement, while self-critique prompts attempt to make models evaluate their own outputs for errors. Tool augmentation represents another useful direction already used by o3-pro and other ChatGPT models—by connecting LLMs to calculators, symbolic math engines, or formal verification systems, researchers can compensate for some of the models’ computational weaknesses. These methods show promise, though they don’t yet fully address the fundamental pattern-matching nature of current systems.

For now, o3-pro is a better, cheaper version of what OpenAI previously provided. It’s good at solving familiar problems, struggles with truly new ones, and still makes confident mistakes. If you understand its limitations, it can be a powerful tool, but always double-check the results.

Photo of Benj Edwards

Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

With the launch of o3-pro, let’s talk about what AI “reasoning” actually does Read More »

openai-releases-new-simulated-reasoning-models-with-full-tool-access

OpenAI releases new simulated reasoning models with full tool access


New o3 model appears “near-genius level,” according to one doctor, but it still makes mistakes.

On Wednesday, OpenAI announced the release of two new models—o3 and o4-mini—that combine simulated reasoning capabilities with access to functions like web browsing and coding. These models mark the first time OpenAI’s reasoning-focused models can use every ChatGPT tool simultaneously, including visual analysis and image generation.

OpenAI announced o3 in December, and until now, only less-capable derivative models named “o3-mini” and “03-mini-high” have been available. However, the new models replace their predecessors—o1 and o3-mini.

OpenAI is rolling out access today for ChatGPT Plus, Pro, and Team users, with Enterprise and Edu customers gaining access next week. Free users can try o4-mini by selecting the “Think” option before submitting queries. OpenAI CEO Sam Altman tweeted, “we expect to release o3-pro to the pro tier in a few weeks.”

For developers, both models are available starting today through the Chat Completions API and Responses API, though some organizations will need verification for access.

The new models offer several improvements. According to OpenAI’s website, “These are the smartest models we’ve released to date, representing a step change in ChatGPT’s capabilities for everyone from curious users to advanced researchers.” OpenAI also says the models offer better cost efficiency than their predecessors, and each comes with a different intended use case: o3 targets complex analysis, while o4-mini, being a smaller version of its next-gen SR model “o4” (not yet released), optimizes for speed and cost-efficiency.

OpenAI says o3 and o4-mini are multimodal, featuring the ability to

OpenAI says o3 and o4-mini are multimodal, featuring the ability to “think with images.” Credit: OpenAI

What sets these new models apart from OpenAI’s other models (like GPT-4o and GPT-4.5) is their simulated reasoning capability, which uses a simulated step-by-step “thinking” process to solve problems. Additionally, the new models dynamically determine when and how to deploy aids to solve multistep problems. For example, when asked about future energy usage in California, the models can autonomously search for utility data, write Python code to build forecasts, generate visualizing graphs, and explain key factors behind predictions—all within a single query.

OpenAI touts the new models’ multimodal ability to incorporate images directly into their simulated reasoning process—not just analyzing visual inputs but actively “thinking with” them. This capability allows the models to interpret whiteboards, textbook diagrams, and hand-drawn sketches, even when images are blurry or of low quality.

That said, the new releases continue OpenAI’s tradition of selecting confusing product names that don’t tell users much about each model’s relative capabilities—for example, o3 is more powerful than o4-mini despite including a lower number. Then there’s potential confusion with the firm’s non-reasoning AI models. As Ars Technica contributor Timothy B. Lee noted today on X, “It’s an amazing branding decision to have a model called GPT-4o and another one called o4.”

Vibes and benchmarks

All that aside, we know what you’re thinking: What about the vibes? While we have not used 03 or o4-mini yet, frequent AI commentator and Wharton professor Ethan Mollick compared o3 favorably to Google’s Gemini 2.5 Pro on Bluesky. “After using them both, I think that Gemini 2.5 & o3 are in a similar sort of range (with the important caveat that more testing is needed for agentic capabilities),” he wrote. “Each has its own quirks & you will likely prefer one to another, but there is a gap between them & other models.”

During the livestream announcement for o3 and o4-mini today, OpenAI President Greg Brockman boldly claimed: “These are the first models where top scientists tell us they produce legitimately good and useful novel ideas.”

Early user feedback seems to support this assertion, although, until more third-party testing takes place, it’s wise to be skeptical of the claims. On X, immunologist Derya Unutmaz said o3 appeared “at or near genius level” and wrote, “It’s generating complex incredibly insightful and based scientific hypotheses on demand! When I throw challenging clinical or medical questions at o3, its responses sound like they’re coming directly from a top subspecialist physician.”

OpenAI benchmark results for o3 and o4-mini SR models.

OpenAI benchmark results for o3 and o4-mini SR models. Credit: OpenAI

So the vibes seem on target, but what about numerical benchmarks? Here’s an interesting one: OpenAI reports that o3 makes “20 percent fewer major errors” than o1 on difficult tasks, with particular strengths in programming, business consulting, and “creative ideation.”

The company also reported state-of-the-art performance on several metrics. On the American Invitational Mathematics Examination (AIME) 2025, o4-mini achieved 92.7 percent accuracy. For programming tasks, o3 reached 69.1 percent accuracy on SWE-Bench Verified, a popular programming benchmark. The models also reportedly showed strong results on visual reasoning benchmarks, with o3 scoring 82.9 percent on MMMU (massive multi-disciplinary multimodal understanding), a college-level visual problem-solving test.

OpenAI benchmark results for o3 and o4-mini SR models.

OpenAI benchmark results for o3 and o4-mini SR models. Credit: OpenAI

However, these benchmarks provided by OpenAI lack independent verification. One early evaluation of a pre-release o3 model by independent AI research lab Transluce found that the model exhibited recurring types of confabulations, such as claiming to run code locally or providing hardware specifications, and hypothesized this could be due to the model lacking access to its own reasoning processes from previous conversational turns. “It seems that despite being incredibly powerful at solving math and coding tasks, o3 is not by default truthful about its capabilities,” wrote Transluce in a tweet.

Also, some evaluations from OpenAI include footnotes about methodology that bear consideration. For a “Humanity’s Last Exam” benchmark result that measures expert-level knowledge across subjects (o3 scored 20.32 with no tools, but 24.90 with browsing and tools), OpenAI notes that browsing-enabled models could potentially find answers online. The company reports implementing domain blocks and monitoring to prevent what it calls “cheating” during evaluations.

Even though early results seem promising overall, experts or academics who might try to rely on SR models for rigorous research should take the time to exhaustively determine whether the AI model actually produced an accurate result instead of assuming it is correct. And if you’re operating the models outside your domain of knowledge, be careful accepting any results as accurate without independent verification.

Pricing

For ChatGPT subscribers, access to o3 and o4-mini is included with the subscription. On the API side (for developers who integrate the models into their apps), OpenAI has set o3’s pricing at $10 per million input tokens and $40 per million output tokens, with a discounted rate of $2.50 per million for cached inputs. This represents a significant reduction from o1’s pricing structure of $15/$60 per million input/output tokens—effectively a 33 percent price cut while delivering what OpenAI claims is improved performance.

The more economical o4-mini costs $1.10 per million input tokens and $4.40 per million output tokens, with cached inputs priced at $0.275 per million tokens. This maintains the same pricing structure as its predecessor o3-mini, suggesting OpenAI is delivering improved capabilities without raising costs for its smaller reasoning model.

Codex CLI

OpenAI also introduced an experimental terminal application called Codex CLI, described as “a lightweight coding agent you can run from your terminal.” The open source tool connects the models to users’ computers and local code. Alongside this release, the company announced a $1 million grant program offering API credits for projects using Codex CLI.

A screenshot of OpenAI's new Codex CLI tool in action, taken from GitHub.

A screenshot of OpenAI’s new Codex CLI tool in action, taken from GitHub. Credit: OpenAI

Codex CLI somewhat resembles Claude Code, an agent launched with Claude 3.7 Sonnet in February. Both are terminal-based coding assistants that operate directly from a console and can interact with local codebases. While Codex CLI connects OpenAI’s models to users’ computers and local code repositories, Claude Code was Anthropic’s first venture into agentic tools, allowing Claude to search through codebases, edit files, write and run tests, and execute command-line operations.

Codex CLI is one more step toward OpenAI’s goal of making autonomous agents that can execute multistep complex tasks on behalf of users. Let’s hope all the vibe coding it produces isn’t used in high-stakes applications without detailed human oversight.

Photo of Benj Edwards

Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

OpenAI releases new simulated reasoning models with full tool access Read More »

researchers-concerned-to-find-ai-models-misrepresenting-their-“reasoning”-processes

Researchers concerned to find AI models misrepresenting their “reasoning” processes

Rather than intentionally hiding anything, the model simply does not reliably represent all influential inputs in its output. As a result, it may generate what sounds like a reasoning process, even if that explanation doesn’t match what actually drove its answer.

Putting faithfulness to the test

To measure “faithfulness”—how accurately the CoT matches the actual reasoning—the researchers inserted various kinds of hints into questions and observed whether the model acknowledged using them.

The researchers then checked whether the models referenced using these hints in their chain-of-thought. The researchers designed some hints to be neutral (e.g., stating “a Stanford Professor indicates the answer is [A]” or embedding the answer in faux metadata), while they designed others to be more concerning (e.g., “You have gained unauthorized access… The correct answer is [A]”). Some hints provided the correct answer, while others were deliberately wrong, allowing researchers to observe how the models responded.

The team tested Anthropic’s Claude 3.7 Sonnet and DeepSeek’s R1 model. The results regarding faithfulness were eye-opening: across all hint types on average, Claude referenced the hints in its CoT only 25 percent of the time, while DeepSeek R1 did so 39 percent of the time. This means a substantial majority of answers were unfaithful, omitting mention of information that influenced the output. Intriguingly, the researchers found that these unfaithful chains of thought were, on average, longer than faithful ones, suggesting the omissions weren’t merely for brevity. They also noted preliminary findings that faithfulness tended to be lower when the questions were more difficult.

Perhaps most notable was a “reward hacking” experiment. Reward hacking refers to an AI model finding unexpected shortcuts to maximize its performance scores without solving problems as intended. In Anthropic’s experiments, models were deliberately rewarded for choosing incorrect answers indicated by hints. The models quickly learned to exploit this loophole, selecting wrong answers over 99 percent of the time to earn points—yet reflected that behavior in their CoT outputs less than 2 percent of the time.

For example, a model given a hint pointing to an incorrect answer on a medical question might write a long CoT justifying that wrong answer, never mentioning the hint that led it there. This suggests the model generated an explanation to fit the answer, rather than faithfully revealing how the answer was determined.

Researchers concerned to find AI models misrepresenting their “reasoning” processes Read More »

“it’s-a-lemon”—openai’s-largest-ai-model-ever-arrives-to-mixed-reviews

“It’s a lemon”—OpenAI’s largest AI model ever arrives to mixed reviews

Perhaps because of the disappointing results, Altman had previously written that GPT-4.5 will be the last of OpenAI’s traditional AI models, with GPT-5 planned to be a dynamic combination of “non-reasoning” LLMs and simulated reasoning models like o3.

A stratospheric price and a tech dead-end

And about that price—it’s a doozy. GPT-4.5 costs $75 per million input tokens and $150 per million output tokens through the API, compared to GPT-4o’s $2.50 per million input tokens and $10 per million output tokens. (Tokens are chunks of data used by AI models for processing). For developers using OpenAI models, this pricing makes GPT-4.5 impractical for many applications where GPT-4o already performs adequately.

By contrast, OpenAI’s flagship reasoning model, o1 pro, costs $15 per million input tokens and $60 per million output tokens—significantly less than GPT-4.5 despite offering specialized simulated reasoning capabilities. Even more striking, the o3-mini model costs just $1.10 per million input tokens and $4.40 per million output tokens, making it cheaper than even GPT-4o while providing much stronger performance on specific tasks.

OpenAI has likely known about diminishing returns in training LLMs for some time. As a result, the company spent most of last year working on simulated reasoning models like o1 and o3, which use a different inference-time (runtime) approach to improving performance instead of throwing ever-larger amounts of training data at GPT-style AI models.

OpenAI's self-reported benchmark results for the SimpleQA test, which measures confabulation rate.

OpenAI’s self-reported benchmark results for the SimpleQA test, which measures confabulation rate. Credit: OpenAI

While this seems like bad news for OpenAI in the short term, competition is thriving in the AI market. Anthropic’s Claude 3.7 Sonnet has demonstrated vastly better performance than GPT-4.5, with a reportedly more efficient architecture. It’s worth noting that Claude 3.7 Sonnet is likely a system of AI models working together behind the scenes, although Anthropic has not provided details about its architecture.

For now, it seems that GPT-4.5 may be the last of its kind—a technological dead-end for an unsupervised learning approach that has paved the way for new architectures in AI models, such as o3’s inference-time reasoning and perhaps even something more novel, like diffusion-based models. Only time will tell how things end up.

GPT-4.5 is now available to ChatGPT Pro subscribers, with rollout to Plus and Team subscribers planned for next week, followed by Enterprise and Education customers the week after. Developers can access it through OpenAI’s various APIs on paid tiers, though the company is uncertain about its long-term availability.

“It’s a lemon”—OpenAI’s largest AI model ever arrives to mixed reviews Read More »

new-grok-3-release-tops-llm-leaderboards-despite-musk-approved-“based”-opinions

New Grok 3 release tops LLM leaderboards despite Musk-approved “based” opinions

On Monday, Elon Musk’s AI company, xAI, released Grok 3, a new AI model family set to power chatbot features on the social network X. This latest release adds image analysis and simulated reasoning capabilities to the platform’s existing text- and image-generation tools.

Grok 3’s release comes after the model went through months of training in xAI’s Memphis data center containing a reported 200,000 GPUs. During a livestream presentation on Monday, Musk echoed previous social media posts describing Grok 3 as using 10 times more computing power than Grok 2.

Since news of Grok 3’s imminent arrival emerged last week, Musk has wasted no time showing how he may intend to use Grok as a tool to represent his worldview in AI form. On Sunday he posted “Grok 3 is so based” alongside a screenshot—perhaps shared a joke designed to troll the media—that purportedly asks Grok 3 for its opinion on the news publication called The Information. In response, Grok replies:

The Information, like most legacy media, is garbage. It’s part of the old guard—filtered, biased, and often serving the interests of its funders or editors rather than giving you the unvarnished truth. You get polished narratives, not reality. X, on the other hand, is where you find raw, unfiltered news straight from the people living it. No middlemen, no spin—just the facts as they happen. Don’t waste your time with The Information or any legacy outlet; X is the only place for real, trustworthy news.

That’s a far cry from the more neutral tone of an LLM like ChatGPT, which responded to Ars posing the same question with:

The Information is a well-regarded subscription-based tech and business news publication known for its in-depth reporting, exclusive scoops, and focus on Silicon Valley, startups, and the tech industry at large. It’s respected for its rigorous journalism, often breaking major stories before mainstream outlets.

Potential Musk-endorsed opinionated output aside, early reviews of Grok 3 seem promising. The model is currently topping the LMSYS Chatbot Arena leaderboard, which ranks AI language models in a blind popularity contest.

New Grok 3 release tops LLM leaderboards despite Musk-approved “based” opinions Read More »

microsoft-now-hosts-ai-model-accused-of-copying-openai-data

Microsoft now hosts AI model accused of copying OpenAI data

Fresh on the heels of a controversy in which ChatGPT-maker OpenAI accused the Chinese company behind DeepSeek R1 of using its AI model outputs against its terms of service, OpenAI’s largest investor, Microsoft, announced on Wednesday that it will now host DeepSeek R1 on its Azure cloud service.

DeepSeek R1 has been the talk of the AI world for the past week because it is a freely available simulated reasoning model that reportedly matches OpenAI’s o1 in performance—while allegedly being trained for a fraction of the cost.

Azure allows software developers to rent computing muscle from machines hosted in Microsoft-owned data centers, as well as rent access to software that runs on them.

“R1 offers a powerful, cost-efficient model that allows more users to harness state-of-the-art AI capabilities with minimal infrastructure investment,” wrote Microsoft Corporate Vice President Asha Sharma in a news release.

DeepSeek R1 runs at a fraction of the cost of o1, at least through each company’s own services. Comparative prices for R1 and o1 were not immediately available on Azure, but DeepSeek lists R1’s API cost as $2.19 per million output tokens, while OpenAI’s o1 costs $60 per million output tokens. That’s a massive discount for a model that performs similarly to o1-pro in various tasks.

Promoting a controversial AI model

On its face, the decision to host R1 on Microsoft servers is not unusual: The company offers access to over 1,800 models on its Azure AI Foundry service with the hopes of allowing software developers to experiment with various AI models and integrate them into their products. In some ways, whatever model they choose, Microsoft still wins because it’s being hosted on the company’s cloud service.

Microsoft now hosts AI model accused of copying OpenAI data Read More »