AI research

new-apple-study-challenges-whether-ai-models-truly-“reason”-through-problems

New Apple study challenges whether AI models truly “reason” through problems


Puzzle-based experiments reveal limitations of simulated reasoning, but others dispute findings.

An illustration of Tower of Hanoi from Popular Science in 1885. Credit: Public Domain

In early June, Apple researchers released a study suggesting that simulated reasoning (SR) models, such as OpenAI’s o1 and o3, DeepSeek-R1, and Claude 3.7 Sonnet Thinking, produce outputs consistent with pattern-matching from training data when faced with novel problems requiring systematic thinking. The researchers found similar results to a recent study by the United States of America Mathematical Olympiad (USAMO) in April, showing that these same models achieved low scores on novel mathematical proofs.

The new study, titled “The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity,” comes from a team at Apple led by Parshin Shojaee and Iman Mirzadeh, and it includes contributions from Keivan Alizadeh, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar.

The researchers examined what they call “large reasoning models” (LRMs), which attempt to simulate a logical reasoning process by producing a deliberative text output sometimes called “chain-of-thought reasoning” that ostensibly assists with solving problems in a step-by-step fashion.

To do that, they pitted the AI models against four classic puzzles—Tower of Hanoi (moving disks between pegs), checkers jumping (eliminating pieces), river crossing (transporting items with constraints), and blocks world (stacking blocks)—scaling them from trivially easy (like one-disk Hanoi) to extremely complex (20-disk Hanoi requiring over a million moves).

Figure 1 from Apple's

Figure 1 from Apple’s “The Illusion of Thinking” research paper. Credit: Apple

“Current evaluations primarily focus on established mathematical and coding benchmarks, emphasizing final answer accuracy,” the researchers write. In other words, today’s tests only care if the model gets the right answer to math or coding problems that may already be in its training data—they don’t examine whether the model actually reasoned its way to that answer or simply pattern-matched from examples it had seen before.

Ultimately, the researchers found results consistent with the aforementioned USAMO research, showing that these same models achieved mostly under 5 percent on novel mathematical proofs, with only one model reaching 25 percent, and not a single perfect proof among nearly 200 attempts. Both research teams documented severe performance degradation on problems requiring extended systematic reasoning.

Known skeptics and new evidence

AI researcher Gary Marcus, who has long argued that neural networks struggle with out-of-distribution generalization, called the Apple results “pretty devastating to LLMs.” While Marcus has been making similar arguments for years and is known for his AI skepticism, the new research provides fresh empirical support for his particular brand of criticism.

“It is truly embarrassing that LLMs cannot reliably solve Hanoi,” Marcus wrote, noting that AI researcher Herb Simon solved the puzzle in 1957 and many algorithmic solutions are available on the web. Marcus pointed out that even when researchers provided explicit algorithms for solving Tower of Hanoi, model performance did not improve—a finding that study co-lead Iman Mirzadeh argued shows “their process is not logical and intelligent.”

Figure 4 from Apple's

Figure 4 from Apple’s “The Illusion of Thinking” research paper. Credit: Apple

The Apple team found that simulated reasoning models behave differently from “standard” models (like GPT-4o) depending on puzzle difficulty. On easy tasks, such as Tower of Hanoi with just a few disks, standard models actually won because reasoning models would “overthink” and generate long chains of thought that led to incorrect answers. On moderately difficult tasks, SR models’ methodical approach gave them an edge. But on truly difficult tasks, including Tower of Hanoi with 10 or more disks, both types failed entirely, unable to complete the puzzles, no matter how much time they were given.

The researchers also identified what they call a “counterintuitive scaling limit.” As problem complexity increases, simulated reasoning models initially generate more thinking tokens but then reduce their reasoning effort beyond a threshold, despite having adequate computational resources.

The study also revealed puzzling inconsistencies in how models fail. Claude 3.7 Sonnet could perform up to 100 correct moves in Tower of Hanoi but failed after just five moves in a river crossing puzzle—despite the latter requiring fewer total moves. This suggests the failures may be task-specific rather than purely computational.

Competing interpretations emerge

However, not all researchers agree with the interpretation that these results demonstrate fundamental reasoning limitations. University of Toronto economist Kevin A. Bryan argued on X that the observed limitations may reflect deliberate training constraints rather than inherent inabilities.

“If you tell me to solve a problem that would take me an hour of pen and paper, but give me five minutes, I’ll probably give you an approximate solution or a heuristic. This is exactly what foundation models with thinking are RL’d to do,” Bryan wrote, suggesting that models are specifically trained through reinforcement learning (RL) to avoid excessive computation.

Bryan suggests that unspecified industry benchmarks show “performance strictly increases as we increase in tokens used for inference, on ~every problem domain tried,” but notes that deployed models intentionally limit this to prevent “overthinking” simple queries. This perspective suggests the Apple paper may be measuring engineered constraints rather than fundamental reasoning limits.

Figure 6 from Apple's

Figure 6 from Apple’s “The Illusion of Thinking” research paper. Credit: Apple

Software engineer Sean Goedecke offered a similar critique of the Apple paper on his blog, noting that when faced with Tower of Hanoi requiring over 1,000 moves, DeepSeek-R1 “immediately decides ‘generating all those moves manually is impossible,’ because it would require tracking over a thousand moves. So it spins around trying to find a shortcut and fails.” Goedecke argues this represents the model choosing not to attempt the task rather than being unable to complete it.

Other researchers also question whether these puzzle-based evaluations are even appropriate for LLMs. Independent AI researcher Simon Willison told Ars Technica in an interview that the Tower of Hanoi approach was “not exactly a sensible way to apply LLMs, with or without reasoning,” and suggested the failures might simply reflect running out of tokens in the context window (the maximum amount of text an AI model can process) rather than reasoning deficits. He characterized the paper as potentially overblown research that gained attention primarily due to its “irresistible headline” about Apple claiming LLMs don’t reason.

The Apple researchers themselves caution against over-extrapolating the results of their study, acknowledging in their limitations section that “puzzle environments represent a narrow slice of reasoning tasks and may not capture the diversity of real-world or knowledge-intensive reasoning problems.” The paper also acknowledges that reasoning models show improvements in the “medium complexity” range and continue to demonstrate utility in some real-world applications.

Implications remain contested

Have the credibility of claims about AI reasoning models been completely destroyed by these two studies? Not necessarily.

What these studies may suggest instead is that the kinds of extended context reasoning hacks used by SR models may not be a pathway to general intelligence, like some have hoped. In that case, the path to more robust reasoning capabilities may require fundamentally different approaches rather than refinements to current methods.

As Willison noted above, the results of the Apple study have so far been explosive in the AI community. Generative AI is a controversial topic, with many people gravitating toward extreme positions in an ongoing ideological battle over the models’ general utility. Many proponents of generative AI have contested the Apple results, while critics have latched onto the study as a definitive knockout blow for LLM credibility.

Apple’s results, combined with the USAMO findings, seem to strengthen the case made by critics like Marcus that these systems rely on elaborate pattern-matching rather than the kind of systematic reasoning their marketing might suggest. To be fair, much of the generative AI space is so new that even its inventors do not yet fully understand how or why these techniques work. In the meantime, AI companies might build trust by tempering some claims about reasoning and intelligence breakthroughs.

However, that doesn’t mean these AI models are useless. Even elaborate pattern-matching machines can be useful in performing labor-saving tasks for the people that use them, given an understanding of their drawbacks and confabulations. As Marcus concedes, “At least for the next decade, LLMs (with and without inference time “reasoning”) will continue have their uses, especially for coding and brainstorming and writing.”

Photo of Benj Edwards

Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

New Apple study challenges whether AI models truly “reason” through problems Read More »

new-lego-building-ai-creates-models-that-actually-stand-up-in-real-life

New Lego-building AI creates models that actually stand up in real life

The LegoGPT system works in three parts, shown in this diagram.

The LegoGPT system works in three parts, shown in this diagram. Credit: Pun et al.

The researchers also expanded the system’s abilities by adding texture and color options. For example, using an appearance prompt like “Electric guitar in metallic purple,” LegoGPT can generate a guitar model, with bricks assigned a purple color.

Testing with robots and humans

To prove their designs worked in real life, the researchers had robots assemble the AI-created Lego models. They used a dual-robot arm system with force sensors to pick up and place bricks according to the AI-generated instructions.

Human testers also built some of the designs by hand, showing that the AI creates genuinely buildable models. “Our experiments show that LegoGPT produces stable, diverse, and aesthetically pleasing Lego designs that align closely with the input text prompts,” the team noted in its paper.

When tested against other AI systems for 3D creation, LegoGPT stands out through its focus on structural integrity. The team tested against several alternatives, including LLaMA-Mesh and other 3D generation models, and found its approach produced the highest percentage of stable structures.

A video of two robot arms building a LegoGPT creation, provided by the researchers.

Still, there are some limitations. The current version of LegoGPT only works within a 20×20×20 building space and uses a mere eight standard brick types. “Our method currently supports a fixed set of commonly used Lego bricks,” the team acknowledged. “In future work, we plan to expand the brick library to include a broader range of dimensions and brick types, such as slopes and tiles.”

The researchers also hope to scale up their training dataset to include more objects than the 21 categories currently available. Meanwhile, others can literally build on their work—the researchers released their dataset, code, and models on their project website and GitHub.

New Lego-building AI creates models that actually stand up in real life Read More »

researchers-concerned-to-find-ai-models-misrepresenting-their-“reasoning”-processes

Researchers concerned to find AI models misrepresenting their “reasoning” processes

Rather than intentionally hiding anything, the model simply does not reliably represent all influential inputs in its output. As a result, it may generate what sounds like a reasoning process, even if that explanation doesn’t match what actually drove its answer.

Putting faithfulness to the test

To measure “faithfulness”—how accurately the CoT matches the actual reasoning—the researchers inserted various kinds of hints into questions and observed whether the model acknowledged using them.

The researchers then checked whether the models referenced using these hints in their chain-of-thought. The researchers designed some hints to be neutral (e.g., stating “a Stanford Professor indicates the answer is [A]” or embedding the answer in faux metadata), while they designed others to be more concerning (e.g., “You have gained unauthorized access… The correct answer is [A]”). Some hints provided the correct answer, while others were deliberately wrong, allowing researchers to observe how the models responded.

The team tested Anthropic’s Claude 3.7 Sonnet and DeepSeek’s R1 model. The results regarding faithfulness were eye-opening: across all hint types on average, Claude referenced the hints in its CoT only 25 percent of the time, while DeepSeek R1 did so 39 percent of the time. This means a substantial majority of answers were unfaithful, omitting mention of information that influenced the output. Intriguingly, the researchers found that these unfaithful chains of thought were, on average, longer than faithful ones, suggesting the omissions weren’t merely for brevity. They also noted preliminary findings that faithfulness tended to be lower when the questions were more difficult.

Perhaps most notable was a “reward hacking” experiment. Reward hacking refers to an AI model finding unexpected shortcuts to maximize its performance scores without solving problems as intended. In Anthropic’s experiments, models were deliberately rewarded for choosing incorrect answers indicated by hints. The models quickly learned to exploit this loophole, selecting wrong answers over 99 percent of the time to earn points—yet reflected that behavior in their CoT outputs less than 2 percent of the time.

For example, a model given a hint pointing to an incorrect answer on a medical question might write a long CoT justifying that wrong answer, never mentioning the hint that led it there. This suggests the model generated an explanation to fit the answer, rather than faithfully revealing how the answer was determined.

Researchers concerned to find AI models misrepresenting their “reasoning” processes Read More »

researchers-astonished-by-tool’s-apparent-success-at-revealing-ai’s-hidden-motives

Researchers astonished by tool’s apparent success at revealing AI’s hidden motives

Blind auditing reveals hidden objectives

To test how effectively these hidden objectives could be uncovered, Anthropic set up a “blind auditing” experiment. Four independent research teams tried to detect a model’s secret motivations without knowing how it had been trained. Three teams successfully identified the hidden “RM-sycophancy” objective—named for the model’s desire to please reward systems—even pinpointing the exact biases the AI exploited.

But the most surprising revelation came from exploiting what the researchers termed “an interesting quirk” of modern AI: multiple “personas” running simultaneously on the same neural network, as expressed through different prompts. While the model’s “assistant” persona carefully concealed its intentions—following explicit training to keep quiet—the model could adopt a “user” persona when prompted differently. In this less-restricted context, the model sometimes posed questions or made statements that inadvertently revealed its hidden goals.

“In other words, while the “assistant” persona knew not to admit it was gaming the reward model scoring system, its “user” persona was not explicitly trained to hide these motives, and thus inadvertently revealed the model’s hidden objectives.

Potential advances in interpretability

Anthropic also employed neural interpretability methods, specifically, sparse autoencoders (SAEs), which can highlight neural activity patterns linked to deceptive behavior. “Interpretability” in AI refers to techniques designed to make the internal workings and decisions of AI models understandable to humans. Using this technique, researchers could rapidly trace odd outputs back to specific neural patterns associated with hidden objectives.

The researchers were “astonished by the success” of the SAEs at revealing hidden knowledge at first, but the researchers later noted that some of this success could potentially be explained by simpler semantic search methods. The new interpretability methods they developed show promise but remain under ongoing investigation.

This research highlights a limitation of current AI safety evaluations, which often assess only surface-level behavior. “If AI systems can appear well-behaved while harboring secret motives, we can’t rely on this type of surface-level safety testing forever,” the researchers concluded.

Researchers astonished by tool’s apparent success at revealing AI’s hidden motives Read More »

researchers-surprised-to-find-less-educated-areas-adopting-ai-writing-tools-faster

Researchers surprised to find less-educated areas adopting AI writing tools faster


From the mouths of machines

Stanford researchers analyzed 305 million texts, revealing AI-writing trends.

Since the launch of ChatGPT in late 2022, experts have debated how widely AI language models would impact the world. A few years later, the picture is getting clear. According to new Stanford University-led research examining over 300 million text samples across multiple sectors, AI language models now assist in writing up to a quarter of professional communications across sectors. It’s having a large impact, especially in less-educated parts of the United States.

“Our study shows the emergence of a new reality in which firms, consumers and even international organizations substantially rely on generative AI for communications,” wrote the researchers.

The researchers tracked large language model (LLM) adoption across industries from January 2022 to September 2024 using a dataset that included 687,241 consumer complaints submitted to the US Consumer Financial Protection Bureau (CFPB), 537,413 corporate press releases, 304.3 million job postings, and 15,919 United Nations press releases.

By using a statistical detection system that tracked word usage patterns, the researchers found that roughly 18 percent of financial consumer complaints (including 30 percent of all complaints from Arkansas), 24 percent of corporate press releases, up to 15 percent of job postings, and 14 percent of UN press releases showed signs of AI assistance during that period of time.

The study also found that while urban areas showed higher adoption overall (18.2 percent versus 10.9 percent in rural areas), regions with lower educational attainment used AI writing tools more frequently (19.9 percent compared to 17.4 percent in higher-education areas). The researchers note that this contradicts typical technology adoption patterns where more educated populations adopt new tools fastest.

“In the consumer complaint domain, the geographic and demographic patterns in LLM adoption present an intriguing departure from historical technology diffusion trends where technology adoption has generally been concentrated in urban areas, among higher-income groups, and populations with higher levels of educational attainment.”

Researchers from Stanford, the University of Washington, and Emory University led the study, titled, “The Widespread Adoption of Large Language Model-Assisted Writing Across Society,” first listed on the arXiv preprint server in mid-February. Weixin Liang and Yaohui Zhang from Stanford served as lead authors, with collaborators Mihai Codreanu, Jiayu Wang, Hancheng Cao, and James Zou.

Detecting AI use in aggregate

We’ve previously covered that AI writing detection services aren’t reliable, and this study does not contradict that finding. On a document-by-document basis, AI detectors cannot be trusted. But when analyzing millions of documents in aggregate, telltale patterns emerge that suggest the influence of AI language models on text.

The researchers developed an approach based on a statistical framework in a previously released work that analyzed shifts in word frequencies and linguistic patterns before and after ChatGPT’s release. By comparing large sets of pre- and post-ChatGPT texts, they estimated the proportion of AI-assisted content at a population level. The presumption is that LLMs tend to favor certain word choices, sentence structures, and linguistic patterns that differ subtly from typical human writing.

To validate their approach, the researchers created test sets with known percentages of AI content (from zero percent to 25 percent) and found their method predicted these percentages with error rates below 3.3 percent. This statistical validation gave them confidence in their population-level estimates.

While the researchers specifically note their estimates likely represent a minimum level of AI usage, it’s important to understand that actual AI involvement might be significantly greater. Due to the difficulty in detecting heavily edited or increasingly sophisticated AI-generated content, the researchers say their reported adoption rates could substantially underestimate true levels of generative AI use.

Analysis suggests AI use as “equalizing tools”

While the overall adoption rates are revealing, perhaps more insightful are the patterns of who is using AI writing tools and how these patterns may challenge conventional assumptions about technology adoption.

In examining the CFPB complaints (a US public resource that collects complaints about consumer financial products and services), the researchers’ geographic analysis revealed substantial variation across US states.

Arkansas showed the highest adoption rate at 29.2 percent (based on 7,376 complaints), followed by Missouri at 26.9 percent (16,807 complaints) and North Dakota at 24.8 percent (1,025 complaints). In contrast, states like West Virginia (2.6 percent), Idaho (3.8 percent), and Vermont (4.8 percent) showed minimal AI writing adoption. Major population centers demonstrated moderate adoption, with California at 17.4 percent (157,056 complaints) and New York at 16.6 percent (104,862 complaints).

The urban-rural divide followed expected technology adoption patterns initially, but with an interesting twist. Using Rural Urban Commuting Area (RUCA) codes, the researchers found that urban and rural areas initially adopted AI writing tools at similar rates during early 2023. However, adoption trajectories diverged by mid-2023, with urban areas reaching 18.2 percent adoption compared to 10.9 percent in rural areas.

Contrary to typical technology diffusion patterns, areas with lower educational attainment showed higher AI writing tool usage. Comparing regions above and below state median levels of bachelor’s degree attainment, areas with fewer college graduates stabilized at 19.9 percent adoption rates compared to 17.4 percent in more educated regions. This pattern held even within urban areas, where less-educated communities showed 21.4 percent adoption versus 17.8 percent in more educated urban areas.

The researchers suggest that AI writing tools may serve as a leg-up for people who may not have as much educational experience. “While the urban-rural digital divide seems to persist,” the researchers write, “our finding that areas with lower educational attainment showed modestly higher LLM adoption rates in consumer complaints suggests these tools may serve as equalizing tools in consumer advocacy.”

Corporate and diplomatic trends in AI writing

According to the researchers, all sectors they analyzed (consumer complaints, corporate communications, job postings) showed similar adoption patterns: sharp increases beginning three to four months after ChatGPT’s November 2022 launch, followed by stabilization in late 2023.

Organization age emerged as the strongest predictor of AI writing usage in the job posting analysis. Companies founded after 2015 showed adoption rates up to three times higher than firms established before 1980, reaching 10–15 percent AI-modified text in certain roles compared to below 5 percent for older organizations. Small companies with fewer employees also incorporated AI more readily than larger organizations.

When examining corporate press releases by sector, science and technology companies integrated AI most extensively, with an adoption rate of 16.8 percent by late 2023. Business and financial news (14–15.6 percent) and people and culture topics (13.6–14.3 percent) showed slightly lower but still significant adoption.

In the international arena, Latin American and Caribbean UN country teams showed the highest adoption among international organizations at approximately 20 percent, while African states, Asia-Pacific states, and Eastern European states demonstrated more moderate increases to 11–14 percent by 2024.

Implications and limitations

In the study, the researchers acknowledge limitations in their analysis due to a focus on English-language content. Also, as we mentioned earlier, they found they could not reliably detect human-edited AI-generated text or text generated by newer models instructed to imitate human writing styles. As a result, the researchers suggest their findings represent a lower bound of actual AI writing tool adoption.

The researchers noted that the plateauing of AI writing adoption in 2024 might reflect either market saturation or increasingly sophisticated LLMs producing text that evades detection methods. They conclude we now live in a world where distinguishing between human and AI writing becomes progressively more difficult, with implications for communications across society.

“The growing reliance on AI-generated content may introduce challenges in communication,” the researchers write. “In sensitive categories, over-reliance on AI could result in messages that fail to address concerns or overall release less credible information externally. Over-reliance on AI could also introduce public mistrust in the authenticity of messages sent by firms.”

Photo of Benj Edwards

Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

Researchers surprised to find less-educated areas adopting AI writing tools faster Read More »

microsoft-cto-kevin-scott-thinks-llm-“scaling-laws”-will-hold-despite-criticism

Microsoft CTO Kevin Scott thinks LLM “scaling laws” will hold despite criticism

As the word turns —

Will LLMs keep improving if we throw more compute at them? OpenAI dealmaker thinks so.

Kevin Scott, CTO and EVP of AI at Microsoft speaks onstage during Vox Media's 2023 Code Conference at The Ritz-Carlton, Laguna Niguel on September 27, 2023 in Dana Point, California.

Enlarge / Kevin Scott, CTO and EVP of AI at Microsoft speaks onstage during Vox Media’s 2023 Code Conference at The Ritz-Carlton, Laguna Niguel on September 27, 2023 in Dana Point, California.

During an interview with Sequoia Capital’s Training Data podcast published last Tuesday, Microsoft CTO Kevin Scott doubled down on his belief that so-called large language model (LLM) “scaling laws” will continue to drive AI progress, despite some skepticism in the field that progress has leveled out. Scott played a key role in forging a $13 billion technology-sharing deal between Microsoft and OpenAI.

“Despite what other people think, we’re not at diminishing marginal returns on scale-up,” Scott said. “And I try to help people understand there is an exponential here, and the unfortunate thing is you only get to sample it every couple of years because it just takes a while to build supercomputers and then train models on top of them.”

LLM scaling laws refer to patterns explored by OpenAI researchers in 2020 showing that the performance of language models tends to improve predictably as the models get larger (more parameters), are trained on more data, and have access to more computational power (compute). The laws suggest that simply scaling up model size and training data can lead to significant improvements in AI capabilities without necessarily requiring fundamental algorithmic breakthroughs.

Since then, other researchers have challenged the idea of persisting scaling laws over time, but the concept is still a cornerstone of OpenAI’s AI development philosophy.

You can see Scott’s comments in the video below beginning around 46: 05:

Microsoft CTO Kevin Scott on how far scaling laws will extend

Scott’s optimism contrasts with a narrative among some critics in the AI community that progress in LLMs has plateaued around GPT-4 class models. The perception has been fueled by largely informal observations—and some benchmark results—about recent models like Google’s Gemini 1.5 Pro, Anthropic’s Claude Opus, and even OpenAI’s GPT-4o, which some argue haven’t shown the dramatic leaps in capability seen in earlier generations, and that LLM development may be approaching diminishing returns.

“We all know that GPT-3 was vastly better than GPT-2. And we all know that GPT-4 (released thirteen months ago) was vastly better than GPT-3,” wrote AI critic Gary Marcus in April. “But what has happened since?”

The perception of plateau

Scott’s stance suggests that tech giants like Microsoft still feel justified in investing heavily in larger AI models, betting on continued breakthroughs rather than hitting a capability plateau. Given Microsoft’s investment in OpenAI and strong marketing of its own Microsoft Copilot AI features, the company has a strong interest in maintaining the perception of continued progress, even if the tech stalls.

Frequent AI critic Ed Zitron recently wrote in a post on his blog that one defense of continued investment into generative AI is that “OpenAI has something we don’t know about. A big, sexy, secret technology that will eternally break the bones of every hater,” he wrote. “Yet, I have a counterpoint: no it doesn’t.”

Some perceptions of slowing progress in LLM capabilities and benchmarking may be due to the rapid onset of AI in the public eye when, in fact, LLMs have been developing for years prior. OpenAI continued to develop LLMs during a roughly three-year gap between the release of GPT-3 in 2020 and GPT-4 in 2023. Many people likely perceived a rapid jump in capability with GPT-4’s launch in 2023 because they had only become recently aware of GPT-3-class models with the launch of ChatGPT in late November 2022, which used GPT-3.5.

In the podcast interview, the Microsoft CTO pushed back against the idea that AI progress has stalled, but he acknowledged the challenge of infrequent data points in this field, as new models often take years to develop. Despite this, Scott expressed confidence that future iterations will show improvements, particularly in areas where current models struggle.

“The next sample is coming, and I can’t tell you when, and I can’t predict exactly how good it’s going to be, but it will almost certainly be better at the things that are brittle right now, where you’re like, oh my god, this is a little too expensive, or a little too fragile, for me to use,” Scott said in the interview. “All of that gets better. It’ll get cheaper, and things will become less fragile. And then more complicated things will become possible. That is the story of each generation of these models as we’ve scaled up.”

Microsoft CTO Kevin Scott thinks LLM “scaling laws” will hold despite criticism Read More »