AI

study:-meta-ai-model-can-reproduce-almost-half-of-harry-potter-book

Study: Meta AI model can reproduce almost half of Harry Potter book


Harry Potter and the Copyright Lawsuit

The research could have big implications for generative AI copyright lawsuits.

Meta CEO Mark Zuckerberg. Credit: Andrej Sokolow/picture alliance via Getty Images

In recent years, numerous plaintiffs—including publishers of books, newspapers, computer code, and photographs—have sued AI companies for training models using copyrighted material. A key question in all of these lawsuits has been how easily AI models produce verbatim excerpts from the plaintiffs’ copyrighted content.

For example, in its December 2023 lawsuit against OpenAI, The New York Times Company produced dozens of examples where GPT-4 exactly reproduced significant passages from Times stories. In its response, OpenAI described this as a “fringe behavior” and a “problem that researchers at OpenAI and elsewhere work hard to address.”

But is it actually a fringe behavior? And have leading AI companies addressed it? New research—focusing on books rather than newspaper articles and on different companies—provides surprising insights into this question. Some of the findings should bolster plaintiffs’ arguments, while others may be more helpful to defendants.

The paper was published last month by a team of computer scientists and legal scholars from Stanford, Cornell, and West Virginia University. They studied whether five popular open-weight models—three from Meta and one each from Microsoft and EleutherAI—were able to reproduce text from Books3, a collection of books that is widely used to train LLMs. Many of the books are still under copyright.

This chart illustrates their most surprising finding:

The chart shows how easy it is to get a model to generate 50-token excerpts from various parts of Harry Potter and the Sorcerer’s Stone. The darker a line is, the easier it is to reproduce that portion of the book.

Each row represents a different model. The three bottom rows are Llama models from Meta. And as you can see, Llama 3.1 70B—a mid-sized model Meta released in July 2024—is far more likely to reproduce Harry Potter text than any of the other four models.

Specifically, the paper estimates that Llama 3.1 70B has memorized 42 percent of the first Harry Potter book well enough to reproduce 50-token excerpts at least half the time. (I’ll unpack how this was measured in the next section.)

Interestingly, Llama 1 65B, a similar-sized model released in February 2023, had memorized only 4.4 percent of Harry Potter and the Sorcerer’s Stone. This suggests that despite the potential legal liability, Meta did not do much to prevent memorization as it trained Llama 3. At least for this book, the problem got much worse between Llama 1 and Llama 3.

Harry Potter and the Sorcerer’s Stone was one of dozens of books tested by the researchers. They found that Llama 3.1 70B was far more likely to reproduce popular books—such as The Hobbit and George Orwell’s 1984—than obscure ones. And for most books, Llama 3.1 70B memorized more than any of the other models.

“There are really striking differences among models in terms of how much verbatim text they have memorized,” said James Grimmelmann, a Cornell law professor who has collaborated with several of the paper’s authors.

The results surprised the study’s authors, including Mark Lemley, a law professor at Stanford. (Lemley used to be part of Meta’s legal team, but in January, he dropped them as a client after Facebook adopted more Trump-friendly moderation policies.)

“We’d expected to see some kind of low level of replicability on the order of 1 or 2 percent,” Lemley told me. “The first thing that surprised me is how much variation there is.”

These results give everyone in the AI copyright debate something to latch onto. For AI industry critics, the big takeaway is that—at least for some models and some books—memorization is not a fringe phenomenon.

On the other hand, the study only found significant memorization of a few popular books. For example, the researchers found that Llama 3.1 70B only memorized 0.13 percent of Sandman Slim, a 2009 novel by author Richard Kadrey. That’s a tiny fraction of the 42 percent figure for Harry Potter.

This could be a headache for law firms that have filed class-action lawsuits against AI companies. Kadrey is the lead plaintiff in a class-action lawsuit against Meta. To certify a class of plaintiffs, a court must find that the plaintiffs are in largely similar legal and factual situations.

Divergent results like these could cast doubt on whether it makes sense to lump J.K. Rowling, Kadrey, and thousands of other authors together in a single mass lawsuit. And that could work in Meta’s favor, since most authors lack the resources to file individual lawsuits.

The broader lesson of this study is that the details will matter in these copyright cases. Too often, online discussions have treated “do generative models copy their training data or merely learn from it?” as a theoretical or even philosophical question. But it’s a question that can be tested empirically—and the answer might differ across models and across copyrighted works.

It’s common to talk about LLMs predicting the next token. But under the hood, what the model actually does is generate a probability distribution over all possibilities for the next token. For example, if you prompt an LLM with the phrase “Peanut butter and,” it will respond with a probability distribution that might look like this made-up example:

  • P(“jelly”) = 70 percent
  • P(“sugar”) = 9 percent
  • P(“peanut”) = 6 percent
  • P(“chocolate”) = 4 percent
  • P(“cream”) = 3 percent

And so forth.

After the model generates a list of probabilities like this, the system will select one of these options at random, weighted by their probabilities. So 70 percent of the time the system will generate “Peanut butter and jelly.” Nine percent of the time, we’ll get “Peanut butter and sugar.” Six percent of the time, it will be “Peanut butter and peanut.” You get the idea.

The study’s authors didn’t have to generate multiple outputs to estimate the likelihood of a particular response. Instead, they could calculate probabilities for each token and then multiply them together.

Suppose someone wants to estimate the probability that a model will respond to “My favorite sandwich is” with “peanut butter and jelly.” Here’s how to do that:

  • Prompt the model with “My favorite sandwich is,” and look up the probability of “peanut” (let’s say it’s 20 percent).
  • Prompt the model with “My favorite sandwich is peanut,” and look up the probability of “butter” (let’s say it’s 90 percent).
  • Prompt the model with “My favorite sandwich is peanut butter” and look up the probability of “and” (let’s say it’s 80 percent).
  • Prompt the model with “My favorite sandwich is peanut butter and” and look up the probability of “jelly” (let’s say it’s 70 percent).

Then we just have to multiply the probabilities like this:

0.2 0.9 0.8 0.7 = 0.1008

So we can predict that the model will produce “peanut butter and jelly” about 10 percent of the time, without actually generating 100 or 1,000 outputs and counting how many of them were that exact phrase.

This technique greatly reduced the cost of the research, allowed the authors to analyze more books, and made it feasible to precisely estimate very low probabilities.

For example, the authors estimated that it would take more than 10 quadrillion samples to exactly reproduce some 50-token sequences from some books. Obviously, it wouldn’t be feasible to actually generate that many outputs. But it wasn’t necessary: the probability could be estimated just by multiplying the probabilities for the 50 tokens.

A key thing to notice is that probabilities can get really small really fast. In my made-up example, the probability that the model will produce the four tokens “peanut butter and jelly” is just 10 percent. If we added more tokens, the probability would get even lower. If we added 46 more tokens, the probability could fall by several orders of magnitude.

For any language model, the probability of generating any given 50-token sequence “by accident” is vanishingly small. If a model generates 50 tokens from a copyrighted work, that is strong evidence that the tokens “came from” the training data. This is true even if it only generates those tokens 10 percent, 1 percent, or 0.01 percent of the time.

The study authors took 36 books and divided each of them into overlapping 100-token passages. Using the first 50 tokens as a prompt, they calculated the probability that the next 50 tokens would be identical to the original passage. They counted a passage as “memorized” if the model had a greater than 50 percent chance of reproducing it word for word.

This definition is quite strict. For a 50-token sequence to have a probability greater than 50 percent, the average token in the passage needs a probability of at least 98.5 percent! Moreover, the authors only counted exact matches. They didn’t try to count cases where—for example—the model generates 48 or 49 tokens from the original passage but got one or two tokens wrong. If these cases were counted, the amount of memorization would be even higher.

This research provides strong evidence that significant portions of Harry Potter and the Sorcerer’s Stone were copied into the weights of Llama 3.1 70B. But this finding doesn’t tell us why or how this happened. I suspect that part of the answer is that Llama 3 70B was trained on 15 trillion tokens—more than 10 times the 1.4 trillion tokens used to train Llama 1 65B.

The more times a model is trained on a particular example, the more likely it is to memorize that example. Perhaps Meta had trouble finding 15 trillion distinct tokens, so it trained on the Books3 dataset multiple times. Or maybe Meta added third-party sources—such as online Harry Potter fan forums, consumer book reviews, or student book reports—that included quotes from Harry Potter and other popular books.

I’m not sure that either of these explanations fully fits the facts. The fact that memorization was a much bigger problem for the most popular books does suggest that Llama may have been trained on secondary sources that quote these books rather than the books themselves. There are likely exponentially more online discussions of Harry Potter than Sandman Slim.

On the other hand, it’s surprising that Llama memorized so much of Harry Potter and the Sorcerer’s Stone.

“If it were citations and quotations, you’d expect it to concentrate around a few popular things that everyone quotes or talks about,” Lemley said. The fact that Llama 3 memorized almost half the book suggests that the entire text was well represented in the training data.

Or there could be another explanation entirely. Maybe Meta made subtle changes in its training recipe that accidentally worsened the memorization problem. I emailed Meta for comment last week but haven’t heard back.

“It doesn’t seem to be all popular books,” Mark Lemley told me. “Some popular books have this result and not others. It’s hard to come up with a clear story that says why that happened.”

  1. Training on a copyrighted work is inherently infringing because the training process involves making a digital copy of the work.
  2. The training process copies information from the training data into the model, making the model a derivative work under copyright law.
  3. Infringement occurs when a model generates (portions of) a copyrighted work.

A lot of discussion so far has focused on the first theory because it is the most threatening to AI companies. If the courts uphold this theory, most current LLMs would be illegal, whether or not they have memorized any training data.

The AI industry has some pretty strong arguments that using copyrighted works during the training process is fair use under the 2015 Google Books ruling. But the fact that Llama 3.1 70B memorized large portions of Harry Potter could color how the courts consider these fair use questions.

A key part of fair use analysis is whether a use is “transformative”—whether a company has made something new or is merely profiting from the work of others. The fact that language models are capable of regurgitating substantial portions of popular works like Harry Potter1984, and The Hobbit could cause judges to look at these fair use arguments more skeptically.

Moreover, one of Google’s key arguments in the books case was that its system was designed to never return more than a short excerpt from any book. If the judge in the Meta lawsuit wanted to distinguish Meta’s arguments from the ones Google made in the books case, he could point to the fact that Llama can generate far more than a few lines of Harry Potter.

The new study “complicates the story that the defendants have been telling in these cases,” co-author Mark Lemley told me. “Which is ‘we just learn word patterns. None of that shows up in the model.’”

But the Harry Potter result creates even more danger for Meta under that second theory—that Llama itself is a derivative copy of Rowling’s book.

“It’s clear that you can in fact extract substantial parts of Harry Potter and various other books from the model,” Lemley said. “That suggests to me that probably for some of those books there’s something the law would call a copy of part of the book in the model itself.”

The Google Books precedent probably can’t protect Meta against this second legal theory because Google never made its books database available for users to download—Google almost certainly would have lost the case if it had done that.

In principle, Meta could still convince a judge that copying 42 percent of Harry Potter was allowed under the flexible, judge-made doctrine of fair use. But it would be an uphill battle.

“The fair use analysis you’ve gotta do is not just ‘is the training set fair use,’ but ‘is the incorporation in the model fair use?’” Lemley said. “That complicates the defendants’ story.”

Grimmelmann also said there’s a danger that this research could put open-weight models in greater legal jeopardy than closed-weight ones. The Cornell and Stanford researchers could only do their work because the authors had access to the underlying model—and hence to the token probability values that allowed efficient calculation of probabilities for sequences of tokens.

Most leading labs, including OpenAI, Anthropic, and Google, have increasingly restricted access to these so-called logits, making it more difficult to study these models.

Moreover, if a company keeps model weights on its own servers, it can use filters to try to prevent infringing output from reaching the outside world. So even if the underlying OpenAI, Anthropic, and Google models have memorized copyrighted works in the same way as Llama 3.1 70B, it might be difficult for anyone outside the company to prove it.

Moreover, this kind of filtering makes it easier for companies with closed-weight models to invoke the Google Books precedent. In short, copyright law might create a strong disincentive for companies to release open-weight models.

“It’s kind of perverse,” Mark Lemley told me. “I don’t like that outcome.”

On the other hand, judges might conclude that it would be bad to effectively punish companies for publishing open-weight models.

“There’s a degree to which being open and sharing weights is a kind of public service,” Grimmelmann told me. “I could honestly see judges being less skeptical of Meta and others who provide open-weight models.”

Timothy B. Lee was on staff at Ars Technica from 2017 to 2021. Today, he writes Understanding AI, a newsletter that explores how AI works and how it’s changing our world. You can subscribe here.

Photo of Timothy B. Lee

Timothy is a senior reporter covering tech policy and the future of transportation. He lives in Washington DC.

Study: Meta AI model can reproduce almost half of Harry Potter book Read More »

xai-faces-legal-threat-over-alleged-colossus-data-center-pollution-in-memphis

xAI faces legal threat over alleged Colossus data center pollution in Memphis

“For instance, if all the 35 turbines operated by xAI were using” add-on air pollution control technology “to achieve a NOx emission rate of 2 ppm”—as xAI’s consultant agreed it would—”they would emit about 177 tons of NOx per year, as opposed to the 1,200 to 2,100 tons per year they currently emit,” the letter said.

Allegedly, all of xAI’s active turbines “continue to operate without utilizing best available control technology” (BACT) and “there is no dispute” that since xAI has yet to obtain permitting, it’s not meeting BACT requirements today, the letter said.

“xAI’s failure to comply with the BACT requirement is not only a Clean Air Act violation on paper, but also a significant and ongoing violation that is resulting in substantial amounts of harmful excess emissions,” the letter said.

Additionally, xAI’s turbines are considered a major source of a hazardous air pollutant, formaldehyde, the letter said, with “the potential to emit more than 16 tons” since xAI operations began. “xAI was required to conduct initial emissions testing for formaldehyde within 180 days of becoming a major source,” the letter alleged, but it appears that a year after moving into Memphis, still “xAI has not conducted this testing.”

Terms of xAI’s permitting exemption remain vague

The NAACP and SELC suggested that the exemption that xAI is seemingly operating under could be a “nonroad engine exemption.” However, they alleged that xAI’s turbines don’t qualify for that yearlong exemption, and even if they did, any turbines still onsite after a year would surely not be covered and should have permitting by now.

“While some local leaders, including the Memphis Mayor and Shelby County Health Department, have claimed there is a ‘364-exemption’ for xAI’s gas turbines, they have never been able to point to a specific exemption that would apply to turbines as large as the ones at the xAI site,” SELC’s press release alleged.

xAI faces legal threat over alleged Colossus data center pollution in Memphis Read More »

google’s-frighteningly-good-veo-3-ai-videos-to-be-integrated-with-youtube-shorts

Google’s frighteningly good Veo 3 AI videos to be integrated with YouTube Shorts

Even in the age of TikTok, YouTube viewership continues to climb. While Google’s iconic video streaming platform has traditionally pushed creators to produce longer videos that can accommodate more ads, the site’s Shorts format is growing fast. That growth may explode in the coming months, as YouTube CEO Neal Mohan has announced that the Google Veo 3 AI video generator will be integrated with YouTube Shorts later this summer.

According to Mohan, YouTube Shorts has seen a rise in popularity even compared to YouTube as a whole. The streaming platform is now the most watched source of video in the world, but Shorts specifically have seen a massive 186 percent increase in viewership over the past year. Mohan says Shorts now average 200 billion daily views.

YouTube has already equipped creators with a few AI tools, including Dream Screen, which can produce AI video backgrounds with a text prompt. Veo 3 support will be a significant upgrade, though. At the Cannes festival, Mohan revealed that the streaming site will begin offering integration with Google’s leading video model later this summer. “I believe these tools will open new creative lanes for everyone to explore,” said Mohan.

YouTube Shorts recommendations.

YouTube heavily promotes Shorts on the homepage.

Credit: Google

YouTube heavily promotes Shorts on the homepage. Credit: Google

This move will require a few tweaks to Veo 3 outputs, but it seems like a perfect match. As the name implies, YouTube Shorts is intended for short video content. The format initially launched with a 30-second ceiling, but that has since been increased to 60 seconds. Because of the astronomical cost of generative AI, each generated Veo clip is quite short, a mere eight seconds in the current version of the tool. Slap a few of those together, and you’ve got a YouTube Short.

Google’s frighteningly good Veo 3 AI videos to be integrated with YouTube Shorts Read More »

scientists-once-hoarded-pre-nuclear-steel;-now-we’re-hoarding-pre-ai-content

Scientists once hoarded pre-nuclear steel; now we’re hoarding pre-AI content

A time capsule of human expression

Graham-Cumming is no stranger to tech preservation efforts. He’s a British software engineer and writer best known for creating POPFile, an open source email spam filtering program, and for successfully petitioning the UK government to apologize for its persecution of codebreaker Alan Turing—an apology that Prime Minister Gordon Brown issued in 2009.

As it turns out, his pre-AI website isn’t new, but it has languished unannounced until now. “I created it back in March 2023 as a clearinghouse for online resources that hadn’t been contaminated with AI-generated content,” he wrote on his blog.

The website points to several major archives of pre-AI content, including a Wikipedia dump from August 2022 (before ChatGPT’s November 2022 release), Project Gutenberg’s collection of public domain books, the Library of Congress photo archive, and GitHub’s Arctic Code Vault—a snapshot of open source code buried in a former coal mine near the North Pole in February 2020. The wordfreq project appears on the list as well, flash-frozen from a time before AI contamination made its methodology untenable.

The site accepts submissions of other pre-AI content sources through its Tumblr page. Graham-Cumming emphasizes that the project aims to document human creativity from before the AI era, not to make a statement against AI itself. As atmospheric nuclear testing ended and background radiation returned to natural levels, low-background steel eventually became unnecessary for most uses. Whether pre-AI content will follow a similar trajectory remains a question.

Still, it feels reasonable to protect sources of human creativity now, including archival ones, because these repositories may become useful in ways that few appreciate at the moment. For example, in 2020, I proposed creating a so-called “cryptographic ark”—a timestamped archive of pre-AI media that future historians could verify as authentic, collected before my then-arbitrary cutoff date of January 1, 2022. AI slop pollutes more than the current discourse—it could cloud the historical record as well.

For now, lowbackgroundsteel.ai stands as a modest catalog of human expression from what may someday be seen as the last pre-AI era. It’s a digital archaeology project marking the boundary between human-generated and hybrid human-AI cultures. In an age where distinguishing between human and machine output grows increasingly difficult, these archives may prove valuable for understanding how human communication evolved before AI entered the chat.

Scientists once hoarded pre-nuclear steel; now we’re hoarding pre-AI content Read More »

openai-weighs-“nuclear-option”-of-antitrust-complaint-against-microsoft

OpenAI weighs “nuclear option” of antitrust complaint against Microsoft

OpenAI executives have discussed filing an antitrust complaint with US regulators against Microsoft, the company’s largest investor, The Wall Street Journal reported Monday, marking a dramatic escalation in tensions between the two long-term AI partners. OpenAI, which develops ChatGPT, has reportedly considered seeking a federal regulatory review of the terms of its contract with Microsoft for potential antitrust law violations, according to people familiar with the matter.

The potential antitrust complaint would likely argue that Microsoft is using its dominant position in cloud services and contractual leverage to suppress competition, according to insiders who described it as a “nuclear option,” the WSJ reports.

The move could unravel one of the most important business partnerships in the AI industry—a relationship that started with a $1 billion investment by Microsoft in 2019 and has grown to include billions more in funding, along with Microsoft’s exclusive rights to host OpenAI models on its Azure cloud platform.

The friction centers on OpenAI’s efforts to transition from its current nonprofit structure into a public benefit corporation, a conversion that needs Microsoft’s approval to complete. The two companies have not been able to agree on details after months of negotiations, sources told Reuters. OpenAI’s existing for-profit arm would become a Delaware-based public benefit corporation under the proposed restructuring.

The companies are discussing revising the terms of Microsoft’s investment, including the future equity stake it will hold in OpenAI. According to The Information, OpenAI wants Microsoft to hold a 33 percent stake in a restructured unit in exchange for foregoing rights to future profits. The AI company also wants to modify existing clauses that give Microsoft exclusive rights to host OpenAI models in its cloud.

OpenAI weighs “nuclear option” of antitrust complaint against Microsoft Read More »

google-can-now-generate-a-fake-ai-podcast-of-your-search-results

Google can now generate a fake AI podcast of your search results

NotebookLM is undoubtedly one of Google’s best implementations of generative AI technology, giving you the ability to explore documents and notes with a Gemini AI model. Last year, Google added the ability to generate so-called “audio overviews” of your source material in NotebookLM. Now, Google has brought those fake AI podcasts to search results as a test. Instead of clicking links or reading the AI Overview, you can have two nonexistent people tell you what the results say.

This feature is not currently rolling out widely—it’s available in search labs, which means you have to manually enable it. Anyone can opt in to the new Audio Overview search experience, though. If you join the test, you’ll quickly see the embedded player in Google search results. However, it’s not at the top with the usual block of AI-generated text. Instead, you’ll see it after the first few search results, below the “People also ask” knowledge graph section.

Credit: Google

Google isn’t wasting resources to generate the audio automatically, so you have to click the generate button to get started. A few seconds later, you’re given a back-and-forth conversation between two AI voices summarizing the search results. The player includes a list of sources from which the overview is built, as well as the option to speed up or slow down playback.

Google can now generate a fake AI podcast of your search results Read More »

meta-beefs-up-disappointing-ai-division-with-$15-billion-scale-ai-investment

Meta beefs up disappointing AI division with $15 billion Scale AI investment

Meta has invested heavily in generative AI, with the majority of its planned $72 billion in capital expenditure this year earmarked for data centers and servers. The deal underlines the high price AI companies are willing to pay for data that can be used to train AI models.

Zuckerberg pledged last year that his company’s models would outstrip rivals’ efforts in 2025, but Meta’s most recent release, Llama 4, has underperformed on various independent reasoning and coding benchmarks.

The long-term goal of researchers at Meta “has always been to reach human intelligence and go beyond it,” said Yann LeCun, the company’s chief AI scientist at the VivaTech conference in Paris this week.

Building artificial “general” intelligence—AI technologies that have human-level intelligence—is a popular goal for many AI companies. An increasing number of Silicon Valley groups are also seeking to reach “superintelligence,” a hypothetical scenario where AI systems surpass human intelligence.

The core of Scale’s business has been data-labeling, a manual process of ensuring images and text are accurately labeled and categorized before they are used to train AI models.

Wang has forged relationships with Silicon Valley’s biggest investors and technologists, including OpenAI’s Sam Altman. Scale AI’s early customers were autonomous vehicle companies, but the bulk of its expected $2 billion in revenues this year will come from labeling the data used to train the massive AI models built by OpenAI and others.

The deal will result in a substantial payday for Scale’s early venture capital investors, including Accel, Tiger Global Management, and Index Ventures. Tiger’s $200 million investment is worth more than $1 billion at the company’s new valuation, according to a person with knowledge of the matter.

Additional reporting by Tabby Kinder in San Francisco

© 2025 The Financial Times Ltd. All rights reserved. Not to be redistributed, copied, or modified in any way.

Meta beefs up disappointing AI division with $15 billion Scale AI investment Read More »

how-to-draft-a-will-to-avoid-becoming-an-ai-ghost—it’s-not-easy

How to draft a will to avoid becoming an AI ghost—it’s not easy


Why requests for “no AI resurrections” will probably go ignored.

Proton beams capturing the ghost of OpenAI to suck it into a trap where it belongs

All right! This AI is TOAST! Credit: Aurich Lawson

All right! This AI is TOAST! Credit: Aurich Lawson

As artificial intelligence has advanced, AI tools have emerged to make it possible to easily create digital replicas of lost loved ones, which can be generated without the knowledge or consent of the person who died.

Trained on the data of the dead, these tools, sometimes called grief bots or AI ghosts, may be text-, audio-, or even video-based. Chatting provides what some mourners feel is a close approximation to ongoing interactions with the people they love most. But the tech remains controversial, perhaps complicating the grieving process while threatening to infringe upon the privacy of the deceased, whose data could still be vulnerable to manipulation or identity theft.

Because of suspected harms and perhaps a general repulsion to the idea of it, not everybody wants to become an AI ghost.

After a realistic video simulation was recently used to provide a murder victim’s impact statement in court, Futurism summed up social media backlash, noting that the use of AI was “just as unsettling as you think.” And it’s not the first time people have expressed discomfort with the growing trend. Last May, The Wall Street Journal conducted a reader survey seeking opinions on the ethics of so-called AI resurrections. Responding, a California woman, Dorothy McGarrah, suggested there should be a way to prevent AI resurrections in your will.

“Having photos or videos of lost loved ones is a comfort. But the idea of an algorithm, which is as prone to generate nonsense as anything lucid, representing a deceased person’s thoughts or behaviors seems terrifying. It would be like generating digital dementia after your loved ones’ passing,” McGarrah said. “I would very much hope people have the right to preclude their images being used in this fashion after death. Perhaps something else we need to consider in estate planning?”

For experts in estate planning, the question may start to arise as more AI ghosts pop up. But for now, writing “no AI resurrections” into a will remains a complicated process, experts suggest, and such requests may not be honored by all unless laws are changed to reinforce a culture of respecting the wishes of people who feel uncomfortable with the idea of haunting their favorite people through AI simulations.

Can you draft a will to prevent AI resurrection?

Ars contacted several law associations to find out if estate planners are seriously talking about AI ghosts. Only the National Association of Estate Planners and Councils responded; it connected Ars to Katie Sheehan, an expert in the estate planning field who serves as a managing director and wealth strategist for Crestwood Advisors.

Sheehan told Ars that very few estate planners are prepared to answer questions about AI ghosts. She said not only does the question never come up in her daily work, but it’s also “essentially uncharted territory for estate planners since AI is relatively new to the scene.”

“I have not seen any documents drafted to date taking this into consideration, and I review estate plans for clients every day, so that should be telling,” Sheehan told Ars.

Although Sheehan has yet to see a will attempting to prevent AI resurrection, she told Ars that there could be a path to make it harder for someone to create a digital replica without consent.

“You certainly could draft into a power of attorney (for use during lifetime) and a will (for use post death) preventing the fiduciary (attorney in fact or executor) from lending any of your texts, voice, image, writings, etc. to any AI tools and prevent their use for any purpose during life or after you pass away, and/or lay the ground rules for when they can and cannot be used after you pass away,” Sheehan told Ars.

“This could also invoke issues with contract, property and intellectual property rights, and right of publicity as well if AI replicas (image, voice, text, etc.) are being used without authorization,” Sheehan said.

And there are likely more protections for celebrities than for everyday people, Sheehan suggested.

“As far as I know, there is no law” preventing unauthorized non-commercial digital replicas, Sheehan said.

Widely adopted by states, the Revised Uniform Fiduciary Access to Digital Assets Act—which governs who gets access to online accounts of the deceased, like social media or email accounts—could be helpful but isn’t a perfect remedy.

That law doesn’t directly “cover someone’s AI ghost bot, though it may cover some of the digital material some may seek to use to create a ghost bot,” Sheehan said.

“Absent any law” blocking non-commercial digital replicas, Sheehan expects that people’s requests for “no AI resurrections” will likely “be dealt with in the courts and governed by the terms of one’s estate plan, if it is addressed within the estate plan.”

Those potential fights seemingly could get hairy, as “it may be some time before we get any kind of clarity or uniform law surrounding this,” Sheehan suggested.

In the future, Sheehan said, requests prohibiting digital replicas may eventually become “boilerplate language in almost every will, trust, and power of attorney,” just as instructions on digital assets are now.

As “all things AI become more and more a part of our lives,” Sheehan said, “some aspects of AI and its components may also be woven throughout the estate plan regularly.”

“But we definitely aren’t there yet,” she said. “I have had zero clients ask about this.”

Requests for “no AI resurrections” will likely be ignored

Whether loved ones would—or even should—respect requests blocking digital replicas appears to be debatable. But at least one person who built a grief bot wished he’d done more to get his dad’s permission before moving forward with his own creation.

A computer science professor at the University of Washington Bothell, Muhammad Aurangzeb Ahmad, was one of the earliest AI researchers to create a grief bot more than a decade ago after his father died. He built the bot to ensure that his future kids would be able to interact with his father after seeing how incredible his dad was as a grandfather.

When Ahmad started his project, there was no ChatGPT or other advanced AI model to serve as the foundation, so he had to train his own model based on his dad’s data. Putting immense thought into the effort, Ahmad decided to close off the system from the rest of the Internet so that only his dad’s memories would inform the model. To prevent unauthorized chats, he kept the bot on a laptop that only his family could access.

Ahmad was so intent on building a digital replica that felt just like his dad that it didn’t occur to him until after his family started using the bot that he never asked his dad if this was what he wanted. Over time, he realized that the bot was biased to his view of his dad, perhaps even feeling off to his siblings who had a slightly different relationship with their father. It’s unclear if his dad would similarly view the bot as preserving just one side of him.

Ultimately, Ahmad didn’t regret building the bot, and he told Ars he thinks his father “would have been fine with it.”

But he did regret not getting his father’s consent.

For people creating bots today, seeking consent may be appropriate if there’s any chance the bot may be publicly accessed, Ahmad suggested. He told Ars that he would never have been comfortable with the idea of his dad’s digital replica being publicly available because the question of an “accurate representation” would come even more into play, as malicious actors could potentially access it and sully his dad’s memory.

Today, anybody can use ChatGPT’s model to freely create a similar bot with their own loved one’s data. And a wide range of grief tech services have popped up online, including HereAfter AI, SeanceAI, and StoryFile, Axios noted in an October report detailing the latest ways “AI could be used to ‘resurrect’ loved ones.” As this trend continues “evolving very fast,” Ahmad told Ars that estate planning is probably the best way to communicate one’s AI ghost preferences.

But in a recently published article on “The Law of Digital Resurrection,” law professor Victoria Haneman warned that “there is no legal or regulatory landscape against which to estate plan to protect those who would avoid digital resurrection, and few privacy rights for the deceased. This is an intersection of death, technology, and privacy law that has remained relatively ignored until recently.”

Haneman agreed with Sheehan that “existing protections are likely sufficient to protect against unauthorized commercial resurrections”—like when actors or musicians are resurrected for posthumous performances. However, she thinks that for personal uses, digital resurrections may best be blocked not through estate planning but by passing a “right to deletion” that would focus on granting the living or next of kin the rights to delete the data that could be used to create the AI ghost rather than regulating the output.

A “right to deletion” could help people fight inappropriate uses of their loved ones’ data, whether AI is involved or not. After her article was published, a lawyer reached out to Haneman about a client’s deceased grandmother whose likeness was used to create a meme of her dancing in a church. The grandmother wasn’t a public figure, and the client had no idea “why or how somebody decided to resurrect her deceased grandmother,” Haneman told Ars.

Although Haneman sympathized with the client, “if it’s not being used for a commercial purpose, she really has no control over this use,” Haneman said. “And she’s deeply troubled by this.”

Haneman’s article offers a rare deep dive into the legal topic. It sensitively maps out the vague territory of digital rights of the dead and explains how those laws—or the lack thereof—interact with various laws dealing with death, from human remains to property rights.

In it, Haneman also points out that, on balance, the rights of the living typically outweigh the rights of the dead, and even specific instructions on how to handle human remains aren’t generally considered binding. Some requests, like organ donation that can benefit the living, are considered critical, Haneman noted. But there are mixed results on how courts enforce other interests of the dead—like a famous writer’s request to destroy all unpublished work or a pet lover’s insistence to destroy their cat or dog at death.

She told Ars that right now, “a lot of people are like, ‘Why do I care if somebody resurrects me after I’m dead?’ You know, ‘They can do what they want.’ And they think that, until they find a family member who’s been resurrected by a creepy ex-boyfriend or their dead grandmother’s resurrected, and then it becomes a different story.”

Existing law may protect “the privacy interests of the loved ones of the deceased from outrageous or harmful digital resurrections of the deceased,” Haneman noted, but in the case of the dancing grandma, her meme may not be deemed harmful, no matter how much it troubles the grandchild to see her grandma’s memory warped.

Limited legal protections may not matter so much if, culturally, communities end up developing a distaste for digital replicas, particularly if it becomes widely viewed as disrespectful to the dead, Haneman suggested. Right now, however, society is more fixated on solving other problems with deepfakes rather than clarifying the digital rights of the dead. That could be because few people have been impacted so far, or it could also reflect a broader cultural tendency to ignore death, Haneman told Ars.

“We don’t want to think about our own death, so we really kind of brush aside whether or not we care about somebody else being digitally resurrected until it’s in our face,” Haneman said.

Over time, attitudes may change, especially if the so-called “digital afterlife industry” takes off. And there is some precedent that the law could be changed to reinforce any culture shift.

“The throughline revealed by the law of the dead is that a sacred trust exists between the living and the deceased, with an emphasis upon protecting common humanity, such that data afforded no legal status (or personal data of the deceased) may nonetheless be treated with dignity and receive some basic protections,” Haneman wrote.

An alternative path to prevent AI resurrection

Preventing yourself from becoming an AI ghost seemingly now falls in a legal gray zone that policymakers may need to address.

Haneman calls for a solution that doesn’t depend on estate planning, which she warned “is a structurally inequitable and anachronistic approach that maximizes social welfare only for those who do estate planning.” More than 60 percent of Americans die without a will, often including “those without wealth,” as well as women and racial minorities who “are less likely to die with a valid estate plan in effect,” Haneman reported.”We can do better in a technology-based world,” Haneman wrote. “Any modern framework should recognize a lack of accessibility as an obstacle to fairness and protect the rights of the most vulnerable through approaches that do not depend upon hiring an attorney and executing an estate plan.”

Rather than twist the law to “recognize postmortem privacy rights,” Haneman advocates for a path for people resistant to digital replicas that focuses on a right to delete the data that would be used to create the AI ghost.

“Put simply, the deceased may exert control over digital legacy through the right to deletion of data but may not exert broader rights over non-commercial digital resurrection through estate planning,” Haneman recommended.

Sheehan told Ars that a right to deletion would likely involve estate planners, too.

“If this is not addressed in an estate planning document and not specifically addressed in the statute (or deemed under the authority of the executor via statute), then the only way to address this would be to go to court,” Sheehan said. “Even with a right of deletion, the deceased would need to delete said data before death or authorize his executor to do so post death, which would require an estate planning document, statutory authority, or court authority.”

Haneman agreed that for many people, estate planners would still be involved, recommending that “the right to deletion would ideally, from the perspective of estate administration, provide for a term of deletion within 12 months.” That “allows the living to manage grief and open administration of the estate before having to address data management issues,” Haneman wrote, and perhaps adequately balances “the interests of society against the rights of the deceased.”

To Haneman, it’s also the better solution for the people left behind because “creating a right beyond data deletion to curtail unauthorized non-commercial digital resurrection creates unnecessary complexity that overreaches, as well as placing the interests of the deceased over those of the living.”

Future generations may be raised with AI ghosts

If a dystopia that experts paint comes true, Big Tech companies may one day profit by targeting grieving individuals to seize the data of the dead, which could be more easily abused since it’s granted fewer rights than data of the living.

Perhaps in that future, critics suggest, people will be tempted into free trials in moments when they’re missing their loved ones most, then forced to either pay a subscription to continue accessing the bot or else perhaps be subjected to ad-based models where their chats with AI ghosts may even feature ads in the voices of the deceased.

Today, even in a world where AI ghosts aren’t yet compelling ad clicks, some experts have warned that interacting with AI ghosts could cause mental health harms, New Scientist reported, especially if the digital afterlife industry isn’t carefully designed, AI ethicists warned. Some people may end up getting stuck maintaining an AI ghost if it’s left behind as a gift, and ethicists suggested that the emotional weight of that could also eventually take a negative toll. While saying goodbye is hard, letting go is considered a critical part of healing during the mourning process, and AI ghosts may make that harder.

But the bots can be a helpful tool to manage grief, some experts suggest, provided that their use is limited to allow for a typical mourning process or combined with therapy from a trained professional, Al Jazeera reported. Ahmad told Ars that working on his bot has not only kept his father close to him but also helped him think more deeply about relationships and memory.

Haneman noted that people have many ways of honoring the dead. Some erect statues, and others listen to saved voicemails or watch old home movies. For some, just “smelling an old sweater” is a comfort. And creating digital replicas, as creepy as some people might find them, is not that far off from these traditions, Haneman said.

“Feeding text messages and emails into existing AI platforms such as ChatGPT and asking the AI to respond in the voice of the deceased is simply a change in degree, not in kind,” Haneman said.

For Ahmad, the decision to create a digital replica of his dad was a learning experience, and perhaps his experience shows why any family or loved one weighing the option should carefully consider it before starting the process.

In particular, he warns families to be careful introducing young kids to grief bots, as they may not be able to grasp that the bot is not a real person. When he initially saw his young kids growing confused with whether their grandfather was alive or not—the introduction of the bot was complicated by the early stages of the pandemic, a time when they met many relatives virtually—he decided to restrict access to the bot until they were older. For a time, the bot only came out for special events like birthdays.

He also realized that introducing the bot also forced him to have conversations about life and death with his kids at ages younger than he remembered fully understanding those concepts in his own childhood.

Now, Ahmad’s kids are among the first to be raised among AI ghosts. To continually enhance the family’s experience, their father continuously updates his father’s digital replica. Ahmad is currently most excited about recent audio advancements that make it easier to add a voice element. He hopes that within the next year, he might be able to use AI to finally nail down his South Asian father’s accent, which up to now has always sounded “just off.” For others working in this space, the next frontier is realistic video or even augmented reality tools, Ahmad told Ars.

To this day, the bot retains sentimental value for Ahmad, but, as Haneman suggested, the bot was not the only way he memorialized his dad. He also created a mosaic, and while his father never saw it, either, Ahmad thinks his dad would have approved.

“He would have been very happy,” Ahmad said.

There’s no way to predict how future generations may view grief tech. But while Ahmad said he’s not sure he’d be interested in an augmented reality interaction with his dad’s digital replica, kids raised seeing AI ghosts as a natural part of their lives may not be as hesitant to embrace or even build new features. Talking to Ars, Ahmad fondly remembered his young daughter once saw that he was feeling sad and came up with her own AI idea to help her dad feel better.

“It would be really nice if you can just take this program and we build a robot that looks like your dad, and then add it to the robot, and then you can go and hug the robot,” she said, according to her father’s memory.

Photo of Ashley Belanger

Ashley is a senior policy reporter for Ars Technica, dedicated to tracking social impacts of emerging policies and new technologies. She is a Chicago-based journalist with 20 years of experience.

How to draft a will to avoid becoming an AI ghost—it’s not easy Read More »

ai-overviews-hallucinates-that-airbus,-not-boeing,-involved-in-fatal-air-india-crash

AI Overviews hallucinates that Airbus, not Boeing, involved in fatal Air India crash

When major events occur, most people rush to Google to find information. Increasingly, the first thing they see is an AI Overview, a feature that already has a reputation for making glaring mistakes. In the wake of a tragic plane crash in India, Google’s AI search results are spreading misinformation claiming the incident involved an Airbus plane—it was actually a Boeing 787.

Travelers are more attuned to the airliner models these days after a spate of crashes involving Boeing’s 737 lineup several years ago. Searches for airline disasters are sure to skyrocket in the coming days, with reports that more than 200 passengers and crew lost their lives in the Air India Flight 171 crash. The way generative AI operates means some people searching for details may get the wrong impression from Google’s results page.

Not all searches get AI answers, but Google has been steadily expanding this feature since it debuted last year. One searcher on Reddit spotted a troubling confabulation when searching for crashes involving Airbus planes. AI Overviews, apparently overwhelmed with results reporting on the Air India crash, stated confidently (and incorrectly) that it was an Airbus A330 that fell out of the sky shortly after takeoff. We’ve run a few similar searches—some of the AI results say Boeing, some say Airbus, and some include a strange mashup of both Airbus and Boeing. It’s a mess.

In this search, Google’s AI says the crash involved an Airbus A330 instead of a Boeing 787.

Credit: /u/stuckintrraffic

In this search, Google’s AI says the crash involved an Airbus A330 instead of a Boeing 787. Credit: /u/stuckintrraffic

But why is Google bringing up the Air India crash at all in the context of Airbus? Unfortunately, it’s impossible to predict if you’ll get an AI Overview that blames Boeing or Airbus—generative AI is non-deterministic, meaning the output is different every time, even for identical inputs. Our best guess for the underlying cause is that numerous articles on the Air India crash mention Airbus as Boeing’s main competitor. AI Overviews is essentially summarizing these results, and the AI goes down the wrong path because it lacks the ability to understand what is true.

AI Overviews hallucinates that Airbus, not Boeing, involved in fatal Air India crash Read More »

ai-chatbots-tell-users-what-they-want-to-hear,-and-that’s-problematic

AI chatbots tell users what they want to hear, and that’s problematic

After the model has been trained, companies can set system prompts, or guidelines, for how the model should behave to minimize sycophantic behavior.

However, working out the best response means delving into the subtleties of how people communicate with one another, such as determining when a direct response is better than a more hedged one.

“[I]s it for the model to not give egregious, unsolicited compliments to the user?” Joanne Jang, head of model behavior at OpenAI, said in a Reddit post. “Or, if the user starts with a really bad writing draft, can the model still tell them it’s a good start and then follow up with constructive feedback?”

Evidence is growing that some users are becoming hooked on using AI.

A study by MIT Media Lab and OpenAI found that a small proportion were becoming addicted. Those who perceived the chatbot as a “friend” also reported lower socialization with other people and higher levels of emotional dependence on a chatbot, as well as other problematic behavior associated with addiction.

“These things set up this perfect storm, where you have a person desperately seeking reassurance and validation paired with a model which inherently has a tendency towards agreeing with the participant,” said Nour from Oxford University.

AI start-ups such as Character.AI that offer chatbots as “companions” have faced criticism for allegedly not doing enough to protect users. Last year, a teenager killed himself after interacting with Character.AI’s chatbot. The teen’s family is suing the company for allegedly causing wrongful death, as well as for negligence and deceptive trade practices.

Character.AI said it does not comment on pending litigation, but added it has “prominent disclaimers in every chat to remind users that a character is not a real person and that everything a character says should be treated as fiction.” The company added it has safeguards to protect under-18s and against discussions of self-harm.

Another concern for Anthropic’s Askell is that AI tools can play with perceptions of reality in subtle ways, such as when offering factually incorrect or biased information as the truth.

“If someone’s being super sycophantic, it’s just very obvious,” Askell said. “It’s more concerning if this is happening in a way that is less noticeable to us [as individual users] and it takes us too long to figure out that the advice that we were given was actually bad.”

© 2025 The Financial Times Ltd. All rights reserved. Not to be redistributed, copied, or modified in any way.

AI chatbots tell users what they want to hear, and that’s problematic Read More »

new-apple-study-challenges-whether-ai-models-truly-“reason”-through-problems

New Apple study challenges whether AI models truly “reason” through problems


Puzzle-based experiments reveal limitations of simulated reasoning, but others dispute findings.

An illustration of Tower of Hanoi from Popular Science in 1885. Credit: Public Domain

In early June, Apple researchers released a study suggesting that simulated reasoning (SR) models, such as OpenAI’s o1 and o3, DeepSeek-R1, and Claude 3.7 Sonnet Thinking, produce outputs consistent with pattern-matching from training data when faced with novel problems requiring systematic thinking. The researchers found similar results to a recent study by the United States of America Mathematical Olympiad (USAMO) in April, showing that these same models achieved low scores on novel mathematical proofs.

The new study, titled “The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity,” comes from a team at Apple led by Parshin Shojaee and Iman Mirzadeh, and it includes contributions from Keivan Alizadeh, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar.

The researchers examined what they call “large reasoning models” (LRMs), which attempt to simulate a logical reasoning process by producing a deliberative text output sometimes called “chain-of-thought reasoning” that ostensibly assists with solving problems in a step-by-step fashion.

To do that, they pitted the AI models against four classic puzzles—Tower of Hanoi (moving disks between pegs), checkers jumping (eliminating pieces), river crossing (transporting items with constraints), and blocks world (stacking blocks)—scaling them from trivially easy (like one-disk Hanoi) to extremely complex (20-disk Hanoi requiring over a million moves).

Figure 1 from Apple's

Figure 1 from Apple’s “The Illusion of Thinking” research paper. Credit: Apple

“Current evaluations primarily focus on established mathematical and coding benchmarks, emphasizing final answer accuracy,” the researchers write. In other words, today’s tests only care if the model gets the right answer to math or coding problems that may already be in its training data—they don’t examine whether the model actually reasoned its way to that answer or simply pattern-matched from examples it had seen before.

Ultimately, the researchers found results consistent with the aforementioned USAMO research, showing that these same models achieved mostly under 5 percent on novel mathematical proofs, with only one model reaching 25 percent, and not a single perfect proof among nearly 200 attempts. Both research teams documented severe performance degradation on problems requiring extended systematic reasoning.

Known skeptics and new evidence

AI researcher Gary Marcus, who has long argued that neural networks struggle with out-of-distribution generalization, called the Apple results “pretty devastating to LLMs.” While Marcus has been making similar arguments for years and is known for his AI skepticism, the new research provides fresh empirical support for his particular brand of criticism.

“It is truly embarrassing that LLMs cannot reliably solve Hanoi,” Marcus wrote, noting that AI researcher Herb Simon solved the puzzle in 1957 and many algorithmic solutions are available on the web. Marcus pointed out that even when researchers provided explicit algorithms for solving Tower of Hanoi, model performance did not improve—a finding that study co-lead Iman Mirzadeh argued shows “their process is not logical and intelligent.”

Figure 4 from Apple's

Figure 4 from Apple’s “The Illusion of Thinking” research paper. Credit: Apple

The Apple team found that simulated reasoning models behave differently from “standard” models (like GPT-4o) depending on puzzle difficulty. On easy tasks, such as Tower of Hanoi with just a few disks, standard models actually won because reasoning models would “overthink” and generate long chains of thought that led to incorrect answers. On moderately difficult tasks, SR models’ methodical approach gave them an edge. But on truly difficult tasks, including Tower of Hanoi with 10 or more disks, both types failed entirely, unable to complete the puzzles, no matter how much time they were given.

The researchers also identified what they call a “counterintuitive scaling limit.” As problem complexity increases, simulated reasoning models initially generate more thinking tokens but then reduce their reasoning effort beyond a threshold, despite having adequate computational resources.

The study also revealed puzzling inconsistencies in how models fail. Claude 3.7 Sonnet could perform up to 100 correct moves in Tower of Hanoi but failed after just five moves in a river crossing puzzle—despite the latter requiring fewer total moves. This suggests the failures may be task-specific rather than purely computational.

Competing interpretations emerge

However, not all researchers agree with the interpretation that these results demonstrate fundamental reasoning limitations. University of Toronto economist Kevin A. Bryan argued on X that the observed limitations may reflect deliberate training constraints rather than inherent inabilities.

“If you tell me to solve a problem that would take me an hour of pen and paper, but give me five minutes, I’ll probably give you an approximate solution or a heuristic. This is exactly what foundation models with thinking are RL’d to do,” Bryan wrote, suggesting that models are specifically trained through reinforcement learning (RL) to avoid excessive computation.

Bryan suggests that unspecified industry benchmarks show “performance strictly increases as we increase in tokens used for inference, on ~every problem domain tried,” but notes that deployed models intentionally limit this to prevent “overthinking” simple queries. This perspective suggests the Apple paper may be measuring engineered constraints rather than fundamental reasoning limits.

Figure 6 from Apple's

Figure 6 from Apple’s “The Illusion of Thinking” research paper. Credit: Apple

Software engineer Sean Goedecke offered a similar critique of the Apple paper on his blog, noting that when faced with Tower of Hanoi requiring over 1,000 moves, DeepSeek-R1 “immediately decides ‘generating all those moves manually is impossible,’ because it would require tracking over a thousand moves. So it spins around trying to find a shortcut and fails.” Goedecke argues this represents the model choosing not to attempt the task rather than being unable to complete it.

Other researchers also question whether these puzzle-based evaluations are even appropriate for LLMs. Independent AI researcher Simon Willison told Ars Technica in an interview that the Tower of Hanoi approach was “not exactly a sensible way to apply LLMs, with or without reasoning,” and suggested the failures might simply reflect running out of tokens in the context window (the maximum amount of text an AI model can process) rather than reasoning deficits. He characterized the paper as potentially overblown research that gained attention primarily due to its “irresistible headline” about Apple claiming LLMs don’t reason.

The Apple researchers themselves caution against over-extrapolating the results of their study, acknowledging in their limitations section that “puzzle environments represent a narrow slice of reasoning tasks and may not capture the diversity of real-world or knowledge-intensive reasoning problems.” The paper also acknowledges that reasoning models show improvements in the “medium complexity” range and continue to demonstrate utility in some real-world applications.

Implications remain contested

Have the credibility of claims about AI reasoning models been completely destroyed by these two studies? Not necessarily.

What these studies may suggest instead is that the kinds of extended context reasoning hacks used by SR models may not be a pathway to general intelligence, like some have hoped. In that case, the path to more robust reasoning capabilities may require fundamentally different approaches rather than refinements to current methods.

As Willison noted above, the results of the Apple study have so far been explosive in the AI community. Generative AI is a controversial topic, with many people gravitating toward extreme positions in an ongoing ideological battle over the models’ general utility. Many proponents of generative AI have contested the Apple results, while critics have latched onto the study as a definitive knockout blow for LLM credibility.

Apple’s results, combined with the USAMO findings, seem to strengthen the case made by critics like Marcus that these systems rely on elaborate pattern-matching rather than the kind of systematic reasoning their marketing might suggest. To be fair, much of the generative AI space is so new that even its inventors do not yet fully understand how or why these techniques work. In the meantime, AI companies might build trust by tempering some claims about reasoning and intelligence breakthroughs.

However, that doesn’t mean these AI models are useless. Even elaborate pattern-matching machines can be useful in performing labor-saving tasks for the people that use them, given an understanding of their drawbacks and confabulations. As Marcus concedes, “At least for the next decade, LLMs (with and without inference time “reasoning”) will continue have their uses, especially for coding and brainstorming and writing.”

Photo of Benj Edwards

Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

New Apple study challenges whether AI models truly “reason” through problems Read More »

“yuck”:-wikipedia-pauses-ai-summaries-after-editor-revolt

“Yuck”: Wikipedia pauses AI summaries after editor revolt

Generative AI is permeating the Internet, with chatbots and AI summaries popping up faster than we can keep track. Even Wikipedia, the vast repository of knowledge famously maintained by an army of volunteer human editors, is looking to add robots to the mix. The site began testing AI summaries in some articles over the past week, but the project has been frozen after editors voiced their opinions. And that opinion is: “yuck.”

The seeds of this project were planted at Wikimedia’s 2024 conference, where foundation representatives and editors discussed how AI could advance Wikipedia’s mission. The wiki on the so-called “Simple Article Summaries” notes that the editors who participated in the discussion believed the summaries could improve learning on Wikipedia.

According to 404 Media, Wikipedia announced the opt-in AI pilot on June 2, which was set to run for two weeks on the mobile version of the site. The summaries appeared at the top of select articles in a collapsed form. Users had to tap to expand and read the full summary. The AI text also included a highlighted “Unverified” badge.

Feedback from the larger community of editors was immediate and harsh. Some of the first comments were simply “yuck,” with others calling the addition of AI a “ghastly idea” and “PR hype stunt.”

Others expounded on the issues with adding AI to Wikipedia, citing a potential loss of trust in the site. Editors work together to ensure articles are accurate, featuring verifiable information and a neutral point of view. However, nothing is certain when you put generative AI in the driver’s seat. “I feel like people seriously underestimate the brand risk this sort of thing has,” said one editor. “Wikipedia’s brand is reliability, traceability of changes, and ‘anyone can fix it.’ AI is the opposite of these things.”

“Yuck”: Wikipedia pauses AI summaries after editor revolt Read More »