Isaac Asimov

microsoft-deletes-blog-telling-users-to-train-ai-on-pirated-harry-potter-books

Microsoft deletes blog telling users to train AI on pirated Harry Potter books


Wizarding world of AI slop

The now-deleted Harry Potter dataset was “mistakenly” marked public domain.

Following backlash in a Hacker News thread, Microsoft deleted a blog post that critics said encouraged developers to pirate Harry Potter books to train AI models that could then be used to create AI slop.

The blog, which is archived here, was written in November 2024 by a senior product manager, Pooja Kamath. According to her LinkedIn, Kamath has been at Microsoft for more than a decade and remains with the company. In 2024, Microsoft tapped her to promote a new feature that the blog said made it easier to “add generative AI features to your own applications with just a few lines of code using Azure SQL DB, LangChain, and LLMs.”

What better way to show “engaging and relatable examples” of Microsoft’s new feature that would “resonate with a wide audience” than to “use a well-known dataset” like Harry Potter books, the blog said.

The books are “one of the most famous and cherished series in literary history,” the blog noted, and fans could use the LLMs they trained in two fun ways: building Q&A systems providing “context-rich answers” and generating “new AI-driven Harry Potter fan fiction” that’s “sure to delight Potterheads.”

To help Microsoft customers achieve this vision, the blog linked to a Kaggle dataset that included all seven Harry Potter books, which, Ars verified, has been available online for years and incorrectly marked as “public domain.” Kaggle’s terms say that rights holders can send notices of infringing content, and repeat offenders risk suspensions, but Hacker News commenters speculated that the Harry Potter dataset flew under the radar, with only 10,000 downloads over time, not catching the attention of J.K. Rowling, who famously keeps a strong grip on the Harry Potter copyrights. The dataset was promptly deleted on Thursday after Ars reached out to the uploader, Shubham Maindola, a data scientist in India with no apparent links to Microsoft.

Maindola told Ars that “the dataset was marked as Public Domain by mistake. There was no intention to misrepresent the licensing status of the works.”

It’s unclear whether Kamath was directed to link to the Harry Potter books dataset in the blog or if it was an individual choice. Cathay Y. N. Smith, a law professor and co-director of Chicago-Kent College of Law’s Program in Intellectual Property Law, told Ars that Kamath may not have realized the books were too recent to be in the public domain.

“Someone might be really knowledgeable about books and technology, but not necessarily about copyright terms and how long they last,” Smith said. “Especially if she saw that something was marked by another reputable company as being public domain.”

Microsoft declined Ars’ request to comment. Kaggle did not respond to Ars’ request to comment.

Microsoft was “probably smart” to pull the blog

On Hacker News, commenters suggested that it’s unlikely anyone familiar with the popular franchise would believe the Harry Potter books were in the public domain. They debated whether Microsoft’s blog was “problematic copyright-wise,” since Microsoft not only encouraged customers to download the infringing materials but also used the books themselves to create Harry Potter AI models that relied on beloved characters to hype Microsoft products.

Microsoft’s blog was posted more than a year ago, at a time when AI firms began facing lawsuits over AI models, which had allegedly infringed copyrights by training on pirated materials and regurgitating works verbatim.

The blog recommended that users learn to train their own AI models by downloading the Harry Potter dataset and then uploading text files to Azure Blob Storage. It included example models based on a dataset that Microsoft seemingly uploaded to Azure Blob Storage, which only included the first book, Harry Potter and the Sorcerer’s Stone.

Training large language models (LLMs) on text files, Harry Potter fans could create Q&A systems capable of pulling up relevant excerpts of books. An example query offered was “Wizarding World snacks,” which retrieved an excerpt from The Sorcerer’s Stone where Harry marvels at strange treats like Bertie Bott’s Every Flavor Beans and chocolate frogs. Another prompt asking “How did Harry feel when he first learnt that he was a Wizard?” generated an output pointing to various early excerpts in the book.

But perhaps an even more exciting use case, Kamath suggested, was generating fan fiction to “explore new adventures” and “even create alternate endings.” That model could quickly comb the dataset for “contextually similar” excerpts that could be used to output fresh stories that fit with existing narratives and incorporate “elements from the retrieved passages,” the blog said.

As an example, Kamath trained a model to write a Harry Potter story she could use to market the feature she was blogging about. She asked the model to write a story in which Harry meets a new friend on the Hogwarts Express train who tells him all about Microsoft’s Native Vector Support in SQL “in the Muggle world.”

Drawing on parts of The Sorcerer’s Stone where Harry learns about Quidditch and gets to know Hermione Granger, the fan fiction showed a boy selling Harry on Microsoft’s “amazing” new feature. To do this, he likened it to having a spell that helps you find exactly what you need among thousands of options, instantly, while declaring it was perfect for machine learning, AI, and recommendation systems.

Further blurring the lines between Microsoft and Harry Potter brands, Kamath also generated an image showing Harry with his new friend, stamped with a Microsoft logo.

Smith told Ars that both use cases could frustrate rights holders, depending on the content in the model outputs.

“I think that the regurgitation and the creation of fan fiction, they both could flag copyright issues, in that fan fiction often has to take from the expressive elements, a copyrighted character, a character that’s famous enough to be protected by a copyright law or plot stories or sequences,” Smith said. “If these things are copied and reproduced, then that output could be potentially infringing.”

But it’s also still a gray area. Looking at the blog, Smith said, “I would be concerned,” but “I wouldn’t say it’s automatically infringement.”

Smith told Ars that, in pulling the blog, Microsoft “was probably smart,” since courts have only generally said that training AI on copyrighted books is fair use. But courts continue to probe questions about pirated AI training materials.

On the deleted Kaggle dataset page, Maindola previously explained that to source the data, he “downloaded the ebooks and then converted them to txt files.”

Microsoft may have infringed copyrights

If Microsoft ever faced questions as to whether the company knowingly used pirated books to train the example models, fair use “could be a difficult argument,” Smith said.

Hacker News commenters suggested the blog could be considered fair use, since the training guide was for “educational purposes,” and Smith said that Microsoft could raise some “good arguments” in its defense.

However, she also suggested that Microsoft could be deemed liable for contributing to infringement on some level after leaving the blog up for a year. Before it was removed, the Kaggle dataset was downloaded more than 10,000 times.

“The ultimate result is to create something infringing by saying, ‘Hey, here you go, go grab that infringing stuff and use that in our system,’” Smith said. “They could potentially have some sort of secondary contributory liability for copyright infringement, downloading it, as well as then using it to encourage others to use it for training purposes.”

On Hacker News, commenters slammed the blog, including a self-described former Microsoft employee who claimed that Microsoft lets employees “blog without having to go through some approval or editing process.”

“It looks like somebody made a bad judgment call on what to put in a company blog post (and maybe what constitutes ethical activity) and that it was taken down as soon as someone noticed,” the former employee said.

Others suggested the blame was solely with the Kaggle uploader, Maindola, who told Ars that the dataset should never have been marked “public domain.” But Microsoft critics pushed back, noting that the Kaggle page made it clear that no special permission was granted and that Microsoft’s employee should have known better. “They don’t need to know any details to know that these properties belong to massive companies and aren’t free for the taking,” one commenter said.

The Harry Potter books weren’t the only books targeted, the thread noted, linking to a separate Azure sample containing Isaac Asimov’s Foundation series, which is also not in the public domain.

“Microsoft could have used any dataset for their blog, they could have even chosen to use actual public domain novels,” another Hacker News commenter wrote. “Instead, they opted to use copywritten works that J.K. hasn’t released into the public domain (unless user ‘Shubham Maindola’ is J.K.’s alter ego).”

Smith suggested Microsoft could have avoided this week’s backlash by more carefully reviewing blogs, noting that “if a company is risk averse, this would probably be flagged.” But she also understood Kamath’s preference for Harry Potter over the many long-forgotten characters that exist in the public domain. On Hacker News, some commenters defended Kamath’s blog, urging that it should be considered fair use since nonprofits and educational institutions could do the same thing in a teaching context without issue.

“I would have been concerned if I were the one clearing this for Microsoft, but at the same time, I completely understand what this employee was doing,” Smith said. “No one wants to write fan fiction about books that are in the public domain.”

Photo of Ashley Belanger

Ashley is a senior policy reporter for Ars Technica, dedicated to tracking social impacts of emerging policies and new technologies. She is a Chicago-based journalist with 20 years of experience.

Microsoft deletes blog telling users to train AI on pirated Harry Potter books Read More »

a-warlord-brings-chaos-in-foundation-s3-trailer

A warlord brings chaos in Foundation S3 trailer

Foundation returns for a third season next month on Apple TV+.

Foundation, Apple TV+’s lavish adaptation (or re-mix, if you prefer) of Isaac Asimov’s seminal sci-fi series, returns for its third season next month, and the streaming platform has dropped an official trailer to give us a taste of what’s in store.

As previously reported, the first season ended with a major time jump of 138 years, and S2 focused on the Second Crisis: imminent war between Empire and the Foundation, along with an enemy seeking to destroy Empire from within. The Foundation, meanwhile, adopted the propaganda tactics of religion to recruit new acolytes to the cause. We also met a colony of “Mentalics” with psionic abilities. We’re getting another mega time jump for the Third Crisis.

Per the official premise:

Set 152 years after the events of S2, The Foundation has become increasingly established far beyond its humble beginnings while the Cleonic Dynasty’s Empire has dwindled. As both of these galactic powers forge an uneasy alliance, a threat to the entire galaxy appears in the fearsome form of a warlord known as “The Mule” whose sights are set on ruling the universe by use of physical and military force, as well as mind control. It’s anyone’s guess who will win, who will lose, who will live, and who will die as Hari Seldon, Gaal Dornick, the Cleons and Demerzel play a potentially deadly game of intergalactic chess.

Most of the main cast is returning: Lee Pace as Brother Day, Cassian Bilton as Brother Dawn, Terrence Mann as Brother Dusk, Jared Harris as Hari Seldon, Lou Llobell as Gaal, and Laura Birn as Eto Demerzel. Pilou Asbæk plays the Mule. New S3 cast members include Alexander Siddig as Dr. Ebling Mis, a Seldon fan and self-taught psychohistorian; Troy Kotsur as Preem Palver, leader of a planet of psychics; Cherry Jones as Foundation Ambassador Quent; Brandon P. Bell as Han Pritcher; Synnøve Karlsen as Bayta Mallow; Cody Fern as Toran Mallow; Tómas Lemarquis as Magnifico Giganticus; Yootha Wong-Loi-Sing as Song; and Leo Bill as Mayor Indbur.

A warlord brings chaos in Foundation S3 trailer Read More »

the-third-crisis-dawns-in-foundation-s3-teaser

The Third Crisis dawns in Foundation S3 teaser

We have our first teaser for the upcoming third season of Foundation.

It’s been nearly two years, but the third season of Foundation, Apple TV+’s epic adaptation (or remix) of the Isaac Asimov series, is almost here. The streaming platform released an action-packed teaser of what we can expect from the new ten-episode season: the onset of the Third Crisis, a galactic war, and a shirtless Lee Pace.

(Some spoilers for first two seasons below.)

Showrunner David S. Goyer took great pains in S1 to carefully set up his expansive fictional world, and the scope only broadened in the second season. As previously reported, Asimov’s fundamental narrative arc remains intact, with the series taking place across multiple planets over 1,000 years and featuring a huge cast of characters.

Mathematician Hari Seldon (Jared Harris) developed a controversial theory of “psychohistory,” and his calculations predict the fall of the Empire, ushering in a Dark Age period that will last 30,000 years, after which a second Empire will emerge. The collapse of the Empire is inevitable, but Seldon has a plan to reduce the Dark Ages to a mere 1,000 years through the establishment of a Foundation to preserve all human knowledge so that civilization need not rebuild itself entirely from scratch. He is aided in this endeavor by his math prodigy protegé Gaal Dornick (Lou Llobell).

The biggest change from the books is the replacement of the Empire’s ruling committee with a trio of Eternal Emperor clones called the Cleons—a genetic triune dynasty comprised of Brother Day (Pace), Brother Dusk (Terrence Mann), and Brother Dawn (Cassian Bilton). Technically, they are all perfect incarnations of the same man at different ages, and this is both the source of their strength as a team and of their conflicts. Their guardian is an android, Eto Demerzel (Laura Birn), one of the last surviving androids from the ancient Robot Wars, who is programmed to protect the dynasty at all costs.

The Third Crisis dawns in Foundation S3 teaser Read More »