copyright law

us-lawmaker-proposes-a-public-database-of-all-ai-training-material

US lawmaker proposes a public database of all AI training material

Who’s got the receipts? —

Proposed law would require more transparency from AI companies.

US lawmaker proposes a public database of all AI training material

Amid a flurry of lawsuits over AI models’ training data, US Representative Adam Schiff (D-Calif.) has introduced a bill that would require AI companies to disclose exactly which copyrighted works are included in datasets training AI systems.

The Generative AI Disclosure Act “would require a notice to be submitted to the Register of Copyrights prior to the release of a new generative AI system with regard to all copyrighted works used in building or altering the training dataset for that system,” Schiff said in a press release.

The bill is retroactive and would apply to all AI systems available today, as well as to all AI systems to come. It would take effect 180 days after it’s enacted, requiring anyone who creates or alters a training set not only to list works referenced by the dataset, but also to provide a URL to the dataset within 30 days before the AI system is released to the public. That URL would presumably give creators a way to double-check if their materials have been used and seek any credit or compensation available before the AI tools are in use.

All notices would be kept in a publicly available online database.

Schiff described the act as championing “innovation while safeguarding the rights and contributions of creators, ensuring they are aware when their work contributes to AI training datasets.”

“This is about respecting creativity in the age of AI and marrying technological progress with fairness,” Schiff said.

Currently, creators who don’t have access to training datasets rely on AI models’ outputs to figure out if their copyrighted works may have been included in training various AI systems. The New York Times, for example, prompted ChatGPT to spit out excerpts of its articles, relying on a tactic to identify training data by asking ChatGPT to produce lines from specific articles, which OpenAI has curiously described as “hacking.”

Under Schiff’s law, The New York Times would need to consult the database to ID all articles used to train ChatGPT or any other AI system.

Any AI maker who violates the act would risk a “civil penalty in an amount not less than $5,000,” the proposed bill said.

At a hearing on artificial intelligence and intellectual property, Rep. Darrell Issa (R-Calif.)—who chairs the House Judiciary Subcommittee on Courts, Intellectual Property, and the Internet—told Schiff that his subcommittee would consider the “thoughtful” bill.

Schiff told the subcommittee that the bill is “only a first step” toward “ensuring that at a minimum” creators are “aware of when their work contributes to AI training datasets,” saying that he would “welcome the opportunity to work with members of the subcommittee” on advancing the bill.

“The rapid development of generative AI technologies has outpaced existing copyright laws, which has led to widespread use of creative content to train generative AI models without consent or compensation,” Schiff warned at the hearing.

In Schiff’s press release, Meredith Stiehm, president of the Writers Guild of America West, joined leaders from other creative groups celebrating the bill as an “important first step” for rightsholders.

“Greater transparency and guardrails around AI are necessary to protect writers and other creators” and address “the unprecedented and unauthorized use of copyrighted materials to train generative AI systems,” Stiehm said.

Until the thorniest AI copyright questions are settled, Ken Doroshow, a chief legal officer for the Recording Industry Association of America, suggested that Schiff’s bill filled an important gap by introducing “comprehensive and transparent recordkeeping” that would provide “one of the most fundamental building blocks of effective enforcement of creators’ rights.”

A senior adviser for the Human Artistry Campaign, Moiya McTier, went further, celebrating the bill as stopping AI companies from “exploiting” artists and creators.

“AI companies should stop hiding the ball when they copy creative works into AI systems and embrace clear rules of the road for recordkeeping that create a level and transparent playing field for the development and licensing of genuinely innovative applications and tools,” McTier said.

AI copyright guidance coming soon

While courts weigh copyright questions raised by artists, book authors, and newspapers, the US Copyright Office announced in March that it would be issuing guidance later this year, but the office does not seem to be prioritizing questions on AI training.

Instead, the Copyright Office will focus first on issuing guidance on deepfakes and AI outputs. This spring, the office will release a report “analyzing the impact of AI on copyright” of “digital replicas, or the use of AI to digitally replicate individuals’ appearances, voices, or other aspects of their identities.” Over the summer, another report will focus on “the copyrightability of works incorporating AI-generated material.”

Regarding “the topic of training AI models on copyrighted works as well as any licensing considerations and liability issues,” the Copyright Office did not provide a timeline for releasing guidance, only confirming that their “goal is to finalize the entire report by the end of the fiscal year.”

Once guidance is available, it could sway court opinions, although courts do not necessarily have to apply Copyright Office guidance when weighing cases.

The Copyright Office’s aspirational timeline does seem to be ahead of when at least some courts can be expected to decide on some of the biggest copyright questions for some creators. The class-action lawsuit raised by book authors against OpenAI, for example, is not expected to be resolved until February 2025, and the New York Times’ lawsuit is likely on a similar timeline. However, artists suing Stability AI face a hearing on that AI company’s motion to dismiss this May.

US lawmaker proposes a public database of all AI training material Read More »

google-balks-at-$270m-fine-after-training-ai-on-french-news-sites’-content

Google balks at $270M fine after training AI on French news sites’ content

Google balks at $270M fine after training AI on French news sites’ content

Google has agreed to pay 250 million euros (about $273 million) to settle a dispute in France after breaching years-old commitments to inform and pay French news publishers when referencing and displaying content in both search results and when training Google’s AI-powered chatbot, Gemini.

According to France’s competition watchdog, the Autorité de la Concurrence (ADLC), Google dodged many commitments to deal with publishers fairly. Most recently, it never notified publishers or the ADLC before training Gemini (initially launched as Bard) on publishers’ content or displaying content in Gemini outputs. Google also waited until September 28, 2023, to introduce easy options for publishers to opt out, which made it impossible for publishers to negotiate fair deals for that content, the ADLC found.

“Until this date, press agencies and publishers wanting to opt out of this use had to insert an instruction opposing any crawling of their content by Google, including on the Search, Discover and Google News services,” the ADLC noted, warning that “in the future, the Autorité will be particularly attentive as regards the effectiveness of opt-out systems implemented by Google.”

To address breaches of four out of seven commitments in France—which the ADLC imposed in 2022 for a period of five years to “benefit” publishers by ensuring Google’s ongoing negotiations with them were “balanced”—Google has agreed to “a series of corrective measures,” the ADLC said.

Google is not happy with the fine, which it described as “not proportionate” partly because the fine “doesn’t sufficiently take into account the efforts we have made to answer and resolve the concerns raised—in an environment where it’s very hard to set a course because we can’t predict which way the wind will blow next.”

According to Google, regulators everywhere need to clearly define fair use of content when developing search tools and AI models, so that search companies and AI makers always know “whom we are paying for what.” Currently in France, Google contends, the scope of Google’s commitments has shifted from just general news publishers to now also include specialist publications and listings and comparison sites.

The ADLC agreed that “the question of whether the use of press publications as part of an artificial intelligence service qualifies for protection under related rights regulations has not yet been settled,” but noted that “at the very least,” Google was required to “inform publishers of the use of their content for their Bard software.”

Regarding Bard/Gemini, Google said that it “voluntarily introduced a new technical solution called Google-Extended to make it easier for rights holders to opt out of Gemini without impact on their presence in Search.” It has now also committed to better explain to publishers both “how our products based on generative AI work and how ‘Opt Out’ works.”

Google said that it agreed to the settlement “because it’s time to move on” and “focus on the larger goal of sustainable approaches to connecting people with quality content and on working constructively with French publishers.”

“Today’s fine relates mostly to [a] disagreement about how much value Google derives from news content,” Google’s blog said, claiming that “a lack of clear regulatory guidance and repeated enforcement actions have made it hard to navigate negotiations with publishers, or plan how we invest in news in France in the future.”

What changes did Google agree to make?

Google defended its position as “the first and only platform to have signed significant licensing agreements” in France, benefiting 280 French press publishers and “covering more than 450 publications.”

With these publishers, the ADLC found that Google breached requirements to “negotiate in good faith based on transparent, objective, and non-discriminatory criteria,” to consistently “make a remuneration offer” within three months of a publisher’s request, and to provide information for publishers to “transparently assess their remuneration.”

Google also breached commitments to “inform editors and press agencies of the use of their content by its service Bard” and of Google’s decision to link “the use of press agencies’ and publishers’ content by its artificial intelligence service to the display of protected content on services such as Search, Discover and News.”

Regarding negotiations, the ADLC found that Google not only failed to be transparent with publishers about remuneration, but also failed to keep the ADLC informed of information necessary to monitor whether Google was honoring its commitments to fairly pay publishers. Partly “to guarantee better communication,” Google has agreed to appoint a French-speaking representative in its Paris office, along with other steps the ADLC recommended.

According to the ADLC’s announcement (translated from French), Google seemingly acted sketchy in negotiations by not meeting non-discrimination criteria—and unfavorably treating publishers in different situations identically—and by not mentioning “all the services that could generate revenues for the negotiating party.”

“According to the Autorité, not taking into account differences in attractiveness between content does not allow for an accurate reflection of the contribution of each press agency and publisher to Google’s revenues,” the ADLC said.

Also problematically, Google established a minimum threshold of 100 euros for remuneration that it has now agreed to drop.

This threshold, “in its very principle, introduces discrimination between publishers that, below a certain threshold, are all arbitrarily assigned zero remuneration, regardless of their respective situations,” the ADLC found.

Google balks at $270M fine after training AI on French news sites’ content Read More »

nvidia-sued-over-ai-training-data-as-copyright-clashes-continue

Nvidia sued over AI training data as copyright clashes continue

In authors’ bad books —

Copyright suits over AI training data reportedly decreasing AI transparency.

Nvidia sued over AI training data as copyright clashes continue

Book authors are suing Nvidia, alleging that the chipmaker’s AI platform NeMo—used to power customized chatbots—was trained on a controversial dataset that illegally copied and distributed their books without their consent.

In a proposed class action, novelists Abdi Nazemian (Like a Love Story), Brian Keene (Ghost Walk), and Stewart O’Nan (Last Night at the Lobster) argued that Nvidia should pay damages and destroy all copies of the Books3 dataset used to power NeMo large language models (LLMs).

The Books3 dataset, novelists argued, copied “all of Bibliotek,” a shadow library of approximately 196,640 pirated books. Initially shared through the AI community Hugging Face, the Books3 dataset today “is defunct and no longer accessible due to reported copyright infringement,” the Hugging Face website says.

According to the authors, Hugging Face removed the dataset last October, but not before AI companies like Nvidia grabbed it and “made multiple copies.” By training NeMo models on this dataset, the authors alleged that Nvidia “violated their exclusive rights under the Copyright Act.” The authors argued that the US district court in San Francisco must intervene and stop Nvidia because the company “has continued to make copies of the Infringed Works for training other models.”

A Hugging Face spokesperson clarified to Ars that “Hugging Face never removed this dataset, and we did not host the Books3 dataset on the Hub.” Instead, “Hugging Face hosted a script that downloads the data from The Eye, which is the place where ELeuther hosted the data,” until “Eleuther removed the data from The Eye” over copyright concerns, causing the dataset script on Hugging Face to break.

Nvidia did not immediately respond to Ars’ request to comment.

Demanding a jury trial, authors are hoping the court will rule that Nvidia has no possible defense for both allegedly violating copyrights and intending “to cause further infringement” by distributing NeMo models “as a base from which to build further models.”

AI models decreasing transparency amid suits

The class action was filed by the same legal team representing authors suing OpenAI, whose lawsuit recently saw many claims dismissed, but crucially not their claim of direct copyright infringement. Lawyers told Ars last month that authors would be amending their complaints against OpenAI and were “eager to move forward and litigate” their direct copyright infringement claim.

In that lawsuit, the authors alleged copyright infringement both when OpenAI trained LLMs and when chatbots referenced books in outputs. But authors seemed more concerned about alleged damages from chatbot outputs, warning that AI tools had an “uncanny ability to generate text similar to that found in copyrighted textual materials, including thousands of books.”

Uniquely, in the Nvidia suit, authors are focused exclusively on Nvidia’s training data, seemingly concerned that Nvidia could empower businesses to create any number of AI models on the controversial dataset, which could affect thousands of authors whose works could allegedly be broadly infringed just by training these models.

There’s no telling yet how courts will rule on the direct copyright claims in either lawsuit—or in the New York Times’ lawsuit against OpenAI—but so far, OpenAI has failed to convince courts to toss claims aside.

However, OpenAI doesn’t appear very shaken by the lawsuits. In February, OpenAI said that it expected to beat book authors’ direct copyright infringement claim at a “later stage” of the case and, most recently in the New York Times case, tried to convince the court that NYT “hacked” ChatGPT to “set up” the lawsuit.

And Microsoft, a co-defendant in the NYT lawsuit, even more recently introduced a new argument that could help tech companies defeat copyright suits over LLMs. Last month, Microsoft argued that The New York Times was attempting to stop a “groundbreaking new technology” and would fail, just like movie producers attempting to kill off the VCR in the 1980s.

“Despite The Times’s contentions, copyright law is no more an obstacle to the LLM than it was to the VCR (or the player piano, copy machine, personal computer, Internet, or search engine),” Microsoft wrote.

In December, Hugging Face’s machine learning and society lead, Yacine Jernite, noted that developers appeared to be growing less transparent about training data after copyright lawsuits raised red flags about companies using the Books3 dataset, “especially for commercial models.”

Meta, for example, “limited the amount of information [it] disclosed about” its LLM, Llama-2, “to a single paragraph description and one additional page of safety and bias analysis—after [its] use of the Books3 dataset when training the first Llama model was brought up in a copyright lawsuit,” Jernite wrote.

Jernite warned that AI models lacking transparency could hinder “the ability of regulatory safeguards to remain relevant as training methods evolve, of individuals to ensure that their rights are respected, and of open science and development to play their role in enabling democratic governance of new technologies.” To support “more accountability,” Jernite recommended “minimum meaningful public transparency standards to support effective AI regulation,” as well as companies providing options for anyone to opt out of their data being included in training data.

“More data transparency supports better governance and fosters technology development that more reliably respects peoples’ rights,” Jernite wrote.

Nvidia sued over AI training data as copyright clashes continue Read More »

openai-accuses-nyt-of-hacking-chatgpt-to-set-up-copyright-suit

OpenAI accuses NYT of hacking ChatGPT to set up copyright suit

OpenAI accuses NYT of hacking ChatGPT to set up copyright suit

OpenAI is now boldly claiming that The New York Times “paid someone to hack OpenAI’s products” like ChatGPT to “set up” a lawsuit against the leading AI maker.

In a court filing Monday, OpenAI alleged that “100 examples in which some version of OpenAI’s GPT-4 model supposedly generated several paragraphs of Times content as outputs in response to user prompts” do not reflect how normal people use ChatGPT.

Instead, it allegedly took The Times “tens of thousands of attempts to generate” these supposedly “highly anomalous results” by “targeting and exploiting a bug” that OpenAI claims it is now “committed to addressing.”

According to OpenAI this activity amounts to “contrived attacks” by a “hired gun”—who allegedly hacked OpenAI models until they hallucinated fake NYT content or regurgitated training data to replicate NYT articles. NYT allegedly paid for these “attacks” to gather evidence to support The Times’ claims that OpenAI’s products imperil its journalism by allegedly regurgitating reporting and stealing The Times’ audiences.

“Contrary to the allegations in the complaint, however, ChatGPT is not in any way a substitute for a subscription to The New York Times,” OpenAI argued in a motion that seeks to dismiss the majority of The Times’ claims. “In the real world, people do not use ChatGPT or any other OpenAI product for that purpose. Nor could they. In the ordinary course, one cannot use ChatGPT to serve up Times articles at will.”

In the filing, OpenAI described The Times as enthusiastically reporting on its chatbot developments for years without raising any concerns about copyright infringement. OpenAI claimed that it disclosed that The Times’ articles were used to train its AI models in 2020, but The Times only cared after ChatGPT’s popularity exploded after its debut in 2022.

According to OpenAI, “It was only after this rapid adoption, along with reports of the value unlocked by these new technologies, that the Times claimed that OpenAI had ‘infringed its copyright[s]’ and reached out to demand ‘commercial terms.’ After months of discussions, the Times filed suit two days after Christmas, demanding ‘billions of dollars.'”

Ian Crosby, Susman Godfrey partner and lead counsel for The New York Times, told Ars that “what OpenAI bizarrely mischaracterizes as ‘hacking’ is simply using OpenAI’s products to look for evidence that they stole and reproduced The Times’s copyrighted works. And that is exactly what we found. In fact, the scale of OpenAI’s copying is much larger than the 100-plus examples set forth in the complaint.”

Crosby told Ars that OpenAI’s filing notably “doesn’t dispute—nor can they—that they copied millions of The Times’ works to build and power its commercial products without our permission.”

“Building new products is no excuse for violating copyright law, and that’s exactly what OpenAI has done on an unprecedented scale,” Crosby said.

OpenAI argued that the court should dismiss claims alleging direct copyright, contributory infringement, Digital Millennium Copyright Act violations, and misappropriation, all of which it describes as “legally infirm.” Some fail because they are time-barred—seeking damages on training data for OpenAI’s older models—OpenAI claimed. Others allegedly fail because they misunderstand fair use or are preempted by federal laws.

If OpenAI’s motion is granted, the case would be substantially narrowed.

But if the motion is not granted and The Times ultimately wins—and it might—OpenAI may be forced to wipe ChatGPT and start over.

“OpenAI, which has been secretive and has deliberately concealed how its products operate, is now asserting it’s too late to bring a claim for infringement or hold them accountable. We disagree,” Crosby told Ars. “It’s noteworthy that OpenAI doesn’t dispute that it copied Times works without permission within the statute of limitations to train its more recent and current models.”

OpenAI did not immediately respond to Ars’ request to comment.

OpenAI accuses NYT of hacking ChatGPT to set up copyright suit Read More »

judge-rejects-most-chatgpt-copyright-claims-from-book-authors

Judge rejects most ChatGPT copyright claims from book authors

Insufficient evidence —

OpenAI plans to defeat authors’ remaining claim at a “later stage” of the case.

Judge rejects most ChatGPT copyright claims from book authors

A US district judge in California has largely sided with OpenAI, dismissing the majority of claims raised by authors alleging that large language models powering ChatGPT were illegally trained on pirated copies of their books without their permission.

By allegedly repackaging original works as ChatGPT outputs, authors alleged, OpenAI’s most popular chatbot was just a high-tech “grift” that seemingly violated copyright laws, as well as state laws preventing unfair business practices and unjust enrichment.

According to judge Araceli Martínez-Olguín, authors behind three separate lawsuits—including Sarah Silverman, Michael Chabon, and Paul Tremblay—have failed to provide evidence supporting any of their claims except for direct copyright infringement.

OpenAI had argued as much in their promptly filed motion to dismiss these cases last August. At that time, OpenAI said that it expected to beat the direct infringement claim at a “later stage” of the proceedings.

Among copyright claims tossed by Martínez-Olguín were accusations of vicarious copyright infringement. Perhaps most significantly, Martínez-Olguín agreed with OpenAI that the authors’ allegation that “every” ChatGPT output “is an infringing derivative work” is “insufficient” to allege vicarious infringement, which requires evidence that ChatGPT outputs are “substantially similar” or “similar at all” to authors’ books.

“Plaintiffs here have not alleged that the ChatGPT outputs contain direct copies of the copyrighted books,” Martínez-Olguín wrote. “Because they fail to allege direct copying, they must show a substantial similarity between the outputs and the copyrighted materials.”

Authors also failed to convince Martínez-Olguín that OpenAI violated the Digital Millennium Copyright Act (DMCA) by allegedly removing copyright management information (CMI)—such as author names, titles of works, and terms and conditions for use of the work—from training data.

This claim failed because authors cited “no facts” that OpenAI intentionally removed the CMI or built the training process to omit CMI, Martínez-Olguín wrote. Further, the authors cited examples of ChatGPT referencing their names, which would seem to suggest that some CMI remains in the training data.

Some of the remaining claims were dependent on copyright claims to survive, Martínez-Olguín wrote.

Arguing that OpenAI caused economic injury by unfairly repurposing authors’ works, even if authors could show evidence of a DMCA violation, authors could only speculate about what injury was caused, the judge said.

Similarly, allegations of “fraudulent” unfair conduct—accusing OpenAI of “deceptively” designing ChatGPT to produce outputs that omit CMI—”rest on a violation of the DMCA,” Martínez-Olguín wrote.

The only claim under California’s unfair competition law that was allowed to proceed alleged that OpenAI used copyrighted works to train ChatGPT without authors’ permission. Because the state law broadly defines what’s considered “unfair,” Martínez-Olguín said that it’s possible that OpenAI’s use of the training data “may constitute an unfair practice.”

Remaining claims of negligence and unjust enrichment failed, Martínez-Olguín wrote, because authors only alleged intentional acts and did not explain how OpenAI “received and unjustly retained a benefit” from training ChatGPT on their works.

Authors have been ordered to consolidate their complaints and have until March 13 to amend arguments and continue pursuing any of the dismissed claims.

To shore up the tossed copyright claims, authors would likely need to provide examples of ChatGPT outputs that are similar to their works, as well as evidence of OpenAI intentionally removing CMI to “induce, enable, facilitate, or conceal infringement,” Martínez-Olguín wrote.

Ars could not immediately reach the authors’ lawyers or OpenAI for comment.

As authors likely prepare to continue fighting OpenAI, the US Copyright Office has been fielding public input before releasing guidance that could one day help rights holders pursue legal claims and may eventually require works to be licensed from copyright owners for use as training materials. Among the thorniest questions is whether AI tools like ChatGPT should be considered authors when spouting outputs included in creative works.

While the Copyright Office prepares to release three reports this year “revealing its position on copyright law in relation to AI,” according to The New York Times, OpenAI recently made it clear that it does not plan to stop referencing copyrighted works in its training data. Last month, OpenAI said it would be “impossible” to train AI models without copyrighted materials, because “copyright today covers virtually every sort of human expression—including blogposts, photographs, forum posts, scraps of software code, and government documents.”

According to OpenAI, it doesn’t just need old copyrighted materials; it needs current copyright materials to ensure that chatbot and other AI tools’ outputs “meet the needs of today’s citizens.”

Rights holders will likely be bracing throughout this confusing time, waiting for the Copyright Office’s reports. But once there is clarity, those reports could “be hugely consequential, weighing heavily in courts, as well as with lawmakers and regulators,” The Times reported.

Judge rejects most ChatGPT copyright claims from book authors Read More »