AI training

reddit-blocks-internet-archive-to-end-sneaky-ai-scraping

Reddit blocks Internet Archive to end sneaky AI scraping

“Until they’re able to defend their site and comply with platform policies (e.g., respecting user privacy, re: deleting removed content) we’re limiting some of their access to Reddit data to protect redditors,” Rathschmidt said.

A review of social media comments suggests that in the past, some Redditors have used the Wayback Machine to research deleted comments or threads. Those commenters noted that myriad other tools exist for surfacing deleted posts or researching a user’s activity, with some suggesting that the Wayback Machine was maybe not the easiest platform to navigate for that purpose.

Redditors have also turned to resources like IA during times when Reddit’s platform changes trigger content removals. Most recently in 2023, when changes to Reddit’s public API threatened to kill beloved subreddits, archives stepped in to preserve content before it was lost.

IA has not signaled whether it’s looking into fixes to get Reddit’s restrictions lifted and did not respond to Ars’ request to comment on how this change might impact the archive’s utility as an open web resource, given Reddit’s popularity.

The director of the Wayback Machine, Mark Graham, told Ars that IA has “a longstanding relationship with Reddit” and continues to have “ongoing discussions about this matter.”

It seems likely that Reddit is financially motivated to restrict AI firms from taking advantage of Wayback Machine archives, perhaps hoping to spur more lucrative licensing deals like Reddit struck with OpenAI and Google. The terms of the OpenAI deal were kept quiet, but the Google deal was reportedly worth $60 million. Over the next three years, Reddit expects to make more than $200 million off such licensing deals.

Disclosure: Advance Publications, which owns Ars Technica parent Condé Nast, is the largest shareholder in Reddit.

Reddit blocks Internet Archive to end sneaky AI scraping Read More »

meta-pirated-and-seeded-porn-for-years-to-train-ai,-lawsuit-says

Meta pirated and seeded porn for years to train AI, lawsuit says

Evidence may prove Meta seeded more content

Seeking evidence to back its own copyright infringement claims, Strike 3 Holdings searched “its archive of recorded infringement captured by its VXN Scan and Cross Reference tools” and found 47 “IP addresses identified as owned by Facebook infringing its copyright protected Works.”

The data allegedly demonstrates a “continued unauthorized distribution” over “several years.” And Meta allegedly did not stop its seeding after Strike 3 Holdings confronted the tech giant with this evidence—despite the IP data supposedly being verified through an industry-leading provider called Maxmind.

Strike 3 Holdings shared a screenshot of MaxMind’s findings. Credit: via Strike 3 Holdings’ complaint

Meta also allegedly attempted to “conceal its BitTorrent activities” through “six Virtual Private Clouds” that formed a “stealth network” of “hidden IP addresses,” the lawsuit alleged, which seemingly implicated a “major third-party data center provider” as a partner in Meta’s piracy.

An analysis of these IP addresses allegedly found “data patterns that matched infringement patterns seen on Meta’s corporate IP Addresses” and included “evidence of other activity on the BitTorrent network including ebooks, movies, television shows, music, and software.” The seemingly non-human patterns documented on both sets of IP addresses suggest the data was for AI training and not for personal use, Strike 3 Holdings alleged.

Perhaps most shockingly, considering that a Meta employee joked “torrenting from a corporate laptop doesn’t feel right,” Strike 3 Holdings further alleged that it found “at least one residential IP address of a Meta employee” infringing its copyrighted works. That suggests Meta may have directed an employee to torrent pirated data outside the office to obscure the data trail.

The adult site operator did not identify the employee or the major data center discussed in its complaint, noting in a subsequent filing that it recognized the risks to Meta’s business and its employees’ privacy of sharing sensitive information.

In total, the company alleged that evidence shows “well over 100,000 unauthorized distribution transactions” linked to Meta’s corporate IPs. Strike 3 Holdings is hoping the evidence will lead a jury to find Meta liable for direct copyright infringement or charge Meta with secondary and vicarious copyright infringement if the jury finds that Meta successfully distanced itself by using the third-party data center or an employee’s home IP address.

“Meta has the right and ability to supervise and/or control its own corporate IP addresses, as well as the IP addresses hosted in off-infra data centers, and the acts of its employees and agents infringing Plaintiffs’ Works through their residential IPs by using Meta’s AI script to obtain content through BitTorrent,” the complaint said.

Meta pirated and seeded porn for years to train AI, lawsuit says Read More »

everything-tech-giants-will-hate-about-the-eu’s-new-ai-rules

Everything tech giants will hate about the EU’s new AI rules

The code also details expectations for AI companies to respect paywalls, as well as robots.txt instructions restricting crawling, which could help confront a growing problem of AI crawlers hammering websites. It “encourages” online search giants to embrace a solution that Cloudflare is currently pushing: allowing content creators to protect copyrights by restricting AI crawling without impacting search indexing.

Additionally, companies are asked to disclose total energy consumption for both training and inference, allowing the EU to detect environmental concerns while companies race forward with AI innovation.

More substantially, the code’s safety guidance provides for additional monitoring for other harms. It makes recommendations to detect and avoid “serious incidents” with new AI models, which could include cybersecurity breaches, disruptions of critical infrastructure, “serious harm to a person’s health (mental and/or physical),” or “a death of a person.” It stipulates timelines of between five and 10 days to report serious incidents with the EU’s AI Office. And it requires companies to track all events, provide an “adequate level” of cybersecurity protection, prevent jailbreaking as best they can, and justify “any failures or circumventions of systemic risk mitigations.”

Ars reached out to tech companies for immediate reactions to the new rules. OpenAI, Meta, and Microsoft declined to comment. A Google spokesperson confirmed that the company is reviewing the code, which still must be approved by the European Commission and EU member states amid expected industry pushback.

“Europeans should have access to first-rate, secure AI models when they become available, and an environment that promotes innovation and investment,” Google’s spokesperson said. “We look forward to reviewing the code and sharing our views alongside other model providers and many others.”

These rules are just one part of the AI Act, which will start taking effect in a staggered approach over the next year or more, the NYT reported. Breaching the AI Act could result in AI models being yanked off the market or fines “of as much as 7 percent of a company’s annual sales or 3 percent for the companies developing advanced AI models,” Bloomberg noted.

Everything tech giants will hate about the EU’s new AI rules Read More »

pay-up-or-stop-scraping:-cloudflare-program-charges-bots-for-each-crawl

Pay up or stop scraping: Cloudflare program charges bots for each crawl

“Imagine asking your favorite deep research program to help you synthesize the latest cancer research or a legal brief, or just help you find the best restaurant in Soho—and then giving that agent a budget to spend to acquire the best and most relevant content,” Cloudflare said, promising that “we enable a future where intelligent agents can programmatically negotiate access to digital resources.”

AI crawlers now blocked by default

Cloudflare’s announcement comes after rolling out a feature last September, allowing website owners to block AI crawlers in a single click. According to Cloudflare, over 1 million customers chose to block AI crawlers, signaling that people want more control over their content at a time when Cloudflare observed that writing instructions for AI crawlers in robots.txt files was widely “underutilized.”

To protect more customers moving forward, any new customers (including anyone on a free plan) who sign up for Cloudflare services will have their domains, by default, set to block all known AI crawlers.

This marks Cloudflare’s transition away from the dreaded opt-out models of AI scraping to a permission-based model, which a Cloudflare spokesperson told Ars is expected to “fundamentally change how AI companies access web content going forward.”

In a world where some website owners have grown sick and tired of attempting and failing to block AI scraping through robots.txt—including some trapping AI crawlers in tarpits to punish them for ignoring robots.txt—Cloudflare’s feature allows users to choose granular settings to prevent blocks on AI bots from impacting bots that drive search engine traffic. That’s critical for small content creators who want their sites to still be discoverable but not digested by AI bots.

“AI crawlers collect content like text, articles, and images to generate answers, without sending visitors to the original source—depriving content creators of revenue, and the satisfaction of knowing someone is reading their content,” Cloudflare’s blog said. “If the incentive to create original, quality content disappears, society ends up losing, and the future of the Internet is at risk.”

Disclosure: Condé Nast, which owns Ars Technica, is a partner involved in Cloudflare’s beta test.

This story was corrected on July 1 to remove publishers incorrectly listed as participating in Cloudflare’s pay-per-crawl beta.

Pay up or stop scraping: Cloudflare program charges bots for each crawl Read More »

judge:-pirate-libraries-may-have-profited-from-meta-torrenting-80tb-of-books

Judge: Pirate libraries may have profited from Meta torrenting 80TB of books

It could certainly look worse for Meta if authors manage to present evidence supporting the second way that torrenting could be relevant to the case, Chhabaria suggested.

“Meta downloading copyrighted material from shadow libraries” would also be relevant to the character of the use, “if it benefitted those who created the libraries and thus supported and perpetuated their unauthorized copying and distribution of copyrighted works,” Chhabria wrote.

Counting potential strikes against Meta, Chhabria pointed out that the “vast majority of cases” involving “this sort of peer-to-peer file-sharing” are found to “constitute copyright infringement.” And it likely doesn’t help Meta’s case that “some of the libraries Meta used have themselves been found liable for infringement.”

However, Meta may overcome this argument, too, since book authors “have not submitted any evidence” that potentially shows how Meta’s downloading may perhaps be “propping up” or financially benefiting pirate libraries.

Finally, Chhabria noted that the “last issue relating to the character of Meta’s use” of books in regards to its torrenting is “the relationship between Meta’s downloading of the plaintiffs’ books and Meta’s use of the books to train Llama.”

Authors had tried to argue that these elements were distinct. But Chhabria said there’s no separating the fact that Meta downloaded the books to serve the “highly transformative” purpose of training Llama.

“Because Meta’s ultimate use of the plaintiffs’ books was transformative, so too was Meta’s downloading of those books,” Chhabria wrote.

AI training rulings may get more authors paid

Authors only learned of Meta’s torrenting through discovery in the lawsuit, and because of that, Chhabria noted that “the record on Meta’s alleged distribution is incomplete.”

It’s possible that authors may be able to show evidence that Meta “contributed to the BitTorrent network” by providing significant computing power that could’ve meaningfully assisted shadow libraries, Chhabria said in a footnote.

Judge: Pirate libraries may have profited from Meta torrenting 80TB of books Read More »

meta-is-making-users-who-opted-out-of-ai-training-opt-out-again,-watchdog-says

Meta is making users who opted out of AI training opt out again, watchdog says

Noyb has requested a response from Meta by May 21, but it seems unlikely that Meta will quickly cave in this fight.

In a blog post, Meta said that AI training on EU users was critical to building AI tools for Europeans that are informed by “everything from dialects and colloquialisms, to hyper-local knowledge and the distinct ways different countries use humor and sarcasm on our products.”

Meta argued that its AI training efforts in the EU are far more transparent than efforts from competitors Google and OpenAI, which, Meta noted, “have already used data from European users to train their AI models,” supposedly without taking the steps Meta has to inform users.

Also echoing a common refrain in the AI industry, another Meta blog warned that efforts to further delay Meta’s AI training in the EU could lead to “major setbacks,” pushing the EU behind rivals in the AI race.

“Without a reform and simplification of the European regulatory system, Europe threatens to fall further and further behind in the global AI race and lose ground compared to the USA and China,” Meta warned.

Noyb discredits this argument and noted that it can pursue injunctions in various jurisdictions to block Meta’s plan. The group said it’s currently evaluating options to seek injunctive relief and potentially even pursue a class action worth possibly “billions in damages” to ensure that 400 million monthly active EU users’ data rights are shielded from Meta’s perceived grab.

A Meta spokesperson reiterated to Ars that the company’s plan “follows extensive and ongoing engagement with the Irish Data Protection Commission,” while reiterating Meta’s statements in blogs that its AI training approach “reflects consensus among” EU Data Protection Authorities (DPAs).

But while Meta claims that EU regulators have greenlit its AI training plans, Noyb argues that national DPAs have “largely stayed silent on the legality of AI training without consent,” and Meta seems to have “simply moved ahead anyways.”

“This fight is essentially about whether to ask people for consent or simply take their data without it,” Schrems said, adding, “Meta’s absurd claims that stealing everyone’s personal data is necessary for AI training is laughable. Other AI providers do not use social network data—and generate even better models than Meta.”

Meta is making users who opted out of AI training opt out again, watchdog says Read More »

judge-on-meta’s-ai-training:-“i-just-don’t-understand-how-that-can-be-fair-use”

Judge on Meta’s AI training: “I just don’t understand how that can be fair use”


Judge downplayed Meta’s “messed up” torrenting in lawsuit over AI training.

A judge who may be the first to rule on whether AI training data is fair use appeared skeptical Thursday at a hearing where Meta faced off with book authors over the social media company’s alleged copyright infringement.

Meta, like most AI companies, holds that training must be deemed fair use, or else the entire AI industry could face immense setbacks, wasting precious time negotiating data contracts while falling behind global rivals. Meta urged the court to rule that AI training is a transformative use that only references books to create an entirely new work that doesn’t replicate authors’ ideas or replace books in their markets.

At the hearing that followed after both sides requested summary judgment, however, Judge Vince Chhabria pushed back on Meta attorneys arguing that the company’s Llama AI models posed no threat to authors in their markets, Reuters reported.

“You have companies using copyright-protected material to create a product that is capable of producing an infinite number of competing products,” Chhabria said. “You are dramatically changing, you might even say obliterating, the market for that person’s work, and you’re saying that you don’t even have to pay a license to that person.”

Declaring, “I just don’t understand how that can be fair use,” the shrewd judge apparently stoked little response from Meta’s attorney, Kannon Shanmugam, apart from a suggestion that any alleged threat to authors’ livelihoods was “just speculation,” Wired reported.

Authors may need to sharpen their case, which Chhabria warned could be “taken away by fair use” if none of the authors suing, including Sarah Silverman, Ta-Nehisi Coates, and Richard Kadrey, can show “that the market for their actual copyrighted work is going to be dramatically affected.”

Determined to probe this key question, Chhabria pushed authors’ attorney, David Boies, to point to specific evidence of market harms that seemed noticeably missing from the record.

“It seems like you’re asking me to speculate that the market for Sarah Silverman’s memoir will be affected by the billions of things that Llama will ultimately be capable of producing,” Chhabria said. “And it’s just not obvious to me that that’s the case.”

But if authors can prove fears of market harms are real, Meta might struggle to win over Chhabria, and that could set a precedent impacting copyright cases challenging AI training on other kinds of content.

The judge repeatedly appeared to be sympathetic to authors, suggesting that Meta’s AI training may be a “highly unusual case” where even though “the copying is for a highly transformative purpose, the copying has the high likelihood of leading to the flooding of the markets for the copyrighted works.”

And when Shanmugam argued that copyright law doesn’t shield authors from “protection from competition in the marketplace of ideas,” Chhabria resisted the framing that authors weren’t potentially being robbed, Reuters reported.

“But if I’m going to steal things from the marketplace of ideas in order to develop my own ideas, that’s copyright infringement, right?” Chhabria responded.

Wired noted that he asked Meta’s lawyers, “What about the next Taylor Swift?” If AI made it easy to knock off a young singer’s sound, how could she ever compete if AI produced “a billion pop songs” in her style?

In a statement, Meta’s spokesperson reiterated the company’s defense that AI training is fair use.

“Meta has developed transformational open source AI models that are powering incredible innovation, productivity, and creativity for individuals and companies,” Meta’s spokesperson said. “Fair use of copyrighted materials is vital to this. We disagree with Plaintiffs’ assertions, and the full record tells a different story. We will continue to vigorously defend ourselves and to protect the development of GenAI for the benefit of all.”

Meta’s torrenting seems “messed up”

Some have pondered why Chhabria appeared so focused on market harms, instead of hammering Meta for admittedly illegally pirating books that it used for its AI training, which seems to be obvious copyright infringement. According to Wired, “Chhabria spoke emphatically about his belief that the big question is whether Meta’s AI tools will hurt book sales and otherwise cause the authors to lose money,” not whether Meta’s torrenting of books was illegal.

The torrenting “seems kind of messed up,” Chhabria said, but “the question, as the courts tell us over and over again, is not whether something is messed up but whether it’s copyright infringement.”

It’s possible that Chhabria dodged the question for procedural reasons. In a court filing, Meta argued that authors had moved for summary judgment on Meta’s alleged copying of their works, not on “unsubstantiated allegations that Meta distributed Plaintiffs’ works via torrent.”

In the court filing, Meta alleged that even if Chhabria agreed that the authors’ request for “summary judgment is warranted on the basis of Meta’s distribution, as well as Meta’s copying,” that the authors “lack evidence to show that Meta distributed any of their works.”

According to Meta, authors abandoned any claims that Meta’s seeding of the torrented files served to distribute works, leaving only claims about Meta’s leeching. Meta argued that the authors “admittedly lack evidence that Meta ever uploaded any of their works, or any identifiable part of those works, during the so-called ‘leeching’ phase,” relying instead on expert estimates based on how torrenting works.

It’s also possible that for Chhabria, the torrenting question seemed like an unnecessary distraction. Former Meta attorney Mark Lumley, who quit the case earlier this year, told Vanity Fair that the torrenting was “one of those things that sounds bad but actually shouldn’t matter at all in the law. Fair use is always about uses the plaintiff doesn’t approve of; that’s why there is a lawsuit.”

Lumley suggested that court cases mulling fair use at this current moment should focus on the outputs, rather than the training. Citing the ruling in a case where Google Books scanning books to share excerpts was deemed fair use, Lumley argued that “all search engines crawl the full Internet, including plenty of pirated content,” so there’s seemingly no reason to stop AI crawling.

But the Copyright Alliance, a nonprofit, non-partisan group supporting the authors in the case, in a court filing alleged that Meta, in its bid to get AI products viewed as transformative, is aiming to do the opposite. “When describing the purpose of generative AI,” Meta allegedly strives to convince the court to “isolate the ‘training’ process and ignore the output of generative AI,” because that’s seemingly the only way that Meta can convince the court that AI outputs serve “a manifestly different purpose from Plaintiffs’ books,” the Copyright Alliance argued.

“Meta’s motion ignores what comes after the initial ‘training’—most notably the generation of output that serves the same purpose of the ingested works,” the Copyright Alliance argued. And the torrenting question should matter, the group argued, because unlike in Google Books, Meta’s AI models are apparently training on pirated works, not “legitimate copies of books.”

Chhabria will not be making a snap decision in the case, planning to take his time and likely stressing not just Meta, but every AI company defending training as fair use the longer he delays. Understanding that the entire AI industry potentially has a stake in the ruling, Chhabria apparently sought to relieve some tension at the end of the hearing with a joke, Wired reported.

 “I will issue a ruling later today,” Chhabria said. “Just kidding! I will take a lot longer to think about it.”

Photo of Ashley Belanger

Ashley is a senior policy reporter for Ars Technica, dedicated to tracking social impacts of emerging policies and new technologies. She is a Chicago-based journalist with 20 years of experience.

Judge on Meta’s AI training: “I just don’t understand how that can be fair use” Read More »

judge-calls-out-openai’s-“straw-man”-argument-in-new-york-times-copyright-suit

Judge calls out OpenAI’s “straw man” argument in New York Times copyright suit

“Taken as true, these facts give rise to a plausible inference that defendants at a minimum had reason to investigate and uncover end-user infringement,” Stein wrote.

To Stein, the fact that OpenAI maintains an “ongoing relationship” with users by providing outputs that respond to users’ prompts also supports contributory infringement claims, despite OpenAI’s argument that ChatGPT’s “substantial noninfringing uses” are exonerative.

OpenAI defeated some claims

For OpenAI, Stein’s ruling likely disappoints, although Stein did drop some of NYT’s claims.

Likely upsetting to news publishers, that included a “free-riding” claim that ChatGPT unfairly profits off time-sensitive “hot news” items, including the NYT’s Wirecutter posts. Stein explained that news publishers failed to plausibly allege non-attribution (which is key to a free-riding claim) because, for example, ChatGPT cites the NYT when sharing information from Wirecutter posts. Those claims are pre-empted by the Copyright Act anyway, Stein wrote, granting OpenAI’s motion to dismiss.

Stein also dismissed a claim from the NYT regarding alleged removal of copyright management information (CMI), which Stein said cannot be proven simply because ChatGPT reproduces excerpts of NYT articles without CMI.

The Digital Millennium Copyright Act (DMCA) requires news publishers to show that ChatGPT’s outputs are “close to identical” to the original work, Stein said, and allowing publishers’ claims based on excerpts “would risk boundless DMCA liability”—including for any use of block quotes without CMI.

Asked for comment on the ruling, an OpenAI spokesperson declined to go into any specifics, instead repeating OpenAI’s long-held argument that AI training on copyrighted works is fair use. (Last month, OpenAI warned Donald Trump that the US would lose the AI race to China if courts ruled against that argument.)

“ChatGPT helps enhance human creativity, advance scientific discovery and medical research, and enable hundreds of millions of people to improve their daily lives,” OpenAI’s spokesperson said. “Our models empower innovation, and are trained on publicly available data and grounded in fair use.”

Judge calls out OpenAI’s “straw man” argument in New York Times copyright suit Read More »

meta-claims-torrenting-pirated-books-isn’t-illegal-without-proof-of-seeding

Meta claims torrenting pirated books isn’t illegal without proof of seeding

Just because Meta admitted to torrenting a dataset of pirated books for AI training purposes, that doesn’t necessarily mean that Meta seeded the file after downloading it, the social media company claimed in a court filing this week.

Evidence instead shows that Meta “took precautions not to ‘seed’ any downloaded files,” Meta’s filing said. Seeding refers to sharing a torrented file after the download completes, and because there’s allegedly no proof of such “seeding,” Meta insisted that authors cannot prove Meta shared the pirated books with anyone during the torrenting process.

Whether or not Meta actually seeded the pirated books could make a difference in a copyright lawsuit from book authors including Richard Kadrey, Sarah Silverman, and Ta-Nehisi Coates. Authors had previously alleged that Meta unlawfully copied and distributed their works through AI outputs—an increasingly common complaint that so far has barely been litigated. But Meta’s admission to torrenting appears to add a more straightforward claim of unlawful distribution of copyrighted works through illegal torrenting, which has long been considered established case-law.

Authors have alleged that “Meta deliberately engaged in one of the largest data piracy campaigns in history to acquire text data for its LLM training datasets, torrenting and sharing dozens of terabytes of pirated data that altogether contain many millions of copyrighted works.” Separate from their copyright infringement claims opposing Meta’s AI training on pirated copies of their books, authors alleged that Meta torrenting the dataset was “independently illegal” under California’s Computer Data Access and Fraud Act (CDAFA), which allegedly “prevents the unauthorized taking of data, including copyrighted works.”

Meta, however, is hoping to convince the court that torrenting is not in and of itself illegal, but is, rather, a “widely-used protocol to download large files.” According to Meta, the decision to download the pirated books dataset from pirate libraries like LibGen and Z-Library was simply a move to access “data from a ‘well-known online repository’ that was publicly available via torrents.”

Meta claims torrenting pirated books isn’t illegal without proof of seeding Read More »

”torrenting-from-a-corporate-laptop-doesn’t-feel-right”:-meta-emails-unsealed

”Torrenting from a corporate laptop doesn’t feel right”: Meta emails unsealed

Emails discussing torrenting prove that Meta knew it was “illegal,” authors alleged. And Bashlykov’s warnings seemingly landed on deaf ears, with authors alleging that evidence showed Meta chose to instead hide its torrenting as best it could while downloading and seeding terabytes of data from multiple shadow libraries as recently as April 2024.

Meta allegedly concealed seeding

Supposedly, Meta tried to conceal the seeding by not using Facebook servers while downloading the dataset to “avoid” the “risk” of anyone “tracing back the seeder/downloader” from Facebook servers, an internal message from Meta researcher Frank Zhang said, while describing the work as in “stealth mode.” Meta also allegedly modified settings “so that the smallest amount of seeding possible could occur,” a Meta executive in charge of project management, Michael Clark, said in a deposition.

Now that new information has come to light, authors claim that Meta staff involved in the decision to torrent LibGen must be deposed again, because allegedly the new facts “contradict prior deposition testimony.”

Mark Zuckerberg, for example, claimed to have no involvement in decisions to use LibGen to train AI models. But unredacted messages show the “decision to use LibGen occurred” after “a prior escalation to MZ,” authors alleged.

Meta did not immediately respond to Ars’ request for comment and has maintained throughout the litigation that AI training on LibGen was “fair use.”

However, Meta has previously addressed its torrenting in a motion to dismiss filed last month, telling the court that “plaintiffs do not plead a single instance in which any part of any book was, in fact, downloaded by a third party from Meta via torrent, much less that Plaintiffs’ books were somehow distributed by Meta.”

While Meta may be confident in its legal strategy despite the new torrenting wrinkle, the social media company has seemingly complicated its case by allowing authors to expand the distribution theory that’s key to winning a direct copyright infringement claim beyond just claiming that Meta’s AI outputs unlawfully distributed their works.

As limited discovery on Meta’s seeding now proceeds, Meta is not fighting the seeding aspect of the direct copyright infringement claim at this time, telling the court that it plans to “set… the record straight and debunk… this meritless allegation on summary judgment.”

”Torrenting from a corporate laptop doesn’t feel right”: Meta emails unsealed Read More »

how-to-stop-linkedin-from-training-ai-on-your-data

How to stop LinkedIn from training AI on your data

Better to beg for forgiveness than ask for permission? —

LinkedIn limits opt-outs to future training, warns AI models may spout personal data.

How to stop LinkedIn from training AI on your data

LinkedIn admitted Wednesday that it has been training its own AI on many users’ data without seeking consent. Now there’s no way for users to opt out of training that has already occurred, as LinkedIn limits opt-out to only future AI training.

In a blog detailing updates coming on November 20, LinkedIn general counsel Blake Lawit confirmed that LinkedIn’s user agreement and privacy policy will be changed to better explain how users’ personal data powers AI on the platform.

Under the new privacy policy, LinkedIn now informs users that “we may use your personal data… [to] develop and train artificial intelligence (AI) models, develop, provide, and personalize our Services, and gain insights with the help of AI, automated systems, and inferences, so that our Services can be more relevant and useful to you and others.”

An FAQ explained that the personal data could be collected any time a user interacts with generative AI or other AI features, as well as when a user composes a post, changes their preferences, provides feedback to LinkedIn, or uses the platform for any amount of time.

That data is then stored until the user deletes the AI-generated content. LinkedIn recommends that users use its data access tool if they want to delete or request to delete data collected about past LinkedIn activities.

LinkedIn’s AI models powering generative AI features “may be trained by LinkedIn or another provider,” such as Microsoft, which provides some AI models through its Azure OpenAI service, the FAQ said.

A potentially major privacy risk for users, LinkedIn’s FAQ noted, is that users who “provide personal data as an input to a generative AI powered feature” could end up seeing their “personal data being provided as an output.”

LinkedIn claims that it “seeks to minimize personal data in the data sets used to train the models,” relying on “privacy enhancing technologies to redact or remove personal data from the training dataset.”

While Lawit’s blog avoids clarifying if data already collected can be removed from AI training data sets, the FAQ affirmed that users who automatically opted in to sharing personal data for AI training can only opt out of the invasive data collection “going forward.”

Opting out “does not affect training that has already taken place,” the FAQ said.

A LinkedIn spokesperson told Ars that it “benefits all members” to be opted in to AI training “by default.”

“People can choose to opt out, but they come to LinkedIn to be found for jobs and networking and generative AI is part of how we are helping professionals with that change,” LinkedIn’s spokesperson said.

By allowing opt-outs of future AI training, LinkedIn’s spokesperson additionally claimed that the platform is giving “people using LinkedIn even more choice and control when it comes to how we use data to train our generative AI technology.”

How to opt out of AI training on LinkedIn

Users can opt out of AI training by navigating to the “Data privacy” section in their account settings, then turning off the option allowing collection of “data for generative AI improvement” that LinkedIn otherwise automatically turns on for most users.

The only exception is for users in the European Economic Area or Switzerland, who are protected by stricter privacy laws that either require consent from platforms to collect personal data or for platforms to justify the data collection as a legitimate interest. Those users will not see an option to opt out, because they were never opted in, LinkedIn repeatedly confirmed.

Additionally, users can “object to the use of their personal data for training” generative AI models not used to generate LinkedIn content—such as models used for personalization or content moderation purposes, The Verge noted—by submitting the LinkedIn Data Processing Objection Form.

Last year, LinkedIn shared AI principles, promising to take “meaningful steps to reduce the potential risks of AI.”

One risk that the updated user agreement specified is that using LinkedIn’s generative features to help populate a profile or generate suggestions when writing a post could generate content that “might be inaccurate, incomplete, delayed, misleading or not suitable for your purposes.”

Users are advised that they are responsible for avoiding sharing misleading information or otherwise spreading AI-generated content that may violate LinkedIn’s community guidelines. And users are additionally warned to be cautious when relying on any information shared on the platform.

“Like all content and other information on our Services, regardless of whether it’s labeled as created by ‘AI,’ be sure to carefully review before relying on it,” LinkedIn’s user agreement says.

Back in 2023, LinkedIn claimed that it would always “seek to explain in clear and simple ways how our use of AI impacts people,” because users’ “understanding of AI starts with transparency.”

Legislation like the European Union’s AI Act and the GDPR—especially with its strong privacy protections—if enacted elsewhere, would lead to fewer shocks to unsuspecting users. That would put all companies and their users on equal footing when it comes to training AI models and result in fewer nasty surprises and angry customers.

How to stop LinkedIn from training AI on your data Read More »

nonprofit-scrubs-illegal-content-from-controversial-ai-training-dataset

Nonprofit scrubs illegal content from controversial AI training dataset

Nonprofit scrubs illegal content from controversial AI training dataset

After Stanford Internet Observatory researcher David Thiel found links to child sexual abuse materials (CSAM) in an AI training dataset tainting image generators, the controversial dataset was immediately taken down in 2023.

Now, the LAION (Large-scale Artificial Intelligence Open Network) team has released a scrubbed version of the LAION-5B dataset called Re-LAION-5B and claimed that it “is the first web-scale, text-link to images pair dataset to be thoroughly cleaned of known links to suspected CSAM.”

To scrub the dataset, LAION partnered with the Internet Watch Foundation (IWF) and the Canadian Center for Child Protection (C3P) to remove 2,236 links that matched with hashed images in the online safety organizations’ databases. Removals include all the links flagged by Thiel, as well as content flagged by LAION’s partners and other watchdogs, like Human Rights Watch, which warned of privacy issues after finding photos of real kids included in the dataset without their consent.

In his study, Thiel warned that “the inclusion of child abuse material in AI model training data teaches tools to associate children in illicit sexual activity and uses known child abuse images to generate new, potentially realistic child abuse content.”

Thiel urged LAION and other researchers scraping the Internet for AI training data that a new safety standard was needed to better filter out not just CSAM, but any explicit imagery that could be combined with photos of children to generate CSAM. (Recently, the US Department of Justice pointedly said that “CSAM generated by AI is still CSAM.”)

While LAION’s new dataset won’t alter models that were trained on the prior dataset, LAION claimed that Re-LAION-5B sets “a new safety standard for cleaning web-scale image-link datasets.” Where before illegal content “slipped through” LAION’s filters, the researchers have now developed an improved new system “for identifying and removing illegal content,” LAION’s blog said.

Thiel told Ars that he would agree that LAION has set a new safety standard with its latest release, but “there are absolutely ways to improve it.” However, “those methods would require possession of all original images or a brand new crawl,” and LAION’s post made clear that it only utilized image hashes and did not conduct a new crawl that could have risked pulling in more illegal or sensitive content. (On Threads, Thiel shared more in-depth impressions of LAION’s effort to clean the dataset.)

LAION warned that “current state-of-the-art filters alone are not reliable enough to guarantee protection from CSAM in web scale data composition scenarios.”

“To ensure better filtering, lists of hashes of suspected links or images created by expert organizations (in our case, IWF and C3P) are suitable choices,” LAION’s blog said. “We recommend research labs and any other organizations composing datasets from the public web to partner with organizations like IWF and C3P to obtain such hash lists and use those for filtering. In the longer term, a larger common initiative can be created that makes such hash lists available for the research community working on dataset composition from web.”

According to LAION, the bigger concern is that some links to known CSAM scraped into a 2022 dataset are still active more than a year later.

“It is a clear hint that law enforcement bodies have to intensify the efforts to take down domains that host such image content on public web following information and recommendations by organizations like IWF and C3P, making it a safer place, also for various kinds of research related activities,” LAION’s blog said.

HRW researcher Hye Jung Han praised LAION for removing sensitive data that she flagged, while also urging more interventions.

“LAION’s responsive removal of some children’s personal photos from their dataset is very welcome, and will help to protect these children from their likenesses being misused by AI systems,” Han told Ars. “It’s now up to governments to pass child data protection laws that would protect all children’s privacy online.”

Although LAION’s blog said that the content removals represented an “upper bound” of CSAM that existed in the initial dataset, AI specialist and Creative.AI co-founder Alex Champandard told Ars that he’s skeptical that all CSAM was removed.

“They only filter out previously identified CSAM, which is only a partial solution,” Champandard told Ars. “Statistically speaking, most instances of CSAM have likely never been reported nor investigated by C3P or IWF. A more reasonable estimate of the problem is about 25,000 instances of things you’d never want to train generative models on—maybe even 50,000.”

Champandard agreed with Han that more regulations are needed to protect people from AI harms when training data is scraped from the web.

“There’s room for improvement on all fronts: privacy, copyright, illegal content, etc.,” Champandard said. Because “there are too many data rights being broken with such web-scraped datasets,” Champandard suggested that datasets like LAION’s won’t “stand the test of time.”

“LAION is simply operating in the regulatory gap and lag in the judiciary system until policymakers realize the magnitude of the problem,” Champandard said.

Nonprofit scrubs illegal content from controversial AI training dataset Read More »