AI training

Microsoft vows to cover full power costs for energy-hungry AI data centers

AI, AI infrastructure, AI training, Amazon, Biz & IT, brad smith, community development, datacenters, Donald Trump, electricity, Google, machine learning, Meta, microsoft, openai, property taxes, skilled trades, water use / Beth Washington / January 13, 2026

Taking responsibility for power usage

In the Microsoft blog post, Smith acknowledged that residential electricity rates have recently risen in dozens of states, driven partly by inflation, supply chain constraints, and grid upgrades. He wrote that communities “value new jobs and property tax revenue, but not if they come with higher power bills or tighter water supplies.”

Microsoft says it will ask utilities and public commissions to set rates high enough to cover the full electricity costs for its data centers, including infrastructure additions. In Wisconsin, the company is supporting a new rate structure that would charge “Very Large Customers,” including data centers, the cost of the electricity required to serve them.

Smith wrote that while some have suggested the public should help pay for the added electricity needed for AI, Microsoft disagrees. He stated, “Especially when tech companies are so profitable, we believe that it’s both unfair and politically unrealistic for our industry to ask the public to shoulder added electricity costs for AI.”

On water usage for cooling, Microsoft plans a 40 percent improvement in data center water-use intensity by 2030. A recent environmental audit from AI model-maker Mistral found that training and running its Large 2 model over 18 months produced 20.4 kilotons of CO₂ emissions and evaporated enough water to fill 112 Olympic-size swimming pools, illustrating the aggregate environmental impact of AI operations at scale.

To solve some of these issues, Microsoft says it has launched a new AI data center design using a closed-loop system that constantly recirculates cooling liquid, dramatically cutting water usage. In this design, already deployed in Wisconsin and Georgia, potable water is no longer needed for cooling.

On property taxes, Smith stated in the blog post that the company will not ask local municipalities to reduce their rates. The company says it will pay its full share of local property taxes. Smith wrote that Microsoft’s goal is to bring these commitments to life in the first half of 2026. Of course, these are PR-aligned company goals and not realities yet, so we’ll have to check back in later to see whether Microsoft has been following through on its promises.

Microsoft vows to cover full power costs for energy-hungry AI data centers Read More »

World’s largest shadow library made a 300TB copy of Spotify’s most streamed songs

AI, AI training, Anna's archive, LLMs, music streaming, piracy, Policy, shadow library, Spotify / Kelly Newman / December 22, 2025

But Anna’s Archive is clearly working to support AI developers, another noted, pointing out that Anna’s Archive promotes selling “high-speed access” to “enterprise-level” LLM data, including “unreleased collections.” Anyone can donate “tens of thousands” to get such access, the archive suggests on its webpage, and any interested AI researchers can reach out to discuss “how we can work together.”

“AI may not be their original/primary motivation, but they are evidently on board with facilitating AI labs piracy-maxxing,” a third commenter suggested.

Meanwhile, on Reddit, some fretted that Anna’s Archive may have doomed itself by scraping the data. To them, it seemed like the archive was “only making themselves a target” after watching the Internet Archive struggle to survive a legal attack from record labels that ended in a confidential settlement last year.

“I’m furious with AA for sticking this target on their own backs,” a redditor wrote on a post declaring that “this Spotify hacking will just ruin the actual important literary archive.”

As Anna’s Archive fans spiraled, a conspiracy was even raised that the archive was only “doing it for the AI bros, who are the ones paying the bills behind the scenes” to keep the archive afloat.

Ars could not immediately reach Anna’s Archive to comment on users’ fears or Spotify’s investigation.

On Reddit, one user took comfort in the fact that the archive is “designed to be resistant to being taken out,” perhaps preventing legal action from ever really dooming the archive.

“The domain and such can be gone, sure, but the core software and its data can be resurfaced again and again,” the user explained.

But not everyone was convinced that Anna’s Archive could survive brazenly torrenting so much Spotify data.

“This is like saying the Titanic is unsinkable” that user warned, suggesting that Anna’s Archive might lose donations if Spotify-fueled takedowns continually frustrate downloads over time. “Sure, in theory data can certainly resurface again and again, but doing so each time, it will take money and resources, which are finite. How many times are folks willing to do this before they just give up?”

This story was updated to include Spotify’s statement.

World’s largest shadow library made a 300TB copy of Spotify’s most streamed songs Read More »

Meta denies torrenting porn to train AI, says downloads were for “personal use”

adult content, AI, AI training, BItTorrent, Meta, Meta Movie Gen, Policy, pornography, torrenting / Kelly Newman / October 30, 2025

Instead, Meta argued, available evidence “is plainly indicative” that the flagged adult content was torrented for “private personal use”—since the small amount linked to Meta IP addresses and employees represented only “a few dozen titles per year intermittently obtained one file at a time.”

“The far more plausible inference to be drawn from such meager, uncoordinated activity is that disparate individuals downloaded adult videos for personal use,” Meta’s filing said.

For example, unlike lawsuits raised by book authors whose works are part of an enormous dataset used to train AI, the activity on Meta’s corporate IP addresses only amounted to about 22 downloads per year. That is nowhere near the “concerted effort to collect the massive datasets Plaintiffs allege are necessary for effective AI training,” Meta argued.

Further, that alleged activity can’t even reliably be linked to any Meta employee, Meta argued.

Strike 3 “does not identify any of the individuals who supposedly used these Meta IP addresses, allege that any were employed by Meta or had any role in AI training at Meta, or specify whether (and which) content allegedly downloaded was used to train any particular Meta model,” Meta wrote.

Meanwhile, “tens of thousands of employees,” as well as “innumerable contractors, visitors, and third parties access the Internet at Meta every day,” Meta argued. So while it’s “possible one or more Meta employees” downloaded Strike 3’s content over the last seven years, “it is just as possible” that a “guest, or freeloader,” or “contractor, or vendor, or repair person—or any combination of such persons—was responsible for that activity,” Meta suggested.

Other alleged activity included a claim that a Meta contractor was directed to download adult content at his father’s house, but those downloads, too, “are plainly indicative of personal consumption,” Meta argued. That contractor worked as an “automation engineer,” Meta noted, with no apparent basis provided for why he would be expected to source AI training data in that role. “No facts plausibly” tie “Meta to those downloads,” Meta claimed.

Meta denies torrenting porn to train AI, says downloads were for “personal use” Read More »

Pay-per-output? AI firms blindsided by beefed up robots.txt instructions.

AI, ai crawlers, ai scraping, AI training, Google, Meta, openai, Policy, really simply licensing, rsl standard, RSS, xAI / 9u50fv / September 10, 2025

“Really Simple Licensing” makes it easier for creators to get paid for AI scraping.

Logo for the “Really Simply Licensing” (RSL) standard. Credit: via RSL Collective

Leading Internet companies and publishers—including Reddit, Yahoo, Quora, Medium, The Daily Beast, Fastly, and more—think there may finally be a solution to end AI crawlers hammering websites to scrape content without permission or compensation.

Announced Wednesday morning, the “Really Simply Licensing” (RSL) standard evolves robots.txt instructions by adding an automated licensing layer that’s designed to block bots that don’t fairly compensate creators for content.

Free for any publisher to use starting today, the RSL standard is an open, decentralized protocol that makes clear to AI crawlers and agents the terms for licensing, usage, and compensation of any content used to train AI, a press release noted.

The standard was created by the RSL Collective, which was founded by Doug Leeds, former CEO of Ask.com, and Eckart Walther, a former Yahoo vice president of products and co-creator of the RSS standard, which made it easy to syndicate content across the web.

Based on the “Really Simply Syndication” (RSS) standard, RSL terms can be applied to protect any digital content, including webpages, books, videos, and datasets. The new standard supports “a range of licensing, usage, and royalty models, including free, attribution, subscription, pay-per-crawl (publishers get compensated every time an AI application crawls their content), and pay-per-inference (publishers get compensated every time an AI application uses their content to generate a response),” the press release said.

Leeds told Ars that the idea to use the RSS “playbook” to roll out the RSL standard arose after he invited Walther to speak to University of California, Berkeley students at the end of last year. That’s when the longtime friends with search backgrounds began pondering how AI had changed the search industry, as publishers today are forced to compete with AI outputs referencing their own content as search traffic nosedives.

Eckart had watched the RSS standard quickly become adopted by millions of sites, and he realized that RSS had actually always been a licensing standard, Leeds said. Essentially, by adopting the RSS standard, publishers agreed to let search engines license a “bit” of their content in exchange for search traffic, and Eckart realized that it could be just as straightforward to add AI licensing terms in the same way. That way, publishers could strive to recapture lost search revenue by agreeing to license all or some part of their content to train AI in return for payment each time AI outputs link to their content.

Leeds told Ars that the RSL standard doesn’t just benefit publishers, though. It also solves a problem for AI companies, which have complained in litigation over AI scraping that there is no effective way to license content across the web.

“We have listened to them, and what we’ve heard them say is… we need a new protocol,” Leeds said. With the RSL standard, AI firms get a “scalable way to get all the content” they want, while setting an incentive that they’ll only have to pay for the best content that their models actually reference.

“If they’re using it, they pay for it, and if they’re not using it, they don’t pay for it,” Leeds said.

No telling yet how AI firms will react to RSL

At this point, it’s hard to say if AI companies will embrace the RSL standard. Ars reached out to Google, Meta, OpenAI, and xAI—some of the big tech companies whose crawlers have drawn scrutiny—to see if it was technically feasible to pay publishers for every output referencing their content. xAI did not respond, and the other companies declined to comment without further detail about the standard, appearing to have not yet considered how a licensing layer beefing up robots.txt could impact their scraping.

Today will likely be the first chance for AI companies to wrap their heads around the idea of paying publishers per output. Leeds confirmed that the RSL Collective did not consult with AI companies when developing the RSL standard.

But AI companies know that they need a constant stream of fresh content to keep their tools relevant and to continually innovate, Leeds suggested. In that way, the RSL standard “supports what supports them,” Leeds said, “and it creates the appropriate incentive system” to create sustainable royalty streams for creators and ensure that human creativity doesn’t wane as AI evolves.

While we’ll have to wait to see how AI firms react to RSL, early adopters of the standard celebrated the launch today. That included Neil Vogel, CEO of People Inc., who said that “RSL moves the industry forward—evolving from simply blocking unauthorized crawlers, to setting our licensing terms, for all AI use cases, at global web scale.”

Simon Wistow, co-founder of Fastly, suggested the solution “is a timely and necessary response to the shifting economics of the web.”

“By making it easy for publishers to define and enforce licensing terms, RSL lays the foundation for a healthy content ecosystem—one where innovation and investment in original work are rewarded, and where collaboration between publishers and AI companies becomes frictionless and mutually beneficial,” Wistow said.

Leeds noted that a key benefit of the RSL standard is that even small creators will now have an opportunity to generate revenue for helping to train AI. Tony Stubblebine, CEO of Medium, did not mince words when explaining the battle that bloggers face as AI crawlers threaten to divert their traffic without compensating them.

“Right now, AI runs on stolen content,” Stubblebine said. “Adopting this RSL Standard is how we force those AI companies to either pay for what they use, stop using it, or shut down.”

How will the RSL standard be enforced?

On the RSL standard site, publishers can find common terms to add templated or customized text to their robots.txt files to adopt the RSL standard today and start protecting their content from unfettered AI scraping. Here’s an example of how machine-readable licensing terms could look, added directly to robots.txt files:

# NOTICE: all crawlers and bots are strictly prohibited from using this

# content for AI training without complying with the terms of the RSL

# Collective AI royalty license. Any use of this content for AI training

# without a license is a violation of our intellectual property rights.

License: https://rslcollective.org/royalty.xml

Through RSL terms, publishers can automate licensing, with the cloud company Fastly partnering with the collective to provide technical enforcement that Leeds described as tech that acts as a bouncer to keep unapproved bots away from valuable content. It seems likely that Cloudflare, which launched a pay-per-crawl program blocking greedy crawlers in July, could also help enforce the RSL standard.

For publishers, the standard “solves a business problem immediately,” Leeds told Ars, so the collective is hopeful that RSL will be rapidly and widely adopted. As further incentive, publishers can also rely on the RSL standard to “easily encrypt and license non-published, proprietary content to AI companies, including paywalled articles, books, videos, images, and data,” the RSL Collective site said, and that potentially could expand AI firms’ data pool.

On top of technical enforcement, Leeds said that publishers and content creators could legally enforce the terms, noting that the recent $1.5 billion Anthropic settlement suggests “there’s real money at stake” if you don’t train AI “legitimately.”

Should the industry adopt the standard, it could “establish fair market prices and strengthen negotiation leverage for all publishers,” the press release said. And Leeds noted that it’s very common for regulations to follow industry solutions (consider the Digital Millennium Copyright Act). Since the RSL Collective is already in talks with lawmakers, Leeds thinks “there’s good reason to believe” that AI companies will soon “be forced to acknowledge” the standard.

“But even better than that,” Leeds said, “it’s in their interest” to adopt the standard.

With RSL, AI firms can license content at scale “in a way that’s fair [and] preserves the content that they need to make their products continue to innovate.”

Additionally, the RSL standard may solve a problem that risks gutting trust and interest in AI at this early stage.

Leeds noted that currently, AI outputs don’t provide “the best answer” to prompts but instead rely on mashing up answers from different sources to avoid taking too much content from one site. That means that not only do AI companies “spend an enormous amount of money on compute costs to do that,” but AI tools may also be more prone to hallucination in the process of “mashing up” source material “to make something that’s not the best answer because they don’t have the rights to the best answer.”

“The best answer could exist somewhere,” Leeds said. But “they’re spending billions of dollars to create hallucinations, and we’re talking about: Let’s just solve that with a licensing scheme that allows you to use the actual content in a way that solves the user’s query best.”

By transforming the “ecosystem” with a standard that’s “actually sustainable and fair,” Leeds said that AI companies could also ensure that humanity never gets to the point where “humans stop producing” and “turn to AI to reproduce what humans can’t.”

Failing to adopt the RSL standard would be bad for AI innovation, Leeds suggested, perhaps paving the way for AI to replace search with a “sort of self-fulfilling swap of bad content that actually one doesn’t have any current information, doesn’t have any current thinking, because it’s all based on old training information.”

To Leeds, the RSL standard is ultimately “about creating the system that allows the open web to continue. And that happens when we get adoption from everybody,” he said, insisting that “literally the small guys are as important as the big guys” in pushing the entire industry to change and fairly compensate creators.

Ashley is a senior policy reporter for Ars Technica, dedicated to tracking social impacts of emerging policies and new technologies. She is a Chicago-based journalist with 20 years of experience.

Pay-per-output? AI firms blindsided by beefed up robots.txt instructions. Read More »

Judge: Anthropic’s $1.5B settlement is being shoved “down the throat of authors”

AI, AI training, Anthropic, Artificial Intelligence, copyright, copyright act, copyright infringement, generative ai, libgen, piracy, Policy / Kelly Newman / September 9, 2025

At a hearing Monday, US district judge William Alsup blasted a proposed $1.5 billion settlement over Anthropic’s rampant piracy of books to train AI.

The proposed settlement comes in a case where Anthropic could have owed more than $1 trillion in damages after Alsup certified a class that included up to 7 million claimants whose works were illegally downloaded by the AI company.

Instead, critics fear Anthropic will get off cheaply, striking a deal with authors suing that covers less than 500,000 works and paying a small fraction of its total valuation (currently $183 billion) to get away with the massive theft. Defector noted that the settlement doesn’t even require Anthropic to admit wrongdoing, while the company continues raising billions based on models trained on authors’ works. Most recently, Anthropic raised $13 billion in a funding round, making back about 10 times the proposed settlement amount after announcing the deal.

Alsup expressed grave concerns that lawyers rushed the deal, which he said now risks being shoved “down the throat of authors,” Bloomberg Law reported.

In an order, Alsup clarified why he thought the proposed settlement was a chaotic mess. The judge said he was “disappointed that counsel have left important questions to be answered in the future,” seeking approval for the settlement despite the Works List, the Class List, the Claim Form, and the process for notification, allocation, and dispute resolution all remaining unresolved.

Denying preliminary approval of the settlement, Alsup suggested that the agreement is “nowhere close to complete,” forcing Anthropic and authors’ lawyers to “recalibrate” the largest publicly reported copyright class-action settlement ever inked, Bloomberg reported.

Of particular concern, the settlement failed to outline how disbursements would be managed for works with multiple claimants, Alsup noted. Until all these details are ironed out, Alsup intends to withhold approval, the order said.

One big change the judge wants to see is the addition of instructions requiring “anyone with copyright ownership” to opt in, with the consequence that the work won’t be covered if even one rights holder opts out, Bloomberg reported. There should also be instruction that any disputes over ownership or submitted claims should be settled in state court, Alsup said.

Judge: Anthropic’s $1.5B settlement is being shoved “down the throat of authors” Read More »

College student’s “time travel” AI experiment accidentally outputs real 1834 history

AI, AI development, AI research, AI training, Biz & IT, computational archaeology, digital archaeology, GitHub, historical AI, historical research, language models, large language models, LocalLLaMA, London history, machine learning, Open Source, reddit, TimeCapsuleLLM, Victorian era / Mike M. / August 23, 2025

A hobbyist developer building AI language models that speak Victorian-era English “just for fun” got an unexpected history lesson this week when his latest creation mentioned real protests from 1834 London—events the developer didn’t know had actually happened until he Googled them.

“I was interested to see if a protest had actually occurred in 1834 London and it really did happen,” wrote Reddit user Hayk Grigorian, who is a computer science student at Muhlenberg College in Pennsylvania.

For the past month, Grigorian has been developing what he calls TimeCapsuleLLM, a small AI language model (like a pint-sized distant cousin to ChatGPT) which has been trained entirely on texts from 1800–1875 London. Grigorian wants to capture an authentic Victorian voice in the AI model’s outputs. As a result, the AI model ends up spitting out text that’s heavy with biblical references and period-appropriate rhetorical excess.

Grigorian’s project joins a growing field of researchers exploring what some call “Historical Large Language Models” (HLLMs) if they feature a larger base model than the small one Grigorian is using. Similar projects include MonadGPT, which was trained on 11,000 texts from 1400 to 1700 CE that can discuss topics using 17th-century knowledge frameworks, and XunziALLM, which generates classical Chinese poetry following ancient formal rules. These models offer researchers a chance to interact with the linguistic patterns of past eras.

According to Grigorian, TimeCapsuleLLM’s most intriguing recent output emerged from a simple test. When he prompted it with “It was the year of our Lord 1834,” the AI model—which is trained to continue text from wherever a user leaves off—generated the following:

It was the year of our Lord 1834 and the streets of London were filled with protest and petition. The cause, as many re counted, was not bound in the way of private, but having taken up the same day in the day of Lord Palmerston, the public will receive a short statement of the difficulties under which the day of law has reached us. It is a matter of deep regret, that the present events in the history of the world are clear, and consequently will be’known. It is not true that the very men who first settled in the Gospel at Jerusalem should have so extensive and so interesting a record of the prosperity and prosperity

Curious about the accuracy, Grigorian did some fact-checking. “The output also brought up Lord Palmerston,” he wrote, “and after a google search I learned that his actions resulted in the 1834 protests.”

College student’s “time travel” AI experiment accidentally outputs real 1834 history Read More »

Reddit blocks Internet Archive to end sneaky AI scraping

AI, ai scraping, AI training, Artificial Intelligence, Google, Internet Archive, openai, Policy, reddit / DJ Henderson / August 12, 2025

“Until they’re able to defend their site and comply with platform policies (e.g., respecting user privacy, re: deleting removed content) we’re limiting some of their access to Reddit data to protect redditors,” Rathschmidt said.

A review of social media comments suggests that in the past, some Redditors have used the Wayback Machine to research deleted comments or threads. Those commenters noted that myriad other tools exist for surfacing deleted posts or researching a user’s activity, with some suggesting that the Wayback Machine was maybe not the easiest platform to navigate for that purpose.

Redditors have also turned to resources like IA during times when Reddit’s platform changes trigger content removals. Most recently in 2023, when changes to Reddit’s public API threatened to kill beloved subreddits, archives stepped in to preserve content before it was lost.

IA has not signaled whether it’s looking into fixes to get Reddit’s restrictions lifted and did not respond to Ars’ request to comment on how this change might impact the archive’s utility as an open web resource, given Reddit’s popularity.

The director of the Wayback Machine, Mark Graham, told Ars that IA has “a longstanding relationship with Reddit” and continues to have “ongoing discussions about this matter.”

It seems likely that Reddit is financially motivated to restrict AI firms from taking advantage of Wayback Machine archives, perhaps hoping to spur more lucrative licensing deals like Reddit struck with OpenAI and Google. The terms of the OpenAI deal were kept quiet, but the Google deal was reportedly worth $60 million. Over the next three years, Reddit expects to make more than $200 million off such licensing deals.

Disclosure: Advance Publications, which owns Ars Technica parent Condé Nast, is the largest shareholder in Reddit.

Reddit blocks Internet Archive to end sneaky AI scraping Read More »

Meta pirated and seeded porn for years to train AI, lawsuit says

adult content, adult sites, AI, ai porn, AI training, AI training data, Ai video generator, BItTorrent, copyright infringement, LLaMA, Meta, online piracy, Policy, porn, seeding, torrenting / Kelly Newman / July 28, 2025

Evidence may prove Meta seeded more content

Seeking evidence to back its own copyright infringement claims, Strike 3 Holdings searched “its archive of recorded infringement captured by its VXN Scan and Cross Reference tools” and found 47 “IP addresses identified as owned by Facebook infringing its copyright protected Works.”

The data allegedly demonstrates a “continued unauthorized distribution” over “several years.” And Meta allegedly did not stop its seeding after Strike 3 Holdings confronted the tech giant with this evidence—despite the IP data supposedly being verified through an industry-leading provider called Maxmind.

Strike 3 Holdings shared a screenshot of MaxMind’s findings. Credit: via Strike 3 Holdings’ complaint

Meta also allegedly attempted to “conceal its BitTorrent activities” through “six Virtual Private Clouds” that formed a “stealth network” of “hidden IP addresses,” the lawsuit alleged, which seemingly implicated a “major third-party data center provider” as a partner in Meta’s piracy.

An analysis of these IP addresses allegedly found “data patterns that matched infringement patterns seen on Meta’s corporate IP Addresses” and included “evidence of other activity on the BitTorrent network including ebooks, movies, television shows, music, and software.” The seemingly non-human patterns documented on both sets of IP addresses suggest the data was for AI training and not for personal use, Strike 3 Holdings alleged.

Perhaps most shockingly, considering that a Meta employee joked “torrenting from a corporate laptop doesn’t feel right,” Strike 3 Holdings further alleged that it found “at least one residential IP address of a Meta employee” infringing its copyrighted works. That suggests Meta may have directed an employee to torrent pirated data outside the office to obscure the data trail.

The adult site operator did not identify the employee or the major data center discussed in its complaint, noting in a subsequent filing that it recognized the risks to Meta’s business and its employees’ privacy of sharing sensitive information.

In total, the company alleged that evidence shows “well over 100,000 unauthorized distribution transactions” linked to Meta’s corporate IPs. Strike 3 Holdings is hoping the evidence will lead a jury to find Meta liable for direct copyright infringement or charge Meta with secondary and vicarious copyright infringement if the jury finds that Meta successfully distanced itself by using the third-party data center or an employee’s home IP address.

“Meta has the right and ability to supervise and/or control its own corporate IP addresses, as well as the IP addresses hosted in off-infra data centers, and the acts of its employees and agents infringing Plaintiffs’ Works through their residential IPs by using Meta’s AI script to obtain content through BitTorrent,” the complaint said.

Meta pirated and seeded porn for years to train AI, lawsuit says Read More »

Everything tech giants will hate about the EU’s new AI rules

AI, AI Act, AI risks, AI safety, AI training, Artificial Intelligence, copyright infringement, European Union, Google, Meta, microsoft, openai, piracy, Policy / Beth Washington / July 10, 2025

The code also details expectations for AI companies to respect paywalls, as well as robots.txt instructions restricting crawling, which could help confront a growing problem of AI crawlers hammering websites. It “encourages” online search giants to embrace a solution that Cloudflare is currently pushing: allowing content creators to protect copyrights by restricting AI crawling without impacting search indexing.

Additionally, companies are asked to disclose total energy consumption for both training and inference, allowing the EU to detect environmental concerns while companies race forward with AI innovation.

More substantially, the code’s safety guidance provides for additional monitoring for other harms. It makes recommendations to detect and avoid “serious incidents” with new AI models, which could include cybersecurity breaches, disruptions of critical infrastructure, “serious harm to a person’s health (mental and/or physical),” or “a death of a person.” It stipulates timelines of between five and 10 days to report serious incidents with the EU’s AI Office. And it requires companies to track all events, provide an “adequate level” of cybersecurity protection, prevent jailbreaking as best they can, and justify “any failures or circumventions of systemic risk mitigations.”

Ars reached out to tech companies for immediate reactions to the new rules. OpenAI, Meta, and Microsoft declined to comment. A Google spokesperson confirmed that the company is reviewing the code, which still must be approved by the European Commission and EU member states amid expected industry pushback.

“Europeans should have access to first-rate, secure AI models when they become available, and an environment that promotes innovation and investment,” Google’s spokesperson said. “We look forward to reviewing the code and sharing our views alongside other model providers and many others.”

These rules are just one part of the AI Act, which will start taking effect in a staggered approach over the next year or more, the NYT reported. Breaching the AI Act could result in AI models being yanked off the market or fines “of as much as 7 percent of a company’s annual sales or 3 percent for the companies developing advanced AI models,” Bloomberg noted.

Everything tech giants will hate about the EU’s new AI rules Read More »

Pay up or stop scraping: Cloudflare program charges bots for each crawl

AI, ai bots, ai crawler, ai scraping, AI training, Artificial Intelligence, cloudflare, Policy, robots.txt / Kelly Newman / July 1, 2025

“Imagine asking your favorite deep research program to help you synthesize the latest cancer research or a legal brief, or just help you find the best restaurant in Soho—and then giving that agent a budget to spend to acquire the best and most relevant content,” Cloudflare said, promising that “we enable a future where intelligent agents can programmatically negotiate access to digital resources.”

AI crawlers now blocked by default

Cloudflare’s announcement comes after rolling out a feature last September, allowing website owners to block AI crawlers in a single click. According to Cloudflare, over 1 million customers chose to block AI crawlers, signaling that people want more control over their content at a time when Cloudflare observed that writing instructions for AI crawlers in robots.txt files was widely “underutilized.”

To protect more customers moving forward, any new customers (including anyone on a free plan) who sign up for Cloudflare services will have their domains, by default, set to block all known AI crawlers.

This marks Cloudflare’s transition away from the dreaded opt-out models of AI scraping to a permission-based model, which a Cloudflare spokesperson told Ars is expected to “fundamentally change how AI companies access web content going forward.”

In a world where some website owners have grown sick and tired of attempting and failing to block AI scraping through robots.txt—including some trapping AI crawlers in tarpits to punish them for ignoring robots.txt—Cloudflare’s feature allows users to choose granular settings to prevent blocks on AI bots from impacting bots that drive search engine traffic. That’s critical for small content creators who want their sites to still be discoverable but not digested by AI bots.

“AI crawlers collect content like text, articles, and images to generate answers, without sending visitors to the original source—depriving content creators of revenue, and the satisfaction of knowing someone is reading their content,” Cloudflare’s blog said. “If the incentive to create original, quality content disappears, society ends up losing, and the future of the Internet is at risk.”

Disclosure: Condé Nast, which owns Ars Technica, is a partner involved in Cloudflare’s beta test.

This story was corrected on July 1 to remove publishers incorrectly listed as participating in Cloudflare’s pay-per-crawl beta.

Pay up or stop scraping: Cloudflare program charges bots for each crawl Read More »

Judge: Pirate libraries may have profited from Meta torrenting 80TB of books

AI, AI training, BItTorrent, copyright, copyright infringement, libgen, LLaMA, Meta, online piracy, Policy, shadow libraries, torrenting / Shannon Garcia / June 27, 2025

It could certainly look worse for Meta if authors manage to present evidence supporting the second way that torrenting could be relevant to the case, Chhabaria suggested.

“Meta downloading copyrighted material from shadow libraries” would also be relevant to the character of the use, “if it benefitted those who created the libraries and thus supported and perpetuated their unauthorized copying and distribution of copyrighted works,” Chhabria wrote.

Counting potential strikes against Meta, Chhabria pointed out that the “vast majority of cases” involving “this sort of peer-to-peer file-sharing” are found to “constitute copyright infringement.” And it likely doesn’t help Meta’s case that “some of the libraries Meta used have themselves been found liable for infringement.”

However, Meta may overcome this argument, too, since book authors “have not submitted any evidence” that potentially shows how Meta’s downloading may perhaps be “propping up” or financially benefiting pirate libraries.

Finally, Chhabria noted that the “last issue relating to the character of Meta’s use” of books in regards to its torrenting is “the relationship between Meta’s downloading of the plaintiffs’ books and Meta’s use of the books to train Llama.”

Authors had tried to argue that these elements were distinct. But Chhabria said there’s no separating the fact that Meta downloaded the books to serve the “highly transformative” purpose of training Llama.

“Because Meta’s ultimate use of the plaintiffs’ books was transformative, so too was Meta’s downloading of those books,” Chhabria wrote.

AI training rulings may get more authors paid

Authors only learned of Meta’s torrenting through discovery in the lawsuit, and because of that, Chhabria noted that “the record on Meta’s alleged distribution is incomplete.”

It’s possible that authors may be able to show evidence that Meta “contributed to the BitTorrent network” by providing significant computing power that could’ve meaningfully assisted shadow libraries, Chhabria said in a footnote.

Judge: Pirate libraries may have profited from Meta torrenting 80TB of books Read More »

Meta is making users who opted out of AI training opt out again, watchdog says

AI, AI training, Artificial Intelligence, facebook, Instagram, Meta, Online Privacy, Policy / Kelly Newman / May 14, 2025

Noyb has requested a response from Meta by May 21, but it seems unlikely that Meta will quickly cave in this fight.

In a blog post, Meta said that AI training on EU users was critical to building AI tools for Europeans that are informed by “everything from dialects and colloquialisms, to hyper-local knowledge and the distinct ways different countries use humor and sarcasm on our products.”

Meta argued that its AI training efforts in the EU are far more transparent than efforts from competitors Google and OpenAI, which, Meta noted, “have already used data from European users to train their AI models,” supposedly without taking the steps Meta has to inform users.

Also echoing a common refrain in the AI industry, another Meta blog warned that efforts to further delay Meta’s AI training in the EU could lead to “major setbacks,” pushing the EU behind rivals in the AI race.

“Without a reform and simplification of the European regulatory system, Europe threatens to fall further and further behind in the global AI race and lose ground compared to the USA and China,” Meta warned.

Noyb discredits this argument and noted that it can pursue injunctions in various jurisdictions to block Meta’s plan. The group said it’s currently evaluating options to seek injunctive relief and potentially even pursue a class action worth possibly “billions in damages” to ensure that 400 million monthly active EU users’ data rights are shielded from Meta’s perceived grab.

A Meta spokesperson reiterated to Ars that the company’s plan “follows extensive and ongoing engagement with the Irish Data Protection Commission,” while reiterating Meta’s statements in blogs that its AI training approach “reflects consensus among” EU Data Protection Authorities (DPAs).

But while Meta claims that EU regulators have greenlit its AI training plans, Noyb argues that national DPAs have “largely stayed silent on the legality of AI training without consent,” and Meta seems to have “simply moved ahead anyways.”

“This fight is essentially about whether to ask people for consent or simply take their data without it,” Schrems said, adding, “Meta’s absurd claims that stealing everyone’s personal data is necessary for AI training is laughable. Other AI providers do not use social network data—and generate even better models than Meta.”

Meta is making users who opted out of AI training opt out again, watchdog says Read More »