Web Scraping

lawsuit:-reddit-caught-perplexity-“red-handed”-stealing-data-from-google-results

Lawsuit: Reddit caught Perplexity “red-handed” stealing data from Google results


Scraper accused of stealing Reddit content “shocked” by lawsuit.

In a lawsuit filed on Wednesday, Reddit accused an AI search engine, Perplexity, of conspiring with several companies to illegally scrape Reddit content from Google search results, allegedly dodging anti-scraping methods that require substantial investments from both Google and Reddit.

Reddit alleged that Perplexity feeds off Reddit and Google, claiming to be “the world’s first answer engine” but really doing “nothing groundbreaking.”

“Its answer engine simply uses a different company’s” large language model “to parse through a massive number of Google search results to see if it can answer a user’s question based on those results,” the lawsuit said. “But Perplexity can only run its ‘answer engine’ by wrongfully accessing and scraping Reddit content appearing in Google’s own search results from Google’s own search engine.”

Likening companies involved in the alleged conspiracy to “bank robbers,” Reddit claimed it caught Perplexity “red-handed” stealing content that its “answer engine” should not have had access to.

Baiting Perplexity with “the digital equivalent of marked bills,” Reddit tested out posting content that could only be found in Google search engine results pages (SERPs) and “within hours, queries to Perplexity’s ‘answer engine’ produced the contents of that test post.”

“The only way that Perplexity could have obtained that Reddit content and then used it in its ‘answer engine’ is if it and/or its Co-Defendants scraped Google SERPs for that Reddit content and Perplexity then quickly incorporated that data into its answer engine,” Reddit’s lawsuit said.

In a Reddit post, Perplexity denied any wrongdoing, describing its answer engine as summarizing Reddit discussions and citing Reddit threads in answers, just like anyone who shares links or posts on Reddit might do. Perplexity suggested that Reddit was attacking the open Internet by trying to extort licensing fees for Reddit content, despite knowing that Perplexity doesn’t train foundational models. Reddit’s endgame, Perplexity alleged, was to use the Perplexity lawsuit as a “show of force in Reddit’s training data negotiations with Google and OpenAI.”

“We won’t be extorted, and we won’t help Reddit extort Google, even if they’re our (huge) competitor,” Perplexity wrote. “Perplexity will play fair, but we won’t cave. And we won’t let bigger companies use us in shell games. ”

Reddit likely anticipated Perplexity’s defense of the “open Internet,” noting in its complaint that “Reddit’s current Robots Exclusion Protocol file (‘robots.txt’) says, ‘Reddit believes in an open Internet, but not the misuse of public content.’”

Google reveals how scrapers steal from search results

To block scraping, Reddit uses various measures, such as “registered user-identification limits, IP-rate limits, captcha bot protection, and anomaly-detection tools,” the complaint said.

Similarly, Google relies on “anti-scraping systems and teams dedicated to preventing unauthorized access to its products and services,” Reddit said, noting Google prohibits “unauthorized automated access” to its SERPs.

To back its claims, Reddit subpoenaed Google to find out more about how the search giant blocks AI scrapers from accessing content on SERPs. Google confirmed it relies on “a technological access control system called ‘SearchGuard,’ which is designed to prevent automated systems from accessing and obtaining wholesale search results and indexed data while allowing individual users—i.e., humans—access to Google’s search results, including results that feature Reddit data.”

“SearchGuard prevents unauthorized access to Google’s search data by imposing a barrier challenge that cannot be solved in the ordinary course by automated systems unless they take affirmative actions to circumvent the SearchGuard system,” Reddit’s complaint explained.

Bypassing these anti-scraping systems violates the Digital Millennium Copyright Act, Reddit alleged, as well as laws against unfair trade and unjust enrichment. Seemingly, Google’s SearchGuard may currently be the easiest to bypass for alleged conspirators who supposedly pivoted to looting Google SERPs after realizing they couldn’t access Reddit content directly on the platform.

Scrapers shocked by Reddit lawsuit

Reddit accused three companies of conspiring with Perplexity—”a Lithuanian data scraper” called Oxylabs UAB, “a former Russian botnet” known as AWMProxy, and SerpApi, a Texas company that sells services for scraping search engines.

Oxylabs “is explicit that its scraping service is meant to circumvent Google’s technological measures,” Reddit alleged, pointing to an Oxylabs’ website called “How to Scrape Google Search Results.”

SerpApi touts the same service, including some options to scrape SERPs at “ludicrous speeds.” To trick browsers, SerpApi’s fastest option uses “a server-swarm to hide from, avoid, or simply overwhelm by brute force effective measures Google has put in place to ward off automated access to search engine results,” Reddit alleged. SerpApi also allegedly provides users “with tips to reduce the chance of being blocked while web scraping, such as by sending ‘fake user-agent string[s],’ shifting IP addresses to avoid multiple requests from the same address, and using proxies ‘to make traffic look like regular user traffic’ and thereby ‘impersonate’ user traffic.”

According to Reddit, the three companies disguise “their web scrapers as regular people (among other techniques) to circumvent or bypass the security restrictions meant to stop them.” During a two-week span in July, they scraped “almost three billion” SERPs containing Reddit text, URLs, images, and videos, a subpoena requesting information from Google revealed.

Ars could not immediately reach AWMProxy for comment. However, the other companies were surprised by Reddit’s lawsuit, while vowing to defend their business models.

SerpApi’s spokesperson told Ars that Reddit did not notify the company before filing the lawsuit.

“We strongly disagree with Reddit’s allegations and intend to vigorously defend ourselves in court,” SerpApi’s spokesperson said. “In the eight years we’ve been in business, SerpApi has always operated on the right side of the law. As stated on our website, ‘The crawling and parsing of public data is protected by the First Amendment of the United States Constitution. We value freedom of speech tremendously.’”

Additionally, SerpAPI works “closely with our attorneys to ensure that our services comply with all applicable laws and fair use principles. SerpApi stands firmly behind its business model and conduct, and we will continue to defend our rights to the fullest extent,” the spokesperson said.

Oxylabs’ chief governance strategy officer, Denas Grybauskas, told Ars that Reddit’s complaint seemed baffling since the other companies involved in the litigation are “unrelated and unaffiliated.”

“We are shocked and disappointed by this news, as Reddit has made no attempt to speak with us directly or communicate any potential concerns,” Grybauskas said. “Oxylabs has always been and will continue to be a pioneer and an industry leader in public data collection, and it will not hesitate to defend itself against these allegations. Oxylabs’ position is that no company should claim ownership of public data that does not belong to them. It is possible that it is just an attempt to sell the same public data at an inflated price.”

Grybauskas defended Oxylabs’ business as creating “real-world value for thousands of businesses and researchers, such as those driving open-source investigations, disinformation tackling, or environmental monitoring.”

“We strongly believe that our core business principles make the Internet a better place and serve the public good,” Grybauskas said. “Oxylabs provides infrastructure for compliant access to publicly available information, and we demand every customer to use our services lawfully. ”

Reddit cited threats to licensing deals

Apparently, Reddit caught on to the alleged scheme after sending cease-and-desist letters to Perplexity to stop scraping Reddit content that its answer engine was citing. Rather than ending the scraping, Reddit claimed Perplexity’s citations increased “forty-fold.” Since Perplexity is a customer listed on SerpApi’s website, Reddit hypothesized the two were conspiring to skirt Google’s anti-circumvention tools, the complaint said, along with the other companies.

In a statement provided to Ars, Ben Lee, chief legal officer at Reddit, said that Oxylabs, AWMProxy, and SerpApi were “textbook examples” of scrapers that “bypass technological protections to steal data, then sell it to clients hungry for training material.”

“Unable to scrape Reddit directly, they mask their identities, hide their locations, and disguise their web scrapers to steal Reddit content from Google Search,” Lee said. “Perplexity is a willing customer of at least one of these scrapers, choosing to buy stolen data rather than enter into a lawful agreement with Reddit itself.”

On Reddit, Perplexity pushed back on Reddit’s claims that Perplexity ignored requests to license Reddit content.

“Untrue. Whenever anyone asks us about content licensing, we explain that Perplexity, as an application-layer company, does not train AI models on content,” Perplexity said. “Never has. So, it is impossible for us to sign a license agreement to do so.”

Reddit supposedly “insisted we pay anyway, despite lawfully accessing Reddit data,” Perplexity said. “Bowing to strong arm tactics just isn’t how we do business.”

Perplexity’s spokesperson, Jesse Dwyer, told Ars the company chose to post its statement on Reddit “to illustrate a simple point.”

“It is a public Reddit link accessible to anyone, yet by the logic of Reddit’s lawsuit, if you mention it or cite it in any way (which is your job as a reporter), they might just sue you,” Dwyer said.

But Reddit claimed that its business and reputation have been “damaged” by “misappropriation of Reddit data and circumvention of technological control measures.” Without a licensing deal ensuring that Perplexity and others are respecting Reddit policies, Reddit cannot control who has access to data, how they’re using data, and if data use conflicts with Reddit’s privacy policy and user agreement, the complaint said.

Further, Reddit’s worried that Perplexity’s workaround could catch on, potentially messing up Reddit’s other licensing deals. All the while, Reddit noted, it has to invest “significant resources” in anti-scraping technology, with Reddit ultimately suffering damages, including “lost profits and business opportunities, reputational harm, and loss of user trust.”

Reddit’s hoping the court will grant an injunction barring companies from scraping Reddit content from Google SERPs. It also wants companies blocked from both selling Reddit data and “developing or distributing any technology or product that is used for the unauthorized circumvention of technological control measures and scraping of Reddit data.”

If Reddit wins, companies could be required to pay substantial damages or to disgorge profits from the sale of Reddit content.

Advance Publications, which owns Ars Technica parent Condé Nast, is the largest shareholder in Reddit.

Photo of Ashley Belanger

Ashley is a senior policy reporter for Ars Technica, dedicated to tracking social impacts of emerging policies and new technologies. She is a Chicago-based journalist with 20 years of experience.

Lawsuit: Reddit caught Perplexity “red-handed” stealing data from Google results Read More »

billions-of-public-discord-messages-may-be-sold-through-a-scraping-service

Billions of public Discord messages may be sold through a scraping service

Discord chat-scraping service —

Cross-server tracking suggests a new understanding of “public” chat servers.

Discord logo, warped by vertical perspective over a phone displaying the app

Getty Images

It’s easy to get the impression that Discord chat messages are ephemeral, especially across different public servers, where lines fly upward at a near-unreadable pace. But someone claims to be catching and compiling that data and is offering packages that can track more than 600 million users across more than 14,000 servers.

Joseph Cox at 404 Media confirmed that Spy Pet, a service that sells access to a database of purportedly 3 billion Discord messages, offers data “credits” to customers who pay in bitcoin, ethereum, or other cryptocurrency. Searching individual users will reveal the servers that Spy Pet can track them across, a raw and exportable table of their messages, and connected accounts, such as GitHub. Ominously, Spy Pet lists more than 86,000 other servers in which it has “no bots,” but “we know it exists.”

  • An example of Spy Pet’s service from its website. Shown are a user’s nicknames, connected accounts, banner image, server memberships, and messages across those servers tracked by Spy Pet.

    Spy Pet

  • Statistics on servers, users, and messages purportedly logged by Spy Pet.

    Spy Pet

  • An example image of the publicly available data gathered by Spy Pet, in this example for a public server for the game Deep Rock Galactic: Survivor.

    Spy Pet

As Cox notes, Discord doesn’t make messages inside server channels, like blog posts or unlocked social media feeds, easy to publicly access and search. But many Discord users many not expect their messages, server memberships, bans, or other data to be grabbed by a bot, compiled, and sold to anybody wishing to pin them all on a particular user. 404 Media confirmed the service’s function with multiple user examples. Private messages are not mentioned by Spy Pet and are presumably still secure.

Spy Pet openly asks those training AI models, or “federal agents looking for a new source of intel,” to contact them for deals. As noted by 404 Media and confirmed by Ars, clicking on the “Request Removal” link plays a clip of J. Jonah Jameson from Spider-Man (the Tobey Maguire/Sam Raimi version) laughing at the idea of advance payment before an abrupt “You’re serious?” Users of Spy Pet, however, are assured of “secure and confidential” searches, with random usernames.

This author found nearly every public Discord he had ever dropped into for research or reporting in Spy Pet’s server list. Those who haven’t paid for message access can only see fairly benign public-facing elements, like stickers, emojis, and charted member totals over time. But as an indication of the reach of Spy Pet’s scraping, it’s an effective warning, or enticement, depending on your goals.

Ars has reached out to Spy Pet for comment and will update this post if we receive a response. A Discord spokesperson told Ars that the company is investigating whether Spy Pet violated its terms of service and community guidelines. It will take “appropriate steps to enforce our policies,” the company said, and could not provide further comment.

Billions of public Discord messages may be sold through a scraping service Read More »

openai-accuses-nyt-of-hacking-chatgpt-to-set-up-copyright-suit

OpenAI accuses NYT of hacking ChatGPT to set up copyright suit

OpenAI accuses NYT of hacking ChatGPT to set up copyright suit

OpenAI is now boldly claiming that The New York Times “paid someone to hack OpenAI’s products” like ChatGPT to “set up” a lawsuit against the leading AI maker.

In a court filing Monday, OpenAI alleged that “100 examples in which some version of OpenAI’s GPT-4 model supposedly generated several paragraphs of Times content as outputs in response to user prompts” do not reflect how normal people use ChatGPT.

Instead, it allegedly took The Times “tens of thousands of attempts to generate” these supposedly “highly anomalous results” by “targeting and exploiting a bug” that OpenAI claims it is now “committed to addressing.”

According to OpenAI this activity amounts to “contrived attacks” by a “hired gun”—who allegedly hacked OpenAI models until they hallucinated fake NYT content or regurgitated training data to replicate NYT articles. NYT allegedly paid for these “attacks” to gather evidence to support The Times’ claims that OpenAI’s products imperil its journalism by allegedly regurgitating reporting and stealing The Times’ audiences.

“Contrary to the allegations in the complaint, however, ChatGPT is not in any way a substitute for a subscription to The New York Times,” OpenAI argued in a motion that seeks to dismiss the majority of The Times’ claims. “In the real world, people do not use ChatGPT or any other OpenAI product for that purpose. Nor could they. In the ordinary course, one cannot use ChatGPT to serve up Times articles at will.”

In the filing, OpenAI described The Times as enthusiastically reporting on its chatbot developments for years without raising any concerns about copyright infringement. OpenAI claimed that it disclosed that The Times’ articles were used to train its AI models in 2020, but The Times only cared after ChatGPT’s popularity exploded after its debut in 2022.

According to OpenAI, “It was only after this rapid adoption, along with reports of the value unlocked by these new technologies, that the Times claimed that OpenAI had ‘infringed its copyright[s]’ and reached out to demand ‘commercial terms.’ After months of discussions, the Times filed suit two days after Christmas, demanding ‘billions of dollars.'”

Ian Crosby, Susman Godfrey partner and lead counsel for The New York Times, told Ars that “what OpenAI bizarrely mischaracterizes as ‘hacking’ is simply using OpenAI’s products to look for evidence that they stole and reproduced The Times’s copyrighted works. And that is exactly what we found. In fact, the scale of OpenAI’s copying is much larger than the 100-plus examples set forth in the complaint.”

Crosby told Ars that OpenAI’s filing notably “doesn’t dispute—nor can they—that they copied millions of The Times’ works to build and power its commercial products without our permission.”

“Building new products is no excuse for violating copyright law, and that’s exactly what OpenAI has done on an unprecedented scale,” Crosby said.

OpenAI argued that the court should dismiss claims alleging direct copyright, contributory infringement, Digital Millennium Copyright Act violations, and misappropriation, all of which it describes as “legally infirm.” Some fail because they are time-barred—seeking damages on training data for OpenAI’s older models—OpenAI claimed. Others allegedly fail because they misunderstand fair use or are preempted by federal laws.

If OpenAI’s motion is granted, the case would be substantially narrowed.

But if the motion is not granted and The Times ultimately wins—and it might—OpenAI may be forced to wipe ChatGPT and start over.

“OpenAI, which has been secretive and has deliberately concealed how its products operate, is now asserting it’s too late to bring a claim for infringement or hold them accountable. We disagree,” Crosby told Ars. “It’s noteworthy that OpenAI doesn’t dispute that it copied Times works without permission within the statute of limitations to train its more recent and current models.”

OpenAI did not immediately respond to Ars’ request to comment.

OpenAI accuses NYT of hacking ChatGPT to set up copyright suit Read More »

beautiful-soup-vs-scrapy-vs.-selenium:-which-web-scraping-tool-should-you-use?

Beautiful Soup vs. Scrapy vs. Selenium: Which Web Scraping Tool Should You Use?

internal/modules/cjs/loader.js: 905 throw err; ^ Error: Cannot find module ‘puppeteer’ Require stack: – /home/760439.cloudwaysapps.com/jxzdkzvxkw/public_html/wp-content/plugins/rss-feed-post-generator-echo/res/puppeteer/puppeteer.js at Function.Module._resolveFilename (internal/modules/cjs/loader.js: 902: 15) at Function.Module._load (internal/modules/cjs/loader.js: 746: 27) at Module.require (internal/modules/cjs/loader.js: 974: 19) at require (internal/modules/cjs/helpers.js: 101: 18) at Object. (/home/760439.cloudwaysapps.com/jxzdkzvxkw/public_html/wp-content/plugins/rss-feed-post-generator-echo/res/puppeteer/puppeteer.js:2: 19) at Module._compile (internal/modules/cjs/loader.js: 1085: 14) at Object.Module._extensions..js (internal/modules/cjs/loader.js: 1114: 10) at Module.load (internal/modules/cjs/loader.js: 950: 32) at Function.Module._load (internal/modules/cjs/loader.js: 790: 12) at Function.executeUserEntryPoint [as runMain] (internal/modules/run_main.js: 75: 12) code: ‘MODULE_NOT_FOUND’, requireStack: [ ‘/home/760439.cloudwaysapps.com/jxzdkzvxkw/public_html/wp-content/plugins/rss-feed-post-generator-echo/res/puppeteer/puppeteer.js’ ]

Beautiful Soup vs. Scrapy vs. Selenium: Which Web Scraping Tool Should You Use? Read More »