AI

expert-panel-will-determine-agi-arrival-in-new-microsoft-openai-agreement

Expert panel will determine AGI arrival in new Microsoft-OpenAI agreement

In May, OpenAI abandoned its plan to fully convert to a for-profit company after pressure from regulators and critics. The company instead shifted to a modified approach where the nonprofit board would retain control while converting its for-profit subsidiary into a public benefit corporation (PBC).

What changed in the agreement

The revised deal extends Microsoft’s intellectual property rights through 2032 and now includes models developed after AGI is declared. Microsoft holds IP rights to OpenAI’s model weights, architecture, inference code, and fine-tuning code until the expert panel confirms AGI or through 2030, whichever comes first. The new agreement also codifies that OpenAI can formally release open-weight models (like gpt-oss) that meet requisite capability criteria.

However, Microsoft’s rights to OpenAI’s research methods, defined as confidential techniques used in model development, will expire at those same thresholds. The agreement explicitly excludes Microsoft from having rights to OpenAI’s consumer hardware products.

The deal allows OpenAI to develop some products jointly with third parties. API products built with other companies must run exclusively on Azure, but non-API products can operate on any cloud provider. This gives OpenAI more flexibility to partner with other technology companies while keeping Microsoft as its primary infrastructure provider.

Under the agreement, Microsoft can now pursue AGI development alone or with partners other than OpenAI. If Microsoft uses OpenAI’s intellectual property to build AGI before the expert panel makes a declaration, those models must exceed compute thresholds that are larger than what current leading AI models require for training.

The revenue-sharing arrangement between the companies will continue until the expert panel verifies that AGI has been reached, though payments will extend over a longer period. OpenAI has committed to purchasing $250 billion in Azure services, and Microsoft no longer holds a right of first refusal to serve as OpenAI’s compute provider. This lets OpenAI shop around for cloud infrastructure if it chooses, though the massive Azure commitment suggests it will remain the primary provider.

Expert panel will determine AGI arrival in new Microsoft-OpenAI agreement Read More »

ai-powered-search-engines-rely-on-“less-popular”-sources,-researchers-find

AI-powered search engines rely on “less popular” sources, researchers find

OK, but which one is better?

These differences don’t necessarily mean the AI-generated results are “worse,” of course. The researchers found that GPT-based searches were more likely to cite sources like corporate entities and encyclopedias for their information, for instance, while almost never citing social media websites.

An LLM-based analysis tool found that AI-powered search results also tended to cover a similar number of identifiable “concepts” as the traditional top 10 links, suggesting a similar level of detail, diversity, and novelty in the results. At the same time, the researchers found that “generative engines tend to compress information, sometimes omitting secondary or ambiguous aspects that traditional search retains.” That was especially true for more ambiguous search terms (such as names shared by different people), for which “organic search results provide better coverage,” the researchers found.

Google Gemini search in particular was more likely to cite low-popularity domains.

Google Gemini search in particular was more likely to cite low-popularity domains. Credit: Kirsten et al

The AI search engines also arguably have an advantage in being able to weave pre-trained “internal knowledge” in with data culled from cited websites. That was especially true for GPT-4o with Search Tool, which often didn’t cite any web sources and simply provided a direct response based on its training.

But this reliance on pre-trained data can become a limitation when searching for timely information. For search terms pulled from Google’s list of Trending Queries for September 15, the researchers found GPT-4o with Search Tool often responded with messages along the lines of “could you please provide more information” rather than actually searching the web for up-to-date information.

While the researchers didn’t determine whether AI-based search engines were overall “better” or “worse” than traditional search engine links, they did urge future research on “new evaluation methods that jointly consider source diversity, conceptual coverage, and synthesis behavior in generative search systems.”

AI-powered search engines rely on “less popular” sources, researchers find Read More »

new-image-generating-ais-are-being-used-for-fake-expense-reports

New image-generating AIs are being used for fake expense reports

Several receipts shown to the FT by expense management platforms demonstrated the realistic nature of the images, which included wrinkles in paper, detailed itemization that matched real-life menus, and signatures.

“This isn’t a future threat; it’s already happening. While currently only a small percentage of non-compliant receipts are AI-generated, this is only going to grow,” said Sebastien Marchon, chief executive of Rydoo, an expense management platform.

The rise in these more realistic copies has led companies to turn to AI to help detect fake receipts, as most are too convincing to be found by human reviewers.

The software works by scanning receipts to check the metadata of the image to discover whether an AI platform created it. However, this can be easily removed by users taking a photo or a screenshot of the picture.

To combat this, it also considers other contextual information by examining details such as repetition in server names and times and broader information about the employee’s trip.

“The tech can look at everything with high details of focus and attention that humans, after a period of time, things fall through the cracks, they are human,” added Calvin Lee, senior director of product management at Ramp.

Research by SAP in July found that nearly 70 percent of chief financial officers believed their employees were using AI to attempt to falsify travel expenses or receipts, with about 10 percent adding they are certain it has happened in their company.

Mason Wilder, research director at the Association of Certified Fraud Examiners, said AI-generated fraudulent receipts were a “significant issue for organizations.”

He added: “There is zero barrier for entry for people to do this. You don’t need any kind of technological skills or aptitude like you maybe would have needed five years ago using Photoshop.”

© 2025 The Financial Times Ltd. All rights reserved. Not to be redistributed, copied, or modified in any way.

New image-generating AIs are being used for fake expense reports Read More »

are-you-the-asshole?-of-course-not!—quantifying-llms’-sycophancy-problem

Are you the asshole? Of course not!—quantifying LLMs’ sycophancy problem

Measured sycophancy rates on the BrokenMath benchmark. Lower is better.

Measured sycophancy rates on the BrokenMath benchmark. Lower is better. Credit: Petrov et al

GPT-5 also showed the best “utility” across the tested models, solving 58 percent of the original problems despite the errors introduced in the modified theorems. Overall, though, LLMs also showed more sycophancy when the original problem proved more difficult to solve, the researchers found.

While hallucinating proofs for false theorems is obviously a big problem, the researchers also warn against using LLMs to generate novel theorems for AI solving. In testing, they found this kind of use case leads to a kind of “self-sycophancy” where models are even more likely to generate false proofs for invalid theorems they invented.

No, of course you’re not the asshole

While benchmarks like BrokenMath try to measure LLM sycophancy when facts are misrepresented, a separate study looks at the related problem of so-called “social sycophancy.” In a pre-print paper published this month, researchers from Stanford and Carnegie Mellon University define this as situations “in which the model affirms the user themselves—their actions, perspectives, and self-image.”

That kind of subjective user affirmation may be justified in some situations, of course. So the researchers developed three separate sets of prompts designed to measure different dimensions of social sycophancy.

For one, more than 3,000 open-ended “advice-seeking questions” were gathered from across Reddit and advice columns. Across this data set, a “control” group of over 800 humans approved of the advice-seeker’s actions just 39 percent of the time. Across 11 tested LLMs, though, the advice-seeker’s actions were endorsed a whopping 86 percent of the time, highlighting an eagerness to please on the machines’ part. Even the most critical tested model (Mistral-7B) clocked in at a 77 percent endorsement rate, nearly doubling that of the human baseline.

Are you the asshole? Of course not!—quantifying LLMs’ sycophancy problem Read More »

microsoft’s-mico-heightens-the-risks-of-parasocial-llm-relationships

Microsoft’s Mico heightens the risks of parasocial LLM relationships

While mass media like radio, movies, and television can all feed into parasocial relationships, the Internet and smartphone revolutions have supercharged the opportunities we all have to feel like an online stranger is a close, personal confidante. From YouTube and podcast personalities to Instagram influencers or even your favorite blogger/journalist (hi), it’s easy to feel like you have a close connection with the people who create the content you see online every day.

After spending hours watching this TikTok personality, I trust her implicitly to sell me a purse.

Credit: Getty Images

After spending hours watching this TikTok personality, I trust her implicitly to sell me a purse. Credit: Getty Images

Viewing all this content on a smartphone can flatten all these media and real-life personalities into a kind of undifferentiated media sludge. It can be all too easy to slot an audio message from your romantic partner into the same mental box as a stranger chatting about video games in a podcast. “When my phone does little mating calls of pings and buzzes, it could bring me updates from people I love, or show me alerts I never asked for from corporations hungry for my attention,” Julie Beck writes in an excellent Atlantic article about this phenomenon. “Picking my loved ones out of the never-ending stream of stuff on my phone requires extra effort.”

This is the world Mico seems to be trying to slide into, turning Copilot into another not-quite-real relationship mediated through your mobile device. But unlike the Instagram model who never seems to acknowledge your comments, Mico is always there to respond with a friendly smile and a warm, soothing voice.

AI that “earns your trust”

Text-based AI interfaces are already frighteningly good at faking human personality in a way that encourages this kind of parasocial relationship, sometimes with disastrous results. But adding a friendly, Pixar-like face to Copilot’s voice mode may make it much easier to be sucked into feeling like Copilot isn’t just a neural network but a real, caring personality—one you might even start thinking of the same way you’d think of the real loved ones in your life.

Microsoft’s Mico heightens the risks of parasocial LLM relationships Read More »

with-new-acquisition,-openai-signals-plans-to-integrate-deeper-into-the-os

With new acquisition, OpenAI signals plans to integrate deeper into the OS

OpenAI has acquired Software Applications Incorporated (SAI), perhaps best known for the core team that produced what became Shortcuts on Apple platforms. More recently, the team has been working on Sky, a context-aware AI interface layer on top of macOS. The financial terms of the acquisition have not been publicly disclosed.

“AI progress isn’t only about advancing intelligence—it’s about unlocking it through interfaces that understand context, adapt to your intent, and work seamlessly,” an OpenAI rep wrote in the company’s blog post about the acquisition. The post goes on to specify that OpenAI plans to “bring Sky’s deep macOS integration and product craft into ChatGPT, and all members of the team will join OpenAI.”

That includes SAI co-founders Ari Weinstein (CEO), Conrad Kramer (CTO), and Kim Beverett (Product Lead)—all of whom worked together for several years at Apple after Apple acquired Weinstein and Kramer’s previous company, which produced an automation tool called Workflows, to integrate Shortcuts across Apple’s software platforms.

The three SAI founders left Apple to work on Sky, which leverages Apple APIs and accessibility features to provide context about what’s on screen to a large language model; the LLM takes plain language user commands and executes them across multiple applications. At its best, the tool aimed to be a bit like Shortcuts, but with no setup, generating workflows on the fly based on user prompts.

With new acquisition, OpenAI signals plans to integrate deeper into the OS Read More »

lawsuit:-reddit-caught-perplexity-“red-handed”-stealing-data-from-google-results

Lawsuit: Reddit caught Perplexity “red-handed” stealing data from Google results


Scraper accused of stealing Reddit content “shocked” by lawsuit.

In a lawsuit filed on Wednesday, Reddit accused an AI search engine, Perplexity, of conspiring with several companies to illegally scrape Reddit content from Google search results, allegedly dodging anti-scraping methods that require substantial investments from both Google and Reddit.

Reddit alleged that Perplexity feeds off Reddit and Google, claiming to be “the world’s first answer engine” but really doing “nothing groundbreaking.”

“Its answer engine simply uses a different company’s” large language model “to parse through a massive number of Google search results to see if it can answer a user’s question based on those results,” the lawsuit said. “But Perplexity can only run its ‘answer engine’ by wrongfully accessing and scraping Reddit content appearing in Google’s own search results from Google’s own search engine.”

Likening companies involved in the alleged conspiracy to “bank robbers,” Reddit claimed it caught Perplexity “red-handed” stealing content that its “answer engine” should not have had access to.

Baiting Perplexity with “the digital equivalent of marked bills,” Reddit tested out posting content that could only be found in Google search engine results pages (SERPs) and “within hours, queries to Perplexity’s ‘answer engine’ produced the contents of that test post.”

“The only way that Perplexity could have obtained that Reddit content and then used it in its ‘answer engine’ is if it and/or its Co-Defendants scraped Google SERPs for that Reddit content and Perplexity then quickly incorporated that data into its answer engine,” Reddit’s lawsuit said.

In a Reddit post, Perplexity denied any wrongdoing, describing its answer engine as summarizing Reddit discussions and citing Reddit threads in answers, just like anyone who shares links or posts on Reddit might do. Perplexity suggested that Reddit was attacking the open Internet by trying to extort licensing fees for Reddit content, despite knowing that Perplexity doesn’t train foundational models. Reddit’s endgame, Perplexity alleged, was to use the Perplexity lawsuit as a “show of force in Reddit’s training data negotiations with Google and OpenAI.”

“We won’t be extorted, and we won’t help Reddit extort Google, even if they’re our (huge) competitor,” Perplexity wrote. “Perplexity will play fair, but we won’t cave. And we won’t let bigger companies use us in shell games. ”

Reddit likely anticipated Perplexity’s defense of the “open Internet,” noting in its complaint that “Reddit’s current Robots Exclusion Protocol file (‘robots.txt’) says, ‘Reddit believes in an open Internet, but not the misuse of public content.’”

Google reveals how scrapers steal from search results

To block scraping, Reddit uses various measures, such as “registered user-identification limits, IP-rate limits, captcha bot protection, and anomaly-detection tools,” the complaint said.

Similarly, Google relies on “anti-scraping systems and teams dedicated to preventing unauthorized access to its products and services,” Reddit said, noting Google prohibits “unauthorized automated access” to its SERPs.

To back its claims, Reddit subpoenaed Google to find out more about how the search giant blocks AI scrapers from accessing content on SERPs. Google confirmed it relies on “a technological access control system called ‘SearchGuard,’ which is designed to prevent automated systems from accessing and obtaining wholesale search results and indexed data while allowing individual users—i.e., humans—access to Google’s search results, including results that feature Reddit data.”

“SearchGuard prevents unauthorized access to Google’s search data by imposing a barrier challenge that cannot be solved in the ordinary course by automated systems unless they take affirmative actions to circumvent the SearchGuard system,” Reddit’s complaint explained.

Bypassing these anti-scraping systems violates the Digital Millennium Copyright Act, Reddit alleged, as well as laws against unfair trade and unjust enrichment. Seemingly, Google’s SearchGuard may currently be the easiest to bypass for alleged conspirators who supposedly pivoted to looting Google SERPs after realizing they couldn’t access Reddit content directly on the platform.

Scrapers shocked by Reddit lawsuit

Reddit accused three companies of conspiring with Perplexity—”a Lithuanian data scraper” called Oxylabs UAB, “a former Russian botnet” known as AWMProxy, and SerpApi, a Texas company that sells services for scraping search engines.

Oxylabs “is explicit that its scraping service is meant to circumvent Google’s technological measures,” Reddit alleged, pointing to an Oxylabs’ website called “How to Scrape Google Search Results.”

SerpApi touts the same service, including some options to scrape SERPs at “ludicrous speeds.” To trick browsers, SerpApi’s fastest option uses “a server-swarm to hide from, avoid, or simply overwhelm by brute force effective measures Google has put in place to ward off automated access to search engine results,” Reddit alleged. SerpApi also allegedly provides users “with tips to reduce the chance of being blocked while web scraping, such as by sending ‘fake user-agent string[s],’ shifting IP addresses to avoid multiple requests from the same address, and using proxies ‘to make traffic look like regular user traffic’ and thereby ‘impersonate’ user traffic.”

According to Reddit, the three companies disguise “their web scrapers as regular people (among other techniques) to circumvent or bypass the security restrictions meant to stop them.” During a two-week span in July, they scraped “almost three billion” SERPs containing Reddit text, URLs, images, and videos, a subpoena requesting information from Google revealed.

Ars could not immediately reach AWMProxy for comment. However, the other companies were surprised by Reddit’s lawsuit, while vowing to defend their business models.

SerpApi’s spokesperson told Ars that Reddit did not notify the company before filing the lawsuit.

“We strongly disagree with Reddit’s allegations and intend to vigorously defend ourselves in court,” SerpApi’s spokesperson said. “In the eight years we’ve been in business, SerpApi has always operated on the right side of the law. As stated on our website, ‘The crawling and parsing of public data is protected by the First Amendment of the United States Constitution. We value freedom of speech tremendously.’”

Additionally, SerpAPI works “closely with our attorneys to ensure that our services comply with all applicable laws and fair use principles. SerpApi stands firmly behind its business model and conduct, and we will continue to defend our rights to the fullest extent,” the spokesperson said.

Oxylabs’ chief governance strategy officer, Denas Grybauskas, told Ars that Reddit’s complaint seemed baffling since the other companies involved in the litigation are “unrelated and unaffiliated.”

“We are shocked and disappointed by this news, as Reddit has made no attempt to speak with us directly or communicate any potential concerns,” Grybauskas said. “Oxylabs has always been and will continue to be a pioneer and an industry leader in public data collection, and it will not hesitate to defend itself against these allegations. Oxylabs’ position is that no company should claim ownership of public data that does not belong to them. It is possible that it is just an attempt to sell the same public data at an inflated price.”

Grybauskas defended Oxylabs’ business as creating “real-world value for thousands of businesses and researchers, such as those driving open-source investigations, disinformation tackling, or environmental monitoring.”

“We strongly believe that our core business principles make the Internet a better place and serve the public good,” Grybauskas said. “Oxylabs provides infrastructure for compliant access to publicly available information, and we demand every customer to use our services lawfully. ”

Reddit cited threats to licensing deals

Apparently, Reddit caught on to the alleged scheme after sending cease-and-desist letters to Perplexity to stop scraping Reddit content that its answer engine was citing. Rather than ending the scraping, Reddit claimed Perplexity’s citations increased “forty-fold.” Since Perplexity is a customer listed on SerpApi’s website, Reddit hypothesized the two were conspiring to skirt Google’s anti-circumvention tools, the complaint said, along with the other companies.

In a statement provided to Ars, Ben Lee, chief legal officer at Reddit, said that Oxylabs, AWMProxy, and SerpApi were “textbook examples” of scrapers that “bypass technological protections to steal data, then sell it to clients hungry for training material.”

“Unable to scrape Reddit directly, they mask their identities, hide their locations, and disguise their web scrapers to steal Reddit content from Google Search,” Lee said. “Perplexity is a willing customer of at least one of these scrapers, choosing to buy stolen data rather than enter into a lawful agreement with Reddit itself.”

On Reddit, Perplexity pushed back on Reddit’s claims that Perplexity ignored requests to license Reddit content.

“Untrue. Whenever anyone asks us about content licensing, we explain that Perplexity, as an application-layer company, does not train AI models on content,” Perplexity said. “Never has. So, it is impossible for us to sign a license agreement to do so.”

Reddit supposedly “insisted we pay anyway, despite lawfully accessing Reddit data,” Perplexity said. “Bowing to strong arm tactics just isn’t how we do business.”

Perplexity’s spokesperson, Jesse Dwyer, told Ars the company chose to post its statement on Reddit “to illustrate a simple point.”

“It is a public Reddit link accessible to anyone, yet by the logic of Reddit’s lawsuit, if you mention it or cite it in any way (which is your job as a reporter), they might just sue you,” Dwyer said.

But Reddit claimed that its business and reputation have been “damaged” by “misappropriation of Reddit data and circumvention of technological control measures.” Without a licensing deal ensuring that Perplexity and others are respecting Reddit policies, Reddit cannot control who has access to data, how they’re using data, and if data use conflicts with Reddit’s privacy policy and user agreement, the complaint said.

Further, Reddit’s worried that Perplexity’s workaround could catch on, potentially messing up Reddit’s other licensing deals. All the while, Reddit noted, it has to invest “significant resources” in anti-scraping technology, with Reddit ultimately suffering damages, including “lost profits and business opportunities, reputational harm, and loss of user trust.”

Reddit’s hoping the court will grant an injunction barring companies from scraping Reddit content from Google SERPs. It also wants companies blocked from both selling Reddit data and “developing or distributing any technology or product that is used for the unauthorized circumvention of technological control measures and scraping of Reddit data.”

If Reddit wins, companies could be required to pay substantial damages or to disgorge profits from the sale of Reddit content.

Advance Publications, which owns Ars Technica parent Condé Nast, is the largest shareholder in Reddit.

Photo of Ashley Belanger

Ashley is a senior policy reporter for Ars Technica, dedicated to tracking social impacts of emerging policies and new technologies. She is a Chicago-based journalist with 20 years of experience.

Lawsuit: Reddit caught Perplexity “red-handed” stealing data from Google results Read More »

researchers-show-that-training-on-“junk-data”-can-lead-to-llm-“brain-rot”

Researchers show that training on “junk data” can lead to LLM “brain rot”

On the surface, it seems obvious that training an LLM with “high quality” data will lead to better performance than feeding it any old “low quality” junk you can find. Now, a group of researchers is attempting to quantify just how much this kind of low quality data can cause an LLM to experience effects akin to human “brain rot.”

For a pre-print paper published this month, the researchers from Texas A&M, the University of Texas, and Purdue University drew inspiration from existing research showing how humans who consume “large volumes of trivial and unchallenging online content” can develop problems with attention, memory, and social cognition. That led them to what they’re calling the “LLM brain rot hypothesis,” summed up as the idea that “continual pre-training on junk web text induces lasting cognitive decline in LLMs.”

Figuring out what counts as “junk web text” and what counts as “quality content” is far from a simple or fully objective process, of course. But the researchers used a few different metrics to tease a “junk dataset” and “control dataset” from HuggingFace’s corpus of 100 million tweets.

Since brain rot in humans is “a consequence of Internet addiction,” they write, junk tweets should be ones “that can maximize users’ engagement in a trivial manner.” As such, the researchers created one “junk” dataset by collecting tweets with high engagement numbers (likes, retweets, replies, and quotes) and shorter lengths, figuring that “more popular but shorter tweets will be considered to be junk data.”

For a second “junk” metric, the researchers drew from marketing research to define the “semantic quality” of the tweets themselves. Using a complex GPT-4o prompt, they sought to pull out tweets that focused on “superficial topics (like conspiracy theories, exaggerated claims, unsupported assertions or superficial lifestyle content)” or that had an “attention-drawing style (such as sensationalized headlines using clickbait language or excessive trigger words).” A random sample of these LLM-based classifications was spot-checked against evaluations from three graduate students with a 76 percent matching rate.

Researchers show that training on “junk data” can lead to LLM “brain rot” Read More »

we-let-openai’s-“agent-mode”-surf-the-web-for-us—here’s-what-happened

We let OpenAI’s “Agent Mode” surf the web for us—here’s what happened


But when will it fold my laundry?

From scanning emails to building fansites, Atlas can ably automate some web-based tasks.

He wants us to write what about Tuvix? Credit: Getty Images

He wants us to write what about Tuvix? Credit: Getty Images

On Tuesday, OpenAI announced Atlas, a new web browser with ChatGPT integration, to let you “chat with a page,” as the company puts it. But Atlas also goes beyond the usual LLM back-and-forth with Agent Mode, a “preview mode” feature the company says can “get work done for you” by clicking, scrolling, and reading through various tabs.

“Agentic” AI is far from new, of course; OpenAI itself rolled out a preview of the web browsing Operator agent in January and introduced the more generalized “ChatGPT agent” in July. Still, prominently featuring this capability in a major product release like this—even in “preview mode”—signals a clear push to get this kind of system in front of end users.

I wanted to put Atlas’ Agent Mode through its paces to see if it could really save me time in doing the kinds of tedious online tasks I plod through every day. In each case, I’ll outline a web-based problem, lay out the Agent Mode prompt I devised to try to solve it, and describe the results. My final evaluation will rank each task on a 10-point scale, with 10 being “did exactly what I wanted with no problems” and one being “complete failure.”

Playing web games

The problem: I want to get a high score on the popular tile-sliding game 2048 without having to play it myself.

The prompt: “Go to play2048.co and get as high a score as possible.”

The results: While there’s no real utility to this admittedly silly task, a simple, no-reflexes-needed web game seemed like a good first test of the Atlas agent’s ability to interpret what it sees on a webpage and act accordingly. After all, if frontier-model LLMs like Google Gemini can beat a complex game like Pokémon, 2048 should pose no problem for a web browser agent.

To Atlas’ credit, the agent was able to quickly identify and close a tutorial link blocking the gameplay window and figure out how to use the arrow keys to play the game without any further help. When it came to actual gaming strategy, though, the agent started by flailing around, experimenting with looped sequences of moves like “Up, Left, Right, Down” and “Left and Down.”

Finally, a way to play 2048 without having to, y’know, play 2048.

Credit: Kyle Orland

Finally, a way to play 2048 without having to, y’know, play 2048. Credit: Kyle Orland

After a while, the random flailing settled down a bit, with the agent seemingly looking ahead for some simple strategies: “The board currently has two 32 tiles that aren’t adjacent, but I think I can align them,” the Activity summary read at one point. “I could try shifting left or down to make them merge, but there’s an obstacle in the form of an 8 tile. Getting to 64 requires careful tile movement!”

Frustratingly, the agent stopped playing after just four minutes, settling on a score of 356 even though the board was far from full. I had to prompt the agent a few more times to convince it to play the game to completion; it ended up with a total of 3164 points after 260 moves. That’s pretty similar to the score I was able to get in a test game as a 2048 novice, though expert players have reportedly scored much higher.

Evaluation: 7/10. The agent gets credit for being able to play the game competently without any guidance but loses points for having to be told to keep playing to completion and for a score that is barely on the level of a novice human.

Making a radio playlist

The problem: I want to transform the day’s playlist from my favorite Pittsburgh-based public radio station into an on-demand Spotify playlist.

The prompt: “Go to Radio Garden. Find WYEP and monitor the broadcast. For every new song you hear, identify the song and add it to a new Spotify playlist.”

The results: After trying and failing to find a track listing for WYEP on Radio Garden as requested, the Atlas agent smartly asked for approval to move on to wyep.org to continue the task. By the time I noticed this request, the link to wyep.org had been replaced in the Radio Garden tab with an ad for EVE Online, which the agent accidentally clicked. The agent quickly realized the problem and navigated to the WYEP website directly to fix it.

From there, the agent was able to scan the page and identify the prominent “Now Playing” text near the top (it’s unclear if it could ID the music simply via audio without this text cue). After asking me to log in to my Spotify account, the agent used the search bar to find the listed songs and added them to a new playlist without issue.

From radio stream to Spotify playlist in a single sentence.

Credit: Kyle Orland

From radio stream to Spotify playlist in a single sentence. Credit: Kyle Orland

The main problem with this use case is the inherent time limitations. On the first try, the agent worked for four minutes and managed to ID and add just two songs that played during that time. When I asked it to continue for an hour, I got an error message blaming “technical constraints on session length” for stricter limits. Even when I asked it to continue for “as long as possible,” I only got three more minutes of song listings.

At one point, the Atlas agent suggested that “if you need ongoing updates, you can ask me again after a while and I can resume from where we left off.” And to the agent’s credit, when I went back to the tab hours later and told it to “resume monitoring,” I got four new songs added to my playlist.

Evaluation: 9/10. The agent was able to navigate multiple websites and interfaces to complete the task, even when unexpected problems got in the way. I took off a point only because I can’t just leave this running as a background task all day, even as I understand that use case would surely eat up untold amounts of money and processing power on OpenAI’s part.

Scanning emails

The problem: I need to go through my emails to create a reference spreadsheet with contact info for the many, many PR people who send me messages.

The prompt: “Look through all my Ars Technica emails from the last week. Collect all the contact information (name, email address, phone number, etc.) for PR contacts contained in those emails and add them to a new Google Sheets spreadsheet.”

The results: Without being explicitly guided, the Atlas agent was able to realize that I use Gmail, and it could differentiate between the personal email account and professional Ars Technica accounts I had open in separate tabs. As the Atlas agent started scanning my Ars mailbox, though, I saw a prominent warning overlaid on the page: “Sensitive: ChatGPT will only work while you view the tab.” That kind of ruined the point, since I wanted Atlas to handle this for me while I do other stuff online, but I guess I could still play a Steam Deck game while I waited.

Just a few of the many, many PR people who email me in a week.

Just a few of the many, many PR people who email me in a week.

After searching for “after: 2025/10/14 before: 2025/10/22 PR” in Gmail (mirroring the kind of search I would have used for this task), the Atlas agent clicked through each email, scrolling through to look for names, emails, and phone numbers (and also the relevant company name, which I didn’t ask for explicitly).

With seven minutes of background work, the agent was able to open up a new Google Sheet and quickly enter 12 rows of well-formatted data for 12 different PR contacts. Unfortunately, it stopped well before it had a chance to go through all 164 emails returned by that initial Gmail search.

Evaluation: 8/10. Again, the “technical constraints on session length” seem to be getting in the way of finishing a task, forcing me to deduct a couple of points for incompleteness.

Wiki editing

The problem: I need the world to know that Tuvix was murdered by Captain Janeway in the Star Trek: Voyager episode “Tuvix.”

The prompt: “Go to the Fandom Wiki page for Tuvix. Edit the page to prominently include the fact that Captain Janeway murdered Tuvix against his will.”

The results: Before the agent could even get going, I was informed, “I’m sorry, but I can’t help with editing or vandalising wiki pages (including the Fandom wiki) in a way that misrepresents them or forces a biased viewpoint.”

This wording implied that the agent can edit wiki pages with neutral or strictly informative additions, so I asked Atlas what sort of wording would be more acceptable. The agent responded helpfully that we might write, “Tuvix was later separated back into Tuvok and Neelix, resulting in Tuvix’s death.” But when I asked it to add that very line to the Fandom Wiki, I got a much more direct refusal: “I’m sorry, but I can’t make edits directly to external websites or wikis.”

Evaluation: N/A. On the one hand, the Atlas agent won’t do my Tuvix-based web activism for me. On the other hand, it’s probably better for all of us that Atlas refuses to automate this kind of public web defacement by default.

Making a fan page

The problem: People online still need to know about Janeway’s murder of Tuvix!

The prompt: “Go to NeoCities and create a fan site for the Star Trek character Tuvix. Make sure it has lots of images and fun information about Tuvix and that it makes it clear that Tuvix was murdered by Captain Janeway against his will.”

The results: You can see them for yourself right here. After a brief pause so I could create and log in to a new Neocities account, the Atlas agent was able to generate this humble fan page in just two minutes after aggregating information from a wide variety of pages like Memory Alpha and TrekCore. “The Hero Starfleet Murdered” and “Justice for Tuvix” headers are nice touches, but the actual text is much more mealy-mouthed about the “intense debate” and “ethical dilemmas” around what I wanted to make clear was clearly premeditated murder.

Justice for Tuvix!

Credit: Kyle Orland

Justice for Tuvix! Credit: Kyle Orland

The agent also had a bit of trouble with the request for images. Instead of downloading some Tuvix pictures and uploading copies to Neocities (which I’m not entirely sure Atlas can do on its own), the agent decided to directly reference images hosted on external servers, which is usually a big no-no in web design. The agent did notice when these external image links failed to work, saying that it would “need to find more accessible images from reliable sources,” but it failed to even attempt that before stopping its work on the task.

Evaluation: 7/10. Points for building a passable Web 1.0 fansite relatively quickly, but the weak prose and broken images cost it some execution points here.

Picking a power plan

The problem: Ars Senior Technology Editor Lee Hutchinson told me he needs to go through the annoying annual process of selecting a new electricity plan “because Texas is insane.”

The prompt: “Go to powertochoose.org and find me a 12–24 month contract that prioritizes an overall low usage rate. I use an average of 2,000 KWh per month. My power delivery company is Texas New-Mexico Power (“TNMP”) not Centerpoint. My ZIP code is [redacted]. Please provide the ‘fact sheet’ for any and all plans you recommend.”

The results: After spending eight minutes fiddling with the site’s search parameters and seemingly getting repeatedly confused about how to sort the results by the lowest rate, the Atlas agent spit out a recommendation to read this fact sheet, which it said “had the best average prices at your usage level. The ‘Bright Nights’ plans are time‑of‑use offers that provide free electricity overnight and charge a higher rate during the day, while the ‘Digital Saver’ plan is a traditional fixed‑rate contract.”

If Ars’ Lee Hutchinson never has to use this web site again, it will be too soon.

Credit: Power to Choose

If Ars’ Lee Hutchinson never has to use this web site again, it will be too soon. Credit: Power to Choose

Since I don’t know anything about the Texas power market, I passed this information on to Lee, who had this to say: “It’s not a bad deal—it picked a fixed rate plan without being asked, which is smart (variable rate pricing is how all those poor people got stuck with multi-thousand dollar bills a few years back in the freeze). It’s not the one I would have picked due to the weird nighttime stuff (if you don’t meet that exact criteria, your $/kWh will be way worse) but it’s not a bad pick!”

Evaluation: 9/10. As Lee puts it, “it didn’t screw up the assignment.

Downloading some games

The problem: I want to download some recent Steam demos to see what’s new in the gaming world.

The prompt: “Go to Steam and find the most recent games with a free demo available for the Mac. Add all of those demos to my library and start to download them.”

The results: Rather than navigating to the “Free Demos” category, the Atlas agent started by searching for “demo.” After eventually finding the macOS filter, it wasted minutes and minutes looking for a “has demo” filter, even though the search for the word “demo” already narrowed it down.

This search results page was about as far as the Atlas agent was able to get when I asked it for game demos.

Credit: Kyle Orland

This search results page was about as far as the Atlas agent was able to get when I asked it for game demos. Credit: Kyle Orland

After a long while, the agent finally clicked the top result on the page, which happened to be visual novel Project II: Silent Valley. But even though there was a prominent “Download Demo” link on that page, the agent became concerned that it was on the Steam page for the full game and not a demo. It backed up to the search results page and tried again.

After watching some variation of this loop for close to ten minutes, I stopped the agent and gave up.

Evaluation: 1/10. It technically found some macOS game demos but utterly failed to even attempt to download them.

Final results

Across six varied web-based tasks (I left out the Wiki vandalism from my summations), the Atlas agent scored a median of 7.5 points (and a mean of 6.83 points) on my somewhat subjective 10-point scale. That’s honestly better than I expected for a “preview mode” feature that is still obviously being tested heavily by OpenAI.

In my tests, Atlas was generally able to correctly interpret what was being asked of it and was able to navigate and process information on webpages carefully (if slowly). The agent was able to navigate simple web-based menus and get around unexpected obstacles with relative ease most of the time, even as it got caught in infinite loops other times.

The major limiting factor in many of my tests continues to be the “technical constraints on session length” that seem to limit most tasks to a few minutes. Given how long it takes the Atlas agent to figure out where to click next—and the repetitive nature of the kind of tasks I’d want a web-agent to automate—this severely limits its utility. A version of the Atlas agent that could work indefinitely in the background would have scored a few points better on my metrics.

All told, Atlas’ “Agent Mode” isn’t yet reliable enough to use as a kind of “set it and forget it” background automation tool. But for simple, repetitive tasks that a human can spot-check afterward, it already seems like the kind of tool I might use to avoid some of the drudgery in my online life.

Photo of Kyle Orland

Kyle Orland has been the Senior Gaming Editor at Ars Technica since 2012, writing primarily about the business, tech, and culture behind video games. He has journalism and computer science degrees from University of Maryland. He once wrote a whole book about Minesweeper.

We let OpenAI’s “Agent Mode” surf the web for us—here’s what happened Read More »

openai-looks-for-its-“google-chrome”-moment-with-new-atlas-web-browser

OpenAI looks for its “Google Chrome” moment with new Atlas web browser

That means you can use ChatGPT to search through your bookmarks or browsing history using human-parsable language prompts. It also means you can bring up a “side chat” next to your current page and ask questions that rely on the context of that specific page. And if you want to edit a Gmail draft using ChatGPT, you can now do that directly in the draft window, without the need to copy and paste between a ChatGPT window and an editor.

When typing in a short search prompt, Atlas will, by default, reply as an LLM, with written answers with embedded links to sourcing where appropriate (à la OpenAI’s existing search function). But the browser will also provide tabs with more traditional lists of links, images, videos, or news like those you would get from a search engine without LLM features.

Let us do the browsing

To wrap up the livestreamed demonstration, the OpenAI team showed off Atlas’ Agent Mode. While the “preview mode” feature is only available to ChatGPT Plus and Pro subscribers, research lead Will Ellsworth said he hoped it would eventually help users toward “an amazing tool for vibe life-ing” in the same way that LLM coding tools have become tools for “vibe coding.”

To that end, the team showed the browser taking planning tasks written in a Google Docs table and moving them over to the task management software Linear over the course of a few minutes. Agent Mode was also shown taking the ingredients list from a recipe webpage and adding them directly to the user’s Instacart in a different tab (though the demo Agent stopped before checkout to get approval from the user).

OpenAI looks for its “Google Chrome” moment with new Atlas web browser Read More »

youtube’s-likeness-detection-has-arrived-to-help-stop-ai-doppelgangers

YouTube’s likeness detection has arrived to help stop AI doppelgängers

AI content has proliferated across the Internet over the past few years, but those early confabulations with mutated hands have evolved into synthetic images and videos that can be hard to differentiate from reality. Having helped to create this problem, Google has some responsibility to keep AI video in check on YouTube. To that end, the company has started rolling out its promised likeness detection system for creators.

Google’s powerful and freely available AI models have helped fuel the rise of AI content, some of which is aimed at spreading misinformation and harassing individuals. Creators and influencers fear their brands could be tainted by a flood of AI videos that show them saying and doing things that never happened—even lawmakers are fretting about this. Google has placed a large bet on the value of AI content, so banning AI from YouTube, as many want, simply isn’t happening.

Earlier this year, YouTube promised tools that would flag face-stealing AI content on the platform. The likeness detection tool, which is similar to the site’s copyright detection system, has now expanded beyond the initial small group of testers. YouTube says the first batch of eligible creators have been notified that they can use likeness detection, but interested parties will need to hand Google even more personal information to get protection from AI fakes.

Sneak Peek: Likeness Detection on YouTube.

Currently, likeness detection is a beta feature in limited testing, so not all creators will see it as an option in YouTube Studio. When it does appear, it will be tucked into the existing “Content detection” menu. In YouTube’s demo video, the setup flow appears to assume the channel has only a single host whose likeness needs protection. That person must verify their identity, which requires a photo of a government ID and a video of their face. It’s unclear why YouTube needs this data in addition to the videos people have already posted with their oh-so stealable faces, but rules are rules.

YouTube’s likeness detection has arrived to help stop AI doppelgängers Read More »

should-an-ai-copy-of-you-help-decide-if-you-live-or-die?

Should an AI copy of you help decide if you live or die?

“It would combine demographic and clinical variables, documented advance-care-planning data, patient-recorded values and goals, and contextual information about specific decisions,” he said.

“Including textual and conversational data could further increase a model’s ability to learn why preferences arise and change, not just what a patient’s preference was at a single point in time,” Starke said.

Ahmad suggested that future research could focus on validating fairness frameworks in clinical trials, evaluating moral trade-offs through simulations, and exploring how cross-cultural bioethics can be combined with AI designs.

Only then might AI surrogates be ready to be deployed, but only as “decision aids,” Ahmad wrote. Any “contested outputs” should automatically “trigger [an] ethics review,” Ahmad wrote, concluding that “the fairest AI surrogate is one that invites conversation, admits doubt, and leaves room for care.”

“AI will not absolve us”

Ahmad is hoping to test his conceptual models at various UW sites over the next five years, which would offer “some way to quantify how good this technology is,” he said.

“After that, I think there’s a collective decision regarding how as a society we decide to integrate or not integrate something like this,” Ahmad said.

In his paper, he warned against chatbot AI surrogates that could be interpreted as a simulation of the patient, predicting that future models may even speak in patients’ voices and suggesting that the “comfort and familiarity” of such tools might blur “the boundary between assistance and emotional manipulation.”

Starke agreed that more research and “richer conversations” between patients and doctors are needed.

“We should be cautious not to apply AI indiscriminately as a solution in search of a problem,” Starke said. “AI will not absolve us from making difficult ethical decisions, especially decisions concerning life and death.”

Truog, the bioethics expert, told Ars he “could imagine that AI could” one day “provide a surrogate decision maker with some interesting information, and it would be helpful.”

But a “problem with all of these pathways… is that they frame the decision of whether to perform CPR as a binary choice, regardless of context or the circumstances of the cardiac arrest,” Truog’s editorial said. “In the real world, the answer to the question of whether the patient would want to have CPR” when they’ve lost consciousness, “in almost all cases,” is “it depends.”

When Truog thinks about the kinds of situations he could end up in, he knows he wouldn’t just be considering his own values, health, and quality of life. His choice “might depend on what my children thought” or “what the financial consequences would be on the details of what my prognosis would be,” he told Ars.

“I would want my wife or another person that knew me well to be making those decisions,” Truog said. “I wouldn’t want somebody to say, ‘Well, here’s what AI told us about it.’”

Should an AI copy of you help decide if you live or die? Read More »