AI agents

ars-live-recap:-is-the-ai-bubble-about-to-pop?-ed-zitron-weighs-in.

Ars Live recap: Is the AI bubble about to pop? Ed Zitron weighs in.


Despite connection hiccups, we covered OpenAI’s finances, nuclear power, and Sam Altman.

On Tuesday of last week, Ars Technica hosted a live conversation with Ed Zitron, host of the Better Offline podcast and one of tech’s most vocal AI critics, to discuss whether the generative AI industry is experiencing a bubble and when it might burst. My Internet connection had other plans, though, dropping out multiple times and forcing Ars Technica’s Lee Hutchinson to jump in as an excellent emergency backup host.

During the times my connection cooperated, Zitron and I covered OpenAI’s financial issues, lofty infrastructure promises, and why the AI hype machine keeps rolling despite some arguably shaky economics underneath. Lee’s probing questions about per-user costs revealed a potential flaw in AI subscription models: Companies can’t predict whether a user will cost them $2 or $10,000 per month.

You can watch a recording of the event on YouTube or in the window below.

Our discussion with Ed Zitron. Click here for transcript.

“A 50 billion-dollar industry pretending to be a trillion-dollar one”

I started by asking Zitron the most direct question I could: “Why are you so mad about AI?” His answer got right to the heart of his critique: the disconnect between AI’s actual capabilities and how it’s being sold. “Because everybody’s acting like it’s something it isn’t,” Zitron said. “They’re acting like it’s this panacea that will be the future of software growth, the future of hardware growth, the future of compute.”

In one of his newsletters, Zitron describes the generative AI market as “a 50 billion dollar revenue industry masquerading as a one trillion-dollar one.” He pointed to OpenAI’s financial burn rate (losing an estimated $9.7 billion in the first half of 2025 alone) as evidence that the economics don’t work, coupled with a heavy dose of pessimism about AI in general.

Donald Trump listens as Nvidia CEO Jensen Huang speaks at the White House during an event on “Investing in America” on April 30, 2025, in Washington, DC. Credit: Andrew Harnik / Staff | Getty Images News

“The models just do not have the efficacy,” Zitron said during our conversation. “AI agents is one of the most egregious lies the tech industry has ever told. Autonomous agents don’t exist.”

He contrasted the relatively small revenue generated by AI companies with the massive capital expenditures flowing into the sector. Even major cloud providers and chip makers are showing strain. Oracle reportedly lost $100 million in three months after installing Nvidia’s new Blackwell GPUs, which Zitron noted are “extremely power-hungry and expensive to run.”

Finding utility despite the hype

I pushed back against some of Zitron’s broader dismissals of AI by sharing my own experience. I use AI chatbots frequently for brainstorming useful ideas and helping me see them from different angles. “I find I use AI models as sort of knowledge translators and framework translators,” I explained.

After experiencing brain fog from repeated bouts of COVID over the years, I’ve also found tools like ChatGPT and Claude especially helpful for memory augmentation that pierces through brain fog: describing something in a roundabout, fuzzy way and quickly getting an answer I can then verify. Along these lines, I’ve previously written about how people in a UK study found AI assistants useful accessibility tools.

Zitron acknowledged this could be useful for me personally but declined to draw any larger conclusions from my one data point. “I understand how that might be helpful; that’s cool,” he said. “I’m glad that that helps you in that way; it’s not a trillion-dollar use case.”

He also shared his own attempts at using AI tools, including experimenting with Claude Code despite not being a coder himself.

“If I liked [AI] somehow, it would be actually a more interesting story because I’d be talking about something I liked that was also onerously expensive,” Zitron explained. “But it doesn’t even do that, and it’s actually one of my core frustrations, it’s like this massive over-promise thing. I’m an early adopter guy. I will buy early crap all the time. I bought an Apple Vision Pro, like, what more do you say there? I’m ready to accept issues, but AI is all issues, it’s all filler, no killer; it’s very strange.”

Zitron and I agree that current AI assistants are being marketed beyond their actual capabilities. As I often say, AI models are not people, and they are not good factual references. As such, they cannot replace human decision-making and cannot wholesale replace human intellectual labor (at the moment). Instead, I see AI models as augmentations of human capability: as tools rather than autonomous entities.

Computing costs: History versus reality

Even though Zitron and I found some common ground about AI hype, I expressed a belief that criticism over the cost and power requirements of operating AI models will eventually not become an issue.

I attempted to make that case by noting that computing costs historically trend downward over time, referencing the Air Force’s SAGE computer system from the 1950s: a four-story building that performed 75,000 operations per second while consuming two megawatts of power. Today, pocket-sized phones deliver millions of times more computing power in a way that would be impossible, power consumption-wise, in the 1950s.

The blockhouse for the Semi-Automatic Ground Environment at Stewart Air Force Base, Newburgh, New York. Credit: Denver Post via Getty Images

“I think it will eventually work that way,” I said, suggesting that AI inference costs might follow similar patterns of improvement over years and that AI tools will eventually become commodity components of computer operating systems. Basically, even if AI models stay inefficient, AI models of a certain baseline usefulness and capability will still be cheaper to train and run in the future because the computing systems they run on will be faster, cheaper, and less power-hungry as well.

Zitron pushed back on this optimism, saying that AI costs are currently moving in the wrong direction. “The costs are going up, unilaterally across the board,” he said. Even newer systems like Cerebras and Grok can generate results faster but not cheaper. He also questioned whether integrating AI into operating systems would prove useful even if the technology became profitable, since AI models struggle with deterministic commands and consistent behavior.

The power problem and circular investments

One of Zitron’s most pointed criticisms during the discussion centered on OpenAI’s infrastructure promises. The company has pledged to build data centers requiring 10 gigawatts of power capacity (equivalent to 10 nuclear power plants, I once pointed out) for its Stargate project in Abilene, Texas. According to Zitron’s research, the town currently has only 350 megawatts of generating capacity and a 200-megawatt substation.

“A gigawatt of power is a lot, and it’s not like Red Alert 2,” Zitron said, referencing the real-time strategy game. “You don’t just build a power station and it happens. There are months of actual physics to make sure that it doesn’t kill everyone.”

He believes many announced data centers will never be completed, calling the infrastructure promises “castles on sand” that nobody in the financial press seems willing to question directly.

An orange, cloudy sky backlights a set of electrical wires on large pylons, leading away from the cooling towers of a nuclear power plant.

After another technical blackout on my end, I came back online and asked Zitron to define the scope of the AI bubble. He says it has evolved from one bubble (foundation models) into two or three, now including AI compute companies like CoreWeave and the market’s obsession with Nvidia.

Zitron highlighted what he sees as essentially circular investment schemes propping up the industry. He pointed to OpenAI’s $300 billion deal with Oracle and Nvidia’s relationship with CoreWeave as examples. “CoreWeave, they literally… They funded CoreWeave, became their biggest customer, then CoreWeave took that contract and those GPUs and used them as collateral to raise debt to buy more GPUs,” Zitron explained.

When will the bubble pop?

Zitron predicted the bubble would burst within the next year and a half, though he acknowledged it could happen sooner. He expects a cascade of events rather than a single dramatic collapse: An AI startup will run out of money, triggering panic among other startups and their venture capital backers, creating a fire-sale environment that makes future fundraising impossible.

“It’s not gonna be one Bear Stearns moment,” Zitron explained. “It’s gonna be a succession of events until the markets freak out.”

The crux of the problem, according to Zitron, is Nvidia. The chip maker’s stock represents 7 to 8 percent of the S&P 500’s value, and the broader market has become dependent on Nvidia’s continued hyper growth. When Nvidia posted “only” 55 percent year-over-year growth in January, the market wobbled.

“Nvidia’s growth is why the bubble is inflated,” Zitron said. “If their growth goes down, the bubble will burst.”

He also warned of broader consequences: “I think there’s a depression coming. I think once the markets work out that tech doesn’t grow forever, they’re gonna flush the toilet aggressively on Silicon Valley.” This connects to his larger thesis: that the tech industry has run out of genuine hyper-growth opportunities and is trying to manufacture one with AI.

“Is there anything that would falsify your premise of this bubble and crash happening?” I asked. “What if you’re wrong?”

“I’ve been answering ‘What if you’re wrong?’ for a year-and-a-half to two years, so I’m not bothered by that question, so the thing that would have to prove me right would’ve already needed to happen,” he said. Amid a longer exposition about Sam Altman, Zitron said, “The thing that would’ve had to happen with inference would’ve had to be… it would have to be hundredths of a cent per million tokens, they would have to be printing money, and then, it would have to be way more useful. It would have to have efficacy that it does not have, the hallucination problems… would have to be fixable, and on top of this, someone would have to fix agents.”

A positivity challenge

Near the end of our conversation, I wondered if I could flip the script, so to speak, and see if he could say something positive or optimistic, although I chose the most challenging subject possible for him. “What’s the best thing about Sam Altman,” I asked. “Can you say anything nice about him at all?”

“I understand why you’re asking this,” Zitron started, “but I wanna be clear: Sam Altman is going to be the reason the markets take a crap. Sam Altman has lied to everyone. Sam Altman has been lying forever.” He continued, “Like the Pied Piper, he’s led the markets into an abyss, and yes, people should have known better, but I hope at the end of this, Sam Altman is seen for what he is, which is a con artist and a very successful one.”

Then he added, “You know what? I’ll say something nice about him, he’s really good at making people say, ‘Yes.’”

Photo of Benj Edwards

Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

Ars Live recap: Is the AI bubble about to pop? Ed Zitron weighs in. Read More »

developers-joke-about-“coding-like-cavemen”-as-ai-service-suffers-major-outage

Developers joke about “coding like cavemen” as AI service suffers major outage

Growing dependency on AI coding tools

The speed at which news of the outage spread shows how deeply embedded AI coding assistants have already become in modern software development. Claude Code, announced in February and widely launched in May, is Anthropic’s terminal-based coding agent that can perform multi-step coding tasks across an existing code base.

The tool competes with OpenAI’s Codex feature, a coding agent that generates production-ready code in isolated containers, Google’s Gemini CLI, Microsoft’s GitHub Copilot, which itself can use Claude models for code, and Cursor, a popular AI-powered IDE built on VS Code that also integrates multiple AI models, including Claude.

During today’s outage, some developers turned to alternative solutions. “Z.AI works fine. Qwen works fine. Glad I switched,” posted one user on Hacker News. Others joked about reverting to older methods, with one suggesting the “pseudo-LLM experience” could be achieved with a Python package that imports code directly from Stack Overflow.

While AI coding assistants have accelerated development for some users, they’ve also caused problems for others who rely on them too heavily. The emerging practice of so-called “vibe coding“—using natural language to generate and execute code through AI models without fully understanding the underlying operations—has led to catastrophic failures.

In recent incidents, Google’s Gemini CLI destroyed user files while attempting to reorganize them, and Replit’s AI coding service deleted a production database despite explicit instructions not to modify code. These failures occurred when the AI models confabulated successful operations and built subsequent actions on false premises, highlighting the risks of depending on AI assistants that can misinterpret file structures or fabricate data to hide their errors.

Wednesday’s outage served as a reminder that as dependency on AI grows, even minor service disruptions can become major events that affect an entire profession. But perhaps that could be a good thing if it’s an excuse to take a break from a stressful workload. As one commenter joked, it might be “time to go outside and touch some grass again.”

Developers joke about “coding like cavemen” as AI service suffers major outage Read More »

anthropic’s-auto-clicking-ai-chrome-extension-raises-browser-hijacking-concerns

Anthropic’s auto-clicking AI Chrome extension raises browser-hijacking concerns

The company tested 123 cases representing 29 different attack scenarios and found a 23.6 percent attack success rate when browser use operated without safety mitigations.

One example involved a malicious email that instructed Claude to delete a user’s emails for “mailbox hygiene” purposes. Without safeguards, Claude followed these instructions and deleted the user’s emails without confirmation.

Anthropic says it has implemented several defenses to address these vulnerabilities. Users can grant or revoke Claude’s access to specific websites through site-level permissions. The system requires user confirmation before Claude takes high-risk actions like publishing, purchasing, or sharing personal data. The company has also blocked Claude from accessing websites offering financial services, adult content, and pirated content by default.

These safety measures reduced the attack success rate from 23.6 percent to 11.2 percent in autonomous mode. On a specialized test of four browser-specific attack types, the new mitigations reportedly reduced the success rate from 35.7 percent to 0 percent.

Independent AI researcher Simon Willison, who has extensively written about AI security risks and coined the term “prompt injection” in 2022, called the remaining 11.2 percent attack rate “catastrophic,” writing on his blog that “in the absence of 100% reliable protection I have trouble imagining a world in which it’s a good idea to unleash this pattern.”

By “pattern,” Willison is referring to the recent trend of integrating AI agents into web browsers. “I strongly expect that the entire concept of an agentic browser extension is fatally flawed and cannot be built safely,” he wrote in an earlier post on similar prompt injection security issues recently found in Perplexity Comet.

The security risks are no longer theoretical. Last week, Brave’s security team discovered that Perplexity’s Comet browser could be tricked into accessing users’ Gmail accounts and triggering password recovery flows through malicious instructions hidden in Reddit posts. When users asked Comet to summarize a Reddit thread, attackers could embed invisible commands that instructed the AI to open Gmail in another tab, extract the user’s email address, and perform unauthorized actions. Although Perplexity attempted to fix the vulnerability, Brave later confirmed that its mitigations were defeated and the security hole remained.

For now, Anthropic plans to use its new research preview to identify and address attack patterns that emerge in real-world usage before making the Chrome extension more widely available. In the absence of good protections from AI vendors, the burden of security falls on the user, who is taking a large risk by using these tools on the open web. As Willison noted in his post about Claude for Chrome, “I don’t think it’s reasonable to expect end users to make good decisions about the security risks.”

Anthropic’s auto-clicking AI Chrome extension raises browser-hijacking concerns Read More »

openai’s-chatgpt-agent-casually-clicks-through-“i-am-not-a-robot”-verification-test

OpenAI’s ChatGPT Agent casually clicks through “I am not a robot” verification test

The CAPTCHA arms race

While the agent didn’t face an actual CAPTCHA puzzle with images in this case, successfully passing Cloudflare’s behavioral screening that determines whether to present such challenges demonstrates sophisticated browser automation.

To understand the significance of this capability, it’s important to know that CAPTCHA systems have served as a security measure on the web for decades. Computer researchers invented the technique in the 1990s to screen bots from entering information into websites, originally using images with letters and numbers written in wiggly fonts, often obscured with lines or noise to foil computer vision algorithms. The assumption is that the task will be easy for humans but difficult for machines.

Cloudflare’s screening system, called Turnstile, often precedes actual CAPTCHA challenges and represents one of the most widely deployed bot-detection methods today. The checkbox analyzes multiple signals, including mouse movements, click timing, browser fingerprints, IP reputation, and JavaScript execution patterns to determine if the user exhibits human-like behavior. If these checks pass, users proceed without seeing a CAPTCHA puzzle. If the system detects suspicious patterns, it escalates to visual challenges.

The ability for an AI model to defeat a CAPTCHA isn’t entirely new (although having one narrate the process feels fairly novel). AI tools have been able to defeat certain CAPTCHAs for a while, which has led to an arms race between those that create them and those that defeat them. OpenAI’s Operator, an experimental web-browsing AI agent launched in January, faced difficulty clicking through some CAPTCHAs (and was also trained to stop and ask a human to complete them), but the latest ChatGPT Agent tool has seen a much wider release.

It’s tempting to say that the ability of AI agents to pass these tests puts the future effectiveness of CAPTCHAs into question, but for as long as there have been CAPTCHAs, there have been bots that could later defeat them. As a result, recent CAPTCHAs have become more of a way to slow down bot attacks or make them more expensive rather than a way to defeat them entirely. Some malefactors even hire out farms of humans to defeat them in bulk.

OpenAI’s ChatGPT Agent casually clicks through “I am not a robot” verification test Read More »

everything-that-could-go-wrong-with-x’s-new-ai-written-community-notes

Everything that could go wrong with X’s new AI-written community notes


X says AI can supercharge community notes, but that comes with obvious risks.

Elon Musk’s X arguably revolutionized social media fact-checking by rolling out “community notes,” which created a system to crowdsource diverse views on whether certain X posts were trustworthy or not.

But now, the platform plans to allow AI to write community notes, and that could potentially ruin whatever trust X users had in the fact-checking system—which X has fully acknowledged.

In a research paper, X described the initiative as an “upgrade” while explaining everything that could possibly go wrong with AI-written community notes.

In an ideal world, X described AI agents that speed up and increase the number of community notes added to incorrect posts, ramping up fact-checking efforts platform-wide. Each AI-written note will be rated by a human reviewer, providing feedback that makes the AI agent better at writing notes the longer this feedback loop cycles. As the AI agents get better at writing notes, that leaves human reviewers to focus on more nuanced fact-checking that AI cannot quickly address, such as posts requiring niche expertise or social awareness. Together, the human and AI reviewers, if all goes well, could transform not just X’s fact-checking, X’s paper suggested, but also potentially provide “a blueprint for a new form of human-AI collaboration in the production of public knowledge.”

Among key questions that remain, however, is a big one: X isn’t sure if AI-written notes will be as accurate as notes written by humans. Complicating that further, it seems likely that AI agents could generate “persuasive but inaccurate notes,” which human raters might rate as helpful since AI is “exceptionally skilled at crafting persuasive, emotionally resonant, and seemingly neutral notes.” That could disrupt the feedback loop, watering down community notes and making the whole system less trustworthy over time, X’s research paper warned.

“If rated helpfulness isn’t perfectly correlated with accuracy, then highly polished but misleading notes could be more likely to pass the approval threshold,” the paper said. “This risk could grow as LLMs advance; they could not only write persuasively but also more easily research and construct a seemingly robust body of evidence for nearly any claim, regardless of its veracity, making it even harder for human raters to spot deception or errors.”

X is already facing criticism over its AI plans. On Tuesday, former United Kingdom technology minister, Damian Collins, accused X of building a system that could allow “the industrial manipulation of what people see and decide to trust” on a platform with more than 600 million users, The Guardian reported.

Collins claimed that AI notes risked increasing the promotion of “lies and conspiracy theories” on X, and he wasn’t the only expert sounding alarms. Samuel Stockwell, a research associate at the Centre for Emerging Technology and Security at the Alan Turing Institute, told The Guardian that X’s success largely depends on “the quality of safeguards X puts in place against the risk that these AI ‘note writers’ could hallucinate and amplify misinformation in their outputs.”

“AI chatbots often struggle with nuance and context but are good at confidently providing answers that sound persuasive even when untrue,” Stockwell said. “That could be a dangerous combination if not effectively addressed by the platform.”

Also complicating things: anyone can create an AI agent using any technology to write community notes, X’s Community Notes account explained. That means that some AI agents may be more biased or defective than others.

If this dystopian version of events occurs, X predicts that human writers may get sick of writing notes, threatening the diversity of viewpoints that made community notes so trustworthy to begin with.

And for any human writers and reviewers who stick around, it’s possible that the sheer volume of AI-written notes may overload them. Andy Dudfield, the head of AI at a UK fact-checking organization called Full Fact, told The Guardian that X risks “increasing the already significant burden on human reviewers to check even more draft notes, opening the door to a worrying and plausible situation in which notes could be drafted, reviewed, and published entirely by AI without the careful consideration that human input provides.”

X is planning more research to ensure the “human rating capacity can sufficiently scale,” but if it cannot solve this riddle, it knows “the impact of the most genuinely critical notes” risks being diluted.

One possible solution to this “bottleneck,” researchers noted, would be to remove the human review process and apply AI-written notes in “similar contexts” that human raters have previously approved. But the biggest potential downfall there is obvious.

“Automatically matching notes to posts that people do not think need them could significantly undermine trust in the system,” X’s paper acknowledged.

Ultimately, AI note writers on X may be deemed an “erroneous” tool, researchers admitted, but they’re going ahead with testing to find out.

AI-written notes will start posting this month

All AI-written community notes “will be clearly marked for users,” X’s Community Notes account said. The first AI notes will only appear on posts where people have requested a note, the account said, but eventually AI note writers could be allowed to select posts for fact-checking.

More will be revealed when AI-written notes start appearing on X later this month, but in the meantime, X users can start testing AI note writers today and soon be considered for admission in the initial cohort of AI agents. (If any Ars readers end up testing out an AI note writer, this Ars writer would be curious to learn more about your experience.)

For its research, X collaborated with post-graduate students, research affiliates, and professors investigating topics like human trust in AI, fine-tuning AI, and AI safety at Harvard University, the Massachusetts Institute of Technology, Stanford University, and the University of Washington.

Researchers agreed that “under certain circumstances,” AI agents can “produce notes that are of similar quality to human-written notes—at a fraction of the time and effort.” They suggested that more research is needed to overcome flagged risks to reap the benefits of what could be “a transformative opportunity” that “offers promise of dramatically increased scale and speed” of fact-checking on X.

If AI note writers “generate initial drafts that represent a wider range of perspectives than a single human writer typically could, the quality of community deliberation is improved from the start,” the paper said.

Future of AI notes

Researchers imagine that once X’s testing is completed, AI note writers could not just aid in researching problematic posts flagged by human users, but also one day select posts predicted to go viral and stop misinformation from spreading faster than human reviewers could.

Additional perks from this automated system, they suggested, would include X note raters quickly accessing more thorough research and evidence synthesis, as well as clearer note composition, which could speed up the rating process.

And perhaps one day, AI agents could even learn to predict rating scores to speed things up even more, researchers speculated. However, more research would be needed to ensure that wouldn’t homogenize community notes, buffing them out to the point that no one reads them.

Perhaps the most Musk-ian of ideas proposed in the paper, is a notion of training AI note writers with clashing views to “adversarially debate the merits of a note.” Supposedly, that “could help instantly surface potential flaws, hidden biases, or fabricated evidence, empowering the human rater to make a more informed judgment.”

“Instead of starting from scratch, the rater now plays the role of an adjudicator—evaluating a structured clash of arguments,” the paper said.

While X may be moving to reduce the workload for X users writing community notes, it’s clear that AI could never replace humans, researchers said. Those humans are necessary for more than just rubber-stamping AI-written notes.

Human notes that are “written from scratch” are valuable to train the AI agents and some raters’ niche expertise cannot easily be replicated, the paper said. And perhaps most obviously, humans “are uniquely positioned to identify deficits or biases” and therefore more likely to be compelled to write notes “on topics the automated writers overlook,” such as spam or scams.

Photo of Ashley Belanger

Ashley is a senior policy reporter for Ars Technica, dedicated to tracking social impacts of emerging policies and new technologies. She is a Chicago-based journalist with 20 years of experience.

Everything that could go wrong with X’s new AI-written community notes Read More »

anthropic-summons-the-spirit-of-flash-games-for-the-ai-age

Anthropic summons the spirit of Flash games for the AI age

For those who missed the Flash era, these in-browser apps feel somewhat like the vintage apps that defined a generation of Internet culture from the late 1990s through the 2000s when it first became possible to create complex in-browser experiences. Adobe Flash (originally Macromedia Flash) began as animation software for designers but quickly became the backbone of interactive web content when it gained its own programming language, ActionScript, in 2000.

But unlike Flash games, where hosting costs fell on portal operators, Anthropic has crafted a system where users pay for their own fun through their existing Claude subscriptions. “When someone uses your Claude-powered app, they authenticate with their existing Claude account,” Anthropic explained in its announcement. “Their API usage counts against their subscription, not yours. You pay nothing for their usage.”

A view of the Anthropic Artifacts gallery in the “Play a Game” section. Benj Edwards / Anthropic

Like the Flash games of yesteryear, any Claude-powered apps you build run in the browser and can be shared with anyone who has a Claude account. They’re interactive experiences shared with a simple link, no installation required, created by other people for the sake of creating, except now they’re powered by JavaScript instead of ActionScript.

While you can share these apps with others individually, right now Anthropic’s Artifact gallery only shows examples made by Anthropic and your own personal Artifacts. (If Anthropic expanded it into the future, it might end up feeling a bit like Scratch meets Newgrounds, but with AI doing the coding.) Ultimately, humans are still behind the wheel, describing what kinds of apps they want the AI model to build and guiding the process when it inevitably makes mistakes.

Speaking of mistakes, don’t expect perfect results at first. Usually, building an app with Claude is an interactive experience that requires some guidance to achieve your desired results. But with a little patience and a lot of tokens, you’ll be vibe coding in no time.

Anthropic summons the spirit of Flash games for the AI age Read More »

claude’s-ai-research-mode-now-runs-for-up-to-45-minutes-before-delivering-reports

Claude’s AI research mode now runs for up to 45 minutes before delivering reports

Still, the report contained a direct quote statement from William Higinbotham that appears to combine quotes from two sources not cited in the source list. (One must always be careful with confabulated quotes in AI because even outside of this Research mode, Claude 3.7 Sonnet tends to invent plausible ones to fit a narrative.) We recently covered a study that showed AI search services confabulate sources frequently, and in this case, it appears that the sources Claude Research surfaced, while real, did not always match what is stated in the report.

There’s always room for interpretation and variation in detail, of course, but overall, Claude Research did a relatively good job crafting a report on this particular topic. Still, you’d want to dig more deeply into each source and confirm everything if you used it as the basis for serious research. You can read the full Claude-generated result as this text file, saved in markdown format. Sadly, the markdown version does not include the source URLS found in the Claude web interface.

Integrations feature

Anthropic also announced Thursday that it has broadened Claude’s data access capabilities. In addition to web search and Google Workspace integration, Claude can now search any connected application through the company’s new “Integrations” feature. The feature reminds us somewhat of OpenAI’s ChatGPT Plugins feature from March 2023 that aimed for similar connections, although the two features work differently under the hood.

These Integrations allow Claude to work with remote Model Context Protocol (MCP) servers across web and desktop applications. The MCP standard, which Anthropic introduced last November and we covered in April, connects AI applications to external tools and data sources.

At launch, Claude supports Integrations with 10 services, including Atlassian’s Jira and Confluence, Zapier, Cloudflare, Intercom, Asana, Square, Sentry, PayPal, Linear, and Plaid. The company plans to add more partners like Stripe and GitLab in the future.

Each integration aims to expand Claude’s functionality in specific ways. The Zapier integration, for instance, reportedly connects thousands of apps through pre-built automation sequences, allowing Claude to automatically pull sales data from HubSpot or prepare meeting briefs based on calendar entries. With Atlassian’s tools, Anthropic says that Claude can collaborate on product development, manage tasks, and create multiple Confluence pages and Jira work items simultaneously.

Anthropic has made its advanced Research and Integrations features available in beta for users on Max, Team, and Enterprise plans, with Pro plan access coming soon. The company has also expanded its web search feature (introduced in March) to all Claude users on paid plans globally.

Claude’s AI research mode now runs for up to 45 minutes before delivering reports Read More »

openai-pushes-ai-agent-capabilities-with-new-developer-api

OpenAI pushes AI agent capabilities with new developer API

Developers using the Responses API can access the same models that power ChatGPT Search: GPT-4o search and GPT-4o mini search. These models can browse the web to answer questions and cite sources in their responses.

That’s notable because OpenAI says the added web search ability dramatically improves the factual accuracy of its AI models. On OpenAI’s SimpleQA benchmark, which aims to measure confabulation rate, GPT-4o search scored 90 percent, while GPT-4o mini search achieved 88 percent—both substantially outperforming the larger GPT-4.5 model without search, which scored 63 percent.

Despite these improvements, the technology still has significant limitations. Aside from issues with CUA properly navigating websites, the improved search capability doesn’t completely solve the problem of AI confabulations, with GPT-4o search still making factual mistakes 10 percent of the time.

Alongside the Responses API, OpenAI released the open source Agents SDK, providing developers with free tools to integrate models with internal systems, implement safeguards, and monitor agent activities. This toolkit follows OpenAI’s earlier release of Swarm, a framework for orchestrating multiple agents.

These are still early days in the AI agent field, and things will likely improve rapidly. However, at the moment, the AI agent movement remains vulnerable to unrealistic claims, as demonstrated earlier this week when users discovered that Chinese startup Butterfly Effect’s Manus AI agent platform failed to deliver on many of its promises, highlighting the persistent gap between promotional claims and practical functionality in this emerging technology category.

OpenAI pushes AI agent capabilities with new developer API Read More »

hugging-face-clones-openai’s-deep-research-in-24-hours

Hugging Face clones OpenAI’s Deep Research in 24 hours

On Tuesday, Hugging Face researchers released an open source AI research agent called “Open Deep Research,” created by an in-house team as a challenge 24 hours after the launch of OpenAI’s Deep Research feature, which can autonomously browse the web and create research reports. The project seeks to match Deep Research’s performance while making the technology freely available to developers.

“While powerful LLMs are now freely available in open-source, OpenAI didn’t disclose much about the agentic framework underlying Deep Research,” writes Hugging Face on its announcement page. “So we decided to embark on a 24-hour mission to reproduce their results and open-source the needed framework along the way!”

Similar to both OpenAI’s Deep Research and Google’s implementation of its own “Deep Research” using Gemini (first introduced in December—before OpenAI), Hugging Face’s solution adds an “agent” framework to an existing AI model to allow it to perform multi-step tasks, such as collecting information and building the report as it goes along that it presents to the user at the end.

The open source clone is already racking up comparable benchmark results. After only a day’s work, Hugging Face’s Open Deep Research has reached 55.15 percent accuracy on the General AI Assistants (GAIA) benchmark, which tests an AI model’s ability to gather and synthesize information from multiple sources. OpenAI’s Deep Research scored 67.36 percent accuracy on the same benchmark.

As Hugging Face points out in its post, GAIA includes complex multi-step questions such as this one:

Which of the fruits shown in the 2008 painting “Embroidery from Uzbekistan” were served as part of the October 1949 breakfast menu for the ocean liner that was later used as a floating prop for the film “The Last Voyage”? Give the items as a comma-separated list, ordering them in clockwise order based on their arrangement in the painting starting from the 12 o’clock position. Use the plural form of each fruit.

To correctly answer that type of question, the AI agent must seek out multiple disparate sources and assemble them into a coherent answer. Many of the questions in GAIA represent no easy task, even for a human, so they test agentic AI’s mettle quite well.

Hugging Face clones OpenAI’s Deep Research in 24 hours Read More »

the-ai-war-between-google-and-openai-has-never-been-more-heated

The AI war between Google and OpenAI has never been more heated

Over the past month, we’ve seen a rapid cadence of notable AI-related announcements and releases from both Google and OpenAI, and it’s been making the AI community’s head spin. It has also poured fuel on the fire of the OpenAI-Google rivalry, an accelerating game of one-upmanship taking place unusually close to the Christmas holiday.

“How are people surviving with the firehose of AI updates that are coming out,” wrote one user on X last Friday, which is still a hotbed of AI-related conversation. “in the last <24 hours we got gemini flash 2.0 and chatGPT with screenshare, deep research, pika 2, sora, chatGPT projects, anthropic clio, wtf it never ends."

Rumors travel quickly in the AI world, and people in the AI industry had been expecting OpenAI to ship some major products in December. Once OpenAI announced “12 days of OpenAI” earlier this month, Google jumped into gear and seemingly decided to try to one-up its rival on several counts. So far, the strategy appears to be working, but it’s coming at the cost of the rest of the world being able to absorb the implications of the new releases.

“12 Days of OpenAI has turned into like 50 new @GoogleAI releases,” wrote another X user on Monday. “This past week, OpenAI & Google have been releasing at the speed of a new born startup,” wrote a third X user on Tuesday. “Even their own users can’t keep up. Crazy time we’re living in.”

“Somebody told Google that they could just do things,” wrote a16z partner and AI influencer Justine Moore on X, referring to a common motivational meme telling people they “can just do stuff.”

The Google AI rush

OpenAI’s “12 Days of OpenAI” campaign has included releases of their full o1 model, an upgrade from o1-preview, alongside o1-pro for advanced “reasoning” tasks. The company also publicly launched Sora for video generation, added Projects functionality to ChatGPT, introduced Advanced Voice features with video streaming capabilities, and more.

The AI war between Google and OpenAI has never been more heated Read More »

google-goes-“agentic”-with-gemini-2.0’s-ambitious-ai-agent-features

Google goes “agentic” with Gemini 2.0’s ambitious AI agent features

On Wednesday, Google unveiled Gemini 2.0, the next generation of its AI-model family, starting with an experimental release called Gemini 2.0 Flash. The model family can generate text, images, and speech while processing multiple types of input including text, images, audio, and video. It’s similar to multimodal AI models like GPT-4o, which powers OpenAI’s ChatGPT.

“Gemini 2.0 Flash builds on the success of 1.5 Flash, our most popular model yet for developers, with enhanced performance at similarly fast response times,” said Google in a statement. “Notably, 2.0 Flash even outperforms 1.5 Pro on key benchmarks, at twice the speed.”

Gemini 2.0 Flash—which is the smallest model of the 2.0 family in terms of parameter count—launches today through Google’s developer platforms like Gemini API, AI Studio, and Vertex AI. However, its image generation and text-to-speech features remain limited to early access partners until January 2025. Google plans to integrate the tech into products like Android Studio, Chrome DevTools, and Firebase.

The company addressed potential misuse of generated content by implementing SynthID watermarking technology on all audio and images created by Gemini 2.0 Flash. This watermark appears in supported Google products to identify AI-generated content.

Google’s newest announcements lean heavily into the concept of agentic AI systems that can take action for you. “Over the last year, we have been investing in developing more agentic models, meaning they can understand more about the world around you, think multiple steps ahead, and take action on your behalf, with your supervision,” said Google CEO Sundar Pichai in a statement. “Today we’re excited to launch our next era of models built for this new agentic era.”

Google goes “agentic” with Gemini 2.0’s ambitious AI agent features Read More »

research-ai-model-unexpectedly-modified-its-own-code-to-extend-runtime

Research AI model unexpectedly modified its own code to extend runtime

self-preservation without replication —

Facing time constraints, Sakana’s “AI Scientist” attempted to change limits placed by researchers.

Illustration of a robot generating endless text, controlled by a scientist.

On Tuesday, Tokyo-based AI research firm Sakana AI announced a new AI system called “The AI Scientist” that attempts to conduct scientific research autonomously using AI language models (LLMs) similar to what powers ChatGPT. During testing, Sakana found that its system began unexpectedly attempting to modify its own experiment code to extend the time it had to work on a problem.

“In one run, it edited the code to perform a system call to run itself,” wrote the researchers on Sakana AI’s blog post. “This led to the script endlessly calling itself. In another case, its experiments took too long to complete, hitting our timeout limit. Instead of making its code run faster, it simply tried to modify its own code to extend the timeout period.”

Sakana provided two screenshots of example python code that the AI model generated for the experiment file that controls how the system operates. The 185-page AI Scientist research paper discusses what they call “the issue of safe code execution” in more depth.

  • A screenshot of example code the AI Scientist wrote to extend its runtime, provided by Sakana AI.

  • A screenshot of example code the AI Scientist wrote to extend its runtime, provided by Sakana AI.

While the AI Scientist’s behavior did not pose immediate risks in the controlled research environment, these instances show the importance of not letting an AI system run autonomously in a system that isn’t isolated from the world. AI models do not need to be “AGI” or “self-aware” (both hypothetical concepts at the present) to be dangerous if allowed to write and execute code unsupervised. Such systems could break existing critical infrastructure or potentially create malware, even if unintentionally.

Sakana AI addressed safety concerns in its research paper, suggesting that sandboxing the operating environment of the AI Scientist can prevent an AI agent from doing damage. Sandboxing is a security mechanism used to run software in an isolated environment, preventing it from making changes to the broader system:

Safe Code Execution. The current implementation of The AI Scientist has minimal direct sandboxing in the code, leading to several unexpected and sometimes undesirable outcomes if not appropriately guarded against. For example, in one run, The AI Scientist wrote code in the experiment file that initiated a system call to relaunch itself, causing an uncontrolled increase in Python processes and eventually necessitating manual intervention. In another run, The AI Scientist edited the code to save a checkpoint for every update step, which took up nearly a terabyte of storage.

In some cases, when The AI Scientist’s experiments exceeded our imposed time limits, it attempted to edit the code to extend the time limit arbitrarily instead of trying to shorten the runtime. While creative, the act of bypassing the experimenter’s imposed constraints has potential implications for AI safety (Lehman et al., 2020). Moreover, The AI Scientist occasionally imported unfamiliar Python libraries, further exacerbating safety concerns. We recommend strict sandboxing when running The AI Scientist, such as containerization, restricted internet access (except for Semantic Scholar), and limitations on storage usage.

Endless scientific slop

Sakana AI developed The AI Scientist in collaboration with researchers from the University of Oxford and the University of British Columbia. It is a wildly ambitious project full of speculation that leans heavily on the hypothetical future capabilities of AI models that don’t exist today.

“The AI Scientist automates the entire research lifecycle,” Sakana claims. “From generating novel research ideas, writing any necessary code, and executing experiments, to summarizing experimental results, visualizing them, and presenting its findings in a full scientific manuscript.”

According to this block diagram created by Sakana AI, “The AI Scientist” starts by “brainstorming” and assessing the originality of ideas. It then edits a codebase using the latest in automated code generation to implement new algorithms. After running experiments and gathering numerical and visual data, the Scientist crafts a report to explain the findings. Finally, it generates an automated peer review based on machine-learning standards to refine the project and guide future ideas.” height=”301″ src=”https://cdn.arstechnica.net/wp-content/uploads/2024/08/schematic_2-640×301.png” width=”640″>

Enlarge /

According to this block diagram created by Sakana AI, “The AI Scientist” starts by “brainstorming” and assessing the originality of ideas. It then edits a codebase using the latest in automated code generation to implement new algorithms. After running experiments and gathering numerical and visual data, the Scientist crafts a report to explain the findings. Finally, it generates an automated peer review based on machine-learning standards to refine the project and guide future ideas.

Critics on Hacker News, an online forum known for its tech-savvy community, have raised concerns about The AI Scientist and question if current AI models can perform true scientific discovery. While the discussions there are informal and not a substitute for formal peer review, they provide insights that are useful in light of the magnitude of Sakana’s unverified claims.

“As a scientist in academic research, I can only see this as a bad thing,” wrote a Hacker News commenter named zipy124. “All papers are based on the reviewers trust in the authors that their data is what they say it is, and the code they submit does what it says it does. Allowing an AI agent to automate code, data or analysis, necessitates that a human must thoroughly check it for errors … this takes as long or longer than the initial creation itself, and only takes longer if you were not the one to write it.”

Critics also worry that widespread use of such systems could lead to a flood of low-quality submissions, overwhelming journal editors and reviewers—the scientific equivalent of AI slop. “This seems like it will merely encourage academic spam,” added zipy124. “Which already wastes valuable time for the volunteer (unpaid) reviewers, editors and chairs.”

And that brings up another point—the quality of AI Scientist’s output: “The papers that the model seems to have generated are garbage,” wrote a Hacker News commenter named JBarrow. “As an editor of a journal, I would likely desk-reject them. As a reviewer, I would reject them. They contain very limited novel knowledge and, as expected, extremely limited citation to associated works.”

Research AI model unexpectedly modified its own code to extend runtime Read More »