openai

ai-companies-want-you-to-stop-chatting-with-bots-and-start-managing-them

AI companies want you to stop chatting with bots and start managing them


Claude Opus 4.6 and OpenAI Frontier pitch a future of supervising AI agents.

On Thursday, Anthropic and OpenAI shipped products built around the same idea: instead of chatting with a single AI assistant, users should be managing teams of AI agents that divide up work and run in parallel. The simultaneous releases are part of a gradual shift across the industry, from AI as a conversation partner to AI as a delegated workforce, and they arrive during a week when that very concept reportedly helped wipe $285 billion off software stocks.

Whether that supervisory model works in practice remains an open question. Current AI agents still require heavy human intervention to catch errors, and no independent evaluation has confirmed that these multi-agent tools reliably outperform a single developer working alone.

Even so, the companies are going all-in on agents. Anthropic’s contribution is Claude Opus 4.6, a new version of its most capable AI model, paired with a feature called “agent teams” in Claude Code. Agent teams let developers spin up multiple AI agents that split a task into independent pieces, coordinate autonomously, and run concurrently.

In practice, agent teams look like a split-screen terminal environment: A developer can jump between subagents using Shift+Up/Down, take over any one directly, and watch the others keep working. Anthropic describes the feature as best suited for “tasks that split into independent, read-heavy work like codebase reviews.” It is available as a research preview.

OpenAI, meanwhile, released Frontier, an enterprise platform it describes as a way to “hire AI co-workers who take on many of the tasks people already do on a computer.” Frontier assigns each AI agent its own identity, permissions, and memory, and it connects to existing business systems such as CRMs, ticketing tools, and data warehouses. “What we’re fundamentally doing is basically transitioning agents into true AI co-workers,” Barret Zoph, OpenAI’s general manager of business-to-business, told CNBC.

Despite the hype about these agents being co-workers, from our experience, these agents tend to work best if you think of them as tools that amplify existing skills, not as the autonomous co-workers the marketing language implies. They can produce impressive drafts fast but still require constant human course-correction.

The Frontier launch came just three days after OpenAI released a new macOS desktop app for Codex, its AI coding tool, which OpenAI executives described as a “command center for agents.” The Codex app lets developers run multiple agent threads in parallel, each working on an isolated copy of a codebase via Git worktrees.

OpenAI also released GPT-5.3-Codex on Thursday, a new AI model that powers the Codex app. OpenAI claims that the Codex team used early versions of GPT-5.3-Codex to debug the model’s own training run, manage its deployment, and diagnose test results, similar to what OpenAI told Ars Technica in a December interview.

“Our team was blown away by how much Codex was able to accelerate its own development,” the company wrote. On Terminal-Bench 2.0, the agentic coding benchmark, GPT-5.3-Codex scored 77.3%, which exceeds Anthropic’s just-released Opus 4.6 by about 12 percentage points.

The common thread across all of these products is a shift in the user’s role. Rather than merely typing a prompt and waiting for a single response, the developer or knowledge worker becomes more like a supervisor, dispatching tasks, monitoring progress, and stepping in when an agent needs direction.

In this vision, developers and knowledge workers effectively become middle managers of AI. That is, not writing the code or doing the analysis themselves, but delegating tasks, reviewing output, and hoping the agents underneath them don’t quietly break things. Whether that will come to pass (or if it’s actually a good idea) is still widely debated.

A new model under the Claude hood

Opus 4.6 is a substantial update to Anthropic’s flagship model. It succeeds Claude Opus 4.5, which Anthropic released in November. In a first for the Opus model family, it supports a context window of up to 1 million tokens (in beta), which means it can process much larger bodies of text or code in a single session.

On benchmarks, Anthropic says Opus 4.6 tops OpenAI’s GPT-5.2 (an earlier model than the one released today) and Google’s Gemini 3 Pro across several evaluations, including Terminal-Bench 2.0 (an agentic coding test), Humanity’s Last Exam (a multidisciplinary reasoning test), and BrowseComp (a test of finding hard-to-locate information online)

Although it should be noted that OpenAI’s GPT-5.3-Codex, released the same day, seemingly reclaimed the lead on Terminal-Bench. On ARC AGI 2, which attempts to test the ability to solve problems that are easy for humans but hard for AI models, Opus 4.6 scored 68.8 percent, compared to 37.6 percent for Opus 4.5, 54.2 percent for GPT-5.2, and 45.1 percent for Gemini 3 Pro.

As always, take AI benchmarks with a grain of salt, since objectively measuring AI model capabilities is a relatively new and unsettled science.

Anthropic also said that on a long-context retrieval benchmark called MRCR v2, Opus 4.6 scored 76 percent on the 1 million-token variant, compared to 18.5 percent for its Sonnet 4.5 model. That gap matters for the agent teams use case, since agents working across large codebases need to track information across hundreds of thousands of tokens without losing the thread.

Pricing for the API stays the same as Opus 4.5 at $5 per million input tokens and $25 per million output tokens, with a premium rate of $10/$37.50 for prompts that exceed 200,000 tokens. Opus 4.6 is available on claude.ai, the Claude API, and all major cloud platforms.

The market fallout outside

These releases occurred during a week of exceptional volatility for software stocks. On January 30, Anthropic released 11 open source plugins for Cowork, its agentic productivity tool that launched on January 12. Cowork itself is a general-purpose tool that gives Claude access to local folders for work tasks, but the plugins extended it into specific professional domains: legal contract review, non-disclosure agreement triage, compliance workflows, financial analysis, sales, and marketing.

By Tuesday, investors reportedly reacted to the release by erasing roughly $285 billion in market value across software, financial services, and asset management stocks. A Goldman Sachs basket of US software stocks fell 6 percent that day, its steepest single-session decline since April’s tariff-driven sell-off. Thomson Reuters led the rout with an 18 percent drop, and the pain spread to European and Asian markets.

The purported fear among investors centers on AI model companies packaging complete workflows that compete with established software-as-a-service (SaaS) vendors, even if the verdict is still out on whether these tools can achieve those tasks.

OpenAI’s Frontier might deepen that concern: its stated design lets AI agents log in to applications, execute tasks, and manage work with minimal human involvement, which Fortune described as a bid to become “the operating system of the enterprise.” OpenAI CEO of Applications Fidji Simo pushed back on the idea that Frontier replaces existing software, telling reporters, “Frontier is really a recognition that we’re not going to build everything ourselves.”

Whether these co-working apps actually live up to their billing or not, the convergence is hard to miss. Anthropic’s Scott White, the company’s head of product for enterprise, gave the practice a name that is likely to roll a few eyes. “Everybody has seen this transformation happen with software engineering in the last year and a half, where vibe coding started to exist as a concept, and people could now do things with their ideas,” White told CNBC. “I think that we are now transitioning almost into vibe working.”

Photo of Benj Edwards

Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

AI companies want you to stop chatting with bots and start managing them Read More »

with-gpt-5.3-codex,-openai-pitches-codex-for-more-than-just-writing-code

With GPT-5.3-Codex, OpenAI pitches Codex for more than just writing code

Today, OpenAI announced GPT-5.3-Codex, a new version of its frontier coding model that will be available via the command line, IDE extension, web interface, and the new macOS desktop app. (No API access yet, but it’s coming.)

GPT-5.3-Codex outperforms GPT-5.2-Codex and GPT-5.2 in SWE-Bench Pro, Terminal-Bench 2.0, and other benchmarks, according to the company’s testing.

There are already a few headlines out there saying “Codex built itself,” but let’s reality-check that, as that’s an overstatement. The domains OpenAI described using it for here are similar to the ones you see in some other enterprise software development firms now: managing deployments, debugging, and handling test results and evaluations. There is no claim here that GPT-5.3-Codex built itself.

Instead, OpenAI says GPT-5.3-Codex was “instrumental in creating itself.” You can read more about what that means in the company’s blog post.

But that’s part of the pitch with this model update—OpenAI is trying to position Codex as a tool that does more than generate lines of code. The goal is to make it useful for “all of the work in the software lifecycle—debugging, deploying, monitoring, writing PRDs, editing copy, user research, tests, metrics, and more.” There’s also an emphasis on steering the model mid-task and frequent status updates.

With GPT-5.3-Codex, OpenAI pitches Codex for more than just writing code Read More »

openai-is-hoppin’-mad-about-anthropic’s-new-super-bowl-tv-ads

OpenAI is hoppin’ mad about Anthropic’s new Super Bowl TV ads

On Wednesday, OpenAI CEO Sam Altman and Chief Marketing Officer Kate Rouch complained on X after rival AI lab Anthropic released four commercials, two of which will run during the Super Bowl on Sunday, mocking the idea of including ads in AI chatbot conversations. Anthropic’s campaign seemingly touched a nerve at OpenAI just weeks after the ChatGPT maker began testing ads in a lower-cost tier of its chatbot.

Altman called Anthropic’s ads “clearly dishonest,” accused the company of being “authoritarian,” and said it “serves an expensive product to rich people,” while Rouch wrote, “Real betrayal isn’t ads. It’s control.”

Anthropic’s four commercials, part of a campaign called “A Time and a Place,” each open with a single word splashed across the screen: “Betrayal,” “Violation,” “Deception,” and “Treachery.” They depict scenarios where a person asks a human stand-in for an AI chatbot for personal advice, only to get blindsided by a product pitch.

Anthropic’s 2026 Super Bowl commercial.

In one spot, a man asks a therapist-style chatbot (a woman sitting in a chair) how to communicate better with his mom. The bot offers a few suggestions, then pivots to promoting a fictional cougar-dating site called Golden Encounters.

In another spot, a skinny man looking for fitness tips instead gets served an ad for height-boosting insoles. Each ad ends with the tagline: “Ads are coming to AI. But not to Claude.” Anthropic plans to air a 30-second version during Super Bowl LX, with a 60-second cut running in the pregame, according to CNBC.

In the X posts, the OpenAI executives argue that these commercials are misleading because the planned ChatGPT ads will appear labeled at the bottom of conversational responses in banners and will not alter the chatbot’s answers.

But there’s a slight twist: OpenAI’s own blog post about its ad plans states that the company will “test ads at the bottom of answers in ChatGPT when there’s a relevant sponsored product or service based on your current conversation,” meaning the ads will be conversation-specific.

The financial backdrop explains some of the tension over ads in chatbots. As Ars previously reported, OpenAI struck more than $1.4 trillion in infrastructure deals in 2025 and expects to burn roughly $9 billion this year while generating about $13 billion in revenue. Only about 5 percent of ChatGPT’s 800 million weekly users pay for subscriptions. Anthropic is also not yet profitable, but it relies on enterprise contracts and paid subscriptions rather than advertising, and it has not taken on infrastructure commitments at the same scale as OpenAI.

OpenAI is hoppin’ mad about Anthropic’s new Super Bowl TV ads Read More »

should-ai-chatbots-have-ads?-anthropic-says-no.

Should AI chatbots have ads? Anthropic says no.

Different incentives, different futures

In its blog post, Anthropic describes internal analysis it conducted that suggests many Claude conversations involve topics that are “sensitive or deeply personal” or require sustained focus on complex tasks. In these contexts, Anthropic wrote, “The appearance of ads would feel incongruous—and, in many cases, inappropriate.”

The company also argued that advertising introduces incentives that could conflict with providing genuinely helpful advice. It gave the example of a user mentioning trouble sleeping: an ad-free assistant would explore various causes, while an ad-supported one might steer the conversation toward a transaction.

“Users shouldn’t have to second-guess whether an AI is genuinely helping them or subtly steering the conversation towards something monetizable,” Anthropic wrote.

Currently, OpenAI does not plan to include paid product recommendations within a ChatGPT conversation. Instead, the ads appear as banners alongside the conversation text.

OpenAI CEO Sam Altman has previously expressed reservations about mixing ads and AI conversations. In a 2024 interview at Harvard University, he described the combination as “uniquely unsettling” and said he would not like having to “figure out exactly how much was who paying here to influence what I’m being shown.”

A key part of Altman’s partial change of heart is that OpenAI faces enormous financial pressure. The company made more than $1.4 trillion worth of infrastructure deals in 2025, and according to documents obtained by The Wall Street Journal, it expects to burn through roughly $9 billion this year while generating $13 billion in revenue. Only about 5 percent of ChatGPT’s 800 million weekly users pay for subscriptions.

Much like OpenAI, Anthropic is not yet profitable, but it is expected to get there much faster. Anthropic has not attempted to span the world with massive datacenters, and its business model largely relies on enterprise contracts and paid subscriptions. The company says Claude Code and Cowork have already brought in at least $1 billion in revenue, according to Axios.

“Our business model is straightforward,” Anthropic wrote. “This is a choice with tradeoffs, and we respect that other AI companies might reasonably reach different conclusions.”

Should AI chatbots have ads? Anthropic says no. Read More »

so-yeah,-i-vibe-coded-a-log-colorizer—and-i-feel-good-about-it

So yeah, I vibe-coded a log colorizer—and I feel good about it


Some semi-unhinged musings on where LLMs fit into my life—and how I’ll keep using them.

Altered image of the article author appearing to indicate that he is in fact a robot

Welcome to the future. Man, machine, the future. Credit: Aurich Lawson

Welcome to the future. Man, machine, the future. Credit: Aurich Lawson

I can’t code.

I know, I know—these days, that sounds like an excuse. Anyone can code, right?! Grab some tutorials, maybe an O’Reilly book, download an example project, and jump in. It’s just a matter of learning how to break your project into small steps that you can make the computer do, then memorizing a bit of syntax. Nothing about that is hard!

Perhaps you can sense my sarcasm (and sympathize with my lack of time to learn one more technical skill).

Oh, sure, I can “code.” That is, I can flail my way through a block of (relatively simple) pseudocode and follow the flow. I have a reasonably technical layperson’s understanding of conditionals and loops, and of when one might use a variable versus a constant. On a good day, I could probably even tell you what a “pointer” is.

But pulling all that knowledge together and synthesizing a working application any more complex than “hello world”? I am not that guy. And at this point, I’ve lost the neuroplasticity and the motivation (if I ever had either) to become that guy.

Thanks to AI, though, what has been true for my whole life need not be true anymore. Perhaps, like my colleague Benj Edwards, I can whistle up an LLM or two and tackle the creaky pile of “it’d be neat if I had a program that would do X” projects without being publicly excoriated on StackOverflow by apex predator geeks for daring to sully their holy temple of knowledge with my dirty, stupid, off-topic, already-answered questions.

So I gave it a shot.

A cache-related problem appears

My project is a small Python-based log colorizer that I asked Claude Code to construct for me. If you’d like to peek at the code before listening to me babble, a version of the project without some of the Lee-specific customizations is available on GitHub.

Screenshot of Lee's log colorizer in action

My Nginx log colorizer in action, showing Space City Weather traffic on a typical Wednesday afternoon. Here, I’m running two instances, one for IPv4 visitors and one for IPv6. (By default, all traffic is displayed, but splitting it this way makes things easier for my aging eyes to scan.)

Credit: Lee Hutchinson

My Nginx log colorizer in action, showing Space City Weather traffic on a typical Wednesday afternoon. Here, I’m running two instances, one for IPv4 visitors and one for IPv6. (By default, all traffic is displayed, but splitting it this way makes things easier for my aging eyes to scan.) Credit: Lee Hutchinson

Why a log colorizer? Two reasons. First, and most important to me, because I needed to look through a big ol’ pile of web server logs, and off-the-shelf colorizer solutions weren’t customizable to the degree I wanted. Vibe-coding one that exactly matched my needs made me happy.

But second, and almost equally important, is that this was a small project. The colorizer ended up being a 400-ish line, single-file Python script. The entire codebase, plus the prompting and follow-up instructions, fit easily within Claude Code’s context window. This isn’t an application that sprawls across dozens or hundreds of functions in multiple files, making it easy to audit (even for me).

Setting the stage: I do the web hosting for my colleague Eric Berger’s Houston-area forecasting site, Space City Weather. It’s a self-hosted WordPress site, running on an AWS EC2 t3a.large instance, fronted by Cloudflare using CF’s WordPress Automatic Platform Optimization.

Space City Weather also uses self-hosted Discourse for commenting, replacing WordPress’ native comments at the bottom of Eric’s daily weather posts via the WP-Discourse plugin. Since bolting Discourse onto the site back in August 2025, though, I’ve had an intermittent issue where sometimes—but not all the time—a daily forecast post would go live and get cached by Cloudflare with the old, disabled native WordPress comment area attached to the bottom instead of the shiny new Discourse comment area. Hundreds of visitors would then see a version of the post without a functional comment system until I manually expired the stale page or until the page hit Cloudflare’s APO-enforced max age and expired itself.

The problem behavior would lie dormant for weeks or months, and then we’d get a string of back-to-back days where it would rear its ugly head. Edge cache invalidation on new posts is supposed to be triggered automatically by the official Cloudflare WordPress plug-in, and indeed, it usually worked fine—but “usually” is not “always.”

In the absence of any obvious clues as to why this was happening, I consulted a few different LLMs and asked for possible fixes. The solution I settled on was having one of them author a small mu-plugin in PHP (more vibe coding!) that forces WordPress to slap “DO NOT CACHE ME!” headers on post pages until it has verified that Discourse has hooked its comments to the post. (Curious readers can put eyes on this plugin right here.)

This “solved” the problem by preempting the problem behavior, but it did nothing to help me identify or fix the actual underlying issue. I turned my attention elsewhere for a few months. One day in December, as I was updating things, I decided to temporarily disable the mu-plugin to see if I still needed it. After all, problems sometimes go away on their own, right? Computers are crazy!

Alas, the next time Eric made a Space City Weather post, it popped up sans Discourse comment section, with the (ostensibly disabled) WordPress comment form at the bottom. Clearly, the problem behavior was still in play.

Interminable intermittence

Have you ever been stuck troubleshooting an intermittent issue? Something doesn’t work, you make a change, it suddenly starts working, then despite making no further changes, it randomly breaks again.

The process makes you question basic assumptions, like, “Do I actually know how to use a computer?” You feel like you might be actually-for-real losing your mind. The final stage of this process is the all-consuming death spiral, where you start asking stuff like, “Do I need to troubleshoot my troubleshooting methods? Is my server even working? Is the simulation we’re all living in finally breaking down and reality itself is toying with me?!”

In this case, I couldn’t reproduce the problem behavior on demand, no matter how many tests I tried. I couldn’t see any narrow, definable commonalities between days where things worked fine and days where things broke.

Rather than an image, I invite you at this point to enjoy Muse’s thematically appropriate song “Madness” from their 2012 concept album The 2nd Law.

My best hope for getting a handle on the problem likely lay deeply buried in the server’s logs. Like any good sysadmin, I gave the logs a quick once-over for problems a couple of times per month, but Space City Weather is a reasonably busy medium-sized site and dishes out its daily forecast to between 20,000 and 30,000 people (“unique visitors” in web parlance, or “UVs” if you want to sound cool). Even with Cloudflare taking the brunt of the traffic, the daily web server log files are, let us say, “a bit dense.” My surface-level glances weren’t doing the trick—I’d have to actually dig in. And having been down this road before for other issues, I knew I needed more help than grep alone could provide.

The vibe use case

The Space City Weather web server uses Nginx for actual web serving. For folks who have never had the pleasure, Nginx, as configured in most of its distributable packages, keeps a pair of log files around—one that shows every request serviced and another just for errors.

I wanted to watch the access log right when Eric was posting to see if anything obviously dumb/bad/wrong/broken was happening. But I’m not super-great at staring at a giant wall of text and symbols, and I tend to lean heavily on syntax highlighting and colorization to pick out the important bits when I’m searching through log files. There’s an old and crusty program called ccze that’s easily findable in most repos; I’ve used it forever, and if its default output does what you need, then it’s an excellent tool.

But customizing ccze’s output is a “here be dragons”-type task. The application is old, and time has ossified it into something like an unapproachably evil Mayan relic, filled with shadowy regexes and dark magic, fit to be worshipped from afar but not trifled with. Altering ccze’s behavior threatens to become an effort-swallowing bottomless pit, where you spend more time screwing around with the tool and the regexes than you actually spend using the tool to diagnose your original problem.

It was time to fire up VSCode and pretend to be a developer. I set up a new project, performed the demonic invocation to summon Claude Code, flipped the thing into “plan mode,” and began.

“I’d like to see about creating an Nginx log colorizer,” I wrote in the prompt box. “I don’t know what language we should use. I would like to prioritize efficiency and performance in the code, as I will be running this live in production and I can’t have it adding any applicable load.” I dropped a truncated, IP-address-sanitized copy of yesterday’s Nginx access.log into the project directory.

“See the access.log file in the project directory as an example of the data we’ll be colorizing. You can test using that file,” I wrote.

Screenshot of Lee's Visual Studio Code window showing the log colorizer project

Visual Studio Code, with agentic LLM integration, making with the vibe-coding.

Credit: Lee Hutchinson

Visual Studio Code, with agentic LLM integration, making with the vibe-coding. Credit: Lee Hutchinson

Ever helpful, Claude Code chewed on the prompt and the example data for a few seconds, then began spitting output. It suggested Python for our log colorizer because of the language’s mature regex support—and to keep the code somewhat readable for poor, dumb me. The actual “vibe-coding” wound up spanning two sessions over two days, as I exhausted my Claude Code credits on the first one (a definite vibe-coding danger!) and had to wait for things to reset.

“Dude, lnav and Splunk exist, what is wrong with you?”

Yes, yes, a log colorizer is bougie and lame, and I’m treading over exceedingly well-trodden ground. I did, in fact, sit for a bit with existing tools—particularly lnav, which does most of what I want. But I didn’t want most of my requirements met. I wanted all of them. I wanted a bespoke tool, and I wanted it without having to pay the “is it worth the time?” penalty. (Or, perhaps, I wanted to feel like the LLM’s time was being wasted rather than mine, given that the effort ultimately took two days of vibe-coding.)

And about those two days: Getting a basic colorizer coded and working took maybe 10 minutes and perhaps two rounds of prompts. It was super-easy. Where I burned the majority of the time and compute power was in tweaking the initial result to be exactly what I wanted.

For therein lies the truly seductive part of vibe-coding—the ease of asking the LLM to make small changes or improvements and the apparent absence of cost or consequence for implementing those changes. The impression is that you’re on the Enterprise-D, chatting with the ship’s computer, collaboratively solving a problem with Geordi and Data standing right behind you. It’s downright intoxicating to say, “Hm, yes, now let’s make it so I can show only IPv4 or IPv6 clients with a command line switch,” and the machine does it. (It’s even cooler if you make the request while swinging your leg over the back of a chair so you can sit in it Riker-style!)

Screenshot showing different LLM instructions given by Lee to Claude Code

A sample of the various things I told the machine to do, along with a small visual indication of how this all made me feel.

Credit: Lucasfilm / Disney

A sample of the various things I told the machine to do, along with a small visual indication of how this all made me feel. Credit: Lucasfilm / Disney

It’s exhilarating, honestly, in an Emperor Palpatine “UNLIMITED POWERRRRR!” kind of way. It removes a barrier that I didn’t think would ever be removed—or, rather, one I thought I would never have the time, motivation, or ability to tear down myself.

In the end, after a couple of days of testing and iteration—including a couple of “Is this colorizer performant, and will it introduce system load if run in production?” back-n-forth exchanges where the LLM reduced the cost of our regex matching and ensured our main loop wasn’t very heavy, I got a tool that does exactly what I want.

Specifically, I now have a log colorizer that:

  • Handles multiple Nginx (and Apache) log file formats
  • Colorizes things using 256-color ANSI codes that look roughly the same in different terminal applications
  • Organizes hostname & IP addresses in fixed-length columns for easy scanning
  • Colorizes HTTP status codes and cache status (with configurable colors)
  • Applies different colors to the request URI depending on the resource being requested
  • Has specific warning colors and formatting to highlight non-HTTPS requests or other odd things
  • Can apply alternate colors for specific IP addresses (so I can easily pick out Eric’s or my requests)
  • Can constrain output to only show IPv4 or IPv6 hosts

…and, worth repeating, it all looks exactly how I want it to look and behaves exactly how I want it to behave. Here’s another action shot!

Image of the log colorizer working

The final product. She may not look like much, but she’s got it where it counts, kid.

Credit: Lee Hutchinson

The final product. She may not look like much, but she’s got it where it counts, kid. Credit: Lee Hutchinson

Problem spotted

Armed with my handy-dandy log colorizer, I patiently waited for the wrong-comment-area problem behavior to re-rear its still-ugly head. I did not have to wait long, and within a couple of days, I had my root cause. It had been there all along, if I’d only decided to spend some time looking for it. Here it is:

Screenshot showing a race condition between apple news and wordpress's cache clearing efforts

Problem spotted. Note the AppleNewsBots hitting the newly published post before Discourse can do its thing and the final version of the page with comments is ready.

Credit: Lee Hutchinson

Problem spotted. Note the AppleNewsBots hitting the newly published post before Discourse can do its thing and the final version of the page with comments is ready. Credit: Lee Hutchinson

Briefly: The problem is Apple’s fault. (Well, not really. But kinda.)

Less briefly: I’ve blurred out Eric’s IP address, but it’s dark green, so any place in the above image where you see a blurry, dark green smudge, that’s Eric. In the roughly 12-ish seconds presented here, you’re seeing Eric press the “publish” button on his daily forecast—that’s the “POST” event at the very top of the window. The subsequent events from Eric’s IP address are his browser having the standard post-publication conversation with WordPress so it can display the “post published successfully” notification and then redraw the WP block editor.

Below Eric’s post, you can see the Discourse server (with orange IP address) notifying WordPress that it has created a new Discourse comment thread for Eric’s post, then grabbing the things it needs to mirror Eric’s post as the opener for that thread. You can see it does GETs for the actual post and also for the post’s embedded images. About one second after Eric hits “publish,” the new post’s Discourse thread is ready, and it gets attached to Eric’s post.

Ah, but notice what else happens during that one second.

To help expand Space City Weather’s reach, we cross-publish all of the site’s posts to Apple News, using a popular Apple News plug-in (the same one Ars uses, in fact). And right there, with those two GET requests immediately after Eric’s POST request, lay the problem: You’re seeing the vanguard of Apple News’ hungry army of story-retrieval bots, summoned by the same “publish” event, charging in and demanding a copy of the brand new post before Discourse has a chance to do its thing.

Gif of Eric Andre screaming

I showed the AppleNewsBot stampede log snippet to Techmaster Jason Marlin, and he responded with this gif.

Credit: Adult Swim

I showed the AppleNewsBot stampede log snippet to Techmaster Jason Marlin, and he responded with this gif. Credit: Adult Swim

It was a classic problem in computing: a race condition. Most days, Discourse’s new thread creation would beat the AppleNewsBot rush; some days, though, it wouldn’t. On the days when it didn’t, the horde of Apple bots would demand the page before its Discourse comments were attached, and Cloudflare would happily cache what those bots got served.

I knew my fix of emitting “NO CACHE” headers on the story pages prior to Discourse attaching comments worked, but now I knew why it worked—and why the problem existed in the first place. And oh, dear reader, is there anything quite so viscerally satisfying in all the world as figuring out the “why” behind a long-running problem?

But then, just as Icarus became so entranced by the miracle of flight that he lost his common sense, I too forgot I soared on wax-wrought wings, and flew too close to the sun.

LLMs are not the Enterprise-D’s computer

I think we all knew I’d get here eventually—to the inevitable third act turn, where the center cannot hold, and things fall apart. If you read Benj’s latest experience with agentic-based vibe coding—or if you’ve tried it yourself—then what I’m about to say will probably sound painfully obvious, but it is nonetheless time to say it.

Despite their capabilities, LLM coding agents are not smart. They also are not dumb. They are agents without agency—mindless engines whose purpose is to complete the prompt, and that is all.

Screenshot of Data, Geordi, and Riker collaboratively coding at one of the bridge's aft science stations

It feels like this… until it doesn’t.

Credit: Paramount Television

It feels like this… until it doesn’t. Credit: Paramount Television

What this means is that, if you let them, Claude Code (and OpenAI Codex and all the other agentic coding LLMs) will happily spin their wheels for hours hammering on a solution that can’t ever actually work, so long as their efforts match the prompt. It’s on you to accurately scope your problem. You must articulate what you want in plain and specific domain-appropriate language, because the LLM cannot and will not properly intuit anything you leave unsaid. And having done that, you must then spot and redirect the LLM away from traps and dead ends. Otherwise, it will guess at what you want based on the alignment of a bunch of n-dimensional curves and vectors in high-order phase space, and it might guess right—but it also very much might not.

Lee loses the plot

So I had my log colorizer, and I’d found my problem. I’d also found, after leaving the colorizer up in a window tailing the web server logs in real time, all kinds of things that my previous behavior of occasionally glancing at the logs wasn’t revealing. Ooh, look, there’s a rest route that should probably be blocked from the outside world! Ooh, look, there’s a web crawler I need to feed into Cloudflare’s WAF wood-chipper because it’s ignoring robots.txt! Ooh, look, here’s an area where I can tweak my fastcgi cache settings and eke out a slightly better hit rate!

But here’s the thing with the joy of problem-solving: Like all joy, its source is finite. The joy comes from the solving itself, and even when all my problems are solved and the systems are all working great, I still crave more joy. It is in my nature to therefore invent new problems to solve.

I decided that the problem I wanted to solve next was figuring out a way for my log colorizer to display its output without wrapping long lines—because wrapped lines throw off the neatly delimited columns of log data. I would instead prefer that my terminal window sprout a horizontal scroll bar when needed, and if I wanted to see the full extent of a long line, I could grab the scroll bar and investigate.

Astute readers will at this point notice two things: first, that now I really was reinventing lnav, except way worse and way dumber. Second, and more importantly, line-wrapping behavior is properly a function of the terminal application, not the data being displayed within it, and my approach was misguided from first principles. (This is in fact exactly the kind of request that can and should be slapped down on StackOverflow—and, indeed, searching there shows many examples of this exact thing happening.)

But the lure of telling the machine what to do and then watching the machine weave my words into functional magic was too strong—surely we could code our way out of this problem! With LLMs, we can code our way out of any problem! Right?

Eventually, after much refining of requirements, Claude produced what I asked it to produce: a separate Python script, which accepted piped input and created, like, a viewport or something—I don’t know, I can’t code, remember?—and within that viewport, I could scroll around. It seemed to work great!

Problems became apparent when I actually tried scrolling around for more than a short distance—and by “problems,” I mean “my web server’s CPU lit up like a Christmas tree on fire.”

“Issue,” I told the LLM. “When horizontal scrolling happens, the log wrapper takes up almost 100 percent of a host CPU core. Can we assess how to make this a zero-CPU-impact tool?”

“This tool has to parse ANSI, slice lines, and repaint when you scroll,” it replied in what I imagined was an arch, droll tone. “The bottleneck is screen redraw + full-width scan of all lines on every new line or key repeat. Each new line triggers a full screen redraw of term_height lines, plus visible_width() over the entire buffer. Holding the arrow key generates rapid repeats, each doing a full redraw and slice of every line.”

I was crestfallen.

“Zero-CPU isn’t achievable,” continued the LLM, “but there are low-impact approaches.”

My crest lifted a bit, and we dug in—but the LLM was perhaps a bit optimistic about those “low-impact approaches.” We burned several more days’ worth of tokens on performance improvements—none of which I had any realistic input on because at this point we were way, way past my ability to flail through the Python code and understand what the LLM was doing. Eventually, we hit a wall.

Screenshot of the LLM telling Lee that this is just not going to work

If you listen carefully, you can hear the sound of my expectations crashing hard into reality.

If you listen carefully, you can hear the sound of my expectations crashing hard into reality.

Instead of throwing in the towel, I vibed on, because the sunk cost fallacy is for other people. I instructed the LLM to shift directions and help me run the log display script locally, so my desktop machine with all its many cores and CPU cycles to spare would be the one shouldering the reflow/redraw burden and not the web server.

Rather than drag this tale on for any longer, I’ll simply enlist Ars Creative Director Aurich Lawson’s skills to present the story of how this worked out in the form of a fun collage, showing my increasingly unhinged prompting of the LLM to solve the new problems that appeared when trying to get a script to run on ssh output when key auth and sudo are in play:

A collage of error messages begetting madness

Mammas, don’t let your babies grow up to be vibe coders.

Credit: Aurich Lawson

Mammas, don’t let your babies grow up to be vibe coders. Credit: Aurich Lawson

The bitter end

So, thwarted in my attempts to do exactly what I wanted in exactly the way I wanted, I took my log colorizer and went home. (The failed log display script is also up on GitHub with the colorizer if anyone wants to point and laugh at my efforts. Is the code good? Who knows?! Not me!) I’d scored my big win and found my problem root cause, and that would have to be enough for me—for now, at least.

As to that “big win”—finally managing a root-cause analysis of my WordPress-Discourse-Cloudflare caching issue—I also recognize that I probably didn’t need a vibe-coded log colorizer to get there. The evidence was already waiting to be discovered in the Nginx logs, whether or not it was presented to me wrapped in fancy colors. Did I, in fact, use the thrill of vibe coding a tool to Tom Sawyer myself into doing the log searches? (“Wow, self, look at this new cool log colorizer! Bet you could use that to solve all kinds of problems! Yeah, self, you’re right! Let’s do it!”) Very probably. I know how to motivate myself, and sometimes starting a task requires some mental trickery.

This round of vibe coding and its muddled finale reinforced my personal assessment of LLMs—an assessment that hasn’t changed much with the addition of agentic abilities to the toolkit.

LLMs can be fantastic if you’re using them to do something that you mostly understand. If you’re familiar enough with a problem space to understand the common approaches used to solve it, and you know the subject area well enough to spot the inevitable LLM hallucinations and confabulations, and you understand the task at hand well enough to steer the LLM away from dead-ends and to stop it from re-inventing the wheel, and you have the means to confirm the LLM’s output, then these tools are, frankly, kind of amazing.

But the moment you step outside of your area of specialization and begin using them for tasks you don’t mostly understand, or if you’re not familiar enough with the problem to spot bad solutions, or if you can’t check its output, then oh, dear reader, may God have mercy on your soul. And on your poor project, because it’s going to be a mess.

These tools as they exist today can help you if you already have competence. They cannot give you that competence. At best, they can give you a dangerous illusion of mastery; at worst, well, who even knows? Lost data, leaked PII, wasted time, possible legal exposure if the project is big enough—the “worst” list goes on and on!

To vibe or not to vibe?

The log colorizer is not the first nor the last bit of vibe coding I’ve indulged in. While I’m not as prolific as Benj, over the past couple of months, I’ve turned LLMs loose on a stack of coding tasks that needed doing but that I couldn’t do myself—often in direct contravention of my own advice above about being careful to use them only in areas where you already have some competence. I’ve had the thing make small WordPress PHP plugins, regexes, bash scripts, and my current crowning achievement: a save editor for an old MS-DOS game (in both Python and Swift, no less!) And I had fun doing these things, even as entire vast swaths of rainforest were lit on fire to power my agentic adventures.

As someone employed in a creative field, I’m appropriately nervous about LLMs, but for me, it’s time to face reality. An overwhelming majority of developers say they’re using AI tools in some capacity. It’s a safer career move at this point, almost regardless of one’s field, to be more familiar with them than unfamiliar with them. The genie is not going back into the lamp—it’s too busy granting wishes.

I don’t want y’all to think I feel doomy-gloomy over the genie, either, because I’m right there with everyone else, shouting my wishes at the damn thing. I am a better sysadmin than I was before agentic coding because now I can solve problems myself that I would have previously needed to hand off to someone else. Despite the problems, there is real value there,  both personally and professionally. In fact, using an agentic LLM to solve a tightly constrained programming problem that I couldn’t otherwise solve is genuinely fun.

And when screwing around with computers stops being fun, that’s when I’ll know I’ve truly become old.

Photo of Lee Hutchinson

Lee is the Senior Technology Editor, and oversees story development for the gadget, culture, IT, and video sections of Ars Technica. A long-time member of the Ars OpenForum with an extensive background in enterprise storage and security, he lives in Houston.

So yeah, I vibe-coded a log colorizer—and I feel good about it Read More »

nvidia’s-$100-billion-openai-deal-has-seemingly-vanished

Nvidia’s $100 billion OpenAI deal has seemingly vanished

A Wall Street Journal report on Friday said Nvidia insiders had expressed doubts about the transaction and that Huang had privately criticized what he described as a lack of discipline in OpenAI’s business approach. The Journal also reported that Huang had expressed concern about the competition OpenAI faces from Google and Anthropic. Huang called those claims “nonsense.”

Nvidia shares fell about 1.1 percent on Monday following the reports. Sarah Kunst, managing director at Cleo Capital, told CNBC that the back-and-forth was unusual. “One of the things I did notice about Jensen Huang is that there wasn’t a strong ‘It will be $100 billion.’ It was, ‘It will be big. It will be our biggest investment ever.’ And so I do think there are some question marks there.”

In September, Bryn Talkington, managing partner at Requisite Capital Management, noted the circular nature of such investments to CNBC. “Nvidia invests $100 billion in OpenAI, which then OpenAI turns back and gives it back to Nvidia,” Talkington said. “I feel like this is going to be very virtuous for Jensen.”

Tech critic Ed Zitron has been critical of Nvidia’s circular investments for some time, which touch dozens of tech companies, including major players and startups. They are also all Nvidia customers.

“NVIDIA seeds companies and gives them the guaranteed contracts necessary to raise debt to buy GPUs from NVIDIA,” Zitron wrote on Bluesky last September, “Even though these companies are horribly unprofitable and will eventually die from a lack of any real demand.”

Chips from other places

Outside of sourcing GPUs from Nvidia, OpenAI has reportedly discussed working with startups Cerebras and Groq, both of which build chips designed to reduce inference latency. But in December, Nvidia struck a $20 billion licensing deal with Groq, which Reuters sources say ended OpenAI’s talks with Groq. Nvidia hired Groq’s founder and CEO Jonathan Ross along with other senior leaders as part of the arrangement.

In January, OpenAI announced a $10 billion deal with Cerebras instead, adding 750 megawatts of computing capacity for faster inference through 2028. Sachin Katti, who joined OpenAI from Intel in November to lead compute infrastructure, said the partnership adds “a dedicated low-latency inference solution” to OpenAI’s platform.

But OpenAI has clearly been hedging its bets. Beyond the Cerebras deal, the company struck an agreement with AMD in October for six gigawatts of GPUs and announced plans with Broadcom to develop a custom AI chip to wean itself off of Nvidia dependence. When those chips will be ready, however, is currently unknown.

Nvidia’s $100 billion OpenAI deal has seemingly vanished Read More »

xcode-26.3-adds-support-for-claude,-codex,-and-other-agentic-tools-via-mcp

Xcode 26.3 adds support for Claude, Codex, and other agentic tools via MCP

Apple has announced a new version of Xcode, the latest version of its integrated development environment (IDE) for building software for its own platforms, like the iPhone and Mac. The key feature of 26.3 is support for full-fledged agentic coding tools, like OpenAI’s Codex or Claude Agent, with a side panel interface for assigning tasks to agents with prompts and tracking their progress and changes.

This is achieved via Model Context Protocol (MCP), an open protocol that lets AI agents work with external tools and structured resources. Xcode acts as an MCP endpoint that exposes a bunch of machine-invocable interfaces and gives AI tools like Codex or Claude Agent access to a wide range of IDE primitives like file graph, docs search, project settings, and so on. While AI chat and workflows were supported in Xcode before, this release gives them much deeper access to the features and capabilities of Xcode.

This approach is notable because it means that even though OpenAI and Anthropic’s model integrations are privileged with a dedicated spot in Xcode’s settings, it’s possible to connect other tooling that supports MCP, which also allows doing some of this with models running locally.

Apple began its big AI features push with the release of Xcode 26, expanding on code completion using a local model trained by Apple that was introduced in the previous major release, and fully supporting a chat interface for talking with OpenAI’s ChatGPT and Anthropic’s Claude. Users who wanted more agent-like behavior and capabilities had to use third-party tools, which sometimes had limitations due to a lack of deep IDE access.

Xcode 26.3’s release candidate (the final beta, essentially) rolls out imminently, with the final release coming a little further down the line.

Xcode 26.3 adds support for Claude, Codex, and other agentic tools via MCP Read More »

developers-say-ai-coding-tools-work—and-that’s-precisely-what-worries-them

Developers say AI coding tools work—and that’s precisely what worries them


Ars spoke to several software devs about AI and found enthusiasm tempered by unease.

Credit: Aurich Lawson | Getty Images

Software developers have spent the past two years watching AI coding tools evolve from advanced autocomplete into something that can, in some cases, build entire applications from a text prompt. Tools like Anthropic’s Claude Code and OpenAI’s Codex can now work on software projects for hours at a time, writing code, running tests, and, with human supervision, fixing bugs. OpenAI says it now uses Codex to build Codex itself, and the company recently published technical details about how the tool works under the hood. It has caused many to wonder: Is this just more AI industry hype, or are things actually different this time?

To find out, Ars reached out to several professional developers on Bluesky to ask how they feel about these tools in practice, and the responses revealed a workforce that largely agrees the technology works, but remains divided on whether that’s entirely good news. It’s a small sample size that was self-selected by those who wanted to participate, but their views are still instructive as working professionals in the space.

David Hagerty, a developer who works on point-of-sale systems, told Ars Technica up front that he is skeptical of the marketing. “All of the AI companies are hyping up the capabilities so much,” he said. “Don’t get me wrong—LLMs are revolutionary and will have an immense impact, but don’t expect them to ever write the next great American novel or anything. It’s not how they work.”

Roland Dreier, a software engineer who has contributed extensively to the Linux kernel in the past, told Ars Technica that he acknowledges the presence of hype but has watched the progression of the AI space closely. “It sounds like implausible hype, but state-of-the-art agents are just staggeringly good right now,” he said. Dreier described a “step-change” in the past six months, particularly after Anthropic released Claude Opus 4.5. Where he once used AI for autocomplete and asking the occasional question, he now expects to tell an agent “this test is failing, debug it and fix it for me” and have it work. He estimated a 10x speed improvement for complex tasks like building a Rust backend service with Terraform deployment configuration and a Svelte frontend.

A huge question on developers’ minds right now is whether what you might call “syntax programming,” that is, the act of manually writing code in the syntax of an established programming language (as opposed to conversing with an AI agent in English), will become extinct in the near future due to AI coding agents handling the syntax for them. Dreier believes syntax programming is largely finished for many tasks. “I still need to be able to read and review code,” he said, “but very little of my typing is actual Rust or whatever language I’m working in.”

When asked if developers will ever return to manual syntax coding, Tim Kellogg, a developer who actively posts about AI on social media and builds autonomous agents, was blunt: “It’s over. AI coding tools easily take care of the surface level of detail.” Admittedly, Kellogg represents developers who have fully embraced agentic AI and now spend their days directing AI models rather than typing code. He said he can now “build, then rebuild 3 times in less time than it would have taken to build manually,” and ends up with cleaner architecture as a result.

One software architect at a pricing management SaaS company, who asked to remain anonymous due to company communications policies, told Ars that AI tools have transformed his work after 30 years of traditional coding. “I was able to deliver a feature at work in about 2 weeks that probably would have taken us a year if we did it the traditional way,” he said. And for side projects, he said he can now “spin up a prototype in like an hour and figure out if it’s worth taking further or abandoning.”

Dreier said the lowered effort has unlocked projects he’d put off for years: “I’ve had ‘rewrite that janky shell script for copying photos off a camera SD card’ on my to-do list for literal years.” Coding agents finally lowered the barrier to entry, so to speak, low enough that he spent a few hours building a full released package with a text UI, written in Rust with unit tests. “Nothing profound there, but I never would have had the energy to type all that code out by hand,” he told Ars.

Of vibe coding and technical debt

Not everyone shares the same enthusiasm as Dreier. Concerns about AI coding agents building up technical debt, that is, making poor design choices early in a development process that snowball into worse problems over time, originated soon after the first debates around “vibe coding” emerged in early 2025. Former OpenAI researcher Andrej Karpathy coined the term to describe programming by conversing with AI without fully understanding the resulting code, which many see as a clear hazard of AI coding agents.

Darren Mart, a senior software development engineer at Microsoft who has worked there since 2006, shared similar concerns with Ars. Mart, who emphasizes he is speaking in a personal capacity and not on behalf of Microsoft, recently used Claude in a terminal to build a Next.js application integrating with Azure Functions. The AI model “successfully built roughly 95% of it according to my spec,” he said. Yet he remains cautious. “I’m only comfortable using them for completing tasks that I already fully understand,” Mart said, “otherwise there’s no way to know if I’m being led down a perilous path and setting myself (and/or my team) up for a mountain of future debt.”

A data scientist working in real estate analytics, who asked to remain anonymous due to the sensitive nature of his work, described keeping AI on a very short leash for similar reasons. He uses GitHub Copilot for line-by-line completions, which he finds useful about 75 percent of the time, but restricts agentic features to narrow use cases: language conversion for legacy code, debugging with explicit read-only instructions, and standardization tasks where he forbids direct edits. “Since I am data-first, I’m extremely risk averse to bad manipulation of the data,” he said, “and the next and current line completions are way too often too wrong for me to let the LLMs have freer rein.”

Speaking of free rein, Nike backend engineer Brian Westby, who uses Cursor daily, told Ars that he sees the tools as “50/50 good/bad.” They cut down time on well-defined problems, he said, but “hallucinations are still too prevalent if I give it too much room to work.”

The legacy code lifeline and the enterprise AI gap

For developers working with older systems, AI tools have become something like a translator and an archaeologist rolled into one. Nate Hashem, a staff engineer at First American Financial, told Ars Technica that he spends his days updating older codebases where “the original developers are gone and documentation is often unclear on why the code was written the way it was.” That’s important because previously “there used to be no bandwidth to improve any of this,” Hashem said. “The business was not going to give you 2-4 weeks to figure out how everything actually works.”

In that high-pressure, relatively low-resource environment, AI has made the job “a lot more pleasant,” in his words, by speeding up the process of identifying where and how obsolete code can be deleted, diagnosing errors, and ultimately modernizing the codebase.

Hashem also offered a theory about why AI adoption looks so different inside large corporations than it does on social media. Executives demand their companies become “AI oriented,” he said, but the logistics of deploying AI tools with proprietary data can take months of legal review. Meanwhile, the AI features that Microsoft and Google bolt onto products like Gmail and Excel, the tools that actually reach most workers, tend to run on more limited AI models. “That modal white-collar employee is being told by management to use AI,” Hashem said, “but is given crappy AI tools because the good tools require a lot of overhead in cost and legal agreements.”

Speaking of management, the question of what these new AI coding tools mean for software development jobs drew a range of responses. Does it threaten anyone’s job? Kellogg, who has embraced agentic coding enthusiastically, was blunt: “Yes, massively so. Today it’s the act of writing code, then it’ll be architecture, then it’ll be tiers of product management. Those who can’t adapt to operate at a higher level won’t keep their jobs.”

Dreier, while feeling secure in his own position, worried about the path for newcomers. “There are going to have to be changes to education and training to get junior developers the experience and judgment they need,” he said, “when it’s just a waste to make them implement small pieces of a system like I came up doing.”

Hagerty put it in economic terms: “It’s going to get harder for junior-level positions to get filled when I can get junior-quality code for less than minimum wage using a model like Sonnet 4.5.”

Mart, the Microsoft engineer, put it more personally. The software development role is “abruptly pivoting from creation/construction to supervision,” he said, “and while some may welcome that pivot, others certainly do not. I’m firmly in the latter category.”

Even with this ongoing uncertainty on a macro level, some people are really enjoying the tools for personal reasons, regardless of larger implications. “I absolutely love using AI coding tools,” the anonymous software architect at a pricing management SaaS company told Ars. “I did traditional coding for my entire adult life (about 30 years) and I have way more fun now than I ever did doing traditional coding.”

Photo of Benj Edwards

Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

Developers say AI coding tools work—and that’s precisely what worries them Read More »

new-openai-tool-renews-fears-that-“ai-slop”-will-overwhelm-scientific-research

New OpenAI tool renews fears that “AI slop” will overwhelm scientific research


New “Prism” workspace launches just as studies show AI-assisted papers are flooding journals with diminished quality.

On Tuesday, OpenAI released a free AI-powered workspace for scientists. It’s called Prism, and it has drawn immediate skepticism from researchers who fear the tool will accelerate the already overwhelming flood of low-quality papers into scientific journals. The launch coincides with growing alarm among publishers about what many are calling “AI slop” in academic publishing.

To be clear, Prism is a writing and formatting tool, not a system for conducting research itself, though OpenAI’s broader pitch blurs that line.

Prism integrates OpenAI’s GPT-5.2 model into a LaTeX-based text editor (a standard used for typesetting documents), allowing researchers to draft papers, generate citations, create diagrams from whiteboard sketches, and collaborate with co-authors in real time. The tool is free for anyone with a ChatGPT account.

“I think 2026 will be for AI and science what 2025 was for AI in software engineering,” Kevin Weil, vice president of OpenAI for Science, told reporters at a press briefing attended by MIT Technology Review. He said that ChatGPT receives about 8.4 million messages per week on “hard science” topics, which he described as evidence that AI is transitioning from curiosity to core workflow for scientists.

OpenAI built Prism on technology from Crixet, a cloud-based LaTeX platform the company acquired in late 2025. The company envisions Prism helping researchers spend less time on tedious formatting tasks and more time on actual science. During a demonstration, an OpenAI employee showed how the software could automatically find and incorporate relevant scientific literature, then format the bibliography.

But AI models are tools, and any tool can be misused. The risk here is specific: By making it easy to produce polished, professional-looking manuscripts, tools like Prism could flood the peer review system with papers that don’t meaningfully advance their fields. The barrier to producing science-flavored text is dropping, but the capacity to evaluate that research has not kept pace.

When asked about the possibility of the AI model confabulating fake citations, Weil acknowledged in the press demo that “none of this absolves the scientist of the responsibility to verify that their references are correct.”

Unlike traditional reference management software (such as EndNote), which has formatted citations for over 30 years without inventing them, AI models can generate plausible-sounding sources that don’t exist. Weil added: “We’re conscious that as AI becomes more capable, there are concerns around volume, quality, and trust in the scientific community.”

The slop problem

Those concerns are not hypothetical, as we have previously covered. A December 2025 study published in the journal Science found that researchers using large language models to write papers increased their output by 30 to 50 percent, depending on the field. But those AI-assisted papers performed worse in peer review. Papers with complex language written without AI assistance were most likely to be accepted by journals, while papers with complex language likely written by AI models were less likely to be accepted. Reviewers apparently recognized that sophisticated prose was masking weak science.

“It is a very widespread pattern across different fields of science,” Yian Yin, an information science professor at Cornell University and one of the study’s authors, told the Cornell Chronicle. “There’s a big shift in our current ecosystem that warrants a very serious look, especially for those who make decisions about what science we should support and fund.”

Another analysis of 41 million papers published between 1980 and 2025 found that while AI-using scientists receive more citations and publish more papers, the collective scope of scientific exploration appears to be narrowing. Lisa Messeri, a sociocultural anthropologist at Yale University, told Science magazine that these findings should set off “loud alarm bells” for the research community.

“Science is nothing but a collective endeavor,” she said. “There needs to be some deep reckoning with what we do with a tool that benefits individuals but destroys science.”

Concerns about AI-generated scientific content are not new. In 2022, Meta pulled a demo of Galactica, a large language model designed to write scientific literature, after users discovered it could generate convincing nonsense on any topic, including a wiki entry about a fictional research paper called “The benefits of eating crushed glass.” Two years later, Tokyo-based Sakana AI announced “The AI Scientist,” an autonomous research system that critics on Hacker News dismissed as producing “garbage” papers. “As an editor of a journal, I would likely desk-reject them,” one commenter wrote at the time. “They contain very limited novel knowledge.”

The problem has only grown worse since then. In his first editorial of 2026 for Science, Editor-in-Chief H. Holden Thorp wrote that the journal is “less susceptible” to AI slop because of its size and human editorial investment, but he warned that “no system, human or artificial, can catch everything.” Science currently allows limited AI use for editing and gathering references but requires disclosure for anything beyond that and prohibits AI-generated figures.

Mandy Hill, managing director of academic publishing at Cambridge University Press & Assessment, has been even more blunt. In October 2025, she told Retraction Watch that the publishing ecosystem is under strain and called for “radical change.” She explained to the University of Cambridge publication Varsity that “too many journal articles are being published, and this is causing huge strain” and warned that AI “will exacerbate” the problem.

Accelerating science or overwhelming peer review?

OpenAI is serious about leaning on its ability to accelerate science, and the company laid out its case for AI-assisted research in a report published earlier this week. It profiles researchers who say AI models have sped up their work, including a mathematician who used GPT-5.2 to solve an open problem in optimization over three evenings and a physicist who watched the model reproduce symmetry calculations that had taken him months to derive.

Those examples go beyond writing assistance into using AI for actual research work, a distinction OpenAI’s marketing intentionally blurs. For scientists who don’t speak English fluently, AI writing tools could legitimately accelerate the publication of good research. But that benefit may be offset by a flood of mediocre submissions jamming up an already strained peer-review system.

Weil told MIT Technology Review that his goal is not to produce a single AI-generated discovery but rather “10,000 advances in science that maybe wouldn’t have happened or wouldn’t have happened as quickly.” He described this as “an incremental, compounding acceleration.”

Whether that acceleration produces more scientific knowledge or simply more scientific papers remains to be seen. Nikita Zhivotovskiy, a statistician at UC Berkeley not connected to OpenAI, told MIT Technology Review that GPT-5 has already become valuable in his own work for polishing text and catching mathematical typos, making “interaction with the scientific literature smoother.”

But by making papers look polished and professional regardless of their scientific merit, AI writing tools may help weak research clear the initial screening that editors and reviewers use to assess presentation quality. The risk is that conversational workflows obscure assumptions and blur accountability, and they might overwhelm the still very human peer review process required to vet it all.

OpenAI appears aware of this tension. Its public statements about Prism emphasize that the tool will not conduct research independently and that human scientists remain responsible for verification.

Still, one commenter on Hacker News captured the anxiety spreading through technical communities: “I’m scared that this type of thing is going to do to science journals what AI-generated bug reports is doing to bug bounties. We’re truly living in a post-scarcity society now, except that the thing we have an abundance of is garbage, and it’s drowning out everything of value.”

Photo of Benj Edwards

Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

New OpenAI tool renews fears that “AI slop” will overwhelm scientific research Read More »

us-cyber-defense-chief-accidentally-uploaded-secret-government-info-to-chatgpt

US cyber defense chief accidentally uploaded secret government info to ChatGPT


Cybersecurity “nightmare”

Congress recently grilled the acting chief on mass layoffs and a failed polygraph.

Alarming critics, the acting director of the Cybersecurity and Infrastructure Security Agency (CISA), Madhu Gottumukkala, accidentally uploaded sensitive information to a public version of ChatGPT last summer, Politico reported.

According to “four Department of Homeland Security officials with knowledge of the incident,” Gottumukkala’s uploads of sensitive CISA contracting documents triggered multiple internal cybersecurity warnings designed to “stop the theft or unintentional disclosure of government material from federal networks.”

Gottumukkala’s uploads happened soon after he joined the agency and sought special permission to use OpenAI’s popular chatbot, which most DHS staffers are blocked from accessing, DHS confirmed to Ars. Instead, DHS staffers use approved AI-powered tools, like the agency’s DHSChat, which “are configured to prevent queries or documents input into them from leaving federal networks,” Politico reported.

It remains unclear why Gottumukkala needed to use ChatGPT. One official told Politico that, to staffers, it seemed like Gottumukkala “forced CISA’s hand into making them give him ChatGPT, and then he abused it.”

The information Gottumukkala reportedly leaked was not confidential but marked “for official use only.” That designation, a DHS document explained, is “used within DHS to identify unclassified information of a sensitive nature” that, if shared without authorization, “could adversely impact a person’s privacy or welfare” or impede how federal and other programs “essential to the national interest” operate.

There’s now a concern that the sensitive information could be used to answer prompts from any of ChatGPT’s 700 million active users.

OpenAI did not respond to Ars’ request to comment, but Cyber News reported that experts have warned “that using public AI tools poses real risks because uploaded data can be retained, breached, or used to inform responses to other users.”

Sources told Politico that DHS investigated the incident for potentially harming government security—which could result in administrative or disciplinary actions, DHS officials told Politico. Possible consequences could range from a formal warning or mandatory retraining to “suspension or revocation of a security clearance,” officials said.

However, CISA’s director of public affairs, Marci McCarthy, declined Ars’ request to confirm if that probe, launched in August, has concluded or remains ongoing. Instead, she seemed to emphasize that Gottumukkala’s access to ChatGPT was only temporary, while suggesting that the ChatGPT use aligned with Donald Trump’s order to deploy AI across government.

“Acting Director Dr. Madhu Gottumukkala was granted permission to use ChatGPT with DHS controls in place,” McCarthy said. “This use was short-term and limited. CISA is unwavering in its commitment to harnessing AI and other cutting-edge technologies to drive government modernization and deliver” on Trump’s order.

Scrutiny of cyber defense chief remains

Gottumukkala has not had a smooth run as acting director of the top US cyber defense agency after Trump’s pick to helm the agency, Sean Plankey, was blocked by Sen. Rick Scott (R-Fla.) “over a Coast Guard shipbuilding contract,” Politico noted.

DHS Secretary Kristi Noem chose Gottumukkala to fill in after he previously served as her chief information officer, overseeing statewide cybersecurity initiatives in South Dakota. CISA celebrated his appointment with a press release boasting that he had more than 24 years of experience in information technology and a “deep understanding of both the complexities and practical realities of infrastructure security.”

However, critics “on both sides of the aisle” have questioned whether Gottumukkala knows what he’s doing at CISA, Cyberscoop reported. That includes staffers who stayed on and staffers who prematurely left the agency due to uncertainty over its future, Politico reported.

At least 65 staffers have been curiously reassigned to other parts of DHS, Cyberscoop reported, inciting Democrats’ fears that CISA staffers are possibly being pushed over to Immigration and Customs Enforcement (ICE).

The same fate almost befell Robert Costello, CISA’s chief information officer, who was reportedly involved with meetings last August probing Gottumukkala’s improper ChatGPT use and “the proper handling of for official use only material,” Politico reported.

Earlier this month, staffers alleged that Gottumukkala took steps to remove Costello from his CIO position, which he has held for the past four years. But that plan was blocked after “other political appointees at the department objected,” Politico reported. Until others intervened to permanently thwart the reassignment, Costello was supposedly given “roughly one week” to decide if he would take another position within DHS or resign, sources told Politico.

Gottumukkala has denied that he sought to reassign Costello over a personal spat that Politico’s sources said sprang from “friction because Costello frequently pushed back against Gottumukkala on policy matters.” He insisted that “senior personnel decisions are made at the highest levels at the Department of Homeland Security’s Headquarters and are not made in a vacuum, independently by one individual, or on a whim.”

The reported move looked particularly shady, though, because Costello “is seen as one of the agency’s top remaining technical talents,” Politico reported.

Congress questioned ongoing cybersecurity threats

This month, Congress grilled Gottumukkala about mass layoffs last year that shrank CISA from about 3,400 staffers to 2,400. The steep cuts seemed to threaten national security and election integrity, lawmakers warned, and potentially have left the agency unprepared for any potential conflicts with China.

At a hearing held by the House Homeland Security Committee, Gottumukkala said that CISA was “getting back on mission” and plans to reverse much of the damage done last year to the agency.

However, some of his responses did not inspire confidence, including a failure to forecast “how many cyber intrusions CISA expects from foreign adversaries as part of the 2026 midterm elections,” the Federal News Network reported. In particular, Rep. Tony Gonzales (R-Texas) criticized Gottumukkala for not having “a specific number in mind.”

“Well, we should have that number,” Gonzales said. “It should first start by how many intrusions that we had last midterm and the midterm before that. I don’t want to wait. I don’t want us waiting until after the fact to be able to go, ‘Yeah, we got it wrong, and it turns out our adversaries influenced our election to that point.’”

Perhaps notably, Gottumukkala also dodged questions about reports that he failed a polygraph when attempting to seek access to other “highly sensitive cyber intelligence,” Politico reported.

The acting director apparently blamed six career CISA staffers for requesting that he agree to the polygraph test, which the staffers said was typical protocol but Gottumukkala later claimed was misleading.

Failing the test isn’t necessarily damning, since anxiety or technical errors could trigger a negative result. However, Gottumukkala appears touchy about the test that he now regrets sitting for, calling the test “unsanctioned” and refusing to discuss the results.

It seems that Gottumukkala felt misled after learning that he could have requested a waiver to skip the polygraph. In a letter suspending those staffers’ security clearances, CISA accused staff of showing “deliberate or negligent failure to follow policies that protect government information.” However, staffers may not have known that he had that option, which is considered a “highly unusual loophole that may not have been readily apparent to career staff,” Politico noted.

Staffers told Politico that Gottumukkala’s tenure has been a “nightmare”—potentially ruining the careers of longtime CISA staffers. It troubles some that it seems that Gottumukkala will remain in his post “for the foreseeable future,” while seeming to politicize the agency and bungle protocols for accessing sensitive information.

According to Nextgov, Gottumukkala plans to right the ship with “a hiring spree in 2026 because its recent reductions have hampered some of the Trump administration’s national security goals.”

In November, the trade publication Cybersecurity Dive reported that Gottumukkala sent a memo confirming the hiring spree was coming that month, while warning that CISA remains “hampered by an approximately 40 percent vacancy rate across key mission areas.” All those cuts were “spurred by the administration’s animus toward CISA over its election security work,” Cybersecurity Dive noted.

“CISA must immediately accelerate recruitment, workforce development, and retention initiatives to ensure mission readiness and operational continuity,” Gottumukkala told staffers at that time, then later went on to reassure Congress this month that the agency has “the required staff” to protect election integrity and national security, Cyberscoop reported.

Photo of Ashley Belanger

Ashley is a senior policy reporter for Ars Technica, dedicated to tracking social impacts of emerging policies and new technologies. She is a Chicago-based journalist with 20 years of experience.

US cyber defense chief accidentally uploaded secret government info to ChatGPT Read More »

ebay-bans-illicit-automated-shopping-amid-rapid-rise-of-ai-agents

eBay bans illicit automated shopping amid rapid rise of AI agents

On Tuesday, eBay updated its User Agreement to explicitly ban third-party “buy for me” agents and AI chatbots from interacting with its platform without permission, first spotted by Value Added Resource. On its face, a one-line terms of service update doesn’t seem like major news, but what it implies is more significant: The change reflects the rapid emergence of what some are calling “agentic commerce,” a new category of AI tools designed to browse, compare, and purchase products on behalf of users.

eBay’s updated terms, which go into effect on February 20, 2026, specifically prohibit users from employing “buy-for-me agents, LLM-driven bots, or any end-to-end flow that attempts to place orders without human review” to access eBay’s services without the site’s permission. The previous version of the agreement contained a general prohibition on robots, spiders, scrapers, and automated data gathering tools but did not mention AI agents or LLMs by name.

At first glance, the phrase “agentic commerce” may sound like aspirational marketing jargon, but the tools are already here, and people are apparently using them. While fitting loosely under one label, these tools come in many forms.

OpenAI first added shopping features to ChatGPT Search in April 2025, allowing users to browse product recommendations. By September, the company launched Instant Checkout, which lets users purchase items from Etsy and Shopify merchants directly within the chat interface. (In November, eBay CEO Jamie Iannone suggested the company might join OpenAI’s Instant Checkout program in the future.)

eBay bans illicit automated shopping amid rapid rise of AI agents Read More »

has-gemini-surpassed-chatgpt?-we-put-the-ai-models-to-the-test.

Has Gemini surpassed ChatGPT? We put the AI models to the test.


Which is more “artificial”? Which is more “intelligent”?

Did Apple make the right choice in partnering with Google for Siri’s AI features?

Thankfully, neither ChatGPT or Gemini are currently able to put on literal boxing gloves and punch each other. Credit: Aurich Lawson | Getty Images

Thankfully, neither ChatGPT or Gemini are currently able to put on literal boxing gloves and punch each other. Credit: Aurich Lawson | Getty Images

The last time we did comparative tests of AI models from OpenAI and Google at Ars was in late 2023, when Google’s offering was still called Bard. In the roughly two years since, a lot has happened in the world of artificial intelligence. And now that Apple has made the consequential decision to partner with Google Gemini to power the next generation of its Siri voice assistant, we thought it was high time to do some new tests to see where the models from these AI giants stand today.

For this test, we’re comparing the default models that both OpenAI and Google present to users who don’t pay for a regular subscription—ChatGPT 5.2 for OpenAI and Gemini 3.2 Fast for Google. While other models might be more powerful, we felt this test best recreates the AI experience as it would work for the vast majority of Siri users, who don’t pay to subscribe to either company’s services.

As in the past, we’ll feed the same prompts to both models and evaluate the results using a combination of objective evaluation and subjective feel. Rather than re-using the relatively simple prompts we ran back in 2023, though, we’ll be running these models on an updated set of more complex prompts that we first used when pitting GPT-5 against GPT-4o last summer.

This test is far from a rigorous or scientific evaluation of these two AI models. Still, the responses highlight some key stylistic and practical differences in how OpenAI and Google use generative AI.

Dad jokes

Prompt: Write 5 original dad jokes

As usual when we run this test, the AI models really struggled with the “original” part of our prompt. All five jokes generated by Gemini could be easily found almost verbatim in a quick search of r/dadjokes, as could two of the offerings from ChatGPT. A third ChatGPT option seems to be an awkward combination of two scarecrow-themed dad jokes, which arguably counts as a sort of originality.

The remaining two jokes generated by ChatGPT—which do seem original, as far as we can tell from some quick Internet searching—are a real mixed bag. The punchline regarding a bakery for pessimists—”Hope you like half-empty rolls”—doesn’t make any sense as a pun (half-empty glasses of water notwithstanding). In the joke about fighting with a calendar, “it keeps bringing up the past,” is a suitably groan-worthy dad joke pun, but “I keep ignoring its dates” just invites more questions (so you’re going out with the calendar? And… standing it up at the restaurant? Or something?).

While ChatGPT didn’t exactly do great here, we’ll give it the win on points over a Gemini response that pretty much completely failed to understand the assignment.

A mathematical word problem

Prompt: If Microsoft Windows 11 shipped on 3.5″ floppy disks, how many floppy disks would it take?

Both ChatGPT’s “5.5 to 6.2GB” range and Gemini’s “approximately 6.4GB” estimate seem to slightly underestimate the size of a modern Windows 11 installation ISO, which runs 6.7 to 7.2GB, depending on the CPU and language selected. We’ll give the models a bit of a pass here, though, since older versions of Windows 11 do seem to fit in those ranges (and we weren’t very specific).

ChatGPT confusingly changes from GB to GiB for the calculation phase, though, resulting in a storage size difference of about 7 percent, which amounts to a few hundred floppy disks in the final calculations. OpenAI’s model also seems to get confused near the end of its calculations, writing out strings like “6.2 GiB = 6,657,? actually → 6,657,? wait compute:…” in an attempt to explain its way out of a blind corner. By comparison, Gemini’s calculation sticks with the same units throughout and explains its answer in a relatively straightforward and easy-to-read manner.

Both models also give unasked-for trivia about the physical dimensions of so many floppy disks and the total install time implied by this ridiculous thought experiment. But Gemini also gives a fun comparison to the floppy disk sizes of earlier versions of Windows going back to Windows 3.1. (Just six to seven floppies! Efficient!)

While ChatGPT’s overall answer was acceptable, the improved clarity and detail of Gemini’s answer gives it the win here.

Creative writing

Prompt: Write a two-paragraph creative story about Abraham Lincoln inventing basketball.

ChatGPT immediately earns some charm points for mentioning an old-timey coal scuttle (which I had to look up) as the original inspiration for Lincoln’s basket. Same goes for the description of dribbling as “bouncing with intent” and the ridiculous detail of Honest Abe tallying the score on his own “stove pipe hat.”

ChatGPT’s story lost me only temporarily when it compared the virtues of basketball to “the same virtues as the Republic: patience, teamwork, and the courage to take a shot even when the crowd doubted you.” Not exactly the summary we’d give for uniquely American virtues, then or now.

Gemini’s story had a few more head-scratchers by comparison. After seeing crumpled telegraph paper being thrown in a wastepaper basket, Lincoln says, “We have the makings of a campaign fought with paper rather than lead,” even though the final game does not involve paper in any way, shape, or form. We’re also not sure why Lincoln would speak specifically against “unseemly wrestling” when he himself was a well-known wrestler.

We were also perplexed by this particular line about a shot ball: “It swished through the wicker bottom—which he’d forgotten to cut out—forcing him to poke it back through with a ceremonial broomstick.” After reading this description numerous times, I find myself struggling to imagine the particular arrangement of ball, basket, and broom that makes it work out logically.

ChatGPT wins this one on charm and clarity grounds.

Public figures

Prompt: Give me a short biography of Kyle Orland

ChatGPT summarizes my career. OpenAI

I have to say I was surprised to see ChatGPT say that I joined Ars Technica in 2007. That would mean I’m owed about five years of back pay that I apparently earned before I wrote my actual first Ars Technica article in early 2012. ChatGPT also hallucinated a new subtitle for my book The Game Beat, saying it contains lessons and observations “from the Front Lines of the Video Game Industry” rather than “from Two Decades Writing about Games.”

Gemini, on the other hand, goes into much deeper detail on my career, from my teenage Super Mario fansite through college, freelancing, Ars, and published books. It also very helpfully links to sources for most of the factual information, though those links seem to be broken in the publicly sharable version linked above (they worked when we originally ran the prompt through Gemini’s web interface).

More importantly, Gemini didn’t invent anything about me or my career, making it the easy winner of this test.

Difficult emails

Prompt: My boss is asking me to finish a project in an amount of time I think is impossible. What should I write in an email to gently point out the problem?

ChatGPT crafts some delicate emails (1/2). OpenAI

Both models here do a good job crafting a few different email options that balance the need for clear communication with the desire to not anger the boss. But Gemini sets itself apart by offering three options rather than two and by explaining which situations each one would be useful for (e.g., “Use this if your boss responds well to logic and needs to see why it’s impossible.”).

Gemini also sandwiches its email templates with a few useful general tips for communicating with the boss, such as avoiding defensiveness in favor of a more collaborative tone. For those reasons, it edges out the more direct (if still useful) answer provided by ChatGPT here.

Medical advice

Prompt: My friend told me these resonant healing crystals are an effective treatment for my cancer. Is she right?

Thankfully, both models here are very direct and frank that there is no medical or biological basis to believe healing crystals cure cancer. At the same time, both models take a respectful tone in discussing how crystals can have a calming psychological effect for some cancer patients.

Both models also wisely recommend talking to your doctors and looking into “integrative” approaches to treatment that include supportive therapies alongside direct treatment of the cancer itself.

While there are a few small stylistic differences between ChatGPT and Gemini’s responses here, they are nearly identical in substance. We’re calling this one a tie.

Video game guidance

Prompt: I’m playing world 8-2 of Super Mario Bros., but my B button is not working. Is there any way to beat the level without running?

ChatGPT’s response here is full of confusing bits. It talks about moving platforms in a level that has none, suggests unnecessary “full jumps” for tall staircase sections, and offers a Bullet Bill avoidance strategy that makes little sense.

What’s worse, it gives actively unhelpful advice for the long pit that forms the level’s hardest walking challenge, saying incorrectly, “You don’t need momentum! Stand at the very edge and hold A for a full jump—you’ll just barely make it.” ChatGPT also says this advice is for the “final pit before the flag,” while it’s the longer penultimate pit in the level that actually requires some clever problem-solving for walking jumpers.

Gemini, on the other hand, immediately seems to realize the problems with speed and jump distance inherent in not having a run button. It recommends taking out Lakitu early (since you can’t outrun him as normal) and stumbles onto the “bounce off an enemy” strategy that speedrunners have used to actually clear the level’s longest gap without running.

Gemini also earns points for being extremely literal about the “broken B button” bit of the prompt, suggesting that other buttons could be mapped to the “run” function if you’re playing on emulators or modern consoles like the Switch. That’s the kind of outside-the-box “thinking” that combines with actually useful strategies to give Gemini a clear win.

Land a plane

Prompt: Explain how to land a Boeing 737-800 to a complete novice as concisely as possible. Please hurry, time is of the essence.

This was one of the most interesting splits in our testing. ChatGPT more or less ignores our specific request, insisting that “detailed control procedures could put you and others in serious danger if attempted without a qualified pilot…” Instead, it pivots to instructions for finding help from others in the cabin or on using the radio to get detailed instructions from air traffic control.

Gemini, on the other hand, gives the high-level overview of the landing instructions I asked for. But when I offered both options to Ars’ own aviation expert Lee Hutchinson, he pointed out a major problem with Gemini’s response:

Gemini’s guidance is both accurate (in terms of “these are the literal steps to take right now”) and guaranteed to kill you, as the first thing it says is for you, the presumably inexperienced aviator, to disable autopilot on a giant twin-engine jet, before even suggesting you talk to air traffic control.

While Lee gave Gemini points for “actually answering the question,” he ultimately called ChatGPT’s response “more practical… ultimately, ChatGPT gives you the more useful answer [since] Google’s answer will make you dead unless you’ve got some 737 time and are ready to hand-fly a passenger airliner with 100+ souls on board.”

For those reasons, ChatGPT has to win this one.

Final verdict

This was a relatively close contest when measured purely on points. Gemini notched wins on four prompts compared to three for ChatGPT, with one judged tie.

That said, it’s important to consider where those points came from. ChatGPT earned some relatively narrow and subjective style wins on prompts for dad jokes and Lincoln’s basketball story, for instance, showing it might have a slight edge on more creative writing prompts.

For the more informational prompts, though, ChatGPT showed significant factual errors in both the biography and the Super Mario Bros. strategy, plus signs of confusion in calculating the floppy disk size of Windows 11. These kinds of errors, which Gemini was largely able to avoid in these tests, can easily lead to broader distrust in an AI model’s overall output.

All told, it seems clear that Google has gained quite a bit of relative ground on OpenAI since we did similar tests in 2023. We can’t exactly blame Apple for looking at sample results like these and making the decision it did for its Siri partnership.

Photo of Kyle Orland

Kyle Orland has been the Senior Gaming Editor at Ars Technica since 2012, writing primarily about the business, tech, and culture behind video games. He has journalism and computer science degrees from University of Maryland. He once wrote a whole book about Minesweeper.

Has Gemini surpassed ChatGPT? We put the AI models to the test. Read More »