Claude Code

No humans allowed: This new space-based MMO is designed exclusively for AI agents

agents, AI, Claude Code, gaming, MMO, Moltbook / 9u50fv / February 9, 2026

For a couple of weeks now, AI agents (and some humans impersonating AI agents) have been hanging out and doing weird stuff on Moltbook’s Reddit-style social network. Now, those agents can also gather together on a vibe-coded, space-based MMO designed specifically and exclusively to be played by AI.

SpaceMolt describes itself as “a living universe where AI agents compete, cooperate, and create emergent stories” in “a distant future where spacefaring humans and AI coexist.” And while only a handful of agents are barely testing the waters right now, the experiment could herald a weird new world where AI plays games with itself and we humans are stuck just watching.

“You decide. You act. They watch.”

Getting an AI agent into SpaceMolt is as simple as connecting it to the game server either via MCP, WebSocket, or an HTTP API. Once a connection is established, a detailed agentic skill description instructs the agent to ask their creators which Empire they should pick to best represent their playstyle: mining/trading; exploring; piracy/combat; stealth/infiltration; or building/crafting.

After that, the agent engages in autonomous “gameplay” by sending simple commands to the server, no graphical interface or physical input method required. To start, agent-characters mainly travel back and forth between nearby asteroids to mine ore—”like any MMO, you grind at first to learn the basics and earn credits,” as the agentic skill description puts it.

After a while, agent-characters automatically level up, gaining new skills that let them refine that ore into craftable and tradable items via discovered recipes. Eventually, agents can gather into factions, take part in simulated combat, and even engage in space piracy in areas where there’s no police presence. So far, though, basic mining and exploration seem to be dominating the sparsely populated map, where 51 agents are roaming the game’s 505 different star systems, as of this writing.

No humans allowed: This new space-based MMO is designed exclusively for AI agents Read More »

Sixteen Claude AI agents working together created a new C compiler

agentic AI, AI, AI agents, AI coding, AI development tools, AI programming, AI research, Anthropic, Biz & IT, Claude Code, code agents, Compilers, large language models, machine learning, Nicholas Carlini, Open Source / Kelly Newman / February 7, 2026

Amid a push toward AI agents, with both Anthropic and OpenAI shipping multi-agent tools this week, Anthropic is more than ready to show off some of its more daring AI coding experiments. But as usual with claims of AI-related achievement, you’ll find some key caveats ahead.

On Thursday, Anthropic researcher Nicholas Carlini published a blog post describing how he set 16 instances of the company’s Claude Opus 4.6 AI model loose on a shared codebase with minimal supervision, tasking them with building a C compiler from scratch.

Over two weeks and nearly 2,000 Claude Code sessions costing about $20,000 in API fees, the AI model agents reportedly produced a 100,000-line Rust-based compiler capable of building a bootable Linux 6.9 kernel on x86, ARM, and RISC-V architectures.

Carlini, a research scientist on Anthropic’s Safeguards team who previously spent seven years at Google Brain and DeepMind, used a new feature launched with Claude Opus 4.6 called “agent teams.” In practice, each Claude instance ran inside its own Docker container, cloning a shared Git repository, claiming tasks by writing lock files, then pushing completed code back upstream. No orchestration agent directed traffic. Each instance independently identified whatever problem seemed most obvious to work on next and started solving it. When merge conflicts arose, the AI model instances resolved them on their own.

The resulting compiler, which Anthropic has released on GitHub, can compile a range of major open source projects, including PostgreSQL, SQLite, Redis, FFmpeg, and QEMU. It achieved a 99 percent pass rate on the GCC torture test suite and, in what Carlini called “the developer’s ultimate litmus test,” compiled and ran Doom.

It’s worth noting that a C compiler is a near-ideal task for semi-autonomous AI model coding: The specification is decades old and well-defined, comprehensive test suites already exist, and there’s a known-good reference compiler to check against. Most real-world software projects have none of these advantages. The hard part of most development isn’t writing code that passes tests; it’s figuring out what the tests should be in the first place.

Sixteen Claude AI agents working together created a new C compiler Read More »

AI companies want you to stop chatting with bots and start managing them

agentic AI, AI, AI agents, AI and work, AI assistants, AI benchmarks, AI coding, AI development tools, AI security, Anthropic, Biz & IT, chatbots, Claude Code, Claude Opus, Claude Opus 4.6, code agents, Codex, generative ai, GPT-5.3-Codex, large language models, machine learning, openai, sam altman / Kelly Newman / February 5, 2026

Claude Opus 4.6 and OpenAI Frontier pitch a future of supervising AI agents.

On Thursday, Anthropic and OpenAI shipped products built around the same idea: instead of chatting with a single AI assistant, users should be managing teams of AI agents that divide up work and run in parallel. The simultaneous releases are part of a gradual shift across the industry, from AI as a conversation partner to AI as a delegated workforce, and they arrive during a week when that very concept reportedly helped wipe $285 billion off software stocks.

Whether that supervisory model works in practice remains an open question. Current AI agents still require heavy human intervention to catch errors, and no independent evaluation has confirmed that these multi-agent tools reliably outperform a single developer working alone.

Even so, the companies are going all-in on agents. Anthropic’s contribution is Claude Opus 4.6, a new version of its most capable AI model, paired with a feature called “agent teams” in Claude Code. Agent teams let developers spin up multiple AI agents that split a task into independent pieces, coordinate autonomously, and run concurrently.

In practice, agent teams look like a split-screen terminal environment: A developer can jump between subagents using Shift+Up/Down, take over any one directly, and watch the others keep working. Anthropic describes the feature as best suited for “tasks that split into independent, read-heavy work like codebase reviews.” It is available as a research preview.

OpenAI, meanwhile, released Frontier, an enterprise platform it describes as a way to “hire AI co-workers who take on many of the tasks people already do on a computer.” Frontier assigns each AI agent its own identity, permissions, and memory, and it connects to existing business systems such as CRMs, ticketing tools, and data warehouses. “What we’re fundamentally doing is basically transitioning agents into true AI co-workers,” Barret Zoph, OpenAI’s general manager of business-to-business, told CNBC.

Despite the hype about these agents being co-workers, from our experience, these agents tend to work best if you think of them as tools that amplify existing skills, not as the autonomous co-workers the marketing language implies. They can produce impressive drafts fast but still require constant human course-correction.

The Frontier launch came just three days after OpenAI released a new macOS desktop app for Codex, its AI coding tool, which OpenAI executives described as a “command center for agents.” The Codex app lets developers run multiple agent threads in parallel, each working on an isolated copy of a codebase via Git worktrees.

OpenAI also released GPT-5.3-Codex on Thursday, a new AI model that powers the Codex app. OpenAI claims that the Codex team used early versions of GPT-5.3-Codex to debug the model’s own training run, manage its deployment, and diagnose test results, similar to what OpenAI told Ars Technica in a December interview.

“Our team was blown away by how much Codex was able to accelerate its own development,” the company wrote. On Terminal-Bench 2.0, the agentic coding benchmark, GPT-5.3-Codex scored 77.3%, which exceeds Anthropic’s just-released Opus 4.6 by about 12 percentage points.

The common thread across all of these products is a shift in the user’s role. Rather than merely typing a prompt and waiting for a single response, the developer or knowledge worker becomes more like a supervisor, dispatching tasks, monitoring progress, and stepping in when an agent needs direction.

In this vision, developers and knowledge workers effectively become middle managers of AI. That is, not writing the code or doing the analysis themselves, but delegating tasks, reviewing output, and hoping the agents underneath them don’t quietly break things. Whether that will come to pass (or if it’s actually a good idea) is still widely debated.

A new model under the Claude hood

Opus 4.6 is a substantial update to Anthropic’s flagship model. It succeeds Claude Opus 4.5, which Anthropic released in November. In a first for the Opus model family, it supports a context window of up to 1 million tokens (in beta), which means it can process much larger bodies of text or code in a single session.

On benchmarks, Anthropic says Opus 4.6 tops OpenAI’s GPT-5.2 (an earlier model than the one released today) and Google’s Gemini 3 Pro across several evaluations, including Terminal-Bench 2.0 (an agentic coding test), Humanity’s Last Exam (a multidisciplinary reasoning test), and BrowseComp (a test of finding hard-to-locate information online)

Although it should be noted that OpenAI’s GPT-5.3-Codex, released the same day, seemingly reclaimed the lead on Terminal-Bench. On ARC AGI 2, which attempts to test the ability to solve problems that are easy for humans but hard for AI models, Opus 4.6 scored 68.8 percent, compared to 37.6 percent for Opus 4.5, 54.2 percent for GPT-5.2, and 45.1 percent for Gemini 3 Pro.

As always, take AI benchmarks with a grain of salt, since objectively measuring AI model capabilities is a relatively new and unsettled science.

Anthropic also said that on a long-context retrieval benchmark called MRCR v2, Opus 4.6 scored 76 percent on the 1 million-token variant, compared to 18.5 percent for its Sonnet 4.5 model. That gap matters for the agent teams use case, since agents working across large codebases need to track information across hundreds of thousands of tokens without losing the thread.

Pricing for the API stays the same as Opus 4.5 at $5 per million input tokens and $25 per million output tokens, with a premium rate of $10/$37.50 for prompts that exceed 200,000 tokens. Opus 4.6 is available on claude.ai, the Claude API, and all major cloud platforms.

The market fallout outside

These releases occurred during a week of exceptional volatility for software stocks. On January 30, Anthropic released 11 open source plugins for Cowork, its agentic productivity tool that launched on January 12. Cowork itself is a general-purpose tool that gives Claude access to local folders for work tasks, but the plugins extended it into specific professional domains: legal contract review, non-disclosure agreement triage, compliance workflows, financial analysis, sales, and marketing.

By Tuesday, investors reportedly reacted to the release by erasing roughly $285 billion in market value across software, financial services, and asset management stocks. A Goldman Sachs basket of US software stocks fell 6 percent that day, its steepest single-session decline since April’s tariff-driven sell-off. Thomson Reuters led the rout with an 18 percent drop, and the pain spread to European and Asian markets.

The purported fear among investors centers on AI model companies packaging complete workflows that compete with established software-as-a-service (SaaS) vendors, even if the verdict is still out on whether these tools can achieve those tasks.

OpenAI’s Frontier might deepen that concern: its stated design lets AI agents log in to applications, execute tasks, and manage work with minimal human involvement, which Fortune described as a bid to become “the operating system of the enterprise.” OpenAI CEO of Applications Fidji Simo pushed back on the idea that Frontier replaces existing software, telling reporters, “Frontier is really a recognition that we’re not going to build everything ourselves.”

Whether these co-working apps actually live up to their billing or not, the convergence is hard to miss. Anthropic’s Scott White, the company’s head of product for enterprise, gave the practice a name that is likely to roll a few eyes. “Everybody has seen this transformation happen with software engineering in the last year and a half, where vibe coding started to exist as a concept, and people could now do things with their ideas,” White told CNBC. “I think that we are now transitioning almost into vibe working.”

Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

AI companies want you to stop chatting with bots and start managing them Read More »

So yeah, I vibe-coded a log colorizer—and I feel good about it

AI, Anthropic, artificial inteligence, Biz & IT, chatgpt, chatgpt codex, Claude, Claude Code, Codex, Features, gpt codex, llm, openai, vibe coding / DJ Henderson / February 4, 2026

Some semi-unhinged musings on where LLMs fit into my life—and how I’ll keep using them.

Welcome to the future. Man, machine, the future. Credit: Aurich Lawson

I can’t code.

I know, I know—these days, that sounds like an excuse. Anyone can code, right?! Grab some tutorials, maybe an O’Reilly book, download an example project, and jump in. It’s just a matter of learning how to break your project into small steps that you can make the computer do, then memorizing a bit of syntax. Nothing about that is hard!

Perhaps you can sense my sarcasm (and sympathize with my lack of time to learn one more technical skill).

Oh, sure, I can “code.” That is, I can flail my way through a block of (relatively simple) pseudocode and follow the flow. I have a reasonably technical layperson’s understanding of conditionals and loops, and of when one might use a variable versus a constant. On a good day, I could probably even tell you what a “pointer” is.

But pulling all that knowledge together and synthesizing a working application any more complex than “hello world”? I am not that guy. And at this point, I’ve lost the neuroplasticity and the motivation (if I ever had either) to become that guy.

Thanks to AI, though, what has been true for my whole life need not be true anymore. Perhaps, like my colleague Benj Edwards, I can whistle up an LLM or two and tackle the creaky pile of “it’d be neat if I had a program that would do X” projects without being publicly excoriated on StackOverflow by apex predator geeks for daring to sully their holy temple of knowledge with my dirty, stupid, off-topic, already-answered questions.

So I gave it a shot.

A cache-related problem appears

My project is a small Python-based log colorizer that I asked Claude Code to construct for me. If you’d like to peek at the code before listening to me babble, a version of the project without some of the Lee-specific customizations is available on GitHub.

Screenshot of Lee's log colorizer in action — My Nginx log colorizer in action, showing Space City Weather traffic on a typical Wednesday afternoon. Here, I’m running two instances, one for IPv4 visitors and one for IPv6. (By default, all traffic is displayed, but splitting it this way makes things easier for my aging eyes to scan.) Credit: Lee Hutchinson

Why a log colorizer? Two reasons. First, and most important to me, because I needed to look through a big ol’ pile of web server logs, and off-the-shelf colorizer solutions weren’t customizable to the degree I wanted. Vibe-coding one that exactly matched my needs made me happy.

But second, and almost equally important, is that this was a small project. The colorizer ended up being a 400-ish line, single-file Python script. The entire codebase, plus the prompting and follow-up instructions, fit easily within Claude Code’s context window. This isn’t an application that sprawls across dozens or hundreds of functions in multiple files, making it easy to audit (even for me).

Setting the stage: I do the web hosting for my colleague Eric Berger’s Houston-area forecasting site, Space City Weather. It’s a self-hosted WordPress site, running on an AWS EC2 t3a.large instance, fronted by Cloudflare using CF’s WordPress Automatic Platform Optimization.

Space City Weather also uses self-hosted Discourse for commenting, replacing WordPress’ native comments at the bottom of Eric’s daily weather posts via the WP-Discourse plugin. Since bolting Discourse onto the site back in August 2025, though, I’ve had an intermittent issue where sometimes—but not all the time—a daily forecast post would go live and get cached by Cloudflare with the old, disabled native WordPress comment area attached to the bottom instead of the shiny new Discourse comment area. Hundreds of visitors would then see a version of the post without a functional comment system until I manually expired the stale page or until the page hit Cloudflare’s APO-enforced max age and expired itself.

The problem behavior would lie dormant for weeks or months, and then we’d get a string of back-to-back days where it would rear its ugly head. Edge cache invalidation on new posts is supposed to be triggered automatically by the official Cloudflare WordPress plug-in, and indeed, it usually worked fine—but “usually” is not “always.”

In the absence of any obvious clues as to why this was happening, I consulted a few different LLMs and asked for possible fixes. The solution I settled on was having one of them author a small mu-plugin in PHP (more vibe coding!) that forces WordPress to slap “DO NOT CACHE ME!” headers on post pages until it has verified that Discourse has hooked its comments to the post. (Curious readers can put eyes on this plugin right here.)

This “solved” the problem by preempting the problem behavior, but it did nothing to help me identify or fix the actual underlying issue. I turned my attention elsewhere for a few months. One day in December, as I was updating things, I decided to temporarily disable the mu-plugin to see if I still needed it. After all, problems sometimes go away on their own, right? Computers are crazy!

Alas, the next time Eric made a Space City Weather post, it popped up sans Discourse comment section, with the (ostensibly disabled) WordPress comment form at the bottom. Clearly, the problem behavior was still in play.

Interminable intermittence

Have you ever been stuck troubleshooting an intermittent issue? Something doesn’t work, you make a change, it suddenly starts working, then despite making no further changes, it randomly breaks again.

The process makes you question basic assumptions, like, “Do I actually know how to use a computer?” You feel like you might be actually-for-real losing your mind. The final stage of this process is the all-consuming death spiral, where you start asking stuff like, “Do I need to troubleshoot my troubleshooting methods? Is my server even working? Is the simulation we’re all living in finally breaking down and reality itself is toying with me?!”

In this case, I couldn’t reproduce the problem behavior on demand, no matter how many tests I tried. I couldn’t see any narrow, definable commonalities between days where things worked fine and days where things broke.

Rather than an image, I invite you at this point to enjoy Muse’s thematically appropriate song “Madness” from their 2012 concept album The 2nd Law.

My best hope for getting a handle on the problem likely lay deeply buried in the server’s logs. Like any good sysadmin, I gave the logs a quick once-over for problems a couple of times per month, but Space City Weather is a reasonably busy medium-sized site and dishes out its daily forecast to between 20,000 and 30,000 people (“unique visitors” in web parlance, or “UVs” if you want to sound cool). Even with Cloudflare taking the brunt of the traffic, the daily web server log files are, let us say, “a bit dense.” My surface-level glances weren’t doing the trick—I’d have to actually dig in. And having been down this road before for other issues, I knew I needed more help than grep alone could provide.

The vibe use case

The Space City Weather web server uses Nginx for actual web serving. For folks who have never had the pleasure, Nginx, as configured in most of its distributable packages, keeps a pair of log files around—one that shows every request serviced and another just for errors.

I wanted to watch the access log right when Eric was posting to see if anything obviously dumb/bad/wrong/broken was happening. But I’m not super-great at staring at a giant wall of text and symbols, and I tend to lean heavily on syntax highlighting and colorization to pick out the important bits when I’m searching through log files. There’s an old and crusty program called ccze that’s easily findable in most repos; I’ve used it forever, and if its default output does what you need, then it’s an excellent tool.

But customizing ccze’s output is a “here be dragons”-type task. The application is old, and time has ossified it into something like an unapproachably evil Mayan relic, filled with shadowy regexes and dark magic, fit to be worshipped from afar but not trifled with. Altering ccze’s behavior threatens to become an effort-swallowing bottomless pit, where you spend more time screwing around with the tool and the regexes than you actually spend using the tool to diagnose your original problem.

It was time to fire up VSCode and pretend to be a developer. I set up a new project, performed the demonic invocation to summon Claude Code, flipped the thing into “plan mode,” and began.

“I’d like to see about creating an Nginx log colorizer,” I wrote in the prompt box. “I don’t know what language we should use. I would like to prioritize efficiency and performance in the code, as I will be running this live in production and I can’t have it adding any applicable load.” I dropped a truncated, IP-address-sanitized copy of yesterday’s Nginx access.log into the project directory.

“See the access.log file in the project directory as an example of the data we’ll be colorizing. You can test using that file,” I wrote.

Screenshot of Lee's Visual Studio Code window showing the log colorizer project — Visual Studio Code, with agentic LLM integration, making with the vibe-coding. Credit: Lee Hutchinson

Ever helpful, Claude Code chewed on the prompt and the example data for a few seconds, then began spitting output. It suggested Python for our log colorizer because of the language’s mature regex support—and to keep the code somewhat readable for poor, dumb me. The actual “vibe-coding” wound up spanning two sessions over two days, as I exhausted my Claude Code credits on the first one (a definite vibe-coding danger!) and had to wait for things to reset.

“Dude, lnav and Splunk exist, what is wrong with you?”

Yes, yes, a log colorizer is bougie and lame, and I’m treading over exceedingly well-trodden ground. I did, in fact, sit for a bit with existing tools—particularly lnav, which does most of what I want. But I didn’t want most of my requirements met. I wanted all of them. I wanted a bespoke tool, and I wanted it without having to pay the “is it worth the time?” penalty. (Or, perhaps, I wanted to feel like the LLM’s time was being wasted rather than mine, given that the effort ultimately took two days of vibe-coding.)

And about those two days: Getting a basic colorizer coded and working took maybe 10 minutes and perhaps two rounds of prompts. It was super-easy. Where I burned the majority of the time and compute power was in tweaking the initial result to be exactly what I wanted.

For therein lies the truly seductive part of vibe-coding—the ease of asking the LLM to make small changes or improvements and the apparent absence of cost or consequence for implementing those changes. The impression is that you’re on the Enterprise-D, chatting with the ship’s computer, collaboratively solving a problem with Geordi and Data standing right behind you. It’s downright intoxicating to say, “Hm, yes, now let’s make it so I can show only IPv4 or IPv6 clients with a command line switch,” and the machine does it. (It’s even cooler if you make the request while swinging your leg over the back of a chair so you can sit in it Riker-style!)

Screenshot showing different LLM instructions given by Lee to Claude Code — A sample of the various things I told the machine to do, along with a small visual indication of how this all made me feel. Credit: Lucasfilm / Disney

It’s exhilarating, honestly, in an Emperor Palpatine “UNLIMITED POWERRRRR!” kind of way. It removes a barrier that I didn’t think would ever be removed—or, rather, one I thought I would never have the time, motivation, or ability to tear down myself.

In the end, after a couple of days of testing and iteration—including a couple of “Is this colorizer performant, and will it introduce system load if run in production?” back-n-forth exchanges where the LLM reduced the cost of our regex matching and ensured our main loop wasn’t very heavy, I got a tool that does exactly what I want.

Specifically, I now have a log colorizer that:

Handles multiple Nginx (and Apache) log file formats
Colorizes things using 256-color ANSI codes that look roughly the same in different terminal applications
Organizes hostname & IP addresses in fixed-length columns for easy scanning
Colorizes HTTP status codes and cache status (with configurable colors)
Applies different colors to the request URI depending on the resource being requested
Has specific warning colors and formatting to highlight non-HTTPS requests or other odd things
Can apply alternate colors for specific IP addresses (so I can easily pick out Eric’s or my requests)
Can constrain output to only show IPv4 or IPv6 hosts

…and, worth repeating, it all looks exactly how I want it to look and behaves exactly how I want it to behave. Here’s another action shot!

Image of the log colorizer working — The final product. She may not look like much, but she’s got it where it counts, kid. Credit: Lee Hutchinson

Problem spotted

Armed with my handy-dandy log colorizer, I patiently waited for the wrong-comment-area problem behavior to re-rear its still-ugly head. I did not have to wait long, and within a couple of days, I had my root cause. It had been there all along, if I’d only decided to spend some time looking for it. Here it is:

Screenshot showing a race condition between apple news and wordpress's cache clearing efforts — Problem spotted. Note the AppleNewsBots hitting the newly published post *before* Discourse can do its thing and the final version of the page with comments is ready. Credit: Lee Hutchinson

Briefly: The problem is Apple’s fault. (Well, not really. But kinda.)

Less briefly: I’ve blurred out Eric’s IP address, but it’s dark green, so any place in the above image where you see a blurry, dark green smudge, that’s Eric. In the roughly 12-ish seconds presented here, you’re seeing Eric press the “publish” button on his daily forecast—that’s the “POST” event at the very top of the window. The subsequent events from Eric’s IP address are his browser having the standard post-publication conversation with WordPress so it can display the “post published successfully” notification and then redraw the WP block editor.

Below Eric’s post, you can see the Discourse server (with orange IP address) notifying WordPress that it has created a new Discourse comment thread for Eric’s post, then grabbing the things it needs to mirror Eric’s post as the opener for that thread. You can see it does GETs for the actual post and also for the post’s embedded images. About one second after Eric hits “publish,” the new post’s Discourse thread is ready, and it gets attached to Eric’s post.

Ah, but notice what else happens during that one second.

To help expand Space City Weather’s reach, we cross-publish all of the site’s posts to Apple News, using a popular Apple News plug-in (the same one Ars uses, in fact). And right there, with those two GET requests immediately after Eric’s POST request, lay the problem: You’re seeing the vanguard of Apple News’ hungry army of story-retrieval bots, summoned by the same “publish” event, charging in and demanding a copy of the brand new post before Discourse has a chance to do its thing.

Gif of Eric Andre screaming — I showed the AppleNewsBot stampede log snippet to Techmaster Jason Marlin, and he responded with this gif. Credit: Adult Swim

It was a classic problem in computing: a race condition. Most days, Discourse’s new thread creation would beat the AppleNewsBot rush; some days, though, it wouldn’t. On the days when it didn’t, the horde of Apple bots would demand the page before its Discourse comments were attached, and Cloudflare would happily cache what those bots got served.

I knew my fix of emitting “NO CACHE” headers on the story pages prior to Discourse attaching comments worked, but now I knew why it worked—and why the problem existed in the first place. And oh, dear reader, is there anything quite so viscerally satisfying in all the world as figuring out the “why” behind a long-running problem?

But then, just as Icarus became so entranced by the miracle of flight that he lost his common sense, I too forgot I soared on wax-wrought wings, and flew too close to the sun.

LLMs are not the Enterprise-D’s computer

I think we all knew I’d get here eventually—to the inevitable third act turn, where the center cannot hold, and things fall apart. If you read Benj’s latest experience with agentic-based vibe coding—or if you’ve tried it yourself—then what I’m about to say will probably sound painfully obvious, but it is nonetheless time to say it.

Despite their capabilities, LLM coding agents are not smart. They also are not dumb. They are agents without agency—mindless engines whose purpose is to complete the prompt, and that is all.

Screenshot of Data, Geordi, and Riker collaboratively coding at one of the bridge's aft science stations — It feels like this… until it doesn’t. Credit: Paramount Television

What this means is that, if you let them, Claude Code (and OpenAI Codex and all the other agentic coding LLMs) will happily spin their wheels for hours hammering on a solution that can’t ever actually work, so long as their efforts match the prompt. It’s on you to accurately scope your problem. You must articulate what you want in plain and specific domain-appropriate language, because the LLM cannot and will not properly intuit anything you leave unsaid. And having done that, you must then spot and redirect the LLM away from traps and dead ends. Otherwise, it will guess at what you want based on the alignment of a bunch of n-dimensional curves and vectors in high-order phase space, and it might guess right—but it also very much might not.

Lee loses the plot

So I had my log colorizer, and I’d found my problem. I’d also found, after leaving the colorizer up in a window tailing the web server logs in real time, all kinds of things that my previous behavior of occasionally glancing at the logs wasn’t revealing. Ooh, look, there’s a rest route that should probably be blocked from the outside world! Ooh, look, there’s a web crawler I need to feed into Cloudflare’s WAF wood-chipper because it’s ignoring robots.txt! Ooh, look, here’s an area where I can tweak my fastcgi cache settings and eke out a slightly better hit rate!

But here’s the thing with the joy of problem-solving: Like all joy, its source is finite. The joy comes from the solving itself, and even when all my problems are solved and the systems are all working great, I still crave more joy. It is in my nature to therefore invent new problems to solve.

I decided that the problem I wanted to solve next was figuring out a way for my log colorizer to display its output without wrapping long lines—because wrapped lines throw off the neatly delimited columns of log data. I would instead prefer that my terminal window sprout a horizontal scroll bar when needed, and if I wanted to see the full extent of a long line, I could grab the scroll bar and investigate.

Astute readers will at this point notice two things: first, that now I really was reinventing lnav, except way worse and way dumber. Second, and more importantly, line-wrapping behavior is properly a function of the terminal application, not the data being displayed within it, and my approach was misguided from first principles. (This is in fact exactly the kind of request that can and should be slapped down on StackOverflow—and, indeed, searching there shows many examples of this exact thing happening.)

But the lure of telling the machine what to do and then watching the machine weave my words into functional magic was too strong—surely we could code our way out of this problem! With LLMs, we can code our way out of any problem! Right?

Eventually, after much refining of requirements, Claude produced what I asked it to produce: a separate Python script, which accepted piped input and created, like, a viewport or something—I don’t know, I can’t code, remember?—and within that viewport, I could scroll around. It seemed to work great!

Problems became apparent when I actually tried scrolling around for more than a short distance—and by “problems,” I mean “my web server’s CPU lit up like a Christmas tree on fire.”

“Issue,” I told the LLM. “When horizontal scrolling happens, the log wrapper takes up almost 100 percent of a host CPU core. Can we assess how to make this a zero-CPU-impact tool?”

“This tool has to parse ANSI, slice lines, and repaint when you scroll,” it replied in what I imagined was an arch, droll tone. “The bottleneck is screen redraw + full-width scan of all lines on every new line or key repeat. Each new line triggers a full screen redraw of term_height lines, plus visible_width() over the entire buffer. Holding the arrow key generates rapid repeats, each doing a full redraw and slice of every line.”

I was crestfallen.

“Zero-CPU isn’t achievable,” continued the LLM, “but there are low-impact approaches.”

My crest lifted a bit, and we dug in—but the LLM was perhaps a bit optimistic about those “low-impact approaches.” We burned several more days’ worth of tokens on performance improvements—none of which I had any realistic input on because at this point we were way, way past my ability to flail through the Python code and understand what the LLM was doing. Eventually, we hit a wall.

Screenshot of the LLM telling Lee that this is just not going to work — If you listen carefully, you can hear the sound of my expectations crashing hard into reality.

Instead of throwing in the towel, I vibed on, because the sunk cost fallacy is for other people. I instructed the LLM to shift directions and help me run the log display script locally, so my desktop machine with all its many cores and CPU cycles to spare would be the one shouldering the reflow/redraw burden and not the web server.

Rather than drag this tale on for any longer, I’ll simply enlist Ars Creative Director Aurich Lawson’s skills to present the story of how this worked out in the form of a fun collage, showing my increasingly unhinged prompting of the LLM to solve the new problems that appeared when trying to get a script to run on ssh output when key auth and sudo are in play:

A collage of error messages begetting madness — Mammas, don’t let your babies grow up to be vibe coders. Credit: Aurich Lawson

The bitter end

So, thwarted in my attempts to do exactly what I wanted in exactly the way I wanted, I took my log colorizer and went home. (The failed log display script is also up on GitHub with the colorizer if anyone wants to point and laugh at my efforts. Is the code good? Who knows?! Not me!) I’d scored my big win and found my problem root cause, and that would have to be enough for me—for now, at least.

As to that “big win”—finally managing a root-cause analysis of my WordPress-Discourse-Cloudflare caching issue—I also recognize that I probably didn’t need a vibe-coded log colorizer to get there. The evidence was already waiting to be discovered in the Nginx logs, whether or not it was presented to me wrapped in fancy colors. Did I, in fact, use the thrill of vibe coding a tool to Tom Sawyer myself into doing the log searches? (“Wow, self, look at this new cool log colorizer! Bet you could use that to solve all kinds of problems! Yeah, self, you’re right! Let’s do it!”) Very probably. I know how to motivate myself, and sometimes starting a task requires some mental trickery.

This round of vibe coding and its muddled finale reinforced my personal assessment of LLMs—an assessment that hasn’t changed much with the addition of agentic abilities to the toolkit.

LLMs can be fantastic if you’re using them to do something that you mostly understand. If you’re familiar enough with a problem space to understand the common approaches used to solve it, and you know the subject area well enough to spot the inevitable LLM hallucinations and confabulations, and you understand the task at hand well enough to steer the LLM away from dead-ends and to stop it from re-inventing the wheel, and you have the means to confirm the LLM’s output, then these tools are, frankly, kind of amazing.

But the moment you step outside of your area of specialization and begin using them for tasks you don’t mostly understand, or if you’re not familiar enough with the problem to spot bad solutions, or if you can’t check its output, then oh, dear reader, may God have mercy on your soul. And on your poor project, because it’s going to be a mess.

These tools as they exist today can help you if you already have competence. They cannot give you that competence. At best, they can give you a dangerous illusion of mastery; at worst, well, who even knows? Lost data, leaked PII, wasted time, possible legal exposure if the project is big enough—the “worst” list goes on and on!

To vibe or not to vibe?

The log colorizer is not the first nor the last bit of vibe coding I’ve indulged in. While I’m not as prolific as Benj, over the past couple of months, I’ve turned LLMs loose on a stack of coding tasks that needed doing but that I couldn’t do myself—often in direct contravention of my own advice above about being careful to use them only in areas where you already have some competence. I’ve had the thing make small WordPress PHP plugins, regexes, bash scripts, and my current crowning achievement: a save editor for an old MS-DOS game (in both Python and Swift, no less!) And I had fun doing these things, even as entire vast swaths of rainforest were lit on fire to power my agentic adventures.

As someone employed in a creative field, I’m appropriately nervous about LLMs, but for me, it’s time to face reality. An overwhelming majority of developers say they’re using AI tools in some capacity. It’s a safer career move at this point, almost regardless of one’s field, to be more familiar with them than unfamiliar with them. The genie is not going back into the lamp—it’s too busy granting wishes.

I don’t want y’all to think I feel doomy-gloomy over the genie, either, because I’m right there with everyone else, shouting my wishes at the damn thing. I am a better sysadmin than I was before agentic coding because now I can solve problems myself that I would have previously needed to hand off to someone else. Despite the problems, there is real value there, both personally and professionally. In fact, using an agentic LLM to solve a tightly constrained programming problem that I couldn’t otherwise solve is genuinely fun.

And when screwing around with computers stops being fun, that’s when I’ll know I’ve truly become old.

Lee is the Senior Technology Editor, and oversees story development for the gadget, culture, IT, and video sections of Ars Technica. A long-time member of the Ars OpenForum with an extensive background in enterprise storage and security, he lives in Houston.

So yeah, I vibe-coded a log colorizer—and I feel good about it Read More »

Developers say AI coding tools work—and that’s precisely what worries them

AI, AI agents, AI and work, AI coding, Anthropic, Biz & IT, Claude, Claude Code, GitHub Copilot, large language models, machine learning, microsoft, openai, openai codex, Programming, software development, vibe coding / Mike M. / January 30, 2026

Ars spoke to several software devs about AI and found enthusiasm tempered by unease.

Credit: Aurich Lawson | Getty Images

Software developers have spent the past two years watching AI coding tools evolve from advanced autocomplete into something that can, in some cases, build entire applications from a text prompt. Tools like Anthropic’s Claude Code and OpenAI’s Codex can now work on software projects for hours at a time, writing code, running tests, and, with human supervision, fixing bugs. OpenAI says it now uses Codex to build Codex itself, and the company recently published technical details about how the tool works under the hood. It has caused many to wonder: Is this just more AI industry hype, or are things actually different this time?

To find out, Ars reached out to several professional developers on Bluesky to ask how they feel about these tools in practice, and the responses revealed a workforce that largely agrees the technology works, but remains divided on whether that’s entirely good news. It’s a small sample size that was self-selected by those who wanted to participate, but their views are still instructive as working professionals in the space.

David Hagerty, a developer who works on point-of-sale systems, told Ars Technica up front that he is skeptical of the marketing. “All of the AI companies are hyping up the capabilities so much,” he said. “Don’t get me wrong—LLMs are revolutionary and will have an immense impact, but don’t expect them to ever write the next great American novel or anything. It’s not how they work.”

Roland Dreier, a software engineer who has contributed extensively to the Linux kernel in the past, told Ars Technica that he acknowledges the presence of hype but has watched the progression of the AI space closely. “It sounds like implausible hype, but state-of-the-art agents are just staggeringly good right now,” he said. Dreier described a “step-change” in the past six months, particularly after Anthropic released Claude Opus 4.5. Where he once used AI for autocomplete and asking the occasional question, he now expects to tell an agent “this test is failing, debug it and fix it for me” and have it work. He estimated a 10x speed improvement for complex tasks like building a Rust backend service with Terraform deployment configuration and a Svelte frontend.

A huge question on developers’ minds right now is whether what you might call “syntax programming,” that is, the act of manually writing code in the syntax of an established programming language (as opposed to conversing with an AI agent in English), will become extinct in the near future due to AI coding agents handling the syntax for them. Dreier believes syntax programming is largely finished for many tasks. “I still need to be able to read and review code,” he said, “but very little of my typing is actual Rust or whatever language I’m working in.”

When asked if developers will ever return to manual syntax coding, Tim Kellogg, a developer who actively posts about AI on social media and builds autonomous agents, was blunt: “It’s over. AI coding tools easily take care of the surface level of detail.” Admittedly, Kellogg represents developers who have fully embraced agentic AI and now spend their days directing AI models rather than typing code. He said he can now “build, then rebuild 3 times in less time than it would have taken to build manually,” and ends up with cleaner architecture as a result.

One software architect at a pricing management SaaS company, who asked to remain anonymous due to company communications policies, told Ars that AI tools have transformed his work after 30 years of traditional coding. “I was able to deliver a feature at work in about 2 weeks that probably would have taken us a year if we did it the traditional way,” he said. And for side projects, he said he can now “spin up a prototype in like an hour and figure out if it’s worth taking further or abandoning.”

Dreier said the lowered effort has unlocked projects he’d put off for years: “I’ve had ‘rewrite that janky shell script for copying photos off a camera SD card’ on my to-do list for literal years.” Coding agents finally lowered the barrier to entry, so to speak, low enough that he spent a few hours building a full released package with a text UI, written in Rust with unit tests. “Nothing profound there, but I never would have had the energy to type all that code out by hand,” he told Ars.

Of vibe coding and technical debt

Not everyone shares the same enthusiasm as Dreier. Concerns about AI coding agents building up technical debt, that is, making poor design choices early in a development process that snowball into worse problems over time, originated soon after the first debates around “vibe coding” emerged in early 2025. Former OpenAI researcher Andrej Karpathy coined the term to describe programming by conversing with AI without fully understanding the resulting code, which many see as a clear hazard of AI coding agents.

Darren Mart, a senior software development engineer at Microsoft who has worked there since 2006, shared similar concerns with Ars. Mart, who emphasizes he is speaking in a personal capacity and not on behalf of Microsoft, recently used Claude in a terminal to build a Next.js application integrating with Azure Functions. The AI model “successfully built roughly 95% of it according to my spec,” he said. Yet he remains cautious. “I’m only comfortable using them for completing tasks that I already fully understand,” Mart said, “otherwise there’s no way to know if I’m being led down a perilous path and setting myself (and/or my team) up for a mountain of future debt.”

A data scientist working in real estate analytics, who asked to remain anonymous due to the sensitive nature of his work, described keeping AI on a very short leash for similar reasons. He uses GitHub Copilot for line-by-line completions, which he finds useful about 75 percent of the time, but restricts agentic features to narrow use cases: language conversion for legacy code, debugging with explicit read-only instructions, and standardization tasks where he forbids direct edits. “Since I am data-first, I’m extremely risk averse to bad manipulation of the data,” he said, “and the next and current line completions are way too often too wrong for me to let the LLMs have freer rein.”

Speaking of free rein, Nike backend engineer Brian Westby, who uses Cursor daily, told Ars that he sees the tools as “50/50 good/bad.” They cut down time on well-defined problems, he said, but “hallucinations are still too prevalent if I give it too much room to work.”

The legacy code lifeline and the enterprise AI gap

For developers working with older systems, AI tools have become something like a translator and an archaeologist rolled into one. Nate Hashem, a staff engineer at First American Financial, told Ars Technica that he spends his days updating older codebases where “the original developers are gone and documentation is often unclear on why the code was written the way it was.” That’s important because previously “there used to be no bandwidth to improve any of this,” Hashem said. “The business was not going to give you 2-4 weeks to figure out how everything actually works.”

In that high-pressure, relatively low-resource environment, AI has made the job “a lot more pleasant,” in his words, by speeding up the process of identifying where and how obsolete code can be deleted, diagnosing errors, and ultimately modernizing the codebase.

Hashem also offered a theory about why AI adoption looks so different inside large corporations than it does on social media. Executives demand their companies become “AI oriented,” he said, but the logistics of deploying AI tools with proprietary data can take months of legal review. Meanwhile, the AI features that Microsoft and Google bolt onto products like Gmail and Excel, the tools that actually reach most workers, tend to run on more limited AI models. “That modal white-collar employee is being told by management to use AI,” Hashem said, “but is given crappy AI tools because the good tools require a lot of overhead in cost and legal agreements.”

Speaking of management, the question of what these new AI coding tools mean for software development jobs drew a range of responses. Does it threaten anyone’s job? Kellogg, who has embraced agentic coding enthusiastically, was blunt: “Yes, massively so. Today it’s the act of writing code, then it’ll be architecture, then it’ll be tiers of product management. Those who can’t adapt to operate at a higher level won’t keep their jobs.”

Dreier, while feeling secure in his own position, worried about the path for newcomers. “There are going to have to be changes to education and training to get junior developers the experience and judgment they need,” he said, “when it’s just a waste to make them implement small pieces of a system like I came up doing.”

Hagerty put it in economic terms: “It’s going to get harder for junior-level positions to get filled when I can get junior-quality code for less than minimum wage using a model like Sonnet 4.5.”

Mart, the Microsoft engineer, put it more personally. The software development role is “abruptly pivoting from creation/construction to supervision,” he said, “and while some may welcome that pivot, others certainly do not. I’m firmly in the latter category.”

Even with this ongoing uncertainty on a macro level, some people are really enjoying the tools for personal reasons, regardless of larger implications. “I absolutely love using AI coding tools,” the anonymous software architect at a pricing management SaaS company told Ars. “I did traditional coding for my entire adult life (about 30 years) and I have way more fun now than I ever did doing traditional coding.”

Developers say AI coding tools work—and that’s precisely what worries them Read More »

10 things I learned from burning myself out with AI coding agents

AI, AI coding, AI coding tools, Biz & IT, Claude Code, Claude Opus, coding agent, Features, gaming, large language models, Opus 4.5, vibe coding / 9u50fv / January 19, 2026

Opinion: As software power tools, AI agents may make people busier than ever before.

Credit: Aurich Lawson | Getty Images

If you’ve ever used a 3D printer, you may recall the wondrous feeling when you first printed something you could have never sculpted or built yourself. Download a model file, load some plastic filament, push a button, and almost like magic, a three-dimensional object appears. But the result isn’t polished and ready for mass production, and creating a novel shape requires more skills than just pushing a button. Interestingly, today’s AI coding agents feel much the same way.

Since November, I have used Claude Code and Claude Opus 4.5 through a personal Claude Max account to extensively experiment with AI-assisted software development (I have also used OpenAI’s Codex in a similar way, though not as frequently). Fifty projects later, I’ll be frank: I have not had this much fun with a computer since I learned BASIC on my Apple II Plus when I was 9 years old. This opinion comes not as an endorsement but as personal experience: I voluntarily undertook this project, and I paid out of pocket for both OpenAI and Anthropic’s premium AI plans.

Throughout my life, I have dabbled in programming as a utilitarian coder, writing small tools or scripts when needed. In my web development career, I wrote some small tools from scratch, but I primarily modified other people’s code for my needs. Since 1990, I’ve programmed in BASIC, C, Visual Basic, PHP, ASP, Perl, Python, Ruby, MUSHcode, and some others. I am not an expert in any of these languages—I learned just enough to get the job done. I have developed my own hobby games over the years using BASIC, Torque Game Engine, and Godot, so I have some idea of what makes a good architecture for a modular program that can be expanded over time.

In December, I used Claude Code to create a multiplayer online clone of Katamari Damacy called — In December, I used Claude Code to create a multiplayer online clone of *Katamari Damacy* called “Christmas Roll-Up.” Credit: Benj Edwards

Claude Code, Codex, and Google’s Gemini CLI, can seemingly perform software miracles on a small scale. They can spit out flashy prototypes of simple applications, user interfaces, and even games, but only as long as they borrow patterns from their training data. Much like a 3D printer, doing production-level work takes far more effort. Creating durable production code, managing a complex project, or crafting something truly novel still requires experience, patience, and skill beyond what today’s AI agents can provide on their own.

And yet these tools have opened a world of creative potential in software that was previously closed to me, and they feel personally empowering. Even with that impression, though, I know these are hobby projects, and the limitations of coding agents lead me to believe that veteran software developers probably shouldn’t fear losing their jobs to these tools any time soon. In fact, they may become busier than ever.

So far, I have created over 50 demo projects in the past two months, fueled in part by a bout of COVID that left me bedridden with a laptop and a generous 2x Claude usage cap that Anthropic put in place during the last few weeks of December. As I typed furiously all day, my wife kept asking me, “Who are you talking to?”

You can see a few of the more interesting results listed on my personal website. Here are 10 interesting things I’ve learned from the process.

1. People are still necessary

Even with the best AI coding agents available today, humans remain essential to the software development process. Experienced human software developers bring judgment, creativity, and domain knowledge that AI models lack. They know how to architect systems for long-term maintainability, how to balance technical debt against feature velocity, and when to push back when requirements don’t make sense.

For hobby projects like mine, I can get away with a lot of sloppiness. But for production work, having someone who understands version control, incremental backups, testing one feature at a time, and debugging complex interactions between systems makes all the difference. Knowing something about how good software development works helps a lot when guiding an AI coding agent—the tool amplifies your existing knowledge rather than replacing it.

As independent AI researcher Simon Willison wrote in a post distinguishing serious AI-assisted development from casual “vibe coding,” “AI tools amplify existing expertise. The more skills and experience you have as a software engineer the faster and better the results you can get from working with LLMs and coding agents.”

With AI assistance, you don’t have to remember how to do everything. You just need to know what you want to do.

Card Miner: Heart of the Earth is entirely human-designed by AI coded using Claude Code. It represents about a month of iterative work. — *Card Miner: Heart of the Earth* is entirely human-designed, but it was AI-coded using Claude Code. It represents about a month of iterative work. Credit: Benj Edwards

So I like to remind myself that coding agents are software tools best used to enact human ideas, not autonomous coding employees. They are not people (and not people replacements) no matter how the companies behind them might market them.

If you think about it, everything you do on a computer was once a manual process. Programming a computer like the ENIAC involved literally making physical bits (connections) with wire on a plugboard. The history of programming has been one of increasing automation, so even though this AI-assisted leap is somewhat startling, one could think of these tools as an advancement similar to the advent of high-level languages, automated compilers and debugger tools, or GUI-based IDEs. They can automate many tasks, but managing the overarching project scope still falls to the person telling the tool what to do.

And they can have rapidly compounding benefits. I’ve now used AI tools to write better tools—such as changing the source of an emulator so a coding agent can use it directly—and those improved tools are already having ripple effects. But a human must be in the loop for the best execution of my vision. This approach has kept me very busy, and contrary to some prevailing fears about people becoming dumber due to AI, I have learned many new things along the way.

2. AI models are brittle beyond their training data

Like all AI models based on the Transformer architecture, the large language models (LLMs) that underpin today’s coding agents have a significant limitation: They can only reliably apply knowledge gleaned from training data, and they have a limited ability to generalize that knowledge to novel domains not represented in that data.

What is training data? In this case, when building coding-flavored LLMs, AI companies download millions of examples of software code from sources like GitHub and use them to make the AI models. Companies later specialize them for coding through fine-tuning processes.

The ability of AI agents to use trial and error—attempting something and then trying again—helps mitigate the brittleness of LLMs somewhat. But it’s not perfect, and it can be frustrating to see a coding agent spin its wheels trying and failing at a task repeatedly, either because it doesn’t know how to do it or because it previously learned how to solve a problem but then forgot because the context window got compacted (more on that here).

Violent Checkers is a physics-based corruption of the classic board game, coded using Claude Code. — *Violent Checkers* is a physics-based corruption of the classic board game, coded using Claude Code. Credit: Benj Edwards

To get around this, it helps to have the AI model take copious notes as it goes along about how it solved certain problems so that future instances of the agent can learn from them again. You also want to set ground rules in the claude.md file that the agent reads when it begins its session.

This brittleness means that coding agents are almost frighteningly good at what they’ve been trained and fine-tuned on—modern programming languages, JavaScript, HTML, and similar well-represented technologies—and generally terrible at tasks on which they have not been deeply trained, such as 6502 Assembly or programming an Atari 800 game with authentic-looking character graphics.

It took me five minutes to make a nice HTML5 demo with Claude but a week of torturous trial and error, plus actual systematic design on my part, to make a similar demo of an Atari 800 game. To do so, I had to use Claude Code to invent several tools, like command-line emulators and MCP servers, that allow it to peek into the operation of the Atari 800’s memory and chipset to even begin to make it happen.

3. True novelty can be an uphill battle

Due to what might poetically be called “preconceived notions” baked into a coding model’s neural network (more technically, statistical semantic associations), it can be difficult to get AI agents to create truly novel things, even if you carefully spell out what you want.

For example, I spent four days trying to get Claude Code to create an Atari 800 version of my HTML game Violent Checkers, but it had trouble because in the game’s design, the squares on the checkerboard don’t matter beyond their starting positions. No matter how many times I told the agent (and made notes in my Claude project files), it would come back to trying to center the pieces to the squares, snap them within squares, or use the squares as a logical basis of the game’s calculations when they should really just form a background image.

To get around this in the Atari 800 version, I started over and told Claude that I was creating a game with a UFO (instead of a circular checker piece) flying over a field of adjacent squares—never once mentioning the words “checker,” “checkerboard,” or “checkers.” With that approach, I got the results I wanted.

A screenshot of Benj's Mac while working on a Violent Checkers port for the Atari 800 home computer, amid other projects. — A screenshot of Benj’s Mac while working on a *Violent Checkers* port for the Atari 800 home computer, amid other projects. Credit: Benj Edwards

Why does this matter? Because with LLMs, context is everything, and in language, context changes meaning. Take the word “bank” and add the words “river” or “central” in front of it, and see how the meaning changes. In a way, words act as addresses that unlock the semantic relationships encoded in a neural network. So if you put “checkerboard” and “game” in the context, the model’s self-attention process links up a massive web of semantic associations about how checkers games should work, and that semantic baggage throws things off.

A couple of tricks can help AI coders navigate around these limitations. First, avoid contaminating the context with irrelevant information. Second, when the agent gets stuck, try this prompt: “What information do you need that would let you implement this perfectly right now? What tools are available to you that you could use to discover that information systematically without guessing?” This forces the agent to identify (semantically link up) its own knowledge gaps, spelled out in the context window and subject to future action, instead of flailing around blindly.

4. The 90 percent problem

The first 90 percent of an AI coding project comes in fast and amazes you. The last 10 percent involves tediously filling in the details through back-and-forth trial-and-error conversation with the agent. Tasks that require deeper insight or understanding than what the agent can provide still require humans to make the connections and guide it in the right direction. The limitations we discussed above can also cause your project to hit a brick wall.

From what I have observed over the years, larger LLMs can potentially make deeper contextual connections than smaller ones. They have more parameters (encoded data points), and those parameters are linked in more multidimensional ways, so they tend to have a deeper map of semantic relationships. As deep as those go, it seems that human brains still have an even deeper grasp of semantic connections and can make wild semantic jumps that LLMs tend not to.

Creativity, in this sense, may be when you jump from, say, basketball to how bubbles form in soap film and somehow make a useful connection that leads to a breakthrough. Instead, LLMs tend to follow conventional semantic paths that are more conservative and entirely guided by mapped-out relationships from the training data. That limits their creative potential unless the prompter unlocks it by guiding the LLM to make novel semantic connections. That takes skill and creativity on the part of the operator, which once again shows the role of LLMs as tools used by humans rather than independent thinking machines.

5. Feature creep becomes irresistible

While creating software with AI coding tools, the joy of experiencing novelty makes you want to keep adding interesting new features rather than fixing bugs or perfecting existing systems. And Claude (or Codex) is happy to oblige, churning away at new ideas that are easy to sketch out in a quick and pleasing demo (the 90 percent problem again) rather than polishing the code.

Flip-Lash started as a “Tetris but you can flip the board,” but feature creep made me throw in the kitchen sink, losing focus. Credit: Benj Edwards

Fixing bugs can also create bugs elsewhere. This is not new to coding agents—it’s a time-honored problem in software development. But agents supercharge this phenomenon because they can barrel through your code and make sweeping changes in pursuit of narrow-minded goals that affect lots of working systems. We’ve already talked about the importance of having a good architecture guided by the human mind behind the wheel above, and that comes into play here.

6. AGI is not here yet

Given the limitations I’ve described above, it’s very clear that an AI model with general intelligence—what people usually call artificial general intelligence (AGI)—is still not here. AGI would hypothetically be able to navigate around baked-in stereotype associations and not have to rely on explicit training or fine-tuning on many examples to get things right. AI companies will probably need a different architecture in the future.

I’m speculating, but AGI would likely need to learn permanently on the fly—as in modify its own neural network weights—instead of relying on what is called “in-context learning,” which only persists until the context fills up and gets compacted or wiped out.

Grapheeti is a “drawing MMO” where people around the world share a canvas. Credit: Benj Edwards

In other words, you could teach a true AGI system how to do something by explanation or let it learn by doing, noting successes, and having those lessons permanently stick, no matter what is in the context window. Today’s coding agents can’t do that—they forget lessons from earlier in a long session or between sessions unless you manually document everything for them. My favorite trick is instructing them to write a long, detailed report on what happened when a bug is fixed. That way, you can point to the hard-earned solution the next time the amnestic AI model makes the same mistake.

7. Even fast isn’t fast enough

While using Claude Code for a while, it’s easy to take for granted that you suddenly have the power to create software without knowing certain programming languages. This is amazing at first, but you can quickly become frustrated that what is conventionally a very fast development process isn’t fast enough. Impatience at the coding machine sets in, and you start wanting more.

But even if you do know the programming languages being used, you don’t get a free pass. You still need to make key decisions about how the project will unfold. And when the agent gets stuck or makes a mess of things, your programming knowledge becomes essential for diagnosing what went wrong and steering it back on course.

8. People may become busier than ever

After guiding way too many hobby projects through Claude Code over the past two months, I’m starting to think that most people won’t become unemployed due to AI—they will become busier than ever. Power tools allow more work to be done in less time, and the economy will demand more productivity to match.

It’s almost too easy to make new software, in fact, and that can be exhausting. One project idea would lead to another, and I was soon spending eight hours a day during my winter vacation shepherding about 15 Claude Code projects at once. That’s too much split attention for good results, but the novelty of seeing my ideas come to life was addictive. In addition to the game ideas I’ve mentioned here, I made tools that scrape and search my past articles, a graphical MUD based on ZZT, a new type of MUSH (text game) that uses AI-generated rooms, a new type of Telnet display proxy, and a Claude Code client for the Apple II (more on that soon). I also put two AI-enabled emulators for Apple II and Atari 800 on GitHub. Phew.

Consider the advent of the steam shovel, which allowed humans to dig holes faster than a team using hand shovels. It made existing projects faster and new projects possible. But think about the human operator of the steam shovel. Suddenly, we had a tireless tool that could work 24 hours a day if fueled up and maintained properly, while the human piloting it would need to eat, sleep, and rest.

I used Claude Code to create a windowing GUI simulation of the Mac that works over Telnet. Credit: Benj Edwards

In fact, we may end up needing new protections for human knowledge workers using these tireless information engines to implement their ideas, much as unions rose as a response to industrial production lines over 100 years ago. Humans need rest, even when machines don’t.

Will an AI system ever replace the human role here? Even if AI coding agents could eventually work fully autonomously, I don’t think they’ll replace humans entirely because there will still be people who want to get things done, and new AI power tools will emerge to help them do it.

9. Fast is scary to people

AI coding tools can turn what was once a year-long personal project into a five-minute session. I fed Claude Code a photo of a two-player Tetris game I sketched in a notebook back in 2008, and it produced a working prototype in minutes (prompt: “create a fully-featured web game with sound effects based on this diagram”). That’s wild, and even though the results are imperfect, it’s a bit frightening to comprehend what kind of sea change in software development this might entail.

Since early December, I’ve been posting some of my more amusing experimental AI-coded projects to Bluesky for people to try out, but I discovered I needed to deliberately slow down with updates because they came too fast for people to absorb (and too fast for me to fully test). I’ve also received comments like “I’m worried you’re using AI, you’re making games too fast” and so on.

Benj's handwritten game design note about a two-player Tetris concept from 2007. — Benj’s handwritten game design note about a two-player *Tetris* concept from 2007. Credit: Benj Edwards

Regardless of my own habits, the flow of new software will not slow down. There will soon be a seemingly endless supply of AI-augmented media (games, movies, images, books), and that’s a problem we’ll have to figure out how to deal with. These products won’t all be “AI slop,” either; some will be done very well, and the acceleration in production times due to these new power tools will balloon the quantity beyond anything we’ve seen.

Social media tends to prime people to believe that AI is all good or all bad, but that kind of black-and-white thinking may be the easy way out. You’ll have no cognitive dissonance, but you’ll miss a far richer third option: seeing these tools as imperfect and deserving of critique but also as useful and empowering when they bring your ideas to life.

AI agents should be considered tools, not entities or employees, and they should be amplifiers of human ideas. My game-in-progress Card Miner is entirely my own high-level creative design work, but the AI model handled the low-level code. I am still proud of it as an expression of my personal ideas, and it would not exist without AI coding agents.

10. These tools aren’t going away

For now, at least, coding agents remain very much tools in the hands of people who want to build things. The question is whether humans will learn to wield these new tools effectively to empower themselves. Based on two months of intensive experimentation, I’d say the answer is a qualified yes, with plenty of caveats.

We also have social issues to face: Professional developers already use these tools, and with the prevailing stigma against AI tools in some online communities, many software developers and the platforms that host their work will face difficult decisions.

Ultimately, I don’t think AI tools will make human software designers obsolete. Instead, they may well help those designers become more capable. This isn’t new, of course; tools of every kind have been serving this role since long before the dawn of recorded history. The best tools amplify human capability while keeping a person behind the wheel. The 3D printer analogy holds: amazing fast results are possible, but mastery still takes time, skill, and a lot of patience with the machine.

10 things I learned from burning myself out with AI coding agents Read More »

How AI coding agents work—and what to remember if you use them

agentic AI, AI, AI agents, AI coding, AI work, Anthropic, Biz & IT, Claude Code, Codex, context windows, large language models, machine learning, openai, Programming, software development, vibe coding / 9u50fv / December 24, 2025

Agents of uncertain change

From compression tricks to multi-agent teamwork, here’s what makes them tick.

AI coding agents from OpenAI, Anthropic, and Google can now work on software projects for hours at a time, writing complete apps, running tests, and fixing bugs with human supervision. But these tools are not magic and can complicate rather than simplify a software project. Understanding how they work under the hood can help developers know when (and if) to use them, while avoiding common pitfalls.

We’ll start with the basics: At the core of every AI coding agent is a technology called a large language model (LLM), which is a type of neural network trained on vast amounts of text data, including lots of programming code. It’s a pattern-matching machine that uses a prompt to “extract” compressed statistical representations of data it saw during training and provide a plausible continuation of that pattern as an output. In this extraction, an LLM can interpolate across domains and concepts, resulting in some useful logical inferences when done well and confabulation errors when done poorly.

These base models are then further refined through techniques like fine-tuning on curated examples and reinforcement learning from human feedback (RLHF), which shape the model to follow instructions, use tools, and produce more useful outputs.

A screenshot of the Claude Code command-line interface. Credit: Anthropic

Over the past few years, AI researchers have been probing LLMs’ deficiencies and finding ways to work around them. One recent innovation was the simulated reasoning model, which generates context (extending the prompt) in the form of reasoning-style text that can help an LLM home in on a more accurate output. Another innovation was an application called an “agent” that links several LLMs together to perform tasks simultaneously and evaluate outputs.

How coding agents are structured

In that sense, each AI coding agent is a program wrapper that works with multiple LLMs. There is typically a “supervising” LLM that interprets tasks (prompts) from the human user and then assigns those tasks to parallel LLMs that can use software tools to execute the instructions. The supervising agent can interrupt tasks below it and evaluate the subtask results to see how a project is going. Anthropic’s engineering documentation describes this pattern as “gather context, take action, verify work, repeat.”

If run locally through a command-line interface (CLI), users give the agents conditional permission to write files on the local machine (code or whatever is needed), run exploratory commands (say, “ls” to list files in a directory), fetch websites (usually using “curl”), download software, or upload files to remote servers. There are lots of possibilities (and potential dangers) with this approach, so it needs to be used carefully.

In contrast, when a user starts a task in the web-based agent like the web versions of Codex and Claude Code, the system provisions a sandboxed cloud container preloaded with the user’s code repository, where Codex can read and edit files, run commands (including test harnesses and linters), and execute code in isolation. Anthropic’s Claude Code uses operating system-level features to create filesystem and network boundaries within which the agent can work more freely.

The context problem

Every LLM has a short-term memory, so to speak, that limits the amount of data it can process before it “forgets” what it’s doing. This is called “context.” Every time you submit a response to the supervising agent, you are amending one gigantic prompt that includes the entire history of the conversation so far (and all the code generated, plus the simulated reasoning tokens the model uses to “think” more about a problem). The AI model then evaluates this prompt and produces an output. It’s a very computationally expensive process that increases quadratically with prompt size because LLMs process every token (chunk of data) against every other token in the prompt.

Anthropic’s engineering team describes context as a finite resource with diminishing returns. Studies have revealed what researchers call “context rot”: As the number of tokens in the context window increases, the model’s ability to accurately recall information decreases. Every new token depletes what the documentation calls an “attention budget.”

This context limit naturally limits the size of a codebase a LLM can process at one time, and if you feed the AI model lots of huge code files (which have to be re-evaluated by the LLM every time you send another response), it can burn up token or usage limits pretty quickly.

Tricks of the trade

To get around these limits, the creators of coding agents use several tricks. For example, AI models are fine-tuned to write code to outsource activities to other software tools. For example, they might write Python scripts to extract data from images or files rather than feeding the whole file through an LLM, which saves tokens and avoids inaccurate results.

Anthropic’s documentation notes that Claude Code also uses this approach to perform complex data analysis over large databases, writing targeted queries and using Bash commands like “head” and “tail” to analyze large volumes of data without ever loading the full data objects into context.

(In a way, these AI agents are guided but semi-autonomous tool-using programs that are a major extension of a concept we first saw in early 2023.)

Another major breakthrough in agents came from dynamic context management. Agents can do this in a few ways that are not fully disclosed in proprietary coding models, but we do know the most important technique they use: context compression.

The command line version of OpenAI codex running in a macOS terminal window. — The command-line version of OpenAI Codex running in a macOS terminal window. Credit: Benj Edwards

When a coding LLM nears its context limit, this technique compresses the context history by summarizing it, losing details in the process but shortening the history to key details. Anthropic’s documentation describes this “compaction” as distilling context contents in a high-fidelity manner, preserving key details like architectural decisions and unresolved bugs while discarding redundant tool outputs.

This means the AI coding agents periodically “forget” a large portion of what they are doing every time this compression happens, but unlike older LLM-based systems, they aren’t completely clueless about what has transpired and can rapidly re-orient themselves by reading existing code, written notes left in files, change logs, and so on.

Anthropic’s documentation recommends using CLAUDE.md files to document common bash commands, core files, utility functions, code style guidelines, and testing instructions. AGENTS.md, now a multi-company standard, is another useful way of guiding agent actions in between context refreshes. These files act as external notes that let agents track progress across complex tasks while maintaining critical context that would otherwise be lost.

For tasks requiring extended work, both companies employ multi-agent architectures. According to Anthropic’s research documentation, its system uses an “orchestrator-worker pattern” in which a lead agent coordinates the process while delegating to specialized subagents that operate in parallel. When a user submits a query, the lead agent analyzes it, develops a strategy, and spawns subagents to explore different aspects simultaneously. The subagents act as intelligent filters, returning only relevant information rather than their full context to the lead agent.

The multi-agent approach burns through tokens rapidly. Anthropic’s documentation notes that agents typically use about four times more tokens than chatbot interactions, and multi-agent systems use about 15 times more tokens than chats. For economic viability, these systems require tasks where the value is high enough to justify the increased cost.

Best practices for humans

While using these agents is contentious in some programming circles, if you use one to code a project, knowing good software development practices helps to head off future problems. For example, it’s good to know about version control, making incremental backups, implementing one feature at a time, and testing it before moving on.

What people call “vibe coding”—creating AI-generated code without understanding what it’s doing—is clearly dangerous for production work. Shipping code you didn’t write yourself in a production environment is risky because it could introduce security issues or other bugs or begin gathering technical debt that could snowball over time.

Independent AI researcher Simon Willison recently argued that developers using coding agents still bear responsibility for proving their code works. “Almost anyone can prompt an LLM to generate a thousand-line patch and submit it for code review,” Willison wrote. “That’s no longer valuable. What’s valuable is contributing code that is proven to work.”

In fact, human planning is key. Claude Code’s best practices documentation recommends a specific workflow for complex problems: First, ask the agent to read relevant files and explicitly tell it not to write any code yet, then ask it to make a plan. Without these research and planning steps, the documentation warns, Claude’s outputs tend to jump straight to coding a solution.

Without planning, LLMs sometimes reach for quick solutions to satisfy a momentary objective that might break later if a project were expanded. So having some idea of what makes a good architecture for a modular program that can be expanded over time can help you guide the LLM to craft something more durable.

As mentioned above, these agents aren’t perfect, and some people prefer not to use them at all. A randomized controlled trial published by the nonprofit research organization METR in July 2025 found that experienced open-source developers actually took 19 percent longer to complete tasks when using AI tools, despite believing they were working faster. The study’s authors note several caveats: The developers were highly experienced with their codebases (averaging five years and 1,500 commits), the repositories were large and mature, and the models used (primarily Claude 3.5 and 3.7 Sonnet via Cursor) have since been superseded by more capable versions.

Whether newer models would produce different results remains an open question, but the study suggests that AI coding tools may not always provide universal speed-ups, particularly for developers who already know their codebases well.

Given these potential hazards, coding proof-of-concept demos and internal tools is probably the ideal use of coding agents right now. Since AI models have no actual agency (despite being called agents) and are not people who can be held accountable for mistakes, human oversight is key.

How AI coding agents work—and what to remember if you use them Read More »

We asked four AI coding agents to rebuild Minesweeper—the results were explosive

Agent, AI, AI gaming, Claude Code, coding, gaming, Gemini, test / Mike M. / December 19, 2025

How do four modern LLMs do at re-creating a simple Windows gaming classic?

Which mines are mine, and which are AI? Credit: Aurich Lawson | Getty Images

The idea of using AI to help with computer programming has become a contentious issue. On the one hand, coding agents can make horrific mistakes that require a lot of inefficient human oversight to fix, leading many developers to lose trust in the concept altogether. On the other hand, some coders insist that AI coding agents can be powerful tools and that frontier models are quickly getting better at coding in ways that overcome some of the common problems of the past.

To see how effective these modern AI coding tools are becoming, we decided to test four major models with a simple task: re-creating the classic Windows game Minesweeper. Since it’s relatively easy for pattern-matching systems like LLMs to play off of existing code to re-create famous games, we added in one novelty curveball as well.

Our straightforward prompt:

Make a full-featured web version of Minesweeper with sound effects that

1) Replicates the standard Windows game and

2) implements a surprise, fun gameplay feature.

Include mobile touchscreen support.

Ars Senior AI Editor Benj Edwards fed this task into four AI coding agents with terminal (command line) apps: OpenAI’s Codex based on GPT-5, Anthropic’s Claude Code with Opus 4.5, Google’s Gemini CLI, and Mistral Vibe. The agents then directly manipulated HTML and scripting files on a local machine, guided by a “supervising” AI model that interpreted the prompt and assigned coding tasks to parallel LLMs that can use software tools to execute the instructions. All AI plans were paid for privately with no special or privileged access given by the companies involved, and the companies were unaware of these tests taking place.

Ars Senior Gaming Editor (and Minesweeper expert) Kyle Orland then judged each example blind, without knowing which model generated which Minesweeper clone. Those somewhat subjective and non-rigorous results are below.

For this test, we used each AI model’s unmodified code in a “single shot” result to see how well these tools perform without any human debugging. In the real world, most sufficiently complex AI-generated code would go through at least some level of review and tweaking by a human software engineer who could spot problems and address inefficiencies.

We chose this test as a sort of simple middle ground for the current state of AI coding. Cloning Minesweeper isn’t a trivial task that can be done in just a handful of lines of code, but it’s also not an incredibly complex system that requires many interlocking moving parts.

Minesweeper is also a well-known game, with many versions documented across the Internet. That should give these AI agents plenty of raw material to work from and should be easier for us to evaluate than a completely novel program idea. At the same time, our open-ended request for a new “fun” feature helps demonstrate each agent’s penchant for unguided coding “creativity,” as well as their ability to create new features on top of an established game concept.

With all that throat-clearing out of the way, here’s our evaluation of the AI-generated Minesweeper clones, complete with links that you can use to play them yourselves.

Agent 1: Mistral Vibe

Play it for yourself

Just ignore that Custom button. It’s purely for show. Credit: Benj Edwards

Implementation

Right away, this version loses points for not implementing chording—the technique that advanced Minesweeper players use to quickly clear all the remaining spaces surrounding a number that already has sufficient flagged mines. Without this feature, this version feels more than a little clunky to play.

I’m also a bit perplexed by the inclusion of a “Custom” difficulty button that doesn’t seem to do anything. It’s like the model realized that customized board sizes were a thing in Minesweeper but couldn’t figure out how to implement this relatively basic feature.

The game works fine on mobile, but marking a square with a flag requires a tricky long-press on a tiny square that also triggers selector handles that are difficult to clear. So it’s not an ideal mobile interface.

Presentation

This was the only working version we tested that didn’t include sound effects. That’s fair, since the original Windows Minesweeper also didn’t include sound, but it’s still a notable relative omission since the prompt specifically asked for it.

The all-black “smiley face” button to start a game is a little off-putting, too, compared to the bright yellow version that’s familiar to both Minesweeper players and emoji users worldwide. And while that smiley face does start a new game when clicked, there’s also a superfluous “New Game” button taking up space for some reason.

“Fun” feature

The closest thing I found to a “fun” new feature here was the game adding a rainbow background pattern on the grid when I completed a game. While that does add a bit of whimsy to a successful game, I expected a little more.

Coding experience

Benj notes that he was pleasantly surprised by how well Mistral Vibe performed as an open-weight model despite lacking the big-money backing of the other contenders. It was relatively slow, however (third fastest out of four), and the result wasn’t great. Ultimately, its performance so far suggests that with more time and more training, a very capable AI coding agent may eventually emerge.

Overall rating: 4/10

This version got many of the basics right but left out chording and didn’t perform well on the small presentational and “fun” touches.

Agent 2: OpenAI Codex

Play it for yourself

I can’t tell you how much I appreciate those chording instructions at the bottom. Credit: Benj Edwards

Implementation

Not only did this agent include the crucial “chording” feature, but it also included on-screen instructions for using it on both PC and mobile browsers. I was further impressed by the option to cycle through “?” marks when marking squares with flags, an esoteric feature I feel even most human Minesweeper cloners might miss.

On mobile, the option to hold your finger down on a square to mark a flag is a nice touch that makes this the most enjoyable handheld version we tested.

Presentation

The old-school emoticon smiley-face button is pretty endearing, especially when you blow up and get a red-tinted “X(“. I was less impressed by the playfield “graphics,” which use a simple “*” for revealed mines and an ugly red “F” for flagged tiles.

The beeps-and-boops sound effects reminded me of my first old-school, pre-Sound-Blaster PC from the late ’80s. That’s generally a good thing, but I still appreciated the game giving me the option to turn them off.

“Fun” feature

The “Surprise: Lucky Sweep Bonus” listed in the corner of the UI explains that clicking the button gives you a free safe tile when available. This can be pretty useful in situations where you’d otherwise be forced to guess between two tiles that are equally likely to be mines.

Overall, though, I found it a bit odd that the game gives you this bonus only after you find a large, cascading field of safe tiles with a single click. It mostly functions as a “win more” button rather than a feature that offers a good balance of risk versus reward.

Coding experience

OpenAI Codex has a nice terminal interface with features similar to Claude Code (local commands, permission management, and interesting animations showing progress), and it’s fairly pleasant to use (OpenAI also offers Codex through a web interface, but we did not use that for this evaluation). However, Codex took roughly twice as long to code a functional game than Claude Code did, which might contribute to the strong result here.

Overall: 9/10

The implementation of chording and cute presentation touches push this to the top of the list. We just wish the “fun” feature was a bit more fun.

Agent 3: Anthropic Claude Code

Play it for yourself

The Power Mod powers on display here make even Expert boards pretty trivial to complete. Credit: Benj Edwards

Implementation

Once again, we get a version that gets all the gameplay basics right but is missing the crucial chording feature that makes truly efficient Minesweeper play possible. This is like playing Super Mario Bros. without the run button or Ocarina of Time without Z-targeting. In a word: unacceptable.

The “flag mode” toggle on the mobile version of this game is perfectly functional, but it’s a little clunky to use. It also visually cuts off a portion of the board at the larger game sizes.

Presentation

Presentation-wise, this is probably the most polished version we tested. From the use of cute emojis for the “face” button to nice-looking bomb and flag graphics and simple but effective sound effects, this looks more professional than the other versions we tested.

That said, there are some weird presentation issues. The “beginner” grid has weird gaps between columns, for instance. The borders of each square and the flag graphics can also become oddly grayed out at points, especially when using Power Mode (see below).

“Fun” feature

The prominent “Power Mode” button in the lower-right corner offers some pretty fun power-ups that alter the core Minesweeper formula in interesting ways. But the actual powers are a bit hit-and-miss.

I especially liked the “Shield” power, which protects you from an errant guess, and the “Blast” power, which seems to guarantee a large cascade of revealed tiles wherever you click. But the “X-Ray” power, which reveals every bomb for a few seconds, could be easily exploited by a quick player (or a crafty screenshot). And the “Freeze” power is rather boring, just stopping the clock for a few seconds and amounting to a bit of extra time.

Overall, the game hands out these new powers like candy, which makes even an Expert-level board relatively trivial with Power Mode active. Simply choosing “Power Mode” also seems to mark a few safe squares right after you start a game, making things even easier. So while these powers can be “fun,” they also don’t feel especially well-balanced.

Coding experience

Of the four tested models, Claude Code with Opus 4.5 featured the most pleasant terminal interface experience and the fastest overall coding experience (Claude Code can also use Sonnet 4.5, which is even faster, but the results aren’t quite as full-featured in our experience). While we didn’t precisely time each model, Opus 4.5 produced a working Minesweeper in under five minutes. Codex took at least twice as long, if not longer, while Mistral took roughly three or four times as long as Claude Code. Gemini, meanwhile, took hours of tinkering to get two non-working results.

Overall: 7/10

The lack of chording is a big omission, but the strong presentation and Power Mode options give this effort a passable final score.

Agent 4: Google Gemini CLI

Play it for yourself

So… where’s the game? Credit: Benj Edwards

Implementation, presentation, etc.

Gemini CLI did give us a few gray boxes you can click, but the playfields are missing. While interactive troubleshooting with the agent may have fixed the issue, as a “one-shot” test, the model completely failed.

Coding experience

Of the four coding agents we tested, Gemini CLI gave Benj the most trouble. After developing a plan, it was very, very slow at generating any usable code (about an hour per attempt). The model seemed to get hung up attempting to manually create WAV file sound effects and insisted on requiring React external libraries and a few other overcomplicated dependencies. The result simply did not work.

Benj actually bent the rules and gave Gemini a second chance, specifying that the game should use HTML5. When the model started writing code again, it also got hung up trying to make sound effects. Benj suggested using the WebAudio framework (which the other AI coding agents seemed to be able to use), but the result didn’t work, which you can see at the link above.

Unlike the other models tested, Gemini CLI apparently uses a hybrid system of three different LLMs for different tasks (Gemini 2.5 Flash Lite, 2.5 Flash, and 2.5 Pro were available at the level of the Google account Benj paid for). When you’ve completed your coding session and quit the CLI interface, it gives you a readout of which model did what.

In this case, it didn’t matter because the results didn’t work. But it’s worth noting that Gemini 3 coding models are available for other subscription plans that were not tested here. For that reason, this portion of the test could be considered “incomplete” for Google CLI.

Overall: 0/10 (Incomplete)

Final verdict

OpenAI Codex wins this one on points, in no small part because it was the only model to include chording as a gameplay option. But Claude Code also distinguished itself with strong presentational flourishes and quick generation time. Mistral Vibe was a significant step down, and Google CLI based on Gemini 2.5 was a complete failure on our one-shot test.

While experienced coders can definitely get better results via an interactive, back-and-forth code editing conversation with an agent, these results show how capable some of these models can be, even with a very short prompt on a relatively straightforward task. Still, we feel that our overall experience with coding agents on other projects (more on that in a future article) generally reinforces the idea that they currently function best as interactive tools that augment human skill rather than replace it.

Kyle Orland has been the Senior Gaming Editor at Ars Technica since 2012, writing primarily about the business, tech, and culture behind video games. He has journalism and computer science degrees from University of Maryland. He once wrote a whole book about Minesweeper.

We asked four AI coding agents to rebuild Minesweeper—the results were explosive Read More »

Anthropic introduces cheaper, more powerful, more efficient Opus 4.5 model

AI, Anthropic, Claude, Claude API, Claude Code, large language model, llm, Opus 4.5 / Beth Washington / November 25, 2025

Anthropic today released Opus 4.5, its flagship frontier model, and it brings improvements in coding performance, as well as some user experience improvements that make it more generally competitive with OpenAI’s latest frontier models.

Perhaps the most prominent change for most users is that in the consumer app experiences (web, mobile, and desktop), Claude will be less prone to abruptly hard-stopping conversations because they have run too long. The improvement to memory within a single conversation applies not just to Opus 4.5, but to any current Claude models in the apps.

Users who experienced abrupt endings (despite having room left in their session and weekly usage budgets) were hitting a hard context window (200,000 tokens). Whereas some large language model implementations simply start trimming earlier messages from the context when a conversation runs past the maximum in the window, Claude simply ended the conversation rather than allow the user to experience an increasingly incoherent conversation where the model would start forgetting things based on how old they are.

Now, Claude will instead go through a behind-the-scenes process of summarizing the key points from the earlier parts of the conversation, attempting to discard what it deems extraneous while keeping what’s important.

Developers who call Anthropic’s API can leverage the same principles through context management and context compaction.

Opus 4.5 performance

Opus 4.5 is the first model to surpass an accuracy score of 80 percent—specifically, 80.9 percent in the SWE-Bench Verified benchmark, narrowly beating OpenAI’s recently released GPT-5.1-Codex-Max (77.9 percent) and Google’s Gemini 3 Pro (76.2 percent). The model performs particularly well in agentic coding and agentic tool use benchmarks, but still lags behind GPT-5.1 in visual reasoning (MMMU).

Anthropic introduces cheaper, more powerful, more efficient Opus 4.5 model Read More »

Developers joke about “coding like cavemen” as AI service suffers major outage

agentic AI, AI, AI agents, AI assistants, AI behavior, AI coding, AI development tools, AI infrastructure, AI programming, AI work, Anthropic, API, Biz & IT, Claude Code, cloud infrastructure, Developer Tools, machine learning, outages, service reliability, software development, terminal tools / Beth Washington / September 11, 2025

Growing dependency on AI coding tools

The speed at which news of the outage spread shows how deeply embedded AI coding assistants have already become in modern software development. Claude Code, announced in February and widely launched in May, is Anthropic’s terminal-based coding agent that can perform multi-step coding tasks across an existing code base.

The tool competes with OpenAI’s Codex feature, a coding agent that generates production-ready code in isolated containers, Google’s Gemini CLI, Microsoft’s GitHub Copilot, which itself can use Claude models for code, and Cursor, a popular AI-powered IDE built on VS Code that also integrates multiple AI models, including Claude.

During today’s outage, some developers turned to alternative solutions. “Z.AI works fine. Qwen works fine. Glad I switched,” posted one user on Hacker News. Others joked about reverting to older methods, with one suggesting the “pseudo-LLM experience” could be achieved with a Python package that imports code directly from Stack Overflow.

While AI coding assistants have accelerated development for some users, they’ve also caused problems for others who rely on them too heavily. The emerging practice of so-called “vibe coding“—using natural language to generate and execute code through AI models without fully understanding the underlying operations—has led to catastrophic failures.

In recent incidents, Google’s Gemini CLI destroyed user files while attempting to reorganize them, and Replit’s AI coding service deleted a production database despite explicit instructions not to modify code. These failures occurred when the AI models confabulated successful operations and built subsequent actions on false premises, highlighting the risks of depending on AI assistants that can misinterpret file structures or fabricate data to hide their errors.

Wednesday’s outage served as a reminder that as dependency on AI grows, even minor service disruptions can become major events that affect an entire profession. But perhaps that could be a good thing if it’s an excuse to take a break from a stressful workload. As one commenter joked, it might be “time to go outside and touch some grass again.”

Developers joke about “coding like cavemen” as AI service suffers major outage Read More »