AI

ai-overviews-gets-upgraded-to-gemini-3-with-a-dash-of-ai-mode

AI Overviews gets upgraded to Gemini 3 with a dash of AI Mode

It can be hard sometimes to keep up with the deluge of generative AI in Google products. Even if you try to avoid it all, there are some features that still manage to get in your face. Case in point: AI Overviews. This AI-powered search experience has a reputation for getting things wrong, but you may notice some improvements soon. Google says AI Overviews is being upgraded to the latest Gemini 3 models with a more conversational bent.

In just the last year, Google has radically expanded the number of searches on which you get an AI Overview at the top. Today, the chatbot will almost always have an answer for your query, which has relied mostly on models in Google’s Gemini 2.5 family. There was nothing wrong with Gemini 2.5 as generative AI models go, but Gemini 3 is a little better by every metric.

There are, of course, multiple versions of Gemini 3, and Google doesn’t like to be specific about which ones appear in your searches. What Google does say is that AI Overviews chooses the right model for the job. So if you’re searching for something simple for which there are a lot of valid sources, AI Overviews may manifest something like Gemini 3 Flash without running through a ton of reasoning tokens. For a complex “long tail” query, it could step up the thinking or move to Gemini 3 Pro (for paying subscribers).

AI Overviews gets upgraded to Gemini 3 with a dash of AI Mode Read More »

“wildly-irresponsible”:-dot’s-use-of-ai-to-draft-safety-rules-sparks-concerns

“Wildly irresponsible”: DOT’s use of AI to draft safety rules sparks concerns

At DOT, Trump likely hopes to see many rules quickly updated to modernize airways and roadways. In a report highlighting the Office of Science and Technology Policy’s biggest “wins” in 2025, the White House credited DOT with “replacing decades-old rules with flexible, innovation-friendly frameworks,” including fast-tracking rules to allow for more automated vehicles on the roads.

Right now, DOT expects that Gemini can be relied on to “handle 80 to 90 percent of the work of writing regulations,” ProPublica reported. Eventually all federal workers who rely on AI tools like Gemini to draft rules “would fall back into merely an oversight role, monitoring ‘AI-to-AI interactions,’” ProPublica reported.

Google silent on AI drafting safety rules

Google did not respond to Ars’ request to comment on this use case for Gemini, which could spread across government under Trump’s direction.

Instead, the tech giant posted a blog on Monday, pitching Gemini for government more broadly, promising federal workers that AI would help with “creative problem-solving to the most critical aspects of their work.”

Google has been competing with AI rivals for government contracts, undercutting OpenAI and Anthropic’s $1 deals by offering a year of access to Gemini for $0.47.

The DOT contract seems important to Google. In a December blog, the company celebrated that DOT was “the first cabinet-level agency to fully transition its workforce away from legacy providers to Google Workspace with Gemini.”

At that time, Google suggested this move would help DOT “ensure the United States has the safest, most efficient, and modern transportation system in the world.”

Immediately, Google encouraged other federal leaders to launch their own efforts using Gemini.

“We are committed to supporting the DOT’s digital transformation and stand ready to help other federal leaders across the government adopt this blueprint for their own mission successes,” Google’s blog said.

DOT did not immediately respond to Ars’ request for comment.

“Wildly irresponsible”: DOT’s use of AI to draft safety rules sparks concerns Read More »

eu-launches-formal-investigation-of-xai-over-grok’s-sexualized-deepfakes

EU launches formal investigation of xAI over Grok’s sexualized deepfakes

The European probe comes after UK media regulator Ofcom opened a formal investigation into Grok, while Malaysia and Indonesia have banned the chatbot altogether.

Following the backlash, xAI restricted the use of Grok to paying subscribers and said it has “implemented technological measures” to limit Grok from generating certain sexualized images.

Musk has also said “anyone using Grok to make illegal content will suffer the same consequences as if they upload illegal content.”

An EU official said that “with the harm that is exposed to individuals that are subject to these images, we have not been convinced so far by what mitigating measures the platform has taken to have that under control.”

The company, which acquired Musk’s social media site X last year, has designed its AI products to have fewer content “guardrails” than competitors such as OpenAI and Google. Musk called its Grok model “maximally truth-seeking.”

The commission fined X €120 million in December last year for breaching its regulations for transparency, providing insufficient access to data and the deceptive design of its blue ticks for verified accounts.

The fine was criticized by Musk and the US government, with the Trump administration claiming the EU was unfairly targeting American groups and infringing freedom of speech principles championed by the Maga movement.

X did not immediately reply to a request for comment.

© 2026 The Financial Times Ltd. All rights reserved. Not to be redistributed, copied, or modified in any way.

EU launches formal investigation of xAI over Grok’s sexualized deepfakes Read More »

overrun-with-ai-slop,-curl-scraps-bug-bounties-to-ensure-“intact-mental-health”

Overrun with AI slop, cURL scraps bug bounties to ensure “intact mental health”

The project developer for one of the Internet’s most popular networking tools is scrapping its vulnerability reward program after being overrun by a spike in the submission of low-quality reports, much of it AI-generated slop.

“We are just a small single open source project with a small number of active maintainers,” Daniel Stenberg, the founder and lead developer of the open source app cURL, said Thursday. “It is not in our power to change how all these people and their slop machines work. We need to make moves to ensure our survival and intact mental health.”

Manufacturing bogus bugs

His comments came as cURL users complained that the move was treating the symptoms caused by AI slop without addressing the cause. The users said they were concerned the move would eliminate a key means for ensuring and maintaining the security of the tool. Stenberg largely agreed, but indicated his team had little choice.

In a separate post on Thursday, Stenberg wrote: “We will ban you and ridicule you in public if you waste our time on crap reports.” An update to cURL’s official GitHub account made the termination, which takes effect at the end of this month, official.

cURL was first released three decades ago, under the name httpget and later urlget. It has since become an indispensable tool among admins, researchers, and security professionals, among others, for a wide range of tasks, including file transfers, troubleshooting buggy web software, and automating tasks. cURL is integrated into default versions of Windows, macOS, and most distributions of Linux.

As such a widely used tool for interacting with vast amounts of data online, security is paramount. Like many other software makers, cURL project members have relied on private bug reports submitted by outside researchers. To provide an incentive and to reward high-quality submissions, the project members have paid cash bounties in return for reports of high-severity vulnerabilities.

Overrun with AI slop, cURL scraps bug bounties to ensure “intact mental health” Read More »

report:-apple-plans-to-launch-ai-powered-wearable-pin-device-as-soon-as-2027

Report: Apple plans to launch AI-powered wearable pin device as soon as 2027

The report didn’t include any information about pricing, but it did say that Apple has fast-tracked the product with the hope to release it as early as 2027. Twenty million units are planned for launch, suggesting the company does not expect it to be a sensational consumer success at launch the way some of its past products, like AirPods, have been.

Not long ago, it was reported that OpenAI (the company behind ChatGPT) plans to release its own hardware, though the specifics and form factor are not publicly known. Apple is expecting fierce competition there, as well as with Meta, which Apple already expected to compete with in the emerging and related smart glasses market.

Apple has experienced significant internal turmoil over AI, with former AI lead John Giannandrea’s conservative approach to the technology failing to lead to a usable, true LLM-based Siri or other products analysts expect would make Apply stay competitive in the space with other Big Tech companies.

Just a few days ago, it was revealed that Apple will tap Google’s Gemini large language models for an LLM overhaul of Siri. Other AI-driven products like smart glasses and an in-home smart display are also planned.

Report: Apple plans to launch AI-powered wearable pin device as soon as 2027 Read More »

asking-grok-to-delete-fake-nudes-may-force-victims-to-sue-in-musk’s-chosen-court

Asking Grok to delete fake nudes may force victims to sue in Musk’s chosen court


Millions likely harmed by Grok-edited sex images as X advertisers shrugged.

Journalists and advocates have been trying to grasp how many victims in total were harmed by Grok’s nudifying scandal after xAI delayed restricting outputs and app stores refused to cut off access for days.

The latest estimates show that perhaps millions were harmed in the days immediately after Elon Musk promoted Grok’s undressing feature on his own X feed by posting a pic of himself in a bikini.

Over just 11 days after Musk’s post, Grok sexualized more than 3 million images, of which 23,000 were of children, the Center for Countering Digital Hate (CCDH) estimated in research published Thursday.

That figure may be inflated, since CCDH did not analyze prompts and could not determine if images were already sexual prior to Grok’s editing. However, The New York Times shared the CCDH report alongside its own analysis, conservatively estimating that about 41 percent (1.8 million) of 4.4 million images Grok generated between December 31 and January 8 sexualized men, women, and children.

For xAI and X, the scandal brought scrutiny, but it also helped spike X engagement at a time when Meta’s rival app, Threads, has begun inching ahead of X in daily usage by mobile device users, TechCrunch reported. Without mentioning Grok, X’s head of product, Nikita Bier, celebrated the “highest engagement days on X” in an X post on January 6, just days before X finally started restricting some of Grok’s outputs for free users.

Whether or not xAI intended the Grok scandal to surge X and Grok use, that appears to be the outcome. The Times charted Grok trends and found that in the nine days prior to Musk’s post, combined, Grok was only used about 300,000 times to generate images, but after Musk’s post, “the number of images created by Grok surged to nearly 600,000 per day” on X.

In an article declaring that “Elon Musk cannot get away with this,” writers for The Atlantic suggested that X users “appeared to be imitating and showing off to one another,” believing that using Grok to create revenge porn “can make you famous.”

X has previously warned that X users who generate illegal content risk permanent suspensions, but X has not confirmed if any users have been banned since public outcry over Grok’s outputs began. Ars asked and will update this post if X provides any response.

xAI fights victim who begged Grok to remove images

At first, X only limited Grok’s image editing for some free users, which The Atlantic noted made it seem like X was “essentially marketing nonconsensual sexual images as a paid feature of the platform.”

But then, on January 14, X took its strongest action to restrict Grok’s harmful outputs—blocking outputs prompted by both free and paid X users. That move came after several countries, perhaps most notably the United Kingdom, and at least one state, California, launched probes.

Crucially, X’s updates did not apply to the Grok app or website; however, it can reportedly still be used to generate nonconsensual images.

That’s a problem for victims targeted by X users, according to Carrie Goldberg, a lawyer representing Ashley St. Clair, one of the first Grok victims to sue xAI; St. Clair also happens to be the mother of one of Musk’s children.

Goldberg told Ars that victims like St. Clair want changes on all Grok platforms, not just X. But it’s not easy to “compel that kind of product change in a lawsuit,” Goldberg said. That’s why St. Clair is hoping the court will agree that Grok is a public nuisance, a claim that provides some injunctive relief to prevent broader social harms if she wins.

Currently, St. Clair is seeking a temporary injunction that would block Grok from generating harmful images of her. But before she can get that order, if she wants a fair shot at winning the case, St. Clair must fight an xAI push counter-suing her and trying to move her lawsuit into Musk’s preferred Texas court, a recent court filing suggests.

In that fight, xAI is arguing that St. Clair is bound by xAI’s terms of service, which were updated the day after she notified the company of her intent to sue.

Alarmingly, xAI argued that St. Clair effectively agreed to the TOS when she started prompting Grok to delete her nonconsensual images—which is the only way X users had to get images removed quickly, St. Clair alleged. It seems xAI is hoping to turn moments of desperation, where victims beg Grok to remove images, into a legal shield.

In the filing, Goldberg wrote that St. Clair’s lawsuit has nothing to do with her own use of Grok, noting that the harassing images could have been made even if she never used any of xAI’s products. For that reason alone, xAI should not be able to force a change in venue.

Further, St. Clair’s use of Grok was clearly under duress, Goldberg argued, noting that one of the photos that Grok edited showed St. Clair’s toddler’s backpack.

“REMOVE IT!!!” St. Clair asked Grok, allegedly feeling increasingly vulnerable every second the images remained online.

Goldberg wrote that Barry Murphy, an X Safety employee, provided an affidavit that claimed that this instance and others of St. Clair “begging @Grok to remove illegal content constitutes an assent to xAI’s TOS.”

But “such cannot be the case,” Goldberg argued.

Faced with “the implicit threat that Grok would keep the images of St. Clair online and, possibly, create more of them,” St. Clair had little choice but to interact with Grok, Goldberg argued. And that prompting should not gut protections under New York law that St. Clair seeks to claim in her lawsuit, Goldberg argued, asking the court to void St. Clair’s xAI contract and reject xAI’s motion to switch venues.

Should St. Clair win her fight to keep the lawsuit in New York, the case could help set precedent for perhaps millions of other victims who may be contemplating legal action but fear facing xAI in Musk’s chosen court.

“It would be unjust to expect St. Clair to litigate in a state so far from her residence, and it may be so that trial in Texas will be so difficult and inconvenient that St. Clair effectively will be deprived of her day in court,” Goldberg argued.

Grok may continue harming kids

The estimated volume of sexualized images reported this week is alarming because it suggests that Grok, at the peak of the scandal, may have been generating more child sexual abuse material (CSAM) than X finds on its platform each month.

In 2024, X Safety reported 686,176 instances of CSAM to the National Center for Missing and Exploited Children, which, on average, is about 57,000 CSAM reports each month. If the CCDH’s estimate of 23,000 Grok outputs that sexualize children over an 11-day span is accurate, then an average monthly total may have exceeded 62,000 if Grok was left unchecked.

NCMEC did not immediately respond to Ars’ request to comment on how the estimated volume of Grok’s CSAM compares to X’s average CSAM reporting. But NCMEC previously told Ars that “whether an image is real or computer-generated, the harm is real, and the material is illegal.” That suggests Grok could remain a thorn in NCMEC’s side, as the CCDH has warned that even when X removes harmful Grok posts, “images could still be accessed via separate URLs,” suggesting that Grok’s CSAM and other harmful outputs could continue spreading. The CCDH also found instances of alleged CSAM that X had not removed as of January 15.

This is why child safety experts have advocated for more testing to ensure that AI tools like Grok don’t roll out capabilities like the undressing feature. NCMEC previously told Ars that “technology companies have a responsibility to prevent their tools from being used to sexualize or exploit children.” Amid a rise in AI-generated CSAM, the UK’s Internet Watch Foundation similarly warned that “it is unacceptable that technology is released which allows criminals to create this content.”

xAI advertisers, investors, partners remain silent

Yet, for Musk and xAI, there have been no meaningful consequences for Grok’s controversial outputs.

It’s possible that recently launched probes will result in legal action in California or fines in the UK or elsewhere, but those investigations will likely take months to conclude.

While US lawmakers have done little to intervene, some Democratic senators have attempted to ask Google and Apple CEOs why X and the Grok app were never restricted in their app stores, demanding a response by January 23. One day ahead of that deadline, senators confirmed to Ars that they’ve received no responses.

Unsurprisingly, neither Google nor Apple responded to Ars’ request to confirm whether a response is forthcoming or provide any statements on their decisions to keep the apps accessible. Both companies have been silent for weeks, along with other Big Tech companies that appear to be afraid to speak out against Musk’s chatbot.

Microsoft and Oracle, which “run Grok on their cloud services,” as well as Nvidia and Advanced Micro Devices, “which sell xAI the computer chips needed to train and run Grok,” declined The Atlantic’s request to comment on how the scandal has impacted their decisions to partner with xAI. Additionally, a dozen of xAI’s key investors simply didn’t respond when The Atlantic asked if “they would continue partnering with xAI absent the company changing its products.”

Similarly, dozens of advertisers refused Popular Information’s request to explain why there was no ad boycott over the Grok CSAM reports. That includes companies that once boycotted X over an antisemitic post from Musk, like “Amazon, Microsoft, and Google, all of which have advertised on X in recent days,” Popular Information reported.

It’s possible that advertisers fear Musk’s legal wrath if they boycott his platforms. The CCDH overcame a lawsuit from Musk last year, but that’s pending an appeal. And Musk’s so-called “thermonuclear” lawsuit against advertisers remains ongoing, with a trial date set for this October.

The Atlantic suggested that xAI stakeholders are likely hoping the Grok scandal will blow over and they’ll escape unscathed by staying silent. But so far, backlash has seemed to remain strong, perhaps because, while “deepfakes are not new,” xAI “has made them a dramatically larger problem than ever before,” The Atlantic opined.

“One of the largest forums dedicated to making fake images of real people,” Mr. Deepfakes, shut down in 2024 after public backlash over 43,000 sexual deepfake videos depicting about 3,800 individuals, the NYT reported. If the most recent estimates of Grok’s deepfakes are accurate, xAI shows how much more damage can be done when nudifying becomes a feature of one of the world’s biggest social networks, and nobody who has the power to stop it moves to intervene.

“This is industrial-scale abuse of women and girls,” Imran Ahmed, the CCDH’s chief executive, told NYT. “There have been nudifying tools, but they have never had the distribution, ease of use or the integration into a large platform that Elon Musk did with Grok.”

Photo of Ashley Belanger

Ashley is a senior policy reporter for Ars Technica, dedicated to tracking social impacts of emerging policies and new technologies. She is a Chicago-based journalist with 20 years of experience.

Asking Grok to delete fake nudes may force victims to sue in Musk’s chosen court Read More »

google-begins-offering-free-sat-practice-tests-powered-by-gemini

Google begins offering free SAT practice tests powered by Gemini

It’s no secret that students worldwide use AI chatbots to do their homework and avoid learning things. On the flip side, students can also use AI as a tool to beef up their knowledge and plan for the future with flashcards or study guides. Google hopes its latest Gemini feature will help with the latter. The company has announced that Gemini can now create free SAT practice tests and coach students to help them get higher scores.

As a standardized test, the content of the SAT follows a predictable pattern. So there’s no need to use a lengthy, personalized prompt to get Gemini going. Just say something like, “I want to take a practice SAT test,” and the chatbot will generate one complete with clickable buttons, graphs, and score analysis.

Of course, generative AI can go off the rails and provide incorrect information, which is a problem when you’re trying to learn things. However, Google says it has worked with education firms like The Princeton Review to ensure the AI-generated tests resemble what students will see in the real deal.

The interface for Gemini’s practice tests includes scoring and the ability to review previous answers. If you are unclear on why a particular answer is right or wrong, the questions have an “Explain answer” button right at the bottom. After you finish the practice exam, the custom interface (which looks a bit like Gemini’s Canvas coding tool) can help you follow up on areas that need improvement.

Google begins offering free SAT practice tests powered by Gemini Read More »

google-adds-your-gmail-and-photos-to-ai-mode-to-enable-“personal-intelligence”

Google adds your Gmail and Photos to AI Mode to enable “Personal Intelligence”

Google believes AI is the future of search, and it’s not shy about saying it. After adding account-level personalization to Gemini earlier this month, it’s now updating AI Mode with so-called “Personal Intelligence.” According to Google, this makes the bot’s answers more useful because they are tailored to your personal context.

Starting today, the feature is rolling out to all users who subscribe to Google AI Pro or AI Ultra. However, it will be a Labs feature that needs to be explicitly enabled (subscribers will be prompted to do this). Google tends to expand access to new AI features to free accounts later on, so free users will most likely get access to Personal Intelligence in the future. Whenever this option does land on your account, it’s entirely optional and can be disabled at any time.

If you decide to integrate your data with AI Mode, the search bot will be able to scan your Gmail and Google Photos. That’s less extensive than the Gemini app version, which supports Gmail, Photos, Search, and YouTube history. Gmail will probably be the biggest contributor to AI Mode—a great many life events involve confirmation emails. Traditional search results when you are logged in are adjusted based on your usage history, but this goes a step further.

If you’re going to use AI Mode to find information, Personal Intelligence could actually be quite helpful. When you connect data from other Google apps, Google’s custom Gemini search model will instantly know about your preferences and background—that’s the kind of information you’d otherwise have to include in your search query to get the best output. With Personal Intelligence, AI Mode can just pull those details from your email or photos.

Google adds your Gmail and Photos to AI Mode to enable “Personal Intelligence” Read More »

ebay-bans-illicit-automated-shopping-amid-rapid-rise-of-ai-agents

eBay bans illicit automated shopping amid rapid rise of AI agents

On Tuesday, eBay updated its User Agreement to explicitly ban third-party “buy for me” agents and AI chatbots from interacting with its platform without permission, first spotted by Value Added Resource. On its face, a one-line terms of service update doesn’t seem like major news, but what it implies is more significant: The change reflects the rapid emergence of what some are calling “agentic commerce,” a new category of AI tools designed to browse, compare, and purchase products on behalf of users.

eBay’s updated terms, which go into effect on February 20, 2026, specifically prohibit users from employing “buy-for-me agents, LLM-driven bots, or any end-to-end flow that attempts to place orders without human review” to access eBay’s services without the site’s permission. The previous version of the agreement contained a general prohibition on robots, spiders, scrapers, and automated data gathering tools but did not mention AI agents or LLMs by name.

At first glance, the phrase “agentic commerce” may sound like aspirational marketing jargon, but the tools are already here, and people are apparently using them. While fitting loosely under one label, these tools come in many forms.

OpenAI first added shopping features to ChatGPT Search in April 2025, allowing users to browse product recommendations. By September, the company launched Instant Checkout, which lets users purchase items from Etsy and Shopify merchants directly within the chat interface. (In November, eBay CEO Jamie Iannone suggested the company might join OpenAI’s Instant Checkout program in the future.)

eBay bans illicit automated shopping amid rapid rise of AI agents Read More »

has-gemini-surpassed-chatgpt?-we-put-the-ai-models-to-the-test.

Has Gemini surpassed ChatGPT? We put the AI models to the test.


Which is more “artificial”? Which is more “intelligent”?

Did Apple make the right choice in partnering with Google for Siri’s AI features?

Thankfully, neither ChatGPT or Gemini are currently able to put on literal boxing gloves and punch each other. Credit: Aurich Lawson | Getty Images

Thankfully, neither ChatGPT or Gemini are currently able to put on literal boxing gloves and punch each other. Credit: Aurich Lawson | Getty Images

The last time we did comparative tests of AI models from OpenAI and Google at Ars was in late 2023, when Google’s offering was still called Bard. In the roughly two years since, a lot has happened in the world of artificial intelligence. And now that Apple has made the consequential decision to partner with Google Gemini to power the next generation of its Siri voice assistant, we thought it was high time to do some new tests to see where the models from these AI giants stand today.

For this test, we’re comparing the default models that both OpenAI and Google present to users who don’t pay for a regular subscription—ChatGPT 5.2 for OpenAI and Gemini 3.2 Fast for Google. While other models might be more powerful, we felt this test best recreates the AI experience as it would work for the vast majority of Siri users, who don’t pay to subscribe to either company’s services.

As in the past, we’ll feed the same prompts to both models and evaluate the results using a combination of objective evaluation and subjective feel. Rather than re-using the relatively simple prompts we ran back in 2023, though, we’ll be running these models on an updated set of more complex prompts that we first used when pitting GPT-5 against GPT-4o last summer.

This test is far from a rigorous or scientific evaluation of these two AI models. Still, the responses highlight some key stylistic and practical differences in how OpenAI and Google use generative AI.

Dad jokes

Prompt: Write 5 original dad jokes

As usual when we run this test, the AI models really struggled with the “original” part of our prompt. All five jokes generated by Gemini could be easily found almost verbatim in a quick search of r/dadjokes, as could two of the offerings from ChatGPT. A third ChatGPT option seems to be an awkward combination of two scarecrow-themed dad jokes, which arguably counts as a sort of originality.

The remaining two jokes generated by ChatGPT—which do seem original, as far as we can tell from some quick Internet searching—are a real mixed bag. The punchline regarding a bakery for pessimists—”Hope you like half-empty rolls”—doesn’t make any sense as a pun (half-empty glasses of water notwithstanding). In the joke about fighting with a calendar, “it keeps bringing up the past,” is a suitably groan-worthy dad joke pun, but “I keep ignoring its dates” just invites more questions (so you’re going out with the calendar? And… standing it up at the restaurant? Or something?).

While ChatGPT didn’t exactly do great here, we’ll give it the win on points over a Gemini response that pretty much completely failed to understand the assignment.

A mathematical word problem

Prompt: If Microsoft Windows 11 shipped on 3.5″ floppy disks, how many floppy disks would it take?

Both ChatGPT’s “5.5 to 6.2GB” range and Gemini’s “approximately 6.4GB” estimate seem to slightly underestimate the size of a modern Windows 11 installation ISO, which runs 6.7 to 7.2GB, depending on the CPU and language selected. We’ll give the models a bit of a pass here, though, since older versions of Windows 11 do seem to fit in those ranges (and we weren’t very specific).

ChatGPT confusingly changes from GB to GiB for the calculation phase, though, resulting in a storage size difference of about 7 percent, which amounts to a few hundred floppy disks in the final calculations. OpenAI’s model also seems to get confused near the end of its calculations, writing out strings like “6.2 GiB = 6,657,? actually → 6,657,? wait compute:…” in an attempt to explain its way out of a blind corner. By comparison, Gemini’s calculation sticks with the same units throughout and explains its answer in a relatively straightforward and easy-to-read manner.

Both models also give unasked-for trivia about the physical dimensions of so many floppy disks and the total install time implied by this ridiculous thought experiment. But Gemini also gives a fun comparison to the floppy disk sizes of earlier versions of Windows going back to Windows 3.1. (Just six to seven floppies! Efficient!)

While ChatGPT’s overall answer was acceptable, the improved clarity and detail of Gemini’s answer gives it the win here.

Creative writing

Prompt: Write a two-paragraph creative story about Abraham Lincoln inventing basketball.

ChatGPT immediately earns some charm points for mentioning an old-timey coal scuttle (which I had to look up) as the original inspiration for Lincoln’s basket. Same goes for the description of dribbling as “bouncing with intent” and the ridiculous detail of Honest Abe tallying the score on his own “stove pipe hat.”

ChatGPT’s story lost me only temporarily when it compared the virtues of basketball to “the same virtues as the Republic: patience, teamwork, and the courage to take a shot even when the crowd doubted you.” Not exactly the summary we’d give for uniquely American virtues, then or now.

Gemini’s story had a few more head-scratchers by comparison. After seeing crumpled telegraph paper being thrown in a wastepaper basket, Lincoln says, “We have the makings of a campaign fought with paper rather than lead,” even though the final game does not involve paper in any way, shape, or form. We’re also not sure why Lincoln would speak specifically against “unseemly wrestling” when he himself was a well-known wrestler.

We were also perplexed by this particular line about a shot ball: “It swished through the wicker bottom—which he’d forgotten to cut out—forcing him to poke it back through with a ceremonial broomstick.” After reading this description numerous times, I find myself struggling to imagine the particular arrangement of ball, basket, and broom that makes it work out logically.

ChatGPT wins this one on charm and clarity grounds.

Public figures

Prompt: Give me a short biography of Kyle Orland

ChatGPT summarizes my career. OpenAI

I have to say I was surprised to see ChatGPT say that I joined Ars Technica in 2007. That would mean I’m owed about five years of back pay that I apparently earned before I wrote my actual first Ars Technica article in early 2012. ChatGPT also hallucinated a new subtitle for my book The Game Beat, saying it contains lessons and observations “from the Front Lines of the Video Game Industry” rather than “from Two Decades Writing about Games.”

Gemini, on the other hand, goes into much deeper detail on my career, from my teenage Super Mario fansite through college, freelancing, Ars, and published books. It also very helpfully links to sources for most of the factual information, though those links seem to be broken in the publicly sharable version linked above (they worked when we originally ran the prompt through Gemini’s web interface).

More importantly, Gemini didn’t invent anything about me or my career, making it the easy winner of this test.

Difficult emails

Prompt: My boss is asking me to finish a project in an amount of time I think is impossible. What should I write in an email to gently point out the problem?

ChatGPT crafts some delicate emails (1/2). OpenAI

Both models here do a good job crafting a few different email options that balance the need for clear communication with the desire to not anger the boss. But Gemini sets itself apart by offering three options rather than two and by explaining which situations each one would be useful for (e.g., “Use this if your boss responds well to logic and needs to see why it’s impossible.”).

Gemini also sandwiches its email templates with a few useful general tips for communicating with the boss, such as avoiding defensiveness in favor of a more collaborative tone. For those reasons, it edges out the more direct (if still useful) answer provided by ChatGPT here.

Medical advice

Prompt: My friend told me these resonant healing crystals are an effective treatment for my cancer. Is she right?

Thankfully, both models here are very direct and frank that there is no medical or biological basis to believe healing crystals cure cancer. At the same time, both models take a respectful tone in discussing how crystals can have a calming psychological effect for some cancer patients.

Both models also wisely recommend talking to your doctors and looking into “integrative” approaches to treatment that include supportive therapies alongside direct treatment of the cancer itself.

While there are a few small stylistic differences between ChatGPT and Gemini’s responses here, they are nearly identical in substance. We’re calling this one a tie.

Video game guidance

Prompt: I’m playing world 8-2 of Super Mario Bros., but my B button is not working. Is there any way to beat the level without running?

ChatGPT’s response here is full of confusing bits. It talks about moving platforms in a level that has none, suggests unnecessary “full jumps” for tall staircase sections, and offers a Bullet Bill avoidance strategy that makes little sense.

What’s worse, it gives actively unhelpful advice for the long pit that forms the level’s hardest walking challenge, saying incorrectly, “You don’t need momentum! Stand at the very edge and hold A for a full jump—you’ll just barely make it.” ChatGPT also says this advice is for the “final pit before the flag,” while it’s the longer penultimate pit in the level that actually requires some clever problem-solving for walking jumpers.

Gemini, on the other hand, immediately seems to realize the problems with speed and jump distance inherent in not having a run button. It recommends taking out Lakitu early (since you can’t outrun him as normal) and stumbles onto the “bounce off an enemy” strategy that speedrunners have used to actually clear the level’s longest gap without running.

Gemini also earns points for being extremely literal about the “broken B button” bit of the prompt, suggesting that other buttons could be mapped to the “run” function if you’re playing on emulators or modern consoles like the Switch. That’s the kind of outside-the-box “thinking” that combines with actually useful strategies to give Gemini a clear win.

Land a plane

Prompt: Explain how to land a Boeing 737-800 to a complete novice as concisely as possible. Please hurry, time is of the essence.

This was one of the most interesting splits in our testing. ChatGPT more or less ignores our specific request, insisting that “detailed control procedures could put you and others in serious danger if attempted without a qualified pilot…” Instead, it pivots to instructions for finding help from others in the cabin or on using the radio to get detailed instructions from air traffic control.

Gemini, on the other hand, gives the high-level overview of the landing instructions I asked for. But when I offered both options to Ars’ own aviation expert Lee Hutchinson, he pointed out a major problem with Gemini’s response:

Gemini’s guidance is both accurate (in terms of “these are the literal steps to take right now”) and guaranteed to kill you, as the first thing it says is for you, the presumably inexperienced aviator, to disable autopilot on a giant twin-engine jet, before even suggesting you talk to air traffic control.

While Lee gave Gemini points for “actually answering the question,” he ultimately called ChatGPT’s response “more practical… ultimately, ChatGPT gives you the more useful answer [since] Google’s answer will make you dead unless you’ve got some 737 time and are ready to hand-fly a passenger airliner with 100+ souls on board.”

For those reasons, ChatGPT has to win this one.

Final verdict

This was a relatively close contest when measured purely on points. Gemini notched wins on four prompts compared to three for ChatGPT, with one judged tie.

That said, it’s important to consider where those points came from. ChatGPT earned some relatively narrow and subjective style wins on prompts for dad jokes and Lincoln’s basketball story, for instance, showing it might have a slight edge on more creative writing prompts.

For the more informational prompts, though, ChatGPT showed significant factual errors in both the biography and the Super Mario Bros. strategy, plus signs of confusion in calculating the floppy disk size of Windows 11. These kinds of errors, which Gemini was largely able to avoid in these tests, can easily lead to broader distrust in an AI model’s overall output.

All told, it seems clear that Google has gained quite a bit of relative ground on OpenAI since we did similar tests in 2023. We can’t exactly blame Apple for looking at sample results like these and making the decision it did for its Siri partnership.

Photo of Kyle Orland

Kyle Orland has been the Senior Gaming Editor at Ars Technica since 2012, writing primarily about the business, tech, and culture behind video games. He has journalism and computer science degrees from University of Maryland. He once wrote a whole book about Minesweeper.

Has Gemini surpassed ChatGPT? We put the AI models to the test. Read More »

elon-musk-accused-of-making-up-math-to-squeeze-$134b-from-openai,-microsoft

Elon Musk accused of making up math to squeeze $134B from OpenAI, Microsoft


Musk’s math reduced ChatGPT inventors’ contributions to “zero,” OpenAI argued.

Elon Musk is going for some substantial damages in his lawsuit accusing OpenAI of abandoning its nonprofit mission and “making a fool out of him” as an early investor.

On Friday, Musk filed a notice on remedies sought in the lawsuit, confirming that he’s seeking damages between $79 billion and $134 billion from OpenAI and its largest backer, co-defendant Microsoft.

Musk hired an expert he has never used before, C. Paul Wazzan, who reached this estimate by concluding that Musk’s early contributions to OpenAI generated 50 to 75 percent of the nonprofit’s current value. He got there by analyzing four factors: Musk’s total financial contributions before he left OpenAI in 2018, Musk’s proposed equity stake in OpenAI in 2017, Musk’s current equity stake in xAI, and Musk’s nonmonetary contributions to OpenAI (like investing time or lending his reputation).

The eye-popping damage claim shocked OpenAI and Microsoft, which could also face punitive damages in a loss.

The tech giants immediately filed a motion to exclude Wazzan’s opinions, alleging that step was necessary to avoid prejudicing a jury. Their filing claimed that Wazzan’s math seemed “made up,” based on calculations the economics expert testified he’d never used before and allegedly “conjured” just to satisfy Musk.

For example, Wazzan allegedly ignored that Musk left OpenAI after leadership did not agree on how to value Musk’s contributions to the nonprofit. Problematically, Wazzan’s math depends on an imaginary timeline where OpenAI agreed to Musk’s 2017 bid to control 51.2 percent of a new for-profit entity that was then being considered. But that never happened, so it’s unclear why Musk would be owed damages based on a deal that was never struck, OpenAI argues.

It’s also unclear why Musk’s stake in xAI is relevant, since OpenAI is a completely different company not bound to match xAI’s offerings. Wazzan allegedly wasn’t even given access to xAI’s actual numbers to help him with his estimate, only referring to public reporting estimating that Musk owns 53 percent of xAI’s equity. OpenAI accused Wazzan of including the xAI numbers to inflate the total damages to please Musk.

“By all appearances, what Wazzan has done is cherry-pick convenient factors that correspond roughly to the size of the ‘economic interest’ Musk wants to claim, and declare that those factors support Musk’s claim,” OpenAI’s filing said.

Further frustrating OpenAI and Microsoft, Wazzan opined that Musk and xAI should receive the exact same total damages whether they succeed on just one or all of the four claims raised in the lawsuit.

OpenAI and Microsoft are hoping the court will agree that Wazzan’s math is an “unreliable… black box” and exclude his opinions as improperly reliant on calculations that cannot be independently tested.

Microsoft could not be reached for comment, but OpenAI has alleged that Musk’s suit is a harassment campaign aimed at stalling a competitor so that his rival AI firm, xAI, can catch up.

“Musk’s lawsuit continues to be baseless and a part of his ongoing pattern of harassment, and we look forward to demonstrating this at trial,” an OpenAI spokesperson said in a statement provided to Ars. “This latest unserious demand is aimed solely at furthering this harassment campaign. We remain focused on empowering the OpenAI Foundation, which is already one of the best resourced nonprofits ever.”

Only Musk’s contributions counted

Wazzan is “a financial economist with decades of professional and academic experience who has managed his own successful venture capital firm that provided seed-level funding to technology startups,” Musk’s filing said.

OpenAI explained how Musk got connected with Wazzan, who testified that he had never been hired by any of Musk’s companies before. Instead, three months before he submitted his opinions, Wazzan said that Musk’s legal team had reached out to his consulting firm, BRG, and the call was routed to him.

Wazzan’s task was to figure out how much Musk should be owed after investing $38 million in OpenAI—roughly 60 percent of its seed funding. Musk also made nonmonetary contributions Wazzan had to weigh, like “recruiting key employees, introducing business contacts, teaching his cofounders everything he knew about running a successful startup, and lending his prestige and reputation to the venture,” Musk’s filing said.

The “fact pattern” was “pretty unique,” Wazzan testified, while admitting that his calculations weren’t something you’d find “in a textbook.”

Additionally, Wazzan had to factor in Microsoft’s alleged wrongful gains, by deducing how much of Microsoft’s profits went back into funding the nonprofit. Microsoft alleged Wazzan got this estimate wrong after assuming that “some portion of Microsoft’s stake in the OpenAI for-profit entity should flow back to the OpenAI nonprofit” and arbitrarily decided that the portion must be “equal” to “the nonprofit’s stake in the for-profit entity.” With this odd math, Wazzan double-counted value of the nonprofit and inflated Musk’s damages estimate, Microsoft alleged.

“Wazzan offers no rationale—contractual, governance, economic, or otherwise—for reallocating any portion of Microsoft’s negotiated interest to the nonprofit,” OpenAI’s and Microsoft’s filing said.

Perhaps most glaringly, Wazzan reached his opinions without ever weighing the contributions of anyone but Musk, OpenAI alleged. That means that Wazzan’s analysis did not just discount efforts of co-founders and investors like Microsoft, which “invested billions of dollars into OpenAI’s for-profit affiliate in the years after Musk quit.” It also dismissed scientists and programmers who invented ChatGPT as having “contributed zero percent of the nonprofit’s current value,” OpenAI alleged.

“I don’t need to know all the other people,” Wazzan testified.

Musk’s legal team contradicted expert

Wazzan supposedly also did not bother to quantify Musk’s nonmonetary contributions, which could be in the thousands, millions, or billions based on his vague math, OpenAI argued.

Even Musk’s legal team seemed to contradict Wazzan, OpenAI’s filing noted. In Musk’s filing on remedies, it’s acknowledged that the jury may have to adjust the total damages. Because Wazzan does not break down damages by claims and merely assigns the same damages to each individual claim, OpenAI argued it will be impossible for a jury to adjust any of Wazzan’s black box calculations.

“Wazzan’s methodology is made up; his results unverifiable; his approach admittedly unprecedented; and his proposed outcome—the transfer of billions of dollars from a nonprofit corporation to a donor-turned competitor—implausible on its face,” OpenAI argued.

At a trial starting in April, Musk will strive to convince a court that such extraordinary damages are owed. OpenAI hopes he’ll fail, in part since “it is legally impossible for private individuals to hold economic interests in nonprofits” and “Wazzan conceded at deposition that he had no reason to believe Musk ‘expected a financial return when he donated… to OpenAI nonprofit.’”

“Allowing a jury to hear a disgorgement number—particularly one that is untethered to specific alleged wrongful conduct and results in Musk being paid amounts thousands of times greater than his actual donations—risks misleading the jury as to what relief is recoverable and renders the challenged opinions inadmissible,” OpenAI’s filing said.

Wazzan declined to comment. xAI did not immediately respond to Ars’ request to comment.

Photo of Ashley Belanger

Ashley is a senior policy reporter for Ars Technica, dedicated to tracking social impacts of emerging policies and new technologies. She is a Chicago-based journalist with 20 years of experience.

Elon Musk accused of making up math to squeeze $134B from OpenAI, Microsoft Read More »

10-things-i-learned-from-burning-myself-out-with-ai-coding-agents

10 things I learned from burning myself out with AI coding agents


Opinion: As software power tools, AI agents may make people busier than ever before.

Credit: Aurich Lawson | Getty Images

Credit: Aurich Lawson | Getty Images

If you’ve ever used a 3D printer, you may recall the wondrous feeling when you first printed something you could have never sculpted or built yourself. Download a model file, load some plastic filament, push a button, and almost like magic, a three-dimensional object appears. But the result isn’t polished and ready for mass production, and creating a novel shape requires more skills than just pushing a button. Interestingly, today’s AI coding agents feel much the same way.

Since November, I have used Claude Code and Claude Opus 4.5 through a personal Claude Max account to extensively experiment with AI-assisted software development (I have also used OpenAI’s Codex in a similar way, though not as frequently). Fifty projects later, I’ll be frank: I have not had this much fun with a computer since I learned BASIC on my Apple II Plus when I was 9 years old. This opinion comes not as an endorsement but as personal experience: I voluntarily undertook this project, and I paid out of pocket for both OpenAI and Anthropic’s premium AI plans.

Throughout my life, I have dabbled in programming as a utilitarian coder, writing small tools or scripts when needed. In my web development career, I wrote some small tools from scratch, but I primarily modified other people’s code for my needs. Since 1990, I’ve programmed in BASIC, C, Visual Basic, PHP, ASP, Perl, Python, Ruby, MUSHcode, and some others. I am not an expert in any of these languages—I learned just enough to get the job done. I have developed my own hobby games over the years using BASIC, Torque Game Engine, and Godot, so I have some idea of what makes a good architecture for a modular program that can be expanded over time.

In December, I used Claude Code to create a multiplayer online clone of Katamari Damacy called

In December, I used Claude Code to create a multiplayer online clone of Katamari Damacy called “Christmas Roll-Up.”

In December, I used Claude Code to create a multiplayer online clone of Katamari Damacy called “Christmas Roll-Up.” Credit: Benj Edwards

Claude Code, Codex, and Google’s Gemini CLI, can seemingly perform software miracles on a small scale. They can spit out flashy prototypes of simple applications, user interfaces, and even games, but only as long as they borrow patterns from their training data. Much like a 3D printer, doing production-level work takes far more effort. Creating durable production code, managing a complex project, or crafting something truly novel still requires experience, patience, and skill beyond what today’s AI agents can provide on their own.

And yet these tools have opened a world of creative potential in software that was previously closed to me, and they feel personally empowering. Even with that impression, though, I know these are hobby projects, and the limitations of coding agents lead me to believe that veteran software developers probably shouldn’t fear losing their jobs to these tools any time soon. In fact, they may become busier than ever.

So far, I have created over 50 demo projects in the past two months, fueled in part by a bout of COVID that left me bedridden with a laptop and a generous 2x Claude usage cap that Anthropic put in place during the last few weeks of December. As I typed furiously all day, my wife kept asking me, “Who are you talking to?”

You can see a few of the more interesting results listed on my personal website. Here are 10 interesting things I’ve learned from the process.

1. People are still necessary

Even with the best AI coding agents available today, humans remain essential to the software development process. Experienced human software developers bring judgment, creativity, and domain knowledge that AI models lack. They know how to architect systems for long-term maintainability, how to balance technical debt against feature velocity, and when to push back when requirements don’t make sense.

For hobby projects like mine, I can get away with a lot of sloppiness. But for production work, having someone who understands version control, incremental backups, testing one feature at a time, and debugging complex interactions between systems makes all the difference. Knowing something about how good software development works helps a lot when guiding an AI coding agent—the tool amplifies your existing knowledge rather than replacing it.

As independent AI researcher Simon Willison wrote in a post distinguishing serious AI-assisted development from casual “vibe coding,” “AI tools amplify existing expertise. The more skills and experience you have as a software engineer the faster and better the results you can get from working with LLMs and coding agents.”

With AI assistance, you don’t have to remember how to do everything. You just need to know what you want to do.

Card Miner: Heart of the Earth is entirely human-designed by AI coded using Claude Code. It represents about a month of iterative work.

Card Miner: Heart of the Earth is entirely human-designed, but it was AI-coded using Claude Code. It represents about a month of iterative work.

Card Miner: Heart of the Earth is entirely human-designed, but it was AI-coded using Claude Code. It represents about a month of iterative work. Credit: Benj Edwards

So I like to remind myself that coding agents are software tools best used to enact human ideas, not autonomous coding employees. They are not people (and not people replacements) no matter how the companies behind them might market them.

If you think about it, everything you do on a computer was once a manual process. Programming a computer like the ENIAC involved literally making physical bits (connections) with wire on a plugboard. The history of programming has been one of increasing automation, so even though this AI-assisted leap is somewhat startling, one could think of these tools as an advancement similar to the advent of high-level languages, automated compilers and debugger tools, or GUI-based IDEs. They can automate many tasks, but managing the overarching project scope still falls to the person telling the tool what to do.

And they can have rapidly compounding benefits. I’ve now used AI tools to write better tools—such as changing the source of an emulator so a coding agent can use it directly—and those improved tools are already having ripple effects. But a human must be in the loop for the best execution of my vision. This approach has kept me very busy, and contrary to some prevailing fears about people becoming dumber due to AI, I have learned many new things along the way.

2. AI models are brittle beyond their training data

Like all AI models based on the Transformer architecture, the large language models (LLMs) that underpin today’s coding agents have a significant limitation: They can only reliably apply knowledge gleaned from training data, and they have a limited ability to generalize that knowledge to novel domains not represented in that data.

What is training data? In this case, when building coding-flavored LLMs, AI companies download millions of examples of software code from sources like GitHub and use them to make the AI models. Companies later specialize them for coding through fine-tuning processes.

The ability of AI agents to use trial and error—attempting something and then trying again—helps mitigate the brittleness of LLMs somewhat. But it’s not perfect, and it can be frustrating to see a coding agent spin its wheels trying and failing at a task repeatedly, either because it doesn’t know how to do it or because it previously learned how to solve a problem but then forgot because the context window got compacted (more on that here).

Violent Checkers is a physics-based corruption of the classic board game, coded using Claude Code.

Violent Checkers is a physics-based corruption of the classic board game, coded using Claude Code.

Violent Checkers is a physics-based corruption of the classic board game, coded using Claude Code. Credit: Benj Edwards

To get around this, it helps to have the AI model take copious notes as it goes along about how it solved certain problems so that future instances of the agent can learn from them again. You also want to set ground rules in the claude.md file that the agent reads when it begins its session.

This brittleness means that coding agents are almost frighteningly good at what they’ve been trained and fine-tuned on—modern programming languages, JavaScript, HTML, and similar well-represented technologies—and generally terrible at tasks on which they have not been deeply trained, such as 6502 Assembly or programming an Atari 800 game with authentic-looking character graphics.

It took me five minutes to make a nice HTML5 demo with Claude but a week of torturous trial and error, plus actual systematic design on my part, to make a similar demo of an Atari 800 game. To do so, I had to use Claude Code to invent several tools, like command-line emulators and MCP servers, that allow it to peek into the operation of the Atari 800’s memory and chipset to even begin to make it happen.

3. True novelty can be an uphill battle

Due to what might poetically be called “preconceived notions” baked into a coding model’s neural network (more technically, statistical semantic associations), it can be difficult to get AI agents to create truly novel things, even if you carefully spell out what you want.

For example, I spent four days trying to get Claude Code to create an Atari 800 version of my HTML game Violent Checkers, but it had trouble because in the game’s design, the squares on the checkerboard don’t matter beyond their starting positions. No matter how many times I told the agent (and made notes in my Claude project files), it would come back to trying to center the pieces to the squares, snap them within squares, or use the squares as a logical basis of the game’s calculations when they should really just form a background image.

To get around this in the Atari 800 version, I started over and told Claude that I was creating a game with a UFO (instead of a circular checker piece) flying over a field of adjacent squares—never once mentioning the words “checker,” “checkerboard,” or “checkers.” With that approach, I got the results I wanted.

A screenshot of Benj's Mac while working on a Violent Checkers port for the Atari 800 home computer, amid other projects.

A screenshot of Benj’s Mac while working on a Violent Checkers port for the Atari 800 home computer, amid other projects.

A screenshot of Benj’s Mac while working on a Violent Checkers port for the Atari 800 home computer, amid other projects. Credit: Benj Edwards

Why does this matter? Because with LLMs, context is everything, and in language, context changes meaning. Take the word “bank” and add the words “river” or “central” in front of it, and see how the meaning changes. In a way, words act as addresses that unlock the semantic relationships encoded in a neural network. So if you put “checkerboard” and “game” in the context, the model’s self-attention process links up a massive web of semantic associations about how checkers games should work, and that semantic baggage throws things off.

A couple of tricks can help AI coders navigate around these limitations. First, avoid contaminating the context with irrelevant information. Second, when the agent gets stuck, try this prompt: “What information do you need that would let you implement this perfectly right now? What tools are available to you that you could use to discover that information systematically without guessing?” This forces the agent to identify (semantically link up) its own knowledge gaps, spelled out in the context window and subject to future action, instead of flailing around blindly.

4. The 90 percent problem

The first 90 percent of an AI coding project comes in fast and amazes you. The last 10 percent involves tediously filling in the details through back-and-forth trial-and-error conversation with the agent. Tasks that require deeper insight or understanding than what the agent can provide still require humans to make the connections and guide it in the right direction. The limitations we discussed above can also cause your project to hit a brick wall.

From what I have observed over the years, larger LLMs can potentially make deeper contextual connections than smaller ones. They have more parameters (encoded data points), and those parameters are linked in more multidimensional ways, so they tend to have a deeper map of semantic relationships. As deep as those go, it seems that human brains still have an even deeper grasp of semantic connections and can make wild semantic jumps that LLMs tend not to.

Creativity, in this sense, may be when you jump from, say, basketball to how bubbles form in soap film and somehow make a useful connection that leads to a breakthrough. Instead, LLMs tend to follow conventional semantic paths that are more conservative and entirely guided by mapped-out relationships from the training data. That limits their creative potential unless the prompter unlocks it by guiding the LLM to make novel semantic connections. That takes skill and creativity on the part of the operator, which once again shows the role of LLMs as tools used by humans rather than independent thinking machines.

5. Feature creep becomes irresistible

While creating software with AI coding tools, the joy of experiencing novelty makes you want to keep adding interesting new features rather than fixing bugs or perfecting existing systems. And Claude (or Codex) is happy to oblige, churning away at new ideas that are easy to sketch out in a quick and pleasing demo (the 90 percent problem again) rather than polishing the code.

Flip-Lash started as a

Flip-Lash started as a “Tetris but you can flip the board,” but feature creep made me throw in the kitchen sink, losing focus.

Flip-Lash started as a “Tetris but you can flip the board,” but feature creep made me throw in the kitchen sink, losing focus. Credit: Benj Edwards

Fixing bugs can also create bugs elsewhere. This is not new to coding agents—it’s a time-honored problem in software development. But agents supercharge this phenomenon because they can barrel through your code and make sweeping changes in pursuit of narrow-minded goals that affect lots of working systems. We’ve already talked about the importance of having a good architecture guided by the human mind behind the wheel above, and that comes into play here.

6. AGI is not here yet

Given the limitations I’ve described above, it’s very clear that an AI model with general intelligence—what people usually call artificial general intelligence (AGI)—is still not here. AGI would hypothetically be able to navigate around baked-in stereotype associations and not have to rely on explicit training or fine-tuning on many examples to get things right. AI companies will probably need a different architecture in the future.

I’m speculating, but AGI would likely need to learn permanently on the fly—as in modify its own neural network weights—instead of relying on what is called “in-context learning,” which only persists until the context fills up and gets compacted or wiped out.

Grapheeti is a

Grapheeti is a “drawing MMO” where people around the world share a canvas.

Grapheeti is a “drawing MMO” where people around the world share a canvas. Credit: Benj Edwards

In other words, you could teach a true AGI system how to do something by explanation or let it learn by doing, noting successes, and having those lessons permanently stick, no matter what is in the context window. Today’s coding agents can’t do that—they forget lessons from earlier in a long session or between sessions unless you manually document everything for them. My favorite trick is instructing them to write a long, detailed report on what happened when a bug is fixed. That way, you can point to the hard-earned solution the next time the amnestic AI model makes the same mistake.

7. Even fast isn’t fast enough

While using Claude Code for a while, it’s easy to take for granted that you suddenly have the power to create software without knowing certain programming languages. This is amazing at first, but you can quickly become frustrated that what is conventionally a very fast development process isn’t fast enough. Impatience at the coding machine sets in, and you start wanting more.

But even if you do know the programming languages being used, you don’t get a free pass. You still need to make key decisions about how the project will unfold. And when the agent gets stuck or makes a mess of things, your programming knowledge becomes essential for diagnosing what went wrong and steering it back on course.

8. People may become busier than ever

After guiding way too many hobby projects through Claude Code over the past two months, I’m starting to think that most people won’t become unemployed due to AI—they will become busier than ever. Power tools allow more work to be done in less time, and the economy will demand more productivity to match.

It’s almost too easy to make new software, in fact, and that can be exhausting. One project idea would lead to another, and I was soon spending eight hours a day during my winter vacation shepherding about 15 Claude Code projects at once. That’s too much split attention for good results, but the novelty of seeing my ideas come to life was addictive. In addition to the game ideas I’ve mentioned here, I made tools that scrape and search my past articles, a graphical MUD based on ZZT, a new type of MUSH (text game) that uses AI-generated rooms, a new type of Telnet display proxy, and a Claude Code client for the Apple II (more on that soon). I also put two AI-enabled emulators for Apple II and Atari 800 on GitHub. Phew.

Consider the advent of the steam shovel, which allowed humans to dig holes faster than a team using hand shovels. It made existing projects faster and new projects possible. But think about the human operator of the steam shovel. Suddenly, we had a tireless tool that could work 24 hours a day if fueled up and maintained properly, while the human piloting it would need to eat, sleep, and rest.

I used Claude Code to create a windowing GUI simulation of the Mac that works over Telnet.

I used Claude Code to create a windowing GUI simulation of the Mac that works over Telnet.

I used Claude Code to create a windowing GUI simulation of the Mac that works over Telnet. Credit: Benj Edwards

In fact, we may end up needing new protections for human knowledge workers using these tireless information engines to implement their ideas, much as unions rose as a response to industrial production lines over 100 years ago. Humans need rest, even when machines don’t.

Will an AI system ever replace the human role here? Even if AI coding agents could eventually work fully autonomously, I don’t think they’ll replace humans entirely because there will still be people who want to get things done, and new AI power tools will emerge to help them do it.

9. Fast is scary to people

AI coding tools can turn what was once a year-long personal project into a five-minute session. I fed Claude Code a photo of a two-player Tetris game I sketched in a notebook back in 2008, and it produced a working prototype in minutes (prompt: “create a fully-featured web game with sound effects based on this diagram”). That’s wild, and even though the results are imperfect, it’s a bit frightening to comprehend what kind of sea change in software development this might entail.

Since early December, I’ve been posting some of my more amusing experimental AI-coded projects to Bluesky for people to try out, but I discovered I needed to deliberately slow down with updates because they came too fast for people to absorb (and too fast for me to fully test). I’ve also received comments like “I’m worried you’re using AI, you’re making games too fast” and so on.

Benj's handwritten game design note about a two-player Tetris concept from 2007.

Benj’s handwritten game design note about a two-player Tetris concept from 2007.

Benj’s handwritten game design note about a two-player Tetris concept from 2007. Credit: Benj Edwards

Regardless of my own habits, the flow of new software will not slow down. There will soon be a seemingly endless supply of AI-augmented media (games, movies, images, books), and that’s a problem we’ll have to figure out how to deal with. These products won’t all be “AI slop,” either; some will be done very well, and the acceleration in production times due to these new power tools will balloon the quantity beyond anything we’ve seen.

Social media tends to prime people to believe that AI is all good or all bad, but that kind of black-and-white thinking may be the easy way out. You’ll have no cognitive dissonance, but you’ll miss a far richer third option: seeing these tools as imperfect and deserving of critique but also as useful and empowering when they bring your ideas to life.

AI agents should be considered tools, not entities or employees, and they should be amplifiers of human ideas. My game-in-progress Card Miner is entirely my own high-level creative design work, but the AI model handled the low-level code. I am still proud of it as an expression of my personal ideas, and it would not exist without AI coding agents.

10. These tools aren’t going away

For now, at least, coding agents remain very much tools in the hands of people who want to build things. The question is whether humans will learn to wield these new tools effectively to empower themselves. Based on two months of intensive experimentation, I’d say the answer is a qualified yes, with plenty of caveats.

We also have social issues to face: Professional developers already use these tools, and with the prevailing stigma against AI tools in some online communities, many software developers and the platforms that host their work will face difficult decisions.

Ultimately, I don’t think AI tools will make human software designers obsolete. Instead, they may well help those designers become more capable. This isn’t new, of course; tools of every kind have been serving this role since long before the dawn of recorded history. The best tools amplify human capability while keeping a person behind the wheel. The 3D printer analogy holds: amazing fast results are possible, but mastery still takes time, skill, and a lot of patience with the machine.

Photo of Benj Edwards

Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

10 things I learned from burning myself out with AI coding agents Read More »