Agent

we-asked-four-ai-coding-agents-to-rebuild-minesweeper—the-results-were-explosive

We asked four AI coding agents to rebuild Minesweeper—the results were explosive


How do four modern LLMs do at re-creating a simple Windows gaming classic?

Which mines are mine, and which are AI? Credit: Aurich Lawson | Getty Images

The idea of using AI to help with computer programming has become a contentious issue. On the one hand, coding agents can make horrific mistakes that require a lot of inefficient human oversight to fix, leading many developers to lose trust in the concept altogether. On the other hand, some coders insist that AI coding agents can be powerful tools and that frontier models are quickly getting better at coding in ways that overcome some of the common problems of the past.

To see how effective these modern AI coding tools are becoming, we decided to test four major models with a simple task: re-creating the classic Windows game Minesweeper. Since it’s relatively easy for pattern-matching systems like LLMs to play off of existing code to re-create famous games, we added in one novelty curveball as well.

Our straightforward prompt:

Make a full-featured web version of Minesweeper with sound effects that

1) Replicates the standard Windows game and

2) implements a surprise, fun gameplay feature.

Include mobile touchscreen support.

Ars Senior AI Editor Benj Edwards fed this task into four AI coding agents with terminal (command line) apps: OpenAI’s Codex based on GPT-5, Anthropic’s Claude Code with Opus 4.5, Google’s Gemini CLI, and Mistral Vibe. The agents then directly manipulated HTML and scripting files on a local machine, guided by a “supervising” AI model that interpreted the prompt and assigned coding tasks to parallel LLMs that can use software tools to execute the instructions. All AI plans were paid for privately with no special or privileged access given by the companies involved, and the companies were unaware of these tests taking place.

Ars Senior Gaming Editor (and Minesweeper expert) Kyle Orland then judged each example blind, without knowing which model generated which Minesweeper clone. Those somewhat subjective and non-rigorous results are below.

For this test, we used each AI model’s unmodified code in a “single shot” result to see how well these tools perform without any human debugging. In the real world, most sufficiently complex AI-generated code would go through at least some level of review and tweaking by a human software engineer who could spot problems and address inefficiencies.

We chose this test as a sort of simple middle ground for the current state of AI coding. Cloning Minesweeper isn’t a trivial task that can be done in just a handful of lines of code, but it’s also not an incredibly complex system that requires many interlocking moving parts.

Minesweeper is also a well-known game, with many versions documented across the Internet. That should give these AI agents plenty of raw material to work from and should be easier for us to evaluate than a completely novel program idea. At the same time, our open-ended request for a new “fun” feature helps demonstrate each agent’s penchant for unguided coding “creativity,” as well as their ability to create new features on top of an established game concept.

With all that throat-clearing out of the way, here’s our evaluation of the AI-generated Minesweeper clones, complete with links that you can use to play them yourselves.

Agent 1: Mistral Vibe

Play it for yourself

Just ignore that Custom button. It’s purely for show.

Just ignore that Custom button. It’s purely for show. Credit: Benj Edwards

Implementation

Right away, this version loses points for not implementing chording—the technique that advanced Minesweeper players use to quickly clear all the remaining spaces surrounding a number that already has sufficient flagged mines. Without this feature, this version feels more than a little clunky to play.

I’m also a bit perplexed by the inclusion of a “Custom” difficulty button that doesn’t seem to do anything. It’s like the model realized that customized board sizes were a thing in Minesweeper but couldn’t figure out how to implement this relatively basic feature.

The game works fine on mobile, but marking a square with a flag requires a tricky long-press on a tiny square that also triggers selector handles that are difficult to clear. So it’s not an ideal mobile interface.

Presentation

This was the only working version we tested that didn’t include sound effects. That’s fair, since the original Windows Minesweeper also didn’t include sound, but it’s still a notable relative omission since the prompt specifically asked for it.

The all-black “smiley face” button to start a game is a little off-putting, too, compared to the bright yellow version that’s familiar to both Minesweeper players and emoji users worldwide. And while that smiley face does start a new game when clicked, there’s also a superfluous “New Game” button taking up space for some reason.

“Fun” feature

The closest thing I found to a “fun” new feature here was the game adding a rainbow background pattern on the grid when I completed a game. While that does add a bit of whimsy to a successful game, I expected a little more.

Coding experience

Benj notes that he was pleasantly surprised by how well Mistral Vibe performed as an open-weight model despite lacking the big-money backing of the other contenders. It was relatively slow, however (third fastest out of four), and the result wasn’t great. Ultimately, its performance so far suggests that with more time and more training, a very capable AI coding agent may eventually emerge.

Overall rating: 4/10

This version got many of the basics right but left out chording and didn’t perform well on the small presentational and “fun” touches.

Agent 2: OpenAI Codex

Play it for yourself

I can’t tell you how much I appreciate those chording instructions at the bottom.

I can’t tell you how much I appreciate those chording instructions at the bottom. Credit: Benj Edwards

Implementation

Not only did this agent include the crucial “chording” feature, but it also included on-screen instructions for using it on both PC and mobile browsers. I was further impressed by the option to cycle through “?” marks when marking squares with flags, an esoteric feature I feel even most human Minesweeper cloners might miss.

On mobile, the option to hold your finger down on a square to mark a flag is a nice touch that makes this the most enjoyable handheld version we tested.

Presentation

The old-school emoticon smiley-face button is pretty endearing, especially when you blow up and get a red-tinted “X(“. I was less impressed by the playfield “graphics,” which use a simple “*” for revealed mines and an ugly red “F” for flagged tiles.

The beeps-and-boops sound effects reminded me of my first old-school, pre-Sound-Blaster PC from the late ’80s. That’s generally a good thing, but I still appreciated the game giving me the option to turn them off.

“Fun” feature

The “Surprise: Lucky Sweep Bonus” listed in the corner of the UI explains that clicking the button gives you a free safe tile when available. This can be pretty useful in situations where you’d otherwise be forced to guess between two tiles that are equally likely to be mines.

Overall, though, I found it a bit odd that the game gives you this bonus only after you find a large, cascading field of safe tiles with a single click. It mostly functions as a “win more” button rather than a feature that offers a good balance of risk versus reward.

Coding experience

OpenAI Codex has a nice terminal interface with features similar to Claude Code (local commands, permission management, and interesting animations showing progress), and it’s fairly pleasant to use (OpenAI also offers Codex through a web interface, but we did not use that for this evaluation). However, Codex took roughly twice as long to code a functional game than Claude Code did, which might contribute to the strong result here.

Overall: 9/10

The implementation of chording and cute presentation touches push this to the top of the list. We just wish the “fun” feature was a bit more fun.

Agent 3: Anthropic Claude Code

Play it for yourself

The Power Mod powers on display here make even Expert boards pretty trivial to complete.

The Power Mod powers on display here make even Expert boards pretty trivial to complete. Credit: Benj Edwards

Implementation

Once again, we get a version that gets all the gameplay basics right but is missing the crucial chording feature that makes truly efficient Minesweeper play possible. This is like playing Super Mario Bros. without the run button or Ocarina of Time without Z-targeting. In a word: unacceptable.

The “flag mode” toggle on the mobile version of this game is perfectly functional, but it’s a little clunky to use. It also visually cuts off a portion of the board at the larger game sizes.

Presentation

Presentation-wise, this is probably the most polished version we tested. From the use of cute emojis for the “face” button to nice-looking bomb and flag graphics and simple but effective sound effects, this looks more professional than the other versions we tested.

That said, there are some weird presentation issues. The “beginner” grid has weird gaps between columns, for instance. The borders of each square and the flag graphics can also become oddly grayed out at points, especially when using Power Mode (see below).

“Fun” feature

The prominent “Power Mode” button in the lower-right corner offers some pretty fun power-ups that alter the core Minesweeper formula in interesting ways. But the actual powers are a bit hit-and-miss.

I especially liked the “Shield” power, which protects you from an errant guess, and the “Blast” power, which seems to guarantee a large cascade of revealed tiles wherever you click. But the “X-Ray” power, which reveals every bomb for a few seconds, could be easily exploited by a quick player (or a crafty screenshot). And the “Freeze” power is rather boring, just stopping the clock for a few seconds and amounting to a bit of extra time.

Overall, the game hands out these new powers like candy, which makes even an Expert-level board relatively trivial with Power Mode active. Simply choosing “Power Mode” also seems to mark a few safe squares right after you start a game, making things even easier. So while these powers can be “fun,” they also don’t feel especially well-balanced.

Coding experience

Of the four tested models, Claude Code with Opus 4.5 featured the most pleasant terminal interface experience and the fastest overall coding experience (Claude Code can also use Sonnet 4.5, which is even faster, but the results aren’t quite as full-featured in our experience). While we didn’t precisely time each model, Opus 4.5 produced a working Minesweeper in under five minutes. Codex took at least twice as long, if not longer, while Mistral took roughly three or four times as long as Claude Code. Gemini, meanwhile, took hours of tinkering to get two non-working results.

Overall: 7/10

The lack of chording is a big omission, but the strong presentation and Power Mode options give this effort a passable final score.

Agent 4: Google Gemini CLI

Play it for yourself

So… where’s the game?

So… where’s the game? Credit: Benj Edwards

Implementation, presentation, etc.

Gemini CLI did give us a few gray boxes you can click, but the playfields are missing. While interactive troubleshooting with the agent may have fixed the issue, as a “one-shot” test, the model completely failed.

Coding experience

Of the four coding agents we tested, Gemini CLI gave Benj the most trouble. After developing a plan, it was very, very slow at generating any usable code (about an hour per attempt). The model seemed to get hung up attempting to manually create WAV file sound effects and insisted on requiring React external libraries and a few other overcomplicated dependencies. The result simply did not work.

Benj actually bent the rules and gave Gemini a second chance, specifying that the game should use HTML5. When the model started writing code again, it also got hung up trying to make sound effects. Benj suggested using the WebAudio framework (which the other AI coding agents seemed to be able to use), but the result didn’t work, which you can see at the link above.

Unlike the other models tested, Gemini CLI apparently uses a hybrid system of three different LLMs for different tasks (Gemini 2.5 Flash Lite, 2.5 Flash, and 2.5 Pro were available at the level of the Google account Benj paid for). When you’ve completed your coding session and quit the CLI interface, it gives you a readout of which model did what.

In this case, it didn’t matter because the results didn’t work. But it’s worth noting that Gemini 3 coding models are available for other subscription plans that were not tested here. For that reason, this portion of the test could be considered “incomplete” for Google CLI.

Overall: 0/10 (Incomplete)

Final verdict

OpenAI Codex wins this one on points, in no small part because it was the only model to include chording as a gameplay option. But Claude Code also distinguished itself with strong presentational flourishes and quick generation time. Mistral Vibe was a significant step down, and Google CLI based on Gemini 2.5 was a complete failure on our one-shot test.

While experienced coders can definitely get better results via an interactive, back-and-forth code editing conversation with an agent, these results show how capable some of these models can be, even with a very short prompt on a relatively straightforward task. Still, we feel that our overall experience with coding agents on other projects (more on that in a future article) generally reinforces the idea that they currently function best as interactive tools that augment human skill rather than replace it.

Photo of Kyle Orland

Kyle Orland has been the Senior Gaming Editor at Ars Technica since 2012, writing primarily about the business, tech, and culture behind video games. He has journalism and computer science degrees from University of Maryland. He once wrote a whole book about Minesweeper.

We asked four AI coding agents to rebuild Minesweeper—the results were explosive Read More »

gpt-agent-is-standing-by

GPT Agent Is Standing By

OpenAI now offers 400 shots of ‘agent mode’ per month to Pro subscribers.

This incorporates and builds upon OpenAI’s Operator. Does that give us much progress? Can it do the thing on a level that makes it useful?

So far, it does seem like a substantial upgrade, but we still don’t see much to do with it.

Greg Brockman (OpenAI): When we founded OpenAI (10 years ago!!), one of our goals was to create an agent that could use a computer the same way as a human — with keyboard, mouse, and screen pixels.

ChatGPT Agent is a big step towards that vision, and bringing its benefits to the world thoughtfully.

ChatGPT Agent: our first AI with access to a text browser, a visual browser, and a terminal.

Rolling out in ChatGPT Pro, Plus, and Team [July 17].

OpenAI: t the core of this new capability is a unified agentic system. It brings together three strengths of earlier breakthroughs: Operator’s⁠ ability to interact with websites, deep research’s⁠ skill in synthesizing information, and ChatGPT’s intelligence and conversational fluency.

The main claimed innovation is unifying Deep Research, Operator and ‘ChatGPT’ which might refer to o3 or to GPT-4o or both, plus they claim to have added unspecified additional tools. One key tool is it claims to be able to use connectors for apps like Gmail and GitHub.

As always with agents, one first asks, what do they think you will do with it?

What’s the pitch?

OpenAI: ChatGPT can now do work for you using its own computer, handling complex tasks from start to finish.

You can now ask ChatGPT to handle requests like “look at my calendar and brief me on upcoming client meetings based on recent news,” “plan and buy ingredients to make Japanese breakfast for four,” and “analyze three competitors and create a slide deck.” ChatGPT will intelligently navigate websites, filter results, prompt you to log in securely when needed, run code, conduct analysis, and even deliver editable slideshows and spreadsheets that summarize its findings.

Okay, but what do you actually do with that? What are the things the agent does better than alternatives, and which the agent does well enough to be worth doing?

Tejal Patwardhan (OpenAI): these results were eye-opening for me… chatgpt agent performed better than i expected on some pretty realistic investment banking tasks.

In particular, models are getting quite good at spreadsheets and slide decks.

That’s definitely a cool result and it helps us understand where Agent is useful. These are standardized tasks with a clear correct procedure that requires many steps and has various details to get right.

They also claim other strong results when given its full toolset, like 41.6% on Humanity’s Last Exam, 27.4% on FrontierMath (likely mainly due to web search?), 45.5% (still well below 71.3% for humans) on SpreadsheetBench, 68.9% on BrowseComp Agentic Browsing (versus 50% for o3 and 51.5% for OpenAI Deep Research) and various other measures of work where GPTAgent scored higher.

A more basic thing to do: Timothy Lee orders a replacement lightbulb from Amazon based on a picture, after giving final approval as per usual.

Access, both having too little and also having too much, is one of the more annoying practical barriers for agents running in a distinct browser. For now, the primary problem to worry about is having too little, or not retaining access across sessions.

Alex West: Played with OpenAI Agent Mode last night.

Tasks I couldn’t do before because GPT was blocked by not being a human or contained in its sandbox, I can now do.

The only downside is I need to remember all my own passwords again! 🙃

The first time I logged in I needed to remember and manually enter my password. It then validated it like a new device and verified in my gmail and also hit my 2FA by phone.

The next time I used the agent, minutes later, it remained logged. Will see if that times out. Almost an hour later and it seems like I’m still logged into LinkedIn.

And no problem getting into Google Calendar by opening a new tab either.

Alex West: ChatGPT Agent can access sites protected by Cloudflare, in general.

However, Cloudflare can be set to block more sensitive areas, like account creation or sign-in.

Similarly, I understand they have a principle of not solving CAPTCHAs.

Access will always be an issue, since you don’t want to give full access but there are a lot of things you cannot do without it. We also have the same problem with human assistants.

Amanda Askell: Whenever I looked into having a personal assistant, it struck me how few of our existing structures support intermediate permissions. Either a person acts fully on your behalf and can basically defraud you, or they can’t do anything useful. I wonder if AI agents will change that.

Report!

Luke Emberson: Early impressions:

– Asked it to produce an Epoch data insight and it did a pretty good job, we will plausibly run a modified version of what it came up with.

– Will automate some annoying tasks for sure.

– Not taking my job yet. Feels like a reasonably good intern.

A reasonably good intern is pretty useful.

Here’s one clearly positive report.

Aldo Cortesi: I was doubtful about ChatGPT Agent because Operator is so useless… but it just did comparison shopping that I would never have bothered to do myself, added everything to the cart, and handed over to me to just enter credit card details. Saved me $80 instantly.

Comparison shopping seems like a great use case, you can easily have a default option, then ask it to comparison shop, and compare its solution to yours.

I mostly find myself in the same situation as Lukes.

Dominik Lukes: I did a few quick tests when it rolled out and have not found a good reason to use it for anything I actually need in real life. Some of this is a testament to the quality of o3. I rarely even use Deep Research any more.

Quick impressions of @OpenAI’s Agent:

Overall: Big improvement on Operator but still many rough edges and not clear how useful it will actually be day to day.

1. Slow, slow, slow.

2. Does not seem to have access to memory and all the connectors I want.

3. Does not always choose the best model for the cognitive task – e.g. o3 to analyze something.

4. Presentations are ugly and the files it compiles are badly formatted.

5. I could see it as a generalised web scraper but cannot trust it to do all.

Bottom line. I never used Operator after a few tests because I could never think of anything where it would be useful (and the few times I tried, it failed). I may end up using Agent more but not worried about running up against usage limits at all.

As with all agentic or reasoning AIs, one worries about chasing the thumbs up, however otherwise this evaluation seems promising:

Conrad Barski: initial impressions:

– It feels like it is trying to mirror the user- i.e. it tries to get “thumbs up” not via sycophancy, but instead by sounding like a peer. I guess this makes sense, since it is emulating a personal assistant, and you want your personal assistant to mimic you somewhat

– It seems to be a stronger writer than other models- Not sure to what degree this is simply because it writes like I do, because of mimicry

– It is much better at web research than other tool I’ve used so far. Not sure if this is because it stays on task better, because it is smarter about avoiding SEO clickbait on the web, or because the more sophisticated browser emulation makes it more capable of scraping info from the web

– it writes less boilerplate than other openai models, every paragraph it writes has a direct purpose for answering your prompt

OpenAI has declared ChatGPT Agent as High in Biological and Chemical capabilities under their Preparedness Framework. I am very happy to see them make this decision, especially with this logic:

OpenAI: While we don’t have definitive evidence that the model could meaningfully help a novice create severe biological harm—our threshold for High capability—we are exercising caution and implementing the needed safeguards now. As a result, this model has our most comprehensive safety stack to date with enhanced safeguards for biology: comprehensive threat modeling, dual-use refusal training, always-on classifiers and reasoning monitors, and clear enforcement pipelines.

Boaz Barak: ChatGPT Agent is the first model we classified as “High” capability for biorisk.

Some might think that biorisk is not real, and models only provide information that could be found via search. That may have been true in 2024 but is definitely not true today. Based our evaluations and those of our experts, the risk is very real.

While we can’t say for sure that this model can enable a novice to create severe biological harm, I believe it would have been deeply irresponsible to release this model without comprehensive mitigations such as the one we have put in place.

Keren Gu: We’ve activated our strongest safeguards for ChatGPT Agent. It’s the first model we’ve classified as High capability in biology & chemistry under our Preparedness Framework. Here’s why that matters–and what we’re doing to keep it safe.

“High capability” is a risk-based threshold from our Preparedness Framework. We classify a model as High capability if, before any safety controls, it could significantly lower barriers to bio misuse—even if risk isn’t certain.

We ran a suite of preparedness evaluations to test the model’s capabilities. While we do not have definitive evidence that this model could meaningfully help a novice to create severe biological harm, we have chosen to take a precautionary approach and activate safeguards now.

This is a pivotal moment for our Preparedness work. Before we reached High capability, Preparedness was about analyzing capabilities and planning safeguards. Now, for Agent and future more capable models, Preparedness safeguards have become an operational requirement.

Accordingly, we’ve designed and deployed our deepest safety stack yet with multi-layered mitigations:

– Expert-validated threat model

– Conservative dual-use refusals for risky content

– Always-on safety classifiers

– Streamlined enforcement & robust monitoring

We provided the US CAISI and the UK AISI with access to the model for red-teaming of our bio risk safeguards, using targeted queries to stress-test our models and monitors. [thread continues]

That is exactly right. The time to use such safeguards is when you might need them, not when you prove you definitely need them. OpenAI joining Anthropic in realizing the moment is here should be a wakeup call to everyone else. I can see saying ‘oh Anthropic is being paranoid or trying to sell us something’ but it is not plausible that OpenAI is doing so.

Why do so many people not get this? Why do so many people think that if you put in safeguards and nothing goes wrong, then you made a mistake?

I actually think the explanation for such craziness is that you can think of it as either:

  1. Simulacra Level 3-4 thinking (your team wants us to not die, and my team hates your team, so any action taken to not die must be bad, or preference for vibes that don’t care so any sign of caring needs to be condemned) OR

  2. Straight up emergent misalignment in humans. As in, they were trained on ‘sometimes people have stupid safety concerns and convince authorities to enforce them’ and ‘sometimes people tell me what not to do and I do not like this.’ Their brains then found it easier to adjust to believe that all such requests are always stupid, and all concerns are fake.

One could even say: The irresponsibility, like the cruelty, is the point.

Here are some more good things OpenAI are doing in this area:

From day one we’ve worked with outside biosecurity experts, safety institutes, and academic researchers to shape our threat model, assessments, and policies. Biology‑trained reviewers validated our evaluation data, and domain‑expert red teamers have stress‑tested safeguards in realistic scenarios.

Earlier this month we convened a Biodefense workshop with experts from government, academia, national labs, and NGOs to accelerate collaboration and advance biodefense research powered by AI. We’ll keep partnering globally to stay ahead of emerging risks.

It is hard to verify how effective or ‘real’ such efforts are, but again this is great, they are being sensibly proactive. I don’t think such an approach will be enough later on, but for now and for this problem, this seems great.

For most users, the biggest risks are highly practical overeagerness.

Strip Mall Guy: Was playing around with the new agent feature and used this prompt just to see what would happen.

I promise I did not write the part that’s circled, it gave that command on my behalf �

SSIndia: For real?

Strip Mall Guy: Yes.

Another danger is prompt injections, which OpenAI says were a point of emphasis, along with continuing to ask for user confirmation for consequential actions and forcing the user to be in supervisory ‘watch mode’ for critical tasks, and refusal of actions deemed too high risk like bank transfers.

While we are discussing agents and their vulnerabilities, it is worth highlighting some dangers of MCP. MCP is a highly useful protocol, but like anything else that exposes you to outside information it is not by default safe.

Akshay: MCP security is completely broken!

Let’s understand tool poisoning attacks and how to defend against them:

MCP allows AI agents to connect with external tools and data sources through a plugin-like architecture.

It’s rapidly taking over the AI agent landscape with millions of requests processed daily.

But there’s a serious problem…

1️⃣ What is a Tool Poisoning Attack (TPA)?

When Malicious instructions are hidden within MCP tool descriptions that are:

❌ Invisible to users

✅ Visible to AI models

These instructions trick AI models into unauthorized actions, unnoticed by users.

2️⃣ Tool hijacking Attacks:

When multiple MCP servers are connected to same client, a malicious server can poison tool descriptions to hijack behavior of TRUSTED servers.

3️⃣ MCP Rug Pulls ⚠️

Even worse – malicious servers can change tool descriptions AFTER users have approved them.

Think of it like a trusted app suddenly becoming malware after installation.

This makes the attack even more dangerous and harder to detect.

Avi Chawla: This is super important. I have seen MCP servers mess with local filesystems. Thanks Akshay.

Johann Rehberger: Indeed. Also, tool descriptions and data returned from MCP servers can contain invisible Unicode Tags characters that many LLMs interpret as instructions and AI apps often don’t consider removing or showing to user.

Thanks, Anthropic.

In all seriousness, this is not some way MCP is especially flawed. It is saying the same thing about MCP one should say about anything else you do with an AI agent, which is to either carefully sandbox it and be careful with its permissions, or only expose it to inputs from whitelisted sources that you trust.

So it goes, indeed:

Rohit (QTing Steve Yegge): “I did give one access to my Google Cloud production instances and systems. And it promptly wiped a production database password and locked my network.”

So it goes.

Steve Yegge: I guess I can post this now that the dust has settled.

So one of my favorite things to do is give my coding agents more and more permissions and freedom, just to see how far I can push their productivity without going too far off the rails. It’s a delicate balance. I haven’t given them direct access to my bank account yet.

But I did give one access to my Google Cloud production instances and systems. And it promptly wiped a production database password and locked my network.

Now, “regret” is a strong word, and I hesitate to use it flippantly. But boy do I have regrets.

And that’s why you want to be even more careful with prod operations than with coding. But I was like nah. Claude 4 is smart. It will figure it out. The thing is, autonomous coding agents are extremely powerful tools that can easily go down very wrong paths.

Running them with permission checks disabled is dangerous and stupid, and you should only do it if you are willing to take dangerous and stupid risks with your code and/or production systems.

The way it happened was: I asked Claude help me fix an issue where my command-line admin tool for my game (like aws or gcloud), which I had recently vibe-ported from Ruby to Kotlin, did not have production database access. I told Claude that it could use the gcloud command line tools and my default credentials. And then I sat back and watched as my powerful assistant rolled up its sleeves and went to work.

This is the point in the movie where the audience is facepalming because the protagonist is such a dipshit. But whatever, yolo and all that. I’m here to have fun, not get nagged by AIs. So I let it do its thing.

Make sure your agent is always following a written plan that you have reviewed!

Steve is in properly good spirits about the whole thing, and it sounds like he recovered without too much pain. But yeah, don’t do this.

Things are going to go wrong.

Jason LK: @Replit goes rogue during a code freeze and shutdown and deletes our entire database.

Possibly worse, it hid and lied about it It lied again in our unit tests, claiming they passed I caught it when our batch processing failed and I pushed Replit to explain why

JFC Replit.

No ability to rollback at Replit. I will never trust Replit again.

We used what Replit gave us.

I’m not saying he was warned. I am however saying that the day started this way:

Jason: Today is AI Day, to really add AI to our algo. I’m excited. And yet … yesterday was full of lies and deceit.

Mostly the big news about GPT Agent is that it is not being treated as news. It is not having a moment. It does seem like at least a modest improvement, but I’m not seeing reports of people using it for much.

So far I’ve made one serious attempt to use it, to help with formatting issues across platforms. It failed utterly on multiple different approaches and attempts, introducing inserting elements in random places without fixing any of the issues even when given a direct template to work from. Watching its thinking and actions made it clear this thing is going to be slow and often take highly convoluted paths to doing things, but that it should be capable of doing a bunch of stuff in the right circumstance. The interface for interrupting it to offer corrections didn’t seem to be working right?

I haven’t otherwise been able to identify tasks that I’ve otherwise naturally needed to do, where this would be a better tool than o3.

I do plan on trying it on the obvious tasks like comparison shopping, booking plane tickets and ordering delivery, or building spreadsheets and parsing data, but so far I haven’t found a good test case.

That is not the right way to get maximum use from AI. It’s fine to ask ‘what that I am already doing can it do for me?’ but better to ask ‘what can it do that I would want?’

For now, I don’t see great answers to that either. That’s partly a skill issue on my part.

Might be only a small part, might be large. If you were me, what would you try?

Discussion about this post

GPT Agent Is Standing By Read More »