Biz & IT

gps-is-vulnerable-to-jamming—here’s-how-we-might-fix-it

GPS is vulnerable to jamming—here’s how we might fix it


GPS jamming has gotten cheap and easy, but there are potential solutions.

In September 2025, a Widerøe Airlines flight was trying to land in Vardø, Norway, which sits in the country’s far eastern arm, some 40 miles from the Russian coast. The cloud deck was low, and so was visibility. In such gray situations, pilots use GPS technology to help them land on a runway and not the side of a mountain.

But on this day, GPS systems weren’t working correctly, the airwaves jammed with signals that prevented airplanes from accessing navigation information. The Widerøe flight had taken off during one of Russia’s frequent wargames, in which the country’s military simulates conflict as a preparation exercise. This one involved an imaginary war with a country. It was nicknamed Zapad-2025—translating to “West-2025”—and was happening just across the fjord from Vardø. According to European officials, GPS interference was frequent in the runup to the exercise. Russian forces, they suspected, were using GPS-signal-smashing technology, a tactic used in non-pretend conflict, too. (Russia has denied some allegations of GPS interference in the past.)

Without that guidance from space, and with the cloudy weather, the Widerøe plane had to abort its landing and continue down the coast away from Russia, to Båtsfjord, a fishing village.

The part of Norway in which this interruption occurred is called Finnmark. GPS disruption there is near-constant; problems linked to Russian interference have increased since the invasion of Ukraine.

Military and Pokemon players?

It’s one of the starkest geographic examples of how vulnerable GPS technology is. But such disturbances happen at a lower level all over the globe. The world’s militaries (including that of the United States) are big culprits, breaking out devices that can confuse or disrupt drones, missiles, and aircraft. But the equipment required to interfere with GPS at a less-than-military level is cheap and accessible and affects other aspects of life: Truck drivers, for instance, use it to look like they’ve delivered cargo on time. Players use it to fool augmented-reality games.

Given all this disruption, more U.S. institutions, from the Department of Defense to the Department of Transportation to the Federal Aviation Administration, are making moves toward alternatives and complements for GPS, though perhaps imperfectly. And the existing system has been undergoing a huge modernization program, introducing better-encrypted signals for military users, more varieties of signals for civilians, and higher-power signals for both to the tune of at least $22 billion. The military’s 2025 budget additionally requested $1.5 billion for more resilient “position, navigation, and timing” programs. Other departments have invested smaller amounts. In October 2025, for instance, the Department of Transportation awarded $5 million total to five companies to develop and demonstrate technologies complementary to GPS.

The update’s goals are to make the system more accurate, and harder to mess with. But as threats increase in frequency and sophistication, more work is necessary. “Sooner or later, we’re gonna see bad things happening here,” said John Langer, a GPS expert at the Aerospace Corporation, a nonprofit research organization. “So we need to armor up for it before it happens.”

GPS is the invisible spine of society, in more ways than most people realize. It became central quickly after the satellite system, built in the 1970s for the military, was optimized for civilians. “Part of what makes GPS so successful is that it’s ubiquitous and it’s inexpensive,” said Langer.

Losing GPS would mean losing a lot more than Google Maps. The technology is integrated into everything from lights that turn on at sunset to dating apps that match users nearby. Its signals also undergird the electrical grid, cell networks, banking, defense technology, and the movements of robots used in industries like agriculture.

The U.S. government currently has 31 GPS satellites in orbit around Earth, and three other governments have their own systems: Russia made one called GLONASS, China created BeiDou, and the European Union built Galileo; all four systems’ data is available to the international community.

Finding your place

GPS works in a deceptively simple way: Each satellite carries an atomic clock aboard. It broadcasts that clock’s time toward Earth. That signal alone is what’s useful to energy infrastructure and financial transactions. But to get position information, a receiver—in a phone or other device—simply has to pick up signals from at least four satellites. It knows what time those signals were sent, where the satellites were when they sent them, and how long it took the signals to arrive. Through fancy triangulation, the phone (or guided missile) then computes its own location.

Or at least that’s the idea. GPS can be jammed, meaning that someone broadcasts a signal much stronger than that of GPS (which has had to travel across thousands of miles of space, and grows weaker with every meter), drowning the real signal in noise. It can also be spoofed, meaning someone sends out a fake signal that looks just like a GPS blip but indicates an incorrect location or time.

Image of the globe centered on the Caribbean. Three satellites are superimposed on it, each of them with a colored circle around it. A pin highlights the point where the three circles intersect.

Three satellites are needed to pinpoint a location on Earth. Credit: NASA/JPL-Caltech

Threats like these were always a possibility—and those who built GPS knew about that problem from the beginning, said Todd Walter, director of the Stanford GPS Lab. “Around 2000 is when people got a little more serious about it,” he said. Hardware and software became cheaper, lowering the barrier to swamping or faking signals.

Problems ticked up when the augmented reality game Pokémon GO came online, in 2016. The game required people to travel to places in real life to win. Turns out, not all of them actually wanted to. “All of a sudden, everyone was interested in spoofing,” said Walter.

Pokémon GO cheaters used low-power devices close to the ground, and so didn’t affect cruising aircraft like Widerøe’s. The game made cheating high-tech and furthered methods and technology for signal scrambling, making it available to non-experts, Walter said. At the same time, spoofing arose in conflict zones, where drone and missile attacks are often guided by GPS. Don’t want to get hit by one? Fool its navigation system. “So now people say, ‘Well, we need to protect ourselves from that,’” said Walter. “And so then you see a huge increase in very powerful jamming and spoofing.”

In Norway, officials have noted that GPS disruptions, while most commonly affecting flights thousands of feet in the air, can also cause issues for police cars, ambulances, and ships. According to Espen Slette, director of the spectrum department at the Norwegian Communications Authority (known as Nkom), the agency has detected GPS jammers near hospitals, which could force life-saving helicopters to redirect to a more distant facility. Nkom has also clocked disruptions that affect agriculture and construction operations, while emergency responders have warned about how problems might home in on emergency beacon devices, like the satellite SOS buttons many people carry in the backcountry or aboard boats. The police’s chief of staff in Finnmark encouraged anyone venturing out to, old-school, carry a map and compass.

“It’s hard to grasp the full effect this has on society,” Slette wrote in an email.

Such widespread disruptions are not isolated to the Russia-adjacent Arctic. There are hotspots in Myanmar, most likely associated with drone warfare in the area; on the Black Sea, publicly associated with Russia, which has denied some cases of GPS interference; and in southern Texas, potentially from drug cartels near the border. A report from OpsGroup, a membership organization for international aviation personnel, found a marked increase in spoofing in 2024. “By January 2024, an average of 300 flights a day were being spoofed,” the report said. “By August 2024, this had grown to around 1500 flights per day.” From July 15 to Aug. 15, 2024, 41,000 flights total experienced spoofing. (While in the U.S., it’s generally illegal for civilians to jam or spoof signals, military-led disruptions during conflict are considered a legitimate and legal use-case.)

No going back

The uptick indicates that there’s no going back to a world without disruption hotspots. And that, combined with humans’ dependence on GPS, is why scientists and engineers are working on ways to shore up the system—and develop backchannels so a single-point failure doesn’t come to bite anyone, in conflict or in peacetime.

“There are many ways to mitigate GPS disruptions,” Slette wrote in an email. He suggests setting up devices to use signals from all four international constellations, and to install better receivers and antennas. That’s easier for militaries or infrastructure companies, and hard for people who are just buying the latest model of cell phone and have no control over its innards. But existing backups can tell a given device that something fishy may be up. Planes have inertial navigation systems, which mostly use motion-sensing devices to get an independent measurement; phones do too, and they can also check their data against cell towers, to see if something is off in their GPS signal.

But the U.S. government is worried enough about GPS issues that, across civilian and military agencies, research and development for more robust and resilient systems is ramping up. In March, for instance, the Federal Communications Commission launched a proceeding on GPS alternatives, exploring tools that could be used in addition to or instead of traditional GPS.

The Defense Advanced Research Projects Agency, or DARPA, and the Defense Innovation Unit, meanwhile, are investigating how quantum sensors might help with position, timing, and navigation. The United States’ military branches are also working on their alternative position, navigation, and timing capabilities, and their innovation arms like the Space Force’s SpaceWerx organization are running challenges to support alternative technologies. The Department of Defense acknowledges challenges to GPS and the consequent need to diversify the ways it gets position, navigation, and timing information, noting that it is pursuing the integration of alternative capabilities, according to a statement that public affairs officer Chelsea Dietlin requested be attributed to a Pentagon spokesperson. It is also looking toward working with commercial companies.

Even the Department of Transportation has a strategic plan that includes promoting technologies complementary to GPS. (Undark reached out multiple times to the Department of Transportation to request comment but did not receive a response.) A statement that FAA media relations specialist Cassandra Nolan requested be attributed to an agency spokesperson noted that the FAA is working on a system to detect GPS interference, and that it is working with the Department of Defense on navigation signals and antennas that are more resilient. In addition, the statement noted, the FAA already has “a layered aircraft tracking system that incorporates multiple technologies to guard against threats to Global Navigation Satellite Systems (GNSS).”

But the newer efforts across government may not be as connected as they could be, according to Dana Goward, president of the Resilient Navigation and Timing Foundation, a nonprofit advocacy group that largely comprises companies working in the GPS-problem space. For one, he said, efforts to bolster military and civilian systems have a fairly strict line between them. And neither has been as effective as he’d advocate: On the military side, plentiful programs exist, but they may not be working together. “It’s not clear if there is any coordination or synergies between the projects or how much senior leader support there is for comprehensive solution sets,” Goward wrote in an email.

On the civil side, Congress mandated in 2018 that a backup to GPS be established, but only experimental systems exist so far. There also have been efforts to repeal the law, with the disputed rationale that funding a single system isn’t feasible and there are better paths toward resilience. Goward contended that the government has hoped the private sector will come up with a usable solution, saving the government from creating one itself.

Starting over

And companies are coming to cash in on that desire, offering their solutions to both government agencies and other industries. “Our founding hypothesis was ‘let’s take 50 years of lessons learned but throw out the rulebook and do a clean-sheet design of a new GPS system incorporating a couple of fundamentals,’” said Patrick Shannon, CEO of one such company, called TrustPoint. The company, which has hired scientific and engineering experts in signal processing and space, aims to have a fleet of small satellites orbiting much closer to Earth than the current GPS constellation, and transmitting at a higher frequency.

TrustPoint’s satellites, a few of which have already gone to orbit, also send out an encrypted signal—something harder to spoof. With traditional GPS, only the military gets encrypted signals.

Many Russian jamming systems, he said, work tens of kilometers from their ground zero (their ground zero usually being a truck with a generator aboard). But with TrustPoint’s higher-frequency signals, the effectiveness of the jammer goes down by three times, and the circle of influence becomes 10 times smaller, shrinking even more if the receivers use a special kind of antenna that the U.S. government recently approved.

Messing with signals becomes less feasible, given those changes. “They would need exorbitant numbers of systems, exorbitant numbers of people, and a ton of cash to pull that off,” said Shannon.

So far, TrustPoint has launched three spacecraft, and has gotten five federal contracts in 2024 and 2025, totaling around $8.3 million, with organizations like the Air Force, Space Force, and the Navy.

Another company, called Xona Space Systems, is also putting satellites in low-Earth orbit, and has worked with both the Canadian and U.S. governments. The company plans to broadcast signals 100 times stronger than GPS, giving users two-centimeter precision, and making jamming more difficult. The signal also includes a watermark—a kind of authentication that, at least for now, protects against spoofing. They have launched one satellite that’s being tested by people in industries like agriculture, construction, and mining.

TrustPoint’s technology may offer novel defense against the dark GPS arts, but Xona, whose founders met while students at the Stanford GPS Lab, may have an edge anyway: Its signals are compatible with current infrastructure, so no one has to buy a new device. They just have to update their software. “We are not building receivers ourselves,” said Max Eunice, head of marketing and communications. Instead, they’re relying on the billions of earthly devices that already themselves rely on GPS.

Image of the inside of the cabin of a large farming machine moving through a field of wheat. Screens track its current location and where it has been.

Reliable GPS has become essential for a huge range of industries. Credit: Thomas Barwick

Other solutions, like one called SuperGPS, stay closer to the ground. They use radio transmitters on Earth to do the same things GPS satellites do in space. The setup, as demonstrated by scientists at the Delft University of Technology and VU University in the Netherlands, involves scattering radio transmitters around an area or using those already in place. Each transmitter is synchronized to an atomic clock, which sends the time to transmitters via fiber optic cable, which may already be in a place due to existing communications infrastructure. Receivers can collect signals scattered across a wide range of radio frequencies, making it more difficult to jam or spoof them. The team published a proof of concept in a 2022 Nature paper and is working on a second iteration called SuperGPS2.

Tom Powell, another GPS expert at the Aerospace Corporation, said that looking at alternatives and augmentations like these is important—even though GPS recently underwent the 25-year modernization effort, making its own signals more robust to vulnerabilities. “Now that we have delivered, or nearly completely delivered, this modernization, is there a better way to do it in face of the current realities?” he said. He and other GPS experts don’t have answers yet. “We’re just asking questions right now.”

Walter, the director of the Stanford GPS Lab, thinks that whatever a better path looks like, it will likely still include the old-school, original system. “There’s nothing that really does replace GPS,” he said. “I see articles saying ‘post-GPS World’ and so forth. But really, GPS, I think, will always be there.”

People will, and should, strengthen it, Walter added, but that bolstering is going to be piecemeal—efforts may work in a particular region, or they cover some of GPS’s roles (such as providing accurate time) but not others, or they may back up navigation but not be as accurate. They may also cost money. “GPS is free, so that makes it almost impossible to compete with,” he said.

GPS is also straightforward, said Powell. “As satellites go, they’re pretty simple,” he said. They point at Earth, and they transmit signals that tell what time it is. From that, humans get to live in an interconnected, chronologically propriocepted world. Figuring out how to keep it that way, though, is proving a little more complicated.

This article was originally published on Undark. Read the original article.

GPS is vulnerable to jamming—here’s how we might fix it Read More »

how-ai-coding-agents-work—and-what-to-remember-if-you-use-them

How AI coding agents work—and what to remember if you use them


Agents of uncertain change

From compression tricks to multi-agent teamwork, here’s what makes them tick.

AI coding agents from OpenAI, Anthropic, and Google can now work on software projects for hours at a time, writing complete apps, running tests, and fixing bugs with human supervision. But these tools are not magic and can complicate rather than simplify a software project. Understanding how they work under the hood can help developers know when (and if) to use them, while avoiding common pitfalls.

We’ll start with the basics: At the core of every AI coding agent is a technology called a large language model (LLM), which is a type of neural network trained on vast amounts of text data, including lots of programming code. It’s a pattern-matching machine that uses a prompt to “extract” compressed statistical representations of data it saw during training and provide a plausible continuation of that pattern as an output. In this extraction, an LLM can interpolate across domains and concepts, resulting in some useful logical inferences when done well and confabulation errors when done poorly.

These base models are then further refined through techniques like fine-tuning on curated examples and reinforcement learning from human feedback (RLHF), which shape the model to follow instructions, use tools, and produce more useful outputs.

A screenshot of the Claude Code command-line interface.

A screenshot of the Claude Code command-line interface. Credit: Anthropic

Over the past few years, AI researchers have been probing LLMs’ deficiencies and finding ways to work around them. One recent innovation was the simulated reasoning model, which generates context (extending the prompt) in the form of reasoning-style text that can help an LLM home in on a more accurate output. Another innovation was an application called an “agent” that links several LLMs together to perform tasks simultaneously and evaluate outputs.

How coding agents are structured

In that sense, each AI coding agent is a program wrapper that works with multiple LLMs. There is typically a “supervising” LLM that interprets tasks (prompts) from the human user and then assigns those tasks to parallel LLMs that can use software tools to execute the instructions. The supervising agent can interrupt tasks below it and evaluate the subtask results to see how a project is going. Anthropic’s engineering documentation describes this pattern as “gather context, take action, verify work, repeat.”

If run locally through a command-line interface (CLI), users give the agents conditional permission to write files on the local machine (code or whatever is needed), run exploratory commands (say, “ls” to list files in a directory), fetch websites (usually using “curl”), download software, or upload files to remote servers. There are lots of possibilities (and potential dangers) with this approach, so it needs to be used carefully.

In contrast, when a user starts a task in the web-based agent like the web versions of Codex and Claude Code, the system provisions a sandboxed cloud container preloaded with the user’s code repository, where Codex can read and edit files, run commands (including test harnesses and linters), and execute code in isolation. Anthropic’s Claude Code uses operating system-level features to create filesystem and network boundaries within which the agent can work more freely.

The context problem

Every LLM has a short-term memory, so to speak, that limits the amount of data it can process before it “forgets” what it’s doing. This is called “context.” Every time you submit a response to the supervising agent, you are amending one gigantic prompt that includes the entire history of the conversation so far (and all the code generated, plus the simulated reasoning tokens the model uses to “think” more about a problem). The AI model then evaluates this prompt and produces an output. It’s a very computationally expensive process that increases quadratically with prompt size because LLMs process every token (chunk of data) against every other token in the prompt.

Anthropic’s engineering team describes context as a finite resource with diminishing returns. Studies have revealed what researchers call “context rot”: As the number of tokens in the context window increases, the model’s ability to accurately recall information decreases. Every new token depletes what the documentation calls an “attention budget.”

This context limit naturally limits the size of a codebase a LLM can process at one time, and if you feed the AI model lots of huge code files (which have to be re-evaluated by the LLM every time you send another response), it can burn up token or usage limits pretty quickly.

Tricks of the trade

To get around these limits, the creators of coding agents use several tricks. For example, AI models are fine-tuned to write code to outsource activities to other software tools. For example, they might write Python scripts to extract data from images or files rather than feeding the whole file through an LLM, which saves tokens and avoids inaccurate results.

Anthropic’s documentation notes that Claude Code also uses this approach to perform complex data analysis over large databases, writing targeted queries and using Bash commands like “head” and “tail” to analyze large volumes of data without ever loading the full data objects into context.

(In a way, these AI agents are guided but semi-autonomous tool-using programs that are a major extension of a concept we first saw in early 2023.)

Another major breakthrough in agents came from dynamic context management. Agents can do this in a few ways that are not fully disclosed in proprietary coding models, but we do know the most important technique they use: context compression.

The command line version of OpenAI codex running in a macOS terminal window.

The command-line version of OpenAI Codex running in a macOS terminal window. Credit: Benj Edwards

When a coding LLM nears its context limit, this technique compresses the context history by summarizing it, losing details in the process but shortening the history to key details. Anthropic’s documentation describes this “compaction” as distilling context contents in a high-fidelity manner, preserving key details like architectural decisions and unresolved bugs while discarding redundant tool outputs.

This means the AI coding agents periodically “forget” a large portion of what they are doing every time this compression happens, but unlike older LLM-based systems, they aren’t completely clueless about what has transpired and can rapidly re-orient themselves by reading existing code, written notes left in files, change logs, and so on.

Anthropic’s documentation recommends using CLAUDE.md files to document common bash commands, core files, utility functions, code style guidelines, and testing instructions. AGENTS.md, now a multi-company standard, is another useful way of guiding agent actions in between context refreshes. These files act as external notes that let agents track progress across complex tasks while maintaining critical context that would otherwise be lost.

For tasks requiring extended work, both companies employ multi-agent architectures. According to Anthropic’s research documentation, its system uses an “orchestrator-worker pattern” in which a lead agent coordinates the process while delegating to specialized subagents that operate in parallel. When a user submits a query, the lead agent analyzes it, develops a strategy, and spawns subagents to explore different aspects simultaneously. The subagents act as intelligent filters, returning only relevant information rather than their full context to the lead agent.

The multi-agent approach burns through tokens rapidly. Anthropic’s documentation notes that agents typically use about four times more tokens than chatbot interactions, and multi-agent systems use about 15 times more tokens than chats. For economic viability, these systems require tasks where the value is high enough to justify the increased cost.

Best practices for humans

While using these agents is contentious in some programming circles, if you use one to code a project, knowing good software development practices helps to head off future problems. For example, it’s good to know about version control, making incremental backups, implementing one feature at a time, and testing it before moving on.

What people call “vibe coding”—creating AI-generated code without understanding what it’s doing—is clearly dangerous for production work. Shipping code you didn’t write yourself in a production environment is risky because it could introduce security issues or other bugs or begin gathering technical debt that could snowball over time.

Independent AI researcher Simon Willison recently argued that developers using coding agents still bear responsibility for proving their code works. “Almost anyone can prompt an LLM to generate a thousand-line patch and submit it for code review,” Willison wrote. “That’s no longer valuable. What’s valuable is contributing code that is proven to work.”

In fact, human planning is key. Claude Code’s best practices documentation recommends a specific workflow for complex problems: First, ask the agent to read relevant files and explicitly tell it not to write any code yet, then ask it to make a plan. Without these research and planning steps, the documentation warns, Claude’s outputs tend to jump straight to coding a solution.

Without planning, LLMs sometimes reach for quick solutions to satisfy a momentary objective that might break later if a project were expanded. So having some idea of what makes a good architecture for a modular program that can be expanded over time can help you guide the LLM to craft something more durable.

As mentioned above, these agents aren’t perfect, and some people prefer not to use them at all. A randomized controlled trial published by the nonprofit research organization METR in July 2025 found that experienced open-source developers actually took 19 percent longer to complete tasks when using AI tools, despite believing they were working faster. The study’s authors note several caveats: The developers were highly experienced with their codebases (averaging five years and 1,500 commits), the repositories were large and mature, and the models used (primarily Claude 3.5 and 3.7 Sonnet via Cursor) have since been superseded by more capable versions.

Whether newer models would produce different results remains an open question, but the study suggests that AI coding tools may not always provide universal speed-ups, particularly for developers who already know their codebases well.

Given these potential hazards, coding proof-of-concept demos and internal tools is probably the ideal use of coding agents right now. Since AI models have no actual agency (despite being called agents) and are not people who can be held accountable for mistakes, human oversight is key.

Photo of Benj Edwards

Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

How AI coding agents work—and what to remember if you use them Read More »

openai’s-new-chatgpt-image-generator-makes-faking-photos-easy

OpenAI’s new ChatGPT image generator makes faking photos easy

For most of photography’s roughly 200-year history, altering a photo convincingly required either a darkroom, some Photoshop expertise, or, at minimum, a steady hand with scissors and glue. On Tuesday, OpenAI released a tool that reduces the process to typing a sentence.

It’s not the first company to do so. While OpenAI had a conversational image-editing model in the works since GPT-4o in 2024, Google beat OpenAI to market in March with a public prototype, then refined it to a popular model called Nano Banana image model (and Nano Banana Pro). The enthusiastic response to Google’s image-editing model in the AI community got OpenAI’s attention.

OpenAI’s new GPT Image 1.5 is an AI image synthesis model that reportedly generates images up to four times faster than its predecessor and costs about 20 percent less through the API. The model rolled out to all ChatGPT users on Tuesday and represents another step toward making photorealistic image manipulation a casual process that requires no particular visual skills.

The

The “Galactic Queen of the Universe” added to a photo of a room with a sofa using GPT Image 1.5 in ChatGPT.

GPT Image 1.5 is notable because it’s a “native multimodal” image model, meaning image generation happens inside the same neural network that processes language prompts. (In contrast, DALL-E 3, an earlier OpenAI image generator previously built into ChatGPT, used a different technique called diffusion to generate images.)

This newer type of model, which we covered in more detail in March, treats images and text as the same kind of thing: chunks of data called “tokens” to be predicted, patterns to be completed. If you upload a photo of your dad and type “put him in a tuxedo at a wedding,” the model processes your words and the image pixels in a unified space, then outputs new pixels the same way it would output the next word in a sentence.

Using this technique, GPT Image 1.5 can more easily alter visual reality than earlier AI image models, changing someone’s pose or position, or rendering a scene from a slightly different angle, with varying degrees of success. It can also remove objects, change visual styles, adjust clothing, and refine specific areas while preserving facial likeness across successive edits. You can converse with the AI model about a photograph, refining and revising, the same way you might workshop a draft of an email in ChatGPT.

OpenAI’s new ChatGPT image generator makes faking photos easy Read More »

browser-extensions-with-8-million-users-collect-extended-ai-conversations

Browser extensions with 8 million users collect extended AI conversations

Besides ChatGPT, Claude, and Gemini, the extensions harvest all conversations from Copilot, Perplexity, DeepSeek, Grok, and Meta AI. Koi said the full description of the data captured includes:

  • Every prompt a user sends to the AI
  • Every response received
  • Conversation identifiers and timestamps
  • Session metadata
  • The specific AI platform and model used

The executor script runs independently from the VPN networking, ad blocking, or other core functionality. That means that even when a user toggles off VPN networking, AI protection, ad blocking, or other functions, the conversation collection continues. The only way to stop the harvesting is to disable the extension in the browser settings or to uninstall it.

Koi said it first discovered the conversation harvesting in Urban VPN Proxy, a VPN routing extension that lists “AI protection” as one of its benefits. The data collection began in early July with the release of version 5.5.0.

“Anyone who used ChatGPT, Claude, Gemini, or the other targeted platforms while Urban VPN was installed after July 9, 2025 should assume those conversations are now on Urban VPN’s servers and have been shared with third parties,” the company said. “Medical questions, financial details, proprietary code, personal dilemmas—all of it, sold for ‘marketing analytics purposes.’”

Following that discovery, the security firm uncovered seven additional extensions with identical AI harvesting functionality. Four of the extensions are available in the Chrome Web Store. The other four are on the Edge add-ons page. Collectively, they have been installed more than 8 million times.

They are:

Chrome Store

  • Urban VPN Proxy: 6 million users
  • 1ClickVPN Proxy: 600,000 users
  • Urban Browser Guard: 40,000 users
  • Urban Ad Blocker: 10,000 users

Edge Add-ons:

  • Urban VPN Proxy: 1,32 million users
  • 1ClickVPN Proxy: 36,459 users
  • Urban Browser Guard – 12,624 users
  • Urban Ad Blocker – 6,476 users

Read the fine print

The extensions come with conflicting messages about how they handle bot conversations, which often contain deeply personal information about users’ physical and mental health, finances, personal relationships, and other sensitive information that could be a gold mine for marketers and data brokers. The Urban VPN Proxy in the Chrome Web Store, for instance, lists “AI protection” as a benefit. It goes on to say:

Browser extensions with 8 million users collect extended AI conversations Read More »

roomba-maker-irobot-swept-into-bankruptcy

Roomba maker iRobot swept into bankruptcy

In recent years, it has faced competition from cheaper Chinese rivals, including Picea, putting pressure on sales and forcing iRobot to reduce headcount. A management shake-up in early 2024 saw the departure of its co-founder as chief executive.

Amazon proposed buying the company in 2023, seeing synergy with its Alexa-powered smart speakers and Ring doorbells.

EU regulators, however, pushed back on the deal, raising concerns it would lead to reduced visibility for rival vacuum cleaner brands on Amazon’s website.

Amazon and iRobot terminated the deal little more than a month after Adobe’s $10 billion purchase of design software maker Figma was abandoned amid heightened US antitrust scrutiny under Joe Biden’s administration.

Although iRobot received $94 million in compensation for the termination of its deal with Amazon, a significant portion was used to pay advisory fees and repay part of a $200 million loan from private equity group Carlyle.

Picea’s Hong Kong subsidiary acquired the remaining $191 million of debt from Carlyle last month. At the time, iRobot already owed Picea $161.5 million for manufacturing services, nearly $91 million of which was overdue.

Alvarez & Marsal is serving as iRobot’s investment banker and financial adviser. The company is receiving legal advice from Paul, Weiss, Rifkind, Wharton & Garrison.

© 2025 The Financial Times Ltd. All rights reserved. Not to be redistributed, copied, or modified in any way.

Roomba maker iRobot swept into bankruptcy Read More »

merriam-webster’s-word-of-the-year-delivers-a-dismissive-verdict-on-junk-ai-content

Merriam-Webster’s word of the year delivers a dismissive verdict on junk AI content

Like most tools, generative AI models can be misused. And when the misuse gets bad enough that a major dictionary notices, you know it’s become a cultural phenomenon.

On Sunday, Merriam-Webster announced that “slop” is its 2025 Word of the Year, reflecting how the term has become shorthand for the flood of low-quality AI-generated content that has spread across social media, search results, and the web at large. The dictionary defines slop as “digital content of low quality that is produced usually in quantity by means of artificial intelligence.”

“It’s such an illustrative word,” Merriam-Webster president Greg Barlow told the Associated Press. “It’s part of a transformative technology, AI, and it’s something that people have found fascinating, annoying, and a little bit ridiculous.”

To select its Word of the Year, Merriam-Webster’s editors review data on which words rose in search volume and usage, then reach consensus on which term best captures the year. Barlow told the AP that the spike in searches for “slop” reflects growing awareness among users that they are encountering fake or shoddy content online.

Dictionaries have been tracking AI’s impact on language for the past few years, with Cambridge having selected “hallucinate” as its 2023 word of the year due to the tendency of AI models to generate plausible-but-false information (long-time Ars readers will be happy to hear there’s another word term for that in the dictionary as well).

The trend extends to online culture in general, which is ripe with new coinages. This year, Oxford University Press chose “rage bait,” referring to content designed to provoke anger for engagement. Cambridge Dictionary selected “parasocial,” describing one-sided relationships between fans and celebrities or influencers.

The difference between the baby and the bathwater

As the AP points out, the word “slop” originally entered English in the 1700s to mean soft mud. By the 1800s, it had evolved to describe food waste fed to pigs, and eventually came to mean rubbish or products of little value. The new AI-related definition builds on that history of describing something unwanted and unpleasant.

Merriam-Webster’s word of the year delivers a dismissive verdict on junk AI content Read More »

microsoft-will-finally-kill-obsolete-cipher-that-has-wreaked-decades-of-havoc

Microsoft will finally kill obsolete cipher that has wreaked decades of havoc

Microsoft said it has steadily worked over the past decade to deprecate RC4, but that the task wasn’t easy.

No salt, no iteration? Really?

“The problem though is that it’s hard to kill off a cryptographic algorithm that is present in every OS that’s shipped for the last 25 years and was the default algorithm for so long, Steve Syfuhs, who runs Microsoft’s Windows Authentication team, wrote on Bluesky. “See,” he continued, “the problem is not that the algorithm exists. The problem is how the algorithm is chosen, and the rules governing that spanned 20 years of code changes.”

Over those two decades, developers discovered a raft of critical RC4 vulnerabilities that required “surgical” fixes. Microsoft considered deprecating RC4 by this year, but ultimately “punted” after discovering vulnerabilities that required still more fixes. During that time Microsoft introduced some “minor improvements” that favored the use of AES, and as a result, usage dropped by “orders of magnitude.”

“Within a year we had observed RC4 usage drop to basically nil. This is not a bad thing and in fact gave us a lot more flexibility to kill it outright because we knew it genuinely wasn’t going to break folks, because folks weren’t using it.”

Syfuhs went on to document additional challenges Microsoft encountered and the approach it took to solving them.

While RC4 has known cipher weaknesses that make it insecure, Kerberoasting exploits a separate weakness. As implemented in Active Directory authentication, it uses no cryptographic salt and a single round of the MD4 hashing function. Salt is a technique that adds random input to each password before it is hashed. That requires hackers to invest considerable time and resources into cracking the hash. MD4, meanwhile, is a fast algorithm that requires modest resources. Microsoft’s implementation of AES-SHA1 is much slower and iterates the hash to further slow down cracking efforts. Taken together, AES-Sha1-hashed passwords require about 1,000 times the time and resources to be cracked.

Windows admins would do well to audit their networks for any usage of RC4. Given its wide adoption and continued use industry-wide, it may still be active, much to the surprise and chagrin of those charged with defending against hackers.

Microsoft will finally kill obsolete cipher that has wreaked decades of havoc Read More »

openai-built-an-ai-coding-agent-and-uses-it-to-improve-the-agent-itself

OpenAI built an AI coding agent and uses it to improve the agent itself


“The vast majority of Codex is built by Codex,” OpenAI told us about its new AI coding agent.

With the popularity of AI coding tools rising among some software developers, their adoption has begun to touch every aspect of the process, including the improvement of AI coding tools themselves.

In interviews with Ars Technica this week, OpenAI employees revealed the extent to which the company now relies on its own AI coding agent, Codex, to build and improve the development tool. “I think the vast majority of Codex is built by Codex, so it’s almost entirely just being used to improve itself,” said Alexander Embiricos, product lead for Codex at OpenAI, in a conversation on Tuesday.

Codex, which OpenAI launched in its modern incarnation as a research preview in May 2025, operates as a cloud-based software engineering agent that can handle tasks like writing features, fixing bugs, and proposing pull requests. The tool runs in sandboxed environments linked to a user’s code repository and can execute multiple tasks in parallel. OpenAI offers Codex through ChatGPT’s web interface, a command-line interface (CLI), and IDE extensions for VS Code, Cursor, and Windsurf.

The “Codex” name itself dates back to a 2021 OpenAI model based on GPT-3 that powered GitHub Copilot’s tab completion feature. Embiricos said the name is rumored among staff to be short for “code execution.” OpenAI wanted to connect the new agent to that earlier moment, which was crafted in part by some who have left the company.

“For many people, that model powering GitHub Copilot was the first ‘wow’ moment for AI,” Embiricos said. “It showed people the potential of what it can mean when AI is able to understand your context and what you’re trying to do and accelerate you in doing that.”

A place to enter a prompt, set parameters, and click

The interface for OpenAI’s Codex in ChatGPT. Credit: OpenAI

It’s no secret that the current command-line version of Codex bears some resemblance to Claude Code, Anthropic’s agentic coding tool that launched in February 2025. When asked whether Claude Code influenced Codex’s design, Embiricos parried the question but acknowledged the competitive dynamic. “It’s a fun market to work in because there’s lots of great ideas being thrown around,” he said. He noted that OpenAI had been building web-based Codex features internally before shipping the CLI version, which arrived after Anthropic’s tool.

OpenAI’s customers apparently love the command line version, though. Embiricos said Codex usage among external developers jumped 20 times after OpenAI shipped the interactive CLI extension alongside GPT-5 in August 2025. On September 15, OpenAI released GPT-5 Codex, a specialized version of GPT-5 optimized for agentic coding, which further accelerated adoption.

It hasn’t just been the outside world that has embraced the tool. Embiricos said the vast majority of OpenAI’s engineers now use Codex regularly. The company uses the same open-source version of the CLI that external developers can freely download, suggest additions to, and modify themselves. “I really love this about our team,” Embiricos said. “The version of Codex that we use is literally the open source repo. We don’t have a different repo that features go in.”

The recursive nature of Codex development extends beyond simple code generation. Embiricos described scenarios where Codex monitors its own training runs and processes user feedback to “decide” what to build next. “We have places where we’ll ask Codex to look at the feedback and then decide what to do,” he said. “Codex is writing a lot of the research harness for its own training runs, and we’re experimenting with having Codex monitoring its own training runs.” OpenAI employees can also submit a ticket to Codex through project management tools like Linear, assigning it tasks the same way they would assign work to a human colleague.

This kind of recursive loop, of using tools to build better tools, has deep roots in computing history. Engineers designed the first integrated circuits by hand on vellum and paper in the 1960s, then fabricated physical chips from those drawings. Those chips powered the computers that ran the first electronic design automation (EDA) software, which in turn enabled engineers to design circuits far too complex for any human to draft manually. Modern processors contain billions of transistors arranged in patterns that exist only because software made them possible. OpenAI’s use of Codex to build Codex seems to follow the same pattern: each generation of the tool creates capabilities that feed into the next.

But describing what Codex actually does presents something of a linguistic challenge. At Ars Technica, we try to reduce anthropomorphism when discussing AI models as much as possible while also describing what these systems do using analogies that make sense to general readers. People can talk to Codex like a human, so it feels natural to use human terms to describe interacting with it, even though it is not a person and simulates human personality through statistical modeling.

The system runs many processes autonomously, addresses feedback, spins off and manages child processes, and produces code that ships in real products. OpenAI employees call it a “teammate” and assign it tasks through the same tools they use for human colleagues. Whether the tasks Codex handles constitute “decisions” or sophisticated conditional logic smuggled through a neural network depends on definitions that computer scientists and philosophers continue to debate. What we can say is that a semi-autonomous feedback loop exists: Codex produces code under human direction, that code becomes part of Codex, and the next version of Codex produces different code as a result.

Building faster with “AI teammates”

According to our interviews, the most dramatic example of Codex’s internal impact came from OpenAI’s development of the Sora Android app. According to Embiricos, the development tool allowed the company to create the app in record time.

“The Sora Android app was shipped by four engineers from scratch,” Embiricos told Ars. “It took 18 days to build, and then we shipped it to the app store in 28 days total,” he said. The engineers already had the iOS app and server-side components to work from, so they focused on building the Android client. They used Codex to help plan the architecture, generate sub-plans for different components, and implement those components.

Despite OpenAI’s claims of success with Codex in house, it’s worth noting that independent research has shown mixed results for AI coding productivity. A METR study published in July found that experienced open source developers were actually 19 percent slower when using AI tools on complex, mature codebases—though the researchers noted AI may perform better on simpler projects.

Ed Bayes, a designer on the Codex team, described how the tool has changed his own workflow. Bayes said Codex now integrates with project management tools like Linear and communication platforms like Slack, allowing team members to assign coding tasks directly to the AI agent. “You can add Codex, and you can basically assign issues to Codex now,” Bayes told Ars. “Codex is literally a teammate in your workspace.”

This integration means that when someone posts feedback in a Slack channel, they can tag Codex and ask it to fix the issue. The agent will create a pull request, and team members can review and iterate on the changes through the same thread. “It’s basically approximating this kind of coworker and showing up wherever you work,” Bayes said.

For Bayes, who works on the visual design and interaction patterns for Codex’s interfaces, the tool has enabled him to contribute code directly rather than handing off specifications to engineers. “It kind of gives you more leverage. It enables you to work across the stack and basically be able to do more things,” he said. He noted that designers at OpenAI now prototype features by building them directly, using Codex to handle the implementation details.

The command line version of OpenAI codex running in a macOS terminal window.

The command line version of OpenAI codex running in a macOS terminal window. Credit: Benj Edwards

OpenAI’s approach treats Codex as what Bayes called “a junior developer” that the company hopes will graduate into a senior developer over time. “If you were onboarding a junior developer, how would you onboard them? You give them a Slack account, you give them a Linear account,” Bayes said. “It’s not just this tool that you go to in the terminal, but it’s something that comes to you as well and sits within your team.”

Given this teammate approach, will there be anything left for humans to do? When asked, Embiricos drew a distinction between “vibe coding,” where developers accept AI-generated code without close review, and what AI researcher Simon Willison calls “vibe engineering,” where humans stay in the loop. “We see a lot more vibe engineering in our code base,” he said. “You ask Codex to work on that, maybe you even ask for a plan first. Go back and forth, iterate on the plan, and then you’re in the loop with the model and carefully reviewing its code.”

He added that vibe coding still has its place for prototypes and throwaway tools. “I think vibe coding is great,” he said. “Now you have discretion as a human about how much attention you wanna pay to the code.”

Looking ahead

Over the past year, “monolithic” large language models (LLMs) like GPT-4.5 have apparently become something of a dead end in terms of frontier benchmarking progress as AI companies pivot to simulated reasoning models and also agentic systems built from multiple AI models running in parallel. We asked Embiricos whether agents like Codex represent the best path forward for squeezing utility out of existing LLM technology.

He dismissed concerns that AI capabilities have plateaued. “I think we’re very far from plateauing,” he said. “If you look at the velocity on the research team here, we’ve been shipping models almost every week or every other week.” He pointed to recent improvements where GPT-5-Codex reportedly completes tasks 30 percent faster than its predecessor at the same intelligence level. During testing, the company has seen the model work independently for 24 hours on complex tasks.

OpenAI faces competition from multiple directions in the AI coding market. Anthropic’s Claude Code and Google’s Gemini CLI offer similar terminal-based agentic coding experiences. This week, Mistral AI released Devstral 2 alongside a CLI tool called Mistral Vibe. Meanwhile, startups like Cursor have built dedicated IDEs around AI coding, reportedly reaching $300 million in annualized revenue.

Given the well-known issues with confabulation in AI models when people attempt to use them as factual resources, could it be that coding has become the killer app for LLMs? We wondered if OpenAI has noticed that coding seems to be a clear business use case for today’s AI models with less hazard than, say, using AI language models for writing or as emotional companions.

“We have absolutely noticed that coding is both a place where agents are gonna get good really fast and there’s a lot of economic value,” Embiricos said. “We feel like it’s very mission-aligned to focus on Codex. We get to provide a lot of value to developers. Also, developers build things for other people, so we’re kind of intrinsically scaling through them.”

But will tools like Codex threaten software developer jobs? Bayes acknowledged concerns but said Codex has not reduced headcount at OpenAI, and “there’s always a human in the loop because the human can actually read the code.” Similarly, the two men don’t project a future where Codex runs by itself without some form of human oversight. They feel the tool is an amplifier of human potential rather than a replacement for it.

The practical implications of agents like Codex extend beyond OpenAI’s walls. Embiricos said the company’s long-term vision involves making coding agents useful to people who have no programming experience. “All humanity is not gonna open an IDE or even know what a terminal is,” he said. “We’re building a coding agent right now that’s just for software engineers, but we think of the shape of what we’re building as really something that will be useful to be a more general agent.”

This article was updated on December 12, 2025 at 6: 50 PM to mention the METR study.

Photo of Benj Edwards

Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

OpenAI built an AI coding agent and uses it to improve the agent itself Read More »

openai-releases-gpt-5.2-after-“code-red”-google-threat-alert

OpenAI releases GPT-5.2 after “code red” Google threat alert

On Thursday, OpenAI released GPT-5.2, its newest family of AI models for ChatGPT, in three versions called Instant, Thinking, and Pro. The release follows CEO Sam Altman’s internal “code red” memo earlier this month, which directed company resources toward improving ChatGPT in response to competitive pressure from Google’s Gemini 3 AI model.

“We designed 5.2 to unlock even more economic value for people,” Fidji Simo, OpenAI’s chief product officer, said during a press briefing with journalists on Thursday. “It’s better at creating spreadsheets, building presentations, writing code, perceiving images, understanding long context, using tools and then linking complex, multi-step projects.”

As with previous versions of GPT-5, the three model tiers serve different purposes: Instant handles faster tasks like writing and translation; Thinking spits out simulated reasoning “thinking” text in an attempt to tackle more complex work like coding and math; and Pro spits out even more simulated reasoning text with the goal of delivering the highest-accuracy performance for difficult problems.

A chart of GPT-5.2 benchmark results taken from OpenAI's website.

A chart of GPT-5.2 Thinking benchmark results comparing it to its predecessor, taken from OpenAI’s website. Credit: OpenAI

GPT-5.2 features a 400,000-token context window, allowing it to process hundreds of documents at once, and a knowledge cutoff date of August 31, 2025.

GPT-5.2 is rolling out to paid ChatGPT subscribers starting Thursday, with API access available to developers. Pricing in the API runs $1.75 per million input tokens for the standard model, a 40 percent increase over GPT-5.1. OpenAI says the older GPT-5.1 will remain available in ChatGPT for paid users for three months under a legacy models dropdown.

Playing catch-up with Google

The release follows a tricky month for OpenAI. In early December, Altman issued an internal “code red” directive after Google’s Gemini 3 model topped multiple AI benchmarks and gained market share. The memo called for delaying other initiatives, including advertising plans for ChatGPT, to focus on improving the chatbot’s core experience.

The stakes for OpenAI are substantial. The company has made commitments totaling $1.4 trillion for AI infrastructure buildouts over the next several years, bets it made when it had a more obvious technology lead among AI companies. Google’s Gemini app now has more than 650 million monthly active users, while OpenAI reports 800 million weekly active users for ChatGPT.

OpenAI releases GPT-5.2 after “code red” Google threat alert Read More »

disney-invests-$1-billion-in-openai,-licenses-200-characters-for-ai-video-app-sora

Disney invests $1 billion in OpenAI, licenses 200 characters for AI video app Sora

An AI-generated version of OpenAI CEO Sam Altman, seen in a still capture from a video generated by Sora 2.

An AI-generated version of OpenAI CEO Sam Altman seen in a still capture from a video generated by Sora 2. Credit: OpenAI

Under the new agreement with Disney, Sora users will be able to generate short videos using characters such as Mickey Mouse, Darth Vader, Iron Man, Simba, and characters from franchises including Frozen, Inside Out, Toy Story, and The Mandalorian, along with costumes, props, vehicles, and environments.

The ChatGPT image generator will also gain official access to the same intellectual property, although that information was trained into these AI models long ago. What’s changing is that OpenAI will allow Disney-related content generated by its AI models to officially pass through its content moderation filters and reach the user, sanctioned by Disney.

On Disney’s end of the deal, the company plans to deploy ChatGPT for its employees and use OpenAI’s technology to build new features for Disney+. A curated selection of fan-made Sora videos will stream on the Disney+ platform starting in early 2026.

The agreement does not include any talent likenesses or voices. Disney and OpenAI said they have committed to “maintaining robust controls to prevent the generation of illegal or harmful content” and to “respect the rights of individuals to appropriately control the use of their voice and likeness.”

OpenAI CEO Sam Altman called the deal a model for collaboration between AI companies and studios. “This agreement shows how AI companies and creative leaders can work together responsibly to promote innovation that benefits society, respect the importance of creativity, and help works reach vast new audiences,” Altman said.

From adversary to partner

Money opens all kinds of doors, and the new partnership represents a dramatic reversal in Disney’s approach to OpenAI from just a few months ago. At that time, Disney and other major studios refused to participate in Sora 2 following its launch on September 30.

Disney invests $1 billion in OpenAI, licenses 200 characters for AI video app Sora Read More »

oracle-shares-slide-on-$15b-increase-in-data-center-spending

Oracle shares slide on $15B increase in data center spending

Oracle’s Big Tech rivals such as Amazon, Microsoft, and Google have helped reassure investors about their large capital investments by posting strong earnings from their vast cloud units.

But in the last quarter, Oracle’s cloud infrastructure business, which includes its data centers, posted worse than expected revenues of $4.1 billion. Ellison’s company is also relying more heavily on debt to fuel its expansion.

Net income rose to $6.1 billion in the quarter, boosted by a $2.7 billion pre-tax gain from the sale of semiconductor company Ampere to SoftBank.

The company added an additional 400 MW of data center capacity in the quarter, Magouyrk told investors. Construction was on track at its large data center cluster in Abilene, Texas, which is being built for OpenAI, he added.

Magouyrk, who took over from Safra Catz in September, said there was ample demand from other clients for Oracle’s data centers if OpenAI did not take up the full amount it had contracted for.

“We have a customer base with a lot of demand such that whenever we find ourselves [with] capacity that’s not being used, it very quickly gets allocated,” he said.

Co-founded by Ellison as a business software provider, Oracle was slow to pivot to cloud computing. The billionaire remains chair and its largest shareholder.

Investors and analysts have raised concerns in recent months about the upfront spending required by Oracle to honor its AI infrastructure contracts. Moody’s in September flagged the company’s reliance on a small number of large customers such as OpenAI.

Morgan Stanley forecasts that Oracle’s net debt will soar to about $290 billion by 2028. The company sold $18 billion of bonds in September and is in talks to raise $38 billion in debt financing through a number of US banks.

Brent Thill, an analyst at Jefferies, said Oracle’s software business—which generated $5.9 billion in the quarter—provided some buffer amid accelerated spending. “But the timing mismatch between upfront capex and delayed monetization creates near-term pressure.”

Doug Kehring, principal financial officer, said the company was renting capacity from data center specialists to reduce its direct borrowing.

The debt to build the Abilene site was raised by start-up Crusoe and investment group Blue Owl Capital, and Oracle has signed a 15-year lease for the site.

“Oracle does not pay for these leases until the completed data centers… are delivered to us,” Kehring said, adding that the company was “committed to maintaining our investment-grade debt ratings.”

© 2025 The Financial Times Ltd. All rights reserved. Not to be redistributed, copied, or modified in any way.

Oracle shares slide on $15B increase in data center spending Read More »

a-new-open-weights-ai-coding-model-is-closing-in-on-proprietary-options

A new open-weights AI coding model is closing in on proprietary options

On Tuesday, French AI startup Mistral AI released Devstral 2, a 123 billion parameter open-weights coding model designed to work as part of an autonomous software engineering agent. The model achieves a 72.2 percent score on SWE-bench Verified, a benchmark that attempts to test whether AI systems can solve real GitHub issues, putting it among the top-performing open-weights models.

Perhaps more notably, Mistral didn’t just release an AI model, it released a new development app called Mistral Vibe. It’s a command line interface (CLI) similar to Claude Code, OpenAI Codex, and Gemini CLI that lets developers interact with the Devstral models directly in their terminal. The tool can scan file structures and Git status to maintain context across an entire project, make changes across multiple files, and execute shell commands autonomously. Mistral released the CLI under the Apache 2.0 license.

It’s always wise to take AI benchmarks with a large grain of salt, but we’ve heard from employees of the big AI companies that they pay very close attention to how well models do on SWE-bench Verified, which presents AI models with 500 real software engineering problems pulled from GitHub issues in popular Python repositories. The AI must read the issue description, navigate the codebase, and generate a working patch that passes unit tests. While some AI researchers have noted that around 90 percent of the tasks in the benchmark test relatively simple bug fixes that experienced engineers could complete in under an hour, it’s one of the few standardized ways to compare coding models.

At the same time as the larger AI coding model, Mistral also released Devstral Small 2, a 24 billion parameter version that scores 68 percent on the same benchmark and can run locally on consumer hardware like a laptop with no Internet connection required. Both models support a 256,000 token context window, allowing them to process moderately large codebases (although whether you consider it large or small is very relative depending on overall project complexity). The company released Devstral 2 under a modified MIT license and Devstral Small 2 under the more permissive Apache 2.0 license.

A new open-weights AI coding model is closing in on proprietary options Read More »