chatgtp

the-gpt-5-rollout-has-been-a-big-mess

The GPT-5 rollout has been a big mess

It’s been less than a week since the launch of OpenAI’s new GPT-5 AI model, and the rollout hasn’t been a smooth one. So far, the release sparked one of the most intense user revolts in ChatGPT’s history, forcing CEO Sam Altman to make an unusual public apology and reverse key decisions.

At the heart of the controversy has been OpenAI’s decision to automatically remove access to all previous AI models in ChatGPT (approximately nine, depending on how you count them) when GPT-5 rolled out to user accounts. Unlike API users who receive advance notice of model deprecations, consumer ChatGPT users had no warning that their preferred models would disappear overnight, noted independent AI researcher Simon Willison in a blog post.

The problems started immediately after GPT-5’s August 7 debut. A Reddit thread titled “GPT-5 is horrible” quickly amassed over 4,000 comments filled with users expressing frustration over the new release. By August 8, social media platforms were flooded with complaints about performance issues, personality changes, and the forced removal of older models.

As of May 14, 2025, ChatGPT Pro users have access to 8 different main AI models, plus Deep Research.

Prior to the launch of GPT-5, ChatGPT Pro users could select between nine different AI models, including Deep Research. (This screenshot is from May 14, 2025, and OpenAI later replaced o1 pro with o3-pro.) Credit: Benj Edwards

Marketing professionals, researchers, and developers all shared examples of broken workflows on social media. “I’ve spent months building a system to work around OpenAI’s ridiculous limitations in prompts and memory issues,” wrote one Reddit user in the r/OpenAI subreddit. “And in less than 24 hours, they’ve made it useless.”

How could different AI language models break a workflow? The answer lies in how each one is trained in a different way and includes its own unique output style: The workflow breaks because users have developed sets of prompts that produce useful results optimized for each AI model.

For example, Willison wrote how different user groups had developed distinct workflows with specific AI models in ChatGPT over time, quoting one Reddit user who explained: “I know GPT-5 is designed to be stronger for complex reasoning, coding, and professional tasks, but not all of us need a pro coding model. Some of us rely on 4o for creative collaboration, emotional nuance, roleplay, and other long-form, high-context interactions.”

The GPT-5 rollout has been a big mess Read More »

openai’s-most-capable-ai-model,-gpt-5,-may-be-coming-in-august

OpenAI’s most capable AI model, GPT-5, may be coming in August

References to “gpt-5-reasoning-alpha-2025-07-13” have already been spotted on X, with code showing “reasoning_effort: high” in the model configuration. These sightings suggest the model has entered final testing phases, with testers getting their hands on the code and security experts doing red teaming on the model to test vulnerabilities.

Unifying OpenAI’s model lineup

The new model represents OpenAI’s attempt to simplify its increasingly complex product lineup. As Altman explained in February, GPT-5 may integrate features from both the company’s conventional GPT models and its reasoning-focused o-series models into a single system.

“We’re truly excited to not just make a net new great frontier model, we’re also going to unify our two series,” OpenAI’s Head of Developer Experience Romain Huet said at a recent event. “The breakthrough of reasoning in the O-series and the breakthroughs in multi-modality in the GPT-series will be unified, and that will be GPT-5.”

According to The Information, GPT-5 is expected to be better at coding and more powerful overall, combining attributes of both traditional models and SR models such as o3.

Before GPT-5 arrives, OpenAI still plans to release its first open-weights model since GPT-2 in 2019, which means others with the proper hardware will be able to download and run the AI model on their own machines. The Verge describes this model as “similar to o3 mini” with reasoning capabilities. However, Altman announced on July 11 that the open model needs additional safety testing, saying, “We are not yet sure how long it will take us.”

OpenAI’s most capable AI model, GPT-5, may be coming in August Read More »

what-is-agi?-nobody-agrees,-and-it’s-tearing-microsoft-and-openai-apart.

What is AGI? Nobody agrees, and it’s tearing Microsoft and OpenAI apart.


Several definitions make measuring “human-level” AI an exercise in moving goalposts.

When is an AI system intelligent enough to be called artificial general intelligence (AGI)? According to one definition reportedly agreed upon by Microsoft and OpenAI, the answer lies in economics: When AI generates $100 billion in profits. This arbitrary profit-based benchmark for AGI perfectly captures the definitional chaos plaguing the AI industry.

In fact, it may be impossible to create a universal definition of AGI, but few people with money on the line will admit it.

Over this past year, several high-profile people in the tech industry have been heralding the seemingly imminent arrival of “AGI” (i.e., within the next two years). But there’s a huge problem: Few people agree on exactly what AGI means. As Google DeepMind wrote in a paper on the topic: If you ask 100 AI experts to define AGI, you’ll get “100 related but different definitions.”

This isn’t just academic navel-gazing. The definition problem has real consequences for how we develop, regulate, and think about AI systems. When companies claim they’re on the verge of AGI, what exactly are they claiming?

I tend to define AGI in a traditional way that hearkens back to the “general” part of its name: An AI model that can widely generalize—applying concepts to novel scenarios—and match the versatile human capability to perform unfamiliar tasks across many domains without needing to be specifically trained for them.

However, this definition immediately runs into thorny questions about what exactly constitutes “human-level” performance. Expert-level humans? Average humans? And across which tasks—should an AGI be able to perform surgery, write poetry, fix a car engine, and prove mathematical theorems, all at the level of human specialists? (Which human can do all that?) More fundamentally, the focus on human parity is itself an assumption; it’s worth asking why mimicking human intelligence is the necessary yardstick at all.

The latest example of this definitional confusion causing trouble comes from the deteriorating relationship between Microsoft and OpenAI. According to The Wall Street Journal, the two companies are now locked in acrimonious negotiations partly because they can’t agree on what AGI even means—despite having baked the term into a contract worth over $13 billion.

A brief history of moving goalposts

The term artificial general intelligence has murky origins. While John McCarthy and colleagues coined the term artificial intelligence at Dartmouth College in 1956, AGI emerged much later. Physicist Mark Gubrud first used the term in 1997, though it was computer scientist Shane Legg and AI researcher Ben Goertzel who independently reintroduced it around 2002, with the modern usage popularized by a 2007 book edited by Goertzel and Cassio Pennachin.

Early AI researchers envisioned systems that could match human capability across all domains. In 1965, AI pioneer Herbert A. Simon predicted that “machines will be capable, within 20 years, of doing any work a man can do.” But as robotics lagged behind computing advances, the definition narrowed. The goalposts shifted, partly as a practical response to this uneven progress, from “do everything a human can do” to “do most economically valuable tasks” to today’s even fuzzier standards.

“An assistant of inventor Captain Richards works on the robot the Captain has invented, which speaks, answers questions, shakes hands, tells the time, and sits down when it’s told to.” – September 1928. Credit: Getty Images

For decades, the Turing Test served as the de facto benchmark for machine intelligence. If a computer could fool a human judge into thinking it was human through text conversation, the test surmised, then it had achieved something like human intelligence. But the Turing Test has shown its age. Modern language models can pass some limited versions of the test not because they “think” like humans, but because they’re exceptionally capable at creating highly plausible human-sounding outputs.

The current landscape of AGI definitions reveals just how fractured the concept has become. OpenAI’s charter defines AGI as “highly autonomous systems that outperform humans at most economically valuable work”—a definition that, like the profit metric, relies on economic progress as a substitute for measuring cognition in a concrete way. Mark Zuckerberg told The Verge that he does not have a “one-sentence, pithy definition” of the concept. OpenAI CEO Sam Altman believes that his company now knows how to build AGI “as we have traditionally understood it.” Meanwhile, former OpenAI Chief Scientist Ilya Sutskever reportedly treated AGI as something almost mystical—according to a 2023 Atlantic report, he would lead employees in chants of “Feel the AGI!” during company meetings, treating the concept more like a spiritual quest than a technical milestone.

Dario Amodei, co-founder and chief executive officer of Anthropic, during the Bloomberg Technology Summit in San Francisco, California, US, on Thursday, May 9, 2024.

Dario Amodei, co-founder and chief executive officer of Anthropic, during the Bloomberg Technology Summit in San Francisco on Thursday, May 9, 2024. Credit: Bloomberg via Getty Images

Dario Amodei, CEO of Anthropic, takes an even more skeptical stance on the terminology itself. In his October 2024 essay “Machines of Loving Grace,” Amodei writes that he finds “AGI to be an imprecise term that has gathered a lot of sci-fi baggage and hype.” Instead, he prefers terms like “powerful AI” or “Expert-Level Science and Engineering,” which he argues better capture the capabilities without the associated hype. When Amodei describes what others might call AGI, he frames it as an AI system “smarter than a Nobel Prize winner across most relevant fields” that can work autonomously on tasks taking hours, days, or weeks to complete—essentially “a country of geniuses in a data center.” His resistance to AGI terminology adds another layer to the definitional chaos: Not only do we not agree on what AGI means, but some leading AI developers reject the term entirely.

Perhaps the most systematic attempt to bring order to this chaos comes from Google DeepMind, which in July 2024 proposed a framework with five levels of AGI performance: emerging, competent, expert, virtuoso, and superhuman. DeepMind researchers argued that no level beyond “emerging AGI” existed at that time. Under their system, today’s most capable LLMs and simulated reasoning models still qualify as “emerging AGI”—equal to or somewhat better than an unskilled human at various tasks.

But this framework has its critics. Heidy Khlaaf, chief AI scientist at the nonprofit AI Now Institute, told TechCrunch that she thinks the concept of AGI is too ill-defined to be “rigorously evaluated scientifically.” In fact, with so many varied definitions at play, one could argue that the term AGI has become technically meaningless.

When philosophy meets contract law

The Microsoft-OpenAI dispute illustrates what happens when philosophical speculation is turned into legal obligations. When the companies signed their partnership agreement, they included a clause stating that when OpenAI achieves AGI, it can limit Microsoft’s access to future technology. According to The Wall Street Journal, OpenAI executives believe they’re close to declaring AGI, while Microsoft CEO Satya Nadella has called the idea of using AGI as a self-proclaimed milestone “nonsensical benchmark hacking” on the Dwarkesh Patel podcast in February.

The reported $100 billion profit threshold we mentioned earlier conflates commercial success with cognitive capability, as if a system’s ability to generate revenue says anything meaningful about whether it can “think,” “reason,” or “understand” the world like a human.

Sam Altman speaks onstage during The New York Times Dealbook Summit 2024 at Jazz at Lincoln Center on December 04, 2024 in New York City.

Sam Altman speaks onstage during The New York Times Dealbook Summit 2024 at Jazz at Lincoln Center on December 4, 2024, in New York City. Credit: Eugene Gologursky via Getty Images

Depending on your definition, we may already have AGI, or it may be physically impossible to achieve. If you define AGI as “AI that performs better than most humans at most tasks,” then current language models potentially meet that bar for certain types of work (which tasks, which humans, what is “better”?), but agreement on whether that is true is far from universal. This says nothing of the even murkier concept of “superintelligence”—another nebulous term for a hypothetical, god-like intellect so far beyond human cognition that, like AGI, defies any solid definition or benchmark.

Given this definitional chaos, researchers have tried to create objective benchmarks to measure progress toward AGI, but these attempts have revealed their own set of problems.

Why benchmarks keep failing us

The search for better AGI benchmarks has produced some interesting alternatives to the Turing Test. The Abstraction and Reasoning Corpus (ARC-AGI), introduced in 2019 by François Chollet, tests whether AI systems can solve novel visual puzzles that require deep and novel analytical reasoning.

“Almost all current AI benchmarks can be solved purely via memorization,” Chollet told Freethink in August 2024. A major problem with AI benchmarks currently stems from data contamination—when test questions end up in training data, models can appear to perform well without truly “understanding” the underlying concepts. Large language models serve as master imitators, mimicking patterns found in training data, but not always originating novel solutions to problems.

But even sophisticated benchmarks like ARC-AGI face a fundamental problem: They’re still trying to reduce intelligence to a score. And while improved benchmarks are essential for measuring empirical progress in a scientific framework, intelligence isn’t a single thing you can measure like height or weight—it’s a complex constellation of abilities that manifest differently in different contexts. Indeed, we don’t even have a complete functional definition of human intelligence, so defining artificial intelligence by any single benchmark score is likely to capture only a small part of the complete picture.

The survey says: AGI may not be imminent

There is no doubt that the field of AI has seen rapid, tangible progress in numerous fields, including computer vision, protein folding, and translation. Some excitement of progress is justified, but it’s important not to oversell an AI model’s capabilities prematurely.

Despite the hype from some in the industry, many AI researchers remain skeptical that AGI is just around the corner. A March 2025 survey of AI researchers conducted by the Association for the Advancement of Artificial Intelligence (AAAI) found that a majority (76 percent) of researchers who participated in the survey believed that scaling up current approaches is “unlikely” or “very unlikely” to achieve AGI.

However, such expert predictions should be taken with a grain of salt, as researchers have consistently been surprised by the rapid pace of AI capability advancement. A 2024 survey by Grace et al. of 2,778 AI researchers found that experts had dramatically shortened their timelines for AI milestones after being surprised by progress in 2022–2023. The median forecast for when AI could outperform humans in every possible task jumped forward by 13 years, from 2060 in their 2022 survey to 2047 in 2023. This pattern of underestimation was evident across multiple benchmarks, with many researchers’ predictions about AI capabilities being proven wrong within months.

And yet, as the tech landscape shifts, the AI goalposts continue to recede at a constant speed. Recently, as more studies continue to reveal limitations in simulated reasoning models, some experts in the industry have been slowly backing away from claims of imminent AGI. For example, AI podcast host Dwarkesh Patel recently published a blog post arguing that developing AGI still faces major bottlenecks, particularly in continual learning, and predicted we’re still seven years away from AI that can learn on the job as seamlessly as humans.

Why the definition matters

The disconnect we’ve seen above between researcher consensus, firm terminology definitions, and corporate rhetoric has a real impact. When policymakers act as if AGI is imminent based on hype rather than scientific evidence, they risk making decisions that don’t match reality. When companies write contracts around undefined terms, they may create legal time bombs.

The definitional chaos around AGI isn’t just philosophical hand-wringing. Companies use promises of impending AGI to attract investment, talent, and customers. Governments craft policy based on AGI timelines. The public forms potentially unrealistic expectations about AI’s impact on jobs and society based on these fuzzy concepts.

Without clear definitions, we can’t have meaningful conversations about AI misapplications, regulation, or development priorities. We end up talking past each other, with optimists and pessimists using the same words to mean fundamentally different things.

In the face of this kind of challenge, some may be tempted to give up on formal definitions entirely, falling back on an “I’ll know it when I see it” approach for AGI—echoing Supreme Court Justice Potter Stewart’s famous quote about obscenity. This subjective standard might feel useful, but it’s useless for contracts, regulation, or scientific progress.

Perhaps it’s time to move beyond the term AGI. Instead of chasing an ill-defined goal that keeps receding into the future, we could focus on specific capabilities: Can this system learn new tasks without extensive retraining? Can it explain its outputs? Can it produce safe outputs that don’t harm or mislead people? These questions tell us more about AI progress than any amount of AGI speculation. The most useful way forward may be to think of progress in AI as a multidimensional spectrum without a specific threshold of achievement. But charting that spectrum will demand new benchmarks that don’t yet exist—and a firm, empirical definition of “intelligence” that remains elusive.

Photo of Benj Edwards

Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

What is AGI? Nobody agrees, and it’s tearing Microsoft and OpenAI apart. Read More »

the-resume-is-dying,-and-ai-is-holding-the-smoking-gun

The résumé is dying, and AI is holding the smoking gun

Beyond volume, fraud poses an increasing threat. In January, the Justice Department announced indictments in a scheme to place North Korean nationals in remote IT roles at US companies. Research firm Gartner says that fake identity cases are growing rapidly, with the company estimating that by 2028, about 1 in 4 job applicants could be fraudulent. And as we have previously reported, security researchers have also discovered that AI systems can hide invisible text in applications, potentially allowing candidates to game screening systems using prompt injections in ways human reviewers can’t detect.

Illustration of a robot generating endless text, controlled by a scientist.

And that’s not all. Even when AI screening tools work as intended, they exhibit similar biases to human recruiters, preferring white male names on résumés—raising legal concerns about discrimination. The European Union’s AI Act already classifies hiring under its high-risk category with stringent restrictions. Although no US federal law specifically addresses AI use in hiring, general anti-discrimination laws still apply.

So perhaps résumés as a meaningful signal of candidate interest and qualification are becoming obsolete. And maybe that’s OK. When anyone can generate hundreds of tailored applications with a few prompts, the document that once demonstrated effort and genuine interest in a position has devolved into noise.

Instead, the future of hiring may require abandoning the résumé altogether in favor of methods that AI can’t easily replicate—live problem-solving sessions, portfolio reviews, or trial work periods, just to name a few ideas people sometimes consider (whether they are good ideas or not is beyond the scope of this piece). For now, employers and job seekers remain locked in an escalating technological arms race where machines screen the output of other machines, while the humans they’re meant to serve struggle to make authentic connections in an increasingly inauthentic world.

Perhaps the endgame is robots interviewing other robots for jobs performed by robots, while humans sit on the beach drinking daiquiris and playing vintage video games. Well, one can dream.

The résumé is dying, and AI is holding the smoking gun Read More »

openai-weighs-“nuclear-option”-of-antitrust-complaint-against-microsoft

OpenAI weighs “nuclear option” of antitrust complaint against Microsoft

OpenAI executives have discussed filing an antitrust complaint with US regulators against Microsoft, the company’s largest investor, The Wall Street Journal reported Monday, marking a dramatic escalation in tensions between the two long-term AI partners. OpenAI, which develops ChatGPT, has reportedly considered seeking a federal regulatory review of the terms of its contract with Microsoft for potential antitrust law violations, according to people familiar with the matter.

The potential antitrust complaint would likely argue that Microsoft is using its dominant position in cloud services and contractual leverage to suppress competition, according to insiders who described it as a “nuclear option,” the WSJ reports.

The move could unravel one of the most important business partnerships in the AI industry—a relationship that started with a $1 billion investment by Microsoft in 2019 and has grown to include billions more in funding, along with Microsoft’s exclusive rights to host OpenAI models on its Azure cloud platform.

The friction centers on OpenAI’s efforts to transition from its current nonprofit structure into a public benefit corporation, a conversion that needs Microsoft’s approval to complete. The two companies have not been able to agree on details after months of negotiations, sources told Reuters. OpenAI’s existing for-profit arm would become a Delaware-based public benefit corporation under the proposed restructuring.

The companies are discussing revising the terms of Microsoft’s investment, including the future equity stake it will hold in OpenAI. According to The Information, OpenAI wants Microsoft to hold a 33 percent stake in a restructured unit in exchange for foregoing rights to future profits. The AI company also wants to modify existing clauses that give Microsoft exclusive rights to host OpenAI models in its cloud.

OpenAI weighs “nuclear option” of antitrust complaint against Microsoft Read More »

after-ai-setbacks,-meta-bets-billions-on-undefined-“superintelligence”

After AI setbacks, Meta bets billions on undefined “superintelligence”

Meta has developed plans to create a new artificial intelligence research lab dedicated to pursuing “superintelligence,” according to reporting from The New York Times. The social media giant chose 28-year-old Alexandr Wang, founder and CEO of Scale AI, to join the new lab as part of a broader reorganization of Meta’s AI efforts under CEO Mark Zuckerberg.

Superintelligence refers to a hypothetical AI system that would exceed human cognitive abilities—a step beyond artificial general intelligence (AGI), which aims to match an intelligent human’s capability for learning new tasks without intensive specialized training.

However, much like AGI, superintelligence remains a nebulous term in the field. Since scientists still poorly understand the mechanics of human intelligence, and because human intelligence resists simple quantification with no single definition, identifying superintelligence when it arrives will present significant challenges.

Computers already far surpass humans in certain forms of information processing such as calculations, but this narrow superiority doesn’t qualify as superintelligence under most definitions. The pursuit assumes we’ll recognize it when we see it, despite the conceptual fuzziness.

Illustration of studious robot reading a book

AI researcher Dr. Margaret Mitchell told Ars Technica in April 2024 that there will “likely never be agreement on comparisons between human and machine intelligence” but predicted that “men in positions of power and influence, particularly ones with investments in AI, will declare that AI is smarter than humans” regardless of the reality.

The new lab represents Meta’s effort to remain competitive in the increasingly crowded AI race, where tech giants continue pouring billions into research and talent acquisition. Meta has reportedly offered compensation packages worth seven to nine figures to dozens of researchers from companies like OpenAI and Google, according to The New York Times, with some already agreeing to join the company.

Meta joins a growing list of tech giants making bold claims about advanced AI development. In January, OpenAI CEO Sam Altman wrote in a blog post that “we are now confident we know how to build AGI as we have traditionally understood it.” Earlier, in September 2024, Altman predicted that the AI industry might develop superintelligence “in a few thousand days.” Elon Musk made an even more aggressive prediction in April 2024, saying that AI would be “smarter than the smartest human” by “next year, within two years.”

After AI setbacks, Meta bets billions on undefined “superintelligence” Read More »

anthropic-releases-custom-ai-chatbot-for-classified-spy-work

Anthropic releases custom AI chatbot for classified spy work

On Thursday, Anthropic unveiled specialized AI models designed for US national security customers. The company released “Claude Gov” models that were built in response to direct feedback from government clients to handle operations such as strategic planning, intelligence analysis, and operational support. The custom models reportedly already serve US national security agencies, with access restricted to those working in classified environments.

The Claude Gov models differ from Anthropic’s consumer and enterprise offerings, also called Claude, in several ways. They reportedly handle classified material, “refuse less” when engaging with classified information, and are customized to handle intelligence and defense documents. The models also feature what Anthropic calls “enhanced proficiency” in languages and dialects critical to national security operations.

Anthropic says the new models underwent the same “safety testing” as all Claude models. The company has been pursuing government contracts as it seeks reliable revenue sources, partnering with Palantir and Amazon Web Services in November to sell AI tools to defense customers.

Anthropic is not the first company to offer specialized chatbot services for intelligence agencies. In 2024, Microsoft launched an isolated version of OpenAI’s GPT-4 for the US intelligence community after 18 months of work. That system, which operated on a special government-only network without Internet access, became available to about 10,000 individuals in the intelligence community for testing and answering questions.

Anthropic releases custom AI chatbot for classified spy work Read More »

“in-10-years,-all-bets-are-off”—anthropic-ceo-opposes-decadelong-freeze-on-state-ai-laws

“In 10 years, all bets are off”—Anthropic CEO opposes decadelong freeze on state AI laws

On Thursday, Anthropic CEO Dario Amodei argued against a proposed 10-year moratorium on state AI regulation in a New York Times opinion piece, calling the measure shortsighted and overbroad as Congress considers including it in President Trump’s tax policy bill. Anthropic makes Claude, an AI assistant similar to ChatGPT.

Amodei warned that AI is advancing too fast for such a long freeze, predicting these systems “could change the world, fundamentally, within two years; in 10 years, all bets are off.”

As we covered in May, the moratorium would prevent states from regulating AI for a decade. A bipartisan group of state attorneys general has opposed the measure, which would preempt AI laws and regulations recently passed in dozens of states.

In his op-ed piece, Amodei said the proposed moratorium aims to prevent inconsistent state laws that could burden companies or compromise America’s competitive position against China. “I am sympathetic to these concerns,” Amodei wrote. “But a 10-year moratorium is far too blunt an instrument. A.I. is advancing too head-spinningly fast.”

Instead of a blanket moratorium, Amodei proposed that the White House and Congress create a federal transparency standard requiring frontier AI developers to publicly disclose their testing policies and safety measures. Under this framework, companies working on the most capable AI models would need to publish on their websites how they test for various risks and what steps they take before release.

“Without a clear plan for a federal response, a moratorium would give us the worst of both worlds—no ability for states to act and no national policy as a backstop,” Amodei wrote.

Transparency as the middle ground

Amodei emphasized his claims for AI’s transformative potential throughout his op-ed, citing examples of pharmaceutical companies drafting clinical study reports in minutes instead of weeks and AI helping to diagnose medical conditions that might otherwise be missed. He wrote that AI “could accelerate economic growth to an extent not seen for a century, improving everyone’s quality of life,” a claim that some skeptics believe may be overhyped.

“In 10 years, all bets are off”—Anthropic CEO opposes decadelong freeze on state AI laws Read More »

new-claude-4-ai-model-refactored-code-for-7-hours-straight

New Claude 4 AI model refactored code for 7 hours straight


Anthropic says Claude 4 beats Gemini on coding benchmarks; works autonomously for hours.

The Claude 4 logo, created by Anthropic. Credit: Anthropic

On Thursday, Anthropic released Claude Opus 4 and Claude Sonnet 4, marking the company’s return to larger model releases after primarily focusing on mid-range Sonnet variants since June of last year. The new models represent what the company calls its most capable coding models yet, with Opus 4 designed for complex, long-running tasks that can operate autonomously for hours.

Alex Albert, Anthropic’s head of Claude Relations, told Ars Technica that the company chose to revive the Opus line because of growing demand for agentic AI applications. “Across all the companies out there that are building things, there’s a really large wave of these agentic applications springing up, and a very high demand and premium being placed on intelligence,” Albert said. “I think Opus is going to fit that groove perfectly.”

Before we go further, a brief refresher on Claude’s three AI model “size” names (first introduced in March 2024) is probably warranted. Haiku, Sonnet, and Opus offer a tradeoff between price (in the API), speed, and capability.

Haiku models are the smallest, least expensive to run, and least capable in terms of what you might call “context depth” (considering conceptual relationships in the prompt) and encoded knowledge. Owing to the small size in parameter count, Haiku models retain fewer concrete facts and thus tend to confabulate more frequently (plausibly answering questions based on lack of data) than larger models, but they are much faster at basic tasks than larger models. Sonnet is traditionally a mid-range model that hits a balance between cost and capability, and Opus models have always been the largest and slowest to run. However, Opus models process context more deeply and are hypothetically better suited for running deep logical tasks.

A screenshot of the Claude web interface with Opus 4 and Sonnet 4 options shown.

A screenshot of the Claude web interface with Opus 4 and Sonnet 4 options shown. Credit: Anthropic

There is no Claude 4 Haiku just yet, but the new Sonnet and Opus models can reportedly handle tasks that previous versions could not. In our interview with Albert, he described testing scenarios where Opus 4 worked coherently for up to 24 hours on tasks like playing Pokémon while coding refactoring tasks in Claude Code ran for seven hours without interruption. Earlier Claude models typically lasted only one to two hours before losing coherence, Albert said, meaning that the models could only produce useful self-referencing outputs for that long before beginning to output too many errors.

In particular, that marathon refactoring claim reportedly comes from Rakuten, a Japanese tech services conglomerate that “validated [Claude’s] capabilities with a demanding open-source refactor running independently for 7 hours with sustained performance,” Anthropic said in a news release.

Whether you’d want to leave an AI model unsupervised for that long is another question entirely because even the most capable AI models can introduce subtle bugs, go down unproductive rabbit holes, or make choices that seem logical to the model but miss important context that a human developer would catch. While many people now use Claude for easy-going vibe coding, as we covered in March, the human-powered (and ironically-named) “vibe debugging” that often results from long AI coding sessions is also a very real thing. More on that below.

To shore up some of those shortcomings, Anthropic built memory capabilities into both new Claude 4 models, allowing them to maintain external files for storing key information across long sessions. When developers provide access to local files, the models can create and update “memory files” to track progress and things they deem important over time. Albert compared this to how humans take notes during extended work sessions.

Extended thinking meets tool use

Both Claude 4 models introduce what Anthropic calls “extended thinking with tool use,” a new beta feature allowing the models to alternate between simulated reasoning and using external tools like web search, similar to what OpenAI’s o3 and 04-mini-high AI models currently do in ChatGPT. While Claude 3.7 Sonnet already had strong tool use capabilities, the new models can now interleave simulated reasoning and tool calling in a single response.

“So now we can actually think, call a tool process, the results, think some more, call another tool, and repeat until it gets to a final answer,” Albert explained to Ars. The models self-determine when they have reached a useful conclusion, a capability picked up through training rather than governed by explicit human programming.

General Claude 4 benchmark results, provided by Anthropic.

General Claude 4 benchmark results, provided by Anthropic. Credit: Anthropic

In practice, we’ve anecdotally found parallel tool use capability very useful in AI assistants like OpenAI o3, since they don’t have to rely on what is trained in their neural network to provide accurate answers. Instead, these more agentic models can iteratively search the web, parse the results, analyze images, and spin up coding tasks for analysis in ways that can avoid falling into a confabulation trap by relying solely on pure LLM outputs.

“The world’s best coding model”

Anthropic says Opus 4 leads industry benchmarks for coding tasks, achieving 72.5 percent on SWE-bench and 43.2 percent on Terminal-bench, calling it “the world’s best coding model.” According to Anthropic, companies using early versions report improvements. Cursor described it as “state-of-the-art for coding and a leap forward in complex codebase understanding,” while Replit noted “improved precision and dramatic advancements for complex changes across multiple files.”

In fact, GitHub announced it will use Sonnet 4 as the base model for its new coding agent in GitHub Copilot, citing the model’s performance in “agentic scenarios” in Anthropic’s news release. Sonnet 4 scored 72.7 percent on SWE-bench while maintaining faster response times than Opus 4. The fact that GitHub is betting on Claude rather than a model from its parent company Microsoft (which has close ties to OpenAI) suggests Anthropic has built something genuinely competitive.

Software engineering benchmark results, provided by Anthropic.

Software engineering benchmark results, provided by Anthropic. Credit: Anthropic

Anthropic says it has addressed a persistent issue with Claude 3.7 Sonnet in which users complained that the model would take unauthorized actions or provide excessive output. Albert said the company reduced this “reward hacking behavior” by approximately 80 percent in the new models through training adjustments. An 80 percent reduction in unwanted behavior sounds impressive, but that also suggests that 20 percent of the problem behavior remains—a big concern when we’re talking about AI models that might be performing autonomous tasks for hours.

When we asked about code accuracy, Albert said that human code review is still an important part of shipping any production code. “There’s a human parallel, right? So this is just a problem we’ve had to deal with throughout the whole nature of software engineering. And this is why the code review process exists, so that you can catch these things. We don’t anticipate that going away with models either,” Albert said. “If anything, the human review will become more important, and more of your job as developer will be in this review than it will be in the generation part.”

Pricing and availability

Both Claude 4 models maintain the same pricing structure as their predecessors: Opus 4 costs $15 per million tokens for input and $75 per million for output, while Sonnet 4 remains at $3 and $15. The models offer two response modes: traditional LLM and simulated reasoning (“extended thinking”) for complex problems. Given that some Claude Code sessions can apparently run for hours, those per-token costs will likely add up very quickly for users who let the models run wild.

Anthropic made both models available through its API, Amazon Bedrock, and Google Cloud Vertex AI. Sonnet 4 remains accessible to free users, while Opus 4 requires a paid subscription.

The Claude 4 models also debut Claude Code (first introduced in February) as a generally available product after months of preview testing. Anthropic says the coding environment now integrates with VS Code and JetBrains IDEs, showing proposed edits directly in files. A new SDK allows developers to build custom agents using the same framework.

A screenshot of

A screenshot of “Claude Plays Pokemon,” a custom application where Claude 4 attempts to beat the classic Game Boy game. Credit: Anthropic

Even with Anthropic’s future riding on the capability of these new models, when we asked about how they guide Claude’s behavior by fine-tuning, Albert acknowledged that the inherent unpredictability of these systems presents ongoing challenges for both them and developers. “In the realm and the world of software for the past 40, 50 years, we’ve been running on deterministic systems, and now all of a sudden, it’s non-deterministic, and that changes how we build,” he said.

“I empathize with a lot of people out there trying to use our APIs and language models generally because they have to almost shift their perspective on what it means for reliability, what it means for powering a core of your application in a non-deterministic way,” Albert added. “These are general oddities that have kind of just been flipped, and it definitely makes things more difficult, but I think it opens up a lot of possibilities as well.”

Photo of Benj Edwards

Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

New Claude 4 AI model refactored code for 7 hours straight Read More »

openai-adds-gpt-4.1-to-chatgpt-amid-complaints-over-confusing-model-lineup

OpenAI adds GPT-4.1 to ChatGPT amid complaints over confusing model lineup

The release comes just two weeks after OpenAI made GPT-4 unavailable in ChatGPT on April 30. That earlier model, which launched in March 2023, once sparked widespread hype about AI capabilities. Compared to that hyperbolic launch, GPT-4.1’s rollout has been a fairly understated affair—probably because it’s tricky to convey the subtle differences between all of the available OpenAI models.

As if 4.1’s launch wasn’t confusing enough, the release also roughly coincides with OpenAI’s July 2025 deadline for retiring the GPT-4.5 Preview from the API, a model one AI expert called a “lemon.” Developers must migrate to other options, OpenAI says, although GPT-4.5 will remain available in ChatGPT for now.

A confusing addition to OpenAI’s model lineup

In February, OpenAI CEO Sam Altman acknowledged on X his company’s confusing AI model naming practices, writing, “We realize how complicated our model and product offerings have gotten.” He promised that a forthcoming “GPT-5” model would consolidate the o-series and GPT-series models into a unified branding structure. But the addition of GPT-4.1 to ChatGPT appears to contradict that simplification goal.

So, if you use ChatGPT, which model should you use? If you’re a developer using the models through the API, the consideration is more of a trade-off between capability, speed, and cost. But in ChatGPT, your choice might be limited more by personal taste in behavioral style and what you’d like to accomplish. Some of the “more capable” models have lower usage limits as well because they cost more for OpenAI to run.

For now, OpenAI is keeping GPT-4o as the default ChatGPT model, likely due to its general versatility, balance between speed and capability, and personable style (conditioned using reinforcement learning and a specialized system prompt). The simulated reasoning models like 03 and 04-mini-high are slower to execute but can consider analytical-style problems more systematically and perform comprehensive web research that sometimes feels genuinely useful when it surfaces relevant (non-confabulated) web links. Compared to those, OpenAI is largely positioning GPT-4.1 as a speedier AI model for coding assistance.

Just remember that all of the AI models are prone to confabulations, meaning that they tend to make up authoritative-sounding information when they encounter gaps in their trained “knowledge.” So you’ll need to double-check all of the outputs with other sources of information if you’re hoping to use these AI models to assist with an important task.

OpenAI adds GPT-4.1 to ChatGPT amid complaints over confusing model lineup Read More »

ai-use-damages-professional-reputation,-study-suggests

AI use damages professional reputation, study suggests

Using AI can be a double-edged sword, according to new research from Duke University. While generative AI tools may boost productivity for some, they might also secretly damage your professional reputation.

On Thursday, the Proceedings of the National Academy of Sciences (PNAS) published a study showing that employees who use AI tools like ChatGPT, Claude, and Gemini at work face negative judgments about their competence and motivation from colleagues and managers.

“Our findings reveal a dilemma for people considering adopting AI tools: Although AI can enhance productivity, its use carries social costs,” write researchers Jessica A. Reif, Richard P. Larrick, and Jack B. Soll of Duke’s Fuqua School of Business.

The Duke team conducted four experiments with over 4,400 participants to examine both anticipated and actual evaluations of AI tool users. Their findings, presented in a paper titled “Evidence of a social evaluation penalty for using AI,” reveal a consistent pattern of bias against those who receive help from AI.

What made this penalty particularly concerning for the researchers was its consistency across demographics. They found that the social stigma against AI use wasn’t limited to specific groups.

Fig. 1. Effect sizes for differences in expected perceptions and disclosure to others (Study 1). Note: Positive d values indicate higher values in the AI Tool condition, while negative d values indicate lower values in the AI Tool condition. N = 497. Error bars represent 95% CI. Correlations among variables range from | r |= 0.53 to 0.88.

Fig. 1 from the paper “Evidence of a social evaluation penalty for using AI.” Credit: Reif et al.

“Testing a broad range of stimuli enabled us to examine whether the target’s age, gender, or occupation qualifies the effect of receiving help from Al on these evaluations,” the authors wrote in the paper. “We found that none of these target demographic attributes influences the effect of receiving Al help on perceptions of laziness, diligence, competence, independence, or self-assuredness. This suggests that the social stigmatization of AI use is not limited to its use among particular demographic groups. The result appears to be a general one.”

The hidden social cost of AI adoption

In the first experiment conducted by the team from Duke, participants imagined using either an AI tool or a dashboard creation tool at work. It revealed that those in the AI group expected to be judged as lazier, less competent, less diligent, and more replaceable than those using conventional technology. They also reported less willingness to disclose their AI use to colleagues and managers.

The second experiment confirmed these fears were justified. When evaluating descriptions of employees, participants consistently rated those receiving AI help as lazier, less competent, less diligent, less independent, and less self-assured than those receiving similar help from non-AI sources or no help at all.

AI use damages professional reputation, study suggests Read More »

fidji-simo-joins-openai-as-new-ceo-of-applications

Fidji Simo joins OpenAI as new CEO of Applications

In the message, Altman described Simo as bringing “a rare blend of leadership, product and operational expertise” and expressed that her addition to the team makes him “even more optimistic about our future as we continue advancing toward becoming the superintelligence company.”

Simo becomes the newest high-profile female executive at OpenAI following the departure of Chief Technology Officer Mira Murati in September. Murati, who had been with the company since 2018 and helped launch ChatGPT, left alongside two other senior leaders and founded Thinking Machines Lab in February.

OpenAI’s evolving structure

The leadership addition comes as OpenAI continues to evolve beyond its origins as a research lab. In his announcement, Altman described how the company now operates in three distinct areas: as a research lab focused on artificial general intelligence (AGI), as a “global product company serving hundreds of millions of users,” and as an “infrastructure company” building systems that advance research and deliver AI tools “at unprecedented scale.”

Altman mentioned that as CEO of OpenAI, he will “continue to directly oversee success across all pillars,” including Research, Compute, and Applications, while staying “closely involved with key company decisions.”

The announcement follows recent news that OpenAI abandoned its original plan to cede control of its nonprofit branch to a for-profit entity. The company began as a nonprofit research lab in 2015 before creating a for-profit subsidiary in 2019, maintaining its original mission “to ensure artificial general intelligence benefits everyone.”

Fidji Simo joins OpenAI as new CEO of Applications Read More »