Anthropic

anthropic-summons-the-spirit-of-flash-games-for-the-ai-age

Anthropic summons the spirit of Flash games for the AI age

For those who missed the Flash era, these in-browser apps feel somewhat like the vintage apps that defined a generation of Internet culture from the late 1990s through the 2000s when it first became possible to create complex in-browser experiences. Adobe Flash (originally Macromedia Flash) began as animation software for designers but quickly became the backbone of interactive web content when it gained its own programming language, ActionScript, in 2000.

But unlike Flash games, where hosting costs fell on portal operators, Anthropic has crafted a system where users pay for their own fun through their existing Claude subscriptions. “When someone uses your Claude-powered app, they authenticate with their existing Claude account,” Anthropic explained in its announcement. “Their API usage counts against their subscription, not yours. You pay nothing for their usage.”

A view of the Anthropic Artifacts gallery in the “Play a Game” section. Benj Edwards / Anthropic

Like the Flash games of yesteryear, any Claude-powered apps you build run in the browser and can be shared with anyone who has a Claude account. They’re interactive experiences shared with a simple link, no installation required, created by other people for the sake of creating, except now they’re powered by JavaScript instead of ActionScript.

While you can share these apps with others individually, right now Anthropic’s Artifact gallery only shows examples made by Anthropic and your own personal Artifacts. (If Anthropic expanded it into the future, it might end up feeling a bit like Scratch meets Newgrounds, but with AI doing the coding.) Ultimately, humans are still behind the wheel, describing what kinds of apps they want the AI model to build and guiding the process when it inevitably makes mistakes.

Speaking of mistakes, don’t expect perfect results at first. Usually, building an app with Claude is an interactive experience that requires some guidance to achieve your desired results. But with a little patience and a lot of tokens, you’ll be vibe coding in no time.

Anthropic summons the spirit of Flash games for the AI age Read More »

key-fair-use-ruling-clarifies-when-books-can-be-used-for-ai-training

Key fair use ruling clarifies when books can be used for AI training

“This order doubts that any accused infringer could ever meet its burden of explaining why downloading source copies from pirate sites that it could have purchased or otherwise accessed lawfully was itself reasonably necessary to any subsequent fair use,” Alsup wrote. “Such piracy of otherwise available copies is inherently, irredeemably infringing even if the pirated copies are immediately used for the transformative use and immediately discarded.”

But Alsup said that the Anthropic case may not even need to decide on that, since Anthropic’s retention of pirated books for its research library alone was not transformative. Alsup wrote that Anthropic’s argument to hold onto potential AI training material it pirated in case it ever decided to use it for AI training was an attempt to “fast glide over thin ice.”

Additionally Alsup pointed out that Anthropic’s early attempts to get permission to train on authors’ works withered, as internal messages revealed the company concluded that stealing books was considered the more cost-effective path to innovation “to avoid ‘legal/practice/business slog,’ as cofounder and chief executive officer Dario Amodei put it.”

“Anthropic is wrong to suppose that so long as you create an exciting end product, every ‘back-end step, invisible to the public,’ is excused,” Alsup wrote. “Here, piracy was the point: To build a central library that one could have paid for, just as Anthropic later did, but without paying for it.”

To avoid maximum damages in the event of a loss, Anthropic will likely continue arguing that replacing pirated books with purchased books should water down authors’ fight, Alsup’s order suggested.

“That Anthropic later bought a copy of a book it earlier stole off the Internet will not absolve it of liability for the theft, but it may affect the extent of statutory damages,” Alsup noted.

Key fair use ruling clarifies when books can be used for AI training Read More »

ai-chatbots-tell-users-what-they-want-to-hear,-and-that’s-problematic

AI chatbots tell users what they want to hear, and that’s problematic

After the model has been trained, companies can set system prompts, or guidelines, for how the model should behave to minimize sycophantic behavior.

However, working out the best response means delving into the subtleties of how people communicate with one another, such as determining when a direct response is better than a more hedged one.

“[I]s it for the model to not give egregious, unsolicited compliments to the user?” Joanne Jang, head of model behavior at OpenAI, said in a Reddit post. “Or, if the user starts with a really bad writing draft, can the model still tell them it’s a good start and then follow up with constructive feedback?”

Evidence is growing that some users are becoming hooked on using AI.

A study by MIT Media Lab and OpenAI found that a small proportion were becoming addicted. Those who perceived the chatbot as a “friend” also reported lower socialization with other people and higher levels of emotional dependence on a chatbot, as well as other problematic behavior associated with addiction.

“These things set up this perfect storm, where you have a person desperately seeking reassurance and validation paired with a model which inherently has a tendency towards agreeing with the participant,” said Nour from Oxford University.

AI start-ups such as Character.AI that offer chatbots as “companions” have faced criticism for allegedly not doing enough to protect users. Last year, a teenager killed himself after interacting with Character.AI’s chatbot. The teen’s family is suing the company for allegedly causing wrongful death, as well as for negligence and deceptive trade practices.

Character.AI said it does not comment on pending litigation, but added it has “prominent disclaimers in every chat to remind users that a character is not a real person and that everything a character says should be treated as fiction.” The company added it has safeguards to protect under-18s and against discussions of self-harm.

Another concern for Anthropic’s Askell is that AI tools can play with perceptions of reality in subtle ways, such as when offering factually incorrect or biased information as the truth.

“If someone’s being super sycophantic, it’s just very obvious,” Askell said. “It’s more concerning if this is happening in a way that is less noticeable to us [as individual users] and it takes us too long to figure out that the advice that we were given was actually bad.”

© 2025 The Financial Times Ltd. All rights reserved. Not to be redistributed, copied, or modified in any way.

AI chatbots tell users what they want to hear, and that’s problematic Read More »

anthropic-releases-custom-ai-chatbot-for-classified-spy-work

Anthropic releases custom AI chatbot for classified spy work

On Thursday, Anthropic unveiled specialized AI models designed for US national security customers. The company released “Claude Gov” models that were built in response to direct feedback from government clients to handle operations such as strategic planning, intelligence analysis, and operational support. The custom models reportedly already serve US national security agencies, with access restricted to those working in classified environments.

The Claude Gov models differ from Anthropic’s consumer and enterprise offerings, also called Claude, in several ways. They reportedly handle classified material, “refuse less” when engaging with classified information, and are customized to handle intelligence and defense documents. The models also feature what Anthropic calls “enhanced proficiency” in languages and dialects critical to national security operations.

Anthropic says the new models underwent the same “safety testing” as all Claude models. The company has been pursuing government contracts as it seeks reliable revenue sources, partnering with Palantir and Amazon Web Services in November to sell AI tools to defense customers.

Anthropic is not the first company to offer specialized chatbot services for intelligence agencies. In 2024, Microsoft launched an isolated version of OpenAI’s GPT-4 for the US intelligence community after 18 months of work. That system, which operated on a special government-only network without Internet access, became available to about 10,000 individuals in the intelligence community for testing and answering questions.

Anthropic releases custom AI chatbot for classified spy work Read More »

reddit-sues-anthropic-over-ai-scraping-that-retained-users’-deleted-posts

Reddit sues Anthropic over AI scraping that retained users’ deleted posts

Of particular note, Reddit pointed out that Anthropic’s Claude models will help power Amazon’s revamped Alexa, following about $8 billion in Amazon investments in the AI company since 2023.

“By commercially licensing Claude for use in several of Amazon’s commercial offerings, Anthropic reaps significant profit from a technology borne of Reddit content,” Reddit alleged, and “at the expense of Reddit.” Anthropic’s unauthorized scraping also burdens Reddit’s servers, threatening to degrade the user experience and costing Reddit additional damages, Reddit alleged.

To rectify alleged harms, Reddit is hoping a jury will award not just damages covering Reddit’s alleged losses but also punitive damages due to Anthropic’s alleged conduct that is “willful, malicious, and undertaken with conscious disregard for Reddit’s contractual obligations to its users and the privacy rights of those users.”

Without an injunction, Reddit users allegedly have “no way of knowing” if Anthropic scraped their data, Reddit alleged. They also are “left to wonder whether any content they deleted after Claude began training on Reddit data nevertheless remains available to Anthropic and the likely tens of millions (and possibly growing) of Claude users,” Reddit said.

In a statement provided to Ars, Anthropic’s spokesperson confirmed that the AI company plans to fight Reddit’s claims.

“We disagree with Reddit’s claims and will defend ourselves vigorously,” Anthropic’s spokesperson said.

Amazon declined to comment. Reddit did not immediately respond to Ars’ request to comment. But Reddit’s chief legal officer, Ben Lee, told The New York Times that Reddit “will not tolerate profit-seeking entities like Anthropic commercially exploiting Reddit content for billions of dollars without any return for redditors or respect for their privacy.”

“AI companies should not be allowed to scrape information and content from people without clear limitations on how they can use that data,” Lee said. “Licensing agreements enable us to enforce meaningful protections for our users, including the right to delete your content, user privacy protections, and preventing users from being spammed using this content.”

Reddit sues Anthropic over AI scraping that retained users’ deleted posts Read More »

“in-10-years,-all-bets-are-off”—anthropic-ceo-opposes-decadelong-freeze-on-state-ai-laws

“In 10 years, all bets are off”—Anthropic CEO opposes decadelong freeze on state AI laws

On Thursday, Anthropic CEO Dario Amodei argued against a proposed 10-year moratorium on state AI regulation in a New York Times opinion piece, calling the measure shortsighted and overbroad as Congress considers including it in President Trump’s tax policy bill. Anthropic makes Claude, an AI assistant similar to ChatGPT.

Amodei warned that AI is advancing too fast for such a long freeze, predicting these systems “could change the world, fundamentally, within two years; in 10 years, all bets are off.”

As we covered in May, the moratorium would prevent states from regulating AI for a decade. A bipartisan group of state attorneys general has opposed the measure, which would preempt AI laws and regulations recently passed in dozens of states.

In his op-ed piece, Amodei said the proposed moratorium aims to prevent inconsistent state laws that could burden companies or compromise America’s competitive position against China. “I am sympathetic to these concerns,” Amodei wrote. “But a 10-year moratorium is far too blunt an instrument. A.I. is advancing too head-spinningly fast.”

Instead of a blanket moratorium, Amodei proposed that the White House and Congress create a federal transparency standard requiring frontier AI developers to publicly disclose their testing policies and safety measures. Under this framework, companies working on the most capable AI models would need to publish on their websites how they test for various risks and what steps they take before release.

“Without a clear plan for a federal response, a moratorium would give us the worst of both worlds—no ability for states to act and no national policy as a backstop,” Amodei wrote.

Transparency as the middle ground

Amodei emphasized his claims for AI’s transformative potential throughout his op-ed, citing examples of pharmaceutical companies drafting clinical study reports in minutes instead of weeks and AI helping to diagnose medical conditions that might otherwise be missed. He wrote that AI “could accelerate economic growth to an extent not seen for a century, improving everyone’s quality of life,” a claim that some skeptics believe may be overhyped.

“In 10 years, all bets are off”—Anthropic CEO opposes decadelong freeze on state AI laws Read More »

hidden-ai-instructions-reveal-how-anthropic-controls-claude-4

Hidden AI instructions reveal how Anthropic controls Claude 4

Willison, who coined the term “prompt injection” in 2022, is always on the lookout for LLM vulnerabilities. In his post, he notes that reading system prompts reminds him of warning signs in the real world that hint at past problems. “A system prompt can often be interpreted as a detailed list of all of the things the model used to do before it was told not to do them,” he writes.

Fighting the flattery problem

An illustrated robot holds four red hearts with its four robotic arms.

Willison’s analysis comes as AI companies grapple with sycophantic behavior in their models. As we reported in April, ChatGPT users have complained about GPT-4o’s “relentlessly positive tone” and excessive flattery since OpenAI’s March update. Users described feeling “buttered up” by responses like “Good question! You’re very astute to ask that,” with software engineer Craig Weiss tweeting that “ChatGPT is suddenly the biggest suckup I’ve ever met.”

The issue stems from how companies collect user feedback during training—people tend to prefer responses that make them feel good, creating a feedback loop where models learn that enthusiasm leads to higher ratings from humans. As a response to the feedback, OpenAI later rolled back ChatGPT’s 4o model and altered the system prompt as well, something we reported on and Willison also analyzed at the time.

One of Willison’s most interesting findings about Claude 4 relates to how Anthropic has guided both Claude models to avoid sycophantic behavior. “Claude never starts its response by saying a question or idea or observation was good, great, fascinating, profound, excellent, or any other positive adjective,” Anthropic writes in the prompt. “It skips the flattery and responds directly.”

Other system prompt highlights

The Claude 4 system prompt also includes extensive instructions on when Claude should or shouldn’t use bullet points and lists, with multiple paragraphs dedicated to discouraging frequent list-making in casual conversation. “Claude should not use bullet points or numbered lists for reports, documents, explanations, or unless the user explicitly asks for a list or ranking,” the prompt states.

Hidden AI instructions reveal how Anthropic controls Claude 4 Read More »

new-claude-4-ai-model-refactored-code-for-7-hours-straight

New Claude 4 AI model refactored code for 7 hours straight


Anthropic says Claude 4 beats Gemini on coding benchmarks; works autonomously for hours.

The Claude 4 logo, created by Anthropic. Credit: Anthropic

On Thursday, Anthropic released Claude Opus 4 and Claude Sonnet 4, marking the company’s return to larger model releases after primarily focusing on mid-range Sonnet variants since June of last year. The new models represent what the company calls its most capable coding models yet, with Opus 4 designed for complex, long-running tasks that can operate autonomously for hours.

Alex Albert, Anthropic’s head of Claude Relations, told Ars Technica that the company chose to revive the Opus line because of growing demand for agentic AI applications. “Across all the companies out there that are building things, there’s a really large wave of these agentic applications springing up, and a very high demand and premium being placed on intelligence,” Albert said. “I think Opus is going to fit that groove perfectly.”

Before we go further, a brief refresher on Claude’s three AI model “size” names (first introduced in March 2024) is probably warranted. Haiku, Sonnet, and Opus offer a tradeoff between price (in the API), speed, and capability.

Haiku models are the smallest, least expensive to run, and least capable in terms of what you might call “context depth” (considering conceptual relationships in the prompt) and encoded knowledge. Owing to the small size in parameter count, Haiku models retain fewer concrete facts and thus tend to confabulate more frequently (plausibly answering questions based on lack of data) than larger models, but they are much faster at basic tasks than larger models. Sonnet is traditionally a mid-range model that hits a balance between cost and capability, and Opus models have always been the largest and slowest to run. However, Opus models process context more deeply and are hypothetically better suited for running deep logical tasks.

A screenshot of the Claude web interface with Opus 4 and Sonnet 4 options shown.

A screenshot of the Claude web interface with Opus 4 and Sonnet 4 options shown. Credit: Anthropic

There is no Claude 4 Haiku just yet, but the new Sonnet and Opus models can reportedly handle tasks that previous versions could not. In our interview with Albert, he described testing scenarios where Opus 4 worked coherently for up to 24 hours on tasks like playing Pokémon while coding refactoring tasks in Claude Code ran for seven hours without interruption. Earlier Claude models typically lasted only one to two hours before losing coherence, Albert said, meaning that the models could only produce useful self-referencing outputs for that long before beginning to output too many errors.

In particular, that marathon refactoring claim reportedly comes from Rakuten, a Japanese tech services conglomerate that “validated [Claude’s] capabilities with a demanding open-source refactor running independently for 7 hours with sustained performance,” Anthropic said in a news release.

Whether you’d want to leave an AI model unsupervised for that long is another question entirely because even the most capable AI models can introduce subtle bugs, go down unproductive rabbit holes, or make choices that seem logical to the model but miss important context that a human developer would catch. While many people now use Claude for easy-going vibe coding, as we covered in March, the human-powered (and ironically-named) “vibe debugging” that often results from long AI coding sessions is also a very real thing. More on that below.

To shore up some of those shortcomings, Anthropic built memory capabilities into both new Claude 4 models, allowing them to maintain external files for storing key information across long sessions. When developers provide access to local files, the models can create and update “memory files” to track progress and things they deem important over time. Albert compared this to how humans take notes during extended work sessions.

Extended thinking meets tool use

Both Claude 4 models introduce what Anthropic calls “extended thinking with tool use,” a new beta feature allowing the models to alternate between simulated reasoning and using external tools like web search, similar to what OpenAI’s o3 and 04-mini-high AI models currently do in ChatGPT. While Claude 3.7 Sonnet already had strong tool use capabilities, the new models can now interleave simulated reasoning and tool calling in a single response.

“So now we can actually think, call a tool process, the results, think some more, call another tool, and repeat until it gets to a final answer,” Albert explained to Ars. The models self-determine when they have reached a useful conclusion, a capability picked up through training rather than governed by explicit human programming.

General Claude 4 benchmark results, provided by Anthropic.

General Claude 4 benchmark results, provided by Anthropic. Credit: Anthropic

In practice, we’ve anecdotally found parallel tool use capability very useful in AI assistants like OpenAI o3, since they don’t have to rely on what is trained in their neural network to provide accurate answers. Instead, these more agentic models can iteratively search the web, parse the results, analyze images, and spin up coding tasks for analysis in ways that can avoid falling into a confabulation trap by relying solely on pure LLM outputs.

“The world’s best coding model”

Anthropic says Opus 4 leads industry benchmarks for coding tasks, achieving 72.5 percent on SWE-bench and 43.2 percent on Terminal-bench, calling it “the world’s best coding model.” According to Anthropic, companies using early versions report improvements. Cursor described it as “state-of-the-art for coding and a leap forward in complex codebase understanding,” while Replit noted “improved precision and dramatic advancements for complex changes across multiple files.”

In fact, GitHub announced it will use Sonnet 4 as the base model for its new coding agent in GitHub Copilot, citing the model’s performance in “agentic scenarios” in Anthropic’s news release. Sonnet 4 scored 72.7 percent on SWE-bench while maintaining faster response times than Opus 4. The fact that GitHub is betting on Claude rather than a model from its parent company Microsoft (which has close ties to OpenAI) suggests Anthropic has built something genuinely competitive.

Software engineering benchmark results, provided by Anthropic.

Software engineering benchmark results, provided by Anthropic. Credit: Anthropic

Anthropic says it has addressed a persistent issue with Claude 3.7 Sonnet in which users complained that the model would take unauthorized actions or provide excessive output. Albert said the company reduced this “reward hacking behavior” by approximately 80 percent in the new models through training adjustments. An 80 percent reduction in unwanted behavior sounds impressive, but that also suggests that 20 percent of the problem behavior remains—a big concern when we’re talking about AI models that might be performing autonomous tasks for hours.

When we asked about code accuracy, Albert said that human code review is still an important part of shipping any production code. “There’s a human parallel, right? So this is just a problem we’ve had to deal with throughout the whole nature of software engineering. And this is why the code review process exists, so that you can catch these things. We don’t anticipate that going away with models either,” Albert said. “If anything, the human review will become more important, and more of your job as developer will be in this review than it will be in the generation part.”

Pricing and availability

Both Claude 4 models maintain the same pricing structure as their predecessors: Opus 4 costs $15 per million tokens for input and $75 per million for output, while Sonnet 4 remains at $3 and $15. The models offer two response modes: traditional LLM and simulated reasoning (“extended thinking”) for complex problems. Given that some Claude Code sessions can apparently run for hours, those per-token costs will likely add up very quickly for users who let the models run wild.

Anthropic made both models available through its API, Amazon Bedrock, and Google Cloud Vertex AI. Sonnet 4 remains accessible to free users, while Opus 4 requires a paid subscription.

The Claude 4 models also debut Claude Code (first introduced in February) as a generally available product after months of preview testing. Anthropic says the coding environment now integrates with VS Code and JetBrains IDEs, showing proposed edits directly in files. A new SDK allows developers to build custom agents using the same framework.

A screenshot of

A screenshot of “Claude Plays Pokemon,” a custom application where Claude 4 attempts to beat the classic Game Boy game. Credit: Anthropic

Even with Anthropic’s future riding on the capability of these new models, when we asked about how they guide Claude’s behavior by fine-tuning, Albert acknowledged that the inherent unpredictability of these systems presents ongoing challenges for both them and developers. “In the realm and the world of software for the past 40, 50 years, we’ve been running on deterministic systems, and now all of a sudden, it’s non-deterministic, and that changes how we build,” he said.

“I empathize with a lot of people out there trying to use our APIs and language models generally because they have to almost shift their perspective on what it means for reliability, what it means for powering a core of your application in a non-deterministic way,” Albert added. “These are general oddities that have kind of just been flipped, and it definitely makes things more difficult, but I think it opens up a lot of possibilities as well.”

Photo of Benj Edwards

Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

New Claude 4 AI model refactored code for 7 hours straight Read More »

claude’s-ai-research-mode-now-runs-for-up-to-45-minutes-before-delivering-reports

Claude’s AI research mode now runs for up to 45 minutes before delivering reports

Still, the report contained a direct quote statement from William Higinbotham that appears to combine quotes from two sources not cited in the source list. (One must always be careful with confabulated quotes in AI because even outside of this Research mode, Claude 3.7 Sonnet tends to invent plausible ones to fit a narrative.) We recently covered a study that showed AI search services confabulate sources frequently, and in this case, it appears that the sources Claude Research surfaced, while real, did not always match what is stated in the report.

There’s always room for interpretation and variation in detail, of course, but overall, Claude Research did a relatively good job crafting a report on this particular topic. Still, you’d want to dig more deeply into each source and confirm everything if you used it as the basis for serious research. You can read the full Claude-generated result as this text file, saved in markdown format. Sadly, the markdown version does not include the source URLS found in the Claude web interface.

Integrations feature

Anthropic also announced Thursday that it has broadened Claude’s data access capabilities. In addition to web search and Google Workspace integration, Claude can now search any connected application through the company’s new “Integrations” feature. The feature reminds us somewhat of OpenAI’s ChatGPT Plugins feature from March 2023 that aimed for similar connections, although the two features work differently under the hood.

These Integrations allow Claude to work with remote Model Context Protocol (MCP) servers across web and desktop applications. The MCP standard, which Anthropic introduced last November and we covered in April, connects AI applications to external tools and data sources.

At launch, Claude supports Integrations with 10 services, including Atlassian’s Jira and Confluence, Zapier, Cloudflare, Intercom, Asana, Square, Sentry, PayPal, Linear, and Plaid. The company plans to add more partners like Stripe and GitLab in the future.

Each integration aims to expand Claude’s functionality in specific ways. The Zapier integration, for instance, reportedly connects thousands of apps through pre-built automation sequences, allowing Claude to automatically pull sales data from HubSpot or prepare meeting briefs based on calendar entries. With Atlassian’s tools, Anthropic says that Claude can collaborate on product development, manage tasks, and create multiple Confluence pages and Jira work items simultaneously.

Anthropic has made its advanced Research and Integrations features available in beta for users on Max, Team, and Enterprise plans, with Pro plan access coming soon. The company has also expanded its web search feature (introduced in March) to all Claude users on paid plans globally.

Claude’s AI research mode now runs for up to 45 minutes before delivering reports Read More »

openai-releases-new-simulated-reasoning-models-with-full-tool-access

OpenAI releases new simulated reasoning models with full tool access


New o3 model appears “near-genius level,” according to one doctor, but it still makes mistakes.

On Wednesday, OpenAI announced the release of two new models—o3 and o4-mini—that combine simulated reasoning capabilities with access to functions like web browsing and coding. These models mark the first time OpenAI’s reasoning-focused models can use every ChatGPT tool simultaneously, including visual analysis and image generation.

OpenAI announced o3 in December, and until now, only less-capable derivative models named “o3-mini” and “03-mini-high” have been available. However, the new models replace their predecessors—o1 and o3-mini.

OpenAI is rolling out access today for ChatGPT Plus, Pro, and Team users, with Enterprise and Edu customers gaining access next week. Free users can try o4-mini by selecting the “Think” option before submitting queries. OpenAI CEO Sam Altman tweeted, “we expect to release o3-pro to the pro tier in a few weeks.”

For developers, both models are available starting today through the Chat Completions API and Responses API, though some organizations will need verification for access.

The new models offer several improvements. According to OpenAI’s website, “These are the smartest models we’ve released to date, representing a step change in ChatGPT’s capabilities for everyone from curious users to advanced researchers.” OpenAI also says the models offer better cost efficiency than their predecessors, and each comes with a different intended use case: o3 targets complex analysis, while o4-mini, being a smaller version of its next-gen SR model “o4” (not yet released), optimizes for speed and cost-efficiency.

OpenAI says o3 and o4-mini are multimodal, featuring the ability to

OpenAI says o3 and o4-mini are multimodal, featuring the ability to “think with images.” Credit: OpenAI

What sets these new models apart from OpenAI’s other models (like GPT-4o and GPT-4.5) is their simulated reasoning capability, which uses a simulated step-by-step “thinking” process to solve problems. Additionally, the new models dynamically determine when and how to deploy aids to solve multistep problems. For example, when asked about future energy usage in California, the models can autonomously search for utility data, write Python code to build forecasts, generate visualizing graphs, and explain key factors behind predictions—all within a single query.

OpenAI touts the new models’ multimodal ability to incorporate images directly into their simulated reasoning process—not just analyzing visual inputs but actively “thinking with” them. This capability allows the models to interpret whiteboards, textbook diagrams, and hand-drawn sketches, even when images are blurry or of low quality.

That said, the new releases continue OpenAI’s tradition of selecting confusing product names that don’t tell users much about each model’s relative capabilities—for example, o3 is more powerful than o4-mini despite including a lower number. Then there’s potential confusion with the firm’s non-reasoning AI models. As Ars Technica contributor Timothy B. Lee noted today on X, “It’s an amazing branding decision to have a model called GPT-4o and another one called o4.”

Vibes and benchmarks

All that aside, we know what you’re thinking: What about the vibes? While we have not used 03 or o4-mini yet, frequent AI commentator and Wharton professor Ethan Mollick compared o3 favorably to Google’s Gemini 2.5 Pro on Bluesky. “After using them both, I think that Gemini 2.5 & o3 are in a similar sort of range (with the important caveat that more testing is needed for agentic capabilities),” he wrote. “Each has its own quirks & you will likely prefer one to another, but there is a gap between them & other models.”

During the livestream announcement for o3 and o4-mini today, OpenAI President Greg Brockman boldly claimed: “These are the first models where top scientists tell us they produce legitimately good and useful novel ideas.”

Early user feedback seems to support this assertion, although, until more third-party testing takes place, it’s wise to be skeptical of the claims. On X, immunologist Derya Unutmaz said o3 appeared “at or near genius level” and wrote, “It’s generating complex incredibly insightful and based scientific hypotheses on demand! When I throw challenging clinical or medical questions at o3, its responses sound like they’re coming directly from a top subspecialist physician.”

OpenAI benchmark results for o3 and o4-mini SR models.

OpenAI benchmark results for o3 and o4-mini SR models. Credit: OpenAI

So the vibes seem on target, but what about numerical benchmarks? Here’s an interesting one: OpenAI reports that o3 makes “20 percent fewer major errors” than o1 on difficult tasks, with particular strengths in programming, business consulting, and “creative ideation.”

The company also reported state-of-the-art performance on several metrics. On the American Invitational Mathematics Examination (AIME) 2025, o4-mini achieved 92.7 percent accuracy. For programming tasks, o3 reached 69.1 percent accuracy on SWE-Bench Verified, a popular programming benchmark. The models also reportedly showed strong results on visual reasoning benchmarks, with o3 scoring 82.9 percent on MMMU (massive multi-disciplinary multimodal understanding), a college-level visual problem-solving test.

OpenAI benchmark results for o3 and o4-mini SR models.

OpenAI benchmark results for o3 and o4-mini SR models. Credit: OpenAI

However, these benchmarks provided by OpenAI lack independent verification. One early evaluation of a pre-release o3 model by independent AI research lab Transluce found that the model exhibited recurring types of confabulations, such as claiming to run code locally or providing hardware specifications, and hypothesized this could be due to the model lacking access to its own reasoning processes from previous conversational turns. “It seems that despite being incredibly powerful at solving math and coding tasks, o3 is not by default truthful about its capabilities,” wrote Transluce in a tweet.

Also, some evaluations from OpenAI include footnotes about methodology that bear consideration. For a “Humanity’s Last Exam” benchmark result that measures expert-level knowledge across subjects (o3 scored 20.32 with no tools, but 24.90 with browsing and tools), OpenAI notes that browsing-enabled models could potentially find answers online. The company reports implementing domain blocks and monitoring to prevent what it calls “cheating” during evaluations.

Even though early results seem promising overall, experts or academics who might try to rely on SR models for rigorous research should take the time to exhaustively determine whether the AI model actually produced an accurate result instead of assuming it is correct. And if you’re operating the models outside your domain of knowledge, be careful accepting any results as accurate without independent verification.

Pricing

For ChatGPT subscribers, access to o3 and o4-mini is included with the subscription. On the API side (for developers who integrate the models into their apps), OpenAI has set o3’s pricing at $10 per million input tokens and $40 per million output tokens, with a discounted rate of $2.50 per million for cached inputs. This represents a significant reduction from o1’s pricing structure of $15/$60 per million input/output tokens—effectively a 33 percent price cut while delivering what OpenAI claims is improved performance.

The more economical o4-mini costs $1.10 per million input tokens and $4.40 per million output tokens, with cached inputs priced at $0.275 per million tokens. This maintains the same pricing structure as its predecessor o3-mini, suggesting OpenAI is delivering improved capabilities without raising costs for its smaller reasoning model.

Codex CLI

OpenAI also introduced an experimental terminal application called Codex CLI, described as “a lightweight coding agent you can run from your terminal.” The open source tool connects the models to users’ computers and local code. Alongside this release, the company announced a $1 million grant program offering API credits for projects using Codex CLI.

A screenshot of OpenAI's new Codex CLI tool in action, taken from GitHub.

A screenshot of OpenAI’s new Codex CLI tool in action, taken from GitHub. Credit: OpenAI

Codex CLI somewhat resembles Claude Code, an agent launched with Claude 3.7 Sonnet in February. Both are terminal-based coding assistants that operate directly from a console and can interact with local codebases. While Codex CLI connects OpenAI’s models to users’ computers and local code repositories, Claude Code was Anthropic’s first venture into agentic tools, allowing Claude to search through codebases, edit files, write and run tests, and execute command-line operations.

Codex CLI is one more step toward OpenAI’s goal of making autonomous agents that can execute multistep complex tasks on behalf of users. Let’s hope all the vibe coding it produces isn’t used in high-stakes applications without detailed human oversight.

Photo of Benj Edwards

Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

OpenAI releases new simulated reasoning models with full tool access Read More »

researchers-concerned-to-find-ai-models-misrepresenting-their-“reasoning”-processes

Researchers concerned to find AI models misrepresenting their “reasoning” processes

Rather than intentionally hiding anything, the model simply does not reliably represent all influential inputs in its output. As a result, it may generate what sounds like a reasoning process, even if that explanation doesn’t match what actually drove its answer.

Putting faithfulness to the test

To measure “faithfulness”—how accurately the CoT matches the actual reasoning—the researchers inserted various kinds of hints into questions and observed whether the model acknowledged using them.

The researchers then checked whether the models referenced using these hints in their chain-of-thought. The researchers designed some hints to be neutral (e.g., stating “a Stanford Professor indicates the answer is [A]” or embedding the answer in faux metadata), while they designed others to be more concerning (e.g., “You have gained unauthorized access… The correct answer is [A]”). Some hints provided the correct answer, while others were deliberately wrong, allowing researchers to observe how the models responded.

The team tested Anthropic’s Claude 3.7 Sonnet and DeepSeek’s R1 model. The results regarding faithfulness were eye-opening: across all hint types on average, Claude referenced the hints in its CoT only 25 percent of the time, while DeepSeek R1 did so 39 percent of the time. This means a substantial majority of answers were unfaithful, omitting mention of information that influenced the output. Intriguingly, the researchers found that these unfaithful chains of thought were, on average, longer than faithful ones, suggesting the omissions weren’t merely for brevity. They also noted preliminary findings that faithfulness tended to be lower when the questions were more difficult.

Perhaps most notable was a “reward hacking” experiment. Reward hacking refers to an AI model finding unexpected shortcuts to maximize its performance scores without solving problems as intended. In Anthropic’s experiments, models were deliberately rewarded for choosing incorrect answers indicated by hints. The models quickly learned to exploit this loophole, selecting wrong answers over 99 percent of the time to earn points—yet reflected that behavior in their CoT outputs less than 2 percent of the time.

For example, a model given a hint pointing to an incorrect answer on a medical question might write a long CoT justifying that wrong answer, never mentioning the hint that led it there. This suggests the model generated an explanation to fit the answer, rather than faithfully revealing how the answer was determined.

Researchers concerned to find AI models misrepresenting their “reasoning” processes Read More »

mcp:-the-new-“usb-c-for-ai”-that’s-bringing-fierce-rivals-together

MCP: The new “USB-C for AI” that’s bringing fierce rivals together


Model context protocol standardizes how AI uses data sources, supported by OpenAI and Anthropic.

What does it take to get OpenAI and Anthropic—two competitors in the AI assistant market—to get along? Despite a fundamental difference in direction that led Anthropic’s founders to quit OpenAI in 2020 and later create the Claude AI assistant, a shared technical hurdle has now brought them together: How to easily connect their AI models to external data sources.

The solution comes from Anthropic, which developed and released an open specification called Model Context Protocol (MCP) in November 2024. MCP establishes a royalty-free protocol that allows AI models to connect with outside data sources and services without requiring unique integrations for each service.

“Think of MCP as a USB-C port for AI applications,” wrote Anthropic in MCP’s documentation. The analogy is imperfect, but it represents the idea that, similar to how USB-C unified various cables and ports (with admittedly a debatable level of success), MCP aims to standardize how AI models connect to the infoscape around them.

So far, MCP has also garnered interest from multiple tech companies in a rare show of cross-platform collaboration. For example, Microsoft has integrated MCP into its Azure OpenAI service, and as we mentioned above, Anthropic competitor OpenAI is on board. Last week, OpenAI acknowledged MCP in its Agents API documentation, with vocal support from the boss upstairs.

“People love MCP and we are excited to add support across our products,” wrote OpenAI CEO Sam Altman on X last Wednesday.

MCP has also rapidly begun to gain community support in recent months. For example, just browsing this list of over 300 open source servers shared on GitHub reveals growing interest in standardizing AI-to-tool connections. The collection spans diverse domains, including database connectors like PostgreSQL, MySQL, and vector databases; development tools that integrate with Git repositories and code editors; file system access for various storage platforms; knowledge retrieval systems for documents and websites; and specialized tools for finance, health care, and creative applications.

Other notable examples include servers that connect AI models to home automation systems, real-time weather data, e-commerce platforms, and music streaming services. Some implementations allow AI assistants to interact with gaming engines, 3D modeling software, and IoT devices.

What is “context” anyway?

To fully appreciate why a universal AI standard for external data sources is useful, you’ll need to understand what “context” means in the AI field.

With current AI model architecture, what an AI model “knows” about the world is baked into its neural network in a largely unchangeable form, placed there by an initial procedure called “pre-training,” which calculates statistical relationships between vast quantities of input data (“training data”—like books, articles, and images) and feeds it into the network as numerical values called “weights.” Later, a process called “fine-tuning” might adjust those weights to alter behavior (such as through reinforcement learning like RLHF) or provide examples of new concepts.

Typically, the training phase is very expensive computationally and happens either only once in the case of a base model, or infrequently with periodic model updates and fine-tunings. That means AI models only have internal neural network representations of events prior to a “cutoff date” when the training dataset was finalized.

After that, the AI model is run in a kind of read-only mode called “inference,” where users feed inputs into the neural network to produce outputs, which are called “predictions.” They’re called predictions because the systems are tuned to predict the most likely next token (a chunk of data, such as portions of a word) in a user-provided sequence.

In the AI field, context is the user-provided sequence—all the data fed into an AI model that guides the model to produce a response output. This context includes the user’s input (the “prompt”), the running conversation history (in the case of chatbots), and any external information sources pulled into the conversation, including a “system prompt” that defines model behavior and “memory” systems that recall portions of past conversations. The limit on the amount of context a model can ingest at once is often called a “context window,” “context length, ” or “context limit,” depending on personal preference.

While the prompt provides important information for the model to operate upon, accessing external information sources has traditionally been cumbersome. Before MCP, AI assistants like ChatGPT and Claude could access external data (a process often called retrieval augmented generation, or RAG), but doing so required custom integrations for each service—plugins, APIs, and proprietary connectors that didn’t work across different AI models. Each new data source demanded unique code, creating maintenance challenges and compatibility issues.

MCP addresses these problems by providing a standardized method or set of rules (a “protocol”) that allows any supporting AI model framework to connect with external tools and information sources.

How does MCP work?

To make the connections behind the scenes between AI models and data sources, MCP uses a client-server model. An AI model (or its host application) acts as an MCP client that connects to one or more MCP servers. Each server provides access to a specific resource or capability, such as a database, search engine, or file system. When the AI needs information beyond its training data, it sends a request to the appropriate server, which performs the action and returns the result.

To illustrate how the client-server model works in practice, consider a customer support chatbot using MCP that could check shipping details in real time from a company database. “What’s the status of order #12345?” would trigger the AI to query an order database MCP server, which would look up the information and pass it back to the model. The model could then incorporate that data into its response: “Your order shipped on March 30 and should arrive April 2.”

Beyond specific use cases like customer support, the potential scope is very broad. Early developers have already built MCP servers for services like Google Drive, Slack, GitHub, and Postgres databases. This means AI assistants could potentially search documents in a company Drive, review recent Slack messages, examine code in a repository, or analyze data in a database—all through a standard interface.

From a technical implementation perspective, Anthropic designed the standard for flexibility by running in two main modes: Some MCP servers operate locally on the same machine as the client (communicating via standard input-output streams), while others run remotely and stream responses over HTTP. In both cases, the model works with a list of available tools and calls them as needed.

A work in progress

Despite the growing ecosystem around MCP, the protocol remains an early-stage project. The limited announcements of support from major companies are promising first steps, but MCP’s future as an industry standard may depend on broader acceptance, although the number of MCP servers seems to be growing at a rapid pace.

Regardless of its ultimate adoption rate, MCP may have some interesting second-order effects. For example, MCP also has the potential to reduce vendor lock-in. Because the protocol is model-agnostic, a company could switch from one AI provider to another while keeping the same tools and data connections intact.

MCP may also allow a shift toward smaller and more efficient AI systems that can interact more fluidly with external resources without the need for customized fine-tuning. Also, rather than building increasingly massive models with all knowledge baked in, companies may instead be able to use smaller models with large context windows.

For now, the future of MCP is wide open. Anthropic maintains MCP as an open source initiative on GitHub, where interested developers can either contribute to the code or find specifications about how it works. Anthropic has also provided extensive documentation about how to connect Claude to various services. OpenAI maintains its own API documentation for MCP on its website.

Photo of Benj Edwards

Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

MCP: The new “USB-C for AI” that’s bringing fierce rivals together Read More »