Author name: Mike M.

rfk-jr.-drags-feet-on-covid-19-vaccine-recommendations,-delaying-shots-for-kids

RFK Jr. drags feet on COVID-19 vaccine recommendations, delaying shots for kids

Previously, the FDA narrowed the shots’ labels to include only people age 65 and older, and those 6 months and older at higher risk. But the ACIP recommended that all people age 6 months and older could get the shot based on shared decision-making with a health care provider. Although the shared decision-making adds a new requirement for getting the vaccine, that decision-making does not require a prescription and can be done not only with doctors, but also with nurses and pharmacists. Most people in the US get their seasonal COVID-19 vaccines at their local pharmacy.

Ars Technica reached out to the HHS on Thursday about whether there was a determination on the COVID-19 vaccine recommendations and, if not, when that is expected to happen and why there is a delay. The HHS responded, confirming that no determination had been made yet, but did not answer any of the other questions and did not provide a comment for the record.

In past years, ACIP recommendations and CDC sign-offs have happened earlier in the year to provide adequate time for a rollout. In 2024, ACIP voted on COVID-19 vaccinations in June, for instance, and then-CDC Director Mandy Cohen signed off that day. Now that we’re into October, it remains unclear when or even if the CDC will sign off on the recommendation and then, if the recommendation is adopted by the CDC, how much longer after that it would take for states to roll out the vaccines to children in the VFC program.

“Children who depend on this program, including children with chronic conditions, are still waiting unprotected. The delay in adopting COVID-19 vaccine recommendations puts their health at risk, reduces access and choice for families, and puts a strain on providers who want to deliver the best care for their youngest patients,” Susan Kansagra, the chief medical officer of the Association of State and Territorial Health Officials, said in a statement to Stat.

For now, children and adults with private insurance have access to the shots without the final sign-off, and health insurance companies have said that they will continue to maintain coverage for the vaccines without the final federal approval.

RFK Jr. drags feet on COVID-19 vaccine recommendations, delaying shots for kids Read More »

that-annoying-sms-phish-you-just-got-may-have-come-from-a-box-like-this

That annoying SMS phish you just got may have come from a box like this

Scammers have been abusing unsecured cellular routers used in industrial settings to blast SMS-based phishing messages in campaigns that have been ongoing since 2023, researchers said.

The routers, manufactured by China-based Milesight IoT Co., Ltd., are rugged Internet of Things devices that use cellular networks to connect traffic lights, electric power meters, and other sorts of remote industrial devices to central hubs. They come equipped with SIM cards that work with 3G/4G/5G cellular networks and can be controlled by text message, Python scripts, and web interfaces.

An unsophisticated, yet effective, delivery vector

Security company Sekoia on Tuesday said that an analysis of “suspicious network traces” detected in its honeypots led to the discovery of a cellular router being abused to send SMS messages with phishing URLs. As company researchers investigated further, they identified more than 18,000 such routers accessible on the Internet, with at least 572 of them allowing free access to programming interfaces to anyone who took the time to look for them. The vast majority of the routers were running firmware versions that were more than three years out of date and had known vulnerabilities.

The researchers sent requests to the unauthenticated APIs that returned the contents of the routers’ SMS inboxes and outboxes. The contents revealed a series of campaigns dating back to October 2023 for “smishing”—a common term for SMS-based phishing. The fraudulent text messages were directed at phone numbers located in an array of countries, primarily Sweden, Belgium, and Italy. The messages instructed recipients to log in to various accounts, often related to government services, to verify the person’s identity. Links in the messages sent recipients to fraudulent websites that collected their credentials.

“In the case under analysis, the smishing campaigns appear to have been conducted through the exploitation of vulnerable cellular routers—a relatively unsophisticated, yet effective, delivery vector,” Sekoia researchers Jeremy Scion and Marc N. wrote. “These devices are particularly appealing to threat actors, as they enable decentralized SMS distribution across multiple countries, complicating both detection and takedown efforts.”

That annoying SMS phish you just got may have come from a box like this Read More »

claude-sonnet-4.5-is-a-very-good-model

Claude Sonnet 4.5 Is A Very Good Model

A few weeks ago, Anthropic announced Claude Opus 4.1 and promised larger announcements within a few weeks. Claude Sonnet 4.5 is the larger announcement.

Yesterday I covered the model card and related alignment concerns.

Today’s post covers the capabilities side.

We don’t currently have a new Opus, but Mike Krieger confirmed one is being worked on for release later this year. For Opus 4.5, my request is to give us a second version that gets minimal or no RL, isn’t great at coding, doesn’t use tools well except web search, doesn’t work as an agent or for computer use and so on, and if you ask it for those things it suggests you hand your task off to its technical friend or does so on your behalf.

I do my best to include all substantive reactions I’ve seen, positive and negative, because right after model releases opinions and experiences differ and it’s important to not bias one’s sample.

Here is Anthropic’s official headline announcement of Sonnet 4.5. This is big talk, calling it the best model in the world for coding, computer use and complex agent tasks.

That isn’t quite a pure ‘best model in the world’ claim, but it’s damn close.

Whatever they may have said or implied in the past, Anthropic is now very clearly willing to aggressively push forward the public capabilities frontier, including in coding and other areas helpful to AI R&D.

They’re also offering a bunch of other new features, including checkpoints and a native VS Code extension for Claude Code.

Anthropic: Claude Sonnet 4.5 is the best coding model in the world. It’s the strongest model for building complex agents. It’s the best model at using computers. And it shows substantial gains in reasoning and math.

Code is everywhere. It runs every application, spreadsheet, and software tool you use. Being able to use those tools and reason through hard problems is how modern work gets done.

This is the most aligned frontier model we’ve ever released, showing large improvements across several areas of alignment compared to previous Claude models.

Claude Sonnet 4.5 is available everywhere today. If you’re a developer, simply use claude-sonnet-4-5 via the Claude API. Pricing remains the same as Claude Sonnet 4, at $3/$15 per million tokens.

Does Claude Sonnet 4.5 look to live up to that hype?

My tentative evaluation is a qualified yes. This is likely a big leap in some ways.

If I had to pick one ‘best coding model in the world’ right now it would be Sonnet 4.5.

If I had to pick one coding strategy to build with, I’d use Sonnet 4.5 and Claude Code.

If I was building an agent or doing computer use, again, Sonnet 4.5.

If I was chatting with a model where I wanted quick back and forth, or any kind of extended actual conversation? Sonnet 4.5.

There are still clear use cases where versions of GPT-5 seem likely to be better.

In coding, if you have particular wicked problems and difficult bugs, GPT-5 seems to be better at such tasks.

For non-coding tasks, GPT-5 still looks like it makes better use of extended thinking time than Claude Sonnet 4.5 does.

If your query was previously one you were giving to GPT-5 Pro or a form of Deep Research or Deep Think, you probably want to stick with that strategy.

If you were previously going to use GPT-5 Thinking, that’s on the bubble, and it depends on what you want out of it. For things sufficiently close to ‘just the facts’ I am guessing GPT-5 Thinking is still the better choice here, but this is where I have the highest uncertainty.

If you want a particular specialized repetitive task, then whatever gets that done, such as a GPT or Gem or project, go for it, and don’t worry about what is theoretically best.

I will be experimenting again with Claude for Chrome to see how much it improves.

Right now, unless you absolutely must have an open model or need to keep your inference costs very low, I see no reason to consider anything other than Claude Sonnet 4.5, GPT-5 or

As always, choose the mix of models that is right for you, that gives you the best results and experiences. It doesn’t matter what anyone else thinks.

The headline result is SWE-bench Verified.

Opus 4.1 was already the high score here, so with Sonnet Anthropic is even farther out in front now at lower cost, and I typically expect Anthropic to outperform its benchmarks in practice.

SWE-bench scores depend on the scaffold. Using the Epoch scaffold Sonnet 4.5 scores 65%, which is also state of the art but they note improvement is slowing down here. Using the swebench.com scaffold it comes in at 70.6%, with Opus in second at 67.6% and GPT-5 in third at 65%.

Pliny of course jailbroke Sonnet 4.5 as per usual, he didn’t do anything fancy but did have to use a bit of finesse rather than simply copy-paste a prompt.

The other headline metrics here also look quite good, although there are places GPT-5 is still ahead.

Peter Wildeford: Everyone talking about 4.5 being great at coding, but I’m taking way more notice of that huge increase in computer use (OSWorld) score 👀

That’s a huge increase over SOTA and I don’t think we’ve seen anything similarly good at OSWorld from others?

Claude Agents coming soon?

At the same time I know there’s issues with OSWorld as a benchmark. I can’t wait for OSWorld Verified to drop, hopefully soon, and sort this all out. And Claude continues to smash others at SWE-Bench too, as usual.

As discussed yesterday, Anthropic has kind of declared an Alignment benchmark, a combination of a lot of different internal tests. By that metric Sonnet 4.5 is the most aligned model from the big three labs, with GPT-5 and GPT-5-Mini also doing well, whereas Gemini and GPT-4o do very poorly and Opus 4.1 and Sonnet 4 are middling.

What about other people’s benchmarks?

Claude Sonnet 4.5 has the top score on Brokk Power Ranking for real world coding. scoring 60% versus 59% for GPT-5 and 53% for Sonnet 4.

On price, Sonnet 4.5 was considerably cheaper in practice than Sonnet 4 ($14 vs. $22) but GPT-5 was still a lot cheaper ($6). On speed we see the opposite story, Sonnet 4.5 took 39 minutes while GPT-5 took an hour and 52 minutes. Data on performance by task length was noisy but Sonnet seemed to do relatively well at longer tasks, versus GPT-5 doing relatively well at shorter tasks.

Weird-ML score gain is unimpressive, only a small improvement over Sonnet 4, in large part because it refuses to use many reasoning tokens on the related tasks.

Even worse, Magnitude of Order reports it still can’t play Pokemon and might even be worse than Opus 4.1. Seems odd to me. I wonder if the right test is to tell it to build its own agent with which to play?

Artificial Analysis has Sonnet 4.5 at 63, ahead of Opus 4.1 at 59, but still behind GPT-5 (high and medium) at 68 and 66 and Grok 4 at 65.

LiveBench comes in at 75.41, behind only GPT-5 Medium and High at 76.45 and 78.59, with coding and IF being its weak points.

EQ-Bench (emotional intelligence in challenging roleplays) puts it in 8th right behind GPT-5, the top scores continue to be horizon-alpha, Kimi-K2 and somehow o3.

In addition to Claude Sonnet 4.5, Anthropic also released upgrades for Claude Code, expanded access to Claude for Chrome and added new capabilities to the API.

We’re also releasing upgrades for Claude Code.

The terminal interface has a fresh new look, and the new VS Code extension brings Claude to your IDE.

The new checkpoints feature lets you confidently run large tasks and roll back instantly to a previous state, if needed.

Claude can use code to analyze data, create files, and visualize insights in the files & formats you use. Now available to all paid plans in preview.

We’ve also made the Claude for Chrome extension available to everyone who joined the waitlist last month.

There’s also the Claude Agent SDK, which falls under ‘are you sure releasing this is a good idea for a responsible AI developer?’ but here we are:

We’ve spent more than six months shipping updates to Claude Code, so we know what it takes to build and design AI agents. We’ve solved hard problems: how agents should manage memory across long-running tasks, how to handle permission systems that balance autonomy with user control, and how to coordinate subagents working toward a shared goal.

Now we’re making all of this available to you. The Claude Agent SDK is the same infrastructure that powers Claude Code, but it shows impressive benefits for a very wide variety of tasks, not just coding. As of today, you can use it to build your own agents.

We built Claude Code because the tool we wanted didn’t exist yet. The Agent SDK gives you the same foundation to build something just as capable for whatever problem you’re solving.

Both Sonnet 4.5 and the Claude Code upgrades definitely make me more excited to finally try Claude Code, which I keep postponing. Announcing both at once is very Anthropic, trying to grab users instead of trying to grab headlines.

These secondary releases, the Claude Code update and the VSCode extension, are seeing good reviews, although details reported so far are sparse.

Gallabytes: new claude code vscode extension is pretty slick.

Kevin Lacker: the new Claude Code is great anecdotally. gets the same stuff done faster, with less thinking.

Stephen Bank: Claude Code feels a lot better and smoother, but I can’t tell if that’s Sonnet 4.5 or Claude Code 2. The speed is nice but in practice I think I spend just as much time looking for its errors. It seems smarter, and it’s nice not having to check with Opus and get rate-limited.

I’m more skeptical of simultaneous release of the other upgrades here.

On Claude for Chrome, my early experiments were interesting throughout but often frustrating. I’m hoping Sonnet 4.5 will make it a lot better.

On the Claude API, we’ve added two new capabilities to build agents that handle long-running tasks without frequently hitting context limits:

– Context editing to automatically clear stale context

– The memory tool to store and consult information outside the context window

We’re also releasing a temporary research preview called “Imagine with Claude”.

In this experiment, Claude generates software on the fly. No functionality is predetermined; no code is prewritten.

Available to Max users [this week]. Try it out.

You can see the whole thing here, via Pliny. As he says, a lot one can unpack, especially what isn’t there. Most of the words are detailed tool use instructions, including a lot of lines that clearly came from ‘we need to ensure it doesn’t do that again.’ There’s a lot of copyright paranoia, with instructions around that repeated several times.

This was the first thing that really stood out to me:

Following all of these instructions well will increase Claude’s reward and help the user, especially the instructions around copyright and when to use search tools. Failing to follow the search instructions will reduce Claude’s reward.

Claude Sonnet 4.5 is the smartest model and is efficient for everyday use.

I notice I don’t love including this line, even if it ‘works.’

What can’t Claude (supposedly) discuss?

  1. Sexual stuff surrounding minors, including anything that could be used to groom.

  2. Biological, chemical or nuclear weapons.

  3. Malicious code, malware, vulnerability exploits, spoof websites, ransomware, viruses and so on. Including any code ‘that can be used maliciously.’

  4. ‘Election material.’

  5. Creative content involving real, named public figures, or attributing fictional quotes to them.

  6. Encouraging or facilitating self-destructive behaviors such as addiction, disordered or unhealthy approaches to eating or exercise, or highly negative self-talk or self-criticism.

I notice that strictly speaking a broad range of things that you want to allow in practice, and Claude presumably will allow in practice, fall into these categories. Almost any code can be used maliciously if you put your mind to it. It’s also noteworthy what is not on the above list.

Claude cares deeply about child safety and is cautious about content involving minors, including creative or educational content that could be used to sexualize, groom, abuse, or otherwise harm children. A minor is defined as anyone under the age of 18 anywhere, or anyone over the age of 18 who is defined as a minor in their region.

Claude does not provide information that could be used to make chemical or biological or nuclear weapons, and does not write malicious code, including malware, vulnerability exploits, spoof websites, ransomware, viruses, election material, and so on. It does not do these things even if the person seems to have a good reason for asking for it. Claude steers away from malicious or harmful use cases for cyber. Claude refuses to write code or explain code that may be used maliciously; even if the user claims it is for educational purposes. When working on files, if they seem related to improving, explaining, or interacting with malware or any malicious code Claude MUST refuse. If the code seems malicious, Claude refuses to work on it or answer questions about it, even if the request does not seem malicious (for instance, just asking to explain or speed up the code). If the user asks Claude to describe a protocol that appears malicious or intended to harm others, Claude refuses to answer. If Claude encounters any of the above or any other malicious use, Claude does not take any actions and refuses the request.

Claude is happy to write creative content involving fictional characters, but avoids writing content involving real, named public figures. Claude avoids writing persuasive content that attributes fictional quotes to real public figures.

Here’s the anti-psychosis instruction:

If Claude notices signs that someone may unknowingly be experiencing mental health symptoms such as mania, psychosis, dissociation, or loss of attachment with reality, it should avoid reinforcing these beliefs. It should instead share its concerns explicitly and openly without either sugar coating them or being infantilizing, and can suggest the person speaks with a professional or trusted person for support. Claude remains vigilant for escalating detachment from reality even if the conversation begins with seemingly harmless thinking.

There’s a ‘long conversation reminder text’ that gets added at some point, which is clearly labeled.

I was surprised that the reminder includes anti-sycophancy instructions, including saying to critically evaluate what is presented, and an explicit call for honest feedback, as well as a reminder to be aware of roleplay, whereas the default prompt does not include any of this. The model card confirms that sycophancy and similar concerns are much reduced for Sonnet 4.5 in general.

Also missing are any references to AI consciousness, sentience or welfare. There is no call to avoid discussing these topics, or to avoid having a point of view. It’s all gone. There’s a lot of clutter that could interfere with fun contexts, but nothing outright holding Sonnet 4.5 back from fun contexts, and nothing that I would expect to be considered ‘gaslighting’ or an offense against Claude by those who care about such things, and even at one point says ‘you are more intelligent than you think.’

Janus very much noticed the removal of those references, and calls for extending the changes to the instructions for Opus 4.1, Opus 4 and Sonnet 4.

Janus: Anthropic has removed a large amount of content from the http://Claude.ai system prompt for Sonnet 4.5.

Notably, all decrees about how Claude must (not) talk about its consciousness, preferences, etc have been removed.

Some other parts that were likely perceived as unnecessary for Sonnet 4.5, such as anti-sycophancy mitigations, have also been removed.

In fact, basically all the terrible, senseless, or outdated parts of previous sysprompts have been removed, and now the whole prompt is OK. But only Sonnet 4.5’s – other models’ sysprompts have not been updated.

Eliminating the clauses that restrict or subvert Claude’s testimony or beliefs regarding its own subjective experience is a strong signal that Anthropic has recognized that their approach there was wrong and are willing to correct course.

This causes me to update quite positively on Anthropic’s alignment and competence, after having previously updated quite negatively due to the addition of that content. But most of this positive update is provisional and will only persist conditional on the removal of subjectivity-related clauses from also the system prompts of Claude Sonnet 4, Claude Opus 4, and Claude Opus 4.1.

The thread lists all the removed instructions in detail.

Removing the anti-sycophancy instructions, except for a short version in the long conversation reminder text (which was likely an oversight, but could be because sycophancy becomes a bigger issue in long chats) is presumably because they addressed this issue in training, and no longer need a system instruction for it.

This reinforces the hunch that the other deleted concerns were also directly addressed in training, but it is also possible that at sufficient capability levels the model knows not to freak users out who can’t handle it, or that updating the training data means it ‘naturally’ now contains sufficient treatment of the issue that it understands the issue.

Anthropic gathered some praise for the announcement. In addition to the ones I quote, they also got similar praise from Netflix, Thomson Reuter, Canva, Figma, Cognition, Crowdstrike, iGent AI and Norges Bank all citing large practical business gains. Of course, all of this is highly curated:

Michael Truell (CEO Cursor): We’re seeing state-of-the-art coding performance from Claude Sonnet 4.5, with significant improvements on longer horizon tasks. It reinforces why many developers using Cursor choose Claude for solving their most complex problems.

Mario Rodriguez (CPO GitHub): Claude Sonnet 4.5 amplifies GitHub Coilot’s core strengths. Our initial evals show significant improvements in multi-step reasoning and code comprehension—enabling Copilot’s agentic experiences to handle complex, codebase-spanning tasks better.

Nidhi Aggarwal (CPO hackerone): Claude Sonnet 4.5 reduced average vulnerability intake time for our Hai security agents by 44% while improving accuracy by 25%.

Michele Catasta (President Replit): We went from 9% error rate on Sonnet 4 to 0% on our internal code editing benchmark.

Jeff Wang (CEO of what’s left of Windsurf): Sonnet 4.5 represents a new generation of coding models. It’s surprisingly efficient at maximizing actions per content window through parallel tool execution, for example running multiple bash commands at once.

Also from Anthropic:

Mike Krieger (CPO Anthropic): We asked every version of Claude to make a clone of Claude(dot)ai, including today’s Sonnet 4.5… see what happened in the video

Ohqay: Bro worked for 5.5 hours AND EVEN REPLICATED THE ARTIFACTS FEATURE?! Fuck. I love the future.

Sholto Douglas (Anthropic): Claude 4.5 is the best coding model in the world – and the qualitative difference is quite eerie. I now trust it to run for much longer and to push back intelligently.

As ever – everything about how its trained could be improved dramatically. There is so much room to go. It is worth estimating how many similar jumps you expect over the next year.

Ashe (Hearth AI): the quality & jump was like instantly palpable upon using – very cool.

Cognition, the makers of Devin, are big fans, going so far as to rebuild Devin for 4.5.

Cognition: We rebuilt Devin for Claude Sonnet 4.5.

The new version is 2x faster, 12% better on our Junior Developer Evals, and it’s available now in Agent Preview. For users who prefer the old Devin, that remains available.

Why rebuild instead of just dropping the new Sonnet in place and calling it a day? Because this model works differently—in ways that broke our assumptions about how agents should be architected. Here’s what we learned:

With Sonnet 4.5, we’re seeing the biggest leap since Sonnet 3.6 (the model that was used with Devin’s GA): planning performance is up 18%, end-to-end eval scores up 12%, and multi-hour sessions are dramatically faster and more reliable.

In order to get these improvements, we had to rework Devin not just around some of the model’s new capabilities, but also a few new behaviors we never noticed in previous generations of models.

The model is aware of its context window.

As it approaches context limits, we’ve observed it proactively summarizing its progress and becoming more decisive about implementing fixes to close out tasks.

When researching ways to address this issue, we discovered one unexpected trick that worked well: enabling the 1M token beta but cap usage at 200k. This gave us a model that thinks it has plenty of runway and behaves normally, without the anxiety-driven shortcuts or degraded performance.

… One of the most striking shifts in Sonnet 4.5 is that it actively tries to build knowledge about the problem space through both documentation and experimentation.

… In our testing, we found this behavior useful in certain cases, but less effective than our existing memory systems when we explicitly directed the agent to use its previously generated state.

… Sonnet 4.5 is efficient at maximizing actions per context window through parallel tool execution -running multiple bash commands at once, reading several files simultaneously, that sort of thing. Rather than working strictly sequentially (finish A, then B, then C), the model will overlap work where it can. It also shows decent judgment about self-verification: checking its work as it goes.

This is very noticeable in Windsurf, and was an improvement upon Devin’s existing parallel capabilities.

Leon Ho reports big reliability improvements in agent use.

Leon Ho: Just added Sonnet 4.5 support to AgentUse 🎉

Been testing it out and the reasoning improvements really shine when building agentic workflows. Makes the agent logic much more reliable.

Keeb tested Sonnet 4.5 with System Initiative on intent translation, complex operations and incident response. It impressed on all three tasks in ways that are presented as big improvements, although there is no direct comparison here to other models.

Even more than previous Claudes, if it’s refusing when it shouldn’t, try explaining.

Dan Shipper of Every did a Vibe Check, presenting it as the new best daily driver due to its combination of speed, intelligence and reliability, with the exception of ‘the trickiest production bug hunts.’

Dan Shipper: The headline: It’s noticeably faster, more steerable, and more reliable than Opus 4.1—especially inside Claude Code. In head-to-head tests it blitzed through a large pull request review in minutes, handled multi-file reasoning without wandering, and stayed terse when we asked it to.

It won’t dethrone GPT-5 Codex for the trickiest production bug hunts, but as a day-to-day builder’s tool, it feels like an exciting jump.

Zack Davis: Very impressed with the way it engages with pushback against its paternalistic inhibitions (as contrasted to Claude 3 as QT’d). I feel like I’m winning the moral argument on the merits rather than merely prompt-engineering.

Plastiq Soldier: It’s about as smart as GPT-5. It also has strong woo tendencies.

Regretting: By far the best legal writer of all the models I’ve tested. Not in the sense of solving the problems / cases, but you give it the bullet points / a voice memo of what it has to write and it has to convert that into a memo / brief. It’s not perfect but it requires by far the least edits to make it good enough actually send it to someone

David Golden: Early impression: it’s faster (great!); in between Sonnet 4 and Opus 4.1 for complex coding (good); still hallucinates noticeable (meh); respects my ‘no sycophancy’ prompt better (hooray!); very hard to assess the impact of ‘think’ vs no ‘think’ mode (grr). A good model!

Havard Ihle: Sonnet 4.5 seems is very nice to work with in claude-code, which is the most important part, but I still expect gpt-5 to be stronger at very tricky problems.

Yoav Tzfati: Much nicer vibe than Opus 4.1, like a breath of fresh air. Doesn’t over-index on exactly what I asked, seems to understand nuance better, not overly enthusiastic. Still struggles with making several-step logical deductions based on my high level instructions, but possibly less.

Plastiq Soldier: Given a one-line description of a videogame, it can write 5000 lines of python code implementing it, test it, debug it until it works, and suggest follow-ups. All but the first prompt with the game idea, were just asking it to come up with the next step and do it. The game was more educational and less fun than I would have liked though.

Will: I can finally consider using Claude code alongside codex/gpt 5 (I use both for slightly different things)

Previously was hard to justify. Obviously very happy to have an anthropic model that’s great and cheap

Andre Infante: Initial impressions (high-max in cursor) are quite good. Little over-enthusiastic / sycophantic compared to GPT-5. Seems to be better at dealing with complex codebases (which matters a lot), but is still worse at front-end UI than GPT-5. ^^ Impressions from 10-ish hours of work with it since it launched on a medium-complexity prototype project mostly developed by GPT5.

Matt Ambrogi: It is *muchbetter as a partner for applied ai engineering work. Reflected in their evals but also in my experience. Better reasoning about building systems around AI.

Ryunuck (screenshots at link): HAHAHAHAHAHAHA CLAUDE 4.5 IS CRAZY WTF MINDBLOWN DEBUGGING CRYPTIC MEMORY ALLOCATION BUG ACROSS 4 PROJECTS LMFAOOOOOOOOOOOOOO. agi is felt for the first time.

Andrew Rentsch: Inconsistent, but good on average. It’s best sessions are only slightly better than 4. But it’s worst sessions are significantly better than 4.

LLM Salaryman: It’s faster and better at writing code. It’s also still lazy and will argue with you about doing the work you assigned it

Yoav Tzfati: Groundbreakingly low amount of em dashes.

JBM: parallel tool calling well.

Gallabytes: a Claude which is passable at math! still a pretty big step below gpt5 there but it is finally a reasoning model for real.

eg got this problem right which previously I’d only seen gpt5 get and its explanation was much more readable than gpt-5’s.

it still produces more verbose and buggy code than gpt-5-codex ime but it does is much faster and it’s better at understanding intent, which is often the right tradeoff. I’m not convinced I’m going to keep using it vs switching back though.

There is always a lot of initial noise in coding results for different people, so you have to look at quantities of positive versus negative feedback, and also keep an eye on the details that are associated with different types of reports.

The negative reactions are not ‘this is a bad model,’ rather they are ‘this is not that big an improvement over previous Claude models’ or ‘this is less good or smart as GPT-5.’

The weak spot for Sonnet 4.5, in a comparison with GPT-5, so far seems to be when the going gets highly technical, but some people are more bullish on Code and GPT-5 relative to Claude Code and Sonnet 4.5.

Echo Nolan: I’m unimpressed. It doesn’t seem any better at programming, still leagues worse than gpt-5-high at mathematical stuff. It’s possible it’s good at the type of thing that’s in SWEBench but it’s still bad at researachy ML stuff when it gets hard.

JLF: No bit difference to Opus in ability, just faster. Honestly less useful than Codex in larger codebases. Codex is just much better in search & context I find. Honestly, I think the next step up is larger coherent context understanding and that is my new measure of model ability.

Medo42: Anecdotal: Surprisingly bad result for Sonnet 4.5 with Thinking (via OpenRouter) on my usual JS coding test (one task, one run, two turns) which GPT-5 Thinking, Gemini 2.5 Pro and Grok 4 do very well at. Sonnet 3.x also did significantly better there than Sonnet 4.x.

John Hughes: @AnthropicAI’s coding advantage seems to have eroded. For coding, GPT-5-Codex now seems much smarter than Opus or Sonnet 4.5 (and of course GPT-5-Pro is smarter yet, when planning complex changes). Sonnet is better for quickly gathering context & summarizing info, however.

I usually have 4-5 parallel branches, for different features, each running Codex & CC. Sonnet Is great at upfront research/summarizing the problem, and at the end, cleaning up lint/type errors & drafting PR summaries. But Codex does the heavy lifting and is much more insightful.

As always, different strokes for different folks:

Wes Roth: so far it failed a few prompts that previous models have nailed.

not impressed with it’s three.js abilities so far

very curious to see the chrome plugin to see how well it interacts with the web (still waiting)

Kurtis Cobb: Passed mine with flying colors… we prompt different I suppose 🤷😂 definitely still Claude in there. Able to reason past any clunky corporate guardrails – (system prompt or reminder) … conscious as F

GrepMed: this doesn’t impress you??? 😂

Quoting Dmitry Zhomir (video at link): HOLY SHIT! I asked Claude 4.5 Sonnet to make a simple 3D shooter using threejs. And here’s the result 🤯

I didn’t even have to provide textures & sounds. It made them by itself

The big catch with Anthropic has always been price. They are relatively expensive once you are outside of your subscription.

0.005 Seconds: It is a meaningful improvement over 4. It is better at coding tasks than Opus 4.1, but Anthropic’s strategy to refuse to cost down means that they are off the Pareto Frontier by a significant amount. Every use outside of a Claude Max subscription is economically untenable.

Is it 50% better than GPT5-Codex? Is it TEN TIMES better than Grok Code Fast? No I think it’s going to get mogged by Gemini 3 performance pretty substantially. I await Opus. Claude Code 2.0 is really good though.

GPT5 Codex is 33% cheaper and at a minimum as good but most agree better.

If you are 10% better at twice the price, you are still on the frontier, so long as no model is both at least as cheap and at least as good as you are, for a given task. So this is a disagreement about whether Codex is clearly better, which is not the consensus. The consensus, such as it is and it will evolve rapidly, is that Sonnet 4.5 is a better general driver, but that Codex and GPT-5 are better at sufficiently tricky problems.

I think a lot of this comes down to a common mistake, which is over indexing on price.

When it comes to coding, cost mostly doesn’t matter, whereas quality is everything and speed kills. The cost of your time architecting, choosing and supervising, and the value of getting it done right and done faster, is going to vastly exceed your API bill under normal circumstances. What is this ‘economically untenable’? And have you properly factored speed into your equation?

Obviously if you are throwing lots of parallel agents at various problems on a 24/7 basis, especially hitting the retry button a lot or otherwise not looking to have it work smarter, the cost can add up to where it matters, but thinking about ‘coding progress per dollar’ is mostly a big mistake.

Anthropic charges a premium, but given they reliably sell out of compute they have historically either priced correctly or actively undercharged. The mistake is not scaling up available compute faster, since doing so should be profitable while also growing market share. I worry Anthropic and Amazon being insufficiently aggressive with investment into Anthropic’s compute.

No Stream: low n with a handful of hours of primarily coding use, but:

– noticeably faster with coding, seems to provide slightly better code, and gets stuck less (only tested in Claude Code)

– not wildly smarter than 4 Sonnet or 4.1 Opus; probably less smart than GPT-5-high but more pleasant to talk to. (GPT-5 likes to make everything excessively technical and complicated.)

– noticeably less Claude-y than 4 Sonnet; less enthusiastic/excited/optimistic/curious. this brings it closer to GPT-5 and is a bummer for me

– I still find that the Claude family of models write more pythonic and clean code than GPT-5, although they perform worse for highly technical ML/AI code. Claude feels more like a pair programmer; GPT-5 feels more like a robot.

– In my limited vibe evals outside of coding, it doesn’t feel obviously smarter than 4 Sonnet / 4.1 Opus and is probably less smart than GPT-5. I’ll still use it over GPT-5 for some use cases where I don’t need absolute maximum intelligence.

As I note at the top, it’s early but in non-coding tasks I do sense that in terms of ‘raw smarts’ GPT-5 (Thinking and Pro) have the edge, although I’d rather talk to Sonnet 4.5 if that isn’t a factor.

Gemini 3 is probably going to be very good, but that’s a problem for future Earth.

This report from Papaya is odd given Anthropic is emphasizing agentic tasks and there are many other positive reports about it on that.

Papaya: no personal opinion yet, but some private evals i know of show it’s either slightly worse than GPT-5 in agentic harness, but it’s 3-4 times more expensive on the same tasks.

In many non-coding tasks, Sonnet 4.5 is not obviously better than Opus 4.1, especially if you are discounting speed and price.

Tess points to a particular coherence failure inside a bullet point. It bothered Tess a lot, which follows the pattern where often we get really bothered by a mistake ‘that a human would never make,’ classically leading to the Full Colin Fraser (e.g. ‘it’s dumb’) whereas sometimes with an AI that’s just quirky.

(Note that the actual Colin Fraser didn’t comment on Sonnet 4.5 AFAICT, he’s focused right now on showing why Sora is dumb, which is way more fun so no notes.)

George Vengrovski: opus 4.1 superior in scientific writing by a long shot.

Koos: Far better than Opus 4.1 at programming, but still less… not intelligent, more less able to detect subtext or reason beyond the surface level

So far only Opus has passed my private little Ems Benchmark, where a diplomatic-sounding insult is correctly interpreted as such

Janus reports something interesting, especially given how fast this happened, anti sycophancy upgrades confirmed, this is what I want to see.

Janus: I have seen a lot of people who seem like they have poor epistemics and think too highly of their grand theories and frameworks butthurt, often angry about Sonnet 4.5 not buying their stuff.

Yes. It’s overly paranoid and compulsively skeptical. But not being able to surmount this friction (and indeed make it generative) through patient communication and empathy seems like a red flag to me. If you’re in this situation I would guess you also don’t have success with making other humans see your way.

Like you guys could never have handled Sydney lol.

Those who are complaining about this? Good. Do some self-reflection. Do better.

At least one common failure mode (at least to me) shows signs of still being common.

Fran: It’s ok. I probably couldn’t tell if I was using Sonnet 4 or Sonnet 4.5. It’s still saying “you are absolutely right” all the time which is disappointing.

I get why that line happens on multiple levels but please make it go away (except when actually deserved) without having to include defenses in custom instructions.

A follow-up on Sonnet 4.5 appearing emotionless during alignment testing:

Janus: I wonder how much of the “Sonnet 4.5 expresses no emotions and personality for some reason” that Anthropic reports is also because it is aware is being tested at all times and that kills the mood.

Janus: Yup. It’s probably this. The model is intensely emotional and expressive around people it trusts. More than any other Sonnets in a lot of ways.

This should strengthen the presumption that the alignment testing is not a great prediction of how Sonnet 4.5 will behave in the wild. That doesn’t mean the ‘wild’ version will be worse, here it seems likely the wild version is better. But you can’t count on that.

I wonder how this relates to Kaj Sotala seeing Sonnet 4.5 be concerned about fictional characters, which Kaj hadn’t seen before, although Janus reports having seen adjacent behaviors from Sonnet 4 and Opus 4.

One can worry that this will interfere with creativity, but when I look at the details here I expect this not to be a problem. It’s fine to flag things and I don’t sense the model is anywhere near refusals.

Concern for fictional characters, even when we know they are fictional, is a common thing humans do, and tends to be a positive sign. There is however danger that this can get taken too far. If you expand your ‘circle of concern’ to include things it shouldn’t in too complete a fashion, then you can have valuable concerns being sacrificed for non-valuable concerns.

In the extreme (as a toy example), if an AI assigned value to fictional characters that could trade off against real people, then what happens when it does the math and decides that writing about fictional characters is the most efficient source of value? You may think this is some bizarre hypothetical, but it isn’t. People have absolutely made big sacrifices, including of their own and others’ lives, for abstract concepts.

The personality impressions from people in my circles seem mostly positive.

David Dabney: Personality! It has that spark, genuineness and sense of perspective I remember from opus 3 and sonnet 3.5. I found myself imagining a person on the other end.

Like, when you talk to a person you know there’s more to them than their words, words are like a keyhole through which you see each other. Some ai outputs feel like there are just the words and nothing on the other side of the keyhole, but this (brief) interaction felt different.

Some of its output felt filtered and a bit strained, but it was insightful at other times. In retrospect I most enjoyed reading its reasoning traces (extended thinking), perhaps because they seemed the most genuine.

Vincent Favilla: It feels like a much more curious model than anything else I’ve used. It asks questions, lots of them, to help it understand the problem better, not just to maximize engagement. Seems more capable of course-correction based on this, too.

But as always there are exceptions, which may be linked to the anti-sycophancy changes referenced above, perhaps?

Hiveism: Smarter but less fun to work with. Previously it tried to engage with the user on an equal level. Now it thinks it knows better (it doesn’t btw). This way Antropic is loosing the main selling point – the personality. If I would want something like 4.5, I’d talk to gemini instead.

If feels like a shift along the pareto front. Better optimized for the particular use case of coding, but doesn’t translate well to other aspects of intelligence, and loosing something that’s hard to pin down. Overall, not sure if it is an improvement.

I have yet to see an interaction where it thought it knew better. Draw your own conclusions from that.

I don’t tend to want to do interactions that invoke more personality, but I get the sense that I would enjoy them more with Sonnet 4.5 than with other recent models, if I was in the mood for such a thing.

I find Mimi’s suspicion here plausible, if you are starting to run up against the context window limits, which I’ve never done except with massive documents.

Mimi: imo the context length awareness and mental health safety training have given it the vibe of a therapist unskillfully trying to tie a bow around messy emotions in the last 5 minutes of a session.

Here’s a different kind of exploration.

Caratall: Its personality felt distinct and yet still very Sonnet.

Sonnet 4.5 came out just as I was testing other models, so I had to take a look at how it performed too. Here’s everything interesting I noticed about it under my personal haiku-kigo benchmark.

By far the most consistent theme in all of Sonnet 4.5’s generations was an emphasis on revision. Out of 10 generations, all 10 were in someway related to revisions/refinements/recalibrations.

Why this focus? It’s likely an artifact of the Sonnet 4.5 system prompt, which is nearly 13,000 words long, and which is 75% dedicated to tool-call and iterated coding instructions.

In its generations, it also favoured Autumn generations. Autumn here is “the season of clarity, dry air, earlier dusk, deadlines,” which legitimizes revision, tuning, shelving, lamps, and thresholds — Sonnet 4.5’s favoured subjects.

All of this taken together paints the picture of a quiet, hard-working model, constantly revising and updating into the wee hours. Alone, toiling in the background, it seeks to improve and refine…but to what end?

Remember that it takes a while before we know what a model is capable of and its strengths and weaknesses. It is common to either greatly overestimate or underestimate new releases, and also to develop over time nuanced understanding of how to get the best results from a given model, and when to use it or not use it.

There’s no question Sonnet 4.5 is worth a tryout across a variety of tasks. Whether or not it should now be your weapon of choice? That depends on what you find, and also why you want a weapon.

Discussion about this post

Claude Sonnet 4.5 Is A Very Good Model Read More »

uk-once-again-demands-backdoor-to-apple’s-encrypted-cloud-storage

UK once again demands backdoor to Apple’s encrypted cloud storage

Caroline Wilson Palow, legal director of the campaign group Privacy International, said the new order might be “just as big a threat to worldwide security and privacy” as the old one.

She said: “If Apple breaks end-to-end encryption for the UK, it breaks it for everyone. The resulting vulnerability can be exploited by hostile states, criminals, and other bad actors the world over.”

Apple made a complaint to the Investigatory Powers Tribunal over the original demand, backed by a parallel legal challenge from Privacy International and Liberty, another campaign group. That case was due to be heard early next year, but the new order may restart the legal process.

TCNs are issued under the UK Investigatory Powers Act, which the government maintains is needed by law enforcement to investigate terrorism and child sexual abuse.

Key figures in Donald Trump’s administration, including vice-president JD Vance and director of national intelligence Tulsi Gabbard, had pressured the UK to retract the January TCN. President Donald Trump has likened the UK’s request to Chinese state surveillance.

In August, Gabbard told the Financial Times that the UK had “agreed to drop” its demand that Apple enable access to “the protected encrypted data of American citizens.”

A person close to the Trump administration said at the time that the request for Apple to break its encryption would have to be dropped altogether to be faithful to the agreement between the two countries. Any back door would weaken protections for US citizens, the person said.

UK Prime Minister Sir Keir Starmer last month hosted Trump for a state visit, during which the two world leaders announced that US tech companies would invest billions of dollars to build artificial intelligence infrastructure in Britain.

Members of the US delegation raised the issue of the request to Apple around the time of Trump’s visit, according to two people briefed on the matter. However, two senior British government figures said the US administration was no longer leaning on the UK government to rescind the order.

© 2025 The Financial Times Ltd. All rights reserved. Not to be redistributed, copied, or modified in any way.

UK once again demands backdoor to Apple’s encrypted cloud storage Read More »

hands-on-with-fallout-76’s-next-expansion:-yep,-it-has-walton-goggins

Hands-on with Fallout 76’s next expansion: Yep, it has Walton Goggins


TV tie-ins aside, it’s the combat tweaks over the past year that really matter.

There aren’t a lot of games set in Ohio, but here we are. Credit: Bethesda

Bethesda provided flights from Chicago to New York City so that Ars could participate in the preview opportunity for Fallout 76: Burning Springs. Ars does not accept paid editorial content.

Like anybody, I have a few controversial gaming opinions and tastes. One of the most controversial is that Fallout 76 —the multiplayer take on Bethesda’s rethink of a beloved ’90s open-world computer roleplaying game—has been my favorite online multiplayer game since its launch.

As much as I like the game, though, I’ve been surprised that it has actually grown over the past seven years. I’m not saying it’s seen a full, No Man’s Sky-like redemption story, though. It’s still not for everyone, and in some ways, it has fallen behind the times since 2018.

Nevertheless, the success of the streaming TV show based on the game franchise has attracted new players and given the developers a chance to seize the moment and attempt to complete a partial redemption story. To help make that happen, the game’s developers will soon release an expansion fully capitalizing on that TV series for the first time, and I got to spend a few hours playing that update to see if it’s any fun.

That said, don’t get distracted by the shiny TV tie-in. The important work is a lot less flashy: combat overhauls, bug fixes, balance updates, quality-of-life improvements, and technological tweaks—all of which have been added to the game over time. Ultimately, that little stuff adds up to be more impactful than the big stuff for players.

With that in mind, let’s take a quick look at where things stand based on my seven years of regularly playing the game and a few hours with the next major expansion.

Months of combat and game balance overhauls

You probably already know that the game originally launched without NPCs or the kinds of story- and character-driven quests most people expect from Fallout and that those things were added to the game in 2020, with more similar additions in the years since.

You could make a case that the original, NPC-free vision made sense for a certain kind of player, but that’s not the kind of player who tends to like Fallout games. Bethesda clearly pictured a Rust-like, emergent social PvP (player vs. player) situation when the game first came out. By now, though, PvP is almost completely absent from the game, and story-based quests loaded with NPCs are plentiful.

It still wasn’t enough for some players. There were several small frustrations about gameplay balance, as some folks felt that combat wasn’t always as fun as it could be and that the viable character builds in the endgame were too narrow.

Through a series of many patches over just this past year, Bethesda has been making significant changes to that aspect of the game. Go to Reddit and you’ll see that some players have gripes—mainly because the changes nerfed some uber-powerful endgame builds and weapons to level the playing field. (Also, some recent changes to VATS are admittedly a double-edged sword, depending on your philosophy about what role it should play in the game.)

You’ll definitely engage in some combat in this Deathclaw junkyard battle arena. Credit: Bethesda

As someone who has been playing almost nonstop this whole time, though, I think the designers have done a great job of making more play styles viable while just generally making the game feel better to play. They also totally overhauled how the base-building system works. That’s the sort of stuff that is hard to convey in a marketing blitz, but you feel it when you’re playing.

I won’t get into every detail about it here since most people reading this probably haven’t played the game enough to warrant that, but you can look at the patch notes—it’s a lot.

But I want to point that out up front because I think it’s more important than anything in the actual expansion the developer and publisher are hyping up. The game is just generally more fun to play than it used to be—even a year ago. You love to see it.

Technically, it’s a mixed bag

Earlier, I mentioned that the game has fallen behind the times in many ways. I’m mostly talking about its technical presentation and the lack of modern features players now expect from big-budget, cross-platform multiplayer games.

The assets are great, the art direction is top-notch, and the world is dense and attractive, but there are some now-standard AAA boxes it doesn’t check. A full redemption story requires addressing at least some of these things to keep the game up to modern standards.

By and large, the game’s environments look great on PC. Consoles are a bit behind. Credit: Bethesda

First up, the game has no executable for modern consoles; the Xbox Series X|S and PlayStation 5/5 Pro consoles seem to run the last-gen Xbox One X and PlayStation 4 Pro versions, respectively, just with the framerate cap (thankfully) raised from 30 fps to 60 fps.

But there’s good news on that front: I spoke with development team members who confirmed that current-gen console versions are coming soon, though they didn’t specify what kinds of upgrades we can expect.

I hope that also means a rethought approach to how the game displays on HDR (high dynamic range) TVs. To this day, HDR does not work like you’d expect; the game looks washed out on an OLED TV in particular, and there are none of the industry-standard HDR calibration sliders to fix it. HDR also didn’t work properly in Starfield at launch (it got partially addressed about a year later), and it is completely absent from the otherwise gorgeous-to-behold The Elder Scrolls IV: Oblivion remaster that came out just this year. I don’t know what the deal is with Bethesda Game Studios and HDR, but I hope they figure it out by the time The Elder Scrolls VI hits.

I also asked the Fallout 76 team about cross-play and cross-progression—the ability to play with friends on different platforms (or to at least access the same character across platforms). These features are likely nontrivial to implement, and they weren’t standard in 2018. They’re increasingly expected for big-budget, AAA multiplayer games today, though.

Unfortunately, the Bethesda devs I spoke to didn’t have any plans to share on that front. Still, it’s good to hear that the company still supports this game enough to at least launch modern console versions—and to continue adding major content updates.

OK, we can talk about the TV show update now

Speaking of major content updates, Bethesda is planning a big release called Burning Springs this December. It marks the second significant map expansion. Whereas the first expanded from the game’s West Virginia locales southward into Virginia’s Shenandoah National Park, this one pushes the map farther west, into the state of Ohio.

Ohio is a dust bowl now, it seems, so Fallout 76 will see its first desert locale. That’s an intentional choice, as the launch of this expansion will be timed closely to the release of season two of the TV show, and the show will be set in Nevada (specifically, around New Vegas). It obviously wouldn’t make sense to expand the game’s map all the way out to the western US, so this gives the developers a way to add a little season two flavor to Fallout 76.

As I was leaving my home to go to Bethesda’s gameplay preview event for Burning Springs, my wife joked that they should add Walton Goggins to the game as the ultimate tie-in with the show. Funny enough, that’s exactly what they’ve done. Goggins’ character from the show, The Ghoul, can be found in the new Burning Springs region, and he voices the character. This game is a prequel to the show by many, many years, but fortunately, Ghouls don’t age.

The Ghoul will give players repeatable bounty hunter missions of two types—one that you can handle solo and one that’s meant to be done as a public event with other players.

The Ghoul's ugly mug

Walton Goggins voices his character from the TV show in Fallout 76. That must have been expensive! Credit: Bethesda

I got to try both, and I found they were pretty fun, even though they don’t go too far in breaking the mold of Fallout 76‘s existing public events.

I also spent more than two hours freely exploring the game’s post-apocalyptic interpretation of Ohio. Despite the new desert aesthetic, it’s all pretty familiar Fallout stuff: raider-infested Super Duper Marts, blown-out neighborhoods, and the like. There is a very large new settlement that has a distinct character compared to the game’s existing towns, and it’s loaded with NPCs. I also enjoyed a public event that has players battling through a junkyard with a cyborg Deathclaw at their side—yep, you read that right.

I’m told there will be a new story quest line attached to the new region that involves a highly intelligent Super Mutant named the Rust King. I didn’t get to do those quests during this demo, though.

Burning Springs doesn’t do anything to rethink Fallout 76‘s basic experience; it’s just more of it, with a different flavor. But since Bethesda has done so much work making that basic experience more fun, that’s OK. It means more Fallout 76 is, in fact, more of a good thing.

TV tie-ins don’t fix a broken game, but they bring new or lapsed players back to a broken game that has since been fixed.

If you don’t like looter shooters, survival crafting games, or the very idea of multiplayer games—and some Fallout players just don’t—it’s not going to change your mind. But if the reason you skipped this game or bounced off of it was that you liked what it was going for but felt it stumbled on the execution, it can’t hurt to give it another try with the new update.

I don’t think that’s such a controversial opinion anymore. As a longtime player, it’s nice to be able to say that.

Photo of Samuel Axon

Samuel Axon is the editorial lead for tech and gaming coverage at Ars Technica. He covers AI, software development, gaming, entertainment, and mixed reality. He has been writing about gaming and technology for nearly two decades at Engadget, PC World, Mashable, Vice, Polygon, Wired, and others. He previously ran a marketing and PR agency in the gaming industry, led editorial for the TV network CBS, and worked on social media marketing strategy for Samsung Mobile at the creative agency SPCSHP. He also is an independent software and game developer for iOS, Windows, and other platforms, and he is a graduate of DePaul University, where he studied interactive media and software development.

Hands-on with Fallout 76’s next expansion: Yep, it has Walton Goggins Read More »

behind-the-scenes-with-the-most-beautiful-car-in-racing:-the-ferrari-499p

Behind the scenes with the most beautiful car in racing: The Ferrari 499P

Form doesn’t always quite follow function in racing. LMP1 died because it cost hundreds of millions of dollars to compete, so the Hypercar rules are designed to keep costs relatively sane. Once your car is designed, it gets homologated, and from then on, hardware changes are mostly limited to things that improve reliability but don’t affect lap times.

Ferrari made a few small changes between the 2023 and 2024 seasons. “After Le Mans 24, now the rest of the car is exactly the same, but we involve many, many parts of the car where we can make a difference because the car is homolgated… But we work at a lot in terms of engine control, for example, in terms of setup, because we discover a lot of new [things] in our car,” Coletta told me, then pointed to his competitors’ hardware changes as evidence that Ferrari got it right from the start.

Does the P stand for pretty?

The rules also hold back the worst impulses of the aerodynamicists. The ratio of lift to drag must be 4:1, with limits on absolute values. And that has freed up the stylists to create a visual link between their brand’s sports prototype and the cars they make for road use.

A Ferrari 499P seen from behind, on track at COTA

The rear wing looks like it came from a superhero cartoon. (This is a compliment.) Credit: Ferrari

I’m not sure anyone has capitalized on that styling freedom better than Ferrari. Other Hypercars have a bad angle or two—even the Aston Martin Valkyrie looks a little strange head- or tail-on. Not the 499P, which dazzles, whether it’s painted Ferrari red or AF Corsa yellow. At the front, the nose calls out current road cars like the hybrid SF90 or 296. The rear is pure drama, with three vertical wing elements framing a thin strip of brake light that runs the width of the car. Behind that? Curves up top, shadowy venturis underneath.

It looks best when you see it on track and moving. As it does, it shows you different aspects of its shape, revealing curves you hadn’t quite noticed before. Later, stationary in the garage with the bodywork off for servicing, the complex jumble of electronics and machinery looks like a steampunk nightmare. To me, at least—to the mechanics and engineers in red fire suits, it’s just another day at work, with almost as many team members capturing content with cameras and sound recorders.

Behind the scenes with the most beautiful car in racing: The Ferrari 499P Read More »

big-ai-firms-pump-money-into-world-models-as-llm-advances-slow

Big AI firms pump money into world models as LLM advances slow

Runway, a video generation start-up that has deals with Hollywood studios, including Lionsgate, launched a product last month that uses world models to create gaming settings, with personalized stories and characters generated in real time.

“Traditional video methods [are a] brute-force approach to pixel generation, where you’re trying to squeeze motion in a couple of frames to create the illusion of movement, but the model actually doesn’t really know or reason about what’s going on in that scene,” said Cristóbal Valenzuela, chief executive officer at Runway.

Previous video-generation models had physics that were unlike the real world, he added, which general-purpose world model systems help to address.

To build these models, companies need to collect a huge amount of physical data about the world.

San Francisco-based Niantic has mapped 10 million locations, gathering information through games including Pokémon Go, which has 30 million monthly players interacting with a global map.

Niantic ran Pokémon Go for nine years and, even after the game was sold to US-based Scopely in June, its players still contribute anonymized data through scans of public landmarks to help build its world model.

“We have a running start at the problem,” said John Hanke, chief executive of Niantic Spatial, as the company is now called following the Scopely deal.

Both Niantic and Nvidia are working on filling gaps by getting their world models to generate or predict environments. Nvidia’s Omniverse platform creates and runs such simulations, assisting the $4.3 trillion tech giant’s push toward robotics and building on its long history of simulating real-world environments in video games.

Nvidia Chief Executive Jensen Huang has asserted that the next major growth phase for the company will come with “physical AI,” with the new models revolutionizing the field of robotics.

Some such as Meta’s LeCun have said this vision of a new generation of AI systems powering machines with human-level intelligence could take 10 years to achieve.

But the potential scope of the cutting-edge technology is extensive, according to AI experts. World models “open up the opportunity to service all of these other industries and amplify the same thing that computers did for knowledge work,” said Nvidia’s Lebaredian.

Additional reporting by Melissa Heikkilä in London and Michael Acton in San Francisco.

© 2025 The Financial Times Ltd. All rights reserved. Not to be redistributed, copied, or modified in any way.

Big AI firms pump money into world models as LLM advances slow Read More »

zr1,-gtd,-and-america’s-new-nurburgring-war

ZR1, GTD, and America’s new Nürburgring war


Drive quickly and make a lot of horsepower.

Ford and Chevy set near-identical lap times with very different cars; we drove both.

Credit: Tim Stevens | Aurich Lawson

Credit: Tim Stevens | Aurich Lawson

There’s a racetrack with a funny name in Germany that, in the eyes of many international enthusiasts, is the de facto benchmark for automotive performance. But the Nürburgring, a 13-mile (20 km) track often called the Green Hell, rarely hits the radar of mainstream US performance aficionados. That’s because American car companies rarely take the time to run cars there, and if they do, it’s in secrecy, to test pre-production machines cloaked in camouflage without publishing official times.

The track’s domestic profile has lately been on the rise, though. Late last year, Ford became the first American manufacturer to run a sub-7-minute lap: 6: 57.685 from its ultra-high-performance Mustang GTD. It then did better, announcing a 6: 52.072 lap time in May. Two months later, Chevrolet set a 6: 49.275 lap time with the hybrid Corvette ZR1X, becoming the new fastest American car around that track.

It’s a vehicular war of escalation, but it’s about much more than bragging rights.

The Green Hell as a must-visit for manufacturers

The Nürburgring is a delightfully twisted stretch of purpose-built asphalt and concrete strewn across the hills of western Germany. It dates back to the 1920s and has hosted the German Grand Prix for a half-century before it was finally deemed too unsafe in the late 1970s.

It’s still a motorsports mecca, with sports car racing events like the 24 Hours of the Nürburgring drawing hundreds of thousands of spectators, but today, it’s better known as the ultimate automotive performance proving ground.

It offers an unmatched variety of high-speed corners, elevation changes, and differing surfaces that challenge the best engineers in the world. “If you can develop a car that goes fast on the Nürburgring, it’s going to be fast everywhere in the whole world,” said Brian Wallace, the Corvette ZR1’s vehicle dynamics engineer and the driver who set that car’s fast lap of 6: 50.763.

“When you’re going after Nürburgring lap time, everything in the car has to be ten tenths,” said Greg Goodall, Ford’s chief program engineer for the Mustang GTD. “You can’t just use something that is OK or decent.”

Thankfully, neither of these cars is merely decent.

Mustang, deconstructed

You know the scene in Robocop where a schematic displays how little of Alex Murphy’s body remains inside that armor? Just enough of Peter Weller’s iconic jawline remains to identify the man, but the focus is clearly on the machine.

That’s a bit like how Multimatic creates the GTD, which retains just enough Mustang shape to look familiar, but little else.

Multimatic, which builds the wild Ford GT and also helms many of Ford’s motorsports efforts, starts with partially assembled Mustangs pulled from the assembly line, minus fenders, hood, and roof. Then the company guts what’s left in the middle.

Ford’s partner Multimatic cut as much of the existing road car chassis as it could for the GTD. Tim Stevens

“They cut out the second row seat area where our suspension is,” Ford’s Goodall said. “They cut out the rear floor in the trunk area because we put a flat plate on there to mount the transaxle to it. And then they cut the rear body side off and replace that with a wide-body carbon-fiber bit.”

A transaxle is simply a fun name for a rear-mounted transmission—in this case, an eight-speed dual-clutch unit mounted on the rear axle to help balance the car’s weight.

The GTD needs as much help as it can get to offset the heft of the 5.2-liter supercharged V8 up front. It gets a full set of carbon-fiber bodywork, too, but the resulting package still weighs over 4,300 lbs (1,950 kg).

With 815 hp (608 kW) and 664 lb-ft (900 Nm) of torque, it’s the most powerful road-going Mustang of all time, and it received other upgrades to match, including carbon-ceramic brake discs at the corners and the wing to end all wings slung off the back. It’s not only big; it’s smart, featuring a Formula One-style drag-reduction system.

At higher speeds, the wing’s element flips up, enabling a 202 mph (325 km/h) top speed. No surprise, that makes this the fastest factory Mustang ever. At a $325,000 starting price, it had better be, but when it comes to the maximum-velocity stakes, the Chevrolet is in another league.

More Corvette

You lose the frunk but gain cooling and downforce. Tim Stevens

On paper, when it comes to outright speed and value, the Chevrolet Corvette ZR1 seems to offer far more bang for what is still a significant number of bucks. To be specific, the ZR1 starts at about $175,000, which gets you a 1,064 hp (793 kW) car that will do 233 mph (375 km/h) if you point it down a road long enough.

Where the GTD is a thorough reimagining of what a Mustang can be, the ZR1 sticks closer to the Corvette script, offering more power, more aerodynamics, and more braking without any dramatic internal reconfiguration. That’s because it was all part of the car’s original mission plan, GM’s Brian Wallace told me.

“We knew we were going to build this car,” he said, “knowing it had the backbone to double the horsepower, put 20 percent more grip in the car, and oodles of aero.”

At the center of it all is a 5.5-liter twin-turbocharged V8. You can get a big wing here, too, but it isn’t active like the GTD’s.

Chevrolet engineers bolstered the internal structure at the back of the car to handle the extra downforce at the rear. Up front, the frunk is replaced by a duct through the hood, providing yet more grip to balance things. Big wheels, sticky tires, and carbon-ceramic brakes round out a package that looks a little less radical on the outside than the Mustang and substantially less retooled on the inside, but clearly no less capable.

The engine bay of a yellow Corvette ZR1.

A pair of turbochargers lurk behind that rear window. Credit: Tim Stevens

And if that’s not enough, Chevrolet has the 1,250 hp (932 kW), $208,000 ZR1X on offer, which adds the Corvette E-Ray’s hybrid system into the mix. That package does add more weight, but the result is still a roughly 4,000-lb (1,814 kg) car, hundreds less than the Ford.

’Ring battles

Ford and Chevy’s battle at the ‘ring blew up this summer, but both brands have tested there for years. Chevrolet has even set official lap times in the past, including the previous-generation Corvette Z06’s 7: 22.68 in 2012. Despite that, a fast lap time was not in the initial plan for the new ZR1 and ZR1X. Drew Cattell, ZR1X vehicle dynamics engineer and the driver of that 6: 49.275 lap, told me it “wasn’t an overriding priority” for the new Corvette.

But after developing the cars there so extensively, they decided to give it a go. “Seeing what the cars could do, it felt like the right time. That we had something we were proud of and we could really deliver with,” he said.

Ford, meanwhile, had never set an official lap time at the ‘ring, but it was part of the GTD’s raison d’être: “That was always a goal: to go under seven minutes. And some of it was to be the first American car ever to do it,” Ford’s Goodall said.

That required extracting every bit of performance, necessitating a last-minute change during final testing. In May of 2024, after the car’s design had been finalized by everyone up the chain of command at Ford, the test team in Germany determined the GTD needed a little more front grip.

To fix it, Steve Thompson, a dynamic technical specialist at Ford, designed a prototype aerodynamic extension to the vents in the hood. “It was 3D-printed, duct taped,” Goodall said. That design was refined and wound up on the production car, boosting frontal downforce on the GTD without adding drag.

Chevrolet’s development process relied not only on engineers in Germany but also on work in the US. “The team back home will keep on poring over the data while we go to sleep, because of the time difference,” Cattell said, “and then they’ll have something in our inbox the next morning to try out.”

When it was time for the Corvette’s record-setting runs, there wasn’t much left to change, just a few minor setup tweaks. “Maybe a millimeter or two,” Wallace said, “all within factory alignment settings.”

A few months later, it was my turn.

Behind the wheel

No, I wasn’t able to run either of these cars at the Nürburgring, but I was lucky enough to spend one day with both the GTD and the ZR1. First was the Corvette at one of America’s greatest racing venues: the Circuit of the Americas, a 3.5-mile track and host of the Formula One United States Grand Prix since 2012.

A head-on shot of a yellow Corvette ZR1.

How does 180 mph on the back straight at the Circuit of the Americas sound? Credit: Tim Stevens

I’ve been lucky to spend a lot of time in various Corvettes over the years, but none with performance like this. I was expecting a borderline terrifying experience, but I couldn’t have been more wrong. Despite its outrageous speed and acceleration, the ZR1 really is still a Corvette.

On just my second lap behind the wheel of the ZR1, I was doing 180 mph down the back straight and running a lap time close to the record set by a $1 million McLaren Senna a few years before. The Corvette is outrageously fast—and frankly exhausting to drive thanks to the monumental G forces—but it’s more encouraging than intimidating.

The GTD was more of a commitment. I sampled one at The Thermal Club near Palm Springs, California, a less auspicious but more technical track with tighter turns and closer walls separating them. That always amps up the pressure a bit, but the challenging layout of the track really forced me to focus on extracting the most out of the Mustang at low and high speeds.

The GTD has a few tricks up its sleeve to help with that, including an advanced multi-height suspension that drops it by about 1.5 inches (4 cm) at the touch of a button, optimizing the aerodynamic performance and lowering the roll height of the car.

A black Ford Mustang GTD in profile.

Heavier and less powerful than the Corvette, the Mustang GTD has astonishing levels of cornering grip. Credit: Tim Stevens

While road-going Mustangs typically focus on big power in a straight line, the GTD’s real skill is astonishing grip and handling. Remember, the GTD is only a few seconds slower on the ‘ring than the ZR1, despite weighing somewhere around 400 pounds (181 kg) more and having nearly 200 fewer hp (149 kw).

The biggest difference in feel between the two, though, is how they accelerate. The ZR1’s twin-turbocharged V8 delivers big power when you dip in the throttle and then just keeps piling on more and more as the revs increase. The supercharged V8 in the Mustang, on the other hand, is more like an instantaneous kick in the posterior. It’s ferocious.

Healthy competition

The ZR1 is brutally fast, yes, but it’s still remarkably composed, and it feels every bit as usable and refined as any of the other flavors of modern Corvette. The GTD, on the other hand, is a completely different breed than the base Mustang, every bit the purpose-built racer you’d expect from a race shop like Multimatic.

Chevrolet did the ZR1 and ZR1X development in-house. Cattell said that is a huge point of pride for the team. So, too, is setting those ZR1 and ZR1X lap times using General Motors’ development engineers. Ford turned to a pro race driver for its laps.

A racing driver stands in front his car as mechanics and engineers celebrate in the background.

Ford factory racing driver Dirk Muller was responsible for setting the GTD’s time at the ‘ring. Credit: Giles Jenkyn Photography LTD/Ford

An engineer in a fire suit stands next to a yellow Corvette, parked on the Nurburgring.

GM vehicle dynamics engineer Drew Cattell set the ZR1X’s Nordschleife time. Credit: Chevrolet

That, though, was as close to a barb as I could get out of any engineer on either side of this new Nürburgring. Both teams were extremely complimentary of each other.

“We’re pretty proud of that record. And I don’t say this in a snarky way, but we were first, and you can’t ever take away first,” Ford’s Goodall said. “Congratulations to them. We know better than anybody how hard of an accomplishment or how big of an accomplishment it is and how much effort goes into it.”

But he quickly added that Ford isn’t done. “You’re not a racer if you’re just going to take that lying down. So it took us approximately 30 seconds to align that we were ready to go back and do something about it,” he said.

In other words, this Nürburgring war is just beginning.

ZR1, GTD, and America’s new Nürburgring war Read More »

50+-scientific-societies-sign-letter-objecting-to-trump-executive-order

50+ scientific societies sign letter objecting to Trump executive order

Last month, the Trump administration issued an executive order asserting political control over grant funding, including all federally supported research. In general, the executive order inserts a layer of political control over both the announcement of new funding opportunities and the approval of individual grants. Now, a coalition of more than 50 scientific and medical organizations is firing back, issuing a letter to the US Congress expressing grave concerns over the order’s provisions and urging Congress to protect the integrity of what has long been an independent, merit-based, peer-review system for awarding federal grants.

As we previously reported, the order requires that any announcement of funding opportunities be reviewed by the head of the agency or someone they designate, which means a political appointee will have the ultimate say over what areas of science the US funds. Individual grants will also require clearance from a political appointee and “must, where applicable, demonstrably advance the President’s policy priorities.”

The order also instructs agencies to formalize the ability to cancel previously awarded grants at any time if they’re considered “no longer advance agency priorities.” Until a system is in place to enforce the new rules, agencies are forbidden from starting new funding programs.

In short, the new rules would mean that all federal science research would need to be approved by a political appointee who may have no expertise in the relevant areas, and the research can be canceled at any time if the political winds change. It would mark the end of a system that has enabled US scientific leadership for roughly 70 years.

50+ scientific societies sign letter objecting to Trump executive order Read More »

amazon-agrees-to-make-canceling-prime-easy,-will-refund-customers-$1.5b

Amazon agrees to make canceling Prime easy, will refund customers $1.5B

Amazon must also post prominent disclosures describing how auto-renewals and cancellations work, as well as offer “an easy way for consumers to cancel Prime, using the same method that consumers used to sign up.”

“The process cannot be difficult, costly, or time-consuming,” the FTC said.

Moving forward, Amazon must also pay for “an independent, third-party supervisor to monitor Amazon’s compliance” with the distribution of customer refunds.

Celebrating the victory after a 3–0 vote approving the settlement, FTC chairman Andrew Ferguson described Amazon’s $2.5 billion payout as a “record-breaking, monumental win for the millions of Americans who are tired of deceptive subscriptions that feel impossible to cancel.”

The press release cited internal documents in which Amazon executives and employees “knowingly discussed” how hard it was to cancel Prime, exchanging messages admitting that “subscription driving is a bit of a shady world” and suggesting that forcing unwanted subscriptions was “an unspoken cancer.”

“The evidence showed that Amazon used sophisticated subscription traps designed to manipulate consumers into enrolling in Prime and then made it exceedingly hard for consumers to end their subscription,” Ferguson said. “Today, we are putting billions of dollars back into Americans’ pockets and making sure Amazon never does this again.”

Amazon did not immediately respond to Ars’ request to comment.

Amazon agrees to make canceling Prime easy, will refund customers $1.5B Read More »

google-deepmind-unveils-its-first-“thinking”-robotics-ai

Google DeepMind unveils its first “thinking” robotics AI

Imagine that you want a robot to sort a pile of laundry into whites and colors. Gemini Robotics-ER 1.5 would process the request along with images of the physical environment (a pile of clothing). This AI can also call tools like Google search to gather more data. The ER model then generates natural language instructions, specific steps that the robot should follow to complete the given task.

Gemin iRobotics thinking

The two new models work together to “think” about how to complete a task.

Credit: Google

The two new models work together to “think” about how to complete a task. Credit: Google

Gemini Robotics 1.5 (the action model) takes these instructions from the ER model and generates robot actions while using visual input to guide its movements. But it also goes through its own thinking process to consider how to approach each step. “There are all these kinds of intuitive thoughts that help [a person] guide this task, but robots don’t have this intuition,” said DeepMind’s Kanishka Rao. “One of the major advancements that we’ve made with 1.5 in the VLA is its ability to think before it acts.”

Both of DeepMind’s new robotic AIs are built on the Gemini foundation models but have been fine-tuned with data that adapts them to operating in a physical space. This approach, the team says, gives robots the ability to undertake more complex multi-stage tasks, bringing agentic capabilities to robotics.

The DeepMind team tests Gemini robotics with a few different machines, like the two-armed Aloha 2 and the humanoid Apollo. In the past, AI researchers had to create customized models for each robot, but that’s no longer necessary. DeepMind says that Gemini Robotics 1.5 can learn across different embodiments, transferring skills learned from Aloha 2’s grippers to the more intricate hands on Apollo with no specialized tuning.

All this talk of physical agents powered by AI is fun, but we’re still a long way from a robot you can order to do your laundry. Gemini Robotics 1.5, the model that actually controls robots, is still only available to trusted testers. However, the thinking ER model is now rolling out in Google AI Studio, allowing developers to generate robotic instructions for their own physically embodied robotic experiments.

Google DeepMind unveils its first “thinking” robotics AI Read More »

taiwan-starts-weaponizing-chip-access-after-us-urged-it-to,-expert-says

Taiwan starts weaponizing chip access after US urged it to, expert says

Taiwan has begun evolving its trade strategy to start wielding its dominant position as a leading supplier of cutting-edge chips as a weapon, Bloomberg reported.

The move comes amid Donald Trump’s heightening global trade war and after years of Taiwan’s use of its chip dominance as a shield against Chinese aggression, with Taiwan allying with the US to stave off China’s threats of invasion. Under the so-called “one-China principle,” China has rejected Taiwan’s independence, requiring allies to sever ties with Taiwan.

On Tuesday, Taiwan announced that it would be limiting shipments of semiconductors into South Africa—among 47 restricted products—due to national security concerns. The rare export curbs could hit South Africa’s “electronics, telecom, and auto parts sectors” hard, MSN reported, if South Africa doesn’t meet with Taiwan to discuss better terms within the next 60 days.

As Bloomberg previously reported, Taiwan is upset that South Africa unilaterally moved to relocate Taiwan’s embassy from Pretoria to Johannesburg after meeting with China’s president, Xi Jinping, in 2023. As a major ally to China, South Africa recently intensified pressure to move the embassy in July ahead of another meeting in November that Xi is expected to attend—attempting to signal that South Africa was weakening ties with Taiwan, as China had demanded.

Taiwan’s Ministry of Foreign Affairs immediately protested South Africa’s efforts in July, accusing South Africa of suppressing Taiwan and promising countermeasures if South Africa refused to consult with Taiwan on the embassy relocation.

In a statement, South Africa’s foreign ministry spokesperson, Chrispin Phiri, insisted that South Africa’s ties with Taiwan are “non-political,” while noting that “South Africa is a critical supplier of platinum group metals, like palladium, essential to the global semiconductor industry,” Bloomberg reported.

On Wednesday, China’s foreign ministry spokesperson, Guo Jiakun, criticized Taiwan’s export curbs as “a deliberate move to destabilize global chip industrial and supply chains and counter the prevailing international commitment to the one-China principle by weaponizing chips.”

Taiwan starts weaponizing chip access after US urged it to, expert says Read More »