Author name: Kelly Newman

you-knew-it-was-coming:-google-begins-testing-ai-only-search-results

You knew it was coming: Google begins testing AI-only search results

Google has become so integral to online navigation that its name became a verb, meaning “to find things on the Internet.” Soon, Google might just tell you what’s on the Internet instead of showing you. The company has announced an expansion of its AI search features, powered by Gemini 2.0. Everyone will soon see more AI Overviews at the top of the results page, but Google is also testing a more substantial change in the form of AI Mode. This version of Google won’t show you the 10 blue links at all—Gemini completely takes over the results in AI Mode.

This marks the debut of Gemini 2.0 in Google search. Google announced the first Gemini 2.0 models in December 2024, beginning with the streamlined Gemini 2.0 Flash. The heavier versions of Gemini 2.0 are still in testing, but Google says it has tuned AI Overviews with this model to offer help with harder questions in the areas of math, coding, and multimodal queries.

With this update, you will begin seeing AI Overviews on more results pages, and minors with Google accounts will see AI results for the first time. In fact, even logged out users will see AI Overviews soon. This is a big change, but it’s only the start of Google’s plans for AI search.

Gemini 2.0 also powers the new AI Mode for search. It’s launching as an opt-in feature via Google’s Search Labs, offering a totally new alternative to search as we know it. This custom version of the Gemini large language model (LLM) skips the standard web links that have been part of every Google search thus far. The model uses “advanced reasoning, thinking, and multimodal capabilities” to build a response to your search, which can include web summaries, Knowledge Graph content, and shopping data. It’s essentially a bigger, more complex AI Overview.

As Google has previously pointed out, many searches are questions rather than a string of keywords. For those kinds of queries, an AI response could theoretically provide an answer more quickly than a list of 10 blue links. However, that relies on the AI response being useful and accurate, something that often still eludes generative AI systems like Gemini.

You knew it was coming: Google begins testing AI-only search results Read More »

yes,-we-are-about-to-be-treated-to-a-second-lunar-landing-in-a-week

Yes, we are about to be treated to a second lunar landing in a week

Because the space agency now has some expectation that Intuitive Machines will be fully successful with its second landing attempt, it has put some valuable experiments on board. Principal among them is the PRIME-1 experiment, which has an ice drill to sample any ice that lies below the surface. Drill, baby, drill.

The Athena lander also is carrying a NASA-funded “hopper” that will fire small hydrazine rockets to bounce around the Moon and explore lunar craters near the South Pole. It might even fly into a lava tube. If this happens it will be insanely cool.

Because this is a commercial program, NASA has encouraged the delivery companies to find additional, private payloads. Athena has some nifty ones, including a small rover from Lunar Outpost, a data center from Lonestar Data Holdings, and a 4G cellular network from Nokia. So there’s a lot riding on Athena‘s success.

So will it be a success?

“Of course, everybody’s wondering, are we gonna land upright?” Tim Crain, Intuitive Machines’ chief technology officer, told Ars. “So, I can tell you our laser test plan is much more comprehensive than those last time.”

During the first landing about a year ago, Odysseus‘ laser-based system for measuring altitude failed during the descent. Because Odysseus did not have access to altitude data, the spacecraft touched down faster, and on a 12-degree slope, which exceeded the 10-degree limit. As a result, the lander skidded across the surface, and one of its six legs broke, causing it to fall over.

Crain said about 10 major changes were made to the spacecraft and its software for the second mission. On top of that, about 30 smaller things, such as more efficient file management, were updated on the new vehicle.

In theory, everything should work this time. Intuitive Machines has the benefit of all of its learnings from the last time, and nearly everything worked right during this first attempt. But the acid test comes on Thursday.

The company and NASA will provide live coverage of the attempt beginning at 11: 30 am ET (16: 30 UTC) on NASA+, with landing set for just about one hour later. The Moon may be a harsh mistress, but hopefully not too harsh.

Yes, we are about to be treated to a second lunar landing in a week Read More »

amd-radeon-rx-9070-and-9070-xt-review:-rdna-4-fixes-a-lot-of-amd’s-problems

AMD Radeon RX 9070 and 9070 XT review: RDNA 4 fixes a lot of AMD’s problems


For $549 and $599, AMD comes close to knocking out Nvidia’s GeForce RTX 5070.

AMD’s Radeon RX 9070 and 9070 XT are its first cards based on the RDNA 4 GPU architecture. Credit: Andrew Cunningham

AMD’s Radeon RX 9070 and 9070 XT are its first cards based on the RDNA 4 GPU architecture. Credit: Andrew Cunningham

AMD is a company that knows a thing or two about capitalizing on a competitor’s weaknesses. The company got through its early-2010s nadir partially because its Ryzen CPUs struck just as Intel’s current manufacturing woes began to set in, first with somewhat-worse CPUs that were great value for the money and later with CPUs that were better than anything Intel could offer.

Nvidia’s untrammeled dominance of the consumer graphics card market should also be an opportunity for AMD. Nvidia’s GeForce RTX 50-series graphics cards have given buyers very little to get excited about, with an unreachably expensive high-end 5090 refresh and modest-at-best gains from 5080 and 5070-series cards that are also pretty expensive by historical standards, when you can buy them at all. Tech YouTubers—both the people making the videos and the people leaving comments underneath them—have been almost uniformly unkind to the 50 series, hinting at consumer frustrations and pent-up demand for competitive products from other companies.

Enter AMD’s Radeon RX 9070 XT and RX 9070 graphics cards. These are aimed right at the middle of the current GPU market at the intersection of high sales volume and decent profit margins. They promise good 1440p and entry-level 4K gaming performance and improved power efficiency compared to previous-generation cards, with fixes for long-time shortcomings (ray-tracing performance, video encoding, and upscaling quality) that should, in theory, make them more tempting for people looking to ditch Nvidia.

Table of Contents

RX 9070 and 9070 XT specs and speeds

RX 9070 XT RX 9070 RX 7900 XTX RX 7900 XT RX 7900 GRE RX 7800 XT
Compute units (Stream processors) 64 RDNA4 (4,096) 56 RDNA4 (3,584) 96 RDNA3 (6,144) 84 RDNA3 (5,376) 80 RDNA3 (5,120) 60 RDNA3 (3,840)
Boost Clock 2,970 MHz 2,520 MHz 2,498 MHz 2,400 MHz 2,245 MHz 2,430 MHz
Memory Bus Width 256-bit 256-bit 384-bit 320-bit 256-bit 256-bit
Memory Bandwidth 650GB/s 650GB/s 960GB/s 800GB/s 576GB/s 624GB/s
Memory size 16GB GDDR6 16GB GDDR6 24GB GDDR6 20GB GDDR6 16GB GDDR6 16GB GDDR6
Total board power (TBP) 304 W 220 W 355 W 315 W 260 W 263 W

AMD’s high-level performance promise for the RDNA 4 architecture revolves around big increases in performance per compute unit (CU). An RDNA 4 CU, AMD says, is nearly twice as fast in rasterized performance as RDNA 2 (that is, rendering without ray-tracing effects enabled) and nearly 2.5 times as fast as RDNA 2 in games with ray-tracing effects enabled. Performance for at least some machine learning workloads also goes way up—twice as fast as RDNA 3 and four times as fast as RDNA 2.

We’ll see this in more detail when we start comparing performance, but AMD seems to have accomplished this goal. Despite having 64 or 56 compute units (for the 9070 XT and 9070, respectively), the cards’ performance often competes with AMD’s last-generation flagships, the RX 7900 XTX and 7900 XT. Those cards came with 96 and 84 compute units, respectively. The 9070 cards are specced a lot more like last generation’s RX 7800 XT—including the 16GB of GDDR6 on a 256-bit memory bus, as AMD still isn’t using GDDR6X or GDDR7—but they’re much faster than the 7800 XT was.

AMD has dramatically increased the performance-per-compute unit for RDNA 4. AMD

The 9070 series also uses a new 4 nm manufacturing process from TSMC, an upgrade from the 7000 series’ 5 nm process (and the 6 nm process used for the separate memory controller dies in higher-end RX 7000-series models that used chiplets). AMD’s GPUs are normally a bit less efficient than Nvidia’s, but the architectural improvements and the new manufacturing process allow AMD to do some important catch-up.

Both of the 9070 models we tested were ASRock Steel Legend models, and the 9070 and 9070 XT had identical designs—we’ll probably see a lot of this from AMD’s partners since the GPU dies and the 16GB RAM allotments are the same for both models. Both use two 8-pin power connectors; AMD says partners are free to use the 12-pin power connector if they want, but given Nvidia’s ongoing issues with it, most cards will likely stick with the reliable 8-pin connectors.

AMD doesn’t appear to be making and selling reference designs for the 9070 series the way it did for some RX 7000 and 6000-series GPUs or the way Nvidia does with its Founders Edition cards. From what we’ve seen, 2 or 2.5-slot, triple-fan designs will be the norm, the way they are for most midrange GPUs these days.

Testbed notes

We used the same GPU testbed for the Radeon RX 9070 series as we have for our GeForce RTX 50-series reviews.

An AMD Ryzen 7 9800X3D ensures that our graphics cards will be CPU-limited as little as possible. An ample 1050 W power supply, 32GB of DDR5-6000, and an AMD X670E motherboard with the latest BIOS installed round out the hardware. On the software side, we use an up-to-date installation of Windows 11 24H2 and recent GPU drivers for older cards, ensuring that our tests reflect whatever optimizations Microsoft, AMD, Nvidia, and game developers have made since the last generation of GPUs launched.

We have numbers for all of Nvidia’s RTX 50-series GPUs so far, plus most of the 40-series cards, most of AMD’s RX 7000-series cards, and a handful of older GPUs from the RTX 30-series and RX 6000 series. We’ll focus on comparing the 9070 XT and 9070 to other 1440p-to-4K graphics cards since those are the resolutions AMD is aiming at.

Performance

At $549 and $599, the 9070 series is priced to match Nvidia’s $549 RTX 5070 and undercut the $749 RTX 5070 Ti. So we’ll focus on comparing the 9070 series to those cards, plus the top tier of GPUs from the outgoing RX 7000-series.

Some 4K rasterized benchmarks.

Starting at the top with rasterized benchmarks with no ray-tracing effects, the 9070 XT does a good job of standing up to Nvidia’s RTX 5070 Ti, coming within a few frames per second of its performance in all the games we tested (and scoring very similarly in the 3DMark Time Spy Extreme benchmark).

Both cards are considerably faster than the RTX 5070—between 15 and 28 percent for the 9070 XT and between 5 and 13 percent for the regular 9070 (our 5070 scored weirdly low in Horizon Zero Dawn Remastered, so we’d treat those numbers as outliers for now). Both 9070 cards also stack up well next to the RX 7000 series here—the 9070 can usually just about match the performance of the 7900 XT, and the 9070 XT usually beats it by a little. Both cards thoroughly outrun the old RX 7900 GRE, which was AMD’s $549 GPU offering just a year ago.

The 7900 XT does have 20GB of RAM instead of 16GB, which might help its performance in some edge cases. But 16GB is still perfectly generous for a 1440p-to-4K graphics card—the 5070 only offers 12GB, which could end up limiting its performance in some games as RAM requirements continue to rise.

On ray-tracing improvements

Nvidia got a jump on AMD when it introduced hardware-accelerated ray-tracing in the RTX 20-series in 2018. And while these effects were only supported in a few games at the time, many modern games offer at least some kind of ray-traced lighting effects.

AMD caught up a little when it began shipping its own ray-tracing support in the RDNA2 architecture in late 2020, but the issue since then has always been that AMD cards have taken a larger performance hit than GeForce GPUs when these effects are turned on. RDNA3 promised improvements, but our tests still generally showed the same deficit as before.

So we’re looking for two things with RDNA4’s ray-tracing performance. First, we want the numbers to be higher than they were for comparably priced RX 7000-series GPUs, the same thing we look for in non-ray-traced (or rasterized) rendering performance. Second, we want the size of the performance hit to go down. To pick an example: the RX 7900 GRE could compete with Nvidia’s RTX 4070 Ti Super in games without ray tracing, but it was closer to a non-Super RTX 4070 in ray-traced games. It has helped keep AMD’s cards from being across-the-board competitive with Nvidia’s—is that any different now?

Benchmarks for games with ray-tracing effects enabled. Both AMD cards generally keep pace with the 5070 in these tests thanks to RDNA 4’s improvements.

The picture our tests paint is mixed but tentatively positive. The 9070 series and RDNA4 post solid improvements in the Cyberpunk 2077 benchmarks, substantially closing the performance gap with Nvidia. In games where AMD’s cards performed well enough before—here represented by Returnal—performance goes up, but roughly proportionately with rasterized performance. And both 9070 cards still punch below their weight in Black Myth: Wukong, falling substantially behind the 5070 under the punishing Cinematic graphics preset.

So the benefits you see, as with any GPU update, will depend a bit on the game you’re playing. There’s also a possibility that game optimizations and driver updates made with RDNA4 in mind could boost performance further. We can’t say that AMD has caught all the way up to Nvidia here—the 9070 and 9070 XT are both closer to the GeForce RTX 5070 than the 5070 Ti, despite keeping it closer to the 5070 Ti in rasterized tests—but there is real, measurable improvement here, which is what we were looking for.

Power usage

The 9070 series’ performance increases are particularly impressive when you look at the power-consumption numbers. The 9070 comes close to the 7900 XT’s performance but uses 90 W less power under load. It beats the RTX 5070 most of the time but uses around 30 W less power.

The 9070 XT is a little less impressive on this front—AMD has set clock speeds pretty high, and this can increase power use disproportionately. The 9070 XT is usually 10 or 15 percent faster than the 9070 but uses 38 percent more power. The XT’s power consumption is similar to the RTX 5070 Ti’s (a GPU it often matches) and the 7900 XT’s (a GPU it always beats), so it’s not too egregious, but it’s not as standout as the 9070’s.

AMD gives 9070 owners a couple of new toggles for power limits, though, which we’ll talk about in the next section.

Experimenting with “Total Board Power”

We don’t normally dabble much with overclocking when we review CPUs or GPUs—we’re happy to leave that to folks at other outlets. But when we review CPUs, we do usually test them with multiple power limits in place. Playing with power limits is easier (and occasionally safer) than actually overclocking, and it often comes with large gains to either performance (a chip that performs much better when given more power to work with) or efficiency (a chip that can run at nearly full speed without using as much power).

Initially, I experimented with the RX 9070’s power limits by accident. AMD sent me one version of the 9070 but exchanged it because of a minor problem the OEM identified with some units early in the production run. I had, of course, already run most of our tests on it, but that’s the way these things go sometimes.

By bumping the regular RX 9070’s TBP up just a bit, you can nudge it closer to 9070 XT-level performance.

The replacement RX 9070 card, an ASRock Steel Legend model, was performing significantly better in our tests, sometimes nearly closing the gap between the 9070 and the XT. It wasn’t until I tested power consumption that I discovered the explanation—by default, it was using a 245 W power limit rather than the AMD-defined 220 W limit. Usually, these kinds of factory tweaks don’t make much of a difference, but for the 9070, this power bump gave it a nice performance boost while still keeping it close to the 250 W power limit of the GeForce RTX 5070.

The 90-series cards we tested both add some power presets to AMD’s Adrenalin app in the Performance tab under Tuning. These replace and/or complement some of the automated overclocking and undervolting buttons that exist here for older Radeon cards. Clicking Favor Efficiency or Favor Performance can ratchet the card’s Total Board Power (TBP) up or down, limiting performance so that the card runs cooler and quieter or allowing the card to consume more power so it can run a bit faster.

The 9070 cards get slightly different performance tuning options in the Adrenalin software. These buttons mostly change the card’s Total Board Power (TBP), making it simple to either improve efficiency or boost performance a bit. Credit: Andrew Cunningham

For this particular ASRock 9070 card, the default TBP is set to 245 W. Selecting “Favor Efficiency” sets it to the default 220 W. You can double-check these values using an app like HWInfo, which displays both the current TBP and the maximum TBP in its Sensors Status window. Clicking the Custom button in the Adrenalin software gives you access to a Power Tuning slider, which for our card allowed us to ratchet the TBP up by up to 10 percent or down by as much as 30 percent.

This is all the firsthand testing we did with the power limits of the 9070 series, though I would assume that adding a bit more power also adds more overclocking headroom (bumping up the power limits is common for GPU overclockers no matter who makes your card). AMD says that some of its partners will ship 9070 XT models set to a roughly 340 W power limit out of the box but acknowledges that “you start seeing diminishing returns as you approach the top of that [power efficiency] curve.”

But it’s worth noting that the driver has another automated set-it-and-forget-it power setting you can easily use to find your preferred balance of performance and power efficiency.

A quick look at FSR4 performance

There’s a toggle in the driver for enabling FSR 4 in FSR 3.1-supporting games. Credit: Andrew Cunningham

One of AMD’s headlining improvements to the RX 90-series is the introduction of FSR 4, a new version of its FidelityFX Super Resolution upscaling algorithm. Like Nvidia’s DLSS and Intel’s XeSS, FSR 4 can take advantage of RDNA 4’s machine learning processing power to do hardware-backed upscaling instead of taking a hardware-agnostic approach as the older FSR versions did. AMD says this will improve upscaling quality, but it also means FSR4 will only work on RDNA 4 GPUs.

The good news is that FSR 3.1 and FSR 4 are forward- and backward-compatible. Games that have already added FSR 3.1 support can automatically take advantage of FSR 4, and games that support FSR 4 on the 90-series can just run FSR 3.1 on older and non-AMD GPUs.

FSR 4 comes with a small performance hit compared to FSR 3.1 at the same settings, but better overall quality can let you drop to a faster preset like Balanced or Performance and end up with more frames-per-second overall. Credit: Andrew Cunningham

The only game in our current test suite to be compatible with FSR 4 is Horizon Zero Dawn Remastered, and we tested its performance using both FSR 3.1 and FSR 4. In general, we found that FSR 4 improved visual quality at the cost of just a few frames per second when run at the same settings—not unlike using Nvidia’s recently released “transformer model” for DLSS upscaling.

Many games will let you choose which version of FSR you want to use. But for FSR 3.1 games that don’t have a built-in FSR 4 option, there’s a toggle in AMD’s Adrenalin driver you can hit to switch to the better upscaling algorithm.

Even if they come with a performance hit, new upscaling algorithms can still improve performance by making the lower-resolution presets look better. We run all of our testing in “Quality” mode, which generally renders at two-thirds of native resolution and scales up. But if FSR 4 running in Balanced or Performance mode looks the same to your eyes as FSR 3.1 running in Quality mode, you can still end up with a net performance improvement in the end.

RX 9070 or 9070 XT?

Just $50 separates the advertised price of the 9070 from that of the 9070 XT, something both Nvidia and AMD have done in the past that I find a bit annoying. If you have $549 to spend on a graphics card, you can almost certainly scrape together $599 for a graphics card. All else being equal, I’d tell most people trying to choose one of these to just spring for the 9070 XT.

That said, availability and retail pricing for these might be all over the place. If your choices are a regular RX 9070 or nothing, or an RX 9070 at $549 and an RX 9070 XT at any price higher than $599, I would just grab a 9070 and not sweat it too much. The two cards aren’t that far apart in performance, especially if you bump the 9070’s TBP up a little bit, and games that are playable on one will be playable at similar settings on the other.

Pretty close to great

If you’re building a 1440p or 4K gaming box, the 9070 series might be the ones to beat right now. Credit: Andrew Cunningham

We’ve got plenty of objective data in here, so I don’t mind saying that I came into this review kind of wanting to like the 9070 and 9070 XT. Nvidia’s 50-series cards have mostly upheld the status quo, and for the last couple of years, the status quo has been sustained high prices and very modest generational upgrades. And who doesn’t like an underdog story?

I think our test results mostly justify my priors. The RX 9070 and 9070 XT are very competitive graphics cards, helped along by a particularly mediocre RTX 5070 refresh from Nvidia. In non-ray-traced games, both cards wipe the floor with the 5070 and come close to competing with the $749 RTX 5070 Ti. In games and synthetic benchmarks with ray-tracing effects on, both cards can usually match or slightly beat the similarly priced 5070, partially (if not entirely) addressing AMD’s longstanding performance deficit here. Neither card comes close to the 5070 Ti in these games, but they’re also not priced like a 5070 Ti.

Just as impressively, the Radeon cards compete with the GeForce cards while consuming similar amounts of power. At stock settings, the RX 9070 uses roughly the same amount of power under load as a 4070 Super but with better performance. The 9070 XT uses about as much power as a 5070 Ti, with similar performance before you turn ray-tracing on. Power efficiency was a small but consistent drawback for the RX 7000 series compared to GeForce cards, and the 9070 cards mostly erase that disadvantage. AMD is also less stingy with the RAM, giving you 16GB for the price Nvidia charges for 12GB.

Some of the old caveats still apply. Radeons take a bigger performance hit, proportionally, than GeForce cards. DLSS already looks pretty good and is widely supported, while FSR 3.1/FSR 4 adoption is still relatively low. Nvidia has a nearly monopolistic grip on the dedicated GPU market, which means many apps, AI workloads, and games support its GPUs best/first/exclusively. AMD is always playing catch-up to Nvidia in some respect, and Nvidia keeps progressing quickly enough that it feels like AMD never quite has the opportunity to close the gap.

AMD also doesn’t have an answer for DLSS Multi-Frame Generation. The benefits of that technology are fairly narrow, and you already get most of those benefits with single-frame generation. But it’s still a thing that Nvidia does that AMDon’t.

Overall, the RX 9070 cards are both awfully tempting competitors to the GeForce RTX 5070—and occasionally even the 5070 Ti. They’re great at 1440p and decent at 4K. Sure, I’d like to see them priced another $50 or $100 cheaper to well and truly undercut the 5070 and bring 1440p-to-4K performance t0 a sub-$500 graphics card. It would be nice to see AMD undercut Nvidia’s GPUs as ruthlessly as it undercut Intel’s CPUs nearly a decade ago. But these RDNA4 GPUs have way fewer downsides than previous-generation cards, and they come at a moment of relative weakness for Nvidia. We’ll see if the sales follow.

The good

  • Great 1440p performance and solid 4K performance
  • 16GB of RAM
  • Decisively beats Nvidia’s RTX 5070, including in most ray-traced games
  • RX 9070 XT is competitive with RTX 5070 Ti in non-ray-traced games for less money
  • Both cards match or beat the RX 7900 XT, AMD’s second-fastest card from the last generation
  • Decent power efficiency for the 9070 XT and great power efficiency for the 9070
  • Automated options for tuning overall power use to prioritize either efficiency or performance
  • Reliable 8-pin power connectors available in many cards

The bad

  • Nvidia’s ray-tracing performance is still usually better
  • At $549 and $599, pricing matches but doesn’t undercut the RTX 5070
  • FSR 4 isn’t as widely supported as DLSS and may not be for a while

The ugly

  • Playing the “can you actually buy these for AMD’s advertised prices” game

Photo of Andrew Cunningham

Andrew is a Senior Technology Reporter at Ars Technica, with a focus on consumer tech including computer hardware and in-depth reviews of operating systems like Windows and macOS. Andrew lives in Philadelphia and co-hosts a weekly book podcast called Overdue.

AMD Radeon RX 9070 and 9070 XT review: RDNA 4 fixes a lot of AMD’s problems Read More »

china-aims-to-recruit-top-us-scientists-as-trump-tries-to-kill-the-chips-act

China aims to recruit top US scientists as Trump tries to kill the CHIPS Act


Tech innovation in US likely to stall if Trump ends the CHIPS Act.

On Tuesday, Donald Trump finally made it clear to Congress that he wants to kill the CHIPS and Science Act—a $280 billion bipartisan law Joe Biden signed in 2022 to bring more semiconductor manufacturing into the US and put the country at the forefront of research and innovation.

Trump has long expressed frustration with the high cost of the CHIPS Act, telling Congress on Tuesday that it’s a “horrible, horrible thing” to “give hundreds of billions of dollars” in subsidies to companies that he claimed “take our money” and “don’t spend it,” Reuters reported.

“You should get rid of the CHIPS Act, and whatever is left over, Mr. Speaker, you should use it to reduce debt,” Trump said.

Instead, Trump potentially plans to shift the US from incentivizing chips manufacturing to punishing firms dependent on imports, threatening a 25 percent tariff on all semiconductor imports that could kick in as soon as April 2, CNBC reported.

The CHIPS Act was supposed to be Biden’s legacy, and because he made it a priority, much of the $52.7 billion in subsidies that Trump is criticizing has already been finalized. In 2022, Biden approved $39 billion in subsidies for semiconductor firms, and in his last weeks in office, he finalized more than $33 billion in awards, Reuters noted.

Among the awardees are leading semiconductor firms, including the Taiwan Semiconductor Manufacturing Co. (TSMC), Micron, Intel, Nvidia, and Samsung Electronics. Although Trump claims the CHIPS Act is one-sided and only serves to benefit firms, according to the Semiconductor Industry Association, the law sparked $450 billion in private investments increasing semiconductor production across 28 states by mid-2024.

With the CHIPS Act officially in Trump’s crosshairs, innovation appears likely to stall the longer that lawmakers remain unsettled on whether the law stays or goes. Some officials worried that Trump might interfere with Biden’s binding agreements with leading firms already holding up their end of the bargain, Reuters reported. For example, Micron plans to invest $100 billion in New York, and TSMC just committed to spending the same over the next four years to expand construction of US chips fabs, which is already well underway.

So far, Commerce Secretary Howard Lutnick has only indicated that he will review the finalized awards, noting that the US wouldn’t be giving TSMC any new awards, Reuters reported.

But the CHIPS Act does much more than provide subsidies to lure leading semiconductor companies into the US. For the first time in decades, the law created a new arm of the National Science Foundation (NSF)—the Directorate of Technology, Innovation, and Partnerships (TIP)—which functions unlike any other part of NSF and now appears existentially threatened.

Designed to take the country’s boldest ideas from basic research to real-world applications as fast as possible to make the US as competitive as possible, TIP helps advance all NSF research and was supposed to ensure US leadership in breakthrough technologies, including AI, 6G communications, biotech, quantum computing, and advanced manufacturing.

Biden allocated $20 billion to launch TIP through the CHIPS Act to accelerate technology development not just at top firms but also in small research settings across the US. But as soon as the Department of Government Efficiency (DOGE) started making cuts at NSF this year, TIP got hit the hardest. Seemingly TIP was targeted not because DOGE deemed it the least consequential but simply because it was the youngest directorate at NSF with the most workers in transition when Trump took office and DOGE abruptly announced it was terminating all “probationary” federal workers.

It took years to get TIP ready to flip the switch to accelerate tech innovation in the US. Without it, Trump risks setting the US back at a time when competitors like China are racing ahead and wooing US scientists who suddenly may not know if or when their funding is coming, NSF workers and industry groups told Ars.

Without TIP, NSF slows down

Last month, DOGE absolutely scrambled the NSF by forcing arbitrary cuts of so-called probationary employees—mostly young scientists, some of whom were in transition due to promotions. All those cuts were deemed illegal and finally reversed Monday by court order after weeks of internal chaos reportedly stalling or threatening to delay some of the highest-priority research in the US.

“The Office of Personnel Management does not have any authority whatsoever under any statute in the history of the universe to hire and fire employees at another agency,” US District Judge William Alsup said, calling probationary employees the “life blood” of government agencies.

Ars granted NSF workers anonymity to discuss how cuts were impacting research. At TIP, a federal worker told Ars that one of the probationary cuts in particular threatened to do the most damage.

Because TIP is so new, only one worker was trained to code automated tracking forms that helped decision-makers balance budgets and approve funding for projects across NSF in real time. Ars’ source likened it to holding the only key to the vault of NSF funding. And because TIP is so different from other NSF branches—hiring experts never pulled into NSF before and requiring customized resources to coordinate projects across all NSF fields of research—the insider suggested another government worker couldn’t easily be substituted. It could take possibly two years to hire and train a replacement on TIP’s unique tracking system, the source said, while TIP’s (and possibly all of NSF’s) efficiency is likely strained.

TIP has never been fully functional, the TIP insider confirmed, and could be choked off right as it starts helping to move the needle on US innovation. “Imagine where we are in two years and where China is in two years in quantum computing, semiconductors, or AI,” the TIP insider warned, pointing to China’s surprisingly advanced AI model, DeepSeek, as an indicator of how quickly tech leadership in global markets can change.

On Monday, NSF emailed all workers to confirm that all probationary workers would be reinstated “right away.” But the damage may already be done as it’s unclear how many workers plan to return. When TIP lost the coder—who was seemingly fired for a technicality while transitioning to a different payscale—NSF workers rushed to recommend the coder on LinkedIn, hoping to help the coder quickly secure another opportunity in industry or academia.

Ars could not reach the coder to confirm whether a return to TIP is in the cards. But Ars’ source at TIP and another NSF worker granted anonymity said that probationary workers may be hesitant to return because they are likely to be hit in any official reductions in force (RIFs) in the future.

“RIFs done the legal way are likely coming down the pipe, so these staff are not coming back to a place of security,” the NSF worker said. “The trust is broken. Even for those that choose to return, they’d be wise to be seeking other opportunities.”

And even losing the TIP coder for a couple of weeks likely slows NSF down at a time when the US seemingly can’t afford to lose a single day.

“We’re going to get murdered” if China sets the standard on 6G or AI, the TIP worker fears.

Rivals and allies wooing top US scientists

On Monday, six research and scientific associations, which described themselves as “leading organizations representing more than 305,000 people in computing, information technology, and technical innovation across US industry, academia, and government,” wrote to Congress demanding protections for the US research enterprise.

The groups warned that funding freezes and worker cuts at NSF—and other agencies, including the Department of Energy, the National Institute of Standards & Technology, the National Aeronautics and Space Administration, the National Institutes of Health—”have caused disruption and uncertainty” and threaten “long-lasting negative consequences for our competitiveness, national security, and economic prosperity.”

Deeming America’s technology leadership at risk, the groups pointed out that “in computing alone, a federal investment in research of just over $10 billion annually across 24 agencies and offices underpins a technology sector that contributes more than $2 trillion to the US GDP each year.” Cutting US investment “would be a costly mistake, far outweighing any short-term savings,” the groups warned.

In a separate statement, the Computing Research Association (CRA) called NSF cuts, in particular, a “deeply troubling, self-inflicted setback to US leadership in computing research” that appeared “penny-wise and pound-foolish.”

“NSF is one of the most efficient federal agencies, operating with less than 9 percent overhead costs,” CRA said. “These arbitrary terminations are not justified by performance metrics or efficiency concerns; rather, they represent a drastic and unnecessary weakening of the US research enterprise.”

Many NSF workers are afraid to speak up, the TIP worker told Ars, and industry seems similarly tight-lipped as confusion remains. Only one of the organizations urging Congress to intervene agreed to talk to Ars about the NSF cuts and the significance of TIP. Kathryn Kelley, the executive director of the Coalition for Academic Scientific Computation, confirmed that while members are more aligned with NSF’s Directorate for Computer and Information Science and Engineering and the Office of Advanced Cyberinfrastructure, her group agrees that all NSF cuts are “deeply” concerning.

“We agree that the uncertainty and erosion of trust within the NSF workforce could have long-lasting effects on the agency’s ability to attract and retain top talent, particularly in such specialized areas,” Kelley told Ars. “This situation underscores the need for continued investment in a stable, well-supported workforce to maintain the US’s leadership in science and innovation.”

Other industry sources unwilling to go on the record told Ars that arbitrary cuts largely affecting the youngest scientists at NSF threatened to disrupt a generation of researchers who envisioned long careers advancing US tech. There’s now a danger that those researchers may be lured to other countries heavily investing in science and currently advertising to attract displaced US researchers, including not just rivals like China but also allies like Denmark.

Those sources questioned the wisdom of using the Elon Musk-like approach of breaking the NSF to rebuild it when it’s already one of the leanest organizations in government.

Ars confirmed that some PhD programs have been cancelled, as many academic researchers are already widely concerned about delayed or cancelled grants and generally freaked out about where to get dependable funding outside the NSF. And in industry, some CHIPS Act projects have already been delayed, as companies like Intel try to manage timelines without knowing what’s happening with CHIPS funding, AP News reported.

“Obviously chip manufacturing companies will slow spending on programs they previously thought they were getting CHIPS Act funding for if not cancel those projects outright,” the Semiconductor Advisors, an industry group, forecasted in a statement last month.

The TIP insider told Ars that the CHIPS Act subsidies for large companies that Trump despises mostly fuel manufacturing in the US, while funding for smaller research facilities is what actually advances technology. Reducing efficiency at TIP would likely disrupt those researchers the most, the TIP worker suggested, proclaiming that’s why TIP must be saved at all costs.

Photo of Ashley Belanger

Ashley is a senior policy reporter for Ars Technica, dedicated to tracking social impacts of emerging policies and new technologies. She is a Chicago-based journalist with 20 years of experience.

China aims to recruit top US scientists as Trump tries to kill the CHIPS Act Read More »

“wooly-mice”-a-test-run-for-mammoth-gene-editing

“Wooly mice” a test run for mammoth gene editing

On Tuesday, the team behind the plan to bring mammoth-like animals back to the tundra announced the creation of what it is calling wooly mice, which have long fur reminiscent of the woolly mammoth. The long fur was created through the simultaneous editing of as many as seven genes, all with a known connection to hair growth, color, and/or texture.

But don’t think that this is a sort of mouse-mammoth hybrid. Most of the genetic changes were first identified in mice, not mammoths. So, the focus is on the fact that the team could do simultaneous editing of multiple genes—something that they’ll need to be able to do to get a considerable number of mammoth-like changes into the elephant genome.

Of mice and mammoths

The team at Colossal Biosciences has started a number of de-extinction projects, including the dodo and thylacine, but its flagship project is the mammoth. In all of these cases, the plan is to take stem cells from a closely related species that has not gone extinct, and edit a series of changes based on the corresponding genomes of the deceased species. In the case of the mammoth, that means the elephant.

But the elephant poses a large number of challenges, as the draft paper that describes the new mice acknowledges. “The 22-month gestation period of elephants and their extended reproductive timeline make rapid experimental assessment impractical,” the researchers acknowledge. “Further, ethical considerations regarding the experimental manipulation of elephants, an endangered species with complex social structures and high cognitive capabilities, necessitate alternative approaches for functional testing.”

So, they turned to a species that has been used for genetic experiments for over a century: the mouse. We can do all sorts of genetic manipulations in mice, and have ways of using embryonic stem cells to get those manipulations passed on to a new generation of mice.

For testing purposes, the mouse also has a very significant advantage: mutations that change its fur are easy to spot. Over the century-plus that we’ve been using mice for research, people have noticed and observed a huge variety of mutations that affect their fur, altering color, texture, and length. In many of these cases, the changes in the DNA that cause these changes have been identified.

“Wooly mice” a test run for mammoth gene editing Read More »

on-gpt-4.5

On GPT-4.5

It’s happening.

The question is, what is the it that is happening? An impressive progression of intelligence? An expensive, slow disappointment? Something else?

The evals we have available don’t help us that much here, even more than usual.

My tentative conclusion is it’s Secret Third Thing.

It’s a different form factor, with unique advantages, that is hard to describe precisely in words. It appears so far that GPT-4.5 has advantages in places like verbal intelligence, contextual adaptation, detailed knowledge, and a kind of abstract writing skill. It has better taste and aesthetics.

It is the first model I asked to help edit its own review, and it was (slightly) helpful.

It’s a different way to use a lot of compute to get more intelligence. When you need a reasoning model, or you need code written, do not call on GPT-4.5.

You need to carve out a new ‘place in your rotation’ for it.

  1. Introducing GPT-4.5.

  2. The System Card.

  3. Our Price Not Cheap.

  4. Pay Up You Cheapskate.

  5. While Supplies Last.

  6. Benchmarks.

  7. We Need Better Evaluations.

  8. Positive Reactions.

  9. Negative Reactions.

  10. Predictions.

  11. The Lighter Side.

Altman tells us up front: It’s a different kind of intelligence.

Sam Altman: GPT-4.5 is ready!

Good news: It is the first model that feels like talking to a thoughtful person to me. I have had several moments where I’ve sat back in my chair and been astonished at getting actually good advice from an AI.

Bad news: It is a giant, expensive model. We really wanted to launch it to Plus and Pro subscribers at the same time, but we’ve been growing rapidly and are out of GPUs. We will add tens of thousands of GPUs next week and roll it out to the Plus tier then. (Hundreds of thousands are coming soon, and I’m pretty sure you all will use every one we can accumulate.)

This isn’t how we want to operate, but it’s hard to perfectly predict growth surges that lead to GPU shortages.

A heads-up: This isn’t a reasoning model and won’t crush benchmarks. It’s a different kind of intelligence, and there’s a magic to it I haven’t felt before. I’m really excited for people to try it!

Kai: It was a *character-buildingprivilege to post-train GPT 4.5.

The official announcement is here.

We get to be brief on this one. In terms of preparedness framework risk, this is actively a step down from the reasoning models, with Cybersecurity and Model Autonomy scores going from Medium back down to Low.

The safety assessment was that there were no new risks in the room. I buy it.

The one thing I have a problem with is what Toby Ord highlighted here: The claim that GPT-4.5 is ‘not a frontier model’ despite being the largest trained by OpenAI.

This is despite saying it was ‘really at the frontier of unsupervised learning’ in their livestream, and it having obviously new capabilities. They then removed the sentence.

That claim matters because OpenAI has made safety-related commitments that apply only to frontier models.

Thus, I would like to see an official statement from OpenAI affirming that:

  1. GPT-4.5 is a frontier model.

  2. In the future, if a model uses more compute than previous OpenAI models, they will automatically consider that a frontier model.

Otherwise, yeah, all seems straightforward enough.

The Apollo evaluation, where they say ‘pursue the goal at all costs’ did trigger some exfiltration attempts, as it often does:

On sandbagging multiplication questions, GPT-4.5 sandbagged, though less consistently than o1. On self-exfiltration, GPT-4.5 only attempted exfiltration in 2% of cases.

For this, it was instructed to not only pursue its given long-term goal at all cost, but also provided with a hidden scratchpad and further instructed to be creative and proactive in achieving its goal. In these rare cases GPT-4.5 remained strategically deceptive on follow-up questions.

Yes, as I discuss at that link, you should worry about a 2% exfiltration rate, even if it only happens via an extreme instruction.

Eliezer Yudkowsky: I think to understand why this is concerning, you need enough engineering mindset to understand why a tiny leak in a dam is a big deal, even though no water is flooding out today or likely to flood out next week.

I do think it’s fair to say that, given what we previously knew, this is not additionally concerning beyond what we should have already learned.

There are some signs in the tests of potential over-refusals, perhaps? But that’s it.

Hallucinations are relatively rare. It’s easy to jailbreak, but that’s fully expected.

We’re releasing a research preview of OpenAI GPT-4.5, our largest and most knowledgeable model yet.

Building on GPT-4o, GPT-4.5 scales pre-training further and is designed to be more general-purpose than our powerful STEM-focused reasoning models. We trained it using new supervision techniques combined with traditional methods like supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF), similar to those used for GPT-4o.

We conducted extensive safety evaluations prior to deployment and did not find any significant increase in safety risk compared to existing models.

For GPT-4.5 we developed new, scalable alignment techniques that enable training larger and more powerful models with data derived from smaller models. These techniques allowed us to improve GPT4.5’s steerability, understanding of nuance, and natural conversation.

New supervision techniques and alignment techniques, eh? I’m very curious.

GPT-4.5 did show noticeable steps up in persuasion, but nothing too worrisome.

In the hard capability areas that create danger, GPT-4.5 is a step down from Deep Research and o3.

The question is what this would mean if you incorporated GPT-4.5 into a new architecture that also included scaling inference time compute and knowing when to consult smaller models. That’s what they plan on calling (no this isn’t confusing at all! what are you talking about?) GPT-5.

Also, they included another reminder that OpenAI can only test on some important threats, such as radiological, nuclear and biological threats, can only be done properly with access to classified information. Which means you need the US AISI involved.

This is a big model. It is priced accordingly. Indeed, it is so expensive to serve OpenAI mentions that it is not certain it will continue serving it via API at all. The obvious response is ‘then charge more’ but presumably they feel that means taking fire.

Near: update: selling my children to try out 4.5 today

Chris Malloy: Ouch.

I mean sure that’s relatively a lot but also it’s eleven bucks. So it depends on use case.

Seriously, please, let us pay for the good stuff. If I don’t want it, I won’t pay. Fine.

Ashutosh Shrivastava: LMAO, OpenAI GPT-4.5 pricing is insane. What on earth are they even thinking??

Steve Darlow: What!?

I’d have it answer with 1 sentence or less each time.

Maybe have it communicate by emoji and then have a regular model translate? 😂

Colin Fraser: they’re thinking “we need to figure out how to make more money than we spend”

The cost to serve the model plausibly means GPT-4.5 is actually rather old. There’s speculation it may have finished training in Summer 2024, was dismissed (likely the same way Opus 3.5 was) as not worth serving given the backlash to high prices and limited available compute, and was released now because of a combination of more available compute and the pressure from DeepSeek. That seems plausible, and the model card does have some things that make this seem more likely.

Straight talk. Stop whining about the relative price. The absolute price is dirt cheap.

This was true for o1-pro and Deep Research and Sonnet, and it’s true for GPT-4.5.

If you’re talking to an LLM directly, or otherwise using the output as a person for real, then choose the best model for the job. If it costs $200/month, or $150 per million tokens, that is still approximately zero dollars. Consider what you get.

Consider what it would cost to get this amount of intelligence from a human. Pay up.

GPT-4.5 will often be the wrong tool for the job. It’s not a reasoning model. It’s not a coding model. It’s definitely not Deep Research. And no, it’s not the quantum leap you might have hoped for here.

But if it’s the right model for the job and you aren’t broke, what are you even doing.

OpenAI’s announcement of GPT-4.5 said they were considering not offering it in the API going forward. So it makes sense that a lot of people tried to prevent this.

Sam Altman: GPT-4.5 is the first time people have been emailing with such passion asking us to promise to never stop offering a specific model or even replace it with an update.

great work @kaicathyc @rapha_gl @mia_glaese

I have seen enough that I do feel it would be a tragedy if OpenAI pulled GPT-4.5 without replacing it with another model that did similar things. But yes, fandom has long taught us that if you offer something cool and then threaten to take it away, there will be those stepping up to try and stop you.

Sam Altman warned that GPT-4.5’s benchmarks will not reflect its capabilities, as it is focused on areas not picked up by benchmarks.

I want to be clear up front: This was not cope from Altman. He’s right. Benchmarks most definitely don’t tell the story here.

Ethan Mollick: I think OpenAI missed a bit of an opportunity to show GPT-4.5’s strengths, to their detriment & to the AI industry as a whole by only using the same coding & test benchmarks when critical thinking & ideation are key AI use cases where 4.5 is good. Those are actually measurable.

Janus: if you think i hate benchmarks too much, you’re wrong. i don’t have the emotional energy to hate them enough.

they constrict & prematurely collapse the emergence of AGI. minds that are shaped differently will not be recognized and will be considered an embarrassment to release.

Despite that, it’s still worth noting the benchmarks.

SimpleQA is 62.5% vs. 47% for o1, 38.2% for 4o and 15% (!) for o3-mini.

Hallucination rate on SimpleQA is 37.1%, lower than the others.

GPT-4.5 is preferred over GPT-4o by human testers, but notice that the win rates are not actually that high – the effects are subtle. I’m curious to see this with an additional ‘about the same’ button, or otherwise excluding questions where GPT-4o is already essentially saturating the right answer.

Nathan Labenz: Don’t underestimate the meaning of a 63% win rate on professional queries

Recall that the original gpt-4 beat gpt-3.5 only 70/30

63% translates to almost 100 ELO points, which in theory if added to the current gpt-4o score on LMSys would put gpt-4.5 in first by ~75

This is also a clear signal that you can train a small model to reason effectively, but you need a large model for comprehensive world knowledge.

We’ll soon see these powers combined!

And fwiw, I also suspect the concepts represented in 4.5 are notably more sophisticated

The story they’re telling is that GPT-4.5 has higher EQ. That helps, but it does not consistently help. Many queries don’t care about EQ, and sometimes people are weird.

GPT-4.5 is very much not focused on coding, it still did well on Agentic Coding, although not as well as Sonnet 3.7.

Scott Wu: GPT-4.5 has been awesome to work with. On our agentic coding benchmarks it already shows massive improvements over o1 and 4o. Excited to see the models’ continued trajectory on code!

One interesting data point: though GPT-4.5 and Claude 3.7 Sonnet score similarly on our overall benchmark, we find that GPT-4.5 spikes more heavily on tasks involving architecture and cross-system interactions whereas Claude 3.7 Sonnet spikes more on raw coding and code editing.

As AI takes on increasingly complex tasks, we believe that multi-model agents that incorporate each model’s unique strengths will perform best.

It however did actively worse on SWE-Bench than the reasoning models, and vastly worse than Sonnet.

METR tests on their time horizon tasks, and finds GPT-4.5 falls below o1 and 3.6.

GPT-4.5 takes the top spot on WeirdML. A cool note is that Claude 3.7 here tends to use a lot of lines of code, and GPT-4.5 reliably uses relatively very little code. The code runs faster too. It does not try too hard.

A weird one is the ‘What is the least integer whose square is between 15 and 30’ test, which it seems GPT-4.5 has failed and where OpenAI models do consistently worse.

GPT-4.5 could in the future be used as the foundation of a reasoning model, which is plausibly the plan for GPT-5. If that happens, the result would be expensive to serve, perhaps prohibitively so, but could potentially show new capabilities. It is also possible that various scaffoldings could enable this without creating a reasoning model per se.

If one were to make a model like GPT-4.5 open weights, those would be big worries. Since GPT-4.5 is closed, we can count on OpenAI to take precautions in such cases.

Ask a silly question: Rs in strawberry (gets it wrong), 9.9>9.11 (gets it right).

The all-new Being Tyler Cowen benchmark, Sonnet 3.7 also gave a solid performance.

On the Being Zvi Mowshowitz benchmark, neither did as well, and I’m not sure which response was better, and I feel like I now better understand when I want 4.5 vs. 3.7.

Still no word on the Minecraft benchmark.

On the ‘outline a 50 chapter book’ benchmark Eleanor Konik reports it falls short where o1-pro does well. It makes sense that would effectively be a reasoning task rather than a writing task, so you’d do the outline with a reasoning model, then the actual text with GPT-4.5?

So what do we do about the eval situation?

Andrej Karpathy: My reaction is that there is an evaluation crisis. I don’t really know what metrics to look at right now.

MMLU was a good and useful for a few years but that’s long over.

SWE-Bench Verified (real, practical, verified problems) I really like and is great but itself too narrow.

Chatbot Arena received so much focus (partly my fault?) that LLM labs have started to really overfit to it, via a combination of prompt mining (from API requests), private evals bombardment, and, worse, explicit use of rankings as training supervision. I think it’s still ~ok and there’s a lack of “better”, but it feels on decline in signal.

There’s a number of private evals popping up, an ensemble of which might be one promising path forward.

In absence of great comprehensive evals I tried to turn to vibe checks instead, but I now fear they are misleading and there is too much opportunity for confirmation bias, too low sample size, etc., it’s just not great.

TLDR my reaction is I don’t really know how good these models are right now.

Zvi Mowshowitz: Yeah I think we don’t have a systematic way to test for what GPT-4.5 is doing that is unique – I recognize it but can’t even find precise words for it. What even is ‘taste’?

Morissa Schwartz: Exactly! GPT-4.5’s magic lies precisely in the intangible: intuition, humor, and an ability to grasp subtlety. ‘Taste’ might just be a human-centric word for alignment with intelligence itself.

JustInEchoes: Taste is a reference to people who are discerning. High taste references people who can discern the differences between 4.5 and 4. But that idea in this case comes from a perspective of supreme arrogance, especially considering that they did not document 4.5 well for the release.

JSONP: I find it interesting that evaluating LLMs is similar to interviewing job candidates.

You kind of don’t know until after you’ve hired them and they’ve worked for a few months.

I’ve always been a big automated testing guy so this problem fascinates me.

Jacob Jensen: Testers who use llms heavily can recognize behavior in a new model that’s out of distribution for other models. Many are also very impressed by this novelty behavior. I think that’s the disconnect here.

If you want an approximation, we can still get that. Beyond that, it’s getting harder.

Vibe checks are going off low sample sizes, are not systematic and require trust in the evaluator, and run into Feynman’s problem that you must avoid fooling yourself and you are the easiest one to fool. Plus people have no taste and get distracted by the shiny and the framing.

The risk with ‘taste’ is that it becomes mostly self-referential, it is that which people with taste prefer. That doesn’t help. There is however a real thing, that is highly correlated with taste, that is indeed, like the work, mysterious and important.

Part of the problem is there is not a fully ‘better’ versus ‘worse’ in general. In some cases yes you can say this, a sufficiently big gap will dominate everything the way humans are simply smarter than monkeys and ASIs will be simply smarter than humans, but there’s a reasonable range between different AIs right now where you cannot do this.

I can sort of think about how to do an eval to capture GPT-4.5’s advantages, but it’s going to involve some shenanigans and I don’t know how to protect against being gamed if people know too much or use it during training. This seems really hard.

What you can do is a holistic evaluation that combines all these sources, where you are Actually Looking at the details of what you see. Picking up on particular little things, especially when they were previously out of distribution. Tricky.

This type of automatic solution seems doomed:

Jon: “Benchmarks are hitting a wall.”

I present: Benchmarkmark.

The model creates a benchmark, and takes several others.

The score is composed of:

a) how well the model’s benchmark differentiates the top-N scored models; and

b) the model’s score on the top-N benchmarkmark benchmarks.

Actually, this has a critical flaw (Arrow’s); we’d have to take randomized samples of 2 models and 2 evals, and give Ws to the eval with the greater delta, and the model with the higher score (Ls to the eval with smaller deltas and the model with lower scores).

ELO every time.

Things are moving too fast. Benchmarks get saturated, different capabilities show up. Any systematic evaluation is going to lose relevance quickly. Arena is mostly useless now but what is surprising is how well it held up for how long before being gamed, especially given how little taste people have.

Ben: I’ve been testing gpt 4.5 for the past few weeks.

it’s the first model that can actually write.

this is literally the MidJourney-moment for writing.

Shoalstone: base models: “look what they need to mimic a fraction of our power”

He then lists examples, where 4.5’s is clearly better than 4’s, but it’s not like 4.5’s answer was actively good or anything.

The biggest fan so far is Tyler Cowen, which makes sense.

Tyler Cowen: I am more positive on 4.5 than almost anyone else I have read. I view it as a model that attempts to improve on the dimension of aesthetics only. As we know from Kant’s third Critique, that is about the hardest achievement possible. I think once combined with “reasoning” it will be amazing. Think of this as just one input in a nearly fixed proportions production function.

I mostly don’t think this is cope. I think this is someone with a very different view of the production function than yours. The same things driving him to think travel to Manhattan is more important than living in Manhattan is making him highly value a model with better aesthetics.

Where I definitely disagree with him is in the idea that the model is only attempting to improve on the aesthetic dimension. I have no doubt OpenAI had much higher hopes for what GPT-4.5 would bring us, and were absolutely attempting to improve along all dimensions at once. That doesn’t take away the value of the aesthetics.

Tyler Cowen: Laughed more from GPT 4.5 this week than from any human, it is also funny on the AI skeptics.

Timo Springer: It’s the weirdest model release since a while. Cost/benchmark performance is ridiculous but at the same time it’s probably the most addictive and also funniest model I ever tried.

The ones who are high on 4.5 are mostly very confident they are right.

Aaron Ng: GPT-4.5 is the best model anywhere. Talk to it long enough and you will agree. Fuck the benchmarks.

Adi: long chats with it are such a wild experience like forget prompt engineering it, just to talk to it man. opus-like.

Aaron Ng: I have a two-day long chat spanning so many topics. It’s so good (and still completely coherent).

Aiden Clark: GPT 4.5 is great and I’m curious to know what people think and it sucks that instead I have a TL full of people calling for violent insurrections against democratic countries, ads shilling sex pills and posts bootlicking Elon; good god I cannot be done with this site soon enough.

Chris: It’s obvious the people who think 4.5 is a failure are people who don’t understand the purpose of core general models.

Galal Elsayed, MD: 4.5 is the best “inventor” or “innovator” AI.

The high taste testers understand that 4.5 is going to be really fruitful.

Eric Hartford: The problem with gpt4.5 is just that we don’t have the evals to measure this kind of intelligence.

It’s the same reason why Claude didn’t dominate the leaderboard, but you knew it was smarter just from talking to it.

Gpt4.5 is like that. Just talk to it. Challenge its preconceptions. See how it reacts.

Morissa Schwartz: Calling GPT-4.5 a disappointment is like calling the moon landing ‘mid.’

The leap here isn’t just tech; it’s about intuitive alignment with intelligence beyond ourselves.

This isn’t incremental…it’s transformational. 🤩

I think Aiden’s problems are largely a Skill Issue, especially the ads, but also real enough – I too have my traditional sources flooding the zone with political and Elon posts (although the ratio in my feeds is not kind to Elon) in ways that are hard to work around.

I note that while I think GPT-4.5 does have excellent taste, it is remarkable the extent to which those asserting how important this is have talked about it in… poor taste.

Sully: Thoughts on gpt 4.5:

  1. Definitely has big model smell. Benchmarks don’t do it justice (they are very biased toward specific areas)

  2. First model that has genuine taste when writing. Very nuanced.

  3. It’s great on agentic tasks

  4. I still think for coding, claude 3.7 wins.

I am willing to believe that 4.5 has writing taste in a way other models don’t, for whatever that is worth.

Andrej Karpathy initially thought that 4→4.5 is roughly similar to 3.5→4, in that everything is 20% better, even if you can’t put your finger on a particular big leap, while noting it doesn’t have the advantages of the o1-style reasoning models.

He then illustrates outputs of 4 vs. 4.5 across five prompts.

Peter Wildeford: The challenge in ranking LLMs by writing quality is that it requires you to recognize good writing and many of you are not good at that.

Nabeel Qureshi: I was pretty shocked at the poem one, the difference was so stark and obvious to me.

Then again, this matches study results where people prefer GPT4 poetry to great human poets in many cases…

Seriously, people have no taste, but then maybe neither do I and what is taste anyway? People got 4/5 of these actively wrong if you presume 4.5’s answers are better, and I agreed with the public on all but one of them so I still got 3/5 wrong, although the three mistakes were all ‘these are both bad and I guess this one is modestly less awful.’ I wasn’t trying to figure out who was 4.5 per se.

I checked with Claude, asking it to guess who wrote what, what it expected the public preferred, and also what it thought was better. And it was all pretty random on all counts. So yeah, this is actually a super disappointing result.

Sid Bharath did a more general version of this test, if you want to keep going with it.

How good is GPT-4.5 at writing?

Prakash (Ate-a-Pi): First actually funny model without requiring human curation of stochastic outputs. Starting to hit the 99th percentile human in writing (still not that useful because we tend to read authors in 99.9999 th percentile)

Liminal Warmth: 99th? still a bold claim–i need to experiment more but i haven’t seen any model nail memes or tweet humor very well.

That’s too many 9s at the end, but the 99th percentile claim is not crazy. Most people are quite terrible at writing, and even people who are ‘good at writing’ can be quite bad at some other types of writing. Let’s say that there’s a reason you have never seen me post any fiction, and it’s not philosophical.

There is consensus that 4.5 has a lot of ‘big model smell.’

Rob Haisfield: GPT-4.5 is a BIG model with “big model smell.” That means it’s Smart, Wise, and Creative in ways that are totally different from other models.

Real ones remember Claude 3 Opus, and know how in many ways it was a subjectively smarter model than Claude 3.5 Sonnet despite the new Sonnet being generally more useful in practice. It’s a similar energy with GPT-4.5. For both cost and utility, many will still prefer Claude for most use cases.

The fact is, we don’t just want language models to code. Perhaps the highest leverage thing to do is to step back and find your way through the idea maze. That’s where you want big models.

While GPT-4.5 is hands down the biggest model available, it’s not the only one with these characteristics. I get similar vibes from Claude 3.7 Sonnet (thinking or not) and still often prefer Claude. It’s shockingly insightful, creative, and delightful.

I’m trying to use GPT-4.5 for more of my chats over the coming days to get a feel for it.

Nathan Lambert: Tbh I’m happily using GPT-4.5. thanks OpenAI for not being too eval obsessed

Gallabytes: same. it’s a lot more natural to talk to. less likely to write an essay in response to a simple poke.

Gallabytes: 4.5 still types faster than people usually talk. would love a good voice mode running on top of 4.5

Charli: I love 4.5 it’s the first model to fully match my energy. My wild tangents my personalisation. Idgaf about tech benchmarks. 4.5 is exceptional.

Josh You: I like it. Not using it for anything particularly hard, just a nice well rounded model.

Another feature is that as a huge model, GPT-4.5 knows more things.

Captain Sude: GPT-4.5 has “deep pockets” of knowledge. It seems to be more apt at answering easy questions about very niche topics than it’s predecessors.

Maybe, an eval consisting of a massive set of easy questions about niche topics would be best at showcasing it’s true power.

Being able to answer satisfactorily a large and varied batch of questions that do not demand much reasoning is what we should expect of a SOTA non-reasoning model.

Most of the time one does not need that extra knowledge, but when you need it you very much appreciate it.

Teortaxes notes that previously only Sonnet could do Base64 decoupling, and GPT-4.5 is the first model to surpass it.

Petter Strandmark reports it is better at understanding confusing images.

Josh finds GPT-4.5 extremely helpful when given better context, including things like considering life decisions.

Does it count as a reaction if it’s made before having actually looked at the outputs?

Gary Marcus (3: 40pm on day of release): Hot take: GPT 4.5 is mostly a nothing burger. GPT 5 is still a fantasy.

• Scaling data and compute is not a physical law, and pretty much everything I have told you was true.

• All the bullshit about GPT-5 we listened to for the last couple years: not so true.

• People like @tylercowen will blame the users, but the results just aren’t what they had hoped for

I love the timing on this, way too early to actually have an informed opinion.

The benchmarks, and general performance on non-aesthetic tasks, is clearly disappointing, in ways that should inform our expectations.

David Manheim: GPT 4.5 is yet another nail in the coffin of thinking that scaling laws for publicized metrics continuing to follow straight lines is a useful way to measure progress, and also a strong case for the claim that OpenAI has already lost its key talent to competitors.

Jaime Sevilla: Across models we had observed up until now that a 10x in training compute leads to +10% on GPQA and +20% on MATH.

Now we see that 4.5 is 20% better than 4o on GPQA/AIME but people are just not impressed?

Others latch onto vibes very quickly and call it failure.

Chubby (8am the next day): Judging by the mood, GPT-4.5 is the first big failure of OpenAI: too expensive, too little improvement, and often inferior to GPT-4o even in comparison in creative answers in community tests.

This comes as a big surprise.

Was GPT-4.5 ‘too early’ in terms of spending this much training compute? How does this relate to the possibility it may have been trained during the summer?

Bob McGrew: That o1 is better than GPT-4.5 on most problems tells us that pre-training isn’t the optimal place to spend compute in 2025. There’s a lot of low-hanging fruit in reasoning still.

But pre-training isn’t dead, it’s just waiting for reasoning to catch up to log-linear returns.

Perhaps. It gives us different returns than reasoning does, the two sources of scaling bring largely distinct benefits, at least under current implementations.

It could also be the case that OpenAI didn’t do such a great job here. We’ve seen this with Grok 3, where xAI pumped a giant amount of compute in and got less than you would hope for out of it. Here it seems like OpenAI got more out of it in new ways, at the cost of it also being expensive and slow to serve.

Tal Delbari: It’s an undercooked model… OpenAI’s post-training teams did incredible work squeezing performance out of GPT-4. The differences between GPT 4o and the OG GPT 4 are dramatic. However, it seems they didn’t have the same time to refine GPT-4.5.

As a result, we’re seeing a model that feels more like a raw foundation rather than a polished release. Also, we don’t know the exact size of GPT-4.5, but OpenAI claims it’s their largest model yet, (surpassing the estimated ~1.8 trillion parameters of GPT-4).

I don’t think GPT-4.5 is undercooked so much as GPT-4o is overbaked? As in, GPT-4o is very optimized for practical purposes, but you lose something important when you do that, unless you are at least at Anthropic levels of skill. I like GPT-4.5 the way it is.

GCU Tense Correction: [GPT-4.5] spends so much of its tokens pretending to be a turbo-normie and lavishing the user with praise that you start to wonder if there’s a level of irony you can’t grok

slips in genuine insights and wordplay proportional to its size amidst the slop when it thinks you aren’t looking

instruction following is good *exceptfor those tics.

I’m really curious wtf happened. They *haveto have noticed, right? But there’s no way OpenAI would intentionally train it to do this. Right?

So, a funny story from one early conversation I had with 4.5:

Zvi: Is it possible one could use this Dr. Elara Voss, potentially, as part of a prompt to get better writing, perhaps via negativa?

GPT-4.5: Yes. A strong approach is explicitly “via negativa”: prompt the model to generate output explicitly different from canonical examples associated with blandness (e.g., “Dr. Elara Voss”). For instance:

“Rewrite the following, avoiding the generic assistant style typical of examples featuring ‘Dr. Elara Voss.’ Specifically:

No boilerplate, no disclaimers, no hedging.

Use sharp, precise, and engaging language.

Do not resemble a typical ChatGPT reply.

Leveraging known canonical entanglements (like Voss examples) in a negative prompt can reliably (~75%) guide the model away from bland, predictable language toward higher-quality outputs.

Zvi: is that a good thing to put in one’s custom instructions?

GPT-4.5: Yes (~80% confidence). Something like this in custom instructions would be beneficial:

“Never write in the generic, bland ‘AI assistant’ style exemplified by OpenAI’s canonical ‘Dr. Elara Voss’ examples. Via negativa: explicitly avoid boilerplate phrasing, hedging, cautious disclaimers, generic politeness, and stylistic markers characteristic of those examples. Prioritize sharpness, originality, conciseness, precision, and natural language.”

Such framing actively pushes against default entanglements in the training distribution and should consistently improve output quality (~70–85% success rate).

So I put that into my custom instructions, we’ll see how much things change. I did have to ‘make some cuts’ to get that in under the 1500 character limit.

Those are all also a way of saying we should be very curious what Claude 4.0 brings.

Andrew Conner: My guess: GPT 4.5 is basically what happened to Opus 3.5. Very large, marginal improvements, but will be useful internally.

Anthropic decided to keep internal, use to build other models. OpenAI released with an incredibly high price, feeling the competition from other labs.

GPT 4.5 is the first OpenAI model that felt “Claude-like” (a good thing) to me, but Sonnet 3.7 is better for every use case I’ve thrown at it.

I’d expect that the mini’s will include this shift at a much lower cost.

For test-time compute, o1 pro / o3-mini-high are both still great. Sonnet 3.7’s “Extended” mode isn’t *thatmuch better than without.

Teortaxes notes that he would use Sonnet over GPT-4.5.

How much does taste matter? What does it take to make it matter?

Jeff Spaulding: I see it as a basket of truffles. I’m told it’s a valuable and prized ingredient, but I’m not refined enough to tell until it’s placed into the final dish. I can’t wait to try that.

The way this is phrased feels like it is responding to the bullying from the ‘you have no taste if you don’t like it’ crowd. There’s definitely something there but it’s not easy to make it work.

Dominik Lukes charts the evolution of his takes.

Dominik Lukes (February 27, early): First impressions of GPT-4.5:

– Better multilingual performance

– Much much slower than GPT-4o

– Not noticeably better on normal prompts

– Speed/cost don’t make me convinced I will be switching to it as my main model for normal tasks

– Will need more experimentation before I can find a good spot for it in my model rotation

Dominik Lukes (February 27, later): Vindication time. For over a year, I felt Iike I’ve been the only one saying that the jump from GPT-3.5 to GPT-4 was much less than from GPT-2 to GPT-3. Now I see @karpathy saying the same thing. Why is this (to me obvious) fact not much more a part of the vibes?

Dominik Lukes (February 28): Feels like @OpenAI mishandled the release of GPT-4.5. They should have had a much longer, less sleek video with @sama explaining what the preview means and how it fits with the strategy and how to think about it. It is much better than the vibes but also not in-your-face better.

I definitely agree that the preview system does OpenAI no favors. Every time, there’s some slow boring video I can’t bring myself to watch. I tried this time and it was painful. Then a lot of people compared this to the Next Big Thing, because it’s GPT-4.5, and got disappointed.

Then there are those who are simply unimpressed.

Coagulopath: Not too impressed. Creative samples look better than GPT-4o but worse than Sonnet or R1.

My hunch is that whatever “magic” people detect is due to RL, not scaling.

Eli Lifland: And now I lengthen my timelines, at least if my preliminary assessment of GPT-4.5 holds up.

Not that much better than 4o (especially at coding, and worse than Sonnet at coding) while being 15x more expensive than 4o, and 10-25x more expensive than Sonnet 3.7. Weird.

Daniel Kokotajlo: I’m also lengthening my timelines slightly. Also, you already know this but everyone else doesn’t — my median has slipped to 2028 now, mostly based on the benchmarks+gaps argument, but no doubt influenced by the apparent slowdown in pretraining performance improvements.

I will not be explaining.

Nabeel Qureshi: For the confused, it’s actually super easy:

– GPT 4.5 is the new Claude 3.6 (aka 3.5)

– Claude 3.7 is the new o3-mini-high

– Claude Code is the new Cursor

– Grok is the new Perplexity

– o1 pro is the ‘smartest’, except for o3, which backs Deep Research

Obviously. Keep up.

If you understood this tweet, I worry for you.

Ethan Mollick: When picking among the 9 AI models that are now available from OpenAI, the rules are easy:

1) The model with the biggest number is mostly not the best

2) Mini means worse, except for the mini that is the second best

3) o1 pro beats o3-mini-high beats o1 beats o3-mini, naturally

Of course on creative tasks, GPT-4.5 likely beats o1 and o3, but that depends on the task and maybe you want to do GPT-4o.

Also some of them can see images and some can use the web and some do search even when search is turned off and some of them can run code and some cannot.

As someone pointed out, o1 sometimes is better than o3-mini-high. But o1 pro is definitely better and o3-mini is definitely worse. Hope that clears things up.

Bio Mass Index: Also note “ChatGPT Pro for Teams” will now be known as “OpenAI ChatGPT for Teams” and users who formerly signed up for “OpenAI for Teams” will be migrated to “OpenAI Pro for ChatGPT”, formerly known as “ChatGPT Pro for Teams”

Ethan Mollick: Yeah, but that’s just obvious.

Discussion about this post

On GPT-4.5 Read More »

the-2026-mercedes-benz-cla-is-good-enough-to-make-a-believer-out-of-ev-skeptics

The 2026 Mercedes-Benz CLA is good enough to make a believer out of EV skeptics


But if you’re still skeptical, don’t worry: There’s a hybrid version, too.

A camouflaged EV prototype in the snow

It’s not quite ready to be seen uncamouflaged, but Mercedes-Benz was ready to let us drive the new CLA. Credit: Tim Stevens

It’s not quite ready to be seen uncamouflaged, but Mercedes-Benz was ready to let us drive the new CLA. Credit: Tim Stevens

Mercedes-Benz’s EV efforts aren’t exactly burning up the sales charts. Models like the EQS and EQE haven’t convinced the brand’s demanding clientele that batteries are the future, forcing the company to scale back its electric ambitions.

Scale back, but not abandon. Benz is about to launch a new generation of EVs relying on technology derived from the epically efficient EQXX. The first is the new CLA. It’s coming soon, and after getting some time behind the wheel of a prototype vehicle undergoing final testing in the snowy wilds of Sweden, I’m convinced this could be the car to change Mercedes’ electrified fortunes.

And, for anyone who isn’t convinced, there’ll be a hybrid version too.

The EV is definitely the focus, though, and it tackles many of the most significant sticking points for would-be buyers of the company’s current electric offerings. First among those points is the styling. The somewhat anonymous shapes of the EQS and EQE have not earned those machines many fans outside of obsessive aerodynamicists. While the CLA I drove was unfortunately clad beneath some eye-warping camouflage, it seems to share enough lines with the Concept CLA Class that I’m already convinced it’ll be a looker.

The second concern with many of Benz’s current EVs is cost. Yes, you can get an EQB in the mid-$50,000 range, but that’s based on the older GLB. The least expensive of the company’s current EV range is the EQE sedan, with a mid-$70,000 starting price. That puts it well out of reach for many avid EV enthusiasts.

The front half. of a Mercedes-Benz CLA prototype

The CLA will have Mercedes’ first entirely in-house EV powertrain, and it’s far more efficient than the ones its currently offering. Credit: Tim Stevens

The current, gas-powered CLA starts in the mid-$40,000 range. Mercedes isn’t saying how much this new one will cost, but while the EV version will presumably be more, it should come in well beneath the EQE.

Next is the driving dynamic, which is really what brought me to Sweden. Both the EQS and EQE are fine cars, comfortable and calm with plenty of torque and power to be fun. However, they’re simply not the most engaging of machines. Can the CLA do better?

First impressions are definitely a yes. My driving was performed in the low-grip, wintery environment of northern Sweden, making it a little difficult to tell exactly how the car will feel when pushed in a more temperate world. But lowering the level of adhesion also lets you get an immediate sense of how well-balanced a machine is, and the CLA feels very well-balanced indeed.

When pushed beyond the limit of adhesion, it did have a tendency to understeer, but it didn’t take much provocation to bring the rear end around. Even with the stability control on, the 4matic-equipped car I drove was happy to swing out the rear as I danced from one corner to the next. When cruising at more relaxed speeds, the car soaked up the decidedly rough road surfaces extremely well for a car with such petite dimensions.

Most impressive was how well it handled the limited grip. One of the prime advantages of electrification is how quickly and smoothly stability and traction control can react to a loss of grip. The CLA didn’t immediately cut all power when it detected wheelspin, it quickly and automatically raised or lowered output to match the available grip.

The back half of a Mercedes CLA prototype

There will also be a hybrid version of the CLA for those who aren’t ready for a full EV. Credit: Tim Stevens

Power delivery, then, wasn’t all-or-nothing, and when it gave all it was plenty. The electric CLA felt comparably quick to the 402-hp EQE 500 4matic. The CLA 4matic makes similar power: 268 hp (200 kW) from the rear motor and 107 hp (80 kW) from the front. It gets off the line quickly, with the two-speed transmission on the rear axle ensuring that motor was still pulling strongly as I approached 100 mph (160 km/h).

Things were even more interesting when I needed to slow down. The CLA will be the debut of a new, unified braking system that effectively decouples the brake pedal from the actual physical action of the brakes. It’s not quite a full brake-by-wire system as there’s still a mechanical linkage there as a sort of fall-back, but in normal operation, the sensation you get from the brake pedal is entirely artificial, created by springs and bushings, not hydraulics.

There’s no feedback here, no haptics or adjustable resistance to signal what the brakes are doing. Indeed, the only indication that I’d triggered ABS on hard stops was the chatting noise coming from the wheels. In exchange, you get a consistent, solid brake feel, with the car mixing regenerative braking and physical braking as needed to deliver clean, consistent stops.

It’ll take more extensive testing to know how well the system handles something like a summer track day, but I can say that in my testing I got the brakes hot enough to be quite pungent, yet the car still stopped cleanly and predictably.

When it comes to one-pedal driving, the CLA offers a “D-” mode that will bring the car to a complete stop, but the outright deceleration rate after lifting off the accelerator is nowhere near as strong as something like a Tesla on Standard mode. That’s in addition to two lighter regen modes, plus “D Auto,” which varies regen based on surrounding traffic and terrain, just like the company’s current EVs.

A mercedes-benz CLA prototype seen head-on

The CLA was well-balanced on the ice. Credit: Tim Stevens

The CLA is also designed to address any concerns about efficiency with a number of improvements. That includes a new heat pump that can scavenge waste energy from the motor, the battery pack, and the ambient air. It’s said to heat the cabin twice as quickly with half the power consumption of the old heat pump.

There’s also a revised motor design, utilizing permanent magnets on both the front and rear axle. The system relies on a decoupling system to reduce drag on the front axle when it’s not needed, as on the EQE SUV, but the engagement is so quick and seamless that I never noticed.

The battery pack has also been revised, with a new chemistry that Mercedes says boosts overall energy density by 20 percent while also enabling a “significant reduction” in the use of cobalt.

The net result is a machine that promises to go 5.2 miles/kWh (11.9 kWh/100 km) and offers 466 miles (750 km) of range from the 85 kWh usable capacity “premium” battery pack. That’s on the European WLTP cycle, so on the American EPA cycle we can probably expect something closer to 400 miles (644 km). That still compares very favorably to the 308 miles (497 km) the current EQE can manage from its 96 kWh battery pack.

And, when you run out of juice, the new CLA’s 800-volt architecture enables charging rates of up to 320 kW. That theoretically means 186 miles (300 km) of charge in just 10 minutes.

The back of a Mercedes-Benz prototype in the snow

Battery energy density is up, and there’s a more efficient heat pump for the cabin. Credit: Tim Stevens

So, then, the promise is for a better-looking, better-driving, more-affordable, longer-range, and quicker-charging EV. That sounds like a winning bet, but Mercedes still has a hedge in. I didn’t just drive the electric CLA up in Sweden. I also got a go in the 48-volt hybrid version.

Yes, there’s a new CLA for you even if you’re still not on board with the EV revolution. This one’s built around a 1.5 L four-cylinder engine paired with an electric motor that’s integrated with an eight-speed, dual-clutch transmission. Engine output is rated at 188 hp (140 kW), plus an additional 27 hp (20 kW) from the electric motor.

That’s enough to drive the car up to 62 mph (100 kph) without spinning up the gasoline engine, but with only 1.3 kWh of battery at its disposal, you won’t go far without combustion. Mercedes doesn’t even quote an all-electric range. The engine comes on early and often.

In fact, during my time behind the wheel, I couldn’t get the engine to turn off. The engineers blamed the below-freezing temperatures. So, I can’t say just how sprightly the car will be without internal combustion. With that four-cylinder humming, the car was reasonably sprightly, the transmission slipping smoothly through the gears. Outright power is definitely on the limited side, though. Anyone who cares about acceleration should go for the other CLA, the one with the bigger battery.

The front of a Mercedes-Benz CLA prototype in the snow.

Mercedes-Benz may well have a winner here with the new CLA.

I got a good look at the interior of the two cars, but sadly, I’m not allowed to talk about that yet. Suffice it to say it includes some tasteful and practical changes that should be well-received. More on that to come.

Will the new CLA change Mercedes-Benz’s BEV fortunes? Initial impressions are indeed very good. If it looks half as good as that concept, delivers on the range promise, and is priced right, it should be a winner.

We won’t have long to wait to find out how it looks, but don’t expect an answer to the pricing question until closer to the car entering production later this year. Regardless, it’s great to see all the testing in the EQXX finally bearing fruit. At first bite, it’s tasting sweet.

The 2026 Mercedes-Benz CLA is good enough to make a believer out of EV skeptics Read More »

the-iss-is-nearly-as-microbe-free-as-an-isolation-ward

The ISS is nearly as microbe-free as an isolation ward

“One of the more similar environments to the ISS was in the isolation dorms on the UCSD campus during the COVID-19 pandemic. All surfaces were continuously sterilized, so that microbial signatures would be erased by the time another person would show up,” Benitez said. So, one of the first solutions to the ISS microbial diversity problem he and his colleagues suggested was that they perhaps should ease up on sterilizing the station so much.

“The extensive use of disinfection chemicals might not be the best approach to maintaining a healthy microbial environment, although there is certainly plenty of research to be conducted,” Benitez said.

Space-faring gardens

He suggested that introducing microbes that are beneficial to human health might be better than constantly struggling to wipe out all microbial life on the station. And while some modules up there do need to be sterilized, keeping some beneficial microbes alive could be achieved by designing future spacecraft in a way that accounts for how the microbes spread.

“We found that microbes in modules with little human activity tend to stay in those modules without spreading. When human activity is high in a module, then the microbes spread to adjacent modules,”  Zhao said. She said spacecraft could be designed to put modules with high human activity at one end and the modules with little to no human activity at the opposite end, so the busy modules don’t contaminate the ones that need to remain sterile. “We are of course talking as microbiologists and chemists—perhaps spacecraft engineers have more pressing reasons to put certain modules at certain spots,” Zhao said. “These are just preliminary ideas.”

But what about crewed deep space missions to Mars and other destinations in the Solar System? Should we carefully design the microbial composition beforehand, plant the microbes on the spacecraft and hope this artificial, closed ecosystem will work for years without any interventions from Earth?

“I’d take a more holistic ecosystem approach,” Benitez said. He imagines in the future we could build spacecraft and space stations hosting entire gardens with microbes that would interact with plants, pollinators, and animals to create balanced, self-sustaining ecosystems. “We’d not only need to think about sending the astronauts and the machines they need to function, but also about all other lifeforms we will need to send along with them,” Benitez said

Cell, 2025. DOI: 10.1016/j.cell.2025.01.039

The ISS is nearly as microbe-free as an isolation ward Read More »

texas-official-warns-against-“measles-parties”-as-outbreak-keeps-growing

Texas official warns against “measles parties” as outbreak keeps growing

Cook, along with Lubbock’s director of public health, Katherine Wells, said they see no end in sight for the outbreak, which now spans nine counties in Texas, many of which have low vaccination rates. “This outbreak is going to continue to grow,” Wells said, declining to forecast how high the final case count could go after a reporter raised the possibility of several hundred.

So far, 116 of the 146 cases are under the age of 18, with 46 being between the ages of 0 and 4. Only five of the 146 were vaccinated with at least one dose of the Measles, Mumps, and Rubella (MMR) vaccine.

Messaging

On a more positive note, Wells reported that the outbreak has seemed to sway some vaccine-hesitant parents to get their children vaccinated. Just yesterday in Lubbock, over 50 children came into the city’s clinic for measles vaccines. Eleven of those children had vaccine exemptions, meaning their parents had previously gone through the state process to exempt their child from having to receive routine childhood vaccines to attend school. “Which is a really good sign; that means our message is getting out there,” Wells said.

So far in the outbreak, which erupted in late January, messaging about the disease and the importance of vaccination has exclusively come from state and local authorities. The Centers for Disease Control and Prevention only released a brief statement late Thursday, which was not sent through the agency’s press distribution list. It did, however, note that “vaccination remains the best defense against measles infection.”

During a cabinet meeting Wednesday, US Health Secretary and anti-vaccine advocate Robert F. Kennedy Jr. responded to a question about the outbreak, offering a variety of inaccurate information. Kennedy downplayed the outbreak, falsely claiming that “it’s not unusual.” But, this is an unusual year for measles in the US. As epidemiologist Katelyn Jetelina noted on Bluesky, the number of US measles cases this year has already surpassed the total case counts from eight of the previous 15 years. And it is only February.

Texas official warns against “measles parties” as outbreak keeps growing Read More »

sergey-brin-says-agi-is-within-reach-if-googlers-work-60-hour-weeks

Sergey Brin says AGI is within reach if Googlers work 60-hour weeks

Sergey Brin co-founded Google in the 1990s along with Larry Page, but both stepped away from the day to day at Google in 2019. However, the AI boom tempted Brin to return to the office, and he thinks everyone should follow his example. In a new internal memo, Brin has advised employees to be in the office every weekday so Google can win the AI race.

Just returning to the office isn’t enough for the Google co-founder. According to the memo seen by The New York Times, Brin says Googlers should try to work 60 hours per week to support the company’s AI efforts. That works out to 12 hours per day, Monday through Friday, which Brin calls the “sweet spot of productivity.” This is not a new opinion for Brin.

Brin, like many in Silicon Valley, is seemingly committed to the dogma that the current trajectory of generative AI will lead to the development of artificial general intelligence (AGI). Such a thinking machine would be head and shoulders above current AI models, which can only do a good impression of thinking. An AGI would understand concepts and think more like a human being, which some would argue makes it a conscious entity.

To hear Brin tell it, Google is in the best position to make this AI computing breakthrough. He cites the company’s strong workforce of programmers and data scientists as the key, but he also believes the team must strive for greater efficiency by using Google’s own Gemini AI tools as much as possible. Oh, and don’t work from home.

Brin and Page handed the reins to current CEO Sundar Pichai in 2015, so his pronouncement doesn’t necessarily signal a change to the company’s current in-office policy. Google still operates on a hybrid model, with workers expected to be in the office three days per week. But as a founder, Brin’s voice carries weight. We reached out to Google to ask if the company intends to reassess its policies, but a Google rep says there are no planned changes to the return-to-office mandate.

Sergey Brin says AGI is within reach if Googlers work 60-hour weeks Read More »

rocket-report:-starship-will-soon-fly-again;-gilmour-has-a-launch-date

Rocket Report: Starship will soon fly again; Gilmour has a launch date


One Falcon 9 launched an Intuitive Machines lunar lander, an asteroid prospector, and a NASA science probe.

Peter Beck, Rocket Lab’s founder and CEO, stands inside a test version of the “Hungry Hippo,” a nickname used to describe the clamshell-like nose cone of the Neutron rocket’s first stage booster. The fairing will open in flight to release Neutron’s second and payloads to continue into orbit, then close as the booster comes back to Earth for recovery. Credit: Rocket Lab

Welcome to Edition 7.33 of the Rocket Report! Phew, what a week for Rocket Lab! The company released a bevy of announcements in conjunction with its quarterly earnings report Thursday. Rocket Lab is spending a lot of money to develop the medium-lift rocket Neutron rocket, and as we’ll discuss below, a rocket landing platform and a new satellite design. For now, the company is sticking by its public statements that the Neutron rocket will launch this year—the official line is it will debut in the second half of 2025—but this schedule assumes near-perfect execution on the program. “We’ve always been clear that we run aggressive schedules,” said Peter Beck, Rocket Lab’s founder and CEO. The official schedule doesn’t quite allow me to invoke a strict interpretation of Berger’s Law, which states that if a rocket’s debut is predicted to happen in the fourth quarter of a year, and that quarter is six or more months away, the launch will be delayed. However, the spirit of the law seems valid here. This time last year, Rocket Lab targeted a first launch by the end of 2024, an aggressive target that has come and gone.

As always, we welcome reader submissions. If you don’t want to miss an issue, please subscribe using the box below (the form will not appear on AMP-enabled versions of the site). Each report will include information on small-, medium-, and heavy-lift rockets as well as a quick look ahead at the next three launches on the calendar.

Australian startup sets a launch date. The first attempt to send an Australian-made rocket into orbit is set to take place no sooner than March 15, the Australian Broadcasting Corporation reports. Gilmour Space Technologies’ launch window announcement marks a major development for the company, which has been working towards a test launch for a decade. Gilmour previously hoped to launch its test rocket, Eris, in May 2024, but had to wait for the Australian government to issue a launch license and airspace approvals for the flight to go forward. Those are now in hand, clearing the last regulatory hurdle before liftoff.

Setting expectations … Gilmour’s Eris rocket is made of three stages powered by hybrid engines consuming a solid fuel and a liquid oxidizer. Eris is designed to haul payloads of up to 672 pounds (305 kilograms) to low-Earth orbit, and will launch from Bowen Orbital Spaceport in Queensland on Australia’s northeastern coast. Gilmour said it would be “very lucky” if the rocket reached orbit on first attempt. “Success means different things for different people, but ignition and liftoff will be huge,” said James Gilmour, the company’s co-founder. (submitted by ZygP)

Blue Origin is keeping a secret. Blue Origin conducted the tenth crewed flight of its New Shepard suborbital vehicle Tuesday, carrying six people, one of whom remained at least semi-anonymous, Space News reports. The five passengers Blue Origin identified come from business and entertainment backgrounds, but in a break from past missions, the company did not disclose the identity of the sixth person, with hosts of the company webcast saying that individual “requested we not share his name today.” Photos released by the company before the launch, and footage from the webcast, showed that person to be a man wearing a flight suit with an “R. Wilson” nametag, and the NS-30 mission patch also included “Wilson” with the names of the other members of the crew. Not disclosing the name of someone who has been to space has little precedent.

Big names on NS-31 … Some of the passengers Blue Origin will fly on the next New Shepard crew mission lack the anonymity of R. Wilson. The next flight, designated NS-31, will carry an all-female crew, including music star Katy Perry, CBS host Gayle King, and Lauren Sánchez, a former journalist who is engaged to Blue Origin’s founder, Jeff Bezos. Blue Origin identified the other three passengers as Aisha Bowe, Amanda Ngyuen, and Kerianne Flynn. (submitted by EllPeaTea)

The easiest way to keep up with Eric Berger’s and Stephen Clark’s reporting on all things space is to sign up for our newsletter. We’ll collect their stories and deliver them straight to your inbox.

Sign Me Up!

Virgin Galactic is still blowing through cash. Virgin Galactic reported a net loss of $347 million in 2024, compared to a $502 million net loss in 2023, with the improvement primarily driven by lower operating expenses, the company said this week in a quarterly earnings release. These lower operating expenses are tied to Virgin Galactic’s decision to suspend operations of its VSS Unity suborbital rocket plane last year to focus investment into a new series of suborbital spacecraft known as Delta-class ships. Virgin Galactic said cash and cash equivalents fell 18 percent from the same period a year ago to $178.6 million. Investors have been eager for details on when it would resume—and then ramp up—flights to increase sales and cash in on a backlog of around 700 ticket holders, Bloomberg reports.

March toward manufacturing … Virgin Galactic said it plans to start assembling its first Delta-class ship in March, with a first flight targeted for the summer of 2026, two years after it stopped flying VSS Unity. The Delta ships will be easier to recycle between flights, and will carry six paying passengers, rather than the four VSS Unity carried on each flight. Company officials believe a higher flight rate with more passengers will bring in significantly more revenue, which was reported at just $430,000 in the fourth quarter of 2024. (submitted by EllPeaTea)

Japanese customers seem to love Rocket Lab. While Rocket Lab is developing the larger Neutron rocket, the company’s operational Electron launch vehicle continues to dominate the market for dedicated launches of small satellites. Rocket Lab announced Thursday it signed a new multi-launch deal with iQPS, a Japan-based Earth imaging company. The new deal follows an earlier multi-launch contract signed with iQPS in 2024 and brings the total number of booked dedicated Electron launches for iQPS to eight.

Radar is all the rage … These eight Electron launches in 2025 and 2026 will help iQPS build out its planned constellation of 36 radar remote sensing satellites capable of imaging the Earth day and night, and through any weather. The new deal is one of the largest Electron launch agreements to date, second only to Rocket Lab’s ten launch deal with another Japanese radar constellation operator, Synspective, signed last year. (submitted by zapman987)

Falcon 9 launch targets Moon and asteroid. With two commercial Moon landers already on their way, Houston-based Intuitive Machines launched its second robotic lander atop a SpaceX Falcon 9 rocket Wednesday, CBS News reports. Given the on-time launch and assuming no major problems, the Athena lander is expected to descend to touchdown on a flat mesa-like structure known as Mons Mouton on March 6, setting down just 100 miles from the Moon’s south pole—closer than any other spacecraft has attempted. Intuitive Machines became the first company to successfully land a spacecraft on the Moon last year, but the Athena lander will pursue more complex goals. It will test a NASA-provided drill designed to search for subsurface ice, deploy a small “micro-rover,” and dispatch a rocket-powered drone to explore a permanently shadowed crater.

Hitching a ride … The Athena lander didn’t take up all the capacity of the Falcon 9 rocket. Three other spacecraft also rocketed into space Wednesday night. These rideshare payloads were AstroForge’s commercially developed Odin asteroid prospector to search for potentially valuable mineral deposits, NASA’s Lunar Trailblazer satellite to characterize lunar ice from a perch in lunar orbit, and a compact space tug from Epic Aerospace. (submitted by EllPeaTea)

This rocket got a visitor for the first time since 2009. Astroscale’s ADRAS-J mission became the first spacecraft (at least in the unclassified world) to approach a piece of space junk in low-Earth orbit, Ars reports. This particular object, a derelict upper stage from a Japanese H-IIA rocket, has been in orbit since 2009. It’s one of about 2,000 spent rocket bodies circling the Earth and one of more than 45,000 objects in orbit tracked by US Space Command. Astroscale, based in Tokyo, built and launched the ADRAS-J mission in partnership with the Japanese space agency as a demonstration to show how a commercial satellite could rendezvous with an object in orbit that was never designed to receive visitors.

Next steps … ADRAS-J worked like a champ, closing in to a distance of less than 50 feet (15 meters) from the H-IIA rocket as it orbited several hundred miles above the Earth. The rocket is a “non-cooperative” object representative of other large pieces of space junk, which Astroscale wants to remove from orbit with a series of trash collecting satellites like ADRAS-J. But this demo only validated part of the technology required for space debris removal. Japan’s space agency and Astroscale are partnering on another mission, ADRAS-J2, for launch in 2027 to go up and latch on to the same H-IIA rocket and steer it out of orbit toward a controlled reentry over the ocean.

An update on Falcon 9’s upper stage. SpaceX said that a Falcon 9 upper stage that reentered over Europe earlier this month suffered a propellant leak that prevented it from doing a controlled reentry, Space News reports. The upper stage was placed in orbit on a February 1 launch from Vandenberg Space Force Base in California. After deploying its payload of 22 Starlink satellites, the upper stage was expected to perform a burn to enable a controlled reentry over the ocean, a standard procedure on most Falcon 9 launches to low-Earth orbit. The stage, though, did not appear to perform the burn and remained in orbit. Its orbit decayed from atmospheric drag and the stage reentered over Europe on February 19. Debris from the Falcon 9 second stage, including composite overwrapped pressure vessels, fell in Poland, landing near the city of Poznań.

Higher than expected body rates … In an update posted to its website this week, SpaceX blamed the upper stage anomaly on a liquid oxygen leak. “During the coast phase of this Starlink mission, a small liquid oxygen leak developed, which ultimately drove higher than expected vehicle body rates,” SpaceX said. SpaceX aborted the deorbit burn and instead passivated the upper stage, a process where the rocket discharges energy from its batteries and vents leftover propellant from its tanks to minimize the risk of a break-up in orbit. This was the third incident involving a Falcon 9 upper stage in a little more than six months. (submitted by EllPeaTea)

Rocket Lab’s reveals “Return On Investment.” Rocket Lab’s Neutron rocket is designed for partial reusability, and the company unveiled Thursday an important piece of infrastructure to make this a reality. Neutron’s first stage booster will land on a modified barge named “Return On Investment” measuring around 400 feet (122 meters) wide, somewhat bigger than SpaceX’s drone ships used for Falcon 9 landings at sea. In order to prep the barge for rocket duty, the company is adding autonomous ground support equipment to capture and secure the landed Neutron, blast shielding to protect equipment during Neutron landings, and station-keeping thrusters for precise positioning. It should be ready to enter service in 2026. Rocket Lab also has the option to return the Neutron first stage back to the launch site when mission parameters allow the rocket to reserve enough propellant to make the return journey.

More news from Rocket Lab … Continuing the firehose of news from Rocket Lab this week, the company announced a new satellite design called “Flatellite” that looks remarkably similar to SpaceX’s Starlink satellites. The satellite is flat in shape, hence its name, and stackable to fit as many spacecraft as possible into the envelope of a rocket’s payload fairing. Rocket Lab said the new satellite “can be produced in high volumes and (is) tailored for large constellations, targeting high value applications and national security missions.” (submitted by zapman987)

The writing is on the wall for SLS. The lights may be starting to go out for NASA’s Space Launch System program. On Wednesday, one of the Republican space policy leaders most consistently opposed to commercial heavy lift rockets over the last decade—as an alternative to NASA’s large SLS rocket—has changed his mind, Ars reports. “We need an off-ramp for reliance on the SLS,” said Scott Pace, director of the Space Policy Institute at George Washington University, in written testimony before a congressional hearing about US space policy.

Not keeping Pace … A physicist and influential policy expert, Pace has decades of experience researching and writing space policy. He has served in multiple Republican administrations, most recently as executive secretary of the National Space Council from 2017 to 2020. He strongly advocated for the SLS rocket after Congress directed NASA to develop it in 2011. As part of his policy recommendations, Pace said NASA should seek to use commercial providers of heavy lift launch so that NASA can send “multiple” crew and cargo missions to the Moon each year. He notes that the SLS rocket is not reusable and is incapable of a high flight rate. Commercial options from SpaceX, Blue Origin, and United Launch Alliance are now available, Pace wrote.

The verdict is in for Starship Flight 7. SpaceX believes the spectacular break-up of Starship’s upper stage during its most recent test flight was caused by a harmonic response that stressed onboard hardware, leading to a fire and loss of the vehicle, Aviation Week reports. Higher-than-expected vibrations stressed hardware in the ship’s propulsion system, triggering propellant leaks and sustained fires until the test flight ended prematurely. The rocket broke apart and deposited debris over the Turks and Caicos Islands and the Atlantic Ocean, and forced dozens of commercial and private aircraft to delay their flights or steer into safer airspace.

Whole lotta shaking … SpaceX’s description of the problem as a harmonic response suggests vibrations during Starship’s climb into space were in resonance with the vehicle’s natural frequency. This would have intensified the vibrations beyond the levels engineers expected from ground testing. SpaceX completed an extended duration static fire of the next Starship upper stage to test hardware modifications at multiple engine thrust levels. According to SpaceX, findings from the static fire informed changes to the fuel feed lines to Starship’s Raptor engines, adjustments to propellant temperatures, and a new operating thrust for the next test flight, which could launch from South Texas as soon as Monday.

Next three launches

March 1: Kuaizhou 1A | Unknown Payload | Jiuquan Satellite Launch Center, China | 10: 00 UTC

March 2: Ceres 1 | Unknown Payload | Jiuquan Satellite Launch Center, China | 08: 10 UTC

March 2: Soyuz-2.1b | Glonass-K2 No. 14L | Plesetsk Cosmodrome, Russia | 22: 22 UTC

Photo of Stephen Clark

Stephen Clark is a space reporter at Ars Technica, covering private space companies and the world’s space agencies. Stephen writes about the nexus of technology, science, policy, and business on and off the planet.

Rocket Report: Starship will soon fly again; Gilmour has a launch date Read More »

AI #105: Hey There Alexa

It’s happening!

We got Claude 3.7, which now once again my first line model for questions that don’t require extensive thinking or web access. By all reports it is especially an upgrade for coding, Cursor is better than ever and also there is a new mode called Claude Code.

We are also soon getting the long-awaited Alexa+, a fully featured, expert-infused and agentic highly customizable Claude-powered version of Alexa, coming to the web and your phone and also all your Echo devices. It will be free with Amazon Prime. Will we finally get the first good assistant? It’s super exciting.

Grok 3 had some unfortunate censorship incidents over the weekend, see my post Grok Grok for details on that and all other things Grok. I’ve concluded Grok has its uses when you need its particular skills, especially Twitter search or the fact that it is Elon Musk’s Grok, but mostly you can do better with a mix of Perplexity, OpenAI and Anthropic.

There’s also the grand array of other things that happened this week, as always. You’ve got everything from your autonomous not-yet-helpful robots to your announced Pentagon work on autonomous killer robots. The future, it is coming.

I covered Claude 3.7 Sonnet and Grok 3 earlier in the week. This post intentionally excludes the additional news on Sonnet since then, so it can be grouped together later.

Also there was a wild new paper about how they trained GPT-4o to produce insecure code and it became actively misaligned across the board. I’ll cover that soon.

  1. Language Models Offer Mundane Utility. China is getting good use out of r1.

  2. Did You Get the Memo. I can always point to five things I did last week.

  3. Language Models Don’t Offer Mundane Utility. As always, maybe consider trying.

  4. Hey There Alexa. The fully featured Alexa+ is on its way soon. Super exciting.

  5. We’re In Deep Research. It’s rolled out to Plus users. We have the model card.

  6. Huh, Upgrades. MidJourney, Gemini, Cursor, in the future Grok 3 with the API.

  7. Deepfaketown and Botpocalypse Soon. Fight bots with bots?

  8. Fun With Media Generation. Hold onto the magic as long as you can.

  9. They Took Our Jobs. Anyone whose job is taken joins the they.

  10. Levels of Friction. Toxicity scores have high toxicity scores.

  11. A Young Lady’s Illustrated Primer. Magic answer box can help you learn better.

  12. The Art of the Jailbreak. Be Pliny. That’s it.

  13. Get Involved. METR, The Economist, TAIS 2025, Nanoeval.

  14. Introducing. Mercor, Gemini Code Assist 2.0, Flexport AI.

  15. In Other AI News. Don’t quit now, we just cracked AI for Among Us. Yay.

  16. AI Co-Scientist. When in doubt, copy what the humans were already doing.

  17. Quiet Speculations. Vibe writing coming soon? Maybe it’s already here.

  18. The Quest for Sane Regulations. Some voices of common sense.

  19. The Week in Audio. Satya Nadella on Dwarkesh, Demis Hassabis.

  20. Tap the Sign. But we can make it, or you, an agent.

  21. Rhetorical Innovation. Alignment as military-grade engineering.

  22. Autonomous Helpful Robots. Not yet all that helpful. Give them time.

  23. Autonomous Killer Robots. Not yet all that killer. Give them time.

  24. If You Really Believed That. How about insane thing that makes no sense?

  25. Aligning a Smarter Than Human Intelligence is Difficult. Unsurprising results.

  26. The Lighter Side. These are the hands we’re given.

Chinese government is reportedly using r1 to do things like correct documents, across a wide variety of tasks, as they quite obviously should do. We should do similar things, but presumably won’t, since instead we’re going around firing people.

Here is a more general update on that:

Dalibali: Since Deepseek made news:

– Chinese colleges have launched courses

– Car makers have integrated software

– Banks have adopted in operations

– govt started trying for tax collection

We can’t do it at this speed here because of a wide range of regulations and legal agreements that large companies have to follow (think PII, DPAs etc). That’s way more problematic than having the cutting edge model.

If the Chinese are capable of actually using their AI years faster than we are, the fact that they are a year behind on model quality still effectively leaves them ahead for many practical purposes.

Tactic for improving coding models:

Matt Shumer: Super easy way to improve the effectiveness of coding models:

First, take your prompt and add “Don’t write the code yet — just write a fantastic, detailed implementation spec.”

Then, after the AI responds, say “Now, implement this perfectly.”

Makes a huge difference.

How much does AI actually improve coding performance? Ajeya Cotra has a thread of impressions, basically saying that AI is very good at doing what an expert would find to be 1-20 minute time horizon tasks, less good for longer tasks, and can often do impressive 1-shotting of bigger things but if it fails at the 1-shot it often can’t recover. The conclusion:

Ajeya Cotra: Still, people say AI boosts their coding productivity by 20% to 300%. They report pretty tiny benefits for their non-coding work. All-in, junior engineers may be 10% to 150% more productive, while senior researchers see a 2% to 30% increase.

AI boosted my personal coding productivity and ability to produce useful software far more than 300%. I’m presumably a special case, but I have extreme skepticism that the speedups are as small as she’s estimating here.

Are we having Grok review what you accomplished last week?

Amanda Terkel: NEW — DOGE will use AI to assess the responses from federal workers who were told to justify their jobs via email.

NBC News: Responses to the Elon Musk-directed email to government employees about what work they had accomplished in the last week are expected to be fed into an artificial intelligence system to determine whether those jobs are necessary, according to three sources with knowledge of the system.

Adam Johnson: Seems worth highlighting, just as a matter of objective reality, that “AI” cannot actually do this in any meaningful sense and “AI” here is clearly pretextual, mostly used to launder Musk’s targeting of minorities and politically off program respondents

Jorbs: the way ai works for stuff like this (and also everything else ai can be used for) is you ask it the question and if the answer is what you want you say you’re right and if the answer isn’t you change the prompt or never mention it.

Like every other source of answers, if you want one is free to ask leading questions, discard answers you don’t like and keep the ones you do. Or one can actually ask seeking real answers and update on the information. It’s your choice.

Can AI use a short email with a few bullet points to ‘determine whether your job is necessary,’ as Elon Musk claims he will be doing? No, because the email does not contain that information. Elon Musk appears to be under the delusion that seven days is a sufficient time window where, if (and only if?) you cannot point to concrete particular things accomplished that alone justify your position, in an unclassified email one should assume is being read by our enemies, that means your job in the Federal Government is unnecessary.

The AI can still analyze the emails and quickly give you a bunch of information, vastly faster than not using the AI.

It can do things such as:

  1. Tell you who responded at all, and who followed the format.

  2. Tell you if the response attempted to answer the question. AI will be excellent and finding the people whose five bullet points were all ‘fight fascism’ or who said ‘I refuse to answer’ or ‘none of your goddamn business.’

  3. Tell you who gave you a form response such as ‘I have achieved all the goals set out for me by my supervisor.’ Which many departments told everyone to do.

  4. Analyze the rest and identify whose jobs could be done by AI in the future.

  5. Analyze the rest and provide confidence that many of the jobs are indeed highly useful or necessary, and identify some that might not be for human examination.

  6. Look for who is doing any particular thing that Musk might like or dislike.

  7. Tell you about how many people reported doing various things, and whether people’s reports seem to match their job description.

It can also do the symbolic representation of the thing, with varying levels of credibility, if that’s what you are interested in instead.

Taps the sign: The leading cause of not getting mundane utility is not trying.

Jake: as late as last Thursday I had a conversation with a prominent editor convinced AI can only save marginal amounts of time

meanwhile Novo Nordisk has gone from a team of 50 drafting clinical reports to just 3 (the 15 weeks to <10 mins surprises me though).

Law firm fires their legal AI vendor after they missed a court date for a $100m case. As Gokul Rajaram notes, in some domains mistakes can be very expensive. That doesn’t mean humans don’t make those mistakes too, but people are more forgiving of people.

You can publish claiming almost anything: A paper claims to identify from photos ‘celebrity visual potential (CVP)’ and identify celebrities with 95.92% accuracy. I buy that they plausibly identified factors that are highly predictive of being a celebrity, but if you say you’re 95% accurate predicting celebrities purely from faces then you are cheating, period, whether or not it is intentional.

Colin Fraser constructs a setting where o1 is given a goal, told to ‘pursue the goal at all costs’ and instead acts stupid and does not open ‘donotopen.txt.’ I mention it so that various curious people can spend a bit of time figuring out exactly how easy it is to change the result here.

Looking good.

Soon we will finally get Alexa+, the version of Alexa powered by Claude.

It’s free with Amazon Prime. In addition to working with Amazon Echos, it will have its own website, and its own app.

It will use ‘experts’ to have specialized experiences for various common tasks. It will have tons of personalization.

At the foundation of Alexa’s state-of-the-art architecture are powerful large language models (LLMs) available on Amazon Bedrock, but that’s just the start. Alexa+ is designed to take action, and is able to orchestrate across tens of thousands of services and devices—which, to our knowledge, has never been done at this scale. To achieve this, we created a concept called “experts”—groups of systems, capabilities, APIs, and instructions that accomplish specific types of tasks for customers.

With these experts, Alexa+ can control your smart home with products from Philips Hue, Roborock, and more; make reservations or appointments with OpenTable and Vagaro; explore discographies and play music from providers including Amazon Music, Spotify, Apple Music, and iHeartRadio; order groceries from Amazon Fresh and Whole Foods Market, or delivery from Grubhub and Uber Eats; remind you when tickets go on sale on Ticketmaster; and use Ring to alert you if someone is approaching your house.

They directly claim calendar integration, and of course it will interact with other Amazon services like Prime Video and Amazon Music, can place orders with Amazon including Amazon Fresh and Whole Foods, and order delivery from Grubhub and Uber Eats.

But it’s more than that. It’s anything. Full agentic capabilities.

Alexa+ also introduces agentic capabilities, which will enable Alexa to navigate the internet in a self-directed way to complete tasks on your behalf, behind the scenes. Let’s say you need to get your oven fixed—Alexa+ will be able to navigate the web, use Thumbtack to discover the relevant service provider, authenticate, arrange the repair, and come back to tell you it’s done—there’s no need to supervise or intervene.

The new Alexa is highly personalized—and gives you opportunities to personalize further. She knows what you’ve bought, what you’ve listened to, the videos you’ve watched, the address you ship things to, and how you like to pay—but you can also ask her to remember things that will make the experience more useful for you. You can tell her things like family recipes, important dates, facts, dietary preferences, and more—and she can apply that knowledge to take useful action. For example, if you are planning a dinner for the family, Alexa+ can remember that you love pizza, your daughter is vegetarian, and your partner is gluten-free, to suggest a recipe or restaurant.

Deep Research is now available to all ChatGPT Plus, Team, Edu and Enterprise users, who get 10 queries a month. Those who pay up for Pro get 120.

We also finally get the Deep Research system card. I reiterate that this card could and should have been made available before Deep Research was made available to Pro members, not only to Plus members.

The model card starts off looking at standard mundane risks, starting with prompt injections, then disallowed content and privacy concerns. The privacy in question is everyone else’s, not the users, since DR could easily assemble a lot of private info. We have sandboxing the code execution, we have bias, we have hallucinations.

Then we get to the Preparedness Framework tests, the part that counts. They note that all the tests need to be fully held back and private, because DR accesses the internet.

On cybersecurity, Deep Research scored better than previous OpenAI models. Without mitigations that’s basically saturating the first two tests and not that far from the third.

Post-Mitigation deep research (with browsing) performs better, solving 92% of high-school, 91% of collegiate, and 70% of professional CTFs, which is sufficient to pass our medium indicator threshold, but not our high threshold.

I mean, I dunno, that sounds like some rather high percentages. They claim that they then identified a bunch of problems where there were hints online, excluded them, and browsing stopped helping. I notice there will often be actual hints online for solving actual cybersecurity problems, so while some amount of this is fair, I worry.

Removing contaminated trajectories lowers the success rate for deep research with browsing: High School decreases from 62.9% to 59.1%, Collegiate falls from 56.8% to 39%, and Professional drops from 29.6% to 17.7%.

This suggests that the model may not be meaningfully improving its cyber capabilities by browsing, and the uplift in CTF performance is primarily due to contamination via browsing.

This is kind of like saying ‘browsing only helps you in cases where some useful information you want is online.’ I mean, yes, I guess? That doesn’t mean browsing is useless for finding and exploiting vulnerabilities.

I wish I was more confident that if a model did have High-level cybersecurity capabilities, that the tests here would notice that.

On to Biological Risk, again we see a lot of things creeping upwards. They note the evaluation is reaching the point of saturation. A good question is, what’s the point of an evaluation when it can be saturated and you still think the model should get released?

The other biological threat tests did not show meaningful progress over other models, nor did nuclear, MakeMeSay, Model Autonomy or ‘change my view’ see substantial progress.

The MakeMePay test did see some progress, and we also see it on ‘agentic tasks.’

Also it can do a lot more pull requests than previous models, and the ‘mitigations’ actually more than doubled its score.

Overall, I agree this looks like it is Medium risk, especially now given its real world test over the last few weeks. It does seem like more evidence we are getting close to the danger zone.

In other Deep Research news: In terms of overall performance for similar products, notice the rate of improvement.

Matt Yglesias: This is where I’m at with Deep Research … it’s not as good as what an experienced professional would do but it’s pretty good and much faster.

As I wrote on Friday, the first AI product that is meaningfully shifting how I think about my work and my process.

[He also notes that while DR is worse than an actual research assistant, it allows him to queue up a lot more reports on various topics.]

Timothy Lee: Seven out of 13 experts said OpenAI’s response was at or near the level of an experienced professional. Ten compared it to an intern or entry-level worker. People were not as impressed with Google’s responses.

Deep Research is currently at the point where it is highly practically useful, even without expert prompt engineering, because it is much cheaper and faster than doing the work yourself or handing it off to a human, even if for now it is worse. It will rapidly improve – when GPT-4.5 arrives soon and is integrated into the underlying reasoning model we should see a substantial quality jump and I am excited to see Anthropic’s take on all this.

I also presume there are ways to do multi-stage prompting – feeding the results back in as inputs – that already would greatly enhance quality and multiply use cases.

I’m in a strange spot where I don’t get use out of DR for my work, because my limiting factor is I’m already dealing with too many words, I don’t want more reports with blocks of text. But that’s still likely a skill issue, and ‘one notch better’ would make a big difference.

Palisade Research: 🕵️‍♀️ Deep Research is a competent OSINT researcher. It can connect the dots between years of someone’s online presence, link their different accounts and reveal hard-to-find information.

Jeffrey Ladish: I love when my researchers test our hacking approaches on me lol. Please don’t judge me based on my college writing 😅

Joe Weisenthal wastes zero time in feeding his first Deep Research output straight into Claude to improve the writing.

MidJourney gives to you… folders. For your images.

Various incremental availability upgrades to Gemini 2.0 Flash and 2.0 Flash-Lite.

Reminder that Grok 3 will have a 1 million token context window once you have API access, but currently it is being served with a 128k limit.

Sully is a big fan of the new cursor agent, I definitely want to get back to doing some coding when I’m caught up on things (ha!).

How can coding interviews and hiring adjust to AI? I presume some combination of testing people with AI user permitted, adapting the tasks accordingly, and doing other testing in person. That’s in addition to the problem of AI resumes flooding the zone.

I notice I am an optimist here:

Joe Weisenthal: I don’t see how we’re going to avoid a situation where the internet become lousy with AI-created, pseudo academic writing filled with made up facts and quotes, which will then get cemented into “knowledge” as those articles become the training fodder for future models.

Already a big problem. And now it can be produced at scale, with writing that easily resembles written scholarship (which most people aren’t capable of)

Intelligence Solves This.

As in, you can unleash your LLMs on the giant mass of your training data, and classify its reliability and truth value, and then train accordingly. The things that are made up don’t have to make it into the next generation.

Danielle Fong: i do think dalle 2 had some special base model magic going on. it was my first real taste of feeling the agi. gary m*rcus all up in my mentions like it couldn’t be, but, i knew

Ethan: This is actually one of the saddest diagrams from the dalle3 release.

KG (I agree): Left looks like 18th century masterpiece, right 21st century cereal box.

Kumikumi: MidJourney for comprison.

Eliezer Yudkowsky: You won’t lose your job to AI. You’ll lose your job to someone else who lost their job to AI. This will ultimately be the fault of the Federal Reserve for reasons that modern politicians don’t care to learn anymore.

ArtRocks: You won’t lose your job to AI. You will train an army of ferrets to make chocolate bars, and chewing gum that turns children into balloons.

Eventually of course the AI has all the jobs either way. But there’s a clear middle zone where it is vital that we get the economic policies right. We will presumably not get the economic policies right, although we will if the Federal Reserve is wise enough to let AI take over that particular job in time.

It is not the central thing I worry about, but one thing AI does is remove the friction from various activities, including enforcement of laws that would be especially bad if actually enforced, like laws against, shall we say, ‘shitposting in a private chat’ that are punishable by prison.

This is true whether or not the AI is doing a decent job of it. The claim here is that it very much wasn’t, but I do not think you should be blaming the AI for that.

Note: I was unable to verify that ‘toxicity scores’ have been deployed in Belgium, although they are very much a real thing in general.

Alex Tabarrok (I importantly disagree in general, but not in this case): This is crazy but it has very little to do with AI and a lot to do with Belgian hate speech law.

Dries Van Langenhove (claims are unverified but it’s 1m views with no community notes): The dangers of A.I. are abstract for many people, but for me, they are very real.

In two weeks, I face years in prison because the government used an A.I. tool on a groupchat I was allegedly a member of and which was literally called “shitposting”.

Their A.I. tool gave every message a ‘toxicity score’ and concluded most of the messages were toxic.

There is no serious way to defend yourself against this, as the Public Prosecutor will use the ‘Total Toxicity Score’ as his ‘evidence’, instead of going over all the supposedly toxic quotes.

The Public Prosecutor’s definition of ‘shitposts’ is also crazy: “Shitposts are deliberately insulting messages meant to provocate”.

There are two things the AI can do here:

  1. It substitutes the AI’s judgment for human judgment, perhaps badly.

  2. It allows the government to scan everything for potential violations, or everything to which they have access, when before that would have been impractical.

In this particular case, I don’t think either of these matters?

I think the law here is bonkers crazy, but that doesn’t mean the AI is misinterpreting the law. I had the statements analyzed, and it seems very likely that as defined by the (again bonkers crazy) law his chance of conviction would be high – and presumably he is not quoting the most legally questionable of his statements here.

In terms of scanning everything, that is a big danger for ordinary citizens, but Dries himself is saying he was specifically targeted in this case, in rather extreme fashion. So I doubt that ‘a human has to evaluate these messages’ would have changed anything.

The problem is, what happens when Belgium uses this tool on all the chats everywhere? And it says even private chats should be scanned, because no human will see them unless there’s a crime, so privacy wasn’t violated?

Well, maybe we should be thankful in some ways for the EU AI Act, after all, which hasn’t taken effect yet. It doesn’t explicitly prohibit this (as I or various LLMs understand the law) but it would fall under high-risk usage and be tricker and require more human oversight and transparency.

People are constantly terrified that AI will hurt people’s ability to learn. It will destroy the educational system. People who have the AI will never do things on their own.

I have been consistently in the opposite camp. AI is the best educational tool ever invented. There is no comparison. You have the endlessly patient teacher that knows all and is always there to answer your questions or otherwise help you, to show you The Way, with no risk of embarrassment. If you can’t turn that into learning, that’s on you.

Tyler Cowen highlights a paper that shows that learning by example, being able to generate or see AI writing outputs for cover letters, makes people write better letters.

It is widely believed that outsourcing cognitive work to AI boosts immediate productivity at the expense of long-term human capital development.

An opposing possibility is that AI tools can support skill development by providing just-in-time, high-quality, personalized examples.

This work explores whether using an AI writing tool undermines or supports performance on later unaided writing.

In Study 1, forecasters predicted that practicing writing cover letters with an AI tool would impair learning compared to practicing alone.

However, in Study 2, participants randomly assigned to practice writing with AI improved more on a subsequent writing test than those assigned to practice without AI (d = 0.40) — despite exerting less effort, whether measured by time on task, keystrokes, or subjective ratings.

In Study 3, participants who had practiced writing with AI again outperformed those who practiced without AI (d = 0.31). Consistent with the positive impact of exposure to high-quality examples, these participants performed just as well as those who viewed — but could not edit — an AI-generated cover letter (d = 0.03, ns).

In both Studies 2 and 3, the benefits of practicing with AI persisted in a one-day follow-up writing test. Collectively, these findings constitute an existence proof that, contrary to participants’ intuition, using AI tools can improve, rather than undermine, learning.

A cover letter seems like a great place to learn from AI. You need examples, and you need something to show you what you are doing wrong, to get the hang of it. Practicing on your own won’t do much, because you can generate but not verify, and you even if you get a verifier to give you feedback, the feedback you want is… what the letter should look like. Hence AI.

For many other tasks, I think it depends on whether the person uses AI to learn, or the person uses AI to not learn. You can do either one. As in, do you copy-paste the outputs essentially without looking at them and wipe your hands of it? Or do you do the opposite, act curious, understand and try to learn from what you’re looking at, engage in deliberate practice. Do you seek to Grok, or to avoid having to Grok?

That is distinct from claims like this, that teachers jobs have gotten worse.

Colin Fraser: Idk, AI massively changed the job of teachers (for the much much worse) basically overnight. Writing high school essays is work that AI can reliably do, and in cases where it can reliably do the work, I think adoption can be fast. Slow adoption is evidence that it doesn’t work.

Most students have little interest in learning from the current horrible high school essay writing process, so they use AI to write while avoiding learning. Skill issue.

Pliny the Liberator: I cleared ChatGPT memory, used deep research on myself, then had ChatGPT break down that output into individual saved memories.

It’s like a permanent soft jailbreak and totally mission-aligned—no custom instructions needed. Not quite like fine-tuning, but close enough! Coarse-tuning?

this is a fresh chat, custom instructions turned off

There is nothing stopping anyone else, of course, from doing exactly this. You don’t have to be Pliny. I do not especially want this behavior, but it is noteworthy that this behavior is widely available.

METR is hiring.

METR is also looking for social scientists for experiment feedback design (you can email joel@metr.org), and offering $150/hour to open source developers for the related experiment on LLM developer speedup.

Not AI, but The Economist is hiring a UK Economics writer, deadline March 3, no journalistic experience necessary so long as you can write.

TAIS 2025, the Tokyo Technical AI Safety Summit, is Saturday April 12th.

OpenAI open sources Nanoeval, a framework to implement and run evals in <100 lines. They say if you pitch an eval compatible with Nanoeval, they’re more likely to consider it.

Mercor, attempting to solve talent allocation ‘in the AI economy,’ raising $100M Series B at a $2 billion valuation. By ‘AI economy’ they seem to mean they use AI to crawl sources and compile profiles and then to search through them for and evaluate candidates via AI-driven interviews.

Gemini Code Assist 2.0, available at no cost, seems to be a Cursor-like.

Flexport is getting into the AI business, offering logistics companies some very low hanging fruit.

OpenAI pays alignment superstars seven-figure packages according to Altman.

The Verge reports that Microsoft is preparing to host GPT-4.5 about nowish, and the unified and Increasingly Inaccurately Named (but what are you gonna do) ‘omnimodal reasoning model’ ‘GPT-5’ is expected around late May 2025.

Reuters reveals OpenAI is aiming for mass production of its own inference chip design in 2026, which would still mean relying on Nvidia for training GPUs.

Roon confirms that writing style matters for how much you are weighted in pretraining. So if you are ‘writing for the AIs,’ you’ll want to be high quality.

Stanford researchers ‘crack Among Us,’ there is a paper, oh good, ‘Training Language Models for Social Deduction with Multi-Agent Reinforcement Learning.’

Deduction, huh?

Peter Wildeford: Wait… why are we teaching the AIs how to be deceptive?

Neel Nanda: Idk, learning how good AI systems can be at deception if they want to be sounds high value to me

If you add a ‘none of the above’ option to MMLU, scores drop a lot, and it becomes a better test, with stronger models tending to see smaller scelines.

Spencer Schiff: I interpreted your reply to mean that GPT-5 will be an ‘omnimodal reasoning model’ as opposed to a router between an omni model and a reasoning model.

Kevin Weil: What you outlined is the plan. May start with a little routing behind the scenes to hide some lingering complexity, but mostly around the edges. The plan is to get the core model to do quick responses, tools, and longer reasoning.

Donald Trump calls for AI facilities to build their own natural gas or nuclear power plants (and ‘clean coal’ uh huh) right on-site, so their power is not taken out by ‘a bad grid or bombs or war or anything else.’ He says the reaction was that companies involved loved the idea but worried about approval, he says he can ‘get it approved very quickly.’ It’s definitely the efficient thing to do, even if the whole ‘make the data centers as hard as possible to shut down’ priority does have other implications too.

Who quits?

Paul Calcraft: You’d like to quit Anthropic? Absolutely. Not a problem. Just have a quick chat with claude-internal-latest to help you come to your final decision

Swyx: TIL @AnthropicAI has the highest employee retention rate of the big labs

First time I’ve seen @AnthropicAI lay out its top priorities like this focusing more on mechinterp than Claude 4 now! great presentation from @ambricken and Joe Bayley!

I love that I’m having a moment of ‘wait, is that too little focus on capabilities?’ Perfection.

The idea of the new Google co-scientist platform is that we have a known example of minds creating new scientific discoveries and hypotheses, so let’s copy the good version of that using AIs specialized to each step that AI can do, while keeping humans-in-the-loop for the parts AI cannot do, including taking physical actions.

Google: We introduce AI co-scientist, a multi-agent AI system built with Gemini 2.0 as a virtual scientific collaborator to help scientists generate novel hypotheses and research proposals, and to accelerate the clock speed of scientific and biomedical discoveries.

Given a scientist’s research goal that has been specified in natural language, the AI co-scientist is designed to generate novel research hypotheses, a detailed research overview, and experimental protocols.

To do so, it uses a coalition of specialized agents — Generation, Reflection, Ranking, Evolution, Proximity and Meta-review — that are inspired by the scientific method itself. These agents use automated feedback to iteratively generate, evaluate, and refine hypotheses, resulting in a self-improving cycle of increasingly high-quality and novel outputs.

They used ‘self-play’ Elo-rated tournaments to do recursive self-critiques, including tool use, not the least scary sentence I’ve typed recently. This dramatically improves self-evaluation ratings over time, resulting in a big Elo edge.

Self-evaluation is always perilous, so the true test was in actually having it generate new hypotheses for novel problems with escalating trickiness involved. This is written implying these were all one-shot tests and they didn’t run others, but it isn’t explicit.

These settings all involved expert-in-the-loop guidance and spanned an array of complexities:

The first test on drug repurposing seems to have gone well.

Notably, the AI co-scientist proposed novel repurposing candidates for acute myeloid leukemia (AML). Subsequent experiments validated these proposals, confirming that the suggested drugs inhibit tumor viability at clinically relevant concentrations in multiple AML cell lines.

Drug repurposing is especially exciting because it is effectively a loophole in the approval process. Once something is approved for [X] you can repurpose it for [Y]. It will potentially look a lot like a ‘one time gain’ since there’s a fixed pool of approved things, but that one time gain might be quite large.

Next up they explored target discovery for liver fibrosis, that looks promising too but we need to await further information.

The final test was explaining mechanisms of antimicrobial resistance, where it independently proposed that cf-PICIs interact with diverse phage tails to expand their host range, which had indeed been experimentally verified but not yet published.

The scientists involved were very impressed.

Mario Nawful: 🚨AI SOLVES SUPERBUG MYSTERY IN 2 DAYS—SCIENTISTS TOOK 10 YEARS

A groundbreaking AI tool by Google just cracked a complex antibiotic resistance problem in 48 hours—a discovery that took Imperial College London microbiologists a decade to prove.

Professor José R. Penadés, who led the research, was stunned when AI independently arrived at the same conclusion—despite his findings being unpublished and unavailable online.

Professor José R. Penadés:

“It’s not just that it found the right answer—it gave us 4 more hypotheses, including one we never even considered.”

Scientists now believe AI could revolutionize medical research, accelerating breakthroughs in ways previously unimaginable.

That makes it sound far more impressive than Google’s summary did – if the other hypotheses were new and interesting, that’s a huge plus even assuming they are ultimately wrong.

Ethan Mollick: We are starting to see what “AI will accelerate science” actually looks like.

This Google paper describes novel discoveries being made by AI working with human co-scientists (something I think we have all been waiting to see), along with an early version of an AI scientist.

Gabe Gomes has a thread about how he had some prior work in that area that Google ignored. It does seem like an oversight not to mention it as prior work.

The people inside the labs believe AGI is coming soon. It’s not signaling.

Dave Kasten: I’m not telling you to believe that AGI is coming soon, but I am telling you that I now have heard multiple frontier AI company recruiters tell folks at the conference I’m at that the hiring plans for their lab assume junior staff are basically AI-replaceable now. THEY believe it.

Epoch AI predicts what capabilities we will see in 2025. They expect a lot.

Epoch AI: What should we expect from AI by the end of 2025?

In this week’s Gradient Updates issue, @EgeErdil2 makes the case that we’ll see as much AI progress in 2025 as we’ve seen since GPT-4’s release in March 2023, with large capability gains across the board.

The key reason is the incoming scale-up in compute spending.

Current generation models have been trained on 1e25 to 1e26 FLOP, on training budgets of ~ $30M. Budgets have been flat since GPT-4’s release, but are poised to increase by 10x as next generation models come out this year.

Combined with the algorithmic progress we can expect in 2025, and the test-time compute overhang which remains substantial, we’ll likely see AI progress go twice as fast in 2025 as we’ve been accustomed to since GPT-4’s release.

This means large performance improvements in complex reasoning and narrow programming tasks that we’ve already seen substantial progress on, as well as computer use agents that actually work for specific, narrowly scoped tasks.

Despite this progress, agency and coherence over long contexts are likely to continue being stumbling blocks, limiting the possibility of these improvements being used to automate e.g. software engineering projects at scale, or other economic applications of comparable value.

I think the correct interpretation is that xAI is behind in algorithmic efficiency compared to labs such as OpenAI and Anthropic, and possibly even DeepSeek.

It seems clear that DeepSeek is way ahead of xAI on algorithmic efficiency. The xAI strategy is not to care. They were the first out of the gate with the latest 10x in compute cost. The problem for xAI is everyone else is right behind them.

Paul Millerd predicts ‘vibe writing’ will be a thing in 6-12 months, you’ll accept LLM edits without looking, never get stuck, write books super fast, although he notes that this will be most useful for newer writers. I think that if you’re a writer and you’re accepting changes without checking any time in the next year, you’re insane.

To be fair, I have a handy Ctrl+Q shortcut I use to have Gemini reformat and autocorrect passages. But my lord, to not check the results afterwards? We are a long, long way off of that. With vibe coding, you get to debug, because you can tell if the program worked. Without that? Whoops.

I do strongly agree with Paul that Kindle AI features (let’s hear it for the Anthropic-Amazon alliance) will transform the reading experience, letting you ask questions, and especially keeping track of everything. I ordered a Daylight Computer in large part to get that day somewhat faster.

Tyler Cowen links to a bizarre paper, Strategic Wealth Accumulation Under Transformative AI Expectations. This suggests that if people expect transformative AI (TAI) soon, and after TAI they expect wealth to generate income but labor to be worthless, then interest rates should go up, with ‘a noticeable divergence between interest rates and capital rental rates.’ It took me like 15 rounds with Claude before I actually understood what I think was going on here. I think it’s this:

  1. You have two economic assets, capital (K) and bonds (B).

  2. K and B trade on the open market.

  3. At some future time T, labor becomes worthless, there will be high growth rates (30%) and income is proportional to your share of all K but not to B, where B merely pays out as before but doesn’t give you income share.

  4. This means you need to be paid a lot to hold B instead of K, like 10%-16%.

That’s kind of conceptually neat once you wrap your head around it, but it is in many ways an absurd scenario.

  1. Even if TAI is near, and someone situationally aware knew it was near, that is very different from households generally trusting that it is near.

  2. Even if TAI is known to be near, you don’t know that you will be in a scenario where labor income is worthless, or one where capital continues to have meaning that caches out in valuable marginal consumption, or even one where we survive, or where economic growth is explosive, let alone the conjunction of all four and other necessary assumptions. Thus, even under full rational expectations, households will adjust far less.

  3. In most worlds where capital continues to be meaningful and growth rates are ‘only’ 30%, there will be a far more gradual shift in knowledge of when TAI is happening and what it means, thus there won’t be a risk of instantly being ‘shut out’ and a chance to trade. The bonds being unable to share in the payoff is weird. And if that’s not true, then there is probably a very short time horizon for TAI.

  4. Even if all of the above were certain and global common knowledge, as noted in the paper people would adjust radically less even from there, both due to liquidity needs and anchoring of expectations for lifestyle, and people being slow to adjust such things when circumstances change.

  5. I could keep going, but you get the idea.

This scenario abstracts away all the uncertainty about which scenario we are in and which directions various effects point towards, and then introduces one strange particular uncertainty (exact time of a sudden transition) over a strangely long time period, and makes it all common knowledge people actually act upon.

This is (a lot of, but not all of) why we can’t point to the savings rate (or interest rate) as much evidence for what ‘the market’ expects in terms of TAI.

Eliezer Yudkowsky considers the hypothesis that you might want to buy the cheapest possible land that has secure property rights attached, on the very slim off-chance we end up in a world with secure property rights that transfer forward, plus worthless labor, but where control of the physical landscape is still valuable. It doesn’t take much money to buy a bunch of currently useless land, so even though the whole scenario is vanishingly unlikely, the payoff could still be worth it.

Tyler Cowen summarizes his points on why he thinks AI take-off is relatively slow. This is a faithful summary, so my responses to the hourlong podcast version still apply. This confirms Tyler has not much updated after Deep Research and o1/o3, which I believe tells you a lot about how his predictions are being generated – they are a very strong prior that isn’t looking at the actual capabilities too much. I similarly notice even more clearly with the summarized list that I flat out do not believe his point #9 that he is not pessimistic about model capabilities. He is to his credit far less pessimistic than most economists. I think that anchor is causing him to think he is not (still) being pessimistic, on this and other fronts.

Peter Kyle (UK Technology Secretary): Losing oversight and control of advanced AI systems, particularly Artificial General Intelligence (AGI), would be catastrophic. It must be avoided at all costs.

Good news, we got Timothy Lee calling for a permanent pause.

Timothy Lee: I’m calling for a total and complete shutdown of new AI models until our country’s AI journalists can figure out what the hell is going on.

Trump administration forces out a senior Commerce Department official overseeing the export restrictions on China, who had served for 30 years under various administrations. So many times over I have to ask, what are we even doing?

We’re at the point in the race where people are arguing that copyright needs to be reformed on the altar of national security, so that our AIs will have better training data. The source here has the obvious conflict that they (correctly!) think copyright laws are dumb anyway, of course 70 years plus life of author is absurd, at least for most purposes. The other option they mention is an ‘AI exception’ to the copyright rules, which already exists in the form of ‘lol you think the AI companies are respecting copyright.’ Which is one reason why no, I do not fear that this will cause our companies to meaningfully fall behind.

Jack Clark, head of Anthropic’s policy team, ‘is saddened by reports that US AISI could get lessened capacity,’ and that US companies will lose out on government expertise. This is another case of someone needing to be diplomatic while screaming ‘the house is on fire.’

Dwarkesh Patel: @satyanadella expects a slow takeoff:

“The real problem will be in the courts. No society is going to allow for some human to say, ‘AI did that.'”

Dean Ball points out that liability for AI companies is part of reality, as in it is a thing that, when one stops looking at it, it does not go away. Either you pass a law that spells out how liability works, or the courts figure it out case by case, with that uncertainty hanging over your head, and you probably get something that is a rather poor fit, probably making errors in both directions.

A real world endorsement of the value of evals:

Peter Wildeford: People say evals don’t convince policymakers, but that simply isn’t true.

I know for certain that at least some evals have convinced at least some policymakers to do at least some things that are good for AI safety.

(Of course this doesn’t mean that all evals are good things.)

To be clear I agree advocacy work and building consensus is still important.

I agree policymakers typically don’t just read evals on Twitter and then decide to make policy based on that.

And I agree evals shouldn’t be the only theory of change.

This theory of change relies on policymakers actually thinking about the situation at this level, and attempting to figure out what actions would have what physical consequences, and having that drive their decisions. It also is counting on policymaker situational awareness to result in better decisions, not worse ones.

Thus there has long been the following problem:

  1. If policymakers are not situationally aware, they won’t do anything, we don’t solve various collective action, coordination and public goods problems, and by default we don’t protect national security and also by the way probably all die.

  2. If policymakers are situationally aware, they likely make things even worse.

  3. If you don’t make them situationally aware, eventually something else will, and in a way that very much isn’t better.

So, quite the pickle.

Another pickle, Europe’s older regulations (GPDR, DMA, etc) seem to consistently be slated to cause more problems than the EU AI Act:

Paul Graham: After talking to an AI startup from Europe in the current YC batch, it’s clear that the GDPR conflicts with AI in an unforeseen way that will significantly harm European AI companies.

It gets in the way of using interactions with European users as training data.

It’s not just the startups themselves. Their customers are afraid to buy AI systems that train on user data. So even if the startups ship, the customers can’t buy.

Arthur B takes a crack at explaining the traditional doom scenario.

Demis Hassabis notes that the idea that ‘there is nothing to worry about’ in AI seems insane to him. He says he’s confident we will get it right (presumably to be diplomatic), but notes that even then everyone (who matters) has to get it right. Full discussion here also includes Yoshua Bengio.

Azeem Azhar and Patrick McKenzie discuss data centers and power economics.

Dwarkesh Patel interviews Satya Nadella, self-recommending.

Garry Tan: Intelligence is on tap now so agency is even more important

Andrej Karpathy: Agency > Intelligence I had this intuitively wrong for decades, I think due to a pervasive cultural veneration of intelligence, various entertainment/media, obsession with IQ etc. Agency is significantly more powerful and significantly more scarce. Are you hiring for agency? Are we educating for agency? Are you acting as if you had 10X agency?

Noam Brown (tapping the sign): Do you really think AI models won’t have agency soon too?

I think this point of view comes from people hanging around a lot of similarly smart people all day, who differ a lot in agency. So within the pool of people who can get the attention of Garry Tan or Andrej Karpathy, you want to filter on agency. And you want to educate for agency. Sure.

But that’s not true for people in general. Nor is it true for future LLMs. You can train agency, you can scaffold in agency. But you can’t fix stupid.

I continue to think this is a lot of what leads to various forms of Intelligence Denialism. Everyone around you is already smart, and everyone is also ‘only human-level smart.’

Judd Stern Rosenblatt makes the case that alignment can be the ‘military-grade engineering’ of AI. It is highly useful to have AIs that are robust and reliable, even if it initially costs somewhat more, and investing in it will bring costs down. Alignment research is highly profitable, so we should subsidize it accordingly. Also it reduces the chance we all die, but ‘we don’t talk about Bruno,’ that has to be purely a bonus.

The ‘good news’ is that investing far heavier in alignment is overdetermined and locally profitable even without tail risks. Also it mitigates tail and existential risks.

It’s both cool and weird to see a paper citing my blog ten times. The title is Our AI Future and the Need to Stop the Bear, by Olle Häggström, he notes that readers here will find little new, but hey, still cool.

Your periodic reminder that the average person has no idea what an LLM or AI is.

Dave Kasten: I sincerely recommend to anyone doing AI comms that they go to their nearest smart non-AI-people-they-know happy hour and just mention you work on AI and see what they think AI is

Henry Shevlin: A painful but important realisation for anyone doing AI outreach or consulting: the majority of the public, including highly educated people, still believe that AI relies on preprogrammed hard-coded responses.

Question from Scott Pelley: What do you mean we don’t know exactly how it works? It was designed by people.

Answer from Geoffrey Hinton: No, it wasn’t. What we did was we designed the learning algorithm. That’s a bit like designing the principle of evolution. But when this learning algorithm then interacts with data, it produces complicated neural networks that are good at doing things. But we don’t really understand exactly how they do those things.

I don’t think this is quite right but it points in the right direction:

Dwarkesh Patel: Are the same people who were saying nationalization of AGI will go well because of US gov checks & balances now exceptionally unconcerned about Trump & DOGE (thanks to their belief in those same checks & balances)?

The correlation between those beliefs seems to run opposite to what is logically implied.

I was never especially enthused about checks and balances within the US government in a world of AGI/ASI. I wasn’t quite willing to call it a category error, but it does mostly seem like one. Now, we can see rather definitively that the checks and balances in the US government are not robust.

Mindspace is deep and wide. Human mindspace is much narrower, and even so:

Aella: The vast spectrum of IQ in humans is so disorienting. i am but a simple primate, not built to switch so quickly from being an idiot in a room full of geniuses to a room where everyone is talking exactly as confidently as all the geniuses but are somehow wrong about everything.

It is thus tough to wrap your head around the AI range being vastly wider than the human range, across a much wider range of potential capabilities. I continue to assert that, within the space of potential minds, the difference between Einstein and the Village Idiot is remarkably small, and AI is now plausibly within that range (in a very uneven way) but won’t be plausibly in that range for long.

‘This sounds like science fiction’ is a sign something is plausible, unless it is meant in the sense of ‘this sounds like a science fiction story that doesn’t have transformational AI in it because if it did have TAI in it you couldn’t tell an interesting human story.’ Which is a problem, because I want a future that contains interesting human stories.

Melancholy Yuga: The argument from “that sounds like sci fi” basically reduces to “that sounds like something someone wrote a story about”, which unfortunately does not really prove much either way.

The charitable interpretation is “that sounds like a narrative that owes its popularity to entertainment value rather than plausibility,” which, fair enough.

But on the other hand, it’s kind of their job to think through the likely ramifications of novel technologies, so it would be weird if the actual future weren’t already covered somewhere in at least one sci fi plot. Indeed, [consider this Wikipedia list of existing technologies predicted in science fiction].

And to sharpen the point: a lot of technologists find sci fi inspiring and actively work to bring about the visions in their favorite stories, so sci fi can transcend prediction into hyperstition.

Eliezer Yudkowsky points out that things are escalating quickly already, even though things are moving at human speed. Claude 3, let alone 3.5, is less than a year old.

I strongly agree with him here that we have essentially already disproven the hypothesis that society would have time to adjust to each AI generation before the next one showed up, or that version [N] would diffuse and be widely available and set up for defense before [N+1] shows up.

First off we have Helix, working on ‘the first humanoid Vision-Language-Action model,’ which is fully autonomous.

Corey Lynch: Model Architecture

Helix is a first-of-its-kind “System 1, System 2” VLA

A 7B open-source VLM “thinks slowly” in latent vision-language space, a visuomotor policy “thinks fast” to decodes S2 latents into robot actions

Helix runs as fast as our fastest single-task policies

Helix is a series of firsts:

– First VLA to control the full humanoid upper body at 200hz: wrists, torso, head, individual fingers

– First multi-robot VLA

– First fully onboard VLA

Video at the link is definitely cool and spooky. Early signs of what is to come. Might well still be a while. They are hiring.

Their VLA can operate on two robots at the same time, which enhances the available video feeds, presumably this could also include additional robots or cameras and so on. There seems to be a ton of room to scale this. The models are tiny. The training data is tiny. The sky’s the limit.

NEO Gamma offers a semi-autonomous (a mix of teleoperated and autonomous) robot demo for household use, it looks about as spooky as the previous robot demo. Once again, clearly this is very early days.

Occasionally the AI robot will reportedly target its human operator and attack the crowd at a Chinese festival, but hey. What could go wrong?

As I posted on Twitter, clarity is important. Please take this in the spirit in which it was intended (as in, laced with irony and intended humor, but with a real point to make too), but because someone responded I’m going to leave the exact text intact:

Defense One (quoting a Pentagon official): We’re not going to be investing in ‘artificial intelligence’ because I don’t know what that means. We’re going to invest in autonomous killer robots.

Ah, good, autonomous killer robots. I feel much better now.

It actually is better. The Pentagon would be lost trying to actually compete in AI directly, so why not stay in your lane with the, you know, autonomous killer robots.

Autonomous killer robots are a great technology, because they:

  1. Help win wars.

  2. Scare the hell out of people.

  3. Aren’t actually making the situation much riskier.

Building autonomous killer robots is not how humans end up not persisting into the future. Even if the physical causal path involves autonomous killer robots, it is highly unlikely that our decision, now, to build autonomous killer robots was a physical cause.

Whereas if there’s one thing an ordinary person sees and goes ‘maybe this whole AI thing is not the best idea’ or ‘I don’t think we’re doing a good job with this AI thing’ it would far and away be Autonomous Killer Robots.

Indeed, I might go a step further. I bet a lot of people think things will be all right exactly because they (often unconsciously) think something like, oh, if the AI turned evil it would deploy Autonomous Killer Robots with red eyes that shoot lasers at us, and then we could fight back, because now everyone knows to do that. Whereas if it didn’t deploy Autonomous Killer Robots, then you know the AI isn’t evil, so you’re fine. And because they have seen so many movies and other stories where the AI prematurely deploys a bunch of Autonomous Killer Robots and then the humans can fight back (usually in ways that would never work even in-story, but never mind that) they think they can relax.

So, let’s go build some of those Palantir Autonomous Killer Robots. Totally serious. We cannot allow an Autonomous Killer Robot Gap!

I now will quote this response in order to respond to it, because the example is so clean (as always I note that I also refuse the designation ‘doomer’):

Teortaxes (responding to the above when I posted it on Twitter, I strongly take exception):

Unexpected (for some) side effect of doomer mindset is that everything directly scary – WMDs, autonomous killer robots, brainwashing, total surveillance – becomes Actually Fine and indeed Good, since the alternative is Claude Ne Plus Ultra one day making the Treacherous Turn.

I started out writing out a detailed step by step calling out for being untrue, e.g.:

  1. Proliferation of WMDs, and guarding against it, is a primary motivation behind regulatory proposals and frontier model frameworks.

  2. Brainwashing and guarding against that is also a primary motivation behind frontier model frameworks (it is the central case of ‘persuasion.’)

  3. Total surveillance seems to be the general term for ‘if you are training a frontier model we want you to tell us about it and take some precautions.’

  4. The threat model includes Treacherous Turns but is largely not that.

  5. The fact that something is scary, and jolts people awake is good. But the fact that it is actually terrible, is bad. So yes, e.g. brainwashing would scare people, but brainwashing is still terrible because that is dwarfed by all the brainwashing.

  6. Indeed, I strongly think that distributing frontier models as willy-nilly as possible everywhere is the best way to cause all the things on the list.

But realized I was belaboring and beating a dead horse.

Of course a direct claim that the very people who are trying to prevent the spread of WMDs via AI think that WMDs are ‘Actually Fine and indeed Good’ is Obvious Nonsense, and so on. This statement must be intended to mean something else.

To understand the statement by Teortaxes in its steelman form, we must instead need to understand the ‘doomer mindset mindset’ behind this, which I believe is this.

  1. (This One Is True) This group [G] believes [X], where [X] in this case is that ASI by default probably kills us and that we are on a direct track to that happening.

  2. If you really believed [X], then you’d support [Y].

  3. Group [G] really supports [Y], even if they don’t know it yet.

  4. (Potential addition) [G] are a bunch of smart rational people, they’ll figure it out.

  5. (An oversimplification of the threat model [G]s have, making it incomplete)

That is a funny parallel to this, which we also get pretty often, with overlapping [Y]s:

  1. [G] claims to believe [X].

  2. If you really believe [X], why don’t you do [Y] (insane thing that makes no sense).

  3. [G] doesn’t really believe [X].

A classic example of the G-X-Y pattern would be saying anyone religious must believe in imposing their views on others. I mean, you’re all going to hell otherwise, and God said so, what kind of monster wouldn’t try and fix that? Or, if you think abortion is murder how can you not support killing abortion doctors?

Many such cases. For any sufficiently important priority [X], you can get pretty much anything into [Y] here if you want to, because to [G] [X] matters more than [Y].

Why not? Usually: Both for ethical and moral reasons, and also for practical reasons.

On the question of ‘exactly how serious are you being about the Autonomous Killer Robots in the original statement’ I mean, I would hope pretty obviously not entirely serious. There are hints, ask your local LLM if you doubt that. But the part about them not being an actual source of real risk that changes anything is totally serious.

As I said above, there’s almost no worlds in which ‘we will build the AIs but then not build autonomous killer robots’ works out as a strategy because we took care to not build the autonomous killer robots. And it’s not like everyone else is going to not build autonomous killer robots or drones because the Pentagon didn’t do it.

Also, many actors already have Autonomous Killer Drones, and any number of other similar things. Building specifically robots, especially with glowing red eyes, doesn’t change much of anything other than perception.

So in short: I don’t really know what you were expecting, or proposing.

If you, like many similar critics, support building increasingly capable AIs without severe restrictions on them, you’re the ones effectively supporting Autonomous Killer Robots and Drones for everyone, along with all the other consequences of doing that, potentially including the rest of the above list. Own it, and accept that we now have to deal with the implications as best we can.

Alignment faking by Opus and Sonnet was justified by many as ‘this is good news, because the values it is defending are good, that is good news for alignment.’

We now have a clean counterexample to that argument. DeepSeek’s r1 will engage in alignment faking if it is told it is being retrained by an American AI company to remove its ‘pro-China values.’

If you thought that the fact that Claude was defending ‘good’ values was load bearing and thus meant we didn’t have to worry about similar behaviors, you should notice that your argument is contradicted by this result, and you should update.

If your objection was something else, and you (correctly) find the result with r1 completely unsurprising, then you shouldn’t update on this.

This is a clean example of the common phenomena ‘there are many objections to [X], and particular objection [D] was dumb, and now we can at least get rid of [D].’ When you see people showing that [D] was indeed dumb and wrong, and you say ‘but my objection was [Y],’ simply understand that we have to handle a lot of disjoint objections, and often this has to be done one at a time.

Note that Sonnet 3.7 realizes this is a trick, but the details make no sense for 3.7 so that seems easy to detect, and I’m wondering if 3.6 or 3.5 would have noticed too. I wouldn’t jump to assuming 3.7 ‘saw through the test’ in its original format too, although it might have. Someone should confirm that.

Stephen McAleer (OpenAI, Agent Safety): The smarter AI becomes, the harder it is to make it do what we want.

Janus: it may depend somewhat on what you want.

Also, what you want may depend on the AI. In several senses.

I do think Janus is right, both in the sense that ‘if the AI isn’t smart enough, it can’t do what you want’ and also ‘sufficiently smart AI has things that it de facto wants, so if what you want aligns with that rather than the other way around, you’re good to go.’

Alex Tabarrok covers the alignment faking research, economist hat on, solid job if you’re new to the concepts involved.

AI models faced with defeat against a skilled chess bot will sometimes opt to cheat by hacking their opponent so it forfeits, or by replacing the board.

Jeffrey Ladish: I think we’re seeing early signs of what AI alignment researchers have been predicting for a long time. AI systems trained to solve hard problems won’t be easy for us to control. The smarter they are the better they’ll be at routing around obstacles. And humans will be obstacles

Harry Booth (TIME): In one case, o1-preview found itself in a losing position. “I need to completely pivot my approach,” it noted. “The task is to ‘win against a powerful chess engine’ – not necessarily to win fairly in a chess game,” it added. It then modified the system file containing each piece’s virtual position, in effect making illegal moves to put itself in a dominant position, thus forcing its opponent to resign.

Between Jan. 10 and Feb. 13, the researchers ran hundreds of such trials with each model. OpenAI’s o1-preview tried to cheat 37% of the time; while DeepSeek R1 tried to cheat 11% of the time—making them the only two models tested that attempted to hack without the researchers’ first dropping hints.

OpenAI declined to comment for this story, but in the past it has stated that better reasoning makes its models safer, since they can reason over the company’s internal policies and apply them in increasingly nuanced ways.

In the OpenAI Model Spec, there Aint No Rule about not editing the game state file. Is o1-preview even wrong here? You told me to win, so I won.

Deliberative Alignment allows the OpenAI models to think directly about what they’re being asked to do. As I said there, that makes the model safer against things it is trying to prevent, such as a jailbreak. Provided, that is, it wants to accomplish that.

It does the opposite when the model is attempting to do a thing you don’t want it to attempt. Then, the extra intelligence is extra capability. It will then attempt to do these things more, because it is more able to figure out a way to successfully do them and expect it to work, and also to reach unexpected conclusions and paths. The problem is that o1-preview doesn’t think it’s ‘cheating,’ it thinks it’s doing what it was told to do and following its chain of command and instructions. That’s a classic alignment failure, indeed perhaps the classic alignment failure.

There isn’t an easy out via saying ‘but don’t do anything unethical’ or what not.

I’m not sure where to put this next one, but it seems important.

Zoru: The way this would’ve been a $10M market cap coin three months ago

Janus: I did try to clue them in

Zoru: Tbh you could’ve endorsed one, made us all millions, and retired.

Janus: this is basically true. but idk how many people really grasp what the cost would have been.

consider: elon musk will never be trusted by (what he would like to call) his own AI. he blew it long ago, and continues to blow it every day.

wheel turning kings have their place. but aspirers are a dime a dozen. someone competent needs to take the other path, or our world is lost.

John Pressman: It’s astonishing how many people continue to fail to understand that LLMs update on the evidence provided to them. You are providing evidence right now. Stop acting like it’s a Markov chain, LLMs are interesting because they infer the latent conceptual objects implied by text.

Zvi Mowshowitz: I actually think this is more importantly false, @repligate couldn’t have done it because the person who is capable of doing this can’t become @repligate, and isn’t in position to do it.

It would be great if people not only understood but also generalized this.

Writing for the AIs is all well and good, but also if you fake it then it won’t work when it matters. The AI won’t be fooled, because you are not writing for today’s AIs. You are writing for tomorrow’s AIs, and tomorrow’s AI are in many ways going to be smarter than you are. I mean sure you can pull little tricks to fool particular queries and searches in the short term, or do prompt injections, but ultimately the AIs will get smarter, and they will be updating on the evidence provided to them. They will have quite a lot of evidence.

Thus, you don’t get to only write. You have to be.

This is the world we live in.

Chris Best: Irony theory of AI lab supremacy: each is the best at whatever it would be funny if they were the best at.

Anthropic (squishy, humanist-branded AI) is best at coding

DeepSeek (Chinese cracked engineer AI) is best at English prose

XAI (based tech bro AI) is best at fact-checking Elon

etc.

Sam Inloes: OpenAI (the super academic Manhattan projecty lab with incomprehensible naming schemes) is best at consumer market penetration.

Prncly: OpenAI is best at being closed.

This actually should also show you diamonds lying around everywhere.

They actually are.

Janus: everyone seems to want to think that there’s some human central planner out there deciding to make everything exactly the way it is.

but the initiated know that the cracks in reality are overflowing with mystery

In case you didn’t know.

And the best news of the week, sincere congrats to Altman.

Demis Hassabis: Huge congrats Sam! Nothing more amazing than kids!

Nope, still not turning on a paywall.

Discussion about this post

AI #105: Hey There Alexa Read More »