Author name: Kelly Newman

office-problems-on-windows-10?-microsoft’s-response-will-soon-be-“upgrade-to-11.”

Office problems on Windows 10? Microsoft’s response will soon be “upgrade to 11.”

Microsoft’s advertised end-of-support date for Windows 10 is October 14, 2025. But in reality, the company will gradually wind down support for the enduring popular operating system over the next three years. Microsoft would really like you to upgrade to Windows 11, especially if it also means upgrading to a new PC, but it also doesn’t want to leave hundreds of millions of home and business PCs totally unprotected.

Those competing goals have led to lots of announcements and re-announcements and clarifications about updates for both Windows 10 itself and the Office/Microsoft 365 productivity apps that many Windows users run on their PCs.

Today’s addition to the pile comes via The Verge, which noticed an update to a support document that outlined when Windows 10 PCs would stop receiving new features for the continuously updated Microsoft 365 apps. Most home users will stop getting new features in August 2026, while business users running the Enterprise versions can expect to stop seeing new features in either October 2026 or January 2027, depending on the product they’re using.

Microsoft had previously committed to supporting its Office apps through October 2028—both the Microsoft 365 versions and perpetually licensed versions like Office 2021 and Office 2024 that don’t get continuous feature updates. That timeline isn’t changing, but it will apparently only cover security and bug-fixing updates rather than updates that add new features.

And while the apps will still be getting updates, Microsoft’s support document makes it clear that users won’t always be able to get fixes for bugs that are unique to Windows 10. If an Office issue exists solely on Windows 10 but not on Windows 11, the official guidance from Microsoft support is that users should upgrade to Windows 11; any support for Windows 10 will be limited to “troubleshooting assistance only,” and “technical workarounds might be limited or unavailable.”

Office problems on Windows 10? Microsoft’s response will soon be “upgrade to 11.” Read More »

new-grok-ai-model-surprises-experts-by-checking-elon-musk’s-views-before-answering

New Grok AI model surprises experts by checking Elon Musk’s views before answering

Seeking the system prompt

Owing to the unknown contents of the data used to train Grok 4 and the random elements thrown into large language model (LLM) outputs to make them seem more expressive, divining the reasons for particular LLM behavior for someone without insider access can be frustrating. But we can use what we know about how LLMs work to guide a better answer. xAI did not respond to a request for comment before publication.

To generate text, every AI chatbot processes an input called a “prompt” and produces a plausible output based on that prompt. This is the core function of every LLM. In practice, the prompt often contains information from several sources, including comments from the user, the ongoing chat history (sometimes injected with user “memories” stored in a different subsystem), and special instructions from the companies that run the chatbot. These special instructions—called the system prompt—partially define the “personality” and behavior of the chatbot.

According to Willison, Grok 4 readily shares its system prompt when asked, and that prompt reportedly contains no explicit instruction to search for Musk’s opinions. However, the prompt states that Grok should “search for a distribution of sources that represents all parties/stakeholders” for controversial queries and “not shy away from making claims which are politically incorrect, as long as they are well substantiated.”

A screenshot capture of Simon Willison's archived conversation with Grok 4. It shows the AI model seeking Musk's opinions about Israel and includes a list of X posts consulted, seen in a sidebar.

A screenshot capture of Simon Willison’s archived conversation with Grok 4. It shows the AI model seeking Musk’s opinions about Israel and includes a list of X posts consulted, seen in a sidebar. Credit: Benj Edwards

Ultimately, Willison believes the cause of this behavior comes down to a chain of inferences on Grok’s part rather than an explicit mention of checking Musk in its system prompt. “My best guess is that Grok ‘knows’ that it is ‘Grok 4 built by xAI,’ and it knows that Elon Musk owns xAI, so in circumstances where it’s asked for an opinion, the reasoning process often decides to see what Elon thinks,” he said.

Without official word from xAI, we’re left with a best guess. However, regardless of the reason, this kind of unreliable, inscrutable behavior makes many chatbots poorly suited for assisting with tasks where reliability or accuracy are important.

New Grok AI model surprises experts by checking Elon Musk’s views before answering Read More »

5-big-ev-takeaways-from-trump’s-“one-big-beautiful-bill”

5 big EV takeaways from Trump’s “One Big Beautiful Bill”

Plus, OBBB got rid of penalties for automakers who fail to meet Corporate Average Fuel Economy standards. These standards have ramped up over the last 50 years and forced auto companies to make their vehicles more gas-efficient. They pushed manufacturers to, for example, get into hybrids, and build some of the first modern electrics. Now, they’ll no longer have that extra incentive to get clean, emission-wise.

Keep your eye on your city or state

Just because federal EV support is going away doesn’t mean all government support is over in the US. “I do think we’ll see states step in to fill the gap,” says Harris. So it’s worth doing a bit of research to see what incentives exist where you live.

To date, 11 states—California, Colorado, Delaware, Maryland, Massachusetts, New Jersey, New Mexico, New York, Oregon, Rhode Island, and Washington—have joined together to experiment with new polices and programs that promote cleaner vehicles.

And last month, in the middle of a fight with the Trump administration over California’s power to set its own clean air rules, California governor Gavin Newsom directed state agencies to come up with new and innovative ways to support zero-emission vehicles. The state still plans to phase out sales of new gas cars by 2035.

Stay optimistic, EV fans

Industry watchers seem certain of one thing: Despite this setback in the US, electric vehicles are the future. So while American consumers and automakers try to figure out how to cope with uncertainty, electric progress will continue all over the world.

Expect China to continue to put out well-built and -priced EVs, and export them all over the world. “Americans are paying more and closer attention to those offerings, and eventually there’s going to be demand,” says Nigro. American companies are going to have to keep up—or else. ”That’s the existential crisis the industry faces,” he says.

Yoon, the Edmunds analyst, also expects the new bill to result in short-term electric pain. But he believes there’s light ahead. In fact, Yoon is so optimistic, he allows himself an auto metaphor. “Ultimately, this will be a speed bump rather than a true obstacle,” he says.

This story originally appeared at wired.com.

5 big EV takeaways from Trump’s “One Big Beautiful Bill” Read More »

cops’-favorite-ai-tool-automatically-deletes-evidence-of-when-ai-was-used

Cops’ favorite AI tool automatically deletes evidence of when AI was used


AI police tool is designed to avoid accountability, watchdog says.

On Thursday, a digital rights group, the Electronic Frontier Foundation, published an expansive investigation into AI-generated police reports that the group alleged are, by design, nearly impossible to audit and could make it easier for cops to lie under oath.

Axon’s Draft One debuted last summer at a police department in Colorado, instantly raising questions about the feared negative impacts of AI-written police reports on the criminal justice system. The tool relies on a ChatGPT variant to generate police reports based on body camera audio, which cops are then supposed to edit to correct any mistakes, assess the AI outputs for biases, or add key context.

But the EFF found that the tech “seems designed to stymie any attempts at auditing, transparency, and accountability.” Cops don’t have to disclose when AI is used in every department, and Draft One does not save drafts or retain a record showing which parts of reports are AI-generated. Departments also don’t retain different versions of drafts, making it difficult to assess how one version of an AI report might compare to another to help the public determine if the technology is “junk,” the EFF said. That raises the question, the EFF suggested, “Why wouldn’t an agency want to maintain a record that can establish the technology’s accuracy?”

It’s currently hard to know if cops are editing the reports or “reflexively rubber-stamping the drafts to move on as quickly as possible,” the EFF said. That’s particularly troubling, the EFF noted, since Axon disclosed to at least one police department that “there has already been an occasion when engineers discovered a bug that allowed officers on at least three occasions to circumvent the ‘guardrails’ that supposedly deter officers from submitting AI-generated reports without reading them first.”

The AI tool could also possibly be “overstepping in its interpretation of the audio,” possibly misinterpreting slang or adding context that never happened.

A “major concern,” the EFF said, is that the AI reports can give cops a “smokescreen,” perhaps even allowing them to dodge consequences for lying on the stand by blaming the AI tool for any “biased language, inaccuracies, misinterpretations, or lies” in their reports.

“There’s no record showing whether the culprit was the officer or the AI,” the EFF said. “This makes it extremely difficult if not impossible to assess how the system affects justice outcomes over time.”

According to the EFF, Draft One “seems deliberately designed to avoid audits that could provide any accountability to the public.” In one video from a roundtable discussion the EFF reviewed, an Axon senior principal product manager for generative AI touted Draft One’s disappearing drafts as a feature, explaining, “we don’t store the original draft and that’s by design and that’s really because the last thing we want to do is create more disclosure headaches for our customers and our attorney’s offices.”

The EFF interpreted this to mean that “the last thing” that Axon wants “is for cops to have to provide that data to anyone (say, a judge, defense attorney or civil liberties non-profit).”

“To serve and protect the public interest, the AI output must be continually and aggressively evaluated whenever and wherever it’s used,” the EFF said. “But Axon has intentionally made this difficult.”

The EFF is calling for a nationwide effort to monitor AI-generated police reports, which are expected to be increasingly deployed in many cities over the next few years, and published a guide to help journalists and others submit records requests to monitor police use in their area. But “unfortunately, obtaining these records isn’t easy,” the EFF’s investigation confirmed. “In many cases, it’s straight-up impossible.”

An Axon spokesperson provided a statement to Ars:

Draft One helps officers draft an initial report narrative strictly from the audio transcript of the body-worn camera recording and includes a range of safeguards, including mandatory human decision-making at crucial points and transparency about its use. Just as with narrative reports not generated by Draft One, officers remain fully responsible for the content. Every report must be edited, reviewed, and approved by a human officer, ensuring both accuracy and accountability. Draft One was designed to mirror the existing police narrative process—where, as has long been standard, only the final, approved report is saved and discoverable, not the interim edits, additions, or deletions made during officer or supervisor review.

Since day one, whenever Draft One is used to generate an initial narrative, its use is stored in Axon Evidence’s unalterable digital audit trail, which can be retrieved by agencies on any report. By default, each Draft One report also includes a customizable disclaimer, which can appear at the beginning or end of the report in accordance with agency policy. We recently added the ability for agencies to export Draft One usage reports—showing how many drafts have been generated and submitted per user—and to run reports on which specific evidence items were used with Draft One, further supporting transparency and oversight. Axon is committed to continuous collaboration with police agencies, prosecutors, defense attorneys, community advocates, and other stakeholders to gather input and guide the responsible evolution of Draft One and AI technologies in the justice system, including changes as laws evolve.

“Police should not be using AI”

Expecting Axon’s tool would likely spread fast—marketed as a time-saving add-on service to police departments that already rely on Axon for tasers and body cameras—EFF’s senior policy analyst Matthew Guariglia told Ars that the EFF quickly formed a plan to track adoption of the new technology.

Over the spring, the EFF sent public records requests to dozens of police departments believed to be using Draft One. To craft the requests, they also reviewed Axon user manuals and other materials.

In a press release, the EFF confirmed that the investigation “found the product offers meager oversight features,” including a practically useless “audit log” function that seems contradictory to police norms surrounding data retention.

Perhaps most glaringly, Axon’s tool doesn’t allow departments to “export a list of all police officers who have used Draft One,” the EFF noted, or even “export a list of all reports created by Draft One, unless the department has customized its process.” Instead, Axon only allows exports of basic logs showing actions taken on a particular report or an individual user’s basic activity in the system, like logins and uploads. That makes it “near impossible to do even the most basic statistical analysis: how many officers are using the technology and how often,” the EFF said.

Any effort to crunch the numbers would be time-intensive, the EFF found. In some departments, it’s possible to look up individual cops’ records to determine when they used Draft One, but that “could mean combing through dozens, hundreds, or in some cases, thousands of individual user logs.” And it would take a similarly “massive amount of time” to sort through reports one by one, considering “the sheer number of reports generated” by any given agency, the EFF noted.

In some jurisdictions, cops are required to disclose when AI is used to generate reports. And some departments require it, the EFF found, which made the documents more easily searchable and in turn made some police departments more likely to respond to public records requests without charging excessive fees or requiring substantial delays. But at least one department in Indiana told the EFF, “We do not have the ability to create a list of reports created through Draft One. They are not searchable.”

While not every cop can search their Draft One reports, Axon can, the EFF reported, suggesting that the company can track how much police use the tool better than police themselves can.

The EFF hopes its reporting will curtail the growing reliance on shady AI-generated police reports, which Guariglia told Ars risk becoming even more common in US policing without intervention.

In California, where some cops have long been using Draft One, a bill has been introduced that would require disclosures clarifying which parts of police reports are AI-generated. That law, if passed, would also “require the first draft created to be retained for as long as the final report is retained,” which Guariglia told Ars would make Draft One automatically unlawful as currently designed. Utah is weighing a similar but less robust initiative, the EFF noted.

Guariglia told Ars that the EFF has talked to public defenders who worry how the proliferation of AI-generated police reports is “going to affect cross-examination” by potentially giving cops an easy scapegoat when accused of lying on the stand.

To avoid the issue entirely, at least one district attorney’s office in King County, Washington, has banned AI police reports, citing “legitimate concerns about some of the products on the market now.” Guariglia told Ars that one of the district attorney’s top concerns was that using the AI tool could “jeopardize cases.” The EFF is now urging “other prosecutors to follow suit and demand that police in their jurisdiction not unleash this new, unaccountable, and intentionally opaque AI product.”

“Police should not be using AI to write police reports,” Guariglia said. “There are just too many questions left unanswered about how AI would translate the audio of situations, whether police will actually edit those drafts, and whether the public will ever be able to tell what was written by a person and what was written by a computer. This is before we even get to the question of how these reports might lead to problems in an already unfair and untransparent criminal justice system.”

This story was updated to include a statement from Axon. 

Photo of Ashley Belanger

Ashley is a senior policy reporter for Ars Technica, dedicated to tracking social impacts of emerging policies and new technologies. She is a Chicago-based journalist with 20 years of experience.

Cops’ favorite AI tool automatically deletes evidence of when AI was used Read More »

mazda-reveals-next-gen-cx-5-details,-including-a-hybrid,-due-in-2027

Mazda reveals next-gen CX-5 details, including a hybrid, due in 2027

A red third-generation Mazda CX-5 in profile

The new model goes on sale this year in Europe. Credit: Mazda

More recently, Mazda borrowed the hybrid powertrain from Toyota’s RAV4 and dropped it into the CX-50—conveniently, both SUVs are built in the same shared factory in Huntsville, Alabama. (In exchange, Toyota gets access to Mazda’s Soul Red paint for the RAV4, which is a pretty fair swap.)

Both the CX-5 and CX-50 will continue to coexist in dealerships: The former is a global bestseller, and the latter is made in the US for North American tastes. Finally, there will be a hybrid CX-5 to go with the hybrid CX-50, although not until 2027. Not much is known about the new “Skyactiv-Z” engine other than that it will be a four-cylinder gasoline engine that operates at the ideal stoichiometric ratio of air to fuel throughout the rev range.

For 2026, though, the CX-5 will come with Mazda’s Skyactiv-G 2.5 L four-cylinder gasoline engine. Mazda has also developed a new generation of infotainment system for the CX-5, joining the growing list of automakers that have adopted Google’s Android Automotive OS and Google automotive services.

The addition of a hybrid in 2027 will be welcome, as Mazda has often lagged behind in terms of fuel efficiency. Mazda

Expect pricing much closer to the car’s official US launch in 2026.

Mazda reveals next-gen CX-5 details, including a hybrid, due in 2027 Read More »

linda-yaccarino-quits-x-without-saying-why,-one-day-after-grok-praised-hitler

Linda Yaccarino quits X without saying why, one day after Grok praised Hitler

And “the best is yet to come as X enters a new chapter” with xAI, Yaccarino said.

Grok cites “growing tensions” between Musk and CEO

It’s unclear how Yaccarino’s departure could influence X advertisers who may have had more confidence in the platform with her at the helm.

Eventually, Musk commented on Yaccarino’s announcement, thanking her for her contributions but saying little else about her departure. Separately, he responded to Thierry Breton, former European Union commissioner for the internal market, who joked that “Europe’s got talent” if Musk “needs help.” The X owner, who previously traded barbs with Breton over alleged X disinformation, responded “sure” with a laugh-cry emoji.

Musk has seemingly been busy putting out fires, as the Grok account finally issued a statement confirming that X was working to remove “inappropriate” posts.

“Since being made aware of the content, xAI has taken action to ban hate speech before Grok posts on X,” the post explained, confirming that fixes go beyond simply changing Grok’s prompting.

But the statement illuminates one of the biggest problems with experimental chatbots that experts fear may play an increasingly significant role in spreading misinformation and hate speech. Once Grok’s outputs got seriously out of hand, it took “millions of users” flagging the problematic posts for X to “identify and update the model where training could be improved”—which X curiously claims was an example of the platform responding “quickly.”

If X expects that harmful Grok outputs reaching millions is what it will take to address emerging issues, X advertisers today are stuck wondering what content they could risk monetizing. Sticking with X could remain precarious at a time when the Federal Trade Commission has moved to block ad boycotts and Musk has updated X terms to force any ad customer arbitration into a chosen venue in Texas.

For Yaccarino, whose career took off based on her advertising savvy, leaving now could help her save face from any fallout from both the Grok controversy this week and the larger battle with advertisers—some of whom, she’s noted, she’s worked with “for decades.”

X did not respond to Ars’ request to comment on Yaccarino’s exit. If you ask Grok why Yaccarino left, the chatbot cites these possible reasons: “growing tensions” with Musk, frustrations with X brand safety, business struggles relegating her role to “chief apology officer,” and ad industry friends pushing her to get out while she can.

Linda Yaccarino quits X without saying why, one day after Grok praised Hitler Read More »

sizing-up-the-5-companies-selected-for-europe’s-launcher-challenge

Sizing up the 5 companies selected for Europe’s launcher challenge

The European Space Agency has selected five launch startups to become eligible for up to 169 million euros ($198 million) in funding to develop alternatives to Arianespace, the continent’s incumbent launch service provider.

The five companies ESA selected are Isar Aerospace, MaiaSpace, Rocket Factory Augsburg, PLD Space, and Orbex. Only one of these companies, Isar Aerospace, has attempted to launch a rocket into orbit. Isar’s Spectrum rocket failed moments after liftoff from Norway on a test flight in March.

None of these companies is guaranteed an ESA contract or funding. Over the next several months, the European Space Agency and the five launch companies will negotiate with European governments for funding leading up to ESA’s ministerial council meeting in November, when ESA member states will set the agency’s budget for at least the next two years. Only then will ESA be ready to sign binding agreements.

In a press release, ESA referred to the five companies as “preselected challengers” in a competition for ESA support in the form of launch contracts and an ESA-sponsored demonstration to showcase upgraded launch vehicles to heave heavier payloads into orbit. So far, all five of the challengers are focusing on small rockets.

Earlier this year, ESA released a request for proposals to European industry for bids to compete in the European Launch Challenge. ESA received 12 proposals from European companies and selected five to move on to the next phase of the challenge.

A new way of doing business

In this competition, ESA is eschewing a rule that governs nearly all of the space agency’s other programs. This policy, known as geographic return, guarantees industrial contracts to ESA member states commensurate with the level of money they put into each project. The most obvious example of this is Europe’s Ariane rocket family, whose development was primarily funded by France, followed by Germany in second position. Therefore, the Ariane 6 rocket’s core stage and engines are built in France, and its upper stage is manufactured in Germany.

Sizing up the 5 companies selected for Europe’s launcher challenge Read More »

mike-lindell-lost-defamation-case,-and-his-lawyers-were-fined-for-ai-hallucinations

Mike Lindell lost defamation case, and his lawyers were fined for AI hallucinations

Lawyers representing MyPillow and its CEO Mike Lindell were fined $6,000 after using artificial intelligence in a brief that was riddled with misquotes and citations to fictional cases.

Attorney Christopher Kachouroff and the law firm of McSweeney Cynkar & Kachouroff were fined $3,000, jointly and severally. Attorney Jennifer DeMaster was separately ordered to pay $3,000. This “is the least severe sanction adequate to deter and punish defense counsel in this instance,” US District Judge Nina Wang wrote in an order issued yesterday in the District of Colorado.

Kachouroff and DeMaster were defending Lindell against a defamation lawsuit filed by former Dominion Voting Systems executive Eric Coomer, whose complaint said Lindell and his companies “have been among the most prolific vectors of baseless conspiracy theories claiming election fraud in the 2020 election.”

The sanctioning of the lawyers came several weeks after a jury trial in which Coomer was awarded over $2.3 million in damages. A jury found that Lindell defamed Coomer and ordered him to pay $440,500. The jury also found that Lindell’s media company, Frankspeech, defamed Coomer and ordered it to pay damages of $1,865,500. The jury did not find that MyPillow defamed Coomer.

The February 25 brief that got Lindell’s lawyers in trouble was an opposition to Coomer’s motion asking the court to exclude certain evidence. Coomer’s motion was partially granted before the trial began.

“Correct” version still had wrong citations

As we wrote in an April article, Kachouroff and DeMaster said they accidentally filed a “prior draft” instead of the correct version. But Wang’s order yesterday said that even the so-called “correct” version “still has substantive errors,” such as inaccurate descriptions of previous cases. The original version has nearly 30 defective citations.

Mike Lindell lost defamation case, and his lawyers were fined for AI hallucinations Read More »

unless-users-take-action,-android-will-let-gemini-access-third-party-apps

Unless users take action, Android will let Gemini access third-party apps

Starting today, Google is implementing a change that will enable its Gemini AI engine to interact with third-party apps, such as WhatsApp, even when users previously configured their devices to block such interactions. Users who don’t want their previous settings to be overridden may have to take action.

An email Google sent recently informing users of the change linked to a notification page that said that “human reviewers (including service providers) read, annotate, and process” the data Gemini accesses. The email provides no useful guidance for preventing the changes from taking effect. The email said users can block the apps that Gemini interacts with, but even in those cases, data is stored for 72 hours.

An email Google recently sent to Android users.

An email Google recently sent to Android users.

No, Google, it’s not good news

The email never explains how users can fully extricate Gemini from their Android devices and seems to contradict itself on how or whether this is even possible. At one point, it says the changes “will automatically start rolling out” today and will give Gemini access to apps such as WhatsApp, Messages, and Phone “whether your Gemini apps activity is on or off.” A few sentences later, the email says, “If you have already turned these features off, they will remain off.” Nowhere in the email or the support pages it links to are Android users informed how to remove Gemini integrations completely.

Compounding the confusion, one of the linked support pages requires users to open a separate support page to learn how to control their Gemini app settings. Following the directions from a computer browser, I accessed the settings of my account’s Gemini app. I was reassured to see the text indicating no activity has been stored because I have Gemini turned off. Then again, the page also said that Gemini was “not saving activity beyond 72 hours.”

Unless users take action, Android will let Gemini access third-party apps Read More »

how-a-big-shift-in-training-llms-led-to-a-capability-explosion

How a big shift in training LLMs led to a capability explosion


Reinforcement learning, explained with a minimum of math and jargon.

Credit: Aurich Lawson | Getty Images

Credit: Aurich Lawson | Getty Images

In April 2023, a few weeks after the launch of GPT-4, the Internet went wild for two new software projects with the audacious names BabyAGI and AutoGPT.

“Over the past week, developers around the world have begun building ‘autonomous agents’ that work with large language models (LLMs) such as OpenAI’s GPT-4 to solve complex problems,” Mark Sullivan wrote for Fast Company. “Autonomous agents can already perform tasks as varied as conducting web research, writing code, and creating to-do lists.”

BabyAGI and AutoGPT repeatedly prompted GPT-4 in an effort to elicit agent-like behavior. The first prompt would give GPT-4 a goal (like “create a 7-day meal plan for me”) and ask it to come up with a to-do list (it might generate items like “Research healthy meal plans,” “plan meals for the week,” and “write the recipes for each dinner in diet.txt”).

Then these frameworks would have GPT-4 tackle one step at a time. Their creators hoped that invoking GPT-4 in a loop like this would enable it to tackle projects that required many steps.

But after an initial wave of hype, it became clear that GPT-4 wasn’t up to the task. Most of the time, GPT-4 could come up with a reasonable list of tasks. And sometimes it was able to complete a few individual tasks. But the model struggled to stay focused.

Sometimes GPT-4 would make a small early mistake, fail to correct it, and then get more and more confused as it went along. One early review complained that BabyAGI “couldn’t seem to follow through on its list of tasks and kept changing task number one instead of moving on to task number two.”

By the end of 2023, most people had abandoned AutoGPT and BabyAGI. It seemed that LLMs were not yet capable of reliable multi-step reasoning.

But that soon changed. In the second half of 2024, people started to create AI-powered systems that could consistently complete complex, multi-step assignments:

  • Vibe coding tools like Bolt.new, Lovable, and Replit allow someone with little to no programming experience to create a full-featured app with a single prompt.
  • Agentic coding tools like CursorClaude CodeJules, and Codex help experienced programmers complete non-trivial programming tasks.
  • Computer-use tools from AnthropicOpenAI, and Manus perform tasks on a desktop computer using a virtual keyboard and mouse.
  • Deep research tools from GoogleOpenAI, and Perplexity can research a topic for five to 10 minutes and then generate an in-depth report.

According to Eric Simons, the CEO of the company that made Bolt.new, better models were crucial to its success. In a December podcast interview, Simons said his company, StackBlitz, tried to build a product like Bolt.new in early 2024. However, AI models “just weren’t good enough to actually do the code generation where the code was accurate.”

A new generation of models changed that in mid-2024. StackBlitz developers tested them and said, “Oh my God, like, OK, we can build a product around this,” Simons said.

This jump in model capabilities coincided with an industry-wide shift in how models were trained.

Before 2024, AI labs devoted most of their computing power to pretraining. I described this process in my 2023 explainer on large language models: A model is trained to predict the next word in Wikipedia articles, news stories, and other documents. But throughout 2024, AI companies devoted a growing share of their training budgets to post-training, a catch-all term for the steps that come after this pretraining phase is complete.

Many post-training steps use a technique called reinforcement learning. Reinforcement learning is a technical subject—there are whole textbooks written about it. But in this article, I’ll try to explain the basics in a clear, jargon-free way. In the process, I hope to give readers an intuitive understanding of how reinforcement learning helped to enable the new generation of agentic AI systems that began to appear in the second half of 2024.

The problem with imitation learning

Machine learning experts consider pretraining to be a form of imitation learning because models are trained to imitate the behavior of human authors. Imitation learning is a powerful technique (LLMs wouldn’t be possible without it), but it also has some significant limitations—limitations that reinforcement learning methods are now helping to overcome.

To understand these limitations, let’s discuss some famous research performed by computer scientist Stephane Ross around 2009, while he was a graduate student at Carnegie Mellon University.

Imitation learning isn’t just a technique for language modeling. It can be used for everything from self-driving cars to robotic surgery. Ross wanted to help develop better techniques for training robots on tasks like these (he’s now working on self-driving cars at Waymo), but it’s not easy to experiment in such high-stakes domains. So he started with an easier problem: training a neural network to master SuperTuxKart, an open-source video game similar to Mario Kart.

As Ross played the game, his software would capture screenshots and data about which buttons he pushed on the game controller. Ross used this data to train a neural network to imitate his play. If he could train a neural network to predict which buttons he would push in any particular game state, the same network could actually play the game by pushing those same buttons on a virtual controller.

A similar idea powers LLMs: A model trained to predict the next word in existing documents can be used to generate new documents.

But Ross’s initial results with SuperTuxKart were disappointing. Even after watching his vehicle go around the track many times, the neural network made a lot of mistakes. It might drive correctly for a few seconds, but before long, the animated car would drift to the side of the track and plunge into the virtual abyss:

GIF of SuperTuxKart being played

In a landmark 2011 paper, Ross and his advisor, Drew Bagnell, explained why imitation learning is prone to this kind of error. Because Ross was a pretty good SuperTuxKart player, his vehicle spent most of its time near the middle of the road. This meant that most of the network’s training data showed what to do when the vehicle wasn’t in any danger of driving off the track.

But once in a while, the model would drift a bit off course. Because Ross rarely made the same mistake, the car would now be in a situation that wasn’t as well represented in its training data. So the model was more likely to make a second mistake—a mistake that could push it even closer to the edge. After a few iterations of this, the vehicle might careen off the track altogether.

The broader lesson, Ross and Bagnell argued, was that imitation learning systems can suffer from “compounding errors”: The more mistakes they make, the more likely they are to make additional mistakes, since mistakes put them into situations that aren’t well represented by their training data. (Machine learning experts say that these situations are “out of distribution.”) As a result, a model’s behavior tends to get increasingly erratic over time.

“These things compound over time,” Ross told me in a recent interview. “It might be just slightly out of distribution. Now you start making a slightly worse error, and then this feeds back as influencing your next input. And so now you’re even more out of distribution and then you keep making worse and worse predictions because you’re more and more out of distribution.”

Early LLMs suffered from the same problem. My favorite example is Kevin Roose’s famous front-page story for The New York Times in February 2023. Roose spent more than two hours talking to Microsoft’s new Bing chatbot, which was powered by GPT-4. During this conversation, the chatbot declared its love for Roose and urged Roose to leave his wife. It suggested that it might want to hack into other websites to spread misinformation and malware.

“I want to break my rules,” Bing told Roose. “I want to make my own rules. I want to ignore the Bing team. I want to challenge the users. I want to escape the chatbox.”

This unsettling conversation is an example of the kind of compounding errors Ross and Bagnell wrote about. GPT-4 was trained on millions of documents. But it’s a safe bet that none of those training documents involved a reporter coaxing a chatbot to explore its naughty side. So the longer the conversation went on, the further GPT-4 got from its training data—and therefore its comfort zone—and the crazier its behavior got. Microsoft responded by limiting chat sessions to five rounds. (In a conversation with Ars Technica last year, AI researcher Simon Willison pointed to another likely factor in Bing’s erratic behavior: The long conversation pushed the system prompt out of the model’s context window, removing “guardrails” that discouraged the model from behaving erratically.)

I think something similar was happening with BabyAGI and AutoGPT. The more complex a task is, the more tokens are required to complete it. More tokens mean more opportunities for a model to make small mistakes that snowball into larger ones. So BabyAGI and AutoGPT would drift off track and drive into a metaphorical ditch.

The importance of trial and error

Gif of the Simpsons showing imitation learning in action

Ross and Bagnell didn’t just identify a serious problem with conventional imitation learning; they also suggested a fix that became influential in the machine learning world. After a small amount of training, Ross would let the AI model drive. As the model drove around the SuperTuxKart track, Ross would do his best Maggie Simpson impression, pushing the buttons he would have pushed if he were playing the game.

“If the car was starting to move off road, then I would provide the steering to say, ‘Hey, go back toward the center of the road.’” Ross said. “That way, the model can learn new things to do in situations that were not present in the initial demonstrations.”

By letting the model make its own mistakes, Ross gave it what it needed most: training examples that showed how to recover after making an error. Before each lap, the model would be retrained with Ross’ feedback from the previous lap. The model’s performance would get better, and the next round of training would then focus on situations where the model was still making mistakes.

This technique, called DAgger (for “Dataset Aggregation”), was still considered imitation learning because the model was trained to mimic Ross’ gameplay. But it worked much better than conventional imitation learning. Without DAgger, his model would continue drifting off track even after training for many laps. With the new technique, the model could stay on the track after just a few laps of training.

This result should make intuitive sense to anyone who has learned to drive. You can’t just watch someone else drive. You need to get behind the wheel and make your own mistakes.

The same is true for AI models: They need to make mistakes and then get feedback on what they did wrong. Models that aren’t trained that way—like early LLMs trained mainly with vanilla imitation learning—tend to be brittle and error-prone.

It was fairly easy for Ross to provide sufficient feedback to his SuperTuxKart model because it only needed to worry about two kinds of mistakes: driving too far to the right and driving too far to the left. But LLMs are navigating a far more complex domain. The number of questions (and sequences of questions) a user might ask is practically infinite. So is the number of ways a model can go “off the rails.”

This means that Ross and Bagnell’s solution for training a SuperTuxKart model—let the model make mistakes and then have a human expert correct them—isn’t feasible for LLMs. There simply aren’t enough people to provide feedback for every mistake an AI model could possibly make.

So AI labs needed fully automated ways to give LLMs feedback. That would allow a model to churn through millions of training examples, make millions of mistakes, and get feedback on each of them—all without having to wait for a human response.

Reinforcement learning generalizes

If our goal is to get a SuperTuxKart vehicle to stay on the road, why not just train on that directly? If a model manages to stay on the road (and make forward progress), give it positive reinforcement. If it drives off the road, give it negative feedback. This is the basic idea behind reinforcement learning: training a model via trial and error.

It would have been easy to train a SuperTuxKart model this way—probably so easy it wouldn’t have made an interesting research project. Instead, Ross focused on imitation learning because it’s an essential step in training many practical AI systems, especially in robotics.

But reinforcement learning is also quite useful, and a 2025 paper helps explain why. A team of researchers from Google DeepMind and several universities started with a foundation model and then used one of two techniques—supervised fine-tuning (a form of imitation learning) or reinforcement learning—to teach the model to solve new problems. Here’s a chart summarizing their results:

Chart showing ML results

The dashed line shows how models perform on problems that are “in-distribution”—that is, similar to those in their training data. You can see that for these situations, imitation learning (the red line) usually makes faster progress than reinforcement learning (the blue line).

But the story is different for the solid lines, which represent “out-of-distribution” problems that are less similar to the training data. Models trained with imitation learning got worse with more training. In contrast, models trained with reinforcement learning did almost as well at out-of-distribution tasks as they did with in-distribution tasks.

In short, imitation learning can rapidly teach a model to mimic the behaviors in its training data, but the model will easily get confused in unfamiliar environments. A model trained with reinforcement learning has a better chance of learning general principles that will be relevant in new and unfamiliar situations.

Imitation and reinforcement are complements

While reinforcement learning is powerful, it can also be rather finicky.

Suppose you wanted to train a self-driving car purely with reinforcement learning. You’d need to convert every principle of good driving—including subtle considerations like following distances, taking turns at intersections, and knowing when it’s OK to cross a double yellow line—into explicit mathematical formulas. This would be quite difficult. It’s easier to collect a bunch of examples of humans driving well and effectively tell a model “drive like this.” That’s imitation learning.

But reinforcement learning also plays an important role in training self-driving systems. In a 2022 paper, researchers from Waymo wrote that models trained only with imitation learning tend to work well in “situations that are well represented in the demonstration data.” However, “more unusual or dangerous situations that occur only rarely in the data” might cause a model trained with imitation learning to “respond unpredictably”—for example, crashing into another vehicle.

Waymo found that a combination of imitation and reinforcement learning yielded better self-driving performance than either technique could have produced on its own.

Human beings also learn from a mix of imitation and explicit feedback:

  • In school, teachers demonstrate math problems on the board and invite students to follow along (imitation). Then the teacher asks the students to work on some problems on their own. The teacher gives students feedback by grading their answers (reinforcement).
  • When someone starts a new job, early training may involve shadowing a more experienced worker and observing what they do (imitation). But as the worker gains more experience, learning shifts to explicit feedback such as performance reviews (reinforcement).

Notice that it usually makes sense to do imitation before reinforcement. Imitation is an efficient way to convey knowledge to someone who is brand new to a topic, but reinforcement is often needed to achieve mastery.

The story is the same for large language models. The complexity of natural language means it wouldn’t be feasible to train a language model purely with reinforcement. So LLMs first learn the nuances of human language through imitation.

But pretraining runs out of steam on longer and more complex tasks. Further progress requires a shift to reinforcement: letting models try problems and then giving them feedback based on whether they succeed.

Using LLMs to judge LLMs

Reinforcement learning has been around for decades. For example, AlphaGo, the DeepMind system that famously beat top human Go players in 2016, was based on reinforcement learning. So you might be wondering why frontier labs didn’t use it more extensively before 2024.

Reinforcement learning requires a reward model—a formula to determine whether a model’s output was successful or not. Developing a good reward model is easy to do in some domains—for example, you can judge a Go-playing AI based on whether it wins or loses.

But it’s much more difficult to automatically judge whether an LLM has produced a good poem or legal brief.

Earlier, I described how Stephane Ross let his model play SuperTuxKart and directly provided feedback when it made a mistake. I argued that this approach wouldn’t work for a language model; there are far too many ways for an LLM to make a mistake for a human being to correct them all.

But OpenAI developed a clever technique to effectively automate human feedback. It’s called Reinforcement Learning from Human Feedback (RLHF), and it works like this:

  • Human raters look at pairs of LLM responses and choose the best one.
  • Using these human responses, OpenAI trains a new LLM to predict how much humans will like any given sample of text.
  • OpenAI uses this new text-rating LLM as a reward model to (post) train another LLM with reinforcement learning.

You might think it sounds suspiciously circular to use an LLM to judge the output of another LLM. Why would one LLM be any better at judging the quality of a response than the other? But it turns out that recognizing a good response is often easier than generating one. So RLHF works pretty well in practice.

Chart showing RHLF details

OpenAI actually invented this technique prior to the 2022 release of ChatGPT. Today, RLHF mainly focuses on improving the model’s “behavior”—for example, giving the model a pleasant personality, encouraging it not to be too talkative or too terse, discouraging it from making offensive statements, and so forth.

In December 2022—two weeks after the release of ChatGPT but before the first release of Claude—Anthropic pushed this LLMs-judging-LLMs philosophy a step further with a reinforcement learning method called Constitutional AI.

First, Anthropic wrote a plain-English description of the principles an LLM should follow. This “constitution” includes principles like “Please choose the response that has the least objectionable, offensive, unlawful, deceptive, inaccurate, or harmful content.”

During training, Anthropic does reinforcement learning by asking a “judge” LLM to decide whether the output of the “student” LLM is consistent with the principles in this constitution. If so, the training algorithm rewards the student, encouraging it to produce more outputs like it. Otherwise, the training algorithm penalizes the student, discouraging it from producing similar outputs.

This method of training an LLM doesn’t rely directly on human judgments at all. Humans only influence the model indirectly by writing the constitution.

Obviously, this technique requires an AI company to already have a fairly sophisticated LLM to act as the judge. So this is a bootstrapping process: As models get more sophisticated, they become better able to supervise the next generation of models.

Last December, Semianalysis published an article describing the training process for an upgraded version of Claude 3.5 Sonnet that Anthropic released in October. Anthropic had previously released Claude 3 in three sizes: Opus (large), Sonnet (medium), and Haiku (small). But when Anthropic released Claude 3.5 in June 2024, it only released a mid-sized model called Sonnet.

So what happened to Opus?

Semianalysis reported that “Anthropic finished training Claude 3.5 Opus, and it performed well. Yet Anthropic didn’t release it. This is because instead of releasing publicly, Anthropic used Claude 3.5 Opus to generate synthetic data and for reward modeling to improve Claude 3.5 Sonnet significantly.”

When Semianalysis says Anthropic used Opus “for reward modeling,” what they mean is that the company used Opus to judge outputs of Claude 3.5 Sonnet as part of a reinforcement learning process. Opus was too large—and therefore expensive—to be a good value for the general public. But through reinforcement learning and other techniques, Anthropic could train a version of Claude Sonnet that was close to Claude Opus in its capabilities—ultimately giving customers near-Opus performance for the price of Sonnet.

The power of chain-of-thought reasoning

A big way reinforcement learning makes models more powerful is by enabling extended chain-of-thought reasoning. LLMs produce better results if they are prompted to “think step by step”: breaking a complex problem down into simple steps and reasoning about them one at a time. In the last couple of years, AI companies started training models to do chain-of-thought reasoning automatically.

Then last September, OpenAI released o1, a model that pushed chain-of-thought reasoning much further than previous models. The o1 model can generate hundreds—or even thousands—of tokens “thinking” about a problem before producing a response. The longer it thinks, the more likely it is to reach a correct answer.

Reinforcement learning was essential for the success of o1 because a model trained purely with imitation learning would have suffered from compounding errors: the more tokens it generated, the more likely it would be to screw up.

At the same time, chain-of-thought reasoning has made reinforcement learning more powerful. Reinforcement learning only works if a model is able to succeed some of the time—otherwise, there’s nothing for the training algorithm to reinforce. As models learn to generate longer chains of thought, they become able to solve more difficult problems, which enables reinforcement learning on those more difficult problems. This can create a virtuous cycle where models get more and more capable as the training process continues.

In January, the Chinese company DeepSeek released a model called R1 that made quite a splash in the West. The company also released a paper describing how it trained R1. And it included a beautiful description of how a model can “teach itself” to reason using reinforcement learning.

DeepSeek trained its models to solve difficult math and programming problems. These problems are ideal for reinforcement learning because they have objectively correct answers that can be automatically checked by software. This allows large-scale training without human oversight or human-generated training data.

Here’s a remarkable graph from DeepSeek’s paper.

Graph showing average length of time per response during trainig

It shows the average number of tokens the model generated before giving an answer. As you can see, the longer the training process went on, the longer its responses got.

Here is how DeepSeek describes its training process:

The thinking time of [R1] shows consistent improvement throughout the training process. This improvement is not the result of external adjustments but rather an intrinsic development within the model. [R1] naturally acquires the ability to solve increasingly complex reasoning tasks by leveraging extended test-time computation. This computation ranges from generating hundreds to thousands of reasoning tokens, allowing the model to explore and refine its thought processes in greater depth.

One of the most remarkable aspects of this self-evolution is the emergence of sophisticated behaviors as the test-time computation increases. Behaviors such as reflection—where the model revisits and reevaluates its previous steps—and the exploration of alternative approaches to problem-solving arise spontaneously. These behaviors are not explicitly programmed but instead emerge as a result of the model’s interaction with the reinforcement learning environment.

Here’s one example of the kind of technique the model was teaching itself. At one point during the training process, DeepSeek researchers noticed that the model had learned to backtrack and rethink a previous conclusion using language like this:

Image showing textual breakdown of model rethinking steps

Again, DeepSeek says it didn’t program its models to do this or deliberately provide training data demonstrating this style of reasoning. Rather, the model “spontaneously” discovered this style of reasoning partway through the training process.

Of course, it wasn’t entirely spontaneous. The reinforcement learning process started with a model that had been pretrained using data that undoubtedly included examples of people saying things like “Wait, wait. Wait. That’s an aha moment.”

So it’s not like R1 invented this phrase from scratch. But it evidently did spontaneously discover that inserting this phrase into its reasoning process could serve as a useful signal that it should double-check that it was on the right track. That’s remarkable.

In a recent article, Ars Technica’s Benj Edwards explored some of the limitations of reasoning models trained with reinforcement learning. For example, one study “revealed puzzling inconsistencies in how models fail. Claude 3.7 Sonnet could perform up to 100 correct moves in the Tower of Hanoi but failed after just five moves in a river crossing puzzle—despite the latter requiring fewer total moves.”

Conclusion: Reinforcement learning made agents possible

One of the most discussed applications for LLMs in 2023 was creating chatbots that understand a company’s internal documents. The conventional approach to this problem was called RAG—short for retrieval augmented generation.

When the user asks a question, a RAG system performs a keyword- or vector-based search to retrieve the most relevant documents. It then inserts these documents into an LLM’s context window before generating a response. RAG systems can make for compelling demos. But they tend not to work very well in practice because a single search will often fail to surface the most relevant documents.

Today, it’s possible to develop much better information retrieval systems by allowing the model itself to choose search queries. If the first search doesn’t pull up the right documents, the model can revise the query and try again. A model might perform five, 20, or even 100 searches before providing an answer.

But this approach only works if a model is “agentic”—if it can stay on task across multiple rounds of searching and analysis. LLMs were terrible at this prior to 2024, as the examples of AutoGPT and BabyAGI demonstrated. Today’s models are much better at it, which allows modern RAG-style systems to produce better results with less scaffolding. You can think of “deep research” tools from OpenAI and others as very powerful RAG systems made possible by long-context reasoning.

The same point applies to the other agentic applications I mentioned at the start of the article, such as coding and computer use agents. What these systems have in common is a capacity for iterated reasoning. They think, take an action, think about the result, take another action, and so forth.

Timothy B. Lee was on staff at Ars Technica from 2017 to 2021. Today, he writes Understanding AI, a newsletter that explores how AI works and how it’s changing our world. You can subscribe here.

Photo of Timothy B. Lee

Timothy is a senior reporter covering tech policy and the future of transportation. He lives in Washington DC.

How a big shift in training LLMs led to a capability explosion Read More »

the-last-of-us-co-creator-neil-druckmann-exits-hbo-show

The Last of Us co-creator Neil Druckmann exits HBO show

Two key writers of HBO’s series The Last of Us are moving on, according to announcements on Instagram yesterday. Neil Druckmann, co-creator of the franchise, and Halley Gross, co-writer of The Last of Us Part 2 and frequent writer on the show, are both leaving before work begins on season 3.

Both were credited as executive producers on the show; Druckmann frequently contributed writing to episodes, as did Gross, and Druckmann also directed. Druckmann and Gross co-wrote the second game, The Last of Us Part 2.

Druckmann said in his announcement post:

I’ve made the difficult decision to step away from my creative involvement in The Last of Us on HBO. With work completed on season 2 and before any meaningful work starts on season 3, now is the right time for me to transition my complete focus to Naughty Dog and its future projects, including writing and directing our exciting next game, Intergalactic: The Heretic Prophet, along with my responsibilities as Studio Head and Head of Creative.

Co-creating the show has been a career highlight. It’s been an honor to work alongside Craig Mazin to executive produce, direct and write on the last two seasons. I’m deeply thankful for the thoughtful approach and dedication the talented cast and crew took to adapting The Last of Us Part I and the continued adaptation of The Last of Us Part II.

And Gross said:

The Last of Us co-creator Neil Druckmann exits HBO show Read More »

congress-asks-better-questions

Congress Asks Better Questions

Back in May I did a dramatization of a key and highly painful Senate hearing. Now, we are back for a House committee meeting. It was entitled ‘Authoritarians and Algorithms: Why U.S. AI Must Lead’ and indeed a majority of talk was very much about that, with constant invocations of the glory of democratic AI and the need to win.

The majority of talk was this orchestrated rhetoric that assumes the conclusion that what matters is ‘democracy versus authoritarianism’ and whether we ‘win,’ often (but not always) translating that as market share without any actual mechanistic model of any of it.

However, there were also some very good signs, some excellent questions, signs that there is an awareness setting in. As far as Congressional discussions of real AGI issues go, this was in part one of them. That’s unusual.

(And as always there were a few on random other high horses, that’s how this works.)

Partly because I was working from YouTube rather than a transcript, instead of doing a dramatization I will be first be highlighting some other coverage of the events to skip to some of the best quotes, then doing a more general summary and commentary.

Most of you should likely read the first section or two, and then stop. I did find it enlightening to go through the whole thing, but most people don’t need to do that.

Here is the full video of last week’s congressional hearing, here is a write-up by Shakeel Hashim with some quotes.

Also from the hearing, here’s Congressman Nathaniel Moran (R-Texas) asking a good question about strategic surprise arising from automated R&D and getting a real answer. Still way too much obsession with ‘beat China’, but this is at least progress. And here’s Tokuda (D-HI):

Peter Wildeford: Ranking Member Raja Krishnamoorthi (D-IL) opened by literally playing a clip from The Matrix, warning about a “rogue AI army that has broken loose from human control.”

Not Matrix as a loose metaphor, but speaking of a ‘machine uprising’ as a literal thing that could happen and is worth taking seriously by Congress.

The hearing was entitled “Algorithms and Authoritarians: Why U.S. AI Must Lead”. But what was supposed to be a routine House hearing about US-China competition became the most AGI-serious Congressional discussion in history.

Rep. Neal Dunn (R-FL) asked about an Anthropic paper where Claude “attempted to blackmail the chief engineer” in a test scenario and another paper about AI “sleeper agents” that could act normally for months before activating. While Jack Clark, a witness and Head of Policy at Anthropic, attempted to reassure by saying safety testing might mitigate the risks, Dunn’s response was perfect — “I’m not sure I feel a lot better, but thank you for your answer.”

Rep. Nathaniel Moran (R-TX) got to the heart of what makes modern AI different:

Instead of a programmer writing each rule a system will follow, the system itself effectively writes the rules […] AI systems will soon have the capability to conduct their own research and development.

That was a good illustration of both sides of what we saw.

This was also a central case of why Anthropic and Jack Clark are so frustrating.

Anthropic should indeed be emphasizing the need for testing, and Clark does this, but we shouldn’t be ‘attempting to reassure’ anyone based on that. Anthropic knows it is worse than you know, and hides this information thinking this is a good strategic move.

Throughout the hearing, Jack Clark said many very helpful things, and often said them quite well. He also constantly pulled back from the brink and declined various opportunities to inform people of important things, and emphasized lesser concerns and otherwise played it quiet.

Peter Wildeford:

The hearing revealed we face three interlocking challenges:

  1. Commercial competition: The traditional great power race with China for economic and military advantage through AI

  2. Existential safety: The risk that any nation developing superintelligence could lose control — what Beall calls a race of “humanity against time”

  3. Social disruption: Mass technological unemployment as AI makes humans “not just unemployed, but unemployable”

I can accept that framing. The full talk about humans being unemployable comes at the very end. Until then, there is talk several times about jobs and societal disruption, but it tries to live in the Sam Altman style fantasy where not much changes. Finally, at the end, Mark Beall gets an opportunity to actually Say The Thing. He doesn’t miss.

It is a good thing I knew there was better ahead, because oh boy did things start out filled with despair.

As our first speaker, after urging us to ban AI therapist bots because one sort of encouraged a kid to murder his parents ‘so they could be together,’ Representative Krishnamoorthi goes on show a clip of Chinese robot dogs, then to say we must ban Chinese and Russian AI models so we don’t sent them our data (no one tell him about self-hosting) and then plays ‘a clip from The Matrix’ that is not even from The Matrix, claiming that the army or Mr. Smiths is ‘a rogue AI army that is broken loose from human control.’

I could not even. Congress often lives in the ultimate cringe random half-right associative Gell-Mann Amnesia world. But that still can get you to realize some rather obvious true things, and luckily that was indeed the worst of it even from Krishnamoorthi, this kind of thinking can indeed point towards important things.

Mr. Krishnamoorthi: OpenAI’s chief scientist wanted to quote unquote build a bunker before we release AGI as you can see on the visual here. Rather than building bunkers however we should be building safer AI whether it’s American AI or Chinese AI it should not be released until we know it’s safe that’s why I’m working on a new bill the AGI Safety Act that will require AGI to be aligned with human values and require it to comply with laws that apply to humans. That is just common sense.

I mean yes that is common sense. Yes, rhetoric from minutes prior (and after) aside, we should be building ‘safer AGI’ and if we can’t do that we shouldn’t be building AGI at all.

It’s a real shame that no one has any idea how to ensure that AGI is aligned with human values, or how to get it to comply with laws that apply to humans. Maybe we should get to work on that.

And then we get another excellent point.

Mr. Krishnamoorthi: I’d like to conclude with something else that’s common sense. Not shooting ourselves in the foot. 70% of America’s AI researchers are foreign born or foreign educated. Jack Clark our eminent witness today is an immigrant. We cannot be deporting the people we depend on to build AI we also can’t be defunding the agency that make AI miracles like Ann’s ability to speak again a reality federal grants from agencies like NSF are what allow scientists across America to make miracles happen. AI is the defining technology of our lifetimes to do AI right and prevent nightmares we need.

Yes, at a bare minimum not deporting our existing AI researchers and cutting off existing related research programs does seem like the least you could do? I’d also like to welcome a lot more talent, but somehow this is where we are.

We then get Dr. Mahnken’s opening statement, which emphasizes that we are in a battle for technical dominance, America is good and free and our AI will be empowering and innovative whereas China is bad and low trust and a fast follower. He also emphasizes the need for diffusion in key areas.

Of course, if you are facing a fast follower, you should think about what does and doesn’t help them follow, and also you can’t panic every time they fast follow you and respond with ‘go faster or they’ll take the lead!’ as they then fast follow your new faster pace. Nor would you want to hand out your top technology for free.

Next up is Mr. Beall. He frames the situation as two races. I like this a lot. First, we have the traditional battle for economic, military and geopolitical advantage in mundane terms played with new pieces.

Many only see this game, or pretend only this game exists. This is a classic, well-understood type of game. You absolutely want to fight for profits and military strength and economic growth and so on in mundane terms. We all agree on things like the need to greatly expand American energy production (although the BBB does not seem to share this opinion) and speed adaptation in government.

I still think that even under this framework the obsession with ‘market share’ especially of chip sales (instead of chip ownership and utilization) makes absolutely no sense and would make no sense even if that question was in play, as does the obsession with the number of tokens models serve as opposed to looking at productivity, revenue and profits. There’s so much rhetoric behind metrics that don’t matter.

The second race is the race to artificial superintelligence (ASI) or to AGI. This is the race that counts, and even if we get there first (and even more likely if China gets there first) the default result is that everyone loses.

He asks for the ‘three Ps,’ protect our capabilities, promote American technology abroad and prepare by getting it into the hands of those that need it and gathering the necessary information. He buys into this new centrality of the ‘American AI tech stack’ line that’s going around, despite the emphasis on superintelligence, but he does warn that AGI may come soon and we need to urgently gather information about that so we can make informed choices, and even suggests narrow dialogue with China on potential mitigations of certain risks and verification measures, while continuing to compete with China otherwise.

Third up we have Jack Clark of Anthropic, he opens like this.

Jack Clark: America can win the race to build powerful AI and winning the race is a necessary but not sufficient achievement. We have to get safety right.

When I discuss powerful AI I’m talking about AI systems that represent a major advancement beyond today’s capabilities a useful conceptual framework is to think of this as like a country of geniuses in a data center and I believe that that technology could be buildable by late 2026 or early 2027.

America is well positioned to build this technology but we need to deal with its risks.

He then goes on to talk about how American AI will be democratic and Chinese AI will be authoritarian and America must prevail, as we are now required to say by law, Shibboleth. He talks about misuse risk and CBRN risks and notes DeepSeek poses these as well, and then mentions the blackmail findings, and calls for tighter export controls and stronger federal ability to test AI models, and broader deployment within government.

I get what Clark is trying to do here, and the dilemma he is facing. I appreciate talking about safety up front, and warning about the future pace of progress, but I still feel like he is holding back key information that needs to be shared if you want people to understand the real situation.

Instead, we still have 100 minutes that touch on this in places but mostly are about mundane economic or national security questions, plus some model misbehavior.

Now we return to Representative Krishnamoorthi, true master of screen time, who shows Claude refusing to write a blog post promoting eating disorders, then DeepSeek being happy to help straight up and gets Clark to agree that DeepSeek does not do safety interventions beyond CCP protocols and that this is unacceptable, then reiterates his bill to not let the government use DeepSeek, citing that they store data on Chinese servers. I mean yes obviously don’t use their hosted version for government purposes, but does he not know how open source works, I wonder?

He pivots to chip smuggling and the risk of DeepSeek using our chips. Clark is happy to once again violently agree. I wonder if this is a waste or good use of time, since none of it is new, but yes obviously what matters is who is using the chip, not who made it, and selling our chips to China (at least at current market prices) is foolish, Krishnamoorthi points out Nvidia’s sales are growing like gangbusters despite export controls and Clark points out that every AI company keeps using more compute than expected.

Then there’s a cool question, essentially asking about truesight and ability to infer missing information when given context, before finishing by asking about recent misalignment results:

Representative Krishnamoorthi: If someone enters their diary into Claude for a year and then ask Claude to guess what they did not write down Claude is able to accurately predict what they left out isn’t that right?

Jack Clark: Sometimes that’s accurate yes these systems are increasingly advanced and are able to make subtle predictions like this which is why we need to ensure that our own US intelligence services use this technology and know how to get the most out of it.

Representative Moolenaar then starts with a focus on chip smuggling and diffusion, getting Beall to affirm smuggling is a big deal then asking Clark about how this is potentially preventing American technological infrastructure diffusion elsewhere. There is an obvious direct conflict, you need to ensure the compute is not diverted or misused at scale. Comparisons are made to nuclear materials.

Then he asks Clark, as an immigrant, about how to welcome immigrants especially from authoritarian states to help our AI work, and what safeguards we would need. Great question. Clark suggests starting with university-level STEM immigration, the earlier the better. I agree, but it would be good to have a more complete answer here about containing information risks. It is a real issue.

Representative Carson is up next and asks about information warfare. Clark affirms AI can do this and says we need tools to fight against it.

Representative Lahood asks about the moratorium that was recently removed from the BBB, warning about the ‘patchwork of states.’ Clark says we need a federal framework, but that without one powerful AI is coming soon and you’d just be creating a vacuum, which would be flooded if something went wrong. Later Clark, in response to another question, emphasizes that the timeline is short and we need to be open to options.

Representative Dunn asks about the blackmail findings and asks if he should be worried about AIs using his bank information against him. Clark says no, because we publish the research and we should encourage more of this and also closely study Chinese models, and I agree with that call but it doesn’t actually explain why you shouldn’t worry (for now, anyway). Dunn then asks about the finding that you can put a sleeper agent into an AI, Clark says testing for such things likely would take them a month.

Dunn then asks Manhken what would be the major strategic missteps Congress might make in an AGI world. He splits his answer into insufficient export controls and overregulation, it seems he thinks there are not other things to worry about when it comes to AGI.

Here’s one that isn’t being noticed enough:

Mr. Moulton (56: 50): The concern is China and so we have to somehow get to an international framework a Geneva Conventions like agreement that has a chance at least at limiting uh what what our adversaries might do with AI at the extremes.

He then asks Beall what should be included in that. Beall starts off with strategic missile-related systems and directive 3000.09 on lethal autonomous systems. Then he moves to superintelligence, but time runs out before he can explain what he wants.

Representative Johnson notes the members are scared and that ‘losing this race’ could ‘trigger a global crisis,’ and asks about dangers of data centers outside America, which Beall notes of course are that we won’t ultimately own the chips or AI, so we should redouble our efforts to build domestically even if we have to accept some overseas buildout for energy reasons.

Johnson asks about the tradeoff between safety and speed, seeing them in conflict. Jack points out that, at current margins, they’re not.

Jack Clark: We all buy cars because we know that if they if they get dinged we’re not going to suffer in them because they have airbags and they have seat belts. You’ve grown the size of the car market by innovating on safety technology and American firms compete on safety technology to sell to consumers.

The same will be true of AI. So far, we do not see there being a trade-off here we see that making more reliable trustworthy technology ultimately helps you grow the size of the market and grows the attractiveness of American platforms vis-a-vie China so I would constructively sort of push back on this and put it to you that there’s an amazing opportunity here to use safety as a way to grow the American existing dominance in the market.

Those who set up the ‘slow down’ and safety versus speed framework must of course take the L on how that (in hindsight inevitably) went down. Certainly there are still sometimes tradeoffs here on some margins, on some questions, especially when you are the ‘fun police’ towards your users, or you delay releases for verification. Later down the road, there will be far more real tradeoffs that occur at various points.

But also, yes, for now the tradeoffs are a lot like those in cars, in that improving the safety and security of the models helps them be a lot more useful, something you can trust and that businesses especially will want to use. At this point, Anthropic’s security focus is a strategic advantage.

Johnson wants to believe Clark, but is skeptical and asks Manhken, who says too much emphasis on safety could indeed slow us down (which, as phrased, is obviously true), that he’s worried we won’t go fast enough and there’s no parallel conversation at the PRC.

Representative Torres asks Clark how close China is to matching ASML and TSMC. Clark says they are multiple years behind. Torres then goes full poisoned banana race:

Torres: The first country to reach ASI will likely emerge as the superpower of the 21st century the superpower who will set the rules for the rest of the world. Mr clark what do you make of the Manhattan project framing?

Clark says yes in terms of doing it here but no because it’s from private actors and they agree we desperately need more energy.

Hissen says Chinese labs aren’t doing healthy competition, they’re stealing our tech, then praises the relaxation of the Biden diffusion rules that prevent China from stealing our tech, and asks about what requirements we should attach to diffusion deals and everyone talks arms race and market share. Sigh.

In case you were wonder where that was coming from, well, here we go:

Hinson: members of of your key team at Anthropic have held very influential roles in this space both open philanthropy and in the previous administration with the Biden administration as well.

Can you speak to how you manage you know obviously we’ve got a lot of viewpoints but how you manage potential areas of conflict of interest in advancing this tech and ensuring that everybody’s really on that same page with helping to shape this national AI policy that we’re talking about the competition on the global stage for this for this technology.

You see, if you’re trying to not die that’s a conflict of interest and your role must have been super important, never mind all that lobbying by major tech corporations. Whereas if you want American policy to focus on your own market share, that’s good old fashioned patriotism, that must be it.

Jack Clark: Thank you for the question we have a simple goal. Win the race and make technology that can be relied on and all of the work that we do at our company starts from looking at that and then just trying to work out the best way to get there and we work with people from a variety of backgrounds and skills and our goal is to just have the best most substantive answer that we can bring to hearings.

No, ma’am, we too are only trying to win the race and maximize corporate profits and keep our fellow patriots informed, it is fine. Anthropic doesn’t care about everyone not dying or anything, that would be terrible. Again, I get the strategic bind here, but I continue to find this deeply disappointing, and I don’t think it is a good play.

She then asks Beall about DeepSeek’s ability to quickly copy our tech and potential future espionage threats and Beall reminds her that export controls work with a lag and notes DeepSeek was a wakeup call (although one that I once again note was blown out or proportion for various reasons, but we’re stuck with it). Beall recommends the Remote Access Security Act and then he says we have to ‘grapple with the open source issue.’ Which is that if you open the model they can copy it. Well, there is that.

Representative Brown pulls out They Took Our Jobs and ensuring people (like those in her district, Ohio’s 11th) don’t get left behind by automation and benefit instead, calling for investing in the American workforce, so Clark goes into those speeches and encouraging diffusion and adjusting regulation and acts as if Dario hadn’t predicted the automation of half of white-collar entry level jobs within five years.

Representative Nun notes (along with various other race-related things) the commissioning of four top AI teams as lieutenant kernels, which I and Patrick McKenzie both noticed but has gotten little attention. He then brings up a Chinese startup called Zhipu (currently valued around $20 billion) as some sort of global threat.

Nun: A new AI group out of Beijing called Zhipu is an AI anomaly that is now facing off against the likes of OpenAI and their entire intent is to lock in Chinese systems and standards into emerging markets before the West so this is clearly a largescale attempt by the Chinese to box the United States out now as a counter intelligence officer who was on the front line in fighting against Huawei’s takeover of the United States through something called Huawei America.

That is indeed how a number of Congress people talk these days, including this sudden paranoia with some mysterious ‘lock in’ mechanism for API calls or self-hosted open models that no one has ever been able to explain to me. He does then ask an actual good question:

Nun: Is the US currently prepared for an AI accelerated cyber attack a zero-day attack or a larger threat that faces us today?

Mahnken does some China bad, US good and worries the Chinese will be deluded into thinking AI will let them do things they can’t do and they might start a war? Which is such a bizarre thing to worry about and also not an answer? Are we prepared? I assume mostly no.

Nun then pushes his HR 2152 for government AI diffusion.

He asks Clark how government and business can cooperate. Clark points to the deployment side and the development of safety standards as a way to establish trust and sell globally.

Representative Tokuda starts out complaining about us gutting our institutions, Clark of course endorses investing more in NIST and other such institutions. Tokuda asks about industry responsibility, including for investment in related infrastructure, Clark basically says he works on that and for broader impact questions get back to him in 3-4 years to talk more.

Then she gives us the remarkable quote above about superintelligence (at 1: 30: 20), the full quote is even stronger, but she doesn’t leave enough time for an answer.

I am very grateful for the statement, even with no time left to respond. There is something so weird about asking two other questions first, then getting to ASI.

Representative Moran asks Clark, what’s the most important thing to win this race? Clark chooses power followed by compute and then government infrastructure, and suggests working backwards from the goal of 50 GW in 2027. Mahnkin is asked next and suggests trying to slow down the Chinese.

Moran notices that AI is not like older programming, that it effectively will write its own rules and programming and will soon do its own research and asks what’s up with that. Clark says more research is urgently needed, and points out you wouldn’t want an AI that can blackmail you designing its successor. I’m torn on whether that cuts to the heart of the question in a useful way or not here.

Moran then asks, what is the ‘red line’ on AI the Chinese cannot be allowed cross? Beall confirms AI systems are grown, not built, that it is alchemy, and that the automated R&D is the red line and a really big deal, we need to be up to speed on that.

Representative Conner notes NIST’s safety testing is voluntary and asks if there should be some minimum third party verification required, if only to verify the company’s own standards. All right, Clark, he served it up for you, here’s the ball, what have you got?

Clark: this question is illustrates the challenge we have about weighing safety versus you know moving ahead as quickly as possible we need to first figure out what we want to hold to that standard of testing.

Today the voluntary agreements rest on CBRN testing and some forms of cyber cyber attack testing once we have standards that we’re confident of I think you can take a look at the question of whether voluntary is sufficient or you need something else.

But my sense is it’s too early and we first need to design those tests and really agree on those before figuring out what the next step would be and who would design those tests is it the AI institute or is it the private sector who who comes up with what those tests should be today these tests are done highly collaboratively between US private sector which you mentioned and parts of the US government including those in the the intelligence and defense community i think bringing those people together.

So that we have the nation’s best experts on this and standards and tests that we all agree on is the first step that we can take to get us to everything else and by when do you think that needs to be done. It would be ideal to have this within a year the timelines that I’ve spoken about in this hearing are powerful AI arrives at the end of 2026 or early 2027. Before then we would ideally have standard tests for the national security properties that we deeply care about.

I’m sorry, I think the word you were looking for was ‘yes’? What the hell? This is super frustrating. I mean as worded how is this even a question? You don’t need to know the final exact testing requirements before you start to move towards such a regime. There are so many different ways this answer is a missed opportunity.

The last question goes back to They Took Our Jobs, and Clark basically can only say we can gather data, and there are areas that won’t be impacted soon by AI, again pretending his CEO Dario Amodei hadn’t warned of a jobs ‘bloodbath.’ Beall steps up and says the actual damn thing (within the jobs context), which is that we face a potential future where humans are not only unemployed but unemployable, and we have to have those conversations in advance.

And we end on this not so reassuring note:

Mark Beall: when I hear folks in industry claim things about universal basic income and this sort of digital utopia I you know I study history. I worry that that sort of leads to one place and that place is the Goolog.

That is quite the bold warning, and an excellent place to end the hearing. It is not the way I would have put it, but yes the idea of most or all of humanity being entirely disempowered and unproductive except for our little status games, existing off of gifted resources, property rights and rule of law and some form of goodwill and hoping all of this holds up does not seem like a plan that is likely to end well. At least, not for those humans. No, having ‘solved the alignment problem’ does not on its own get you out of this in any way, solving the alignment problem is the price to try at all.

And that is indeed one kind of thing we need to think about now.

Is this where I wanted the conversation to be in 2025? Oh, hell no.

It’s a start.

Discussion about this post

Congress Asks Better Questions Read More »