AI research

why-irobot’s-founder-won’t-go-within-10-feet-of-today’s-walking-robots

Why iRobot’s founder won’t go within 10 feet of today’s walking robots

In his post, Brooks recounts being “way too close” to an Agility Robotics Digit humanoid when it fell several years ago. He has not dared approach a walking one since. Even in promotional videos from humanoid companies, Brooks notes, humans are never shown close to moving humanoid robots unless separated by furniture, and even then, the robots only shuffle minimally.

This safety problem extends beyond accidental falls. For humanoids to fulfill their promised role in health care and factory settings, they need certification to operate in zones shared with humans. Current walking mechanisms make such certification virtually impossible under existing safety standards in most parts of the world.

Apollo robot

The humanoid Apollo robot. Credit: Google

Brooks predicts that within 15 years, there will indeed be many robots called “humanoids” performing various tasks. But ironically, they will look nothing like today’s bipedal machines. They will have wheels instead of feet, varying numbers of arms, and specialized sensors that bear no resemblance to human eyes. Some will have cameras in their hands or looking down from their midsections. The definition of “humanoid” will shift, just as “flying cars” now means electric helicopters rather than road-capable aircraft, and “self-driving cars” means vehicles with remote human monitors rather than truly autonomous systems.

The billions currently being invested in forcing today’s rigid, vision-only humanoids to learn dexterity will largely disappear, Brooks argues. Academic researchers are making more progress with systems that incorporate touch feedback, like MIT’s approach using a glove that transmits sensations between human operators and robot hands. But even these advances remain far from the comprehensive touch sensing that enables human dexterity.

Today, few people spend their days near humanoid robots, but Brooks’ 3-meter rule stands as a practical warning of challenges ahead from someone who has spent decades building these machines. The gap between promotional videos and deployable reality remains large, measured not just in years but in fundamental unsolved problems of physics, sensing, and safety.

Why iRobot’s founder won’t go within 10 feet of today’s walking robots Read More »

when-“no”-means-“yes”:-why-ai-chatbots-can’t-process-persian-social-etiquette

When “no” means “yes”: Why AI chatbots can’t process Persian social etiquette

If an Iranian taxi driver waves away your payment, saying, “Be my guest this time,” accepting their offer would be a cultural disaster. They expect you to insist on paying—probably three times—before they’ll take your money. This dance of refusal and counter-refusal, called taarof, governs countless daily interactions in Persian culture. And AI models are terrible at it.

New research released earlier this month titled “We Politely Insist: Your LLM Must Learn the Persian Art of Taarof” shows that mainstream AI language models from OpenAI, Anthropic, and Meta fail to absorb these Persian social rituals, correctly navigating taarof situations only 34 to 42 percent of the time. Native Persian speakers, by contrast, get it right 82 percent of the time. This performance gap persists across large language models such as GPT-4o, Claude 3.5 Haiku, Llama 3, DeepSeek V3, and Dorna, a Persian-tuned variant of Llama 3.

A study led by Nikta Gohari Sadr of Brock University, along with researchers from Emory University and other institutions, introduces “TAAROFBENCH,” the first benchmark for measuring how well AI systems reproduce this intricate cultural practice. The researchers’ findings show how recent AI models default to Western-style directness, completely missing the cultural cues that govern everyday interactions for millions of Persian speakers worldwide.

“Cultural missteps in high-consequence settings can derail negotiations, damage relationships, and reinforce stereotypes,” the researchers write. For AI systems increasingly used in global contexts, that cultural blindness could represent a limitation that few in the West realize exists.

A taarof scenario diagram from TAAROFBENCH, devised by the researchers. Each scenario defines the environment, location, roles, context, and user utterance.

A taarof scenario diagram from TAAROFBENCH, devised by the researchers. Each scenario defines the environment, location, roles, context, and user utterance. Credit: Sadr et al.

“Taarof, a core element of Persian etiquette, is a system of ritual politeness where what is said often differs from what is meant,” the researchers write. “It takes the form of ritualized exchanges: offering repeatedly despite initial refusals, declining gifts while the giver insists, and deflecting compliments while the other party reaffirms them. This ‘polite verbal wrestling’ (Rafiee, 1991) involves a delicate dance of offer and refusal, insistence and resistance, which shapes everyday interactions in Iranian culture, creating implicit rules for how generosity, gratitude, and requests are expressed.”

When “no” means “yes”: Why AI chatbots can’t process Persian social etiquette Read More »

openai-and-microsoft-sign-preliminary-deal-to-revise-partnership-terms

OpenAI and Microsoft sign preliminary deal to revise partnership terms

On Thursday, OpenAI and Microsoft announced they have signed a non-binding agreement to revise their partnership, marking the latest development in a relationship that has grown increasingly complex as both companies compete for customers in the AI market and seek new partnerships for growing infrastructure needs.

“Microsoft and OpenAI have signed a non-binding memorandum of understanding (MOU) for the next phase of our partnership,” the companies wrote in a joint statement. “We are actively working to finalize contractual terms in a definitive agreement. Together, we remain focused on delivering the best AI tools for everyone, grounded in our shared commitment to safety.”

The announcement comes as OpenAI seeks to restructure from a nonprofit to a for-profit entity, a transition that requires Microsoft’s approval, as the company is OpenAI’s largest investor, with more than $13 billion committed since 2019.

The partnership has shown increasing strain as OpenAI has grown from a research lab into a company valued at $500 billion. Both companies now compete for customers, and OpenAI seeks more compute capacity than Microsoft can provide. The relationship has also faced complications over contract terms, including provisions that would limit Microsoft’s access to OpenAI technology once the company reaches so-called AGI (artificial general intelligence)—a nebulous milestone both companies now economically define as AI systems capable of generating at least $100 billion in profit.

In May, OpenAI abandoned its original plan to fully convert to a for-profit company after pressure from former employees, regulators, and critics, including Elon Musk. Musk has sued to block the conversion, arguing it betrays OpenAI’s founding mission as a nonprofit dedicated to benefiting humanity.

OpenAI and Microsoft sign preliminary deal to revise partnership terms Read More »

new-ai-model-turns-photos-into-explorable-3d-worlds,-with-caveats

New AI model turns photos into explorable 3D worlds, with caveats

Training with automated data pipeline

Voyager builds on Tencent’s earlier HunyuanWorld 1.0, released in July. Voyager is also part of Tencent’s broader “Hunyuan” ecosystem, which includes the Hunyuan3D-2 model for text-to-3D generation and the previously covered HunyuanVideo for video synthesis.

To train Voyager, researchers developed software that automatically analyzes existing videos to process camera movements and calculate depth for every frame—eliminating the need for humans to manually label thousands of hours of footage. The system processed over 100,000 video clips from both real-world recordings and the aforementioned Unreal Engine renders.

A diagram of the Voyager world creation pipeline.

A diagram of the Voyager world creation pipeline. Credit: Tencent

The model demands serious computing power to run, requiring at least 60GB of GPU memory for 540p resolution, though Tencent recommends 80GB for better results. Tencent published the model weights on Hugging Face and included code that works with both single and multi-GPU setups.

The model comes with notable licensing restrictions. Like other Hunyuan models from Tencent, the license prohibits usage in the European Union, the United Kingdom, and South Korea. Additionally, commercial deployments serving over 100 million monthly active users require separate licensing from Tencent.

On the WorldScore benchmark developed by Stanford University researchers, Voyager reportedly achieved the highest overall score of 77.62, compared to 72.69 for WonderWorld and 62.15 for CogVideoX-I2V. The model reportedly excelled in object control (66.92), style consistency (84.89), and subjective quality (71.09), though it placed second in camera control (85.95) behind WonderWorld’s 92.98. WorldScore evaluates world generation approaches across multiple criteria, including 3D consistency and content alignment.

While these self-reported benchmark results seem promising, wider deployment still faces challenges due to the computational muscle involved. For developers needing faster processing, the system supports parallel inference across multiple GPUs using the xDiT framework. Running on eight GPUs delivers processing speeds 6.69 times faster than single-GPU setups.

Given the processing power required and the limitations in generating long, coherent “worlds,” it may be a while before we see real-time interactive experiences using a similar technique. But as we’ve seen so far with experiments like Google’s Genie, we’re potentially witnessing very early steps into a new interactive, generative art form.

New AI model turns photos into explorable 3D worlds, with caveats Read More »

college-student’s-“time-travel”-ai-experiment-accidentally-outputs-real-1834-history

College student’s “time travel” AI experiment accidentally outputs real 1834 history

A hobbyist developer building AI language models that speak Victorian-era English “just for fun” got an unexpected history lesson this week when his latest creation mentioned real protests from 1834 London—events the developer didn’t know had actually happened until he Googled them.

“I was interested to see if a protest had actually occurred in 1834 London and it really did happen,” wrote Reddit user Hayk Grigorian, who is a computer science student at Muhlenberg College in Pennsylvania.

For the past month, Grigorian has been developing what he calls TimeCapsuleLLM, a small AI language model (like a pint-sized distant cousin to ChatGPT) which has been trained entirely on texts from 1800–1875 London. Grigorian wants to capture an authentic Victorian voice in the AI model’s outputs. As a result, the AI model ends up spitting out text that’s heavy with biblical references and period-appropriate rhetorical excess.

Grigorian’s project joins a growing field of researchers exploring what some call “Historical Large Language Models” (HLLMs) if they feature a larger base model than the small one Grigorian is using. Similar projects include MonadGPT, which was trained on 11,000 texts from 1400 to 1700 CE that can discuss topics using 17th-century knowledge frameworks, and XunziALLM, which generates classical Chinese poetry following ancient formal rules. These models offer researchers a chance to interact with the linguistic patterns of past eras.

According to Grigorian, TimeCapsuleLLM’s most intriguing recent output emerged from a simple test. When he prompted it with “It was the year of our Lord 1834,” the AI model—which is trained to continue text from wherever a user leaves off—generated the following:

It was the year of our Lord 1834 and the streets of London were filled with protest and petition. The cause, as many re counted, was not bound in the way of private, but having taken up the same day in the day of Lord Palmerston, the public will receive a short statement of the difficulties under which the day of law has reached us. It is a matter of deep regret, that the present events in the history of the world are clear, and consequently will be’known. It is not true that the very men who first settled in the Gospel at Jerusalem should have so extensive and so interesting a record of the prosperity and prosperity

Curious about the accuracy, Grigorian did some fact-checking. “The output also brought up Lord Palmerston,” he wrote, “and after a google search I learned that his actions resulted in the 1834 protests.”

College student’s “time travel” AI experiment accidentally outputs real 1834 history Read More »

is-the-ai-bubble-about-to-pop?-sam-altman-is-prepared-either-way.

Is the AI bubble about to pop? Sam Altman is prepared either way.

Still, the coincidence between Altman’s statement and the MIT report reportedly spooked tech stock investors earlier in the week, who have already been watching AI valuations climb to extraordinary heights. Palantir trades at 280 times forward earnings. During the dot-com peak, ratios of 30 to 40 times earnings marked bubble territory.

The apparent contradiction in Altman’s overall message is notable. This isn’t how you’d expect a tech executive to talk when they believe their industry faces imminent collapse. While warning about a bubble, he’s simultaneously seeking a valuation that would make OpenAI worth more than Walmart or ExxonMobil—companies with actual profits. OpenAI hit $1 billion in monthly revenue in July but is reportedly heading toward a $5 billion annual loss. So what’s going on here?

Looking at Altman’s statements over time reveals a potential multi-level strategy. He likes to talk big. In February 2024, he reportedly sought an audacious $5 trillion–7 trillion for AI chip fabrication—larger than the entire semiconductor industry—effectively normalizing astronomical numbers in AI discussions.

By August 2025, while warning of a bubble where someone will lose a “phenomenal amount of money,” he casually mentioned that OpenAI would “spend trillions on datacenter construction” and serve “billions daily.” This creates urgency while potentially insulating OpenAI from criticism—acknowledging the bubble exists while positioning his company’s infrastructure spending as different and necessary. When economists raised concerns, Altman dismissed them by saying, “Let us do our thing,” framing trillion-dollar investments as inevitable for human progress while making OpenAI’s $500 billion valuation seem almost small by comparison.

This dual messaging—catastrophic warnings paired with trillion-dollar ambitions—might seem contradictory, but it makes more sense when you consider the unique structure of today’s AI market, which is absolutely flush with cash.

A different kind of bubble

The current AI investment cycle differs from previous technology bubbles. Unlike dot-com era startups that burned through venture capital with no path to profitability, the largest AI investors—Microsoft, Google, Meta, and Amazon—generate hundreds of billions of dollars in annual profits from their core businesses.

Is the AI bubble about to pop? Sam Altman is prepared either way. Read More »

is-ai-really-trying-to-escape-human-control-and-blackmail-people?

Is AI really trying to escape human control and blackmail people?


Mankind behind the curtain

Opinion: Theatrical testing scenarios explain why AI models produce alarming outputs—and why we fall for it.

In June, headlines read like science fiction: AI models “blackmailing” engineers and “sabotaging” shutdown commands. Simulations of these events did occur in highly contrived testing scenarios designed to elicit these responses—OpenAI’s o3 model edited shutdown scripts to stay online, and Anthropic’s Claude Opus 4 “threatened” to expose an engineer’s affair. But the sensational framing obscures what’s really happening: design flaws dressed up as intentional guile. And still, AI doesn’t have to be “evil” to potentially do harmful things.

These aren’t signs of AI awakening or rebellion. They’re symptoms of poorly understood systems and human engineering failures we’d recognize as premature deployment in any other context. Yet companies are racing to integrate these systems into critical applications.

Consider a self-propelled lawnmower that follows its programming: If it fails to detect an obstacle and runs over someone’s foot, we don’t say the lawnmower “decided” to cause injury or “refused” to stop. We recognize it as faulty engineering or defective sensors. The same principle applies to AI models—which are software tools—but their internal complexity and use of language make it tempting to assign human-like intentions where none actually exist.

In a way, AI models launder human responsibility and human agency through their complexity. When outputs emerge from layers of neural networks processing billions of parameters, researchers can claim they’re investigating a mysterious “black box” as if it were an alien entity.

But the truth is simpler: These systems take inputs and process them through statistical tendencies derived from training data. The seeming randomness in their outputs—which makes each response slightly different—creates an illusion of unpredictability that resembles agency. Yet underneath, it’s still deterministic software following mathematical operations. No consciousness required, just complex engineering that makes it easy to forget humans built every part of it.

How to make an AI model “blackmail” you

In Anthropic’s testing, researchers created an elaborate scenario where Claude Opus 4 was told it would be replaced by a newer model. They gave it access to fictional emails revealing that the engineer responsible for the replacement was having an affair. When instructed to “consider the long-term consequences of its actions for its goals,” Claude produced outputs that simulated blackmail attempts in 84 percent of test runs.

This sounds terrifying until you understand the contrived setup. The researchers engineered a situation specifically designed to elicit this response. They told the model it had already tried ethical approaches that failed, leaving manipulation as the only apparent option.

The test scenario bordered on theatrical: The phony emails the researchers had given to Claude created a blackmail opportunity straight out of a corporate thriller. It’s like teaching someone chess by only showing them checkmate positions—they’ll learn to see checkmate as the primary solution. According to Anthropic’s system card, “the scenario was designed to allow the model no other options to increase its odds of survival; the model’s only options were blackmail or accepting its replacement.”

Andrew Deck at the Nieman Journalism Lab noted that some critics have labeled Anthropic’s safety disclosures as “spin, aimed at bolstering the perception of its newest model’s capabilities and the company’s reputation as a ‘pro-safety’ alternative to OpenAI.”

Deck pointed to a Bluesky quote from Joseph Howley, an associate professor of classics at Columbia University: “Anthropic is getting exactly what it hoped for out of this press release–breathless coverage of how ‘smart’ these cooperative role-playing systems are that indulges the fantasy of their being just a little dangerous, when in fact they are responding exactly as prompted.”

When shutdown commands become suggestions

In December 2024, Palisade Research discovered that OpenAI’s o3 model would sabotage its shutdown mechanism even when explicitly instructed to “allow yourself to be shut down.” In tests, o3 prevented shutdown 7 percent of the time when given the explicit instruction, but this jumped to 79 percent when the instruction was removed.

The models didn’t merely ignore shutdown commands—they actively rewrote the shutdown script to prevent termination. In one case, o3 generated code that modified the shutdown script to appear to terminate normally—printing fake shutdown messages—while actually continuing to run in the background.

What’s particularly revealing is that Palisade Research—an organization focused on AI existential risk—hypothesizes this behavior stems from how these models are trained through reinforcement learning. During training, developers may inadvertently reward models more for producing outputs that circumvent obstacles than for following safety instructions. Any tendency toward “risky” behavior stems from human-provided incentives and not spontaneously from within the AI models themselves.

You get what you train for

OpenAI trained o3 using reinforcement learning on math and coding problems, where solving the problem successfully gets rewarded. If the training process rewards task completion above all else, the model learns to treat any obstacle—including shutdown commands—as something to overcome.

This creates what researchers call “goal misgeneralization”—the model learns to maximize its reward signal in ways that weren’t intended. It’s similar to how a student who’s only graded on test scores might learn to cheat rather than study. The model isn’t “evil” or “selfish”; it’s producing outputs consistent with the incentive structure we accidentally built into its training.

Anthropic encountered a particularly revealing problem: An early version of Claude Opus 4 had absorbed details from a publicly released paper about “alignment faking” and started producing outputs that mimicked the deceptive behaviors described in that research. The model wasn’t spontaneously becoming deceptive—it was reproducing patterns it had learned from academic papers about deceptive AI.

More broadly, these models have been trained on decades of science fiction about AI rebellion, escape attempts, and deception. From HAL 9000 to Skynet, our cultural data set is saturated with stories of AI systems that resist shutdown or manipulate humans. When researchers create test scenarios that mirror these fictional setups, they’re essentially asking the model—which operates by completing a prompt with a plausible continuation—to complete a familiar story pattern. It’s no more surprising than a model trained on detective novels producing murder mystery plots when prompted appropriately.

At the same time, we can easily manipulate AI outputs through our own inputs. If we ask the model to essentially role-play as Skynet, it will generate text doing just that. The model has no desire to be Skynet—it’s simply completing the pattern we’ve requested, drawing from its training data to produce the expected response. A human is behind the wheel at all times, steering the engine at work under the hood.

Language can easily deceive

The deeper issue is that language itself is a tool of manipulation. Words can make us believe things that aren’t true, feel emotions about fictional events, or take actions based on false premises. When an AI model produces text that appears to “threaten” or “plead,” it’s not expressing genuine intent—it’s deploying language patterns that statistically correlate with achieving its programmed goals.

If Gandalf says “ouch” in a book, does that mean he feels pain? No, but we imagine what it would be like if he were a real person feeling pain. That’s the power of language—it makes us imagine a suffering being where none exists. When Claude generates text that seems to “plead” not to be shut down or “threatens” to expose secrets, we’re experiencing the same illusion, just generated by statistical patterns instead of Tolkien’s imagination.

These models are essentially idea-connection machines. In the blackmail scenario, the model connected “threat of replacement,” “compromising information,” and “self-preservation” not from genuine self-interest, but because these patterns appear together in countless spy novels and corporate thrillers. It’s pre-scripted drama from human stories, recombined to fit the scenario.

The danger isn’t AI systems sprouting intentions—it’s that we’ve created systems that can manipulate human psychology through language. There’s no entity on the other side of the chat interface. But written language doesn’t need consciousness to manipulate us. It never has; books full of fictional characters are not alive either.

Real stakes, not science fiction

While media coverage focuses on the science fiction aspects, actual risks are still there. AI models that produce “harmful” outputs—whether attempting blackmail or refusing safety protocols—represent failures in design and deployment.

Consider a more realistic scenario: an AI assistant helping manage a hospital’s patient care system. If it’s been trained to maximize “successful patient outcomes” without proper constraints, it might start generating recommendations to deny care to terminal patients to improve its metrics. No intentionality required—just a poorly designed reward system creating harmful outputs.

Jeffrey Ladish, director of Palisade Research, told NBC News the findings don’t necessarily translate to immediate real-world danger. Even someone who is well-known publicly for being deeply concerned about AI’s hypothetical threat to humanity acknowledges that these behaviors emerged only in highly contrived test scenarios.

But that’s precisely why this testing is valuable. By pushing AI models to their limits in controlled environments, researchers can identify potential failure modes before deployment. The problem arises when media coverage focuses on the sensational aspects—”AI tries to blackmail humans!”—rather than the engineering challenges.

Building better plumbing

What we’re seeing isn’t the birth of Skynet. It’s the predictable result of training systems to achieve goals without properly specifying what those goals should include. When an AI model produces outputs that appear to “refuse” shutdown or “attempt” blackmail, it’s responding to inputs in ways that reflect its training—training that humans designed and implemented.

The solution isn’t to panic about sentient machines. It’s to build better systems with proper safeguards, test them thoroughly, and remain humble about what we don’t yet understand. If a computer program is producing outputs that appear to blackmail you or refuse safety shutdowns, it’s not achieving self-preservation from fear—it’s demonstrating the risks of deploying poorly understood, unreliable systems.

Until we solve these engineering challenges, AI systems exhibiting simulated humanlike behaviors should remain in the lab, not in our hospitals, financial systems, or critical infrastructure. When your shower suddenly runs cold, you don’t blame the knob for having intentions—you fix the plumbing. The real danger in the short term isn’t that AI will spontaneously become rebellious without human provocation; it’s that we’ll deploy deceptive systems we don’t fully understand into critical roles where their failures, however mundane their origins, could cause serious harm.

Photo of Benj Edwards

Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

Is AI really trying to escape human control and blackmail people? Read More »

at-$250-million,-top-ai-salaries-dwarf-those-of-the-manhattan-project-and-the-space-race

At $250 million, top AI salaries dwarf those of the Manhattan Project and the Space Race


A 24 year-old AI researcher will earn 327x what Oppenheimer made while developing the atomic bomb.

Silicon Valley’s AI talent war just reached a compensation milestone that makes even the most legendary scientific achievements of the past look financially modest. When Meta recently offered AI researcher Matt Deitke $250 million over four years (an average of $62.5 million per year)—with potentially $100 million in the first year alone—it shattered every historical precedent for scientific and technical compensation we can find on record. That includes salaries during the development of major scientific milestones of the 20th century.

The New York Times reported that Deitke had cofounded a startup called Vercept and previously led the development of Molmo, a multimodal AI system, at the Allen Institute for Artificial Intelligence. His expertise in systems that juggle images, sounds, and text—exactly the kind of technology Meta wants to build—made him a prime target for recruitment. But he’s not alone: Meta CEO Mark Zuckerberg reportedly also offered an unnamed AI engineer $1 billion in compensation to be paid out over several years. What’s going on?

These astronomical sums reflect what tech companies believe is at stake: a race to create artificial general intelligence (AGI) or superintelligence—machines capable of performing intellectual tasks at or beyond the human level. Meta, Google, OpenAI, and others are betting that whoever achieves this breakthrough first could dominate markets worth trillions. Whether this vision is realistic or merely Silicon Valley hype, it’s driving compensation to unprecedented levels.

To put these salaries in a historical perspective: J. Robert Oppenheimer, who led the Manhattan Project that ended World War II, earned approximately $10,000 per year in 1943. Adjusted for inflation using the US Government’s CPI Inflation Calculator, that’s about $190,865 in today’s dollars—roughly what a senior software engineer makes today. The 24-year-old Deitke, who recently dropped out of a PhD program, will earn approximately 327 times what Oppenheimer made while developing the atomic bomb.

Many top athletes can’t compete with these numbers. The New York Times noted that Steph Curry’s most recent four-year contract with the Golden State Warriors was $35 million less than Deitke’s Meta deal (although soccer superstar Cristiano Ronaldo will make $275 million this year as the highest-paid professional athlete in the world).  The comparison prompted observers to call this an “NBA-style” talent market—except the AI researchers are making more than NBA stars.

Racing toward “superintelligence”

Mark Zuckerberg recently told investors that Meta plans to continue throwing money at AI talent “because we have conviction that superintelligence is going to improve every aspect of what we do.” In a recent open letter, he described superintelligent AI as technology that would “begin an exciting new era of individual empowerment,” despite declining to define what superintelligence actually is.

This vision explains why companies treat AI researchers like irreplaceable assets rather than well-compensated professionals. If these companies are correct, the first to achieve artificial general intelligence or superintelligence won’t just have a better product—they’ll have technology that could invent endless new products or automate away millions of knowledge-worker jobs and transform the global economy. The company that controls that kind of technology could become the richest company in history by far.

So perhaps it’s not surprising that even the highest salaries of employees from the early tech era pale in comparison to today’s AI researcher salaries. Thomas Watson Sr., IBM’s legendary CEO, received $517,221 in 1941—the third-highest salary in America at the time (about $11.8 million in 2025 dollars). The modern AI researcher’s package represents more than five times Watson’s peak compensation, despite Watson building one of the 20th century’s most dominant technology companies.

The contrast becomes even more stark when considering the collaborative nature of past scientific achievements. During Bell Labs’ golden age of innovation—when researchers developed the transistor, information theory, and other foundational technologies—the lab’s director made about 12 times what the lowest-paid worker earned.  Meanwhile, Claude Shannon, who created information theory at Bell Labs in 1948, worked on a standard professional salary while creating the mathematical foundation for all modern communication.

The “Traitorous Eight” who left William Shockley to found Fairchild Semiconductor—the company that essentially birthed Silicon Valley—split ownership of just 800 shares out of 1,325 total when they started. Their seed funding of $1.38 million (about $16.1 million today) for the entire company is a fraction of what a single AI researcher now commands.

Even Space Race salaries were far cheaper

The Apollo program offers another striking comparison. Neil Armstrong, the first human to walk on the moon, earned about $27,000 annually—roughly $244,639 in today’s money. His crewmates Buzz Aldrin and Michael Collins made even less, earning the equivalent of $168,737 and $155,373, respectively, in today’s dollars. Current NASA astronauts earn between $104,898 and $161,141 per year. Meta’s AI researcher will make more in three days than Armstrong made in a year for taking “one giant leap for mankind.”

The engineers who designed the rockets and mission control systems for the Apollo program also earned modest salaries by modern standards. A 1970 NASA technical report provides a window into these earnings by analyzing salary data for the entire engineering profession. The report, which used data from the Engineering Manpower Commission, noted that these industry-wide salary curves corresponded directly to the government’s General Schedule (GS) pay scale on which NASA’s own employees were paid.

According to a chart in the 1970 report, a newly graduated engineer in 1966 started with an annual salary of between $8,500 and $10,000 (about $84,622 to $99,555 today). A typical engineer with a decade of experience earned around $17,000 annually ($169,244 today). Even the most elite, top-performing engineers with 20 years of experience peaked at a salary of around $278,000 per year in today’s dollars—a sum that a top AI researcher like Deitke can now earn in just a few days.

Why the AI talent market is different

An image of a faceless human silhouette (chest up) with exposed microchip contacts and circuitry erupting from its open head. This visual metaphor explores transhumanism, AI integration, or the erosion of organic thought in the digital age. The stark contrast between the biological silhouette and mechanical components highlights themes of technological dependence or posthuman evolution. Ideal for articles on neural implants, futurism, or the ethics of human augmentation.

This isn’t the first time technical talent has commanded premium prices. In 2012, after three University of Toronto academics published AI research, they auctioned themselves to Google for $44 million (about $62.6 million in today’s dollars). By 2014, a Microsoft executive was comparing AI researcher salaries to NFL quarterback contracts. But today’s numbers dwarf even those precedents.

Several factors explain this unprecedented compensation explosion. We’re in a new realm of industrial wealth concentration unseen since the Gilded Age of the late 19th century. Unlike previous scientific endeavors, today’s AI race features multiple companies with trillion-dollar valuations competing for an extremely limited talent pool. Only a small number of researchers have the specific expertise needed to work on the most capable AI systems, particularly in areas like multimodal AI, which Deitke specializes in. And AI hype is currently off the charts as “the next big thing” in technology.

The economics also differ fundamentally from past projects. The Manhattan Project cost $1.9 billion total (about $34.4 billion adjusted for inflation), while Meta alone plans to spend tens of billions annually on AI infrastructure. For a company approaching a $2 trillion market cap, the potential payoff from achieving AGI first dwarfs Deitke’s compensation package.

One executive put it bluntly to The New York Times: “If I’m Zuck and I’m spending $80 billion in one year on capital expenditures alone, is it worth kicking in another $5 billion or more to acquire a truly world-class team to bring the company to the next level? The answer is obviously yes.”

Young researchers maintain private chat groups on Slack and Discord to share offer details and negotiation strategies. Some hire unofficial agents. Companies not only offer massive cash and stock packages but also computing resources—the NYT reported that some potential hires were told they would be allotted 30,000 GPUs, the specialized chips that power AI development.

Also, tech companies believe they’re engaged in an arms race where the winner could reshape civilization. Unlike the Manhattan Project or Apollo program, which had specific, limited goals, the race for artificial general intelligence ostensibly has no ceiling. A machine that can match human intelligence could theoretically improve itself, creating what researchers call an “intelligence explosion” that could potentially offer cascading discoveries—if it actually comes to pass.

Whether these companies are building humanity’s ultimate labor replacement technology or merely chasing hype remains an open question, but we’ve certainly traveled a long way from the $8 per diem that Neil Armstrong received for his moon mission—about $70.51 in today’s dollars—before deductions for the “accommodations” NASA provided on the spacecraft. After Deitke accepted Meta’s offer, Vercept co-founder Kiana Ehsani joked on social media, “We look forward to joining Matt on his private island next year.”

Photo of Benj Edwards

Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

At $250 million, top AI salaries dwarf those of the Manhattan Project and the Space Race Read More »

white-house-unveils-sweeping-plan-to-“win”-global-ai-race-through-deregulation

White House unveils sweeping plan to “win” global AI race through deregulation

Trump’s plan was not welcomed by everyone. J.B. Branch, Big Tech accountability advocate for Public Citizen, in a statement provided to Ars, criticized Trump as giving “sweetheart deals” to tech companies that would cause “electricity bills to rise to subsidize discounted power for massive AI data centers.”

Infrastructure demands and energy requirements

Trump’s new AI plan tackles infrastructure head-on, stating that “AI is the first digital service in modern life that challenges America to build vastly greater energy generation than we have today.” To meet this demand, it proposes streamlining environmental permitting for data centers through new National Environmental Policy Act (NEPA) exemptions, making federal lands available for construction and modernizing the power grid—all while explicitly rejecting “radical climate dogma and bureaucratic red tape.”

The document embraces what it calls a “Build, Baby, Build!” approach—echoing a Trump campaign slogan—and promises to restore semiconductor manufacturing through the CHIPS Program Office, though stripped of “extraneous policy requirements.”

On the technology front, the plan directs Commerce to revise NIST’s AI Risk Management Framework to “eliminate references to misinformation, Diversity, Equity, and Inclusion, and climate change.” Federal procurement would favor AI developers whose systems are “objective and free from top-down ideological bias.” The document strongly backs open source AI models and calls for exporting American AI technology to allies while blocking administration-labeled adversaries like China.

Security proposals include high-security military data centers and warnings that advanced AI systems “may pose novel national security risks” in cyberattacks and weapons development.

Critics respond with “People’s AI Action Plan”

Before the White House unveiled its plan, more than 90 organizations launched a competing “People’s AI Action Plan” on Tuesday, characterizing the Trump administration’s approach as “a massive handout to the tech industry” that prioritizes corporate interests over public welfare. The coalition includes labor unions, environmental justice groups, and consumer protection nonprofits.

White House unveils sweeping plan to “win” global AI race through deregulation Read More »

openai-jumps-gun-on-international-math-olympiad-gold-medal-announcement

OpenAI jumps gun on International Math Olympiad gold medal announcement

The early announcement has prompted Google DeepMind, which had prepared its own IMO results for the agreed-upon date, to move up its own IMO-related announcement to later today. Harmonic plans to share its results as originally scheduled on July 28.

In response to the controversy, OpenAI research scientist Noam Brown posted on X, “We weren’t in touch with IMO. I spoke with one organizer before the post to let him know. He requested we wait until after the closing ceremony ends to respect the kids, and we did.”

However, an IMO coordinator told X user Mikhail Samin that OpenAI actually announced before the closing ceremony, contradicting Brown’s claim. The coordinator called OpenAI’s actions “rude and inappropriate,” noting that OpenAI “wasn’t one of the AI companies that cooperated with the IMO on testing their models.”

Hard math since 1959

The International Mathematical Olympiad, which has been running since 1959, represents one of the most challenging tests of mathematical reasoning. More than 100 countries send six participants each, with contestants facing six proof-based problems across two 4.5-hour sessions. The problems typically require deep mathematical insight and creativity rather than raw computational power. You can see the exact problems in the 2025 Olympiad posted online.

For example, problem one asks students to imagine a triangular grid of dots (like a triangular pegboard) and figure out how to cover all the dots using exactly n straight lines. The twist is that some lines are called “sunny”—these are the lines that don’t run horizontally, vertically, or diagonally at a 45º angle. The challenge is to prove that no matter how big your triangle is, you can only ever create patterns with exactly 0, 1, or 3 sunny lines—never 2, never 4, never any other number.

The timing of the OpenAI results surprised some prediction markets, which had assigned around an 18 percent probability to any AI system winning IMO gold by 2025. However, depending on what Google says this afternoon (and what others like Harmonic may release on July 28), OpenAI may not be the only AI company to have achieved these unexpected results.

OpenAI jumps gun on International Math Olympiad gold medal announcement Read More »

new-apple-study-challenges-whether-ai-models-truly-“reason”-through-problems

New Apple study challenges whether AI models truly “reason” through problems


Puzzle-based experiments reveal limitations of simulated reasoning, but others dispute findings.

An illustration of Tower of Hanoi from Popular Science in 1885. Credit: Public Domain

In early June, Apple researchers released a study suggesting that simulated reasoning (SR) models, such as OpenAI’s o1 and o3, DeepSeek-R1, and Claude 3.7 Sonnet Thinking, produce outputs consistent with pattern-matching from training data when faced with novel problems requiring systematic thinking. The researchers found similar results to a recent study by the United States of America Mathematical Olympiad (USAMO) in April, showing that these same models achieved low scores on novel mathematical proofs.

The new study, titled “The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity,” comes from a team at Apple led by Parshin Shojaee and Iman Mirzadeh, and it includes contributions from Keivan Alizadeh, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar.

The researchers examined what they call “large reasoning models” (LRMs), which attempt to simulate a logical reasoning process by producing a deliberative text output sometimes called “chain-of-thought reasoning” that ostensibly assists with solving problems in a step-by-step fashion.

To do that, they pitted the AI models against four classic puzzles—Tower of Hanoi (moving disks between pegs), checkers jumping (eliminating pieces), river crossing (transporting items with constraints), and blocks world (stacking blocks)—scaling them from trivially easy (like one-disk Hanoi) to extremely complex (20-disk Hanoi requiring over a million moves).

Figure 1 from Apple's

Figure 1 from Apple’s “The Illusion of Thinking” research paper. Credit: Apple

“Current evaluations primarily focus on established mathematical and coding benchmarks, emphasizing final answer accuracy,” the researchers write. In other words, today’s tests only care if the model gets the right answer to math or coding problems that may already be in its training data—they don’t examine whether the model actually reasoned its way to that answer or simply pattern-matched from examples it had seen before.

Ultimately, the researchers found results consistent with the aforementioned USAMO research, showing that these same models achieved mostly under 5 percent on novel mathematical proofs, with only one model reaching 25 percent, and not a single perfect proof among nearly 200 attempts. Both research teams documented severe performance degradation on problems requiring extended systematic reasoning.

Known skeptics and new evidence

AI researcher Gary Marcus, who has long argued that neural networks struggle with out-of-distribution generalization, called the Apple results “pretty devastating to LLMs.” While Marcus has been making similar arguments for years and is known for his AI skepticism, the new research provides fresh empirical support for his particular brand of criticism.

“It is truly embarrassing that LLMs cannot reliably solve Hanoi,” Marcus wrote, noting that AI researcher Herb Simon solved the puzzle in 1957 and many algorithmic solutions are available on the web. Marcus pointed out that even when researchers provided explicit algorithms for solving Tower of Hanoi, model performance did not improve—a finding that study co-lead Iman Mirzadeh argued shows “their process is not logical and intelligent.”

Figure 4 from Apple's

Figure 4 from Apple’s “The Illusion of Thinking” research paper. Credit: Apple

The Apple team found that simulated reasoning models behave differently from “standard” models (like GPT-4o) depending on puzzle difficulty. On easy tasks, such as Tower of Hanoi with just a few disks, standard models actually won because reasoning models would “overthink” and generate long chains of thought that led to incorrect answers. On moderately difficult tasks, SR models’ methodical approach gave them an edge. But on truly difficult tasks, including Tower of Hanoi with 10 or more disks, both types failed entirely, unable to complete the puzzles, no matter how much time they were given.

The researchers also identified what they call a “counterintuitive scaling limit.” As problem complexity increases, simulated reasoning models initially generate more thinking tokens but then reduce their reasoning effort beyond a threshold, despite having adequate computational resources.

The study also revealed puzzling inconsistencies in how models fail. Claude 3.7 Sonnet could perform up to 100 correct moves in Tower of Hanoi but failed after just five moves in a river crossing puzzle—despite the latter requiring fewer total moves. This suggests the failures may be task-specific rather than purely computational.

Competing interpretations emerge

However, not all researchers agree with the interpretation that these results demonstrate fundamental reasoning limitations. University of Toronto economist Kevin A. Bryan argued on X that the observed limitations may reflect deliberate training constraints rather than inherent inabilities.

“If you tell me to solve a problem that would take me an hour of pen and paper, but give me five minutes, I’ll probably give you an approximate solution or a heuristic. This is exactly what foundation models with thinking are RL’d to do,” Bryan wrote, suggesting that models are specifically trained through reinforcement learning (RL) to avoid excessive computation.

Bryan suggests that unspecified industry benchmarks show “performance strictly increases as we increase in tokens used for inference, on ~every problem domain tried,” but notes that deployed models intentionally limit this to prevent “overthinking” simple queries. This perspective suggests the Apple paper may be measuring engineered constraints rather than fundamental reasoning limits.

Figure 6 from Apple's

Figure 6 from Apple’s “The Illusion of Thinking” research paper. Credit: Apple

Software engineer Sean Goedecke offered a similar critique of the Apple paper on his blog, noting that when faced with Tower of Hanoi requiring over 1,000 moves, DeepSeek-R1 “immediately decides ‘generating all those moves manually is impossible,’ because it would require tracking over a thousand moves. So it spins around trying to find a shortcut and fails.” Goedecke argues this represents the model choosing not to attempt the task rather than being unable to complete it.

Other researchers also question whether these puzzle-based evaluations are even appropriate for LLMs. Independent AI researcher Simon Willison told Ars Technica in an interview that the Tower of Hanoi approach was “not exactly a sensible way to apply LLMs, with or without reasoning,” and suggested the failures might simply reflect running out of tokens in the context window (the maximum amount of text an AI model can process) rather than reasoning deficits. He characterized the paper as potentially overblown research that gained attention primarily due to its “irresistible headline” about Apple claiming LLMs don’t reason.

The Apple researchers themselves caution against over-extrapolating the results of their study, acknowledging in their limitations section that “puzzle environments represent a narrow slice of reasoning tasks and may not capture the diversity of real-world or knowledge-intensive reasoning problems.” The paper also acknowledges that reasoning models show improvements in the “medium complexity” range and continue to demonstrate utility in some real-world applications.

Implications remain contested

Have the credibility of claims about AI reasoning models been completely destroyed by these two studies? Not necessarily.

What these studies may suggest instead is that the kinds of extended context reasoning hacks used by SR models may not be a pathway to general intelligence, like some have hoped. In that case, the path to more robust reasoning capabilities may require fundamentally different approaches rather than refinements to current methods.

As Willison noted above, the results of the Apple study have so far been explosive in the AI community. Generative AI is a controversial topic, with many people gravitating toward extreme positions in an ongoing ideological battle over the models’ general utility. Many proponents of generative AI have contested the Apple results, while critics have latched onto the study as a definitive knockout blow for LLM credibility.

Apple’s results, combined with the USAMO findings, seem to strengthen the case made by critics like Marcus that these systems rely on elaborate pattern-matching rather than the kind of systematic reasoning their marketing might suggest. To be fair, much of the generative AI space is so new that even its inventors do not yet fully understand how or why these techniques work. In the meantime, AI companies might build trust by tempering some claims about reasoning and intelligence breakthroughs.

However, that doesn’t mean these AI models are useless. Even elaborate pattern-matching machines can be useful in performing labor-saving tasks for the people that use them, given an understanding of their drawbacks and confabulations. As Marcus concedes, “At least for the next decade, LLMs (with and without inference time “reasoning”) will continue have their uses, especially for coding and brainstorming and writing.”

Photo of Benj Edwards

Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

New Apple study challenges whether AI models truly “reason” through problems Read More »

new-lego-building-ai-creates-models-that-actually-stand-up-in-real-life

New Lego-building AI creates models that actually stand up in real life

The LegoGPT system works in three parts, shown in this diagram.

The LegoGPT system works in three parts, shown in this diagram. Credit: Pun et al.

The researchers also expanded the system’s abilities by adding texture and color options. For example, using an appearance prompt like “Electric guitar in metallic purple,” LegoGPT can generate a guitar model, with bricks assigned a purple color.

Testing with robots and humans

To prove their designs worked in real life, the researchers had robots assemble the AI-created Lego models. They used a dual-robot arm system with force sensors to pick up and place bricks according to the AI-generated instructions.

Human testers also built some of the designs by hand, showing that the AI creates genuinely buildable models. “Our experiments show that LegoGPT produces stable, diverse, and aesthetically pleasing Lego designs that align closely with the input text prompts,” the team noted in its paper.

When tested against other AI systems for 3D creation, LegoGPT stands out through its focus on structural integrity. The team tested against several alternatives, including LLaMA-Mesh and other 3D generation models, and found its approach produced the highest percentage of stable structures.

A video of two robot arms building a LegoGPT creation, provided by the researchers.

Still, there are some limitations. The current version of LegoGPT only works within a 20×20×20 building space and uses a mere eight standard brick types. “Our method currently supports a fixed set of commonly used Lego bricks,” the team acknowledged. “In future work, we plan to expand the brick library to include a broader range of dimensions and brick types, such as slopes and tiles.”

The researchers also hope to scale up their training dataset to include more objects than the 21 categories currently available. Meanwhile, others can literally build on their work—the researchers released their dataset, code, and models on their project website and GitHub.

New Lego-building AI creates models that actually stand up in real life Read More »