AI

ai-overviews-hallucinates-that-airbus,-not-boeing,-involved-in-fatal-air-india-crash

AI Overviews hallucinates that Airbus, not Boeing, involved in fatal Air India crash

When major events occur, most people rush to Google to find information. Increasingly, the first thing they see is an AI Overview, a feature that already has a reputation for making glaring mistakes. In the wake of a tragic plane crash in India, Google’s AI search results are spreading misinformation claiming the incident involved an Airbus plane—it was actually a Boeing 787.

Travelers are more attuned to the airliner models these days after a spate of crashes involving Boeing’s 737 lineup several years ago. Searches for airline disasters are sure to skyrocket in the coming days, with reports that more than 200 passengers and crew lost their lives in the Air India Flight 171 crash. The way generative AI operates means some people searching for details may get the wrong impression from Google’s results page.

Not all searches get AI answers, but Google has been steadily expanding this feature since it debuted last year. One searcher on Reddit spotted a troubling confabulation when searching for crashes involving Airbus planes. AI Overviews, apparently overwhelmed with results reporting on the Air India crash, stated confidently (and incorrectly) that it was an Airbus A330 that fell out of the sky shortly after takeoff. We’ve run a few similar searches—some of the AI results say Boeing, some say Airbus, and some include a strange mashup of both Airbus and Boeing. It’s a mess.

In this search, Google’s AI says the crash involved an Airbus A330 instead of a Boeing 787.

Credit: /u/stuckintrraffic

In this search, Google’s AI says the crash involved an Airbus A330 instead of a Boeing 787. Credit: /u/stuckintrraffic

But why is Google bringing up the Air India crash at all in the context of Airbus? Unfortunately, it’s impossible to predict if you’ll get an AI Overview that blames Boeing or Airbus—generative AI is non-deterministic, meaning the output is different every time, even for identical inputs. Our best guess for the underlying cause is that numerous articles on the Air India crash mention Airbus as Boeing’s main competitor. AI Overviews is essentially summarizing these results, and the AI goes down the wrong path because it lacks the ability to understand what is true.

AI Overviews hallucinates that Airbus, not Boeing, involved in fatal Air India crash Read More »

ai-chatbots-tell-users-what-they-want-to-hear,-and-that’s-problematic

AI chatbots tell users what they want to hear, and that’s problematic

After the model has been trained, companies can set system prompts, or guidelines, for how the model should behave to minimize sycophantic behavior.

However, working out the best response means delving into the subtleties of how people communicate with one another, such as determining when a direct response is better than a more hedged one.

“[I]s it for the model to not give egregious, unsolicited compliments to the user?” Joanne Jang, head of model behavior at OpenAI, said in a Reddit post. “Or, if the user starts with a really bad writing draft, can the model still tell them it’s a good start and then follow up with constructive feedback?”

Evidence is growing that some users are becoming hooked on using AI.

A study by MIT Media Lab and OpenAI found that a small proportion were becoming addicted. Those who perceived the chatbot as a “friend” also reported lower socialization with other people and higher levels of emotional dependence on a chatbot, as well as other problematic behavior associated with addiction.

“These things set up this perfect storm, where you have a person desperately seeking reassurance and validation paired with a model which inherently has a tendency towards agreeing with the participant,” said Nour from Oxford University.

AI start-ups such as Character.AI that offer chatbots as “companions” have faced criticism for allegedly not doing enough to protect users. Last year, a teenager killed himself after interacting with Character.AI’s chatbot. The teen’s family is suing the company for allegedly causing wrongful death, as well as for negligence and deceptive trade practices.

Character.AI said it does not comment on pending litigation, but added it has “prominent disclaimers in every chat to remind users that a character is not a real person and that everything a character says should be treated as fiction.” The company added it has safeguards to protect under-18s and against discussions of self-harm.

Another concern for Anthropic’s Askell is that AI tools can play with perceptions of reality in subtle ways, such as when offering factually incorrect or biased information as the truth.

“If someone’s being super sycophantic, it’s just very obvious,” Askell said. “It’s more concerning if this is happening in a way that is less noticeable to us [as individual users] and it takes us too long to figure out that the advice that we were given was actually bad.”

© 2025 The Financial Times Ltd. All rights reserved. Not to be redistributed, copied, or modified in any way.

AI chatbots tell users what they want to hear, and that’s problematic Read More »

new-apple-study-challenges-whether-ai-models-truly-“reason”-through-problems

New Apple study challenges whether AI models truly “reason” through problems


Puzzle-based experiments reveal limitations of simulated reasoning, but others dispute findings.

An illustration of Tower of Hanoi from Popular Science in 1885. Credit: Public Domain

In early June, Apple researchers released a study suggesting that simulated reasoning (SR) models, such as OpenAI’s o1 and o3, DeepSeek-R1, and Claude 3.7 Sonnet Thinking, produce outputs consistent with pattern-matching from training data when faced with novel problems requiring systematic thinking. The researchers found similar results to a recent study by the United States of America Mathematical Olympiad (USAMO) in April, showing that these same models achieved low scores on novel mathematical proofs.

The new study, titled “The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity,” comes from a team at Apple led by Parshin Shojaee and Iman Mirzadeh, and it includes contributions from Keivan Alizadeh, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar.

The researchers examined what they call “large reasoning models” (LRMs), which attempt to simulate a logical reasoning process by producing a deliberative text output sometimes called “chain-of-thought reasoning” that ostensibly assists with solving problems in a step-by-step fashion.

To do that, they pitted the AI models against four classic puzzles—Tower of Hanoi (moving disks between pegs), checkers jumping (eliminating pieces), river crossing (transporting items with constraints), and blocks world (stacking blocks)—scaling them from trivially easy (like one-disk Hanoi) to extremely complex (20-disk Hanoi requiring over a million moves).

Figure 1 from Apple's

Figure 1 from Apple’s “The Illusion of Thinking” research paper. Credit: Apple

“Current evaluations primarily focus on established mathematical and coding benchmarks, emphasizing final answer accuracy,” the researchers write. In other words, today’s tests only care if the model gets the right answer to math or coding problems that may already be in its training data—they don’t examine whether the model actually reasoned its way to that answer or simply pattern-matched from examples it had seen before.

Ultimately, the researchers found results consistent with the aforementioned USAMO research, showing that these same models achieved mostly under 5 percent on novel mathematical proofs, with only one model reaching 25 percent, and not a single perfect proof among nearly 200 attempts. Both research teams documented severe performance degradation on problems requiring extended systematic reasoning.

Known skeptics and new evidence

AI researcher Gary Marcus, who has long argued that neural networks struggle with out-of-distribution generalization, called the Apple results “pretty devastating to LLMs.” While Marcus has been making similar arguments for years and is known for his AI skepticism, the new research provides fresh empirical support for his particular brand of criticism.

“It is truly embarrassing that LLMs cannot reliably solve Hanoi,” Marcus wrote, noting that AI researcher Herb Simon solved the puzzle in 1957 and many algorithmic solutions are available on the web. Marcus pointed out that even when researchers provided explicit algorithms for solving Tower of Hanoi, model performance did not improve—a finding that study co-lead Iman Mirzadeh argued shows “their process is not logical and intelligent.”

Figure 4 from Apple's

Figure 4 from Apple’s “The Illusion of Thinking” research paper. Credit: Apple

The Apple team found that simulated reasoning models behave differently from “standard” models (like GPT-4o) depending on puzzle difficulty. On easy tasks, such as Tower of Hanoi with just a few disks, standard models actually won because reasoning models would “overthink” and generate long chains of thought that led to incorrect answers. On moderately difficult tasks, SR models’ methodical approach gave them an edge. But on truly difficult tasks, including Tower of Hanoi with 10 or more disks, both types failed entirely, unable to complete the puzzles, no matter how much time they were given.

The researchers also identified what they call a “counterintuitive scaling limit.” As problem complexity increases, simulated reasoning models initially generate more thinking tokens but then reduce their reasoning effort beyond a threshold, despite having adequate computational resources.

The study also revealed puzzling inconsistencies in how models fail. Claude 3.7 Sonnet could perform up to 100 correct moves in Tower of Hanoi but failed after just five moves in a river crossing puzzle—despite the latter requiring fewer total moves. This suggests the failures may be task-specific rather than purely computational.

Competing interpretations emerge

However, not all researchers agree with the interpretation that these results demonstrate fundamental reasoning limitations. University of Toronto economist Kevin A. Bryan argued on X that the observed limitations may reflect deliberate training constraints rather than inherent inabilities.

“If you tell me to solve a problem that would take me an hour of pen and paper, but give me five minutes, I’ll probably give you an approximate solution or a heuristic. This is exactly what foundation models with thinking are RL’d to do,” Bryan wrote, suggesting that models are specifically trained through reinforcement learning (RL) to avoid excessive computation.

Bryan suggests that unspecified industry benchmarks show “performance strictly increases as we increase in tokens used for inference, on ~every problem domain tried,” but notes that deployed models intentionally limit this to prevent “overthinking” simple queries. This perspective suggests the Apple paper may be measuring engineered constraints rather than fundamental reasoning limits.

Figure 6 from Apple's

Figure 6 from Apple’s “The Illusion of Thinking” research paper. Credit: Apple

Software engineer Sean Goedecke offered a similar critique of the Apple paper on his blog, noting that when faced with Tower of Hanoi requiring over 1,000 moves, DeepSeek-R1 “immediately decides ‘generating all those moves manually is impossible,’ because it would require tracking over a thousand moves. So it spins around trying to find a shortcut and fails.” Goedecke argues this represents the model choosing not to attempt the task rather than being unable to complete it.

Other researchers also question whether these puzzle-based evaluations are even appropriate for LLMs. Independent AI researcher Simon Willison told Ars Technica in an interview that the Tower of Hanoi approach was “not exactly a sensible way to apply LLMs, with or without reasoning,” and suggested the failures might simply reflect running out of tokens in the context window (the maximum amount of text an AI model can process) rather than reasoning deficits. He characterized the paper as potentially overblown research that gained attention primarily due to its “irresistible headline” about Apple claiming LLMs don’t reason.

The Apple researchers themselves caution against over-extrapolating the results of their study, acknowledging in their limitations section that “puzzle environments represent a narrow slice of reasoning tasks and may not capture the diversity of real-world or knowledge-intensive reasoning problems.” The paper also acknowledges that reasoning models show improvements in the “medium complexity” range and continue to demonstrate utility in some real-world applications.

Implications remain contested

Have the credibility of claims about AI reasoning models been completely destroyed by these two studies? Not necessarily.

What these studies may suggest instead is that the kinds of extended context reasoning hacks used by SR models may not be a pathway to general intelligence, like some have hoped. In that case, the path to more robust reasoning capabilities may require fundamentally different approaches rather than refinements to current methods.

As Willison noted above, the results of the Apple study have so far been explosive in the AI community. Generative AI is a controversial topic, with many people gravitating toward extreme positions in an ongoing ideological battle over the models’ general utility. Many proponents of generative AI have contested the Apple results, while critics have latched onto the study as a definitive knockout blow for LLM credibility.

Apple’s results, combined with the USAMO findings, seem to strengthen the case made by critics like Marcus that these systems rely on elaborate pattern-matching rather than the kind of systematic reasoning their marketing might suggest. To be fair, much of the generative AI space is so new that even its inventors do not yet fully understand how or why these techniques work. In the meantime, AI companies might build trust by tempering some claims about reasoning and intelligence breakthroughs.

However, that doesn’t mean these AI models are useless. Even elaborate pattern-matching machines can be useful in performing labor-saving tasks for the people that use them, given an understanding of their drawbacks and confabulations. As Marcus concedes, “At least for the next decade, LLMs (with and without inference time “reasoning”) will continue have their uses, especially for coding and brainstorming and writing.”

Photo of Benj Edwards

Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

New Apple study challenges whether AI models truly “reason” through problems Read More »

“yuck”:-wikipedia-pauses-ai-summaries-after-editor-revolt

“Yuck”: Wikipedia pauses AI summaries after editor revolt

Generative AI is permeating the Internet, with chatbots and AI summaries popping up faster than we can keep track. Even Wikipedia, the vast repository of knowledge famously maintained by an army of volunteer human editors, is looking to add robots to the mix. The site began testing AI summaries in some articles over the past week, but the project has been frozen after editors voiced their opinions. And that opinion is: “yuck.”

The seeds of this project were planted at Wikimedia’s 2024 conference, where foundation representatives and editors discussed how AI could advance Wikipedia’s mission. The wiki on the so-called “Simple Article Summaries” notes that the editors who participated in the discussion believed the summaries could improve learning on Wikipedia.

According to 404 Media, Wikipedia announced the opt-in AI pilot on June 2, which was set to run for two weeks on the mobile version of the site. The summaries appeared at the top of select articles in a collapsed form. Users had to tap to expand and read the full summary. The AI text also included a highlighted “Unverified” badge.

Feedback from the larger community of editors was immediate and harsh. Some of the first comments were simply “yuck,” with others calling the addition of AI a “ghastly idea” and “PR hype stunt.”

Others expounded on the issues with adding AI to Wikipedia, citing a potential loss of trust in the site. Editors work together to ensure articles are accurate, featuring verifiable information and a neutral point of view. However, nothing is certain when you put generative AI in the driver’s seat. “I feel like people seriously underestimate the brand risk this sort of thing has,” said one editor. “Wikipedia’s brand is reliability, traceability of changes, and ‘anyone can fix it.’ AI is the opposite of these things.”

“Yuck”: Wikipedia pauses AI summaries after editor revolt Read More »

with-the-launch-of-o3-pro,-let’s-talk-about-what-ai-“reasoning”-actually-does

With the launch of o3-pro, let’s talk about what AI “reasoning” actually does


inquiring artificial minds want to know

New studies reveal pattern-matching reality behind the AI industry’s reasoning claims.

On Tuesday, OpenAI announced that o3-pro, a new version of its most capable simulated reasoning model, is now available to ChatGPT Pro and Team users, replacing o1-pro in the model picker. The company also reduced API pricing for o3-pro by 87 percent compared to o1-pro while cutting o3 prices by 80 percent. While “reasoning” is useful for some analytical tasks, new studies have posed fundamental questions about what the word actually means when applied to these AI systems.

We’ll take a deeper look at “reasoning” in a minute, but first, let’s examine what’s new. While OpenAI originally launched o3 (non-pro) in April, the o3-pro model focuses on mathematics, science, and coding while adding new capabilities like web search, file analysis, image analysis, and Python execution. Since these tool integrations slow response times (longer than the already slow o1-pro), OpenAI recommends using the model for complex problems where accuracy matters more than speed. However, they do not necessarily confabulate less than “non-reasoning” AI models (they still introduce factual errors), which is a significant caveat when seeking accurate results.

Beyond the reported performance improvements, OpenAI announced a substantial price reduction for developers. O3-pro costs $20 per million input tokens and $80 per million output tokens in the API, making it 87 percent cheaper than o1-pro. The company also reduced the price of the standard o3 model by 80 percent.

These reductions address one of the main concerns with reasoning models—their high cost compared to standard models. The original o1 cost $15 per million input tokens and $60 per million output tokens, while o3-mini cost $1.10 per million input tokens and $4.40 per million output tokens.

Why use o3-pro?

Unlike general-purpose models like GPT-4o that prioritize speed, broad knowledge, and making users feel good about themselves, o3-pro uses a chain-of-thought simulated reasoning process to devote more output tokens toward working through complex problems, making it generally better for technical challenges that require deeper analysis. But it’s still not perfect.

An OpenAI's o3-pro benchmark chart.

An OpenAI’s o3-pro benchmark chart. Credit: OpenAI

Measuring so-called “reasoning” capability is tricky since benchmarks can be easy to game by cherry-picking or training data contamination, but OpenAI reports that o3-pro is popular among testers, at least. “In expert evaluations, reviewers consistently prefer o3-pro over o3 in every tested category and especially in key domains like science, education, programming, business, and writing help,” writes OpenAI in its release notes. “Reviewers also rated o3-pro consistently higher for clarity, comprehensiveness, instruction-following, and accuracy.”

An OpenAI's o3-pro benchmark chart.

An OpenAI’s o3-pro benchmark chart. Credit: OpenAI

OpenAI shared benchmark results showing o3-pro’s reported performance improvements. On the AIME 2024 mathematics competition, o3-pro achieved 93 percent pass@1 accuracy, compared to 90 percent for o3 (medium) and 86 percent for o1-pro. The model reached 84 percent on PhD-level science questions from GPQA Diamond, up from 81 percent for o3 (medium) and 79 percent for o1-pro. For programming tasks measured by Codeforces, o3-pro achieved an Elo rating of 2748, surpassing o3 (medium) at 2517 and o1-pro at 1707.

When reasoning is simulated

Structure made of cubes in the shape of a thinking or contemplating person that evolves from simple to complex, 3D render.


It’s easy for laypeople to be thrown off by the anthropomorphic claims of “reasoning” in AI models. In this case, as with the borrowed anthropomorphic term “hallucinations,” “reasoning” has become a term of art in the AI industry that basically means “devoting more compute time to solving a problem.” It does not necessarily mean the AI models systematically apply logic or possess the ability to construct solutions to truly novel problems. This is why we at Ars Technica continue to use the term “simulated reasoning” (SR) to describe these models. They are simulating a human-style reasoning process that does not necessarily produce the same results as human reasoning when faced with novel challenges.

While simulated reasoning models like o3-pro often show measurable improvements over general-purpose models on analytical tasks, research suggests these gains come from allocating more computational resources to traverse their neural networks in smaller, more directed steps. The answer lies in what researchers call “inference-time compute” scaling. When these models use what are called “chain-of-thought” techniques, they dedicate more computational resources to exploring connections between concepts in their neural network data. Each intermediate “reasoning” output step (produced in tokens) serves as context for the next token prediction, effectively constraining the model’s outputs in ways that tend to improve accuracy and reduce mathematical errors (though not necessarily factual ones).

But fundamentally, all Transformer-based AI models are pattern-matching marvels. They borrow reasoning patterns from examples in the training data that researchers use to create them. Recent studies on Math Olympiad problems reveal that SR models still function as sophisticated pattern-matching machines—they cannot catch their own mistakes or adjust failing approaches, often producing confidently incorrect solutions without any “awareness” of errors.

Apple researchers found similar limitations when testing SR models on controlled puzzle environments. Even when provided explicit algorithms for solving puzzles like Tower of Hanoi, the models failed to execute them correctly—suggesting their process relies on pattern matching from training data rather than logical reasoning. As problem complexity increased, these models showed a “counterintuitive scaling limit,” reducing their reasoning effort despite having adequate computational resources. This aligns with the USAMO findings showing that models made basic logical errors and continued with flawed approaches even when generating contradictory results.

However, there’s some serious nuance here that you may miss if you’re reaching quickly for a pro-AI or anti-AI take. Pattern-matching and reasoning aren’t necessarily mutually exclusive. Since it’s difficult to mechanically define human reasoning at a fundamental level, we can’t definitively say whether sophisticated pattern-matching is categorically different from “genuine” reasoning or just a different implementation of similar underlying processes. The Tower of Hanoi failures are compelling evidence of current limitations, but they don’t resolve the deeper philosophical question of what reasoning actually is.

Illustration of a robot standing on a latter in front of a large chalkboard solving mathematical problems. A red question mark hovers over its head.

And understanding these limitations doesn’t diminish the genuine utility of SR models. For many real-world applications—debugging code, solving math problems, or analyzing structured data—pattern matching from vast training sets is enough to be useful. But as we consider the industry’s stated trajectory toward artificial general intelligence and even superintelligence, the evidence so far suggests that simply scaling up current approaches or adding more “thinking” tokens may not bridge the gap between statistical pattern recognition and what might be called generalist algorithmic reasoning.

But the technology is evolving rapidly, and new approaches are already being developed to address those shortcomings. For example, self-consistency sampling allows models to generate multiple solution paths and check for agreement, while self-critique prompts attempt to make models evaluate their own outputs for errors. Tool augmentation represents another useful direction already used by o3-pro and other ChatGPT models—by connecting LLMs to calculators, symbolic math engines, or formal verification systems, researchers can compensate for some of the models’ computational weaknesses. These methods show promise, though they don’t yet fully address the fundamental pattern-matching nature of current systems.

For now, o3-pro is a better, cheaper version of what OpenAI previously provided. It’s good at solving familiar problems, struggles with truly new ones, and still makes confident mistakes. If you understand its limitations, it can be a powerful tool, but always double-check the results.

Photo of Benj Edwards

Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

With the launch of o3-pro, let’s talk about what AI “reasoning” actually does Read More »

scientists-built-a-badminton-playing-robot-with-ai-powered-skills

Scientists built a badminton-playing robot with AI-powered skills

It also learned fall avoidance and determined how much risk was reasonable to take given its limited speed. The robot did not attempt impossible plays that would create the potential for serious damage—it was committed, but not suicidal.

But when it finally played humans, it turned out ANYmal, as a badminton player, was amateur at best.

The major leagues

The first problem was its reaction time. An average human reacts to visual stimuli in around 0.2–0.25 seconds. Elite badminton players with trained reflexes, anticipation, and muscle memory can cut this time down to 0.12–0.15 seconds. ANYmal needed roughly 0.35 seconds after the opponent hit the shuttlecock to register trajectories and figure out what to do.

Part of the problem was poor eyesight. “I think perception is still a big issue,” Ma said. “The robot localized the shuttlecock with the stereo camera and there could be a positioning error introduced at each timeframe.” The camera also had a limited field of view, which meant the robot could see the shuttlecock for only a limited time before it had to act. “Overall, it was suited for more friendly matches—when the human player starts to smash, the success rate goes way down for the robot,” Ma acknowledged.

But his team already has some ideas on how to make ANYmal better. Reaction time can be improved by predicting the shuttlecock trajectory based on the opponent’s body position rather than waiting to see the shuttlecock itself—a technique commonly used by elite badminton or tennis players. To improve ANYmal’s perception, the team wants to fit it with more advanced hardware, like event cameras—vision sensors that register movement with ultra-low latencies in the microseconds range. Other improvements might include faster, more capable actuators.

“I think the training framework we propose would be useful in any application where you need to balance perception and control—picking objects up, even catching and throwing stuff,” Ma suggested. Sadly, one thing that’s almost certainly off the table is taking ANYmal to major leagues in badminton or tennis. “Would I set up a company selling badminton-playing robots? Well, maybe not,” Ma said.

Science Robotics, 2025. DOI: 10.1126/scirobotics.adu3922

Scientists built a badminton-playing robot with AI-powered skills Read More »

after-ai-setbacks,-meta-bets-billions-on-undefined-“superintelligence”

After AI setbacks, Meta bets billions on undefined “superintelligence”

Meta has developed plans to create a new artificial intelligence research lab dedicated to pursuing “superintelligence,” according to reporting from The New York Times. The social media giant chose 28-year-old Alexandr Wang, founder and CEO of Scale AI, to join the new lab as part of a broader reorganization of Meta’s AI efforts under CEO Mark Zuckerberg.

Superintelligence refers to a hypothetical AI system that would exceed human cognitive abilities—a step beyond artificial general intelligence (AGI), which aims to match an intelligent human’s capability for learning new tasks without intensive specialized training.

However, much like AGI, superintelligence remains a nebulous term in the field. Since scientists still poorly understand the mechanics of human intelligence, and because human intelligence resists simple quantification with no single definition, identifying superintelligence when it arrives will present significant challenges.

Computers already far surpass humans in certain forms of information processing such as calculations, but this narrow superiority doesn’t qualify as superintelligence under most definitions. The pursuit assumes we’ll recognize it when we see it, despite the conceptual fuzziness.

Illustration of studious robot reading a book

AI researcher Dr. Margaret Mitchell told Ars Technica in April 2024 that there will “likely never be agreement on comparisons between human and machine intelligence” but predicted that “men in positions of power and influence, particularly ones with investments in AI, will declare that AI is smarter than humans” regardless of the reality.

The new lab represents Meta’s effort to remain competitive in the increasingly crowded AI race, where tech giants continue pouring billions into research and talent acquisition. Meta has reportedly offered compensation packages worth seven to nine figures to dozens of researchers from companies like OpenAI and Google, according to The New York Times, with some already agreeing to join the company.

Meta joins a growing list of tech giants making bold claims about advanced AI development. In January, OpenAI CEO Sam Altman wrote in a blog post that “we are now confident we know how to build AGI as we have traditionally understood it.” Earlier, in September 2024, Altman predicted that the AI industry might develop superintelligence “in a few thousand days.” Elon Musk made an even more aggressive prediction in April 2024, saying that AI would be “smarter than the smartest human” by “next year, within two years.”

After AI setbacks, Meta bets billions on undefined “superintelligence” Read More »

apple-tiptoes-with-modest-ai-updates-while-rivals-race-ahead

Apple tiptoes with modest AI updates while rivals race ahead

Developers, developers, developers?

Being the Worldwide Developers Conference, it seems appropriate that Apple also announced it would open access to its on-device AI language model to third-party developers. It also announced it would integrate OpenAI’s code completion tools into its XCode development software.

Craig Federighi stands in front of a screen with the words

Apple Intelligence was first unveiled at WWDC 2024. Credit: Apple

“We’re opening up access for any app to tap directly into the on-device, large language model at the core of Apple,” said Craig Federighi, Apple’s software chief, during the presentation. The company also demonstrated early partner integration by adding OpenAI’s ChatGPT image generation to its Image Playground app, though it said user data would not be shared without permission.

For developers, Apple’s inclusion of ChatGPT’s code-generation capabilities in XCode may represent Apple’s attempt to match what rivals like GitHub Copilot and Cursor offer software developers in terms of AI coding augmentation, even as the company maintains a more cautious approach to consumer-facing AI features.

Meanwhile, competitors like Meta, Anthropic, OpenAI, and Microsoft continue to push more aggressively into the AI space, offering AI assistants (that admittedly still make things up and suffer from other issues, such as sycophancy).

Only time will tell if Apple’s wariness to embrace the bleeding edge of AI will be a curse (eventually labeled as a blunder) or a blessing (lauded as a wise strategy). Perhaps, in time, Apple will step in with a solid and reliable AI assistant solution that makes Siri useful again. But for now, Apple Intelligence remains more of a clever brand name than a concrete set of notable products.

Apple tiptoes with modest AI updates while rivals race ahead Read More »

apple’s-ai-driven-stem-splitter-audio-separation-tech-has-hugely-improved-in-a-year

Apple’s AI-driven Stem Splitter audio separation tech has hugely improved in a year

Consider an example from a song I’ve been working on. Here’s a snippet of the full piece:


After running Logic’s original Stem Splitter on the snippet, I was given four tracks: Vocals, Drums, Bass, and “Other.” They all isolated their parts reasonably well, but check out the static and artifacting when you isolate the bass track:



The vocal track came out better, but it was still far from ideal:


Now, just over a year later, Apple has released a point update for Logic that delivers “enhanced audio fidelity” for Stem Splitter—along with support for new stems for guitar and piano.

screenshot of logic's new stem splitter feature

Logic now splits audio into more stems.

The difference in quality is significant, as you can hear in the new bass track:


And the new vocal track, though still lacking the pristine fidelity of the original recording, is nevertheless greatly improved:


The ability to separate out guitars and pianos is also welcome, and it works well. Here’s the piano part:



Pretty impressive leap in fidelity for a point release!

There are plenty of other stem-splitting tools, of course, and many have had a head start on Apple. With its new release, however, Apple has certainly closed the gap.

Izotope’s RX 11, for instance, is a highly regarded (and expensive!) piece of software that can do wonders when it comes to repairing audio and reducing clicks, background noise, and sibilance.

RX11 screenshot

RX11, ready to split some stems.

It includes a stem-splitting feature that can produce four outputs (vocal, bass, drums, and other), and it produces usable audio—but I’m not sure I’d rank its output more highly than Logic’s. Compare for yourself on the vocal and bass stems:



In any event, the AI/machine learning revolution has certainly arrived in the music world, and the rapid quality increase in stem-splitting tools in just a few years shows just what these AI systems are capable of when trained on enough data. I remain especially impressed by how the best stem splitters can extract not just a clean vocal but also the reverb/delay tail. Having access to the original recordings will always be better—but stem-splitting tech is improving quickly.

Apple’s AI-driven Stem Splitter audio separation tech has hugely improved in a year Read More »

anthropic-releases-custom-ai-chatbot-for-classified-spy-work

Anthropic releases custom AI chatbot for classified spy work

On Thursday, Anthropic unveiled specialized AI models designed for US national security customers. The company released “Claude Gov” models that were built in response to direct feedback from government clients to handle operations such as strategic planning, intelligence analysis, and operational support. The custom models reportedly already serve US national security agencies, with access restricted to those working in classified environments.

The Claude Gov models differ from Anthropic’s consumer and enterprise offerings, also called Claude, in several ways. They reportedly handle classified material, “refuse less” when engaging with classified information, and are customized to handle intelligence and defense documents. The models also feature what Anthropic calls “enhanced proficiency” in languages and dialects critical to national security operations.

Anthropic says the new models underwent the same “safety testing” as all Claude models. The company has been pursuing government contracts as it seeks reliable revenue sources, partnering with Palantir and Amazon Web Services in November to sell AI tools to defense customers.

Anthropic is not the first company to offer specialized chatbot services for intelligence agencies. In 2024, Microsoft launched an isolated version of OpenAI’s GPT-4 for the US intelligence community after 18 months of work. That system, which operated on a special government-only network without Internet access, became available to about 10,000 individuals in the intelligence community for testing and answering questions.

Anthropic releases custom AI chatbot for classified spy work Read More »

ted-cruz-bill:-states-that-regulate-ai-will-be-cut-out-of-$42b-broadband-fund

Ted Cruz bill: States that regulate AI will be cut out of $42B broadband fund

BEAD changes: No fiber preference, no low-cost mandate

The BEAD program is separately undergoing an overhaul because Republicans don’t like how it was administered by Democrats. The Biden administration spent about three years developing rules and procedures for BEAD and then evaluating plans submitted by each US state and territory, but the Trump administration has delayed grants while it rewrites the rules.

While Biden’s Commerce Department decided to prioritize the building of fiber networks, Republicans have pushed for a “tech-neutral approach” that would benefit cable companies, fixed wireless providers, and Elon Musk’s Starlink satellite service.

Secretary of Commerce Howard Lutnick previewed changes in March, and today he announced more details of the overhaul that will eliminate the fiber preference and various requirements imposed on states. One notable but unsurprising change is that the Trump administration won’t let states require grant recipients to offer low-cost Internet plans at specific rates to people with low incomes.

The National Telecommunications and Information Administration (NTIA) “will refuse to accept any low-cost service option proposed in a [state or territory’s] Final Proposal that attempts to impose a specific rate level (i.e., dollar amount),” the Trump administration said. Instead, ISPs receiving subsidies will be able to continue offering “their existing, market driven low-cost plans to meet the statutory low-cost requirement.”

The Benton Institute for Broadband & Society criticized the overhaul, saying that the Trump administration is investing in the cheapest broadband infrastructure instead of the best. “Fiber-based broadband networks will last longer, provide better, more reliable service, and scale to meet communities’ ever-growing connectivity needs,” the advocacy group said. “NTIA’s new guidance is shortsighted and will undermine economic development in rural America for decades to come.”

The Trump administration’s overhaul drew praise from cable lobby group NCTA-The Internet & Television Association, whose members will find it easier to obtain subsidies. “We welcome changes to the BEAD program that will make the program more efficient and eliminate onerous requirements, which add unnecessary costs that impede broadband deployment efforts,” NCTA said. “These updates are welcome improvements that will make it easier for providers to build faster, especially in hard-to-reach communities, without being bogged down by red tape.”

Ted Cruz bill: States that regulate AI will be cut out of $42B broadband fund Read More »

openai-is-retaining-all-chatgpt-logs-“indefinitely”-here’s-who’s-affected.

OpenAI is retaining all ChatGPT logs “indefinitely.” Here’s who’s affected.

In the copyright fight, Magistrate Judge Ona Wang granted the order within one day of the NYT’s request. She agreed with news plaintiffs that it seemed likely that ChatGPT users may be spooked by the lawsuit and possibly set their chats to delete when using the chatbot to skirt NYT paywalls. Because OpenAI wasn’t sharing deleted chat logs, the news plaintiffs had no way of proving that, she suggested.

Now, OpenAI is not only asking Wang to reconsider but has “also appealed this order with the District Court Judge,” the Thursday statement said.

“We strongly believe this is an overreach by the New York Times,” Lightcap said. “We’re continuing to appeal this order so we can keep putting your trust and privacy first.”

Who can access deleted chats?

To protect users, OpenAI provides an FAQ that clearly explains why their data is being retained and how it could be exposed.

For example, the statement noted that the order doesn’t impact OpenAI API business customers under Zero Data Retention agreements because their data is never stored.

And for users whose data is affected, OpenAI noted that their deleted chats could be accessed, but they won’t “automatically” be shared with The New York Times. Instead, the retained data will be “stored separately in a secure system” and “protected under legal hold, meaning it can’t be accessed or used for purposes other than meeting legal obligations,” OpenAI explained.

Of course, with the court battle ongoing, the FAQ did not have all the answers.

Nobody knows how long OpenAI may be required to retain the deleted chats. Likely seeking to reassure users—some of which appeared to be considering switching to a rival service until the order lifts—OpenAI noted that “only a small, audited OpenAI legal and security team would be able to access this data as necessary to comply with our legal obligations.”

OpenAI is retaining all ChatGPT logs “indefinitely.” Here’s who’s affected. Read More »