AI

anthropic-ceo-floats-idea-of-giving-ai-a-“quit-job”-button,-sparking-skepticism

Anthropic CEO floats idea of giving AI a “quit job” button, sparking skepticism

Amodei’s suggestion of giving AI models a way to refuse tasks drew immediate skepticism on X and Reddit as a clip of his response began to circulate earlier this week. One critic on Reddit argued that providing AI with such an option encourages needless anthropomorphism, attributing human-like feelings and motivations to entities that fundamentally lack subjective experiences. They emphasized that task avoidance in AI models signals issues with poorly structured incentives or unintended optimization strategies during training, rather than indicating sentience, discomfort, or frustration.

Our take is that AI models are trained to mimic human behavior from vast amounts of human-generated data. There is no guarantee that the model would “push” a discomfort button because it had a subjective experience of suffering. Instead, we would know it is more likely echoing its training data scraped from the vast corpus of human-generated texts (including books, websites, and Internet comments), which no doubt include representations of lazy, anguished, or suffering workers that it might be imitating.

Refusals already happen

A photo of co-founder and CEO of Anthropic, Dario Amodei, dated May 22, 2024.

Anthropic co-founder and CEO Dario Amodei on May 22, 2024. Credit: Chesnot via Getty Images

In 2023, people frequently complained about refusals in ChatGPT that may have been seasonal, related to training data depictions of people taking winter vacations and not working as hard during certain times of year. Anthropic experienced its own version of the “winter break hypothesis” last year when people claimed Claude became lazy in August due to training data depictions of seeking a summer break, although that was never proven.

However, as far out and ridiculous as this sounds today, it might be short-sighted to permanently rule out the possibility of some kind of subjective experience for AI models as they get more advanced into the future. Even so, will they “suffer” or feel pain? It’s a highly contentious idea, but it’s a topic that Fish is studying for Anthropic, and one that Amodei is apparently taking seriously. But for now, AI models are tools, and if you give them the opportunity to malfunction, that may take place.

To provide further context, here is the full transcript of Amodei’s answer during Monday’s interview (the answer begins around 49: 54 in this video).

Anthropic CEO floats idea of giving AI a “quit job” button, sparking skepticism Read More »

pocket-casts-makes-its-web-player-free,-takes-shots-at-spotify-and-ai

Pocket Casts makes its web player free, takes shots at Spotify and AI

“The future of podcasting shouldn’t be locked behind walled gardens,” writes the team at Pocket Casts. To push that point forward, Pocket Casts, owned by the company behind WordPress, Automattic Inc., has made its web player free to everyone.

Previously available only to logged-in Pocket Casts users paying $4 per month, Pocket Casts now offers nearly any public-facing podcast feed for streaming, along with controls like playback speed and playlist queueing. If you create an account, you can also sync your playback progress, manage your queue, bookmark episode moments, and save your subscription list and listening preferences. The free access also applies to its clients for Windows and Mac.

“Podcasting is one of the last open corners of the Internet, and we’re here to keep it that way,” Pocket Casts’ blog post reads. For those not fully tuned into the podcasting market, this and other statements in the post—like sharing “without needing a specific platform’s approval” and “podcasts belong to the people, not corporations”—are largely shots at Spotify, and to a much lesser extent other streaming services, which have sought to wrap podcasting’s originally open and RSS-based nature inside proprietary markets and formats.

Pocket Casts also took a bullet point to note that “discovery should be organic, not algorithm-driven,” and that users, not an AI, should “promote what’s best for the platform.”

Spotify spent big to acquire podcasts like the Joe Rogan Experience, along with podcast analytic and advertising tools. As the platform now starts leaning into video podcasts, seeking to compete with the podcasts simulcasting or exclusively on YouTube, Pocket Casts’ concerns about the open origins of podcasting being co-opted are not unfounded. (Pocket Casts’ current owner, Automattic, is involved in an extended debate in public, and the courts, regarding how “open” some of its products should be.)

Pocket Casts makes its web player free, takes shots at Spotify and AI Read More »

openai-pushes-ai-agent-capabilities-with-new-developer-api

OpenAI pushes AI agent capabilities with new developer API

Developers using the Responses API can access the same models that power ChatGPT Search: GPT-4o search and GPT-4o mini search. These models can browse the web to answer questions and cite sources in their responses.

That’s notable because OpenAI says the added web search ability dramatically improves the factual accuracy of its AI models. On OpenAI’s SimpleQA benchmark, which aims to measure confabulation rate, GPT-4o search scored 90 percent, while GPT-4o mini search achieved 88 percent—both substantially outperforming the larger GPT-4.5 model without search, which scored 63 percent.

Despite these improvements, the technology still has significant limitations. Aside from issues with CUA properly navigating websites, the improved search capability doesn’t completely solve the problem of AI confabulations, with GPT-4o search still making factual mistakes 10 percent of the time.

Alongside the Responses API, OpenAI released the open source Agents SDK, providing developers with free tools to integrate models with internal systems, implement safeguards, and monitor agent activities. This toolkit follows OpenAI’s earlier release of Swarm, a framework for orchestrating multiple agents.

These are still early days in the AI agent field, and things will likely improve rapidly. However, at the moment, the AI agent movement remains vulnerable to unrealistic claims, as demonstrated earlier this week when users discovered that Chinese startup Butterfly Effect’s Manus AI agent platform failed to deliver on many of its promises, highlighting the persistent gap between promotional claims and practical functionality in this emerging technology category.

OpenAI pushes AI agent capabilities with new developer API Read More »

why-extracting-data-from-pdfs-is-still-a-nightmare-for-data-experts

Why extracting data from PDFs is still a nightmare for data experts


Optical Character Recognition

Countless digital documents hold valuable info, and the AI industry is attempting to set it free.

For years, businesses, governments, and researchers have struggled with a persistent problem: How to extract usable data from Portable Document Format (PDF) files. These digital documents serve as containers for everything from scientific research to government records, but their rigid formats often trap the data inside, making it difficult for machines to read and analyze.

“Part of the problem is that PDFs are a creature of a time when print layout was a big influence on publishing software, and PDFs are more of a ‘print’ product than a digital one,” Derek Willis, a lecturer in Data and Computational Journalism at the University of Maryland, wrote in an email to Ars Technica. “The main issue is that many PDFs are simply pictures of information, which means you need Optical Character Recognition software to turn those pictures into data, especially when the original is old or includes handwriting.”

Computational journalism is a field where traditional reporting techniques merge with data analysis, coding, and algorithmic thinking to uncover stories that might otherwise remain hidden in large datasets, which makes unlocking that data a particular interest for Willis.

The PDF challenge also represents a significant bottleneck in the world of data analysis and machine learning at large. According to several studies, approximately 80–90 percent of the world’s organizational data is stored as unstructured data in documents, much of it locked away in formats that resist easy extraction. The problem worsens with two-column layouts, tables, charts, and scanned documents with poor image quality.

The inability to reliably extract data from PDFs affects numerous sectors but hits hardest in areas that rely heavily on documentation and legacy records, including digitizing scientific research, preserving historical documents, streamlining customer service, and making technical literature more accessible to AI systems.

“It is a very real problem for almost anything published more than 20 years ago and in particular for government records,” Willis says. “That impacts not just the operation of public agencies like the courts, police, and social services but also journalists, who rely on those records for stories. It also forces some industries that depend on information, like insurance and banking, to invest time and resources in converting PDFs into data.”

A very brief history of OCR

Traditional optical character recognition (OCR) technology, which converts images of text into machine-readable text, has been around since the 1970s. Inventor Ray Kurzweil pioneered the commercial development of OCR systems, including the Kurzweil Reading Machine for the blind in 1976, which relied on pattern-matching algorithms to identify characters from pixel arrangements.

These traditional OCR systems typically work by identifying patterns of light and dark pixels in images, matching them to known character shapes, and outputting the recognized text. While effective for clear, straightforward documents, these pattern-matching systems, a form of AI themselves, often falter when faced with unusual fonts, multiple columns, tables, or poor-quality scans.

Traditional OCR persists in many workflows precisely because its limitations are well-understood—it makes predictable errors that can be identified and corrected, offering a reliability that sometimes outweighs the theoretical advantages of newer AI-based solutions. But now that transformer-based large language models (LLMs) are getting the lion’s share of funding dollars, companies are increasingly turning to them for a new approach to reading documents.

The rise of AI language models in OCR

Unlike traditional OCR methods that follow a rigid sequence of identifying characters based on pixel patterns, multimodal LLMs that can read documents are trained on text and images that have been translated into chunks of data called tokens and fed into large neural networks. Vision-capable LLMs from companies like OpenAI, Google, and Meta analyze documents by recognizing relationships between visual elements and understanding contextual cues.

The “visual” image-based method is how ChatGPT reads a PDF file, for example, if you upload it through the AI assistant interface. It’s a fundamentally different approach than standard OCR that allows them to potentially process documents more holistically, considering both visual layouts and text content simultaneously.

And as it turns out, some LLMs from certain vendors are better at this task than others.

“The LLMs that do well on these tasks tend to behave in ways that are more consistent with how I would do it manually,” Willis said. He noted that some traditional OCR methods are quite good, particularly Amazon’s Textract, but that “they also are bound by the rules of their software and limitations on how much text they can refer to when attempting to recognize an unusual pattern.” Willis added, “With LLMs, I think you trade that for an expanded context that seems to help them make better predictions about whether a digit is a three or an eight, for example.”

This context-based approach enables these models to better handle complex layouts, interpret tables, and distinguish between document elements like headers, captions, and body text—all tasks that traditional OCR solutions struggle with.

“[LLMs] aren’t perfect and sometimes require significant intervention to do the job well, but the fact that you can adjust them at all [with custom prompts] is a big advantage,” Willis said.

New attempts at LLM-based OCR

As the demand for better document-processing solutions grows, new AI players are entering the market with specialized offerings. One such recent entrant has caught the attention of document-processing specialists in particular.

Mistral, a French AI company known for its smaller LLMs, recently entered the LLM-powered optical reader space with Mistral OCR, a specialized API designed for document processing. According to Mistral’s materials, their system aims to extract text and images from documents with complex layouts by using its language model capabilities to process document elements.

Robot sitting on a bunch of books, reading a book.

However, these promotional claims don’t always match real-world performance, according to recent tests. “I’m typically a pretty big fan of the Mistral models, but the new OCR-specific one they released last week really performed poorly,” Willis noted.

“A colleague sent this PDF and asked if I could help him parse the table it contained,” says Willis. “It’s an old document with a table that has some complex layout elements. The new [Mistral] OCR-specific model really performed poorly, repeating the names of cities and botching a lot of the numbers.”

AI app developer Alexander Doria also recently pointed out on X a flaw with Mistral OCR’s ability to understand handwriting, writing, “Unfortunately Mistral-OCR has still the usual VLM curse: with challenging manuscripts, it hallucinates completely.”

According to Willis, Google currently leads the field in AI models that can read documents: “Right now, for me the clear leader is Google’s Gemini 2.0 Flash Pro Experimental. It handled the PDF that Mistral did not with a tiny number of mistakes, and I’ve run multiple messy PDFs through it with success, including those with handwritten content.”

Gemini’s performance stems largely from its ability to process expansive documents (in a type of short-term memory called a “context window”), which Willis specifically notes as a key advantage: “The size of its context window also helps, since I can upload large documents and work through them in parts.” This capability, combined with more robust handling of handwritten content, apparently gives Google’s model a practical edge over competitors in real-world document-processing tasks for now.

The drawbacks of LLM-based OCR

Despite their promise, LLMs introduce several new problems to document processing. Among them, they can introduce confabulations or hallucinations (plausible-sounding but incorrect information), accidentally follow instructions in the text (thinking they are part of a user prompt), or just generally misinterpret the data.

“The biggest [drawback] is that they are probabilistic prediction machines and will get it wrong in ways that aren’t just ‘that’s the wrong word’,” Willis explains. “LLMs will sometimes skip a line in larger documents where the layout repeats itself, I’ve found, where OCR isn’t likely to do that.”

AI researcher and data journalist Simon Willison identified several critical concerns of using LLMs for OCR in a conversation with Ars Technica. “I still think the biggest challenge is the risk of accidental instruction following,” he says, always wary of prompt injections (in this case accidental) that might feed nefarious or contradictory instructions to a LLM.

“That and the fact that table interpretation mistakes can be catastrophic,” Willison adds. “In the past I’ve had lots of cases where a vision LLM has matched up the wrong line of data with the wrong heading, which results in absolute junk that looks correct. Also that thing where sometimes if text is illegible a model might just invent the text.”

These issues become particularly troublesome when processing financial statements, legal documents, or medical records, where a mistake might put someone’s life in danger. The reliability problems mean these tools often require careful human oversight, limiting their value for fully automated data extraction.

The path forward

Even in our seemingly advanced age of AI, there is still no perfect OCR solution. The race to unlock data from PDFs continues, with companies like Google now offering context-aware generative AI products. Some of the motivation for unlocking PDFs among AI companies, as Willis observes, doubtless involves potential training data acquisition: “I think Mistral’s announcement is pretty clear evidence that documents—not just PDFs—are a big part of their strategy, exactly because it will likely provide additional training data.”

Whether it benefits AI companies with training data or historians analyzing a historical census, as these technologies improve, they may unlock repositories of knowledge currently trapped in digital formats designed primarily for human consumption. That could lead to a new golden age of data analysis—or a field day for hard-to-spot mistakes, depending on the technology used and how blindly we trust it.

Photo of Benj Edwards

Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

Why extracting data from PDFs is still a nightmare for data experts Read More »

what-does-“phd-level”-ai-mean?-openai’s-rumored-$20,000-agent-plan-explained.

What does “PhD-level” AI mean? OpenAI’s rumored $20,000 agent plan explained.

On the Frontier Math benchmark by EpochAI, o3 solved 25.2 percent of problems, while no other model has exceeded 2 percent—suggesting a leap in mathematical reasoning capabilities over the previous model.

Benchmarks vs. real-world value

Ideally, potential applications for a true PhD-level AI model would include analyzing medical research data, supporting climate modeling, and handling routine aspects of research work.

The high price points reported by The Information, if accurate, suggest that OpenAI believes these systems could provide substantial value to businesses. The publication notes that SoftBank, an OpenAI investor, has committed to spending $3 billion on OpenAI’s agent products this year alone—indicating significant business interest despite the costs.

Meanwhile, OpenAI faces financial pressures that may influence its premium pricing strategy. The company reportedly lost approximately $5 billion last year covering operational costs and other expenses related to running its services.

News of OpenAI’s stratospheric pricing plans come after years of relatively affordable AI services that have conditioned users to expect powerful capabilities at relatively low costs. ChatGPT Plus remains $20 per month and Claude Pro costs $30 monthly—both tiny fractions of these proposed enterprise tiers. Even ChatGPT Pro’s $200/month subscription is relatively small compared to the new proposed fees. Whether the performance difference between these tiers will match their thousandfold price difference is an open question.

Despite their benchmark performances, these simulated reasoning models still struggle with confabulations—instances where they generate plausible-sounding but factually incorrect information. This remains a critical concern for research applications where accuracy and reliability are paramount. A $20,000 monthly investment raises questions about whether organizations can trust these systems not to introduce subtle errors into high-stakes research.

In response to the news, several people quipped on social media that companies could hire an actual PhD student for much cheaper. “In case you have forgotten,” wrote xAI developer Hieu Pham in a viral tweet, “most PhD students, including the brightest stars who can do way better work than any current LLMs—are not paid $20K / month.”

While these systems show strong capabilities on specific benchmarks, the “PhD-level” label remains largely a marketing term. These models can process and synthesize information at impressive speeds, but questions remain about how effectively they can handle the creative thinking, intellectual skepticism, and original research that define actual doctoral-level work. On the other hand, they will never get tired or need health insurance, and they will likely continue to improve in capability and drop in cost over time.

What does “PhD-level” AI mean? OpenAI’s rumored $20,000 agent plan explained. Read More »

cmu-research-shows-compression-alone-may-unlock-ai-puzzle-solving-abilities

CMU research shows compression alone may unlock AI puzzle-solving abilities


Tis the season for a squeezin’

New research challenges prevailing idea that AI needs massive datasets to solve problems.

A pair of Carnegie Mellon University researchers recently discovered hints that the process of compressing information can solve complex reasoning tasks without pre-training on a large number of examples. Their system tackles some types of abstract pattern-matching tasks using only the puzzles themselves, challenging conventional wisdom about how machine learning systems acquire problem-solving abilities.

“Can lossless information compression by itself produce intelligent behavior?” ask Isaac Liao, a first-year PhD student, and his advisor Professor Albert Gu from CMU’s Machine Learning Department. Their work suggests the answer might be yes. To demonstrate, they created CompressARC and published the results in a comprehensive post on Liao’s website.

The pair tested their approach on the Abstraction and Reasoning Corpus (ARC-AGI), an unbeaten visual benchmark created in 2019 by machine learning researcher François Chollet to test AI systems’ abstract reasoning skills. ARC presents systems with grid-based image puzzles where each provides several examples demonstrating an underlying rule, and the system must infer that rule to apply it to a new example.

For instance, one ARC-AGI puzzle shows a grid with light blue rows and columns dividing the space into boxes. The task requires figuring out which colors belong in which boxes based on their position: black for corners, magenta for the middle, and directional colors (red for up, blue for down, green for right, and yellow for left) for the remaining boxes. Here are three other example ARC-AGI puzzles, taken from Liao’s website:

Three example ARC-AGI benchmarking puzzles.

Three example ARC-AGI benchmarking puzzles. Credit: Isaac Liao / Albert Gu

The puzzles test capabilities that some experts believe may be fundamental to general human-like reasoning (often called “AGI” for artificial general intelligence). Those properties include understanding object persistence, goal-directed behavior, counting, and basic geometry without requiring specialized knowledge. The average human solves 76.2 percent of the ARC-AGI puzzles, while human experts reach 98.5 percent.

OpenAI made waves in December for the claim that its o3 simulated reasoning model earned a record-breaking score on the ARC-AGI benchmark. In testing with computational limits, o3 scored 75.7 percent on the test, while in high-compute testing (basically unlimited thinking time), it reached 87.5 percent, which OpenAI says is comparable to human performance.

CompressARC achieves 34.75 percent accuracy on the ARC-AGI training set (the collection of puzzles used to develop the system) and 20 percent on the evaluation set (a separate group of unseen puzzles used to test how well the approach generalizes to new problems). Each puzzle takes about 20 minutes to process on a consumer-grade RTX 4070 GPU, compared to top-performing methods that use heavy-duty data center-grade machines and what the researchers describe as “astronomical amounts of compute.”

Not your typical AI approach

CompressARC takes a completely different approach than most current AI systems. Instead of relying on pre-training—the process where machine learning models learn from massive datasets before tackling specific tasks—it works with no external training data whatsoever. The system trains itself in real-time using only the specific puzzle it needs to solve.

“No pretraining; models are randomly initialized and trained during inference time. No dataset; one model trains on just the target ARC-AGI puzzle and outputs one answer,” the researchers write, describing their strict constraints.

When the researchers say “No search,” they’re referring to another common technique in AI problem-solving where systems try many different possible solutions and select the best one. Search algorithms work by systematically exploring options—like a chess program evaluating thousands of possible moves—rather than directly learning a solution. CompressARC avoids this trial-and-error approach, relying solely on gradient descent—a mathematical technique that incrementally adjusts the network’s parameters to reduce errors, similar to how you might find the bottom of a valley by always walking downhill.

A block diagram of the CompressARC architecture, created by the researchers.

A block diagram of the CompressARC architecture, created by the researchers. Credit: Isaac Liao / Albert Gu

The system’s core principle uses compression—finding the most efficient way to represent information by identifying patterns and regularities—as the driving force behind intelligence. CompressARC searches for the shortest possible description of a puzzle that can accurately reproduce the examples and the solution when unpacked.

While CompressARC borrows some structural principles from transformers (like using a residual stream with representations that are operated upon), it’s a custom neural network architecture designed specifically for this compression task. It’s not based on an LLM or standard transformer model.

Unlike typical machine learning methods, CompressARC uses its neural network only as a decoder. During encoding (the process of converting information into a compressed format), the system fine-tunes the network’s internal settings and the data fed into it, gradually making small adjustments to minimize errors. This creates the most compressed representation while correctly reproducing known parts of the puzzle. These optimized parameters then become the compressed representation that stores the puzzle and its solution in an efficient format.

An animated GIF showing the multi-step process of CompressARC solving an ARC-AGI puzzle.

An animated GIF showing the multi-step process of CompressARC solving an ARC-AGI puzzle. Credit: Isaac Liao

“The key challenge is to obtain this compact representation without needing the answers as inputs,” the researchers explain. The system essentially uses compression as a form of inference.

This approach could prove valuable in domains where large datasets don’t exist or when systems need to learn new tasks with minimal examples. The work suggests that some forms of intelligence might emerge not from memorizing patterns across vast datasets, but from efficiently representing information in compact forms.

The compression-intelligence connection

The potential connection between compression and intelligence may sound strange at first glance, but it has deep theoretical roots in computer science concepts like Kolmogorov complexity (the shortest program that produces a specified output) and Solomonoff induction—a theoretical gold standard for prediction equivalent to an optimal compression algorithm.

To compress information efficiently, a system must recognize patterns, find regularities, and “understand” the underlying structure of the data—abilities that mirror what many consider intelligent behavior. A system that can predict what comes next in a sequence can compress that sequence efficiently. As a result, some computer scientists over the decades have suggested that compression may be equivalent to general intelligence. Based on these principles, the Hutter Prize has offered awards to researchers who can compress a 1GB file to the smallest size.

We previously wrote about intelligence and compression in September 2023, when a DeepMind paper discovered that large language models can sometimes outperform specialized compression algorithms. In that study, researchers found that DeepMind’s Chinchilla 70B model could compress image patches to 43.4 percent of their original size (beating PNG’s 58.5 percent) and audio samples to just 16.4 percent (outperforming FLAC’s 30.3 percent).

Photo of a C-clamp compressing books.

That 2023 research suggested a deep connection between compression and intelligence—the idea that truly understanding patterns in data enables more efficient compression, which aligns with this new CMU research. While DeepMind demonstrated compression capabilities in an already-trained model, Liao and Gu’s work takes a different approach by showing that the compression process can generate intelligent behavior from scratch.

This new research matters because it challenges the prevailing wisdom in AI development, which typically relies on massive pre-training datasets and computationally expensive models. While leading AI companies push toward ever-larger models trained on more extensive datasets, CompressARC suggests intelligence emerging from a fundamentally different principle.

“CompressARC’s intelligence emerges not from pretraining, vast datasets, exhaustive search, or massive compute—but from compression,” the researchers conclude. “We challenge the conventional reliance on extensive pretraining and data, and propose a future where tailored compressive objectives and efficient inference-time computation work together to extract deep intelligence from minimal input.”

Limitations and looking ahead

Even with its successes, Liao and Gu’s system comes with clear limitations that may prompt skepticism. While it successfully solves puzzles involving color assignments, infilling, cropping, and identifying adjacent pixels, it struggles with tasks requiring counting, long-range pattern recognition, rotations, reflections, or simulating agent behavior. These limitations highlight areas where simple compression principles may not be sufficient.

The research has not been peer-reviewed, and the 20 percent accuracy on unseen puzzles, though notable without pre-training, falls significantly below both human performance and top AI systems. Critics might argue that CompressARC could be exploiting specific structural patterns in the ARC puzzles that might not generalize to other domains, challenging whether compression alone can serve as a foundation for broader intelligence rather than just being one component among many required for robust reasoning capabilities.

And yet as AI development continues its rapid advance, if CompressARC holds up to further scrutiny, it offers a glimpse of a possible alternative path that might lead to useful intelligent behavior without the resource demands of today’s dominant approaches. Or at the very least, it might unlock an important component of general intelligence in machines, which is still poorly understood.

Photo of Benj Edwards

Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

CMU research shows compression alone may unlock AI puzzle-solving abilities Read More »

will-the-future-of-software-development-run-on-vibes?

Will the future of software development run on vibes?


Accepting AI-written code without understanding how it works is growing in popularity.

For many people, coding is about telling a computer what to do and having the computer perform those precise actions repeatedly. With the rise of AI tools like ChatGPT, it’s now possible for someone to describe a program in English and have the AI model translate it into working code without ever understanding how the code works. Former OpenAI researcher Andrej Karpathy recently gave this practice a name—”vibe coding”—and it’s gaining traction in tech circles.

The technique, enabled by large language models (LLMs) from companies like OpenAI and Anthropic, has attracted attention for potentially lowering the barrier to entry for software creation. But questions remain about whether the approach can reliably produce code suitable for real-world applications, even as tools like Cursor Composer, GitHub Copilot, and Replit Agent make the process increasingly accessible to non-programmers.

Instead of being about control and precision, vibe coding is all about surrendering to the flow. On February 2, Karpathy introduced the term in a post on X, writing, “There’s a new kind of coding I call ‘vibe coding,’ where you fully give in to the vibes, embrace exponentials, and forget that the code even exists.” He described the process in deliberately casual terms: “I just see stuff, say stuff, run stuff, and copy paste stuff, and it mostly works.”

Karapthy tweet screenshot: There's a new kind of coding I call

A screenshot of Karpathy’s original X post about vibe coding from February 2, 2025. Credit: Andrej Karpathy / X

While vibe coding, if an error occurs, you feed it back into the AI model, accept the changes, hope it works, and repeat the process. Karpathy’s technique stands in stark contrast to traditional software development best practices, which typically emphasize careful planning, testing, and understanding of implementation details.

As Karpathy humorously acknowledged in his original post, the approach is for the ultimate lazy programmer experience: “I ask for the dumbest things, like ‘decrease the padding on the sidebar by half,’ because I’m too lazy to find it myself. I ‘Accept All’ always; I don’t read the diffs anymore.”

At its core, the technique transforms anyone with basic communication skills into a new type of natural language programmer—at least for simple projects. With AI models currently being held back by the amount of code an AI model can digest at once (context size), there tends to be an upper-limit to how complex a vibe-coded software project can get before the human at the wheel becomes a high-level project manager, manually assembling slices of AI-generated code into a larger architecture. But as technical limits expand with each generation of AI models, those limits may one day disappear.

Who are the vibe coders?

There’s no way to know exactly how many people are currently vibe coding their way through either hobby projects or development jobs, but Cursor reported 40,000 paying users in August 2024, and GitHub reported 1.3 million Copilot users just over a year ago (February 2024). While we can’t find user numbers for Replit Agent, the site claims 30 million users, with an unknown percentage using the site’s AI-powered coding agent.

One thing we do know: the approach has particularly gained traction online as a fun way of rapidly prototyping games. Microsoft’s Peter Yang recently demonstrated vibe coding in an X thread by building a simple 3D first-person shooter zombie game through conversational prompts fed into Cursor and Claude 3.7 Sonnet. Yang even used a speech-to-text app so he could verbally describe what he wanted to see and refine the prototype over time.

A photo of a MS-DOS computer with Q-BASIC code on the screen.

In August 2024, the author vibe coded his way into a working Q-BASIC utility script for MS-DOS, thanks to Claude Sonnet. Credit: Benj Edwards

We’ve been doing some vibe coding ourselves. Multiple Ars staffers have used AI assistants and coding tools for extracurricular hobby projects such as creating small games, crafting bespoke utilities, writing processing scripts, and more. Having a vibe-based code genie can come in handy in unexpected places: Last year, I asked Anthropic’s Claude write a Microsoft Q-BASIC program in MS-DOS that decompressed 200 ZIP files into custom directories, saving me many hours of manual typing work.

Debugging the vibes

With all this vibe coding going on, we had to turn to an expert for some input. Simon Willison, an independent software developer and AI researcher, offered a nuanced perspective on AI-assisted programming in an interview with Ars Technica. “I really enjoy vibe coding,” he said. “It’s a fun way to try out an idea and prove if it can work.”

But there are limits to how far Willison will go. “Vibe coding your way to a production codebase is clearly risky. Most of the work we do as software engineers involves evolving existing systems, where the quality and understandability of the underlying code is crucial.”

At some point, understanding at least some of the code is important because AI-generated code may include bugs, misunderstandings, and confabulations—for example, instances where the AI model generates references to nonexistent functions or libraries.

“Vibe coding is all fun and games until you have to vibe debug,” developer Ben South noted wryly on X, highlighting this fundamental issue.

Willison recently argued on his blog that encountering hallucinations with AI coding tools isn’t as detrimental as embedding false AI-generated information into a written report, because coding tools have built-in fact-checking: If there’s a confabulation, the code won’t work. This provides a natural boundary for vibe coding’s reliability—the code runs or it doesn’t.

Even so, the risk-reward calculation for vibe coding becomes far more complex in professional settings. While a solo developer might accept the trade-offs of vibe coding for personal projects, enterprise environments typically require code maintainability and reliability standards that vibe-coded solutions may struggle to meet. When code doesn’t work as expected, debugging requires understanding what the code is actually doing—precisely the knowledge that vibe coding tends to sidestep.

Programming without understanding

When it comes to defining what exactly constitutes vibe coding, Willison makes an important distinction: “If an LLM wrote every line of your code, but you’ve reviewed, tested, and understood it all, that’s not vibe coding in my book—that’s using an LLM as a typing assistant.” Vibe coding, in contrast, involves accepting code without fully understanding how it works.

While vibe coding originated with Karpathy as a playful term, it may encapsulate a real shift in how some developers approach programming tasks—prioritizing speed and experimentation over deep technical understanding. And to some people, that may be terrifying.

Willison emphasizes that developers need to take accountability for their code: “I firmly believe that as a developer you have to take accountability for the code you produce—if you’re going to put your name to it you need to be confident that you understand how and why it works—ideally to the point that you can explain it to somebody else.”

He also warns about a common path to technical debt: “For experiments and low-stake projects where you want to explore what’s possible and build fun prototypes? Go wild! But stay aware of the very real risk that a good enough prototype often faces pressure to get pushed to production.”

The future of programming jobs

So, is all this vibe coding going to cost human programmers their jobs? At its heart, programming has always been about telling a computer how to operate. The method of how we do that has changed over time, but there may always be people who are better at telling a computer precisely what to do than others—even in natural language. In some ways, those people may become the new “programmers.”

There was a point in the late 1970s to early ’80s when many people thought people required programming skills to use a computer effectively because there were very few pre-built applications for all the various computer platforms available. School systems worldwide made educational computer literacy efforts to teach people to code.

A brochure for the GE 210 computer from 1964. BASIC's creators used a similar computer four years later to develop the programming language.

A brochure for the GE 210 computer from 1964. BASIC’s creators used a similar computer four years later to develop the programming language that many children were taught at home and school. Credit: GE / Wikipedia

Before too long, people made useful software applications that let non-coders utilize computers easily—no programming required. Even so, programmers didn’t disappear—instead, they used applications to create better and more complex programs. Perhaps that will also happen with AI coding tools.

To use an analogy, computer controlled technologies like autopilot made reliable supersonic flight possible because they could handle aspects of flight that were too taxing for all but the most highly trained and capable humans to safely control. AI may do the same for programming, allowing humans to abstract away complexities that would otherwise take too much time to manually code, and that may allow for the creation of more complex and useful software experiences in the future.

But at that point, will humans still be able to understand or debug them? Maybe not. We may be completely dependent on AI tools, and some people no doubt find that a little scary or unwise.

Whether vibe coding lasts in the programming landscape or remains a prototyping technique will likely depend less on the capabilities of AI models and more on the willingness of organizations to accept risky trade-offs in code quality, maintainability, and technical debt. For now, vibe coding remains an apt descriptor of the messy, experimental relationship between AI and human developers—more collaborative than autonomous, but increasingly blurring the lines of who (or what) is really doing the programming.

Photo of Benj Edwards

Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

Will the future of software development run on vibes? Read More »

eerily-realistic-ai-voice-demo-sparks-amazement-and-discomfort-online

Eerily realistic AI voice demo sparks amazement and discomfort online


Sesame’s new AI voice model features uncanny imperfections, and it’s willing to act like an angry boss.

In late 2013, the Spike Jonze film Her imagined a future where people would form emotional connections with AI voice assistants. Nearly 12 years later, that fictional premise has veered closer to reality with the release of a new conversational voice model from AI startup Sesame that has left many users both fascinated and unnerved.

“I tried the demo, and it was genuinely startling how human it felt,” wrote one Hacker News user who tested the system. “I’m almost a bit worried I will start feeling emotionally attached to a voice assistant with this level of human-like sound.”

In late February, Sesame released a demo for the company’s new Conversational Speech Model (CSM) that appears to cross over what many consider the “uncanny valley” of AI-generated speech, with some testers reporting emotional connections to the male or female voice assistant (“Miles” and “Maya”).

In our own evaluation, we spoke with the male voice for about 28 minutes, talking about life in general and how it decides what is “right” or “wrong” based on its training data. The synthesized voice was expressive and dynamic, imitating breath sounds, chuckles, interruptions, and even sometimes stumbling over words and correcting itself. These imperfections are intentional.

“At Sesame, our goal is to achieve ‘voice presence’—the magical quality that makes spoken interactions feel real, understood, and valued,” writes the company in a blog post. “We are creating conversational partners that do not just process requests; they engage in genuine dialogue that builds confidence and trust over time. In doing so, we hope to realize the untapped potential of voice as the ultimate interface for instruction and understanding.”

Sometimes the model tries too hard to sound like a real human. In one demo posted online by a Reddit user called MetaKnowing, the AI model talks about craving “peanut butter and pickle sandwiches.”

An example of Sesame’s female voice model craving peanut butter and pickle sandwiches, captured by Reddit user MetaKnowing.

Founded by Brendan Iribe, Ankit Kumar, and Ryan Brown, Sesame AI has attracted significant backing from prominent venture capital firms. The company has secured investments from Andreessen Horowitz, led by Anjney Midha and Marc Andreessen, along with Spark Capital, Matrix Partners, and various founders and individual investors.

Browsing reactions to Sesame found online, we found many users expressing astonishment at its realism. “I’ve been into AI since I was a child, but this is the first time I’ve experienced something that made me definitively feel like we had arrived,” wrote one Reddit user. “I’m sure it’s not beating any benchmarks, or meeting any common definition of AGI, but this is the first time I’ve had a real genuine conversation with something I felt was real.” Many other Reddit threads express similar feelings of surprise, with commenters saying it’s “jaw-dropping” or “mind-blowing.”

While that sounds like a bunch of hyperbole at first glance, not everyone finds the Sesame experience pleasant. Mark Hachman, a senior editor at PCWorld, wrote about being deeply unsettled by his interaction with the Sesame voice AI. “Fifteen minutes after ‘hanging up’ with Sesame’s new ‘lifelike’ AI, and I’m still freaked out,” Hachman reported. He described how the AI’s voice and conversational style eerily resembled an old friend he had dated in high school.

Others have compared Sesame’s voice model to OpenAI’s Advanced Voice Mode for ChatGPT, saying that Sesame’s CSM features more realistic voices, and others are pleased that the model in the demo will roleplay angry characters, which ChatGPT refuses to do.

An example argument with Sesame’s CSM created by Gavin Purcell.

Gavin Purcell, co-host of the AI for Humans podcast, posted an example video on Reddit where the human pretends to be an embezzler and argues with a boss. It’s so dynamic that it’s difficult to tell who the human is and which one is the AI model. Judging by our own demo, it’s entirely capable of what you see in the video.

“Near-human quality”

Under the hood, Sesame’s CSM achieves its realism by using two AI models working together (a backbone and a decoder) based on Meta’s Llama architecture that processes interleaved text and audio. Sesame trained three AI model sizes, with the largest using 8.3 billion parameters (an 8 billion backbone model plus a 300 million parameter decoder) on approximately 1 million hours of primarily English audio.

Sesame’s CSM doesn’t follow the traditional two-stage approach used by many earlier text-to-speech systems. Instead of generating semantic tokens (high-level speech representations) and acoustic details (fine-grained audio features) in two separate stages, Sesame’s CSM integrates into a single-stage, multimodal transformer-based model, jointly processing interleaved text and audio tokens to produce speech. OpenAI’s voice model uses a similar multimodal approach.

In blind tests without conversational context, human evaluators showed no clear preference between CSM-generated speech and real human recordings, suggesting the model achieves near-human quality for isolated speech samples. However, when provided with conversational context, evaluators still consistently preferred real human speech, indicating a gap remains in fully contextual speech generation.

Sesame co-founder Brendan Iribe acknowledged current limitations in a comment on Hacker News, noting that the system is “still too eager and often inappropriate in its tone, prosody and pacing” and has issues with interruptions, timing, and conversation flow. “Today, we’re firmly in the valley, but we’re optimistic we can climb out,” he wrote.

Too close for comfort?

Despite CSM’s technological impressiveness, advancements in conversational voice AI carry significant risks for deception and fraud. The ability to generate highly convincing human-like speech has already supercharged voice phishing scams, allowing criminals to impersonate family members, colleagues, or authority figures with unprecedented realism. But adding realistic interactivity to those scams may take them to another level of potency.

Unlike current robocalls that often contain tell-tale signs of artificiality, next-generation voice AI could eliminate these red flags entirely. As synthetic voices become increasingly indistinguishable from human speech, you may never know who you’re talking to on the other end of the line. It’s inspired some people to share a secret word or phrase with their family for identity verification.

Although Sesame’s demo does not clone a person’s voice, future open source releases of similar technology could allow malicious actors to potentially adapt these tools for social engineering attacks. OpenAI itself held back its own voice technology from wider deployment over fears of misuse.

Sesame sparked a lively discussion on Hacker News about its potential uses and dangers. Some users reported having extended conversations with the two demo voices, with conversations lasting up to the 30-minute limit. In one case, a parent recounted how their 4-year-old daughter developed an emotional connection with the AI model, crying after not being allowed to talk to it again.

The company says it plans to open-source “key components” of its research under an Apache 2.0 license, enabling other developers to build upon their work. Their roadmap includes scaling up model size, increasing dataset volume, expanding language support to over 20 languages, and developing “fully duplex” models that better handle the complex dynamics of real conversations.

You can try the Sesame demo on the company’s website, assuming that it isn’t too overloaded with people who want to simulate a rousing argument.

Photo of Benj Edwards

Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

Eerily realistic AI voice demo sparks amazement and discomfort online Read More »

researchers-surprised-to-find-less-educated-areas-adopting-ai-writing-tools-faster

Researchers surprised to find less-educated areas adopting AI writing tools faster


From the mouths of machines

Stanford researchers analyzed 305 million texts, revealing AI-writing trends.

Since the launch of ChatGPT in late 2022, experts have debated how widely AI language models would impact the world. A few years later, the picture is getting clear. According to new Stanford University-led research examining over 300 million text samples across multiple sectors, AI language models now assist in writing up to a quarter of professional communications across sectors. It’s having a large impact, especially in less-educated parts of the United States.

“Our study shows the emergence of a new reality in which firms, consumers and even international organizations substantially rely on generative AI for communications,” wrote the researchers.

The researchers tracked large language model (LLM) adoption across industries from January 2022 to September 2024 using a dataset that included 687,241 consumer complaints submitted to the US Consumer Financial Protection Bureau (CFPB), 537,413 corporate press releases, 304.3 million job postings, and 15,919 United Nations press releases.

By using a statistical detection system that tracked word usage patterns, the researchers found that roughly 18 percent of financial consumer complaints (including 30 percent of all complaints from Arkansas), 24 percent of corporate press releases, up to 15 percent of job postings, and 14 percent of UN press releases showed signs of AI assistance during that period of time.

The study also found that while urban areas showed higher adoption overall (18.2 percent versus 10.9 percent in rural areas), regions with lower educational attainment used AI writing tools more frequently (19.9 percent compared to 17.4 percent in higher-education areas). The researchers note that this contradicts typical technology adoption patterns where more educated populations adopt new tools fastest.

“In the consumer complaint domain, the geographic and demographic patterns in LLM adoption present an intriguing departure from historical technology diffusion trends where technology adoption has generally been concentrated in urban areas, among higher-income groups, and populations with higher levels of educational attainment.”

Researchers from Stanford, the University of Washington, and Emory University led the study, titled, “The Widespread Adoption of Large Language Model-Assisted Writing Across Society,” first listed on the arXiv preprint server in mid-February. Weixin Liang and Yaohui Zhang from Stanford served as lead authors, with collaborators Mihai Codreanu, Jiayu Wang, Hancheng Cao, and James Zou.

Detecting AI use in aggregate

We’ve previously covered that AI writing detection services aren’t reliable, and this study does not contradict that finding. On a document-by-document basis, AI detectors cannot be trusted. But when analyzing millions of documents in aggregate, telltale patterns emerge that suggest the influence of AI language models on text.

The researchers developed an approach based on a statistical framework in a previously released work that analyzed shifts in word frequencies and linguistic patterns before and after ChatGPT’s release. By comparing large sets of pre- and post-ChatGPT texts, they estimated the proportion of AI-assisted content at a population level. The presumption is that LLMs tend to favor certain word choices, sentence structures, and linguistic patterns that differ subtly from typical human writing.

To validate their approach, the researchers created test sets with known percentages of AI content (from zero percent to 25 percent) and found their method predicted these percentages with error rates below 3.3 percent. This statistical validation gave them confidence in their population-level estimates.

While the researchers specifically note their estimates likely represent a minimum level of AI usage, it’s important to understand that actual AI involvement might be significantly greater. Due to the difficulty in detecting heavily edited or increasingly sophisticated AI-generated content, the researchers say their reported adoption rates could substantially underestimate true levels of generative AI use.

Analysis suggests AI use as “equalizing tools”

While the overall adoption rates are revealing, perhaps more insightful are the patterns of who is using AI writing tools and how these patterns may challenge conventional assumptions about technology adoption.

In examining the CFPB complaints (a US public resource that collects complaints about consumer financial products and services), the researchers’ geographic analysis revealed substantial variation across US states.

Arkansas showed the highest adoption rate at 29.2 percent (based on 7,376 complaints), followed by Missouri at 26.9 percent (16,807 complaints) and North Dakota at 24.8 percent (1,025 complaints). In contrast, states like West Virginia (2.6 percent), Idaho (3.8 percent), and Vermont (4.8 percent) showed minimal AI writing adoption. Major population centers demonstrated moderate adoption, with California at 17.4 percent (157,056 complaints) and New York at 16.6 percent (104,862 complaints).

The urban-rural divide followed expected technology adoption patterns initially, but with an interesting twist. Using Rural Urban Commuting Area (RUCA) codes, the researchers found that urban and rural areas initially adopted AI writing tools at similar rates during early 2023. However, adoption trajectories diverged by mid-2023, with urban areas reaching 18.2 percent adoption compared to 10.9 percent in rural areas.

Contrary to typical technology diffusion patterns, areas with lower educational attainment showed higher AI writing tool usage. Comparing regions above and below state median levels of bachelor’s degree attainment, areas with fewer college graduates stabilized at 19.9 percent adoption rates compared to 17.4 percent in more educated regions. This pattern held even within urban areas, where less-educated communities showed 21.4 percent adoption versus 17.8 percent in more educated urban areas.

The researchers suggest that AI writing tools may serve as a leg-up for people who may not have as much educational experience. “While the urban-rural digital divide seems to persist,” the researchers write, “our finding that areas with lower educational attainment showed modestly higher LLM adoption rates in consumer complaints suggests these tools may serve as equalizing tools in consumer advocacy.”

Corporate and diplomatic trends in AI writing

According to the researchers, all sectors they analyzed (consumer complaints, corporate communications, job postings) showed similar adoption patterns: sharp increases beginning three to four months after ChatGPT’s November 2022 launch, followed by stabilization in late 2023.

Organization age emerged as the strongest predictor of AI writing usage in the job posting analysis. Companies founded after 2015 showed adoption rates up to three times higher than firms established before 1980, reaching 10–15 percent AI-modified text in certain roles compared to below 5 percent for older organizations. Small companies with fewer employees also incorporated AI more readily than larger organizations.

When examining corporate press releases by sector, science and technology companies integrated AI most extensively, with an adoption rate of 16.8 percent by late 2023. Business and financial news (14–15.6 percent) and people and culture topics (13.6–14.3 percent) showed slightly lower but still significant adoption.

In the international arena, Latin American and Caribbean UN country teams showed the highest adoption among international organizations at approximately 20 percent, while African states, Asia-Pacific states, and Eastern European states demonstrated more moderate increases to 11–14 percent by 2024.

Implications and limitations

In the study, the researchers acknowledge limitations in their analysis due to a focus on English-language content. Also, as we mentioned earlier, they found they could not reliably detect human-edited AI-generated text or text generated by newer models instructed to imitate human writing styles. As a result, the researchers suggest their findings represent a lower bound of actual AI writing tool adoption.

The researchers noted that the plateauing of AI writing adoption in 2024 might reflect either market saturation or increasingly sophisticated LLMs producing text that evades detection methods. They conclude we now live in a world where distinguishing between human and AI writing becomes progressively more difficult, with implications for communications across society.

“The growing reliance on AI-generated content may introduce challenges in communication,” the researchers write. “In sensitive categories, over-reliance on AI could result in messages that fail to address concerns or overall release less credible information externally. Over-reliance on AI could also introduce public mistrust in the authenticity of messages sent by firms.”

Photo of Benj Edwards

Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

Researchers surprised to find less-educated areas adopting AI writing tools faster Read More »

ai-firms-follow-deepseek’s-lead,-create-cheaper-models-with-“distillation”

AI firms follow DeepSeek’s lead, create cheaper models with “distillation”

Thanks to distillation, developers and businesses can access these models’ capabilities at a fraction of the price, allowing app developers to run AI models quickly on devices such as laptops and smartphones.

Developers can use OpenAI’s platform for distillation, learning from the large language models that underpin products like ChatGPT. OpenAI’s largest backer, Microsoft, used GPT-4 to distill its small language family of models Phi as part of a commercial partnership after investing nearly $14 billion into the company.

However, the San Francisco-based start-up has said it believes DeepSeek distilled OpenAI’s models to train its competitor, a move that would be against its terms of service. DeepSeek has not commented on the claims.

While distillation can be used to create high-performing models, experts add they are more limited.

“Distillation presents an interesting trade-off; if you make the models smaller, you inevitably reduce their capability,” said Ahmed Awadallah of Microsoft Research, who said a distilled model can be designed to be very good at summarising emails, for example, “but it really would not be good at anything else.”

David Cox, vice-president for AI models at IBM Research, said most businesses do not need a massive model to run their products, and distilled ones are powerful enough for purposes such as customer service chatbots or running on smaller devices like phones.

“Any time you can [make it less expensive] and it gives you the right performance you want, there is very little reason not to do it,” he added.

That presents a challenge to many of the business models of leading AI firms. Even if developers use distilled models from companies like OpenAI, they cost far less to run, are less expensive to create, and, therefore, generate less revenue. Model-makers like OpenAI often charge less for the use of distilled models as they require less computational load.

AI firms follow DeepSeek’s lead, create cheaper models with “distillation” Read More »

ai-versus-the-brain-and-the-race-for-general-intelligence

AI versus the brain and the race for general intelligence


Intelligence, ±artificial

We already have an example of general intelligence, and it doesn’t look like AI.

There’s no question that AI systems have accomplished some impressive feats, mastering games, writing text, and generating convincing images and video. That’s gotten some people talking about the possibility that we’re on the cusp of AGI, or artificial general intelligence. While some of this is marketing fanfare, enough people in the field are taking the idea seriously that it warrants a closer look.

Many arguments come down to the question of how AGI is defined, which people in the field can’t seem to agree upon. This contributes to estimates of its advent that range from “it’s practically here” to “we’ll never achieve it.” Given that range, it’s impossible to provide any sort of informed perspective on how close we are.

But we do have an existing example of AGI without the “A”—the intelligence provided by the animal brain, particularly the human one. And one thing is clear: The systems being touted as evidence that AGI is just around the corner do not work at all like the brain does. That may not be a fatal flaw, or even a flaw at all. It’s entirely possible that there’s more than one way to reach intelligence, depending on how it’s defined. But at least some of the differences are likely to be functionally significant, and the fact that AI is taking a very different route from the one working example we have is likely to be meaningful.

With all that in mind, let’s look at some of the things the brain does that current AI systems can’t.

Defining AGI might help

Artificial general intelligence hasn’t really been defined. Those who argue that it’s imminent are either vague about what they expect the first AGI systems to be capable of or simply define it as the ability to dramatically exceed human performance at a limited number of tasks. Predictions of AGI’s arrival in the intermediate term tend to focus on AI systems demonstrating specific behaviors that seem human-like. The further one goes out on the timeline, the greater the emphasis on the “G” of AGI and its implication of systems that are far less specialized.

But most of these predictions are coming from people working in companies with a commercial interest in AI. It was notable that none of the researchers we talked to for this article were willing to offer a definition of AGI. They were, however, willing to point out how current systems fall short.

“I think that AGI would be something that is going to be more robust, more stable—not necessarily smarter in general but more coherent in its abilities,” said Ariel Goldstein, a researcher at Hebrew University of Jerusalem. “You’d expect a system that can do X and Y to also be able to do Z and T. Somehow, these systems seem to be more fragmented in a way. To be surprisingly good at one thing and then surprisingly bad at another thing that seems related.”

“I think that’s a big distinction, this idea of generalizability,” echoed neuroscientist Christa Baker of NC State University. “You can learn how to analyze logic in one sphere, but if you come to a new circumstance, it’s not like now you’re an idiot.”

Mariano Schain, a Google engineer who has collaborated with Goldstein, focused on the abilities that underlie this generalizability. He mentioned both long-term and task-specific memory and the ability to deploy skills developed in one task in different contexts. These are limited-to-nonexistent in existing AI systems.

Beyond those specific limits, Baker noted that “there’s long been this very human-centric idea of intelligence that only humans are intelligent.” That’s fallen away within the scientific community as we’ve studied more about animal behavior. But there’s still a bias to privilege human-like behaviors, such as the human-sounding responses generated by large language models

The fruit flies that Baker studies can integrate multiple types of sensory information, control four sets of limbs, navigate complex environments, satisfy their own energy needs, produce new generations of brains, and more. And they do that all with brains that contain under 150,000 neurons, far fewer than current large language models.

These capabilities are complicated enough that it’s not entirely clear how the brain enables them. (If we knew how, it might be possible to engineer artificial systems with similar capacities.) But we do know a fair bit about how brains operate, and there are some very obvious ways that they differ from the artificial systems we’ve created so far.

Neurons vs. artificial neurons

Most current AI systems, including all large language models, are based on what are called neural networks. These were intentionally designed to mimic how some areas of the brain operate, with large numbers of artificial neurons taking an input, modifying it, and then passing the modified information on to another layer of artificial neurons. Each of these artificial neurons can pass the information on to multiple instances in the next layer, with different weights applied to each connection. In turn, each of the artificial neurons in the next layer can receive input from multiple sources in the previous one.

After passing through enough layers, the final layer is read and transformed into an output, such as the pixels in an image that correspond to a cat.

While that system is modeled on the behavior of some structures within the brain, it’s a very limited approximation. For one, all artificial neurons are functionally equivalent—there’s no specialization. In contrast, real neurons are highly specialized; they use a variety of neurotransmitters and take input from a range of extra-neural inputs like hormones. Some specialize in sending inhibitory signals while others activate the neurons they interact with. Different physical structures allow them to make different numbers and connections.

In addition, rather than simply forwarding a single value to the next layer, real neurons communicate through an analog series of activity spikes, sending trains of pulses that vary in timing and intensity. This allows for a degree of non-deterministic noise in communications.

Finally, while organized layers are a feature of a few structures in brains, they’re far from the rule. “What we found is it’s—at least in the fly—much more interconnected,” Baker told Ars. “You can’t really identify this strictly hierarchical network.”

With near-complete connection maps of the fly brain becoming available, she told Ars that researchers are “finding lateral connections or feedback projections, or what we call recurrent loops, where we’ve got neurons that are making a little circle and connectivity patterns. I think those things are probably going to be a lot more widespread than we currently appreciate.”

While we’re only beginning to understand the functional consequences of all this complexity, it’s safe to say that it allows networks composed of actual neurons far more flexibility in how they process information—a flexibility that may underly how these neurons get re-deployed in a way that these researchers identified as crucial for some form of generalized intelligence.

But the differences between neural networks and the real-world brains they were modeled on go well beyond the functional differences we’ve talked about so far. They extend to significant differences in how these functional units are organized.

The brain isn’t monolithic

The neural networks we’ve generated so far are largely specialized systems meant to handle a single task. Even the most complicated tasks, like the prediction of protein structures, have typically relied on the interaction of only two or three specialized systems. In contrast, the typical brain has a lot of functional units. Some of these operate by sequentially processing a single set of inputs in something resembling a pipeline. But many others can operate in parallel, in some cases without any input activity going on elsewhere in the brain.

To give a sense of what this looks like, let’s think about what’s going on as you read this article. Doing so requires systems that handle motor control, which keep your head and eyes focused on the screen. Part of this system operates via feedback from the neurons that are processing the read material, causing small eye movements that help your eyes move across individual sentences and between lines.

Separately, there’s part of your brain devoted to telling the visual system what not to pay attention to, like the icon showing an ever-growing number of unread emails. Those of us who can read a webpage without even noticing the ads on it presumably have a very well-developed system in place for ignoring things. Reading this article may also mean you’re engaging the systems that handle other senses, getting you to ignore things like the noise of your heating system coming on while remaining alert for things that might signify threats, like an unexplained sound in the next room.

The input generated by the visual system then needs to be processed, from individual character recognition up to the identification of words and sentences, processes that involve systems in areas of the brain involved in both visual processing and language. Again, this is an iterative process, where building meaning from a sentence may require many eye movements to scan back and forth across a sentence, improving reading comprehension—and requiring many of these systems to communicate among themselves.

As meaning gets extracted from a sentence, other parts of the brain integrate it with information obtained in earlier sentences, which tends to engage yet another area of the brain, one that handles a short-term memory system called working memory. Meanwhile, other systems will be searching long-term memory, finding related material that can help the brain place the new information within the context of what it already knows. Still other specialized brain areas are checking for things like whether there’s any emotional content to the material you’re reading.

All of these different areas are engaged without you being consciously aware of the need for them.

In contrast, something like ChatGPT, despite having a lot of artificial neurons, is monolithic: No specialized structures are allocated before training starts. That’s in sharp contrast to a brain. “The brain does not start out as a bag of neurons and then as a baby it needs to make sense of the world and then determine what connections to make,” Baker noted. “There already a lot of constraints and specifics that are already set up.”

Even in cases where it’s not possible to see any physical distinction between cells specialized for different functions, Baker noted that we can often find differences in what genes are active.

In contrast, pre-planned modularity is relatively new to the AI world. In software development, “This concept of modularity is well established, so we have the whole methodology around it, how to manage it,” Schain said, “it’s really an aspect that is important for maybe achieving AI systems that can then operate similarly to the human brain.” There are a few cases where developers have enforced modularity on systems, but Goldstein said these systems need to be trained with all the modules in place to see any gain in performance.

None of this is saying that a modular system can’t arise within a neural network as a result of its training. But so far, we have very limited evidence that they do. And since we mostly deploy each system for a very limited number of tasks, there’s no reason to think modularity will be valuable.

There is some reason to believe that this modularity is key to the brain’s incredible flexibility. The region that recognizes emotion-evoking content in written text can also recognize it in music and images, for example. But the evidence here is mixed. There are some clear instances where a single brain region handles related tasks, but that’s not consistently the case; Baker noted that, “When you’re talking humans, there are parts of the brain that are dedicated to understanding speech, and there are different areas that are involved in producing speech.”

This sort of re-use of would also provide an advantage in terms of learning since behaviors developed in one context could potentially be deployed in others. But as we’ll see, the differences between brains and AI when it comes to learning are far more comprehensive than that.

The brain is constantly training

Current AIs generally have two states: training and deployment. Training is where the AI learns its behavior; deployment is where that behavior is put to use. This isn’t absolute, as the behavior can be tweaked in response to things learned during deployment, like finding out it recommends eating a rock daily. But for the most part, once the weights among the connections of a neural network are determined through training, they’re retained.

That may be starting to change a bit, Schain said. “There is now maybe a shift in similarity where AI systems are using more and more what they call the test time compute, where at inference time you do much more than before, kind of a parallel to how the human brain operates,” he told Ars. But it’s still the case that neural networks are essentially useless without an extended training period.

In contrast, a brain doesn’t have distinct learning and active states; it’s constantly in both modes. In many cases, the brain learns while doing. Baker described that in terms of learning to take jumpshots: “Once you have made your movement, the ball has left your hand, it’s going to land somewhere. So that visual signal—that comparison of where it landed versus where you wanted it to go—is what we call an error signal. That’s detected by the cerebellum, and its goal is to minimize that error signal. So the next time you do it, the brain is trying to compensate for what you did last time.”

It makes for very different learning curves. An AI is typically not very useful until it has had a substantial amount of training. In contrast, a human can often pick up basic competence in a very short amount of time (and without massive energy use). “Even if you’re put into a situation where you’ve never been before, you can still figure it out,” Baker said. “If you see a new object, you don’t have to be trained on that a thousand times to know how to use it. A lot of the time, [if] you see it one time, you can make predictions.”

As a result, while an AI system with sufficient training may ultimately outperform the human, the human will typically reach a high level of performance faster. And unlike an AI, a human’s performance doesn’t remain static. Incremental improvements and innovative approaches are both still possible. This also allows humans to adjust to changed circumstances more readily. An AI trained on the body of written material up until 2020 might struggle to comprehend teen-speak in 2030; humans could at least potentially adjust to the shifts in language. (Though maybe an AI trained to respond to confusing phrasing with “get off my lawn” would be indistinguishable.)

Finally, since the brain is a flexible learning device, the lessons learned from one skill can be applied to related skills. So the ability to recognize tones and read sheet music can help with the mastery of multiple musical instruments. Chemistry and cooking share overlapping skillsets. And when it comes to schooling, learning how to learn can be used to master a wide range of topics.

In contrast, it’s essentially impossible to use an AI model trained on one topic for much else. The biggest exceptions are large language models, which seem to be able to solve problems on a wide variety of topics if they’re presented as text. But here, there’s still a dependence on sufficient examples of similar problems appearing in the body of text the system was trained on. To give an example, something like ChatGPT can seem to be able to solve math problems, but it’s best at solving things that were discussed in its training materials; giving it something new will generally cause it to stumble.

Déjà vu

For Schain, however, the biggest difference between AI and biology is in terms of memory. For many AIs, “memory” is indistinguishable from the computational resources that allow it to perform a task and was formed during training. For the large language models, it includes both the weights of connections learned then and a narrow “context window” that encompasses any recent exchanges with a single user. In contrast, biological systems have a lifetime of memories to rely on.

“For AI, it’s very basic: It’s like the memory is in the weights [of connections] or in the context. But with a human brain, it’s a much more sophisticated mechanism, still to be uncovered. It’s more distributed. There is the short term and long term, and it has to do a lot with different timescales. Memory for the last second, a minute and a day or a year or years, and they all may be relevant.”

This lifetime of memories can be key to making intelligence general. It helps us recognize the possibilities and limits of drawing analogies between different circumstances or applying things learned in one context versus another. It provides us with insights that let us solve problems that we’ve never confronted before. And, of course, it also ensures that the horrible bit of pop music you were exposed to in your teens remains an earworm well into your 80s.

The differences between how brains and AIs handle memory, however, are very hard to describe. AIs don’t really have distinct memory, while the use of memory as the brain handles a task more sophisticated than navigating a maze is generally so poorly understood that it’s difficult to discuss at all. All we can really say is that there are clear differences there.

Facing limits

It’s difficult to think about AI without recognizing the enormous energy and computational resources involved in training one. And in this case, it’s potentially relevant. Brains have evolved under enormous energy constraints and continue to operate using well under the energy that a daily diet can provide. That has forced biology to figure out ways to optimize its resources and get the most out of the resources it does commit to.

In contrast, the story of recent developments in AI is largely one of throwing more resources at them. And plans for the future seem to (so far at least) involve more of this, including larger training data sets and ever more artificial neurons and connections among them. All of this comes at a time when the best current AIs are already using three orders of magnitude more neurons than we’d find in a fly’s brain and have nowhere near the fly’s general capabilities.

It remains possible that there is more than one route to those general capabilities and that some offshoot of today’s AI systems will eventually find a different route. But if it turns out that we have to bring our computerized systems closer to biology to get there, we’ll run into a serious roadblock: We don’t fully understand the biology yet.

“I guess I am not optimistic that any kind of artificial neural network will ever be able to achieve the same plasticity, the same generalizability, the same flexibility that a human brain has,” Baker said. “That’s just because we don’t even know how it gets it; we don’t know how that arises. So how do you build that into a system?”

Photo of John Timmer

John is Ars Technica’s science editor. He has a Bachelor of Arts in Biochemistry from Columbia University, and a Ph.D. in Molecular and Cell Biology from the University of California, Berkeley. When physically separated from his keyboard, he tends to seek out a bicycle, or a scenic location for communing with his hiking boots.

AI versus the brain and the race for general intelligence Read More »

“it’s-a-lemon”—openai’s-largest-ai-model-ever-arrives-to-mixed-reviews

“It’s a lemon”—OpenAI’s largest AI model ever arrives to mixed reviews

Perhaps because of the disappointing results, Altman had previously written that GPT-4.5 will be the last of OpenAI’s traditional AI models, with GPT-5 planned to be a dynamic combination of “non-reasoning” LLMs and simulated reasoning models like o3.

A stratospheric price and a tech dead-end

And about that price—it’s a doozy. GPT-4.5 costs $75 per million input tokens and $150 per million output tokens through the API, compared to GPT-4o’s $2.50 per million input tokens and $10 per million output tokens. (Tokens are chunks of data used by AI models for processing). For developers using OpenAI models, this pricing makes GPT-4.5 impractical for many applications where GPT-4o already performs adequately.

By contrast, OpenAI’s flagship reasoning model, o1 pro, costs $15 per million input tokens and $60 per million output tokens—significantly less than GPT-4.5 despite offering specialized simulated reasoning capabilities. Even more striking, the o3-mini model costs just $1.10 per million input tokens and $4.40 per million output tokens, making it cheaper than even GPT-4o while providing much stronger performance on specific tasks.

OpenAI has likely known about diminishing returns in training LLMs for some time. As a result, the company spent most of last year working on simulated reasoning models like o1 and o3, which use a different inference-time (runtime) approach to improving performance instead of throwing ever-larger amounts of training data at GPT-style AI models.

OpenAI's self-reported benchmark results for the SimpleQA test, which measures confabulation rate.

OpenAI’s self-reported benchmark results for the SimpleQA test, which measures confabulation rate. Credit: OpenAI

While this seems like bad news for OpenAI in the short term, competition is thriving in the AI market. Anthropic’s Claude 3.7 Sonnet has demonstrated vastly better performance than GPT-4.5, with a reportedly more efficient architecture. It’s worth noting that Claude 3.7 Sonnet is likely a system of AI models working together behind the scenes, although Anthropic has not provided details about its architecture.

For now, it seems that GPT-4.5 may be the last of its kind—a technological dead-end for an unsupervised learning approach that has paved the way for new architectures in AI models, such as o3’s inference-time reasoning and perhaps even something more novel, like diffusion-based models. Only time will tell how things end up.

GPT-4.5 is now available to ChatGPT Pro subscribers, with rollout to Plus and Team subscribers planned for next week, followed by Enterprise and Education customers the week after. Developers can access it through OpenAI’s various APIs on paid tiers, though the company is uncertain about its long-term availability.

“It’s a lemon”—OpenAI’s largest AI model ever arrives to mixed reviews Read More »