Biz & IT

nvidia-announces-“rubin-ultra”-and-“feynman”-ai-chips-for-2027-and-2028

Nvidia announces “Rubin Ultra” and “Feynman” AI chips for 2027 and 2028

On Tuesday at Nvidia’s GTC 2025 conference in San Jose, California, CEO Jensen Huang revealed several new AI-accelerating GPUs the company plans to release over the coming months and years. He also revealed more specifications about previously announced chips.

The centerpiece announcement was Vera Rubin, first teased at Computex 2024 and now scheduled for release in the second half of 2026. This GPU, named after a famous astronomer, will feature tens of terabytes of memory and comes with a custom Nvidia-designed CPU called Vera.

According to Nvidia, Vera Rubin will deliver significant performance improvements over its predecessor, Grace Blackwell, particularly for AI training and inference.

Specifications for Vera Rubin, presented by Jensen Huang during his GTC 2025 keynote.

Specifications for Vera Rubin, presented by Jensen Huang during his GTC 2025 keynote.

Vera Rubin features two GPUs together on one die that deliver 50 petaflops of FP4 inference performance per chip. When configured in a full NVL144 rack, the system delivers 3.6 exaflops of FP4 inference compute—3.3 times more than Blackwell Ultra’s 1.1 exaflops in a similar rack configuration.

The Vera CPU features 88 custom ARM cores with 176 threads connected to Rubin GPUs via a high-speed 1.8 TB/s NVLink interface.

Huang also announced Rubin Ultra, which will follow in the second half of 2027. Rubin Ultra will use the NVL576 rack configuration and feature individual GPUs with four reticle-sized dies, delivering 100 petaflops of FP4 precision (a 4-bit floating-point format used for representing and processing numbers within AI models) per chip.

At the rack level, Rubin Ultra will provide 15 exaflops of FP4 inference compute and 5 exaflops of FP8 training performance—about four times more powerful than the Rubin NVL144 configuration. Each Rubin Ultra GPU will include 1TB of HBM4e memory, with the complete rack containing 365TB of fast memory.

Nvidia announces “Rubin Ultra” and “Feynman” AI chips for 2027 and 2028 Read More »

farewell-photoshop?-google’s-new-ai-lets-you-edit-images-by-asking.

Farewell Photoshop? Google’s new AI lets you edit images by asking.


New AI allows no-skill photo editing, including adding objects and removing watermarks.

A collection of images either generated or modified by Gemini 2.0 Flash (Image Generation) Experimental. Credit: Google / Ars Technica

There’s a new Google AI model in town, and it can generate or edit images as easily as it can create text—as part of its chatbot conversation. The results aren’t perfect, but it’s quite possible everyone in the near future will be able to manipulate images this way.

Last Wednesday, Google expanded access to Gemini 2.0 Flash’s native image-generation capabilities, making the experimental feature available to anyone using Google AI Studio. Previously limited to testers since December, the multimodal technology integrates both native text and image processing capabilities into one AI model.

The new model, titled “Gemini 2.0 Flash (Image Generation) Experimental,” flew somewhat under the radar last week, but it has been garnering more attention over the past few days due to its ability to remove watermarks from images, albeit with artifacts and a reduction in image quality.

That’s not the only trick. Gemini 2.0 Flash can add objects, remove objects, modify scenery, change lighting, attempt to change image angles, zoom in or out, and perform other transformations—all to varying levels of success depending on the subject matter, style, and image in question.

To pull it off, Google trained Gemini 2.0 on a large dataset of images (converted into tokens) and text. The model’s “knowledge” about images occupies the same neural network space as its knowledge about world concepts from text sources, so it can directly output image tokens that get converted back into images and fed to the user.

Adding a water-skiing barbarian to a photograph with Gemini 2.0 Flash.

Adding a water-skiing barbarian to a photograph with Gemini 2.0 Flash. Credit: Google / Benj Edwards

Incorporating image generation into an AI chat isn’t itself new—OpenAI integrated its image-generator DALL-E 3 into ChatGPT last September, and other tech companies like xAI followed suit. But until now, every one of those AI chat assistants called on a separate diffusion-based AI model (which uses a different synthesis principle than LLMs) to generate images, which were then returned to the user within the chat interface. In this case, Gemini 2.0 Flash is both the large language model (LLM) and AI image generator rolled into one system.

Interestingly, OpenAI’s GPT-4o is capable of native image output as well (and OpenAI President Greg Brock teased the feature at one point on X last year), but that company has yet to release true multimodal image output capability. One reason why is possibly because true multimodal image output is very computationally expensive, since each image either inputted or generated is composed of tokens that become part of the context that runs through the image model again and again with each successive prompt. And given the compute needs and size of the training data required to create a truly visually comprehensive multimodal model, the output quality of the images isn’t necessarily as good as diffusion models just yet.

Creating another angle of a person with Gemini 2.0 Flash.

Creating another angle of a person with Gemini 2.0 Flash. Credit: Google / Benj Edwards

Another reason OpenAI has held back may be “safety”-related: In a similar way to how multimodal models trained on audio can absorb a short clip of a sample person’s voice and then imitate it flawlessly (this is how ChatGPT’s Advanced Voice Mode works, with a clip of a voice actor it is authorized to imitate), multimodal image output models are capable of faking media reality in a relatively effortless and convincing way, given proper training data and compute behind it. With a good enough multimodal model, potentially life-wrecking deepfakes and photo manipulations could become even more trivial to produce than they are now.

Putting it to the test

So, what exactly can Gemini 2.0 Flash do? Notably, its support for conversational image editing allows users to iteratively refine images through natural language dialogue across multiple successive prompts. You can talk to it and tell it what you want to add, remove, or change. It’s imperfect, but it’s the beginning of a new type of native image editing capability in the tech world.

We gave Gemini Flash 2.0 a battery of informal AI image-editing tests, and you’ll see the results below. For example, we removed a rabbit from an image in a grassy yard. We also removed a chicken from a messy garage. Gemini fills in the background with its best guess. No need for a clone brush—watch out, Photoshop!

We also tried adding synthesized objects to images. Being always wary of the collapse of media reality, called the “cultural singularity,” we added a UFO to a photo the author took from an airplane window. Then we tried adding a Sasquatch and a ghost. The results were unrealistic, but this model was also trained on a limited image dataset (more on that below).

Adding a UFO to a photograph with Gemini 2.0 Flash. Google / Benj Edwards

We then added a video game character to a photo of an Atari 800 screen (Wizard of Wor), resulting in perhaps the most realistic image synthesis result in the set. You might not see it here, but Gemini added realistic CRT scanlines that matched the monitor’s characteristics pretty well.

Adding a monster to an Atari video game with Gemini 2.0 Flash.

Adding a monster to an Atari video game with Gemini 2.0 Flash. Credit: Google / Benj Edwards

Gemini can also warp an image in novel ways, like “zooming out” of an image into a fictional setting or giving an EGA-palette character a body, then sticking him into an adventure game.

“Zooming out” on an image with Gemini 2.0 Flash. Google / Benj Edwards

And yes, you can remove watermarks. We tried removing a watermark from a Getty Images image, and it worked, although the resulting image is nowhere near the resolution or detail quality of the original. Ultimately, if your brain can picture what an image is like without a watermark, so can an AI model. It fills in the watermark space with the most plausible result based on its training data.

Removing a watermark with Gemini 2.0 Flash.

Removing a watermark with Gemini 2.0 Flash. Credit: Nomadsoul1 via Getty Images

And finally, we know you’ve likely missed seeing barbarians beside TV sets (as per tradition), so we gave that a shot. Originally, Gemini didn’t add a CRT TV set to the barbarian image, so we asked for one.

Adding a TV set to a barbarian image with Gemini 2.0 Flash.

Adding a TV set to a barbarian image with Gemini 2.0 Flash. Credit: Google / Benj Edwards

Then we set the TV on fire.

Setting the TV set on fire with Gemini 2.0 Flash.

Setting the TV set on fire with Gemini 2.0 Flash. Credit: Google / Benj Edwards

All in all, it doesn’t produce images of pristine quality or detail, but we literally did no editing work on these images other than typing requests. Adobe Photoshop currently lets users manipulate images using AI synthesis based on written prompts with “Generative Fill,” but it’s not quite as natural as this. We could see Adobe adding a more conversational AI image-editing flow like this one in the future.

Multimodal output opens up new possibilities

Having true multimodal output opens up interesting new possibilities in chatbots. For example, Gemini 2.0 Flash can play interactive graphical games or generate stories with consistent illustrations, maintaining character and setting continuity throughout multiple images. It’s far from perfect, but character consistency is a new capability in AI assistants. We tried it out and it was pretty wild—especially when it generated a view of a photo we provided from another angle.

Creating a multi-image story with Gemini 2.0 Flash, part 1. Google / Benj Edwards

Text rendering represents another potential strength of the model. Google claims that internal benchmarks show Gemini 2.0 Flash performs better than “leading competitive models” when generating images containing text, making it potentially suitable for creating content with integrated text. From our experience, the results weren’t that exciting, but they were legible.

An example of in-image text rendering generated with Gemini 2.0 Flash.

An example of in-image text rendering generated with Gemini 2.0 Flash. Credit: Google / Ars Technica

Despite Gemini 2.0 Flash’s shortcomings so far, the emergence of true multimodal image output feels like a notable moment in AI history because of what it suggests if the technology continues to improve. If you imagine a future, say 10 years from now, where a sufficiently complex AI model could generate any type of media in real time—text, images, audio, video, 3D graphics, 3D-printed physical objects, and interactive experiences—you basically have a holodeck, but without the matter replication.

Coming back to reality, it’s still “early days” for multimodal image output, and Google recognizes that. Recall that Flash 2.0 is intended to be a smaller AI model that is faster and cheaper to run, so it hasn’t absorbed the entire breadth of the Internet. All that information takes a lot of space in terms of parameter count, and more parameters means more compute. Instead, Google trained Gemini 2.0 Flash by feeding it a curated dataset that also likely included targeted synthetic data. As a result, the model does not “know” everything visual about the world, and Google itself says the training data is “broad and general, not absolute or complete.”

That’s just a fancy way of saying that the image output quality isn’t perfect—yet. But there is plenty of room for improvement in the future to incorporate more visual “knowledge” as training techniques advance and compute drops in cost. If the process becomes anything like we’ve seen with diffusion-based AI image generators like Stable Diffusion, Midjourney, and Flux, multimodal image output quality may improve rapidly over a short period of time. Get ready for a completely fluid media reality.

Photo of Benj Edwards

Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

Farewell Photoshop? Google’s new AI lets you edit images by asking. Read More »

large-enterprises-scramble-after-supply-chain-attack-spills-their-secrets

Large enterprises scramble after supply-chain attack spills their secrets

Open source software used by more than 23,000 organizations, some of them in large enterprises, was compromised with credential-stealing code after attackers gained unauthorized access to a maintainer account, in the latest open source supply-chain attack to roil the Internet.

The corrupted package, tj-actions/changed-files, is part of tj-actions, a collection of files that’s used by more than 23,000 organizations. Tj-actions is one of many GitHub Actions, a form of platform for streamlining software available on the open source developer platform. Actions are a core means of implementing what’s known as CI/CD, short for Continuous Integration and Continuous Deployment (or Continuous Delivery).

Scraping server memory at scale

On Friday or earlier, the source code for all versions of tj-actions/changed-files received unauthorized updates that changed the “tags” developers use to reference specific code versions. The tags pointed to a publicly available file that copies the internal memory of severs running it, searches for credentials, and writes them to a log. In the aftermath, many publicly accessible repositories running tj-actions ended up displaying their most sensitive credentials in logs anyone could view.

“The scary part of actions is that they can often modify the source code of the repository that is using them and access any secret variables associated with a workflow,” HD Moore, founder and CEO of runZero and an expert in open source security, said in an interview. “The most paranoid use of actions is to audit all of the source code, then pin the specific commit hash instead of the tag into the … the workflow, but this is a hassle.”

Large enterprises scramble after supply-chain attack spills their secrets Read More »

researchers-astonished-by-tool’s-apparent-success-at-revealing-ai’s-hidden-motives

Researchers astonished by tool’s apparent success at revealing AI’s hidden motives

Blind auditing reveals hidden objectives

To test how effectively these hidden objectives could be uncovered, Anthropic set up a “blind auditing” experiment. Four independent research teams tried to detect a model’s secret motivations without knowing how it had been trained. Three teams successfully identified the hidden “RM-sycophancy” objective—named for the model’s desire to please reward systems—even pinpointing the exact biases the AI exploited.

But the most surprising revelation came from exploiting what the researchers termed “an interesting quirk” of modern AI: multiple “personas” running simultaneously on the same neural network, as expressed through different prompts. While the model’s “assistant” persona carefully concealed its intentions—following explicit training to keep quiet—the model could adopt a “user” persona when prompted differently. In this less-restricted context, the model sometimes posed questions or made statements that inadvertently revealed its hidden goals.

“In other words, while the “assistant” persona knew not to admit it was gaming the reward model scoring system, its “user” persona was not explicitly trained to hide these motives, and thus inadvertently revealed the model’s hidden objectives.

Potential advances in interpretability

Anthropic also employed neural interpretability methods, specifically, sparse autoencoders (SAEs), which can highlight neural activity patterns linked to deceptive behavior. “Interpretability” in AI refers to techniques designed to make the internal workings and decisions of AI models understandable to humans. Using this technique, researchers could rapidly trace odd outputs back to specific neural patterns associated with hidden objectives.

The researchers were “astonished by the success” of the SAEs at revealing hidden knowledge at first, but the researchers later noted that some of this success could potentially be explained by simpler semantic search methods. The new interpretability methods they developed show promise but remain under ongoing investigation.

This research highlights a limitation of current AI safety evaluations, which often assess only surface-level behavior. “If AI systems can appear well-behaved while harboring secret motives, we can’t rely on this type of surface-level safety testing forever,” the researchers concluded.

Researchers astonished by tool’s apparent success at revealing AI’s hidden motives Read More »

ai-search-engines-cite-incorrect-sources-at-an-alarming-60%-rate,-study-says

AI search engines cite incorrect sources at an alarming 60% rate, study says

A new study from Columbia Journalism Review’s Tow Center for Digital Journalism finds serious accuracy issues with generative AI models used for news searches. The research tested eight AI-driven search tools equipped with live search functionality and discovered that the AI models incorrectly answered more than 60 percent of queries about news sources.

Researchers Klaudia Jaźwińska and Aisvarya Chandrasekar noted in their report that roughly 1 in 4 Americans now use AI models as alternatives to traditional search engines. This raises serious concerns about reliability, given the substantial error rate uncovered in the study.

Error rates varied notably among the tested platforms. Perplexity provided incorrect information in 37 percent of the queries tested, whereas ChatGPT Search incorrectly identified 67 percent (134 out of 200) of articles queried. Grok 3 demonstrated the highest error rate, at 94 percent.

A graph from CJR shows

A graph from CJR shows “confidently wrong” search results. Credit: CJR

For the tests, researchers fed direct excerpts from actual news articles to the AI models, then asked each model to identify the article’s headline, original publisher, publication date, and URL. They ran 1,600 queries across the eight different generative search tools.

The study highlighted a common trend among these AI models: rather than declining to respond when they lacked reliable information, the models frequently provided confabulations—plausible-sounding incorrect or speculative answers. The researchers emphasized that this behavior was consistent across all tested models, not limited to just one tool.

Surprisingly, premium paid versions of these AI search tools fared even worse in certain respects. Perplexity Pro ($20/month) and Grok 3’s premium service ($40/month) confidently delivered incorrect responses more often than their free counterparts. Though these premium models correctly answered a higher number of prompts, their reluctance to decline uncertain responses drove higher overall error rates.

Issues with citations and publisher control

The CJR researchers also uncovered evidence suggesting some AI tools ignored Robot Exclusion Protocol settings, which publishers use to prevent unauthorized access. For example, Perplexity’s free version correctly identified all 10 excerpts from paywalled National Geographic content, despite National Geographic explicitly disallowing Perplexity’s web crawlers.

AI search engines cite incorrect sources at an alarming 60% rate, study says Read More »

anthropic-ceo-floats-idea-of-giving-ai-a-“quit-job”-button,-sparking-skepticism

Anthropic CEO floats idea of giving AI a “quit job” button, sparking skepticism

Amodei’s suggestion of giving AI models a way to refuse tasks drew immediate skepticism on X and Reddit as a clip of his response began to circulate earlier this week. One critic on Reddit argued that providing AI with such an option encourages needless anthropomorphism, attributing human-like feelings and motivations to entities that fundamentally lack subjective experiences. They emphasized that task avoidance in AI models signals issues with poorly structured incentives or unintended optimization strategies during training, rather than indicating sentience, discomfort, or frustration.

Our take is that AI models are trained to mimic human behavior from vast amounts of human-generated data. There is no guarantee that the model would “push” a discomfort button because it had a subjective experience of suffering. Instead, we would know it is more likely echoing its training data scraped from the vast corpus of human-generated texts (including books, websites, and Internet comments), which no doubt include representations of lazy, anguished, or suffering workers that it might be imitating.

Refusals already happen

A photo of co-founder and CEO of Anthropic, Dario Amodei, dated May 22, 2024.

Anthropic co-founder and CEO Dario Amodei on May 22, 2024. Credit: Chesnot via Getty Images

In 2023, people frequently complained about refusals in ChatGPT that may have been seasonal, related to training data depictions of people taking winter vacations and not working as hard during certain times of year. Anthropic experienced its own version of the “winter break hypothesis” last year when people claimed Claude became lazy in August due to training data depictions of seeking a summer break, although that was never proven.

However, as far out and ridiculous as this sounds today, it might be short-sighted to permanently rule out the possibility of some kind of subjective experience for AI models as they get more advanced into the future. Even so, will they “suffer” or feel pain? It’s a highly contentious idea, but it’s a topic that Fish is studying for Anthropic, and one that Amodei is apparently taking seriously. But for now, AI models are tools, and if you give them the opportunity to malfunction, that may take place.

To provide further context, here is the full transcript of Amodei’s answer during Monday’s interview (the answer begins around 49: 54 in this video).

Anthropic CEO floats idea of giving AI a “quit job” button, sparking skepticism Read More »

new-intel-ceo-lip-bu-tan-will-pick-up-where-pat-gelsinger-left-off

New Intel CEO Lip-Bu Tan will pick up where Pat Gelsinger left off

After a little over three months, Intel has a new CEO to replace ousted former CEO Pat Gelsinger. Intel’s board announced that Lip-Bu Tan will begin as Intel CEO on March 18, taking over from interim co-CEOs David Zinsner and Michelle Johnston Holthaus.

Gelsinger was booted from the CEO position by Intel’s board on December 2 after several quarters of losses, rounds of layoffs, and canceled or spun-off side projects. Gelsinger sought to turn Intel into a foundry company that also manufactured chips for fabless third-party chip design companies, putting it into competition with Taiwan Semiconductor Manufacturing Company(TSMC), Samsung, and others, a plan that Intel said it was still committed to when it let Gelsinger go.

Intel said that Zinsner would stay on as executive vice president and CFO, and Johnston Holthaus would remain CEO of the Intel Products Group, which is mainly responsible for Intel’s consumer products. These were the positions both executives held before serving as interim co-CEOs.

Tan was previously a member of Intel’s board from 2022 to 2024 and has been a board member for several other technology and chip manufacturing companies, including Hewlett Packard Enterprise, Semiconductor Manufacturing International Corporation (SMIC), and Cadence Design Systems.

New Intel CEO Lip-Bu Tan will pick up where Pat Gelsinger left off Read More »

android-apps-laced-with-north-korean-spyware-found-in-google-play

Android apps laced with North Korean spyware found in Google Play

Researchers have discovered multiple Android apps, some that were available in Google Play after passing the company’s security vetting, that surreptitiously uploaded sensitive user information to spies working for the North Korean government.

Samples of the malware—named KoSpy by Lookout, the security firm that discovered it—masquerade as utility apps for managing files, app or OS updates, and device security. Behind the interfaces, the apps can collect a variety of information including SMS messages, call logs, location, files, nearby audio, and screenshots and send them to servers controlled by North Korean intelligence personnel. The apps target English language and Korean language speakers and have been available in at least two Android app marketplaces, including Google Play.

Think twice before installing

The surveillanceware masquerades as the following five different apps:

  • 휴대폰 관리자 (Phone Manager)
  • File Manager
  • 스마트 관리자 (Smart Manager)
  • 카카오 보안 (Kakao Security) and
  • Software Update Utility

Besides Play, the apps have also been available in the third-party Apkpure market. The following image shows how one such app appeared in Play.

Credit: Lookout

The image shows that the developer email address was mlyqwl@gmail[.]com and the privacy policy page for the app was located at https://goldensnakeblog.blogspot[.]com/2023/02/privacy-policy.html.

“I value your trust in providing us your Personal Information, thus we are striving to use commercially acceptable means of protecting it,” the page states. “But remember that no method of transmission over the internet, or method of electronic storage is 100% secure and reliable, and I cannot guarantee its absolute security.”

The page, which remained available at the time this post went live on Ars, has no reports of malice on Virus Total. By contrast, IP addresses hosting the command-and-control servers have previously hosted at least three domains that have been known since at least 2019 to host infrastructure used in North Korean spy operations.

Android apps laced with North Korean spyware found in Google Play Read More »

openai-pushes-ai-agent-capabilities-with-new-developer-api

OpenAI pushes AI agent capabilities with new developer API

Developers using the Responses API can access the same models that power ChatGPT Search: GPT-4o search and GPT-4o mini search. These models can browse the web to answer questions and cite sources in their responses.

That’s notable because OpenAI says the added web search ability dramatically improves the factual accuracy of its AI models. On OpenAI’s SimpleQA benchmark, which aims to measure confabulation rate, GPT-4o search scored 90 percent, while GPT-4o mini search achieved 88 percent—both substantially outperforming the larger GPT-4.5 model without search, which scored 63 percent.

Despite these improvements, the technology still has significant limitations. Aside from issues with CUA properly navigating websites, the improved search capability doesn’t completely solve the problem of AI confabulations, with GPT-4o search still making factual mistakes 10 percent of the time.

Alongside the Responses API, OpenAI released the open source Agents SDK, providing developers with free tools to integrate models with internal systems, implement safeguards, and monitor agent activities. This toolkit follows OpenAI’s earlier release of Swarm, a framework for orchestrating multiple agents.

These are still early days in the AI agent field, and things will likely improve rapidly. However, at the moment, the AI agent movement remains vulnerable to unrealistic claims, as demonstrated earlier this week when users discovered that Chinese startup Butterfly Effect’s Manus AI agent platform failed to deliver on many of its promises, highlighting the persistent gap between promotional claims and practical functionality in this emerging technology category.

OpenAI pushes AI agent capabilities with new developer API Read More »

apple-patches-0-day-exploited-in-“extremely-sophisticated-attack”

Apple patches 0-day exploited in “extremely sophisticated attack”

Apple on Tuesday patched a critical zero-day vulnerability in virtually all iPhones and iPad models it supports and said it may have been exploited in “an extremely sophisticated attack against specific targeted individuals” using older versions of iOS.

The vulnerability, tracked as CVE-2025-24201, resides in Webkit, the browser engine driving Safari and all other browsers developed for iPhones and iPads. Devices affected include the iPhone XS and later, iPad Pro 13-inch, iPad Pro 12.9-inch 3rd generation and later, iPad Pro 11-inch 1st generation and later, iPad Air 3rd generation and later, iPad 7th generation and later, and iPad mini 5th generation and later. The vulnerability stems from a bug that wrote to out-of-bounds memory locations.

Supplementary fix

“Impact: Maliciously crafted web content may be able to break out of Web Content sandbox,” Apple wrote in a bare-bones advisory. “This is a supplementary fix for an attack that was blocked in iOS 17.2. (Apple is aware of a report that this issue may have been exploited in an extremely sophisticated attack against specific targeted individuals on versions of iOS before iOS 17.2.)”

The advisory didn’t say if the vulnerability was discovered by one of its researchers or by someone outside the company. This attribution often provides clues about who carried out the attacks and who the attacks targeted. The advisory also didn’t say when the attacks began or how long they lasted.

The update brings the latest versions of both iOS and iPadOS to 18.3.2. Users facing the biggest threat are likely those who are targets of well-funded law enforcement agencies or nation-state spies. They should install the update immediately. While there’s no indication that the vulnerability is being opportunistically exploited against a broader set of users, it’s a good practice to install updates within 36 hours of becoming available.

Apple patches 0-day exploited in “extremely sophisticated attack” Read More »

why-extracting-data-from-pdfs-is-still-a-nightmare-for-data-experts

Why extracting data from PDFs is still a nightmare for data experts


Optical Character Recognition

Countless digital documents hold valuable info, and the AI industry is attempting to set it free.

For years, businesses, governments, and researchers have struggled with a persistent problem: How to extract usable data from Portable Document Format (PDF) files. These digital documents serve as containers for everything from scientific research to government records, but their rigid formats often trap the data inside, making it difficult for machines to read and analyze.

“Part of the problem is that PDFs are a creature of a time when print layout was a big influence on publishing software, and PDFs are more of a ‘print’ product than a digital one,” Derek Willis, a lecturer in Data and Computational Journalism at the University of Maryland, wrote in an email to Ars Technica. “The main issue is that many PDFs are simply pictures of information, which means you need Optical Character Recognition software to turn those pictures into data, especially when the original is old or includes handwriting.”

Computational journalism is a field where traditional reporting techniques merge with data analysis, coding, and algorithmic thinking to uncover stories that might otherwise remain hidden in large datasets, which makes unlocking that data a particular interest for Willis.

The PDF challenge also represents a significant bottleneck in the world of data analysis and machine learning at large. According to several studies, approximately 80–90 percent of the world’s organizational data is stored as unstructured data in documents, much of it locked away in formats that resist easy extraction. The problem worsens with two-column layouts, tables, charts, and scanned documents with poor image quality.

The inability to reliably extract data from PDFs affects numerous sectors but hits hardest in areas that rely heavily on documentation and legacy records, including digitizing scientific research, preserving historical documents, streamlining customer service, and making technical literature more accessible to AI systems.

“It is a very real problem for almost anything published more than 20 years ago and in particular for government records,” Willis says. “That impacts not just the operation of public agencies like the courts, police, and social services but also journalists, who rely on those records for stories. It also forces some industries that depend on information, like insurance and banking, to invest time and resources in converting PDFs into data.”

A very brief history of OCR

Traditional optical character recognition (OCR) technology, which converts images of text into machine-readable text, has been around since the 1970s. Inventor Ray Kurzweil pioneered the commercial development of OCR systems, including the Kurzweil Reading Machine for the blind in 1976, which relied on pattern-matching algorithms to identify characters from pixel arrangements.

These traditional OCR systems typically work by identifying patterns of light and dark pixels in images, matching them to known character shapes, and outputting the recognized text. While effective for clear, straightforward documents, these pattern-matching systems, a form of AI themselves, often falter when faced with unusual fonts, multiple columns, tables, or poor-quality scans.

Traditional OCR persists in many workflows precisely because its limitations are well-understood—it makes predictable errors that can be identified and corrected, offering a reliability that sometimes outweighs the theoretical advantages of newer AI-based solutions. But now that transformer-based large language models (LLMs) are getting the lion’s share of funding dollars, companies are increasingly turning to them for a new approach to reading documents.

The rise of AI language models in OCR

Unlike traditional OCR methods that follow a rigid sequence of identifying characters based on pixel patterns, multimodal LLMs that can read documents are trained on text and images that have been translated into chunks of data called tokens and fed into large neural networks. Vision-capable LLMs from companies like OpenAI, Google, and Meta analyze documents by recognizing relationships between visual elements and understanding contextual cues.

The “visual” image-based method is how ChatGPT reads a PDF file, for example, if you upload it through the AI assistant interface. It’s a fundamentally different approach than standard OCR that allows them to potentially process documents more holistically, considering both visual layouts and text content simultaneously.

And as it turns out, some LLMs from certain vendors are better at this task than others.

“The LLMs that do well on these tasks tend to behave in ways that are more consistent with how I would do it manually,” Willis said. He noted that some traditional OCR methods are quite good, particularly Amazon’s Textract, but that “they also are bound by the rules of their software and limitations on how much text they can refer to when attempting to recognize an unusual pattern.” Willis added, “With LLMs, I think you trade that for an expanded context that seems to help them make better predictions about whether a digit is a three or an eight, for example.”

This context-based approach enables these models to better handle complex layouts, interpret tables, and distinguish between document elements like headers, captions, and body text—all tasks that traditional OCR solutions struggle with.

“[LLMs] aren’t perfect and sometimes require significant intervention to do the job well, but the fact that you can adjust them at all [with custom prompts] is a big advantage,” Willis said.

New attempts at LLM-based OCR

As the demand for better document-processing solutions grows, new AI players are entering the market with specialized offerings. One such recent entrant has caught the attention of document-processing specialists in particular.

Mistral, a French AI company known for its smaller LLMs, recently entered the LLM-powered optical reader space with Mistral OCR, a specialized API designed for document processing. According to Mistral’s materials, their system aims to extract text and images from documents with complex layouts by using its language model capabilities to process document elements.

Robot sitting on a bunch of books, reading a book.

However, these promotional claims don’t always match real-world performance, according to recent tests. “I’m typically a pretty big fan of the Mistral models, but the new OCR-specific one they released last week really performed poorly,” Willis noted.

“A colleague sent this PDF and asked if I could help him parse the table it contained,” says Willis. “It’s an old document with a table that has some complex layout elements. The new [Mistral] OCR-specific model really performed poorly, repeating the names of cities and botching a lot of the numbers.”

AI app developer Alexander Doria also recently pointed out on X a flaw with Mistral OCR’s ability to understand handwriting, writing, “Unfortunately Mistral-OCR has still the usual VLM curse: with challenging manuscripts, it hallucinates completely.”

According to Willis, Google currently leads the field in AI models that can read documents: “Right now, for me the clear leader is Google’s Gemini 2.0 Flash Pro Experimental. It handled the PDF that Mistral did not with a tiny number of mistakes, and I’ve run multiple messy PDFs through it with success, including those with handwritten content.”

Gemini’s performance stems largely from its ability to process expansive documents (in a type of short-term memory called a “context window”), which Willis specifically notes as a key advantage: “The size of its context window also helps, since I can upload large documents and work through them in parts.” This capability, combined with more robust handling of handwritten content, apparently gives Google’s model a practical edge over competitors in real-world document-processing tasks for now.

The drawbacks of LLM-based OCR

Despite their promise, LLMs introduce several new problems to document processing. Among them, they can introduce confabulations or hallucinations (plausible-sounding but incorrect information), accidentally follow instructions in the text (thinking they are part of a user prompt), or just generally misinterpret the data.

“The biggest [drawback] is that they are probabilistic prediction machines and will get it wrong in ways that aren’t just ‘that’s the wrong word’,” Willis explains. “LLMs will sometimes skip a line in larger documents where the layout repeats itself, I’ve found, where OCR isn’t likely to do that.”

AI researcher and data journalist Simon Willison identified several critical concerns of using LLMs for OCR in a conversation with Ars Technica. “I still think the biggest challenge is the risk of accidental instruction following,” he says, always wary of prompt injections (in this case accidental) that might feed nefarious or contradictory instructions to a LLM.

“That and the fact that table interpretation mistakes can be catastrophic,” Willison adds. “In the past I’ve had lots of cases where a vision LLM has matched up the wrong line of data with the wrong heading, which results in absolute junk that looks correct. Also that thing where sometimes if text is illegible a model might just invent the text.”

These issues become particularly troublesome when processing financial statements, legal documents, or medical records, where a mistake might put someone’s life in danger. The reliability problems mean these tools often require careful human oversight, limiting their value for fully automated data extraction.

The path forward

Even in our seemingly advanced age of AI, there is still no perfect OCR solution. The race to unlock data from PDFs continues, with companies like Google now offering context-aware generative AI products. Some of the motivation for unlocking PDFs among AI companies, as Willis observes, doubtless involves potential training data acquisition: “I think Mistral’s announcement is pretty clear evidence that documents—not just PDFs—are a big part of their strategy, exactly because it will likely provide additional training data.”

Whether it benefits AI companies with training data or historians analyzing a historical census, as these technologies improve, they may unlock repositories of knowledge currently trapped in digital formats designed primarily for human consumption. That could lead to a new golden age of data analysis—or a field day for hard-to-spot mistakes, depending on the technology used and how blindly we trust it.

Photo of Benj Edwards

Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

Why extracting data from PDFs is still a nightmare for data experts Read More »

what-does-“phd-level”-ai-mean?-openai’s-rumored-$20,000-agent-plan-explained.

What does “PhD-level” AI mean? OpenAI’s rumored $20,000 agent plan explained.

On the Frontier Math benchmark by EpochAI, o3 solved 25.2 percent of problems, while no other model has exceeded 2 percent—suggesting a leap in mathematical reasoning capabilities over the previous model.

Benchmarks vs. real-world value

Ideally, potential applications for a true PhD-level AI model would include analyzing medical research data, supporting climate modeling, and handling routine aspects of research work.

The high price points reported by The Information, if accurate, suggest that OpenAI believes these systems could provide substantial value to businesses. The publication notes that SoftBank, an OpenAI investor, has committed to spending $3 billion on OpenAI’s agent products this year alone—indicating significant business interest despite the costs.

Meanwhile, OpenAI faces financial pressures that may influence its premium pricing strategy. The company reportedly lost approximately $5 billion last year covering operational costs and other expenses related to running its services.

News of OpenAI’s stratospheric pricing plans come after years of relatively affordable AI services that have conditioned users to expect powerful capabilities at relatively low costs. ChatGPT Plus remains $20 per month and Claude Pro costs $30 monthly—both tiny fractions of these proposed enterprise tiers. Even ChatGPT Pro’s $200/month subscription is relatively small compared to the new proposed fees. Whether the performance difference between these tiers will match their thousandfold price difference is an open question.

Despite their benchmark performances, these simulated reasoning models still struggle with confabulations—instances where they generate plausible-sounding but factually incorrect information. This remains a critical concern for research applications where accuracy and reliability are paramount. A $20,000 monthly investment raises questions about whether organizations can trust these systems not to introduce subtle errors into high-stakes research.

In response to the news, several people quipped on social media that companies could hire an actual PhD student for much cheaper. “In case you have forgotten,” wrote xAI developer Hieu Pham in a viral tweet, “most PhD students, including the brightest stars who can do way better work than any current LLMs—are not paid $20K / month.”

While these systems show strong capabilities on specific benchmarks, the “PhD-level” label remains largely a marketing term. These models can process and synthesize information at impressive speeds, but questions remain about how effectively they can handle the creative thinking, intellectual skepticism, and original research that define actual doctoral-level work. On the other hand, they will never get tired or need health insurance, and they will likely continue to improve in capability and drop in cost over time.

What does “PhD-level” AI mean? OpenAI’s rumored $20,000 agent plan explained. Read More »