Data

microsoft’s-new-10,000-year-data-storage-medium:-glass

Microsoft’s new 10,000-year data storage medium: glass


Femtosecond lasers etch data into a very stable medium.

Right now, Silica hardware isn’t quite ready for commercialization. Credit: Microsoft Research

Archival storage poses lots of challenges. We want media that is extremely dense and stable for centuries or more, and, ideally, doesn’t consume any energy when not being accessed. Lots of ideas have floated around—even DNA has been considered—but one of the simplest is to etch data into glass. Many forms of glass are very physically and chemically stable, and it’s relatively easy to etch things into it.

There’s been a lot of preliminary work demonstrating different aspects of a glass-based storage system. But in Wednesday’s issue of Nature, Microsoft Research announced Project Silica, a working demonstration of a system that can read and write data into small slabs of glass with a density of over a Gigabit per cubic millimeter.

Writing on glass

We tend to think of glass as fragile, prone to shattering, and capable of flowing downward over centuries, although the last claim is a myth. Glass is a category of material, and a variety of chemicals can form glasses. With the right starting chemical, it’s possible to make a glass that is, as the researchers put it, “thermally and chemically stable and is resistant to moisture ingress, temperature fluctuations and electromagnetic interference.” While it would still need to be handled in a way to minimize damage, glass provides the sort of stability we’d want for long-term storage.

Putting data into glass is as simple as etching it. But that’s been one of the challenges, as etching is typically a slow process. However, the development of femtosecond lasers—lasers that emit pulses that only last 10-15 seconds and can emit millions of them per second—can significantly cut down write times and allow etching to be focused on a very small area, increasing potential data density.

To read the data back, there are several options. We’ve already had great success using lasers to read data from optical disks, albeit slowly. But anything that can pick up the small features etched into the glass could conceivably work.

With the above considerations in mind, everything was in place on a theoretical level for Project Silica. The big question is how to put them together into a functional system. Microsoft decided that, just to be cautious, it would answer that question twice.

A real-world system

The difference between these two answers comes down to how an individual unit of data (called a voxel) is written to the glass. One type of voxel they tried was based on birefringence, where refraction of photons depends on their polarization. It’s possible to etch voxels into glass to create birefringence using polarized laser light, producing features smaller than the diffraction limit. In practice, this involved using one laser pulse to create an oval-shaped void, followed by a second, polarized pulse to induce birefringence. The identity of a voxel is based on the orientation of the oval; since we can resolve multiple orientations, it’s possible to save more than one bit in each voxel.

The alternative approach involves changing the magnitude of refractive effects by varying the amount of energy in the laser pulse. Again, it’s possible to discern more than two states in these voxels, allowing multiple data bits to be stored in each voxel.

The map data from Microsoft Flight Simulator etched onto the Silica storage medium.

Credit: Microsoft Research

The map data from Microsoft Flight Simulator etched onto the Silica storage medium. Credit: Microsoft Research

Reading these in Silica involves using a microscope that can pick up differences in refractive index. (For microscopy geeks, this is a way of saying “they used phase contrast microscopy.”) The microscopy sets the limits on how many layers of voxels can be placed in a single piece of glass. During etching, the layers were separated by enough distance so only a single layer would be in the microscope’s plane of focus at a time. The etching process also incorporates symbols that allow the automated microscope system to position the lens above specific points on the glass. From there, the system slowly changes its focal plane, moving through the stack and capturing images that include different layers of voxels.

To interpret these microscope images, Microsoft used a convolutional neural network that combines data from images that are both in and near the plane of focus for a given layer of voxels. This is effective because the influence of nearby voxels changes how a given voxel appears in a subtle way that the AI system can pick up on if given enough training data.

The final piece of the puzzle is data encoding. The Silica system takes the raw bitstream of the data it’s storing and adds error correction using a low-density parity-check code (the same error correction used in 5G networks). Neighboring bits are then combined to create symbols that take advantage of the voxels’ ability to store more than one bit. Once a stream of symbols is made, it’s ready to be written to glass.

Performance

Writing remains a bottleneck in the system, so Microsoft developed hardware that can write a single glass slab with four lasers simultaneously without generating too much heat. That is enough to enable writing at 66 megabits per second, and the team behind the work thinks that it would be possible to add up to a dozen additional lasers. That may be needed, given that it’s possible to store up to 4.84TB in a single slab of glass (the slabs are 12 cm x 12 cm and 0.2 cm thick). That works out to be over 150 hours to fully write a slab.

The “up to” aspect of the storage system has to do with the density of data that’s possible with the two different ways of writing data. The method that relies on birefringence requires more optical hardware and only works in high-quality glasses, but can squeeze more voxels into the same volume, and so has a considerably higher data density. The alternative approach can only put a bit over two terabytes into the same slab of glass, but can be done with simpler hardware and can work on any sort of transparent material.

Borosilicate glass offers extreme stability; Microsoft’s accelerated aging experiments suggest the data would be stable for over 10,000 years at room temperature. That led Microsoft to declare, “Our results demonstrate that Silica could become the archival storage solution for the digital age.”

That may be overselling it just a bit. The Square Kilometer Array telescope, for example, is expected to need to archive 700 petabytes of data each year. That would mean over 140,000 glass slabs would be needed to store the data from this one telescope. Even assuming that the write speed could be boosted by adding significantly more lasers, you’d need over 600 Silica machines operating in parallel to keep up. And the Square Kilometer Array is far from the only project generating enormous amounts of data.

That said, there are some features that make Silica a great match for this sort of thing, most notably the complete absence of energy needed to preserve the data, and the fact that it can be retrieved rapidly if needed (a sharp contrast to the days needed to retrieve information from DNA, for example). Plus, I’m admittedly drawn to a system with a storage medium that looks like something right out of science fiction.

Nature, 2026. DOI: 10.1038/s41586-025-10042-w (About DOIs).

Photo of John Timmer

John is Ars Technica’s science editor. He has a Bachelor of Arts in Biochemistry from Columbia University, and a Ph.D. in Molecular and Cell Biology from the University of California, Berkeley. When physically separated from his keyboard, he tends to seek out a bicycle, or a scenic location for communing with his hiking boots.

Microsoft’s new 10,000-year data storage medium: glass Read More »

researchers-show-that-training-on-“junk-data”-can-lead-to-llm-“brain-rot”

Researchers show that training on “junk data” can lead to LLM “brain rot”

On the surface, it seems obvious that training an LLM with “high quality” data will lead to better performance than feeding it any old “low quality” junk you can find. Now, a group of researchers is attempting to quantify just how much this kind of low quality data can cause an LLM to experience effects akin to human “brain rot.”

For a pre-print paper published this month, the researchers from Texas A&M, the University of Texas, and Purdue University drew inspiration from existing research showing how humans who consume “large volumes of trivial and unchallenging online content” can develop problems with attention, memory, and social cognition. That led them to what they’re calling the “LLM brain rot hypothesis,” summed up as the idea that “continual pre-training on junk web text induces lasting cognitive decline in LLMs.”

Figuring out what counts as “junk web text” and what counts as “quality content” is far from a simple or fully objective process, of course. But the researchers used a few different metrics to tease a “junk dataset” and “control dataset” from HuggingFace’s corpus of 100 million tweets.

Since brain rot in humans is “a consequence of Internet addiction,” they write, junk tweets should be ones “that can maximize users’ engagement in a trivial manner.” As such, the researchers created one “junk” dataset by collecting tweets with high engagement numbers (likes, retweets, replies, and quotes) and shorter lengths, figuring that “more popular but shorter tweets will be considered to be junk data.”

For a second “junk” metric, the researchers drew from marketing research to define the “semantic quality” of the tweets themselves. Using a complex GPT-4o prompt, they sought to pull out tweets that focused on “superficial topics (like conspiracy theories, exaggerated claims, unsupported assertions or superficial lifestyle content)” or that had an “attention-drawing style (such as sensationalized headlines using clickbait language or excessive trigger words).” A random sample of these LLM-based classifications was spot-checked against evaluations from three graduate students with a 76 percent matching rate.

Researchers show that training on “junk data” can lead to LLM “brain rot” Read More »

reddit-cashes-in-on-ai-gold-rush-with-$203m-in-llm-training-license-fees

Reddit cashes in on AI gold rush with $203M in LLM training license fees

Your posts are the product —

Two- to three-year deals with Google, others, come amid legal uncertainty over “fair use.”

Enlarge / “Reddit Gold” takes on a whole new meaning when AI training data is involved.

The last week saw word leak that Google had agreed to license Reddit’s massive corpus of billions of posts and comments to help train its large language models. Now, in a recent Securities and Exchange Commission filing, the popular online forum has revealed that it will bring in $203 million from that and other unspecified AI data licensing contracts over the next three years.

Reddit’s Form S-1—published by the SEC late Thursday ahead of the site’s planned stock IPO—says the company expects $66.4 million of that data-derived value from LLM companies to come during the 2024 calendar year. Bloomberg previously reported the Google deal to be worth an estimated $60 million a year, suggesting that the three-year deal represents the vast majority of its AI licensing revenue so far.

Google and other AI companies that license Reddit’s data will receive “continuous access to [Reddit’s] data API as well as quarterly transfers of Reddit data over the term of the arrangement,” according to the filing. That constant, real-time access is particularly valuable, the site writes in the filing, because “Reddit data constantly grows and regenerates as users come and interact with their communities and each other.”

“Why pay for the cow…?”

While Reddit sees data licensing to AI firms as an important part of its financial future, its filing also notes that free use of its data has already been “a foundational part of how many of the leading large language models have been trained.” The filing seems almost bitter in noting that “some companies have constructed very large commercial language models using Reddit data without entering into a license agreement with us.”

That acknowledgment highlights the still-murky legal landscape over AI companies’ penchant for scraping huge swathes of the public web for training purposes, a practice those companies defend as fair use. And Reddit seems well aware that AI models may continue to hoover up its posts and comments for free, even as it tries to sell that data to others.

“Some companies may decline to license Reddit data and use such data without license given its open nature, even if in violation of the legal terms governing our services,” the company writes. “While we plan to vigorously enforce against such entities, such enforcement activities could take years to resolve, result in substantial expense, and divert management’s attention and other resources, and we may not ultimately be successful.”

Yet the mere existence of AI data licensing agreements like Reddit’s may influence how legal battles over this kind of data scraping play out. As Ars’ Timothy Lee and James Grimmelmann noted in a recent legal analysis, the establishment of a settled licensing market can have a huge impact on whether courts consider a novel use of digitized data to be “fair use” under copyright law.

“The more [AI data licensing] deals like this are signed in the coming months, the easier it will be for the plaintiffs to argue that the ‘effect on the market’ prong of fair use analysis should take this licensing market into account,” Lee and Grimmelmann wrote.

And while Reddit sees LLMs as a new revenue opportunity, the site also sees their popularity as a potential threat. The S-1 filing notes that “some users are also turning to LLMs such as ChatGPT, Gemini, and Anthropic” for seeking information, putting them in the same category of Reddit competition as “Google, Amazon, YouTube, Wikipedia, X, and other news sites.”

After filing for its IPO in late 2021, reports suggest Reddit is aiming to hit the stock market next month officially. The company will offer users and moderators with sufficient karma and/or activity on the site the opportunity to participate in that IPO through a directed share program.

Advance Publications, which owns Ars Technica parent Condé Nast, is the largest shareholder of Reddit.

Reddit cashes in on AI gold rush with $203M in LLM training license fees Read More »

quest-2-is-vastly-outselling-quest-3-so-far-this-holiday-on-amazon

Quest 2 is Vastly Outselling Quest 3 so Far This Holiday on Amazon

With such an alluring price point on Quest 2 during the Black Friday period, it makes sense the headset would sell more than Quest 3. But what will it mean for the company’s effort to make mixed reality the main selling proposition of its headsets?

Twitter user JustDaven pointed out that Amazon reveals some coarse sales figures in certain cases, including for Quest 2 and Quest 3. We thought it would be interesting to look at all of the major Amazon territories where Quests are sold to find out what the numbers look like.

Across all major Amazon territories (just one of many places where the headset is sold), we found that Meta has sold some 240,000 Quest headsets. What’s more interesting than the raw number however is that Quest 2 is outselling Quest 3 nearly 3:1.

Even though Quest 3 is the hot new model that’s getting all the marketing, it’s not surprising how this happened.

The Quest 2 had a pretty stellar Black Friday discount with a sticker price of $250, including a $50 gift card (pricing it effectively at $200). Compare that to the lowest sticker price for Quest 3 which was $500, including a $15 gift card and a copy of Asgard’s Wrath 2 (pricing it effectively at $425).

Considering the Black Friday sticker prices ($250 vs. $500), people will naturally ask: “At twice the price of Quest 2, is Quest 3 twice as good?”

What It Means

In any case, the cheaper headset appears to be the clear winner so far this holiday season. But what does this mean for Meta—which has been trying to pivot from pure VR to mixed reality with its last two headsets?

Demeo Mixed Reality mode | Image courtesy Meta

Meta has pushed mixed reality as the primary use-case for both the Quest Pro and Quest 3. But while developers still need time to build killer apps and use-cases for mixed reality, a fresh surge of Quest 2 users is about to hit—a headset which just barely supports mixed reality experiences with a grainy black & white view.

This creates a difficult decision for developers: build for the new-fangled headsets with their greater power, better visuals, and much improved mixed reality capabilities? Or cater to the much larger audience of Quest 2 users?

This is of course always the case when game developers need to choose when to shift their focus to a next-gen game console. But this is different.

Between PS4 and PS5, for instance, there is no significant difference between the consoles that compares with the difference in mixed reality capabilities between Quest 2 and Quest 3. For PS4 and PS5, it’s comparatively easy for developers to build a single game and tune it to run well on both systems.

That’s arguably the same case for Quest 2 to Quest 3, but only if we’re talking about pure VR apps.

But a great mixed reality game built for Quest 3 is really going to struggle to provide a good experience on Quest 2; not only because of the lower resolution and black & white passthrough view, but also Quest 2’s lack of depth-sensor—a critical component for creating reasonably accurate maps of the player’s environment to truly mix the virtual and real worlds.

Quest 2 is already three years old. That’s not long for a typical console generation, but it is in the much faster moving landscape of standalone VR headsets.

A new surge of users for the last-gen headset will inevitably slow the transition to the next generation. That means developers will stay focused on the broader Quest 2 audience for a longer period, leaving Quest Pro and Quest 3 with less content that truly takes care of their main differentiator of higher quality mixed reality.

Image courtesy Meta

Ever since Quest Pro, Meta has focused its Quest marketing very heavily on mixed reality, giving customers a sense that there’s lots of great mixed reality content for the devices. But that’s far from the truth as things stand today. Mixed reality games and apps are still barely gestating, with most simply attaching a passthrough background to an existing game. Sure, that might make those games better in some cases, but it doesn’t really make use of the headsets’ mixed reality capabilities.

So while Meta would apparently like to see developers accelerate their transition to Quest Pro and Quest 3’s unique capabilities, the market is incentivizing them to decelerate that transition. That puts the platform and its developers at odds, with customers stuck somewhere in the twilight zone between.

Quest 2 is Vastly Outselling Quest 3 so Far This Holiday on Amazon Read More »

70%-of-the-20-best-rated-quest-2-apps-are-now-available-on-pico-4

70% of the 20 Best-rated Quest 2 Apps are Now Available on Pico 4

The standalone VR market is continuing to grow, and with it, we’re increasingly seeing platform competition for quality content. Pico made its biggest push into consumer VR so far with the launch of the Pico 4 last year, and the company has been gaining ground on getting top VR content onto its store.

Top Quest Apps Showing up on Pico 4

Looking at the 20 best-rated apps on the Quest store (data as of April 2023), to date 70% of the list is available on Pico’s standalone headset:

Title Pico 4 Quest 2
Moss: Book II
The Room VR: A Dark Matter
Puzzling Places
Walkabout Mini Golf
I Expect You To Die 2
Breachers
COMPOUND
Vermillion
Swarm
DYSCHRONIA: Chronos Alternate
PatchWorld – Make Music Worlds
I Expect You To Die
Moss
Red Matter 2
ARK and ADE
Ragnarock
Cubism
Ancient Dungeon
Into the Radius
The Last Clockwinder

Another way of looking at Pico’s content traction is by the 20 most-rated apps on the Quest store. Breaking it down that way (data as of April 2023), 50% of the list is now available on Pico.

Title Pico 4 Quest 2
Beat Saber
Blade & Sorcery: Nomad
The Walking Dead: Saints & Sinners
SUPERHOT VR
GOLF+
BONELAB
Vader Immortal: Episode I
Onward
Job Simulator
The Room VR: A Dark Matter
Five Nights at Freddy’s: Help Wanted
Resident Evil 4
The Thrill of the Fight
Walkabout Mini Golf
Pistol Whip
Eleven Table Tennis
GORN
Virtual Desktop
Vader Immortal: Episode III
A Township Tale

Building good VR hardware is really just half the battle when it comes to being a serious player in the industry. The other half is getting compelling content onto the headset.

While Quest 2 still has a considerably larger library of apps and several big standalone exclusives (like Beat Saber) Pico looks to be doing a pretty good job so far in its push to legitimize its platform by making sure that some of the top VR content is available for its customers.

And there’s likely more to come. The company has yet to launch its latest Pico 4 headset in the US, which is a major VR market of both customers and developers. Without the US market in play, there’s less incentive for VR developers to bring their apps to Pico. But if Pico finally launches its headset in the US, it could be the nudge needed for more top VR content to make the leap to the store.


Special thanks to @CkYLee for helping to title availability on the Pico store

70% of the 20 Best-rated Quest 2 Apps are Now Available on Pico 4 Read More »

meta-has-sold-nearly-20-million-quest-headsets,-but-retention-struggles-remain

Meta Has Sold Nearly 20 Million Quest Headsets, But Retention Struggles Remain

Meta has sold nearly 20 million Quest headsets, but the company continues to struggle with keeping customer using VR.

According to a report by The Verge, citing an internal Meta presentation held today, the company has sold nearly 20 million Quest headsets. This likely includes Quest 1, Quest 2, and Quest Pro, though by all accounts Quest 2 appears to be the vast majority. And while the figure wasn’t publicly announced, this would be the first official confirmation of Quest unit sales from the company.

This info was shared by Mark Rabkin, Meta’s VP of VR, during an internal presentation to “thousands” of employees, according to The Verge.

And while the 20 million unit Quest sales figure is impressive—and well beyond any other single VR headset maker—Rabkin went on to stress that the company has to do a better job at keeping customers using the headsets well after their purchase.

“We need to be better at growth and retention and resurrection,” he said. “We need to be better at social and actually make those things more reliable, more intuitive so people can count on it.”

Curiously, Meta’s latest wave of headset customers are less enthusiastic than those that bought in early.

“Right now, we’re on our third year of Quest 2,” Rabkin said, according to The Verge. “And sadly, the newer cohorts that are coming in—the people who bought it this last Christmas—they’re just not as into it [or engaged as] the ones who bought it early.”

The report from The Verge includes more info about the company’s XR roadmap, which you can read in full here.

Meta Has Sold Nearly 20 Million Quest Headsets, But Retention Struggles Remain Read More »