data science

scientists-once-hoarded-pre-nuclear-steel;-now-we’re-hoarding-pre-ai-content

Scientists once hoarded pre-nuclear steel; now we’re hoarding pre-AI content

A time capsule of human expression

Graham-Cumming is no stranger to tech preservation efforts. He’s a British software engineer and writer best known for creating POPFile, an open source email spam filtering program, and for successfully petitioning the UK government to apologize for its persecution of codebreaker Alan Turing—an apology that Prime Minister Gordon Brown issued in 2009.

As it turns out, his pre-AI website isn’t new, but it has languished unannounced until now. “I created it back in March 2023 as a clearinghouse for online resources that hadn’t been contaminated with AI-generated content,” he wrote on his blog.

The website points to several major archives of pre-AI content, including a Wikipedia dump from August 2022 (before ChatGPT’s November 2022 release), Project Gutenberg’s collection of public domain books, the Library of Congress photo archive, and GitHub’s Arctic Code Vault—a snapshot of open source code buried in a former coal mine near the North Pole in February 2020. The wordfreq project appears on the list as well, flash-frozen from a time before AI contamination made its methodology untenable.

The site accepts submissions of other pre-AI content sources through its Tumblr page. Graham-Cumming emphasizes that the project aims to document human creativity from before the AI era, not to make a statement against AI itself. As atmospheric nuclear testing ended and background radiation returned to natural levels, low-background steel eventually became unnecessary for most uses. Whether pre-AI content will follow a similar trajectory remains a question.

Still, it feels reasonable to protect sources of human creativity now, including archival ones, because these repositories may become useful in ways that few appreciate at the moment. For example, in 2020, I proposed creating a so-called “cryptographic ark”—a timestamped archive of pre-AI media that future historians could verify as authentic, collected before my then-arbitrary cutoff date of January 1, 2022. AI slop pollutes more than the current discourse—it could cloud the historical record as well.

For now, lowbackgroundsteel.ai stands as a modest catalog of human expression from what may someday be seen as the last pre-AI era. It’s a digital archaeology project marking the boundary between human-generated and hybrid human-AI cultures. In an age where distinguishing between human and machine output grows increasingly difficult, these archives may prove valuable for understanding how human communication evolved before AI entered the chat.

Scientists once hoarded pre-nuclear steel; now we’re hoarding pre-AI content Read More »

beyond-rgb:-a-new-image-file-format-efficiently-stores-invisible-light-data

Beyond RGB: A new image file format efficiently stores invisible light data

Importantly, it then applies a weighting step, dividing higher-frequency spectral coefficients by the overall brightness (the DC component), allowing less important data to be compressed more aggressively. That is then fed into the codec, and rather than inventing a completely new file type, the method uses the compression engine and features of the standardized JPEG XL image format to store the specially prepared spectral data.

Making spectral images easier to work with

According to the researchers, the massive file sizes of spectral images have reportedly been a real barrier to adoption in industries that would benefit from their accuracy. Smaller files mean faster transfer times, reduced storage costs, and the ability to work with these images more interactively without specialized hardware.

The results reported by the researchers seem impressive—with their technique, spectral image files shrink by 10 to 60 times compared to standard OpenEXR lossless compression, bringing them down to sizes comparable to regular high-quality photos. They also preserve key OpenEXR features like metadata and high dynamic range support.

While some information is sacrificed in the compression process—making this a “lossy” format—the researchers designed it to discard the least noticeable details first, focusing compression artifacts in the less important high-frequency spectral details to preserve important visual information.

Of course, there are some limitations. Translating these research results into widespread practical use hinges on the continued development and refinement of the software tools that handle JPEG XL encoding and decoding. Like many cutting-edge formats, the initial software implementations may need further development to fully unlock every feature. It’s a work in progress.

And while Spectral JPEG XL dramatically reduces file sizes, its lossy approach may pose drawbacks for some scientific applications. Some researchers working with spectral data might readily accept the trade-off for the practical benefits of smaller files and faster processing. Others handling particularly sensitive measurements might need to seek alternative methods of storage.

For now, the new technique remains primarily of interest to specialized fields like scientific visualization and high-end rendering. However, as industries from automotive design to medical imaging continue generating larger spectral datasets, compression techniques like this could help make those massive files more practical to work with.

Beyond RGB: A new image file format efficiently stores invisible light data Read More »