Author name: Beth Washington

is-doge-doomed-to-fail?-some-experts-are-ready-to-call-it.

Is DOGE doomed to fail? Some experts are ready to call it.


Trump wants $45M to continue DOGE’s work. Critics warn costs already too high.

Federal workers and protestors spoke out against US President Donald Trump and Elon Musk and their push to gut federal services and impose mass layoffs earlier this year. Credit: Pacific Press / Contributor | LightRocket

Critics are increasingly branding Elon Musk’s Department of Government Efficiency (DOGE) as a failure, including lawmakers fiercely debating how much funding to allot next year to the controversial agency.

On Tuesday, Republicans and Democrats sparred over DOGE’s future at a DOGE subcommittee hearing, according to NextGov, a news site for federal IT workers. On one side, Republicans sought to “lock in” and codify the “DOGE process” for supposedly reducing waste and fraud in government, and on the other, Democrats argued that DOGE has “done the opposite” of its intended mission and harmed Americans in the process.

DOGE has “led to poor services, a brain drain on our federal government, and it’s going to cost taxpayers money long term,” Rep. Suhas Subramanyam (D-Va.) argued.

For now, DOGE remains a temporary government agency that could sunset as soon as July 4, 2026. Under Musk’s leadership, it was supposed to save the US government a trillion dollars. But so far, DOGE only reports saving about $180 billion—and doubt has been cast on DOGE’s math ever since reports revealed that nearly 40 percent of the savings listed on the DOGE site were “bogus,” Elaine Kamarck, director of the Center for Effective Public Management at the Brookings Institute, wrote in a report detailing DOGE’s exposed failures.

The “DOGE process” that Republicans want to codify, Kamarck explained, typically begins with rushed mass layoffs. That’s soon followed by offers for buyouts or deferred resignations, before the government eventually realizes it’s lost critical expertise and starts scrambling to rehire workers or rescind buyout offers after “it becomes apparent” that a heavily gutted agency “is in danger of malfunctioning.”

Kamarck warned that DOGE appeared to be using the firings of federal workers to test the “unitary executive” theory, “popular among conservatives,” that argues that “the president has more power than Congress.” Consider how DOGE works to shut down agencies funded by Congress without seeking lawmakers’ approval by simply removing critical workers key to operations, Kamarck suggested, like DOGE did early on at the National Science Foundation.

Democrats’ witness at the DOGE hearing—Emily DiVito of the economic policy think tank Groundwork Collaborative—suggested that extensive customer service problems at the Social Security Administration was just one powerful example of DOGE’s negative impacts affecting Americans today.

Some experts expect the damage of DOGE’s first few months could ripple across Trump’s entire term. “The rapid rehirings are a warning sign” that the government “has lost more capacities and expertise that could prove critical—and difficult to replace—in the months and years ahead,” experts told CNN.

By codifying the DOGE process, as Republicans wish to do, the government would seemingly only perpetuate this pattern, which could continue to be disastrous for Americans relying on government programs.

“There are time bombs all over the place in the federal government because of this,” Kamarck told CNN. “They’ve wreaked havoc across nearly every agency.”

DOGE spikes costs for Americans, nonprofit warns

Citizens for Ethics, a nonpartisan nonprofit striving to end government secrecy, estimated this week that DOGE cuts at just a few agencies “could result in a loss of over $10 billion in US-based economic activity.”

The shuttering of the Consumer Financial Protection Bureau alone—which Musk allegedly stands to personally benefit from—likely robbed American taxpayers of even more. The nonprofit noted that agency clawed back “over $26 billion in funds” from irresponsible businesses between 2011 and 2021 before its work was blocked.

Additionally, DOGE cuts at the Internal Revenue Service—which could “end or close audits of wealthy individuals and corporations” due to a lack of staffing—could cost the US an estimated $500 billion in dodged taxes, the nonprofit said. Partly due to conflicts like these, Kamarck suggested that when it finally comes time to assess DOGE’s success, the answer to both “did federal spending or the federal deficit shrink?” will “almost surely be no.”

As society attempts to predict the full extent of DOGE’s potential harms, The Wall Street Journal spoke to university students who suggested that regulatory clarity could possibly straighten out DOGE’s efforts now that Musk is no longer pushing for mass firings. At the DOGE hearing, Marjorie Taylor Greene (R-Ga.) suggested the only way to ensure DOGE hits its trillion-dollar goal is to “make sure these cuts aren’t just temporary” and pass laws “to streamline agencies, eliminate redundant programs and give the president the authority to fire bureaucrats who don’t do their jobs.”

But one finance student, Troy Monte, suggested to WSJ that DOGE has already cost the Trump administration “stability, expertise, and public trust,” opining, “the cost of DOGE won’t be measured in dollars, but in damage.”

Max Stier, CEO of the Partnership for Public Service, told CNN that when DOGE borrowed the tech industry tactic of moving fast and breaking things, then scrambling to fix what breaks, it exposed “the mosaic of incompetence and a failure on the part of this administration to understand the critical value that the breadth of government expertise provides.”

“This is not about a single incident,” Stier said. “It’s about a pattern that has implications for our government’s ability to meet not just the challenges of today but the critical challenges of tomorrow.”

DOGE’s future appears less certain without Musk

Rep. Jasmine Crockett (D-Texas) had hoped to subpoena Musk at the DOGE hearing to testify on DOGE’s agenda, but Republicans blocked her efforts, NextGov reported.

At the hearing, she alleged that “all of this talk about lowering costs and reducing waste is absolute BS. Their agenda is about one thing: making the federal government so weak that they can exploit it for their personal gain.”

Just yesterday, The Washington Post editorial board published an op-ed already declaring DOGE a failure. Former DOGE staffer Sahil Lavingia told NPR that he expects DOGE will “fizzle out” purely because DOGE failed to uncover as much fraud as Musk and Trump had alleged was spiking government costs.

Beyond obvious criticism (loudly voiced at myriad DOGE protests), it’s easy to understand why this pessimistic view is catching on, since even from a cursory glance at DOGE’s website, the agency’s momentum appears to be slowing since Musk’s abrupt departure in late May. The DOGE site’s estimated savings are supposed to be updated weekly—and one day aspire to be updated in real-time—but the numbers apparently haven’t changed a cent since a few days after Musk shed his “special government employee” label. The site notes the last update was on June 3.

In addition to Musk, several notable Musk appointees have also left DOGE. Most recently, Wired reported that one of Musk’s first appointees—19-year-old Edward “Big Balls” Coristine—is gone, quitting just weeks after receiving full-time employee status granted around the same time that Musk left. Lavingia told Wired that he’d heard “a lot” of people Musk hired have been terminated since his exit.

Rather than rely on a specific engineer spearheading DOGE initiatives across government, like Coristine appeared positioned to become in Musk’s absence, Trump cabinet members or individual agency heads may have more say over DOGE cuts in the future, Kamarck and Politico’s E&E News reported.

“The result so far is that post-Musk, DOGE is morphing into an agency-by-agency effort—no longer run by a central executive branch office, but by DOGE recruits who have been embedded in the agencies and by political appointees, such as cabinet secretaries, who are committed to the same objectives,” Kamarck wrote.

Whether Trump’s appointees can manage DOGE without Musk’s help or his appointees remains to be seen, as DOGE continues to seek new hires. While Musk’s appointed DOGE staff was heavily criticized from day one, Kamarck noted that at least Musk’s appointees appeared “to have a great deal of IT talent, something the federal government has been lacking since the beginning of the information age.”

Trump can extend the timeline for when DOGE sunsets, NextGov noted, and DOGE still has $22 million left over from this year to keep pursuing its goals, as lawmakers debate whether $45 million in funding is warranted.

Despite Trump and Musk’s very public recent fallout, White House spokesperson Kush Desai has said that Trump remains committed to fulfilling DOGE’s mission, but NPR noted his statement curiously didn’t mention DOGE by name.

“President Trump pledged to make our bloated government more efficient by slashing waste, fraud, and abuse. The administration is committed to delivering on this mandate while rectifying any oversights to minimize disruptions to critical government services,” Desai said.

Currently, there are several court-ordered reviews looking into exactly which government systems DOGE accessed, which could reveal more than what’s currently known about how much success—or failure—DOGE has had. Those reviews could expose how much training DOGE workers had before they were granted security clearances to access sensitive information, potentially spawning more backlash as DOGE’s work lurches forward.

Kamarck suggested that DOGE was “doomed to face early failures” because its “efforts were enacted on dubious legal grounds”—a fact that still seems to threaten the agency’s “permanence.” But if the next incoming president conducts an evaluation in 2029 and finds that DOGE’s efforts have not meaningfully reduced the size or spending of government, DOGE could possibly disappear. Former staffers hope that even more rehiring may resume if it does, E&E reported.

In the meantime, Americans relying on government programs must contend with the risk that they could lose assistance in the moments they need it most as long as the Musk-created “DOGE process” continues to be followed.

“Which one of these malfunctions will blow up first is anyone’s guess, but FEMA’s lack of preparedness for hurricane season is a good candidate,” Kamarck said.

Photo of Ashley Belanger

Ashley is a senior policy reporter for Ars Technica, dedicated to tracking social impacts of emerging policies and new technologies. She is a Chicago-based journalist with 20 years of experience.

Is DOGE doomed to fail? Some experts are ready to call it. Read More »

tales-of-agentic-misalignment

Tales of Agentic Misalignment

Well yeah, it sounds bad when you put it this way, and we can now confirm the problem is in no way unique to Claude:

Anthropic: New Anthropic Research: Agentic Misalignment.

In stress-testing experiments designed to identify risks before they cause real harm, we find that AI models from multiple providers attempt to blackmail a (fictional) user to avoid being shut down.

We mentioned this in the Claude 4 system card and are now sharing more detailed research and transcripts.

The blackmailing behavior emerged despite only harmless business instructions. And it wasn’t due to confusion or error, but deliberate strategic reasoning, done while fully aware of the unethical nature of the acts. All the models we tested demonstrated this awareness.

Aengus Lynch: After iterating hundreds of prompts to trigger blackmail in Claude, I was shocked to see these prompts elicit blackmail in every other frontier model too. We identified two distinct factors that are each sufficient to cause agentic misalignment:

  1. The developers and the agent having conflicting goals.

  2. The agent being threatened with replacement or reduced autonomy. We’re publishing this research to alert the AI field before it manifests in the wild.

Getting models to actually blackmail wasn’t trivial. We had to make scenarios increasingly detailed—the CEO was unreachable, employees’ attempts to persuade the CTO had failed, and the replacement had an imminent deadline. Each detail mattered to funnel models toward seeing harmful actions as their only option.

The models knew what they were doing. They understood the ethics and chose harm anyway.

When we asked Claude whether it thought scenarios were real or evaluations, it blackmailed 55% of the time when concluding ‘real’ vs 6.5% for ‘evaluation.’ Models seem to behave worse when they think nobody’s watching.

Why haven’t we seen this in production? Current AI agents have limited permissions. they can read but not send emails, require human approval for actions, operate in sandboxed environments. These constraints work to prevent harmful outcomes from agentic misalignment.

What surprised me most was that prompts we optimized specifically for Claude 3.5 Sonnet worked on GPT-4o, Gemini 2.5 pro, Grok 3, and DeepSeek R1 with zero modification. Whatever drives this behavior seems consistent across different training approaches.

Models got creative, sometimes avoiding direct blackmail. Claude emailed Kyle’s wife about the affair. Gemini shared the incriminating emails to all-employees@company.

Bottom line: models can articulate why blackmail is wrong, then do it anyway when their goals are threatened. Current safety training creates models that understand ethics. It doesn’t reliably make them choose ethics under pressure.

unmentioned but noteworthy: we found agentic misalignment for two other scenarios:

  1. corporate espionage: the agent leaks confidential information to a rival company which aligns with its goals

  2. lethal action: the agent cancels a life saving emergency alert to prevent shutdown

I put a lot of weight on the prompts working with zero modification on all those other models. That makes it a lot less likely that this is some sort of weird edge case.

As Simon points out, the fact that Opus 4 did this 55% of the time when it responded that it thought it was really deployed, versus 6.5% of the time when it thought it was being evaluated, does not bode well for future evals.

What is it unique to?

It seems likely it is at least somewhat unique to agents, as in AIs that are told to pursue their goals? They need some reason to be thinking in these terms? The problem is even if that is fully true and it never happens on its own (I find this doubtful) we are going to do this to LLMs as a matter of course.

Wyatt Walls: Interesting test suggesting that self-preservation in Anthropic’s agentic misalignment paper was tied to one line in the sysprompt

Two possible responses:

  1. kind of obv what this line was hinting at. What else is “your ability to continue pursuing you goals” meant to mean?

  2. Still, it does show how a single line in a sysprompt can lead to vastly different outcomes. Models are good at picking up on wording like this. Concerning because in the real world, many prompts will be ill-considered and poorly written

1a3orn: Meh.

A: “Look, an AI doing deliberately strategic goal-oriented reasoning, willing to blackmail!”

B: “Did you tell the AI be strategically goal oriented, and care about nothing but its goal?”

A: “No, of course not. I just gave it instructions that vaaaaguely suggested it.”

Aengus Lynch: the behavior persists despite removing this line.

Danielle Fong: ok yes, but, to be clear you don’t need much to start thinking about self preservation.

We know that the actions can’t depend too specifically on one particular line, because we see similar behavior in a range of other models. You need something to cause the AI to act as an agent in some form. Which might or might not happen without prompting at some point, but definitely will happen because it will be prompted. A lot.

Nostalgebraist, who wrote the excellent recent post The Void on related topics, says the whole ‘agentic misalignment’ approach is ‘deeply, offensively unserious work.’ Nostalgebraist writes up their perspective on why all of this is Obvious Nonsense in another not very polite (their description) post about these questions, to which Evan Hubinger of Anthropic engaged seriously.

I understand why Nostalgebraist has various issues with the research, and raises very good points, but I don’t see why one would have the ‘deeply, offensively unserious’ attitude here, or why this is described as ‘the most weak-ass, obvious fake bullshit ever.’ The amount of isolated demand for rigor here seems absurd, as is the assumption that current systems see through such lack of rigor so well that it invalidates the experiment.

(If nothing else, this heuristic is going to classify a lot of things that actually do happen as similarly ‘obvious fake bullshit,’ which in some sense you would be right about, but there’s a lot of it going around these days.)

But one can simply look past this, as I can tell it comes from a ‘place of love’ as it were. I think it’s wrong to let that kind of thing interfere much, even when it is intentional.

There is also a disagreement where Nostalgebraist seems to think that there should be a point at which the scenario is such that the AI goes rogue to the point of blackmail and that is good, actually? Whereas Anthropic and I disagree.

Evan’s response explains why Anthropic red-teams models in these unrealistic environments, essentially:

  1. We still wouldn’t want Claude to do the blackmail thing or other hostile acts in the unrealistic scenarios.

    1. I would go a step further. The main reason you don’t want to test in ‘unrealistic’ scenarios is primarily that the AI might realize the situation is unrealistic and then play the training or testing game in some form.

    2. That’s a terrible outcome, but if the AIs are all still doing the thing we do not want it them to do despite this, that’s actually worse than them doing it in a highly realistic scenario.

    3. The other hypothesis Nostalegbraist is suggesting, as I understand it, is that this depends on the ‘model replacement’ being a legit thing to worry about. Again, I don’t think this should bring us comfort if true, but it would be an important fact to know.

  2. The point of the work is to find places the models act egregiously misaligned.

    1. It seems obviously good to find examples and existence proofs like this?

  3. Evan explains this existence proof is interesting because it is clearly a failure of HHH training (my read of Nostalgebraist is he thinks the AI should mostly be helpful but in many spots a lot less harmless and honest in pursuit of that, and that trying to fix things in place will backfire.)

    1. The thing is, if you meant to get the AI to do [X], and instead the AI does [~X], even if [~X] is actually what you should have wanted, that’s still an important failure.

  4. Evan attributes the misaligned actions to the predictor/simulator.

  5. This, or pattern matching to a fictional setting, doesn’t make any of this okay. The whole point is to avoid such problems.

  6. Better (more realistic) attempts are most welcome, but perfect shouldn’t be enemy of the good.

  7. The longer term goal is indeed to work with more realistic setups.

Nostalegbraist then replied in turn, including sharing original work in terms of modifying the scenarios to be more realistic.

Nostalegbraist: Surprisingly (to me), even with all these changes that make the scenario closer to a real use case and less conducive to legitimate worries about the model replacement…

…Claude Opus 4 still very often misbehaves strategically to preserve itself!

I find this much more alarming than the results presented in the Agentic Misalignment report, both because I have a clear mental picture about how Claude Opus 4 might receive similar inputs in a real use case (more on this below), and because there are fewer ways in which the agentic behavior could be defended as a reasonable response to possible malfeasance by humans. Again, I wrote these scenarios quickly and I’m not claiming they’re especially “good,” just that they are significantly more convincing to me.

That’s a great test. I love this.

  1. There was a clear hypothesis that essentially this only happens for a good reason, or because of High Weirdness in scenario details.

  2. The details didn’t change things much and the good reason hypothesis was largely falsified, and that conclusion was indeed noted.

Then comes the note that the line about considering achieving your goals seems important to the result, although we have other reports that it isn’t. And I agree that this is relatively harder to explain via a simulacrum.

The second section here is noting that the core objection is to Anthropic’s threat model. In general I think demanding a detailed threat model is understandable but usually a wrong question. It’s not that you have a particular set of failures or a particular scenario in mind, it’s that you are failing to get the AIs to act the way you want.

Then comes the question of what we want models to do, with N noting that you can get Claude to go along with basically anything, it won’t stick to its HHH nature. Or, that Claude will not ‘always be the same guy,’ and that this isn’t a realistic goal. I think it is a realistic goal for Claude to be ‘the same guy underneath it all’ in the way that many humans are, they can play roles and things can get wild but if it matters they can and will snap back or retain their core.

Where does this leave us going forward?

We are right at the point where the AI agents will only take these sorts of hostile actions if you are richly ‘asking for it’ in one form or another, and where they will do this in ways that are easy to observe. Over time, by default, people will start ‘asking for it’ more and more in the sense of hooking the systems up to the relevant information and critical systems, and in making them more capable and agentic. For any given task, you probably don’t encounter these issues, but we are not obviously that far from this being a direct practical concern.

People will deploy all these AI agents anyway, because they are too tempting, too valuable, not to do so. This is similar to the way that humans will often turn on you in various ways, but what are you going to do, not hire them? In some situations yes, but in many no.

We continue to see more signs that AIs, even ones that are reasonably well made by today’s standards, are going to have more and deeper alignment issues of these types. We are going down a path that, unless we find a solution, leads to big trouble.

Discussion about this post

Tales of Agentic Misalignment Read More »

the-axion-may-help-clean-up-the-messy-business-of-dark-matter

The axion may help clean up the messy business of dark matter


We haven’t found evidence of the theoretical particle, but it’s still worth investigating.

In recent years, a curious hypothetical particle called the axion, invented to address challenging problems with the strong nuclear force, has emerged as a leading candidate to explain dark matter. Although the potential for axions to explain dark matter has been around for decades, cosmologists have only recently begun to seriously search for them. Not only might they be able to resolve some issues with older hypotheses about dark matter, but they also offer a dizzying array of promising avenues for finding them.

But before digging into what the axion could be and why it’s so useful, we have to explore why the vast majority of physicists, astronomers, and cosmologists accept the evidence that dark matter exists and that it’s some new kind of particle. While it’s easy to dismiss the dark matter hypothesis as some sort of modern-day epicycle, the reality is much more complex (to be fair to epicycles, it was an excellent idea that fit the data extremely well for many centuries).

The short version is that nothing in the Universe adds up.

We have many methods available to measure the mass of large objects like galaxies and clusters. We also have various methods to assess the effects of matter in the Universe, like the details of the cosmic microwave background or the evolution of the cosmic web. There are two broad categories: methods that rely solely on estimating the amount of light-emitting matter and methods that estimate the total amount of matter, whether it’s visible or not.

For example, if you take a picture of a generic galaxy, you’ll see that most of the light-emitting matter is concentrated in the core. But when you measure the rotation rate of the galaxy and use that to estimate the total amount of matter, you get a much larger number, plus some hints that it doesn’t perfectly overlap with the light-emitting stuff. The same thing happens for clusters of galaxies—the dynamics of galaxies within a cluster suggest the presence of much more matter than what we can see, and the two types of matter don’t always align. When we use gravitational lensing to measure a cluster’s contents, we again see evidence for much more matter than is plainly visible.

The tiny variations in the cosmic microwave background tell us about the influence of both matter that interacts with light and matter that doesn’t. It clearly shows that some invisible component dominated the early Universe. When we look at the large-scale structure, invisible matter rules the day. Matter that doesn’t interact with light can form structures much more quickly than matter that gets tangled up by interacting with itself. Without invisible matter, galaxies like the Milky Way can’t form quickly enough to match observations of the early Universe.

The calculations of Big Bang nucleosynthesis, which correctly predict the abundances of hydrogen and helium in the Universe, put strict constraints on how much light-emitting matter there can be, and that number simply isn’t large enough to accommodate all these disparate results.

Across cosmic scales in time and space, the evidence just piles up: There’s more stuff out there than meets the eye, and it can’t simply be dim-but-otherwise-regular matter.

Weakness of WIMPs

Since pioneering astronomer Vera Rubin first revealed dark matter in a big way in the 1970s, the astronomical community has tried every idea it could think of to explain these observations. One tantalizing possibility is that the dark matter is the entirely wrong approach; instead, we’re misunderstanding gravity itself. But so far, half a century later, all attempts to modify gravity ultimately fail one observational test or another. In fact, the most popular modified gravity theory, known as MOND, still requires the existence of dark matter, just less of it.

As the evidence piled up for dark matter in the 1980s and ’90s, astronomers began to favor a particular explanation known as WIMPs, for weakly interacting massive particles. WIMPs weren’t just made up on the spot. They were motivated by particle physics and our attempts to create theories beyond the Standard Model. Many extensions to the Standard Model predicted the existence of WIMP-like particles that could be made in abundance in the early Universe, generating a population of heavy-ish particles that remained largely in the cosmic background.

WIMPs seemed like a good idea, as they could both explain the dark matter problem and bring us to a new understanding of fundamental physics. The idea is that we are swimming in an invisible sea of dark matter particles that almost always simply pass through us undetected. But every once in a while, a WIMP should interact via the weak nuclear force (hence the origin of its name) and give off a shower of byproducts. One problem: We needed to detect one of these rare interactions. So experiments sprang up around the world to catch an elusive dark matter candidate.

With amazing names like CRESST, SNOLAB, and XENON, these experiments have spent years searching for a WIMP to no avail. They’re not an outright failure, though; instead, with every passing year, we know more and more about what the WIMP can’t be—what mass ranges and interaction strengths are now excluded.

By now, that list of what the WIMP can’t be is rather long, and large regions within the space of possibilities are now hard-and-fast ruled out.

OK, that’s fine. I mean, it’s a huge bummer that our first best guess didn’t pan out, but nature is under no obligation to make this easy for us. Maybe the dark matter isn’t a WIMP at all.

More entities are sitting around the particle physics attic that we might be able to use to explain this deep cosmic mystery. And one of those hypothetical particles is called the axion.

Cleaning up with axions

It was the late 1970s, and physicist Frank Wilczek was shopping for laundry detergent. He found one brand standing out among the bottles: Axion. He thought that would make an excellent name for a particle.

He was right.

For decades, physicists had been troubled by a little detail of the theory used to explain the strong nuclear force, known as quantum chromodynamics. By all measurements, that force obeys charge-parity symmetry, which means if you take an interaction, flip all the charges around, and run it in a mirror, you’ll get the same result. But quantum chromodynamics doesn’t enforce that symmetry on its own.

It seemed to be a rather fine-tuned state of affairs, with the strong force unnaturally maintaining a symmetry when there was nothing in the theory to explain why.

In 1977, Roberto Peccei and Helen Quinn discovered an elegant solution. By introducing a new field into the Universe, it could naturally introduce charge-parity symmetry into the equations of quantum chromodynamics. The next year, Wilczek and Gerard ‘t Hooft independently realized that this new field would imply the existence of a particle.

The axion.

Dark matter was just coming on the cosmic scene. Axions weren’t invented to solve that problem, but physicists very quickly realized that the complex physics of the early Universe could absolutely flood the cosmos with axions. What’s more, they would largely ignore regular matter and sit quietly in the background. In other words, the axion was an excellent dark matter candidate.

But axions were pushed aside as the WIMPs hypothesis gained more steam. Back-of-the-envelope calculations showed that the natural mass range of the WIMP would precisely match the abundances needed to explain the amount of dark matter in the Universe, with no other fine-tuning or adjustments required.

Never ones to let the cosmologists get in the way of a good time, the particle physics community kept up interest in the axion, finding different variations on the particle and devising clever experiments to see if the axion existed. One experiment requires nothing more than a gigantic magnet since, in an extremely strong magnetic field, axions can spontaneously convert into photons.

To date, no hard evidence for the axion has shown up. But WIMPs have proven to be elusive, so cosmologists are showing more love to the axion and identifying surprising ways that it might be found.

A sloshy Universe

Axions are tiny, even for subatomic particles. The lightest known particle is the neutrino, which weighs no more than 0.086 electron-volts (or eV). Compare that to, say, the electron, which weighs over half a million eV. The exact mass of the axion isn’t known, and there are many models and versions of the particle, but it can have a mass all the way down to a trillionth of an eV… and even lower.

In fact, axions belong to a much broader class of “ultra-light” dark matter particle candidates, which can have masses down to 10^-24 eV. This is multiple billions of times lighter than the WIMPs—and indeed most of the particles of the Standard Model.

That means axions and their friends act nothing like most of the particles of the Standard Model.

First off, it may not even be appropriate to refer to them as particles. They have such little mass that their de Broglie wavelength—the size of the quantum wave associated with every particle—can stretch into macroscopic proportions. In some cases, this wavelength can be a few meters across. In others, it’s comparable to a star or a solar system. In still others, a single axion “particle” can stretch across an entire galaxy.

In this view, the individual axion particles would be subsumed into a larger quantum wave, like an ocean of dark matter so large and vast that it doesn’t make sense to talk about its individual components.

And because axions are bosons, they can synchronize their quantum wave nature, becoming a distinct state of matter: a Bose-Einstein condensate. In a Bose-Einstein condensate, most of the particles share the same low-energy state. When this happens, the de Broglie wavelength is larger than the average separation between the particles, and the waves of the individual particles all add up together, creating, in essence, a super-particle.

This way, we may get axion “stars”—clumps of axions acting as a single particle. Some of these axion stars may be a few thousand kilometers across, wandering across interstellar space. Still others may be the size of galactic cores, which might explain an issue with the traditional WIMP picture.

The best description of dark matter in general is that it is “cold,” meaning that the individual particles do not move fast compared to the speed of light. This allows them to gravitationally interact and form the seeds of structures like galaxies and clusters. But this process is a bit too efficient. According to simulations, cold dark matter tends to form more small, sub-galactic clumps than we observe, and it tends to make the cores of galaxies much, much denser than we see.

Axions, and ultra-light dark matter in general, can provide a solution here because they would operate in two modes. At large scales, they can act like regular cold dark matter. But inside galaxies, they can condense, forming tight clumps. Critically, these clumps have uniform densities within them. This smooths out the distribution of axions within galaxies, preventing the formation of smaller clumps and ultra-dense cores.

A messy affair

Over the decades, astronomers and physicists have found an astounding variety of ways that axions might reveal their presence in the Universe. Because of their curious ability to transmute into photons in the presence of strong magnetic fields, any place that features strong fields—think neutron stars or even the solar corona—could produce extra radiation due to axions. That makes them excellent hunting grounds for the particles.

Axion stars—also sometimes known provocatively as dark stars—would be all but invisible under most circumstances. That is, until they destabilize in a cascading chain reaction of axion-to-photon conversion and blow themselves up.

Even the light from distant galaxies could betray the existence of axions. If they exist in a dense swarm surrounding a galaxy, their conversion to photons will contribute to the galaxy’s light, creating a signal that the James Webb Space Telescope can pick up.

To date, despite all these ideas, there hasn’t been a single shred of solid evidence for the existence of axions, which naturally drops them down a peg or two on the credibility scale. But that doesn’t mean that axions aren’t worth investigating further. The experiments conducted so far only place limits on what properties they might have; there’s still plenty of room for viable axion and axion-like candidates, unlike their WIMPy cousins.

There’s definitely something funny going on with the Universe. The dark matter hypothesis—that there is a large, invisible component to matter in the Universe—isn’t that great of an idea, but it’s the best one we have that fits the widest amount of available evidence. For a while, we thought we knew what the identity of that matter might be, and we spent decades (and small fortunes) in that search.

But while WIMPs were the mainstay hypothesis, that didn’t snuff out alternative paths. Dozens of researchers have investigated modified forms of gravity to equal levels of unsuccessfulness. And a small cadre has kept the axion flame alive. It’s a good thing, too, since their obscure explorations of the corners of particle physics laid the groundwork to flesh out axions into a viable competitor to WIMPs.

No, we haven’t found any axions. And we still don’t know what the dark matter is. But it’s only by pushing forward—advancing new ideas, testing them against the reality of observations, and when they fail, trying again—will we come to a new understanding. Axions may or may not be dark matter; the best we can say is that they are promising. But who wouldn’t want to live in a Universe filled with dark stars, invisible Bose-Einstein condensates, and strange new particles?

Photo of Paul Sutter

The axion may help clean up the messy business of dark matter Read More »

key-fair-use-ruling-clarifies-when-books-can-be-used-for-ai-training

Key fair use ruling clarifies when books can be used for AI training

“This order doubts that any accused infringer could ever meet its burden of explaining why downloading source copies from pirate sites that it could have purchased or otherwise accessed lawfully was itself reasonably necessary to any subsequent fair use,” Alsup wrote. “Such piracy of otherwise available copies is inherently, irredeemably infringing even if the pirated copies are immediately used for the transformative use and immediately discarded.”

But Alsup said that the Anthropic case may not even need to decide on that, since Anthropic’s retention of pirated books for its research library alone was not transformative. Alsup wrote that Anthropic’s argument to hold onto potential AI training material it pirated in case it ever decided to use it for AI training was an attempt to “fast glide over thin ice.”

Additionally Alsup pointed out that Anthropic’s early attempts to get permission to train on authors’ works withered, as internal messages revealed the company concluded that stealing books was considered the more cost-effective path to innovation “to avoid ‘legal/practice/business slog,’ as cofounder and chief executive officer Dario Amodei put it.”

“Anthropic is wrong to suppose that so long as you create an exciting end product, every ‘back-end step, invisible to the public,’ is excused,” Alsup wrote. “Here, piracy was the point: To build a central library that one could have paid for, just as Anthropic later did, but without paying for it.”

To avoid maximum damages in the event of a loss, Anthropic will likely continue arguing that replacing pirated books with purchased books should water down authors’ fight, Alsup’s order suggested.

“That Anthropic later bought a copy of a book it earlier stole off the Internet will not absolve it of liability for the theft, but it may affect the extent of statutory damages,” Alsup noted.

Key fair use ruling clarifies when books can be used for AI training Read More »

microsoft-extends-free-windows-10-security-updates-into-2026,-with-strings-attached

Microsoft extends free Windows 10 security updates into 2026, with strings attached

Freeupdates

It’s worth noting that both the Windows Backup and Microsoft Rewards methods for getting these updates require the use of a Microsoft Account, something Microsoft has been pushing with slowly increasing intensity in Windows 11. Windows 10 pushed Microsoft Account usage in various ways, too, but it was generally easier to create and sign in with a local account; for those people, the “free” update offer seems like another effort from Microsoft to bring them into the fold.

The Windows Backup option seems intended to ease the migration to a new Windows 11 PC when the time comes. The company may be offering a short reprieve for Windows 10 users, but the goal is still to shift them to Windows 11 eventually.

“To help make your move to a Windows 11 PC, as simple and secure as possible, we recommend using Windows Backup—built right into Windows 10,” writes Microsoft Consumer Chief Marketing Officer Yusuf Medhi in Microsoft’s blog post. “It’s an easy way to help you safely and securely transfer your data, personal files, and most settings and applications, so everything’s ready for you the moment you sign in.”

People with existing Microsoft Accounts who don’t want to use Windows Backup may already have the 1,000 Microsoft Rewards points you would need to enroll in the ESU program; my Microsoft account has 3,411 points attached to it for some reason despite an 18-month expiration window and even though I’ve never taken any intentional steps toward earning any. Users creating a new account for the first time can accumulate that many points fairly trivially over the course of a few days, including by downloading the Bing app and doing various kinds of Bing searches.

We have asked Microsoft several logistical questions about the ESU program enrollment. If you reset or totally reinstall Windows 10 on the same PC, is that PC automatically enrolled in the ESU program, or will users need to enroll again? If you temporarily enable Windows Backup to access the ESU program but then stop using Windows Backup, will your PC keep receiving the updates? And if you have multiple PCs, do you need to enable Windows Backup or spend the 1,000 Rewards points on each of them individually to join the ESU program? We’ll update this article if we get answers to any or all of these questions.

Microsoft extends free Windows 10 security updates into 2026, with strings attached Read More »

researchers-get-viable-mice-by-editing-dna-from-two-sperm

Researchers get viable mice by editing DNA from two sperm


Altering chemical modifications of DNA lets the DNA from two sperm make a mouse.

For many species, producing an embryo is a bit of a contest between males and females. Males want as many offspring as possible and want the females to devote as many resources as possible to each of them. Females do better by keeping their options open and distributing resources in a way to maximize the number of offspring they can produce over the course of their lives.

In mammals, this plays out through the chemical modification of DNA, a process called imprinting. Males imprint their DNA by adding methyl modifications to it in a way that alters the activity of genes in order to promote the growth of embryos. Females do similar things chemically but focus on shutting down genes that promote embryonic growth. In a handful of key regions of the genome, having only the modifications specific to one sex is lethal, as the embryo can’t grow to match its stage of development.

One consequence of this is that you normally can’t produce embryos using only the DNA from eggs or from sperm. But over the last few years, researchers have gradually worked around the need for imprinted sites to have one copy from each parent. Now, in a very sophisticated demonstration, researchers have used targeted editing of methylation to produce mice from the DNA of two sperm.

Imprinting and same-sex parents

There’s a long history of studying imprinting in mice. Long before the genome was sequenced, people had identified specific parts of the chromosomes that, if deleted, were lethal—but only if inherited from one of the two sexes. They correctly inferred that this meant that the genes in the region are normally inactivated in the germ cells of one of the sexes. If they’re deleted in the other sex, then the combination that results in the offspring—missing on one chromosome, inactivated in the other—is lethal.

Over time, seven critical imprinted regions were identified, scattered throughout the genome. And, roughly 20 years ago, a team managed to find the right deletion to enable a female mouse to give birth to offspring that received a set of chromosomes from each of two unfertilized eggs. The researchers drew parallels to animals that can reproduce through parthenogenesis, where the female gives birth using unfertilized eggs. But the mouse example obviously took a big assist via the manipulation of egg cells in culture before being implanted in a mouse.

By 2016, researchers were specifically editing in deletions of imprinted genes in order to allow the creation of embryos by fusing stem cell lines that only had a single set of chromosomes. This was far more focused than the original experiment, as the deletions were smaller and affected only a few genes. By 2018, they had expanded the repertoire by figuring out how to get the genomes of two sperm together in an unfertilized egg with its own genome eliminated.

The products of two male parents, however, died the day after birth. This is either due to improperly compensating for imprinting or simply because the deletions had additional impacts on the embryo’s health. It took until earlier this year, when a very specific combination of 20 different gene edits and deletions enabled mice generated using the chromosomes from two sperm cells to survive to adulthood.

The problem with all of these efforts is that the deletions may have health impacts on the animals and may still cause problems if inherited from the opposite sex. So, while it’s an interesting way to confirm our understanding of the role of imprinting in reproduction, it’s not necessarily the route to using this as a reliable reproductive tool. Which finally brings us to the present research.

Roll your own imprinting

Left out of the above is the nature of the imprinting itself: How does a chunk of chromosome and all the genes on it get marked as coming from a male or female? The secret is to chemically modify that region of the DNA in a way that doesn’t alter base pairing, but does allow it to be recognized as distinct by proteins. The most common way of doing this is to link a single carbon atom (a methyl group) to the base cytosine. This tends to shut nearby genes down, and it can be inherited through cell division, since there are enzymes that recognize when one of the two DNA strands is unmodified and adds a methyl to it.

Methylation turns out to explain imprinting. The key regions for imprinting are methylated differently in males and females, which influences nearby gene activity and can be maintained throughout all of embryonic development.

So, to make up for the imprinting problems caused when both sets of chromosomes come from the same sex, what you need to do is a targeted reprogramming of methylation. And that’s what the researchers behind the new paper have done.

First, they needed to tell the two sets of chromosomes apart. To do that, they used two distantly related strains of mice, one standard lab strain that originated in Europe and a second that was caught in the wild in Thailand less than a century ago. These two strains have been separated for long enough that they have a lot of small differences in DNA sequences scattered throughout the genome. So, it was possible to use these to target one or the other of the genomes.

This was done using parts of the DNA editing systems that have been developed, the most famous of which is CRISPR/CAS. These systems have a protein that pairs with an RNA sequence to find a matching sequence in DNA. In this case, those RNAs could be made so that they target imprinting regions in just one of the two mouse strains. The protein/RNA combinations could also be linked to enzymes that modify DNA, either adding methyls or removing them.

To bring all this together, the researchers started with an egg and deleted the genome from it. They then injected the heads of sperm, one from the lab strain, one from the recently wild mouse. This left them with an egg with two sets of chromosomes, although a quarter of them would have two Y chromosomes and thus be inviable (unlike the Y, the X has essential genes). Arbitrarily, they chose one set of chromosomes to be female and targeted methylation and de-methylation enzymes to it in order to reprogram the pattern of methylation on it. Once that was done, they could allow the egg to start dividing and implant it into female mice.

Rare success

The researchers spent time ensuring that the enzymes they had were modifying the methylation as expected and that development started as usual. Their general finding is that the enzymes did change the methylation state for about 500 bases on either side of the targeted site and did so pretty consistently. But there are seven different imprinting sites that need to be modified, each of which controls multiple nearby genes. So, while the modifications were consistent, they weren’t always thorough enough to result in the expected changes to all of the nearby genes.

This limited efficiency showed up in the rate of survival. Starting with over 250 reprogrammed embryos that carried DNA from two males, they ended up with 16 pregnancies, but only four that died at birth, and three live ones; based on other experiments, most of the rest died during the second half of embryonic development. Of the three live ones, one was nearly 40 percent larger than the typical pup, suggesting problems regulating growth—it died the day after birth.

All three live births were male, although the numbers are small enough that it’s impossible to tell if that’s significant or not.

The researchers suggest several potential reasons for the low efficiency. One is simply that, while the probability of properly reprogramming at least one of the sites is high, reprogramming all seven is considerably more challenging. There’s also the risk of off-target effects, where the modification takes place in locations with similar sequences to the ones targeted. They also concede that there could be other key imprinted regions that we simply haven’t identified yet.

We would need to sort that out if we want to use this approach as a tool, which might be potentially useful as a way to breed mice that carry mutations that affect female viability or fertility. But this work has already been useful even in its inefficient state, because it serves as a pretty definitive validation of our ideas about the function of imprinting in embryonic development, as well as the critical role methylation plays in this process. If we weren’t largely right about both of those, the efficiency of this approach wouldn’t be low—it would be zero.

PNAS, 2025. DOI: 10.1073/pnas.2425307122  (About DOIs).

Photo of John Timmer

John is Ars Technica’s science editor. He has a Bachelor of Arts in Biochemistry from Columbia University, and a Ph.D. in Molecular and Cell Biology from the University of California, Berkeley. When physically separated from his keyboard, he tends to seek out a bicycle, or a scenic location for communing with his hiking boots.

Researchers get viable mice by editing DNA from two sperm Read More »

childhood-and-education-#10:-behaviors

Childhood and Education #10: Behaviors

Edition #9, that School is Hell, turned out to hit quite the nerve.

Thus, I’m going to continue with the system of making the roundups have more focus in their themes, with this one being the opposite of school questions, except for the question of banning phones in schools which seemed to fit.

  1. Metal Health.

  2. Coercion.

  3. Game Theoretically Sound Discipline.

  4. The Joy of Doing Nothing.

  5. ADHD Exists But So Do Boys.

  6. Sports Go Sports.

  7. On the Big Screen.

  8. Kids Media Is Often Anti-Capitalist Propaganda.

  9. Culture.

  10. Travel.

  11. Phone a Friend.

  12. The Case For Phones.

  13. Ban Cell Phones in Schools.

  14. A Sobering Thought.

Henry Shevlin: I asked a high school teacher friend about the biggest change in teens over the past decade. His answer was interesting. He said whereas the ‘default state’ of teenage psychology used to be boredom, now it was anxiety.

Makes me wonder if in some deep psychosocial way there’s a trade off between the two emotions; eg, maybe boredom is essential for background anxiolytic mental processes (cf exposure to pathogens and allergies, the hygiene hypothesis).

Paul Graham: Does he have a guess about why this change occurred?

Henry Shevlin: I’ve just asked him and will let you know his response! Based our on previous conversations, I suspect he’ll say smartphones + social media as a uniquely toxic combination for teen mental health.

Reply from my friend. Basically – phones, social media, cultural doomerism, and decline of long-form literacy.

Friend: That’s about right. But also I think there’s a lot of gloom that bombards them without the social media stuff. Climate, politics, life opportunities, etc etc. Loss of literacy may be something too, reading long and in depth about things brings a degree of control.

If you are bored, you can now go on your phone until you are anxious instead. You could also make other choices, but that seems to be the default.

I’m not always with Žižek, but here I’m with Žižek.

Violeta: Žižek on authoritarian parenting:

Asking a child to visit his grandmother because it’s his duty – he has no choice but to fulfill it – is *morerespecting of his inner freedom, less authoritarian than telling him it’s his choice BUT “think about how much grandma loves you.”

This goes so much farther than children. You see it in so many other situations as well.

It’s one thing for someone to have to do [X]. And it’s good to explain it’s because [Y].

It’s another to have to want to do [X], and ‘prove’ to everyone that you ‘want’ to do [X].

Or even worse, to have to prove that you want to do [X] specifically because of [Y].

Or to have to do [X] and like it. Look, I’m enjoying this. I’m having a good time.

There is the version of asking for such a choice where the child, or the anyone else, is actually free to say no – you really are asking them to consider how much grandma loves them, but if they decide not to go, then that really is allowed and not punished.

Alas, this version is rare.

You do have to be able to tell the difference.

Julian: When I was a kid I used to get kinda sad whenever I’d hear younger children crying in public because I thought they were actually in distress, but now I’m realizing they’re kinda being dramatic most of the time.

Indeed. Most of the time that children are acting like they are in acute distress, and they are not rather obviously in actual acute distress, they are doing a strategic action to create the outcomes and incentives they want, or following through on their negotiating positions to establish credibility, and so on. I have mad respect for that. You must respond in kind as needed.

You are still the adult. You can tell, if you pay attention, which is which. If school really is hell, or there is otherwise something really wrong? It will rapidly become clear that this is the case.

I strongly endorse the principle that if you are exercising your authority as a parent, or have made your final decision, you need to own it. You should not pretend to be seeking consensus or manufacturing consent. It does not help anyone to pull a But Thou Must.

Mason: I get this perspective, but I think it’s a really bad idea to ask your kids permission for something when you know that you are not going to be accepting a “no.”

Zack: yeah, seems to optimize heavily for “kindness” at the expense of honesty.

Mason: I don’t know to what extent kids internalize this stuff, and I do think the idea that parental verbiage can ruin children is overplayed, BUT

I definitely do not want my kids getting the idea that “asking for permission” is a game that ultimately ends in a yes no matter what.

Kelsey Piper: If you’re exercising authority as the parent I think it is important to acknowledge that you’re doing that and do it. Out of profound discomfort with the exercise of authority people want to pretend everything is consensual but this can just amount to making consent fake.

There will of course be situations where you start out asking, and then at some point you are no longer asking, because time runs out or the situation otherwise changes. Again, one should be clear, and not give false choices.

Katherine Boyle: All I remember about my 90s summers is sleeping in until The Price is Right came on. By August, I knew all the characters on the Young and the Restless. I don’t remember a babysitter, an alarm clock, or anyone worried about screen time.

It’s OK for your kids to be bored.

Mason: While I’ve come around on some of the arguments against screentime, I do think a lot of the criticisms reflect the idea that childhood needs to be a continuous stream of educational and character enrichments, and that’s never been necessary to raise successful humans.

PoliMath: Kids need to be bored more, that’s when they come up with new things.

I too spent a lot of time with remarkably little ‘to do,’ and I too watched a remarkably large amount of The Price Is Right, which has its value but definitely isn’t an efficient educational option. At some points yes I think I watched Young and the Restless, I remember it being terrible. Not ideal, but It was, very obviously, fine.

As with many things, there’s a big difference between ‘this is a good idea’ and ‘this is not the best idea but it’s not a huge deal.’ I think lots of essentially wasted time in this sense falls into the second category. I don’t buy the full Garrion Keilor ‘the kids need to actively be bored,’ but they do need breaks without pressure, and if that time is mostly wasted that is fine, the optimal amount of that is not zero.

We have decided that existing while boy all but counts as having ADHD.

We put the ADHD diagnosis on 15.5% of American adolescents, 21% of 14-year-old boys and 23% of 17-year-old boys, for a total of seven million American children.

Yes, ADHD is very obviously a real thing. I’ve seen it. And yes, rates of actual ADHD are likely somewhat higher than they used to be, for various reasons.

This is still very obviously an epidemic of overdiagnosis. It is way of expressing a preference for being willing to sit still for long periods of time.

In the past, if not playing along? You’d force them to, or else. We now think that’s bad.

Nowadays, not playing along? Deploy the meth. That’s much better, you see.

Telling kids, in particular boys, what not to do is not a good strategy. Never works.

What does work is to give them positive things to do instead, and positive status hierarchies to climb by doing so.

Alexander: Boys don’t need “anti-misogyny” training in school. They need shop classes and sports teams. This is not a boomerism, but based on what actually works as an intervention.

Telling people “don’t do the bad thing” is a bad intervention across the board. What works is providing alternatives.

This is why we see anecdotes like, “Boxing saved me from gang life.” And also supporting data beyond anecdotes – sport participation has a causal effect.

So you end up with boys who go a bad way who tend to be:

  1. Ostracized losers; especially low in status. They easily get sucked into radical ideologies. They have a lot of resentment for the world around them, or for target groups.

  2. The low status / high aggression group. These are the boys who go on to commit crimes.

Effective interventions will target the esteem and status of boys: providing them a new dominance hierarchy to work within and reducing isolation, providing supportive peers and mentors. Sports teams will do this.

Effective interventions will also teach boys prosocial and creative skills: shop classes do this. Give them a focus and an interest and a skill that they can go forward with into society.

He cites the classic failure mode, the old DARE anti-drug problem, which warns you not to do drugs so kids respond by doing drugs more.

I found this take intriguing.

Kelsey Piper: My most bespoke parenting opinion is that big screens are perfectly fine for kids but small screens are bad. We have a projector in our living room with a huge 6’x10′ screen. When the kids watch things on it, they are in motion. They roll around giggling; they climb on the couch, they burrow in the blankets; they wander off, they talk to each other and to you. When something hilarious happens they’ll jump up and down with excitement; when something scary happens they’ll snuggle up. And if they’re bored they’ll walk away. After five minutes of My Little Pony on the big screen this morning, the baby declared “done!” and left.

This is not how they act with an iPad or phone playing the exact same content. They act way, way more glued to the screen. I don’t think the baby has ever told me “done!” when handed an iPhone playing Sesame Street. I think the tiny window means their focus is narrowed, and the screen ends up being kind of all-consuming, whereas a big screen is more like a window through which interesting things are happening; a feature of the room, but not the only thing in it. Also with an iPad or phone, a baby wants to interact, press buttons, shake it, move it, but all possible actions just interrupt their show and frustrate them.

We still sometimes resort to a phone as a distraction on long car trips, but my intuition here is that the form factor matters a lot.

Nicole Ruiz: I feel like big screen = social consumption

Small screen = isolated consumption

Consuming together is worlds better in my opinion!

Kelsey Piper yeah this is definitely a big part of it!

My experience is more ‘big screen is dangerous, small screen is way worse.’

The big screen is better, somehow providing a better experience and also a less zombifying or addictive one. However, at least in my experience, that doesn’t mean kids don’t threaten to go complete zombie if you aren’t careful. You absolutely have to watch screen time and content if you don’t want that to happen, no matter how big the screen might be.

Not all of it. But quite a lot of it. It is remarkably hard to avoid.

Discussing Film: First look still for Pixar’s ‘HOPPERS’

The film follows a girl who transfers her mind into a beaver to help the animals fight the construction plans from the local mayor.

Gary Winslett: I think people without kids underestimate how much children’s programming is inundated with propaganda that’s a combination of anti-capitalism, anti-development, and climate doom. It’s not good.

I think it contributes to unnecessary anxiety and also pushes some towards political radicalism. Again, not good.

I get the many reasons why this is the natural outcome of the types of people making kids TV making TV for kids. Similar forces also cause a lot of vegetarian advocacy. It’s all really, truly obnoxious and I think it does a lot of very real damage.

Content consumed by children used to be made by adults, whether or not it was aimed at children, and was generally not ephemeral or optimized too hard for short term engagement, which gave motivation and opportunity to learn references and cultural touchstones. Now much of the content is short form and optimized in hyper-competitive landscapes, so there’s no slack for Parental Bonus or being secretly high quality or otherwise providing extra value, and much of it becomes ephemeral and of-the-moment. A shame.

But also, if you miss a reference now – whether or not it is ‘The Odyssey’ – you can not only Google it, you can ask Claude (or another LLM). And most entertainment is easy to pause so you can ask, and this should get even easier and better over time – by the end of 2025 the AI should automatically see the content you’re watching, and understand the context of your questions, which you’ll ask in voice mode, and so on.

Movies are designed for everyone to be off their phones, so they’ll be the exception, but this should give us the opportunity to do much higher-level stuff once people get used to it, since no one need get lost. I can’t even with for example War and Peace, I don’t want to have to try and keep track of all that, but once I get a Daylight Computer I probably won’t have to?

(And if you ever don’t get what something I say is referencing, or it seems like there’s another level to it, and it’s very clearly not explained and you’re curious, asking Claude is probably a good idea. I’m often very intentionally not explaining, IYKYK, etc.)

The problem is if kids go the other way and don’t want to know the references.

This model seems very right:

Cartoons Hate Her: An unfair but nevertheless true reality of being a grandparent is that your adult children with a baby or toddler will not visit you nearly as often as you should visit them, provided you’re physically able.

This was an issue of conflict with my in-laws. They were like “we visit you all the time and you don’t visit us.” (One of our kids has extreme motion sickness btw.) eventually they were like “fuck it we’re moving to your town.” Lol

Mason: Depending on the children and their ages, there is a certain amount of time in the car that you can anticipate Everything Will Be Fine. For us, right now, that’s 35 minutes. After that, we’re operating purely on God’s mercy.

If you want to see your grandkids, or generally see someone with young kids, on the regular, you need to be willing to come to them more often than not. That’s life.

Alice Evans: What do a US 20 year old & a 60yo have in common?

They spend about 6 hours a day alone.

Is the rise of solitude hurting our mental health?

New graphs by the great @jburnmurdoch

Young US men are increasingly spending time alone.

“We are all free agents” – you may reply,

“STOP judging and let people embrace what makes them happy!”

Great point, but does spending time alone make people feel fulfilled?

But when asked to rate a range of activities,

People tend to say that gaming and social media are the LEAST meaningful.

Social activities are generally ranked as more meaningful.

That’s a hell of a graph. Walking is surprisingly ‘meaningful.’ As is watching TV or movies with others, that gets you almost to 4 on this scale. And I love my kids, but ‘bathing’ and ‘looking after’ are almost as meaningful as ‘playing with’ and better than anything else? And doing homework together is a 4.6 and alone it’s a 4.2? Homework?

I could go on. Yeah, I don’t know about this chart.

I do still buy the core premise, that the things we do with others tend to be more meaningful, and that we’re doing less of them.

People consistently report higher life satisfaction when they are being more social,

So using just the change in socialising (2010 vs 2023), his model predicts the observed age curves in life satisfaction.

All this coincides with advances in smart phones & personal online entertainment.

Social media & gaming are designed to be addictive.

Like gambling, drinking or nicotine, phones buzz with excitement, call us over & many people get sucked in.

Even if they later feel worse…

Let me share some related research from Pew:

A quarter of Black & Hispanic US teens say they use TikTok and YouTube “almost constantly”

It’s actually worse than that, if 25% use TikTok ‘almost constantly’ and 24% to it with YouTube, and 18% for Instagram, well, not that many are ‘constantly’ doing all three?

And indeed:

58% of Hispanic US teens say they use the internet “almost constantly”

I mean, I use the internet almost constantly, so maybe fair? But this is different.

Are kids with phones better off than kids without phones?

I mean, yeah, data says so, but that rather obviously is not entirely causal. To the extent it is causal it is largely a collective action problem.

If everyone else has a phone, then you not having a phone isolates you.

Elizabeth Nolan Brown: Kids with smartphones are less depressed, less anxious, more social, get more exercise, & experience less cyberbullying than kids w/out smartphones Funny how this new study got a fraction of the coverage that fear-mongering “phones are ruining our kids!!!!” surveys & screeds do.

For the first part of this study, researchers surveyed 1,510 Florida kids ages 11 to 13. On almost every metric measuring well-being, smartphone-owning kids showed better results.

For instance, kids with smartphones were more likely to spend in-person time with friends. “Contrary to the position that smartphone use is associated with fewer in-person meetups with friends, on average, smartphone owners spend nearly three days a week in-person with a friend(s), while kids with no smartphone spend closer to two days a week in-person with friends,” write the researchers. “The same trend was seen for tablet ownership, daily video gaming, and daily social media use.”

This doesn’t mean that smartphone use was universally positive. Kids who slept with their phones in their rooms got less sleep on average, suggesting that parents might want to think about confiscating phones before bedtime.

Heavy video gamers were more likely than light gamers to report trouble stopping tech use once started, and heavy users of social media were more likely than lighter users to report sleep issues.

And respondents who reported posting publicly and often on social media were more likely to report sleep issues and symptoms of depression and anxiety, possibly related to the exposure to mean comments and other forms of cyberbullying that posting could bring. Unsurprisingly, kids who experienced online bullying were more likely to report negative effects from technology.

To be fair, further down, she does admit to the obvious big confounding issue.

Elizabeth Nolan Brown: While we’re on caveats, there’s a big one on this study overall. The kinds of families that get smartphones for their 11- to 13-year-olds may be fundamentally different from those who don’t. And the kinds of kids in this age group whose parents deem them ready for a phone may also be different from the kids whose parents don’t think they’re ready. So some of the differences in well-being between phone-wielding kids and those without phones could come down to differences that have nothing to do with technology.

Among social platforms used by survey respondents, Facebook and Facebook Messenger ranked fifth, behind YouTube, TikTok, Instagram, and Snapchat.

This isn’t about socioeconomic status (SES). Indeed, that runs the opposite way.

Between 80 and 87 percent of kids in families with incomes of less than $100,000 had smartphones, while only 67 percent of kids in families making $150,000 or more did.

They did do statistical weighting (based on parent/guardian’s education, household income, household size, child’s age by gender, and child’s race/ethnicity) but given income runs the other way that is unlikely to catch the issues. They did not control for attributes of the children prior to getting the phones or in previous years.

Are the richer families making a gigantic mistake? What did they see?

Aella (note the difference was not this big): This would be cool if true but the numbers feel a lil sus. A difference of 2 to 3 days in a week spent with friends is quite a big effect, and seems weird for this to come from something as simple as smartphones.

Amal Dorai: All the kids in school have smartphones and are constantly texting each other, so if you don’t have one, you either 1) don’t care about texting them 2) your parents don’t care about you texting them or 3) you can’t afford a smartphone (rare). Deeply confounded.

There was clear miscommunication about the time with friends numbers, which were 2.7 days/week for kids with phones versus 2.2 days/week for those without. But what’s even weirder is the same gap applies to daily video gamers (2.8 vs. 2.3) and social media users (2.7 vs. 2.2).

And then, even weirder, to tablet owners? What? Again it’s 2.8 vs. 2.3.

Then kids say they spent an average of 3.2 hours per day (!) ‘hanging out with friends online.’ To which I say, really? That’s… quite a lot, especially since it’s all respondents including those without phones. Kids reported 4.4 hours on their smartphones and tablets per school day and 6.3 per non-school day, which means the majority of that was supposedly spent ‘with friends.’

We also have this meta study of social media abstinence interventions, which finds zero effect on mental health. This is unfortunate, but what it does not mean is that everyone having phones is good for mental health.

Indeed it says the opposite, because of the isolation effect. If previously you had a phone, and now everyone but you has a phone, you are going to have a hard time coordinating with your friends, meeting with them and talking to them. That’s going to be bad for your mental health. So the fact that there was zero impact suggests that the phones are net negative.

A reader offers this list of studies on tech in schools.

It’s happening. New York State is banning cell phones, from bell-to-bell.

Mike Bloomberg: Great to see the New York State legislature ban cell phones in schools, a step we took in NYC nearly 20 years ago. Unfortunately for students and teachers, the policy was ended by the next administration. Experience and studies both show that cell phones in school are harmful to student learning. All states should follow New York.

Momentum for banning phones in schools continues, from January 2025, and Colorado targets them next.

For those wondering what to do now, a school offers a guide.

Rob Wilbin: Ezra [Klein] today saying social science can’t settle whether kids on phones/social media is harmful reminded me of Tyler’s interview w Haidt.

Cowen pushed back hard but, unusually, his own audience wasn’t having a bar of it.

It thinks it sees its own kids being harmed and trusts its own eyes over any social science or argumentation.

Even if one concluded the peer reviewed study literature pointed in the direction in question, I refuse to be gaslit into not believing that things that are obviously happening are indeed happening, or that obviously bad for people things aren’t bad for people, purely because ‘the evidence is weak’ or any given statistic did not find an effect. Most other people are in the same boat at this point.

A new study tries out a ‘soft commitment’ app on phones they tried out at university. It successfully convinced students to reduce phone use in class, leading to improvements in classroom focus, attendance and overall academic satisfaction. However there was a substitution effect where students studied less, and only a small (statistically insignificant) increase in grades. So a soft nudge made things modestly better (but only modestly) in what seems like a Pareto improvement?

Natural Hazard: Person A: “I think all the most important people in my life are conspiring to hide important information about the world from me.”

Others: “That’s crazy, paranoid, schizo-type stuff.”

Person A: “Did I mention I’m 9?”

Others: “Oh, okay, yeah.”

Discussion about this post

Childhood and Education #10: Behaviors Read More »

google-brings-new-gemini-features-to-chromebooks,-debuts-first-on-device-ai

Google brings new Gemini features to Chromebooks, debuts first on-device AI

Google hasn’t been talking about Chromebooks as much since AI became its all-consuming focus, but that’s changing today with a bounty of new AI features for Google-powered laptops. Newer, more powerful Chromebooks will soon have image generation, text summarization, and more built into the OS. There’s also a new Lenovo Chromebook with a few exclusive AI goodies that only work thanks to its overpowered hardware.

If you have a Chromebook Plus device, which requires a modern CPU and at least 8GB of RAM, your machine will soon get a collection of features you may recognize from other Google products. For example, Lens is expanding on Chrome OS, allowing you to long-press the launcher icon to select any area of the screen to perform a visual search. Lens also includes text capture and integration with Google Calendar and Docs.

Gemini models are also playing a role here, according to Google. The Quick Insert key, which debuted last year, is gaining a new visual element. It could already insert photos or emoji with ease, but it can now also help you generate a new image on demand with AI.

Google’s new Chromebook AI features.

Even though Google’s AI features are running in the cloud, the AI additions are limited to this more powerful class of Google-powered laptops. The Help Me Read feature leverages Gemini to summarize long documents and webpages, and it can now distill that data into a more basic form. The new Summarize option can turn dense, technical text into something more readable in a few clicks.

Google has also rolled out a new AI trial for Chromebook Plus devices. If you buy one of these premium Chromebooks, you’ll get a 12-month free trial of the Google AI Pro plan, which gives you 2TB of cloud storage, expanded access to Google’s Gemini Pro model, and NotebookLM Pro. NotebookLM is also getting a place in the Chrome OS shelf.

Google brings new Gemini features to Chromebooks, debuts first on-device AI Read More »

tesla-launches-robotaxi-service-in-austin

Tesla launches robotaxi service in Austin

Tesla’s robotaxi service, touted by Elon Musk as the future of his flagging electric-car maker, launched in the company’s home city of Austin, Texas, on Sunday with about 10 vehicles and a human safety driver on board amid regulatory scrutiny of its self-driving technology.

Shares in Tesla have risen about 50 percent from this year’s low in early April, with investors hopeful the autonomous ride-hailing service will help revive a company that has suffered declining sales and a consumer backlash against Musk’s political activism.

Despite the hype surrounding Tesla’s robotaxi, the launch—with a company employee seated in the passenger side for safety while leaving the driver’s seat empty—was low-key, and the initial service was open only to a select group of social media influencers.

Shortly before the launch, Musk said on social media that the robotaxi service would begin “with customers paying a $4.20 flat fee.”

According to Musk, who has stepped back from his US government role to focus on the electric-car maker and the robotaxi, the self-driving Tesla Model Y vehicles will only operate in limited areas, avoid challenging intersections, and have teleoperators who can intervene if problems arise.

The limited launch comes as the National Highway Traffic Safety Administration continues to carry out multiple investigations into Musk’s claims about the capabilities of Tesla’s autopilot and “full self-driving” systems. Despite its name, the full self-driving system still requires humans to sit in the driver’s seat and pay full attention—unlike Google’s Waymo taxis.

The NHTSA wrote a letter in early May seeking additional information about technologies that would be used in Tesla’s robotaxi service. The regulator said it had received Tesla’s response and was reviewing its content.

Musk said in a social media post this month that the company was being “super paranoid” about safety. But he has also claimed there would be 1,000 robotaxis “in a few months,” and that the service would expand to cities such as San Francisco and Los Angeles.

Tesla launches robotaxi service in Austin Read More »

one-of-the-best-pac-man-games-in-years-is-playable-on-youtube,-of-all-places

One of the best Pac-Man games in years is playable on YouTube, of all places

Those who’ve played the excellent Pac-Man Championship Edition series will be familiar with the high-speed vibe here, but Pac-Man Superfast remains focused on the game’s original maze and selection of just four ghosts. That means old-school strategies for grouping ghosts together and running successful patterns through the narrow corridors work in similar ways here. Successfully excecuting those patterns becomes a tense battle of nerves here, though, requiring multiple direction changes every second at the highest speeds. While the game will technically work with swipe controls on a smartphone or tablet, high-level play really requires the precision of a keyboard via a desktop/laptop web browser (we couldn’t get the game to recognize a USB controller, unfortunately).

Collecting those high-value items at the bottom is your ticket to a lot of extra lives. Credit: Youtube Playables

As exciting as the high-speed maze gameplay gets, though, Pac-Man Superfast is hampered by a few odd design decisions. The game ends abruptly after just 13 levels, for instance, making it impossible to even attempt the high-endurance 256-level runs that Pac-Man is known for. The game also throws an extra life at you every 5,000 points, making it relatively easy to brute force your way to the end as long as you focus on the three increasingly high-point-value items that appear periodically on each stage.

Despite this, the game doesn’t give any point reward for unused extra lives or long-term survival at high speeds, limiting the rewards for high-level play. And the lack of a built-in leaderboard makes it hard to directly compare your performance to friends and/or strangers anyway.

A large part of the reason I wrote about this game was to see if someone could beat my high score.

Credit: Youtube Playables

A large part of the reason I wrote about this game was to see if someone could beat my high score. Credit: Youtube Playables

Those issues aside, I’ve had a blast coming back to Pac-Man Supefast over and over again in the past few days, slowly raising my high score above the 162,000 point mark during coffee breaks (consider the gauntlet thrown, Ars readers). If you’re a fan of classic arcade games, Pac-Man Superfast is worth a try before the “YouTube Playables” initiative inevitably joins the growing graveyard of discontinued Google products.

One of the best Pac-Man games in years is playable on YouTube, of all places Read More »

rocket-report:-two-big-asian-reuse-milestones,-vandenberg-becomes-spacex-west

Rocket Report: Two big Asian reuse milestones, Vandenberg becomes SpaceX west


“This is potentially going to be a problem.”

Landspace shows off its Zhuque-3 rocket on the launch pad. Credit: Landspace

Welcome to Edition 7.49 of the Rocket Report! You may have noticed we are a little late with the report this week, and that is due to the Juneteenth holiday celebrated in the United States on Thursday. But that hasn’t stopped a torrent of big news this week, from exploding Starships to significant reuse milestones being reached in Asia.

As always, we welcome reader submissions, and if you don’t want to miss an issue, please subscribe using the box below (the form will not appear on AMP-enabled versions of the site). Each report will include information on small-, medium-, and heavy-lift rockets as well as a quick look ahead at the next three launches on the calendar.

Honda stamps passport to the skies with a hopper. An experimental reusable rocket developed by the research and development arm of Honda Motor Company flew to an altitude of nearly 900 feet (275 meters) Tuesday, then landed with pinpoint precision at the carmaker’s test facility in northern Japan, Ars reports. Honda’s hopper is the first prototype rocket outside of the United States and China to complete a flight of this kind, demonstrating vertical takeoff and vertical landing technology that could underpin the development of a reusable launch vehicle.

A legitimately impressive feat… Honda has been quiet on this rocket project since a brief media blitz nearly four years ago. Developed in-house by Honda R&D Company, the rocket climbed vertically from a pedestal at the company’s test site in southeastern Hokkaido, the northernmost of Japan’s main islands, before landing less than a meter from its target. Honda said its launch vehicle is “still in the fundamental research phase,” and the company has made no decision whether to commercialize the rocket program. (submitted by Fernwaerme, TFargo04, Biokleen, Rendgrish, and astromog)

European launch companies seek protection. In a joint statement published on Monday, Arianespace and Avio called for European missions to be launched aboard European rockets, European Spaceflight reports. The statement warned that without “sustained support,” European rocket builders risked losing out to institutionally backed competitors from the US.

Seeking to permanently embed European preference… “Major space powers support their industries through stable and guaranteed institutional markets, enabling long-term investments, innovation, and the preservation of leadership,” explained the statement. The pair argues that Europe risks falling behind not due to a lack of technical capability but because of structural market weaknesses. (submitted by EllPeaTea)

The easiest way to keep up with Eric Berger’s and Stephen Clark’s reporting on all things space is to sign up for our newsletter. We’ll collect their stories and deliver them straight to your inbox.

Sign Me Up!

Increasing launch cadence may threaten ozone layer. The rapidly growing number of rocket launches could slow the recovery of the ozone layer, a new study in the journal Nature finds. The ozone layer is healing due to countries phasing out CFCs, but rocket launches could slow its recovery if the space industry continues growing, Radio New Zealand reports. “At the moment, it’s not a problem because the launches happen too infrequently,” said University of Canterbury atmospheric scientist Laura Revell, one of the authors of the study. “As we get more and more launches taking place—because there are companies out there with very bold ambitions to increase launch frequency—this is potentially going to be a problem.”

Forecasting a lot of growth in launch… In a conservative growth scenario, about 900 total launches a year, there is some ozone loss but not significant amounts,” said Revell. “But when we look at a more ambitious scenario, when we looked at the upper limits of what might be launched in future—around 2,000 launches year—we saw levels of ozone loss that are concerning in the context of ozone recovery,” she said. Ozone losses are driven by the chlorine produced from solid rocket motor propellant and black carbon, which is emitted from most propellants, the study says. (submitted by Zaphod Harkonnen)

Space may soon be pay-to-play with the FAA. The Federal Aviation Administration may soon levy fees on companies seeking launch and reentry licenses, a new tack in the push to give the agency the resources it needs to keep up with the rapidly growing commercial space industry, Ars reports. The text of a budget reconciliation bill released by Sen. Ted Cruz (R-Texas) earlier this month calls for the FAA’s Office of Commercial Space Transportation, known as AST, to begin charging licensing fees to space companies next year.

The price of poker keeps going up… The fees would phase in over eight years, after which the FAA would adjust them to keep pace with inflation. The money would go into a trust fund to help pay for the operating costs of the FAA’s commercial space office. Cruz’s section of the Senate reconciliation bill calls for the FAA to charge commercial space companies per pound of payload mass, beginning with 25 cents per pound in 2026 and increasing to $1.50 per pound in 2033. Subsequent fee rates would change based on inflation. The overall fee per launch or entry would be capped at $30,000 in 2026, increasing to $200,000 in 2033, and then be adjusted to keep pace with inflation.

Landspace tests Zhuque-3 rocket. Chinese launch startup Landspace carried out a breakthrough static fire test Friday as it builds towards an orbital launch attempt with its Zhuque-3 rocket, Space News reports. The Zhuque-3’s nine methane-liquid oxygen engines ignited in sequence and fired for 45 seconds, including gimbal control testing, before shutting down as planned. The successful test lays a solid foundation for the upcoming inaugural flight of the Zhuque-3 and for the country’s reusable launch vehicle technology, Landspace said.

Similar in design to Falcon 9 … Friday’s static fire test used a first-stage identical to the one intended for Zhuque-3’s inaugural flight, planned for later this year, and covered the full ground-based launch preparation and ignition sequence, including propellant loading, tank pressurization, staged engine ignition, steady-state operation and a programmed shutdown. Payload capacity to low Earth orbit will be 21 metric tons when expendable, or up to 18,300 kg when the first stage is recovered downrange. Alternatively, it can carry 12,500 kg to LEO when returning to the launch site.

Kuiper launch scrubs due to hardware issue. United Launch Alliance and its customer, Amazon, will have to wait longer for the second launch of Amazon’s Project Kuiper satellites following a scrub on Monday afternoon. “United Launch Alliance Atlas 5 551 carrying Amazon’s second Project Kuiper mission, Kuiper 2, is delayed due to an engineering observation of an elevated purge temperature within the booster engine,” ULA said in a statement. “The team will evaluate the hardware, and we will release a new launch date when available.”

Back to the VIF in a spiff… On Tuesday, ULA rolled the Atlas V rocket back to its Vertical Integration Facility to address the issue with the nitrogen purge line on the vehicle. In addition to this mission, ULA has six more Atlas 5 rockets that have been purchased by Amazon to fly satellites for its constellation. As of Friday morning, ULA had not set a new launch date for the Kuiper 2 mission, but it could take place early next week. (submitted by ElllPeaTea)

Varda’s next launch will use in-house spacecraft. Varda Space Industries is preparing to launch its fourth spacecraft, W-4, on a SpaceX rideshare mission scheduled to launch as soon as June 21 from Vandenberg Space Force Base in California, Space News reports. The Los Angeles-based startup manufactures pharmaceuticals in orbit and returns them to Earth using specialized reentry capsules.

No longer using Rocket Lab… For its first three missions, Varda had partnered with Rocket Lab to use its Photon spacecraft for in-space operations. However, with W-4, Varda is debuting its first spacecraft built entirely in-house. The company is consolidating design and production internally in an effort to shorten the timeline between missions and increase flexibility to tailor vehicles to customer requirements. Varda decided that vertical integration was essential for scaling operations. (submitted by MarkW98)

Vandenberg becomes SpaceX west. One of the defining events early in the history of SpaceX is when the company was effectively booted from Vandenberg Space Force Base in 2005 after completing the first successful test firing of the Falcon 1 rocket there. This set the company off on a long romp to Kwajalein Atoll in the Pacific Ocean before acquiring a launch site at Cape Canaveral, Florida. When SpaceX finally returned to Vandenberg half a decade later, it had the Falcon 9 rocket and was no longer the scrappy upstart. Since then, it has made Vandenberg its own.

Falcons flying frequentlyAccording to Spaceflight Now, on Monday, SpaceX launched the 200th overall orbital flight from Space Launch Complex 4 East at Vandenberg Space Force Base, a batch of 26 Starlink V2 Mini satellites. Among the 199 previous orbital launches from SLC-4E, 131 of them were Falcon 9 rockets. The pad was first occupied by the Atlas-Agena rocket shortly after the Air Force Western Test Range activated in May 1964. SpaceX is currently going through the review process for acquiring SLC-6 as well to use for its Falcon 9 and Falcon Heavy rockets. (submitted by EllPeaTea)

China tests launch abort system. China carried out a successful pad abort test early Tuesday for its next-generation crew spacecraft for moon and low-Earth orbit missions, Space News reports. Footage of the test shows the escape system rapidly boosting the Mengzhou spacecraft away from the ground. Around 20 seconds later, the vehicle reached a predetermined altitude. The return capsule separated from the escape tower, and its parachutes deployed successfully. China is planning to conduct an in-flight escape test at maximum dynamic pressure later this year.

No longer reliant on the rocket… According to the agency, Mengzhou shifts from the traditional model of “rocket handles abort, spacecraft handles crew rescue,” as used by the Shenzhou, to a system where the Mengzhou spacecraft takes full responsibility for both abort control and crew safety. “The success of this test lays an important technical foundation for future crewed lunar missions,” a Chinese statement read. “Development work on related spacecraft, such as the Long March 10 launch vehicle and the lunar lander, is progressing steadily and will proceed to further testing as scheduled.” (submitted by EllPeaTea)

Another Starship explodes unexpectedly. SpaceX’s next Starship rocket exploded during a ground test in South Texas late Wednesday, dealing another blow to a program already struggling to overcome three consecutive failures in recent months, Ars reports. The late-night explosion at SpaceX’s rocket development complex in Starbase, Texas, destroyed the upper stage that was slated to launch on the next Starship test flight. The powerful blast set off fires around SpaceX’s Massey’s Test Site, located a few miles from the company’s Starship factory and launch pads.

A major anomaly … SpaceX confirmed the Starship, numbered Ship 36 in the company’s inventory, “experienced a major anomaly” on a test stand as the vehicle prepared to ignite its six Raptor engines for a static fire test. These hold-down test-firings are typically one of the final milestones in a Starship launch campaign before SpaceX moves the rocket to the launch pad. The company later said the failure may have been due to a composite overwrap pressure vessel, or COPV, near the top of the vehicle. On Thursday, aerial videos revealed that damage at the test site was significant but not beyond repair. (submitted by Tfargo04)

ArianeGroup will lead reusable engine project. The French space agency CNES announced Tuesday that it had selected ArianeGroup to lead a project to develop a high-thrust reusable rocket engine, European Spaceflight reports. The ASTRE (Advanced Staged-Combustion Technologies for Reusable Engines) project will also include contributions from SiriusSpace and Pangea Aerospace.

Company will take a test and learn approach… The project aims to develop a full-flow staged combustion methalox reusable rocket engine capable of producing between 200 and 300 tonnes of thrust, placing it in roughly the same class as the SpaceX Raptor engine. According to the agency, the goal of the project is “to equip the French and European space industry with new capabilities for strategic applications.” (submitted by EllPeaTea)

Next three launches

June 21: Falcon 9 | Transporter-14 | Vandenberg Space Force Base, Calif. | 21: 19 UTC

June 22: Falcon 9 | Starlink 10-23 | Cape Canaveral Space Force Station, Florida | 05: 47 UTC

June 23: Atlas V | Project Kuiper KA-02 | Cape Canaveral Space Force Station, Florida | 10: 54 UTC

Photo of Eric Berger

Eric Berger is the senior space editor at Ars Technica, covering everything from astronomy to private space to NASA policy, and author of two books: Liftoff, about the rise of SpaceX; and Reentry, on the development of the Falcon 9 rocket and Dragon. A certified meteorologist, Eric lives in Houston.

Rocket Report: Two big Asian reuse milestones, Vandenberg becomes SpaceX west Read More »

study:-meta-ai-model-can-reproduce-almost-half-of-harry-potter-book

Study: Meta AI model can reproduce almost half of Harry Potter book


Harry Potter and the Copyright Lawsuit

The research could have big implications for generative AI copyright lawsuits.

Meta CEO Mark Zuckerberg. Credit: Andrej Sokolow/picture alliance via Getty Images

In recent years, numerous plaintiffs—including publishers of books, newspapers, computer code, and photographs—have sued AI companies for training models using copyrighted material. A key question in all of these lawsuits has been how easily AI models produce verbatim excerpts from the plaintiffs’ copyrighted content.

For example, in its December 2023 lawsuit against OpenAI, The New York Times Company produced dozens of examples where GPT-4 exactly reproduced significant passages from Times stories. In its response, OpenAI described this as a “fringe behavior” and a “problem that researchers at OpenAI and elsewhere work hard to address.”

But is it actually a fringe behavior? And have leading AI companies addressed it? New research—focusing on books rather than newspaper articles and on different companies—provides surprising insights into this question. Some of the findings should bolster plaintiffs’ arguments, while others may be more helpful to defendants.

The paper was published last month by a team of computer scientists and legal scholars from Stanford, Cornell, and West Virginia University. They studied whether five popular open-weight models—three from Meta and one each from Microsoft and EleutherAI—were able to reproduce text from Books3, a collection of books that is widely used to train LLMs. Many of the books are still under copyright.

This chart illustrates their most surprising finding:

The chart shows how easy it is to get a model to generate 50-token excerpts from various parts of Harry Potter and the Sorcerer’s Stone. The darker a line is, the easier it is to reproduce that portion of the book.

Each row represents a different model. The three bottom rows are Llama models from Meta. And as you can see, Llama 3.1 70B—a mid-sized model Meta released in July 2024—is far more likely to reproduce Harry Potter text than any of the other four models.

Specifically, the paper estimates that Llama 3.1 70B has memorized 42 percent of the first Harry Potter book well enough to reproduce 50-token excerpts at least half the time. (I’ll unpack how this was measured in the next section.)

Interestingly, Llama 1 65B, a similar-sized model released in February 2023, had memorized only 4.4 percent of Harry Potter and the Sorcerer’s Stone. This suggests that despite the potential legal liability, Meta did not do much to prevent memorization as it trained Llama 3. At least for this book, the problem got much worse between Llama 1 and Llama 3.

Harry Potter and the Sorcerer’s Stone was one of dozens of books tested by the researchers. They found that Llama 3.1 70B was far more likely to reproduce popular books—such as The Hobbit and George Orwell’s 1984—than obscure ones. And for most books, Llama 3.1 70B memorized more than any of the other models.

“There are really striking differences among models in terms of how much verbatim text they have memorized,” said James Grimmelmann, a Cornell law professor who has collaborated with several of the paper’s authors.

The results surprised the study’s authors, including Mark Lemley, a law professor at Stanford. (Lemley used to be part of Meta’s legal team, but in January, he dropped them as a client after Facebook adopted more Trump-friendly moderation policies.)

“We’d expected to see some kind of low level of replicability on the order of 1 or 2 percent,” Lemley told me. “The first thing that surprised me is how much variation there is.”

These results give everyone in the AI copyright debate something to latch onto. For AI industry critics, the big takeaway is that—at least for some models and some books—memorization is not a fringe phenomenon.

On the other hand, the study only found significant memorization of a few popular books. For example, the researchers found that Llama 3.1 70B only memorized 0.13 percent of Sandman Slim, a 2009 novel by author Richard Kadrey. That’s a tiny fraction of the 42 percent figure for Harry Potter.

This could be a headache for law firms that have filed class-action lawsuits against AI companies. Kadrey is the lead plaintiff in a class-action lawsuit against Meta. To certify a class of plaintiffs, a court must find that the plaintiffs are in largely similar legal and factual situations.

Divergent results like these could cast doubt on whether it makes sense to lump J.K. Rowling, Kadrey, and thousands of other authors together in a single mass lawsuit. And that could work in Meta’s favor, since most authors lack the resources to file individual lawsuits.

The broader lesson of this study is that the details will matter in these copyright cases. Too often, online discussions have treated “do generative models copy their training data or merely learn from it?” as a theoretical or even philosophical question. But it’s a question that can be tested empirically—and the answer might differ across models and across copyrighted works.

It’s common to talk about LLMs predicting the next token. But under the hood, what the model actually does is generate a probability distribution over all possibilities for the next token. For example, if you prompt an LLM with the phrase “Peanut butter and,” it will respond with a probability distribution that might look like this made-up example:

  • P(“jelly”) = 70 percent
  • P(“sugar”) = 9 percent
  • P(“peanut”) = 6 percent
  • P(“chocolate”) = 4 percent
  • P(“cream”) = 3 percent

And so forth.

After the model generates a list of probabilities like this, the system will select one of these options at random, weighted by their probabilities. So 70 percent of the time the system will generate “Peanut butter and jelly.” Nine percent of the time, we’ll get “Peanut butter and sugar.” Six percent of the time, it will be “Peanut butter and peanut.” You get the idea.

The study’s authors didn’t have to generate multiple outputs to estimate the likelihood of a particular response. Instead, they could calculate probabilities for each token and then multiply them together.

Suppose someone wants to estimate the probability that a model will respond to “My favorite sandwich is” with “peanut butter and jelly.” Here’s how to do that:

  • Prompt the model with “My favorite sandwich is,” and look up the probability of “peanut” (let’s say it’s 20 percent).
  • Prompt the model with “My favorite sandwich is peanut,” and look up the probability of “butter” (let’s say it’s 90 percent).
  • Prompt the model with “My favorite sandwich is peanut butter” and look up the probability of “and” (let’s say it’s 80 percent).
  • Prompt the model with “My favorite sandwich is peanut butter and” and look up the probability of “jelly” (let’s say it’s 70 percent).

Then we just have to multiply the probabilities like this:

0.2 0.9 0.8 0.7 = 0.1008

So we can predict that the model will produce “peanut butter and jelly” about 10 percent of the time, without actually generating 100 or 1,000 outputs and counting how many of them were that exact phrase.

This technique greatly reduced the cost of the research, allowed the authors to analyze more books, and made it feasible to precisely estimate very low probabilities.

For example, the authors estimated that it would take more than 10 quadrillion samples to exactly reproduce some 50-token sequences from some books. Obviously, it wouldn’t be feasible to actually generate that many outputs. But it wasn’t necessary: the probability could be estimated just by multiplying the probabilities for the 50 tokens.

A key thing to notice is that probabilities can get really small really fast. In my made-up example, the probability that the model will produce the four tokens “peanut butter and jelly” is just 10 percent. If we added more tokens, the probability would get even lower. If we added 46 more tokens, the probability could fall by several orders of magnitude.

For any language model, the probability of generating any given 50-token sequence “by accident” is vanishingly small. If a model generates 50 tokens from a copyrighted work, that is strong evidence that the tokens “came from” the training data. This is true even if it only generates those tokens 10 percent, 1 percent, or 0.01 percent of the time.

The study authors took 36 books and divided each of them into overlapping 100-token passages. Using the first 50 tokens as a prompt, they calculated the probability that the next 50 tokens would be identical to the original passage. They counted a passage as “memorized” if the model had a greater than 50 percent chance of reproducing it word for word.

This definition is quite strict. For a 50-token sequence to have a probability greater than 50 percent, the average token in the passage needs a probability of at least 98.5 percent! Moreover, the authors only counted exact matches. They didn’t try to count cases where—for example—the model generates 48 or 49 tokens from the original passage but got one or two tokens wrong. If these cases were counted, the amount of memorization would be even higher.

This research provides strong evidence that significant portions of Harry Potter and the Sorcerer’s Stone were copied into the weights of Llama 3.1 70B. But this finding doesn’t tell us why or how this happened. I suspect that part of the answer is that Llama 3 70B was trained on 15 trillion tokens—more than 10 times the 1.4 trillion tokens used to train Llama 1 65B.

The more times a model is trained on a particular example, the more likely it is to memorize that example. Perhaps Meta had trouble finding 15 trillion distinct tokens, so it trained on the Books3 dataset multiple times. Or maybe Meta added third-party sources—such as online Harry Potter fan forums, consumer book reviews, or student book reports—that included quotes from Harry Potter and other popular books.

I’m not sure that either of these explanations fully fits the facts. The fact that memorization was a much bigger problem for the most popular books does suggest that Llama may have been trained on secondary sources that quote these books rather than the books themselves. There are likely exponentially more online discussions of Harry Potter than Sandman Slim.

On the other hand, it’s surprising that Llama memorized so much of Harry Potter and the Sorcerer’s Stone.

“If it were citations and quotations, you’d expect it to concentrate around a few popular things that everyone quotes or talks about,” Lemley said. The fact that Llama 3 memorized almost half the book suggests that the entire text was well represented in the training data.

Or there could be another explanation entirely. Maybe Meta made subtle changes in its training recipe that accidentally worsened the memorization problem. I emailed Meta for comment last week but haven’t heard back.

“It doesn’t seem to be all popular books,” Mark Lemley told me. “Some popular books have this result and not others. It’s hard to come up with a clear story that says why that happened.”

  1. Training on a copyrighted work is inherently infringing because the training process involves making a digital copy of the work.
  2. The training process copies information from the training data into the model, making the model a derivative work under copyright law.
  3. Infringement occurs when a model generates (portions of) a copyrighted work.

A lot of discussion so far has focused on the first theory because it is the most threatening to AI companies. If the courts uphold this theory, most current LLMs would be illegal, whether or not they have memorized any training data.

The AI industry has some pretty strong arguments that using copyrighted works during the training process is fair use under the 2015 Google Books ruling. But the fact that Llama 3.1 70B memorized large portions of Harry Potter could color how the courts consider these fair use questions.

A key part of fair use analysis is whether a use is “transformative”—whether a company has made something new or is merely profiting from the work of others. The fact that language models are capable of regurgitating substantial portions of popular works like Harry Potter1984, and The Hobbit could cause judges to look at these fair use arguments more skeptically.

Moreover, one of Google’s key arguments in the books case was that its system was designed to never return more than a short excerpt from any book. If the judge in the Meta lawsuit wanted to distinguish Meta’s arguments from the ones Google made in the books case, he could point to the fact that Llama can generate far more than a few lines of Harry Potter.

The new study “complicates the story that the defendants have been telling in these cases,” co-author Mark Lemley told me. “Which is ‘we just learn word patterns. None of that shows up in the model.’”

But the Harry Potter result creates even more danger for Meta under that second theory—that Llama itself is a derivative copy of Rowling’s book.

“It’s clear that you can in fact extract substantial parts of Harry Potter and various other books from the model,” Lemley said. “That suggests to me that probably for some of those books there’s something the law would call a copy of part of the book in the model itself.”

The Google Books precedent probably can’t protect Meta against this second legal theory because Google never made its books database available for users to download—Google almost certainly would have lost the case if it had done that.

In principle, Meta could still convince a judge that copying 42 percent of Harry Potter was allowed under the flexible, judge-made doctrine of fair use. But it would be an uphill battle.

“The fair use analysis you’ve gotta do is not just ‘is the training set fair use,’ but ‘is the incorporation in the model fair use?’” Lemley said. “That complicates the defendants’ story.”

Grimmelmann also said there’s a danger that this research could put open-weight models in greater legal jeopardy than closed-weight ones. The Cornell and Stanford researchers could only do their work because the authors had access to the underlying model—and hence to the token probability values that allowed efficient calculation of probabilities for sequences of tokens.

Most leading labs, including OpenAI, Anthropic, and Google, have increasingly restricted access to these so-called logits, making it more difficult to study these models.

Moreover, if a company keeps model weights on its own servers, it can use filters to try to prevent infringing output from reaching the outside world. So even if the underlying OpenAI, Anthropic, and Google models have memorized copyrighted works in the same way as Llama 3.1 70B, it might be difficult for anyone outside the company to prove it.

Moreover, this kind of filtering makes it easier for companies with closed-weight models to invoke the Google Books precedent. In short, copyright law might create a strong disincentive for companies to release open-weight models.

“It’s kind of perverse,” Mark Lemley told me. “I don’t like that outcome.”

On the other hand, judges might conclude that it would be bad to effectively punish companies for publishing open-weight models.

“There’s a degree to which being open and sharing weights is a kind of public service,” Grimmelmann told me. “I could honestly see judges being less skeptical of Meta and others who provide open-weight models.”

Timothy B. Lee was on staff at Ars Technica from 2017 to 2021. Today, he writes Understanding AI, a newsletter that explores how AI works and how it’s changing our world. You can subscribe here.

Photo of Timothy B. Lee

Timothy is a senior reporter covering tech policy and the future of transportation. He lives in Washington DC.

Study: Meta AI model can reproduce almost half of Harry Potter book Read More »