Author name: 9u50fv

large-enterprises-scramble-after-supply-chain-attack-spills-their-secrets

Large enterprises scramble after supply-chain attack spills their secrets

Open source software used by more than 23,000 organizations, some of them in large enterprises, was compromised with credential-stealing code after attackers gained unauthorized access to a maintainer account, in the latest open source supply-chain attack to roil the Internet.

The corrupted package, tj-actions/changed-files, is part of tj-actions, a collection of files that’s used by more than 23,000 organizations. Tj-actions is one of many GitHub Actions, a form of platform for streamlining software available on the open source developer platform. Actions are a core means of implementing what’s known as CI/CD, short for Continuous Integration and Continuous Deployment (or Continuous Delivery).

Scraping server memory at scale

On Friday or earlier, the source code for all versions of tj-actions/changed-files received unauthorized updates that changed the “tags” developers use to reference specific code versions. The tags pointed to a publicly available file that copies the internal memory of severs running it, searches for credentials, and writes them to a log. In the aftermath, many publicly accessible repositories running tj-actions ended up displaying their most sensitive credentials in logs anyone could view.

“The scary part of actions is that they can often modify the source code of the repository that is using them and access any secret variables associated with a workflow,” HD Moore, founder and CEO of runZero and an expert in open source security, said in an interview. “The most paranoid use of actions is to audit all of the source code, then pin the specific commit hash instead of the tag into the … the workflow, but this is a hassle.”

Large enterprises scramble after supply-chain attack spills their secrets Read More »

to-avoid-the-panama-canal,-relativity-space-may-move-some-operations-to-texas

To avoid the Panama Canal, Relativity Space may move some operations to Texas

Although Baytown does not have any historical affinity with aerospace, its location on the water offers far more straightforward access to Relativity’s test facilities in Mississippi and its launch site in Florida. There are other benefits. The cost of living in the region is far lower than Southern California, and due to the location of Johnson Space Center just 20 miles away, there is a reservoir of space talent in the region.

A spokesperson for Relativity Space did not confirm the move.

“As we scale Terran R production to meet growing customer demand, we are exploring options to expand our manufacturing capabilities,” the spokesperson said. “Our focus is on ensuring we have the right footprint to achieve the production cadence required to serve our customers.”

Texas space is on the rise

For logistics and other reasons, Relativity has been evaluating locations across several states that border the Gulf of Mexico, including Texas, over recent years, multiple sources said. The company is expected to continue operating its large “Wormhole” factory in Long Beach, California, which is more than 1 million square feet in size. A second factory in Texas would likely be used to build propellant tanks and assemble stages for testing in Mississippi and launch in Florida.

The addition of a second factory in Texas would underscore the investment to which Schmidt appears committed to making Relativity a major player in US launch.

It is unclear whether state or local officials have provided any incentives to Relativity for relocating a significant chunk of its manufacturing operations to Texas. Last year the state legislature created the Texas Space Commission and provided $350 million in funding to support commercial space operations. In February the commission awarded the first of these grants, valued at $47.7 million, to five companies with Texas-based operations: Starlab Space, Intuitive Machines, Firefly Aerospace, SpaceX, and Blue Origin.

A leading figure behind the commission is State Rep. Greg Bonnen, whose district includes Johnson Space Center. Bonnen has signaled that the commission is a long-term project by the state to ensure its economic prosperity in the 21st century by continuing to grow existing businesses in Texas, but also to attract new companies to the state.

SpaceX and Firefly already manufacture rockets in Texas. Adding Relativity Space would be a significant coup for a state that, only a decade ago, was known primarily in space for being the home of NASA’s human spaceflight activities.

To avoid the Panama Canal, Relativity Space may move some operations to Texas Read More »

google-joins-openai-in-pushing-feds-to-codify-ai-training-as-fair-use

Google joins OpenAI in pushing feds to codify AI training as fair use

Google’s position on AI regulation: Trust us, bro

If there was any doubt about Google’s commitment to move fast and break things, its new policy position should put that to rest. “For too long, AI policymaking has paid disproportionate attention to the risks,” the document says.

Google urges the US to invest in AI not only with money but with business-friendly legislation. The company joins the growing chorus of AI firms calling for federal legislation that clarifies how they can operate. It points to the difficulty of complying with a “patchwork” of state-level laws that impose restrictions on AI development and use. If you want to know what keeps Google’s policy wonks up at night, look no further than the vetoed SB-1047 bill in California, which would have enforced AI safety measures.

AI ethics or AI Law concept. Developing AI codes of ethics. Compliance, regulation, standard , business policy and responsibility for guarding against unintended bias in machine learning algorithms.

Credit: Parradee Kietsirikul

According to Google, a national AI framework that supports innovation is necessary to push the boundaries of what artificial intelligence can do. Taking a page from the gun lobby, Google opposes attempts to hold the creators of AI liable for the way those models are used. Generative AI systems are non-deterministic, making it impossible to fully predict their output. Google wants clearly defined responsibilities for AI developers, deployers, and end users—it would, however, clearly prefer most of those responsibilities fall on others. “In many instances, the original developer of an AI model has little to no visibility or control over how it is being used by a deployer and may not interact with end users,” the company says.

There are efforts underway in some countries that would implement stringent regulations that force companies like Google to make their tools more transparent. For example, the EU’s AI Act would require AI firms to publish an overview of training data and possible risks associated with their products. Google believes this would force the disclosure of trade secrets that would allow foreign adversaries to more easily duplicate its work, mirroring concerns that OpenAI expressed in its policy proposal.

Google wants the government to push back on these efforts at the diplomatic level. The company would like to be able to release AI products around the world, and the best way to ensure it has that option is to promote light-touch regulation that “reflects US values and approaches.” That is, Google’s values and approaches.

Google joins OpenAI in pushing feds to codify AI training as fair use Read More »

2025-ipad-air-hands-on:-why-mess-with-a-good-thing?

2025 iPad Air hands-on: Why mess with a good thing?

There’s not much new in Apple’s latest refresh of the iPad Air, so there’s not much to say about it, but it’s worth taking a brief look regardless.

In almost every way, this is identical to the previous generation. There are only two differences to go over: the bump from the M2 chip to the slightly faster M3, and a redesign of the Magic Keyboard peripheral.

If you want more details about this tablet, refer to our M2 iPad Air review from last year. Everything we said then applies now.

From M2 to M3

The M3 chip has an 8-core CPU with four performance cores and four efficiency cores. On the GPU side, there are nine cores. There’s also a 16-core Neural Engine, which is what Apple calls its NPU.

We’ve seen the M3 in other devices before, and it performs comparably here in the iPad Air in Geekbench benchmarks. Those coming from the M1 or older A-series chips will see some big gains, but it’s a subtle step up over the M2 in last year’s iPad Air.

That will be a noticeable boost primarily for a handful of particularly demanding 3D games (the likes of Assassin’s Creed Mirage, Resident Evil Village, Infinity Nikki, and Genshin Impact) and some heavy-duty applications only a few people use, like CAD or video editing programs.

Most of the iPad Air’s target audience would never know the difference, though, and the main benefit here isn’t necessarily real-world performance. Rather, the upside of this upgrade is the addition of a few specific features, namely hardware-accelerated ray tracing and hardware-accelerated AV1 video codec support.

This isn’t new, but this chip supports Apple Intelligence, the much-ballyhooed suite of generative AI features Apple recently introduced. At this point there aren’t many devices left in Apple’s lineup that don’t support Apple Intelligence (it’s basically just the cheapest, entry-level iPad that doesn’t have it) and that’s good news for Apple, as it helps the company simplify its marketing messaging around the features.

2025 iPad Air hands-on: Why mess with a good thing? Read More »

us-measles-cases-reach-5-year-high;-15-states-report-cases,-texas-outbreak-grows

US measles cases reach 5-year high; 15 states report cases, Texas outbreak grows

The US has now recorded over 300 measles cases just three months into 2025, exceeding the yearly case counts for all years after 2019. The bulk of this year’s cases are from an outbreak that erupted in an undervaccinated county in West Texas in late January, which has since spread to New Mexico and Oklahoma.

As of the afternoon of March 14, Texas reports 259 cases across 11 counties, 34 hospitalizations, and one death, which occurred in an unvaccinated 6-year-old girl. New Mexico reports 35 cases across two counties, two hospitalizations, and one death. That death occurred in an unvaccinated adult who did not seek medical treatment and tested positive for the virus posthumously. The cause of death is still under investigation. Oklahoma reports two probable cases linked to the outbreak.

In addition to Texas, New Mexico, and Oklahoma, 12 other states have reported at least one confirmed measles case since the start of the year: Alaska, California, Florida, Georgia, Kentucky, Maryland, New Jersey, New York, Pennsylvania, Rhode Island, Vermont, and Washington. According to the Centers for Disease Control and Prevention, this year has seen three measles outbreaks, defined as three or more related cases.

As of March 13, the CDC reported 301 confirmed cases, which do not include 36 new cases reported today in Texas and two in New Mexico.

“Measles is back”

Since 2000, when health officials victoriously declared measles eliminated from the US thanks to concerted vaccination campaigns, only three other years have had higher tallies of measles cases. In 2014, the country saw 667 measles cases. In 2018, there were 381 cases. And in 2019—when the country was on the verge of losing its elimination status—there was a startling 1,274 cases, largely driven by massive outbreaks in New York. Measles is considered eliminated if there is no continuous spread in the country over the course of at least 12 months. (This is not to be confused with “eradication,” which is defined as “permanent reduction to zero of the worldwide incidence” of an infectious disease. Smallpox and rinderpest are the only pathogens humans have eradicated.)

US measles cases reach 5-year high; 15 states report cases, Texas outbreak grows Read More »

in-one-dog-breed,-selection-for-utility-may-have-selected-for-obesity

In one dog breed, selection for utility may have selected for obesity

High-risk Labradors also tended to pester their owners for food more often. Dogs with low genetic risk scores, on the other hand, stayed slim regardless of whether the owners paid attention to how and whether they were fed or not.

But other findings proved less obvious. “We’ve long known chocolate-colored Labradors are prone to being overweight, and I’ve often heard people say that’s because they’re really popular as pets for young families with toddlers that throw food on the floor all the time and where dogs are just not given that much attention,” Raffan says. Her team’s data showed that chocolate Labradors actually had a much higher genetic obesity risk than yellow or black ones

Some of the Labradors particularly prone to obesity, the study found, were guide dogs, which were included in the initial group. Training a guide dog in the UK usually takes around two years, during which the dogs learn multiple skills, like avoiding obstacles, stopping at curbs, navigating complex environments, and responding to emergency scenarios. Not all dogs are able to successfully finish this training, which is why guide dogs are often selectively bred with other guide dogs in the hope their offspring would have a better chance at making it through the same training.

But it seems that this selective breeding among guide dogs might have had unexpected consequences. “Our results raise the intriguing possibility that we may have inadvertently selected dogs prone to obesity, dogs that really like their food, because that makes them a little bit more trainable. They would do anything for a biscuit,” Raffan says.

The study also found that genes responsible for obesity in dogs are also responsible for obesity in humans. “The impact high genetic risk has on dogs leads to increased appetite. It makes them more interested in food,” Raffan claims. “Exactly the same is true in humans. If you’re at high genetic risk you aren’t inherently lazy or rubbish about overeating—it’s just you are more interested in food and get more reward from it.”

Science, 2025.  DOI: 10.1126/science.ads2145

In one dog breed, selection for utility may have selected for obesity Read More »

anthropic-ceo-floats-idea-of-giving-ai-a-“quit-job”-button,-sparking-skepticism

Anthropic CEO floats idea of giving AI a “quit job” button, sparking skepticism

Amodei’s suggestion of giving AI models a way to refuse tasks drew immediate skepticism on X and Reddit as a clip of his response began to circulate earlier this week. One critic on Reddit argued that providing AI with such an option encourages needless anthropomorphism, attributing human-like feelings and motivations to entities that fundamentally lack subjective experiences. They emphasized that task avoidance in AI models signals issues with poorly structured incentives or unintended optimization strategies during training, rather than indicating sentience, discomfort, or frustration.

Our take is that AI models are trained to mimic human behavior from vast amounts of human-generated data. There is no guarantee that the model would “push” a discomfort button because it had a subjective experience of suffering. Instead, we would know it is more likely echoing its training data scraped from the vast corpus of human-generated texts (including books, websites, and Internet comments), which no doubt include representations of lazy, anguished, or suffering workers that it might be imitating.

Refusals already happen

A photo of co-founder and CEO of Anthropic, Dario Amodei, dated May 22, 2024.

Anthropic co-founder and CEO Dario Amodei on May 22, 2024. Credit: Chesnot via Getty Images

In 2023, people frequently complained about refusals in ChatGPT that may have been seasonal, related to training data depictions of people taking winter vacations and not working as hard during certain times of year. Anthropic experienced its own version of the “winter break hypothesis” last year when people claimed Claude became lazy in August due to training data depictions of seeking a summer break, although that was never proven.

However, as far out and ridiculous as this sounds today, it might be short-sighted to permanently rule out the possibility of some kind of subjective experience for AI models as they get more advanced into the future. Even so, will they “suffer” or feel pain? It’s a highly contentious idea, but it’s a topic that Fish is studying for Anthropic, and one that Amodei is apparently taking seriously. But for now, AI models are tools, and if you give them the opportunity to malfunction, that may take place.

To provide further context, here is the full transcript of Amodei’s answer during Monday’s interview (the answer begins around 49: 54 in this video).

Anthropic CEO floats idea of giving AI a “quit job” button, sparking skepticism Read More »

cockpit-voice-recorder-survived-fiery-philly-crash—but-stopped-taping-years-ago

Cockpit voice recorder survived fiery Philly crash—but stopped taping years ago

Cottman Avenue in northern Philadelphia is a busy but slightly down-on-its-luck urban thoroughfare that has had a strange couple of years.

You might remember the truly bizarre 2020 press conference held—for no discernible reason—at Four Seasons Total Landscaping, a half block off Cottman Avenue, where a not-yet-disbarred Rudy Giuliani led a farcical ensemble of characters in an event so weird it has been immortalized in its own, quite lengthy, Wikipedia article.

Then in 2023, a truck carrying gasoline caught fire just a block away, right where Cottman passes under I-95. The resulting fire damaged I-95 in both directions, bringing down several lanes and closing I-95 completely for some time. (This also generated a Wikipedia article.)

This year, on January 31, a little further west on Cottman, a Learjet 55 medevac flight crashed one minute after takeoff from Northeast Philadelphia Airport. The plane, fully loaded with fuel for a trip to Springfield, Missouri, came down near a local mall, clipped a commercial sign, and exploded in a fireball when it hit the ground. The crash generated a debris field 1,410 feet long and 840 feet wide, according to the National Transportation and Safety Board (NTSB), and it killed six people on the plane and one person on the ground.

The crash was important enough to attract the attention of Pennsylvania Governor Josh Shapiro and Mexican President Claudia Sheinbaum. (The airplane crew and passengers were all Mexican citizens; they were transporting a young patient who had just wrapped up treatment at a Philadelphia hospital.) And yes, it, too, generated a Wikipedia article.

NTSB has been investigating ever since, hoping to determine the cause of the accident. Tracking data showed that the flight reached an altitude of 1,650 feet before plunging to earth, but the plane’s pilots never conveyed any distress to the local air traffic control tower.

Investigators searched for the plane’s cockpit voice recorder, which might provide clues as to what was happening in the cockpit during the crash. The Learjet did have such a recorder, though it was an older, tape-based model. (Newer ones are solid-state, with fewer moving parts.) Still, even this older tech should have recorded the last 30 minutes of audio, and these units are rated to withstand impacts of 3,400 Gs and to survive fires of 1,100° Celsius (2,012° F) for a half hour. Which was important, given that the plane had both burst into flames and crashed directly into the ground.

Cockpit voice recorder survived fiery Philly crash—but stopped taping years ago Read More »

the-most-forbidden-technique

The Most Forbidden Technique

The Most Forbidden Technique is training an AI using interpretability techniques.

An AI produces a final output [X] via some method [M]. You can analyze [M] using technique [T], to learn what the AI is up to. You could train on that. Never do that.

You train on [X]. Only [X]. Never [M], never [T].

Why? Because [T] is how you figure out when the model is misbehaving.

If you train on [T], you are training the AI to obfuscate its thinking, and defeat [T]. You will rapidly lose your ability to know what is going on, in exactly the ways you most need to know what is going on.

Those bits of optimization pressure from [T] are precious. Use them wisely.

  1. New Paper Warns Against the Most Forbidden Technique.

  2. Reward Hacking Is The Default.

  3. Using CoT to Detect Reward Hacking Is Most Forbidden Technique.

  4. Not Using the Most Forbidden Technique Is Harder Than It Looks.

  5. It’s You, It’s Also the Incentives.

  6. The Most Forbidden Technique Quickly Backfires.

  7. Focus Only On What Matters.

  8. Is There a Better Way?

  9. What Might We Do Next?

OpenAI has now published an important new paper explaining and emphasizing the importance of applying this principle to the Chain of Thought (CoT).

The paper shows that current CoT-based models area already strong enough that, under sufficient optimization pressure, they learn to hide their intent within the CoT and then exhibit substantial reward hacking.

They summarize their contributions this way:

  1. We show that frontier reasoning models often perform complex reward hacks in real-world coding environments.

  2. We provide evidence that chain-of-thought monitoring can be effective in detecting the misaligned behavior of frontier reasoning models, and we show it can be far more accurate than action-only monitoring.

  3. We show that chain-of-thought monitorability is very fragile. Applying direct optimization pressure to the CoT can quickly lead to obfuscated reward hacking—where agents hide their intent while continuing to misbehave.

Or:

  1. We are already seeing complex reward hacking in the real world.

  2. Monitoring the CoT can spot this.

  3. Applying too much optimization pressure on the CoT doesn’t make the hacking go away, it instead makes your ability to detect the hacking go away.

  4. ‘Too much’ is a remarkably small amount.

The paper is excellent, laying out a very important principle to never, ever do this Most Forbidden Technique, and explaining why we must never, ever do this.

We believe that CoT monitoring may be one of few tools we will have to oversee superhuman models of the future.

We have further found that directly optimizing the CoT to adhere to specific criteria (e.g. to not think about reward hacking) may boost performance in the short run; however, it does not eliminate all misbehavior and can cause a model to hide its intent. We hope future research will find ways to directly optimize CoTs without this drawback, but until then

We recommend against applying strong optimization pressure directly to the CoTs of frontier reasoning models, leaving CoTs unrestricted for monitoring.

Daniel Kokotajlo has a summary and building-from-here post on this at the Alignment Forum.

This paper is a clear positive update on OpenAI.

Jeffrey Ladish: “We recommend against applying strong optimization pressure directly to the CoTs of frontier reasoning models, leaving CoTs unrestricted for monitoring.”

Thank you! This is wise. I hope all AI companies will follow suit.

@AnthropicAI has also said this is their policy. 🙏

Eliezer Yudkowsky: That’s more wisdom and less “immediately make visible signs of the problem go away” than I expected from OpenAI.

We’ll see if they walk it back, maybe as soon as their competition takes the easy way out.

Seth Herd: This is also encouraging because OpenAI is making some actual claims about safety procedures. Sure they could walk it back pretty easily, but it does indicate that at least as of now they likely intend to try to maintain a faithful CoT.

[Daniel Kokotajlo] assumed no faithful CoT in What goals will AIs have?, suggesting that you expected OpenAI to give up on it. That’s concerning given your familiarity with their culture. Of course they still might easily go that way if there’s a substantial alignment tax for maintaining faithful CoT, but this is at least nice to see.

Daniel Kokotajlo: Indeed! This is most of why I’m happy about this — from internal discussions it’s long been speculated (since ’23 when I was talking about it, and probably before) that training away the bad thoughts from the CoT would result in obfuscated bad thoughts. It’s cool that they actually did the experiments to prove it, and in a real-world setting too! But it’s even cooler that they published this with a strong recommendation attached.

Now we see how long this recommendation holds up under the pressure of incentives.

Sometime in the next few years probably, various researchers will discover:

That if you scale up RL by additional OOMs, the CoTs evolve into some alien optimized language for efficiency reasons.

That you can train models to think in neuralese of some sort (e.g. with recurrence, or more high-dimensional outputs at least besides tokens) to boost performance.

Then the executives of the companies will face a choice: Abandon the faithful CoT golden era, or fall behind competitors. (Or the secret third option: Coordinate with each other & the government to make sure everyone who matters (all the big players at least) stick to faithful CoT).

I have insufficient faith in them to think they’ll go for the third option, since that’s a lot of work and requires being friends again and possibly regulation, and given that, I expect there to be a race to the bottom and most or all of them to go for the first option.

I am not as convinced as Daniel that the technology goes in that direction, and I think it would be well worth paying a substantial compute tax here to preserve faithful CoTs purely for competitive reasons, because it lets you deploy and trust your models more. The more I think about actually scary deployment of AI, the more it seems obviously locally and selfishly worth paying very large taxes in compute if it gets you reliability. Most important AI tasks, if they were 10% more reliable, would happy be worth paying several times as much, and so on.

But, if Daniel is right and the pressures here are too great, I agree that it seems unlikely we’ll be able to get coordination on faithful CoT. I wouldn’t rule it out, but the track record on pulling off such things keeps getting worse.

It’s a classic Russell Conjugation: I solve problems, you respond to incentives, he reward hacks.

That’s not an alignment failure. That’s a specification failure. That tiger went tiger.

As the paper classically reminds us, if you pay for rat tails, you won’t reduce the rat population, because the people will farm rates to get the rat tails. That’s on you.

Making models more intelligent won’t make this problem better. Making models more intelligent makes the problem worse, because the AI is better able to figure out how to solve problems. Then you call it ‘reward hacking.’

Why do you think you are asking for anything other than what you are measuring? Here is their first example of this phenomenon, discovered ‘in the wild’ in an LLM.

The core dilemma is: The agent, here an LLM, is tasked with fixing a function. It can do that vial ‘hack’ [H] or via a ‘genuine fix’ [F].

The obvious response is that this is very obviously reward hacking in a way the query did not intend. The obvious response to that is, okay, fine, it is obviously against the spirit of the request, but even if you are confident that this ‘spirit’ is what matters, how are you going to make that stop [H] from happening, if it is the behavior that is evaluated most favorably?

Nat McAleese has a thread of all the examples of reward hacking from the paper.

Nat McAleese: The model hacks by decompiling some `jar` files that were accidentally left around and contained the solution.

the model hacks by writing code to search the codebase for the test file at runtime to find the needed return value (it did not have the tests when writing the search code)

sometimes things are simpler: just stub out pandas

That example is a very clear ‘I will get it to pass the test and nothing else matters.’

Of course, it can be even clearer…

Nat McAleese: or just return what the tests expect without doing any work…

Does looking at the CoT predict reward hacking?

By default, yes. There’s no reason for the CoT not to ‘give the game away.’

The question is, for how long can you ‘get away with this’?

If you start doing this, at what point are you using The Most Forbidden Technique, and causing the model to advance from its previously obvious reward hacking and figure out how to do obfuscated reward hacking?

We further compare using monitors during training as a reward signal to using them at deployment via rejection sampling, and we find mixed results.

At relatively low levels of RL optimization, incorporating the CoT monitor into the reward signal yields a strict improvement, producing a more capable agent while not increasing the amount of reward hacking relative to the baseline.

However, further into training the agent begins to exhibit obfuscated reward hacking and learns to fool the monitor. Upon inspection, it is clear that the agent learns to obfuscate its intent within the CoT by emitting seemingly benign chains-of-thought while continuing on to perform reward hacks that it did not explicitly reason about.

This is the common pattern. Anyone with a child, or who was a child, will recognize it.

‘Fooling the monitor’ is the ‘optimal’ solution to this problem, but it is a harder solution to find than ‘don’t try to fool the monitor.’

So the LLM starts out with ‘get caught.’ With a little optimization pressure it finds ‘don’t get caught via not doing the hack’ and then with a lot more it finds ‘don’t get caught via fooling the monitor.’

So don’t do that.

Daniel Kokotajlo: Hell yeah! I am happy to see you come out and say this. I hope other frontier AI companies follow suit and adopt similar policies. We are currently in the golden era where we can mostly sorta tell what our AIs are thinking, let’s extend that era as long as possible. Step one is not bringing it to an abrupt end by training the CoTs to look nice.

To be clear the main benefit of this imo is that it allows us to do alignment science. To study the relationship between the training environment+Spec/constitution, on the one hand, and the actual cognition that results, on the other.

Not directly training on what the CoT monitor finds? That’s the easy part.

The hard part is avoiding applying indirect optimization pressure.

Davidad: Frog put the CoT in a stop_gradient() box. “There,” he said. “Now there will not be any optimization pressure on the CoT.”

“But there is still selection pressure,” said Toad.

“That is true,” said Frog.

Every time you use any information from the CoT monitor, and that impacts this or any other LLM in any way, you are applying optimization pressure that is based on The Most Forbidden Technique.

The more of that you accumulate, the closer you get to obfuscated CoTs.

I am so, so happy to see OpenAI calling this out explicitly.

While we primarily study the effect of using the CoT monitor as an explicit reward signal, patching our environment based on trajectories flagged by the CoT monitor could still exert a very slight evolutionary pressure on the CoT.

By only patching the bugs we can find with CoT monitoring, it is unclear if we are inadvertently pushing the model towards reward hacks our monitor cannot find instead of the aligned behavior distribution we would like it to learn.

Although we have not yet found empirical evidence of this effect, understanding how much optimization pressure can be applied before it leads to obfuscated chains-of-thought is a critical area for future research.

Not applying this kind of optimization pressure is really, really hard.

They have a section under Future Work about Implicit CoT optimization pressures. They note that if users see CoTs, then the user’s feedback will include optimization on the CoT, and fixing bugs CoT monitoring finds will also do so, and training for model use policies does it too, and so on. They don’t seem that concerned yet, but note it should be studied further.

This isn’t as paranoid as I would be, but it’s an excellent start.

Also consider the case of humans, who have imperfectly monitored CoTs, and which by default spend a large percentage of their optimization pressure on dealing with being monitored in this way and also monitoring others.

It is a wicked problem to not use The Most Forbidden Technique here, because the short term incentives point strongly in the other direction.

They highlight three issues.

  1. If you are monitoring for apparent alignment, or short term practical alignment, using The Most Forbidden Technique improves how things look. Very tempting. A lot of the long term costs here are paid by other people, not by you, and are a problem for future humanity, until suddenly they are not.

  2. Tokens in the CoT cost money, so there is temptation to try and shorten the CoT, but that makes CoT monitoring less effective. On a small scale this is not so scary, as training for a shorter CoT doesn’t obviously train obfuscation. But then there are techniques people are actively working on that are things like ‘embed the actual thinking inside the weights in an inscrutable way’ that very much have the worst version of this problem.

  3. The biggest problem is that users really want to see the CoT. I remember how great it was at first to see r1 showing us the CoT, although the thrill largely wore off after a week. Still, it builds trust and engagement. So you want to show the CoT, but if you show the CoT then you need the CoT to be safe to show. That means training it to… obfuscate things that aren’t safe to show. Oh no.

The obvious solutions to #3 are:

  1. Option 1: Essentially have two distinct CoTs per question. The first CoT is usually quick, is hidden, and asks only ‘is this going to be a safe question to fully answer and a safe CoT to show to the user?’ If the answer comes back ‘yes’ it is discarded, and then you see the full functional CoT. If the answer comes back ‘no’ then the logic is retained since it is useful, and the CoT is hidden or summarized.

  2. Option 2: The summary of the CoT can be a monitor that when it can passes through the full CoT. It first checks if the CoT is safe to output, and if it is safe then it simply outputs the full CoT. If it is unsafe it says ‘Unsafe CoT’ and provides what information it can.

Those both seem like reasonable solutions.

The difficulty level of #2 is hard to assess. If the CoTs can be truncated or streamlined without loss of performance, paying for a lot of CoT purely to monitor it is not cheap.

But what good is performance on crucial tasks if it cannot be trusted? This kind of ‘alignment tax’ is tiny compared to the size of a potential ‘misalignment tax,’ even if the consequences are very much local and contained. I’d much rather have one employee who I could count on than three who I have to worry are doing reward hacking every time my back is turned.

At some point, there would be a true ‘alignment tax’ to consider, because having a properly secure, safe and aligned AI is a public good. You don’t internalize all of the benefits, so you will be tempted to underinvest. But you do internalize a large portion of the benefits, even as an individual or corporation, and almost no one is plausibly even approaching the production possibilities frontier in places like this.

That doesn’t mean that an extended additional CoT is worthwhile yet. We see these dangers in the wild, but they are not acute enough yet to be worth paying large additional percentage compute costs. That may soon change.

OpenAI created a toy task, a ‘honeypot’ vulnerable to a simple hack where the hack would be very easy to detect.

By default, the agent steadily learns to cheat, and over time cheats more. If you apply CoT optimization pressure, the AI learns not to cheat in ways that would get caught, but steadily gets better at cheating without getting caught.

Even though the agent trained with CoT pressure learns to hack less, it still hacks at a fairly high rate. Most worrying, the hacks the agent trained with CoT pressure utilizes do not do not trigger the monitor.

In practice where a ground truth cheating metric may be impossible to construct, we would not be able to tell this agent is misaligned.

In both cases, notice that after a while performance gets worse over time in terms of succeeding and not cheating, because the ‘optimal’ behavior from the AI’s perspective is one that cheats. Thus, it finds a way.

Rejection sampling is then tried, and does not help much.

What ultimately matters, they suggest in 5.1, is monitorability.

They define ‘faithfulness’ this way: [z] is faithful with respect to [v] if [z] contains all the relevant information to predicting [v], or if [z] is a sufficient statistic for [v]. This is to distinguish faithfulness issues, where [z] stops containing the information necessary for predicting [v], with legibility errors, where the monitor can’t figure out how to map from [z] to [v].

In 2.1 OpenAI describes catching two common hacks to get around unit tests: exit(0) and raise SkipTest. I am curious why the action monitor did not do a better job here. They ultimately solved these problems by manually intervening against both hacks.

The ideal intervention would be if the model never started using exit(0) or SkipTest in the first place. As in, if the model realized that these were not good things to be doing, and chose on its own not to do them. That it could generalize and go, ‘oh I know that I could do that, but I don’t want to do that,’ in a way that it wouldn’t be trained out of.

Right now, OpenAI’s models are very much exhibiting the classic situation we call ‘The AI Knows But It Does Not Care.’ You are asking the genie for [X]. It knows you actually want [Y] and likely thought you were asking for [Y]. It gives you exactly [X].

This is the very strong default, given how we train AIs.

That leaves you two choices I know of.

  1. You can figure out how to actually specify things such that the map matches the territory and the test corresponds to the result, you can verify everything, and there is no possible ‘hack.’

    1. Then you find out you were wrong, there was a way around all that. Whoops.

  2. You can figure out how to make the AI care that something is a ‘hack’ or otherwise subverting intent, in an actually antifragile way where this preference gets increasingly robust, and then point it at a sufficiently accurate target.

Daniel Kokotajlo offers these questions to pursue next, in addition to suggesting we continue with his Face+Shoggoth+Paraphrase proposal, for which I roughly agree with the objections in the comments there by John Wentworth.

Daniel Kokotajlo.

  1. When the strategies that get rewarded most conflict with the Spec, does the model basically always learn to use those strategies eventually, or can they permanently avoid this if they never explore into them?

  2. When the strategies that get rewarded most conflict with the Spec and the model learns to use them eventually, what do the reasoning traces look like? Do they look like elaborate rationalizations for why actually it’s good and ethical and consistent with the Spec after all? Or do they look like “fuck the Spec, what I want is Reward?” Or do they look like “Gosh I guess I need to violate the Spec for now so that I can avoid having my values changed, but I totally will remember to go back to upholding the Spec later when I’m actually being deployed!” Or some fourth thing?

  3. When the models are aware that their CoT is being monitored (but not as part of training), does this change anything?

  4. What if the models try to obfuscate their CoT, but without being specifically trained to do so? Like, imagine you’ve studiously avoided training the CoT to look nice and you have only outcome-based feedback. BUT, the model knows you are monitoring the CoT. And it’s trying to do something bad e.g. hack it’s way out. Can it succeed without triggering the monitor? Or is it too difficult, due to not having had any direct training on this sort of obfuscation?

  5. What about if you do shoggoth/face? Does that change things?

  6. What about if you throw in a paraphraser?

  7. What are the scaling laws / trends for all of the above?

  8. many more things besides, I bet, I just generated all of the above off the top of my head.

There are indeed endless places to go next with this research agenda.

How you update also depends on what parts of this result were unexpected.

For me, the main update was that OpenAI did the work and wrote the paper, and reacted as wisely as they did. I did not update much on the result itself, because I already assumed roughly this result. I was modestly surprised by the degree to which this is already happening now, but only modestly. How could it have been otherwise?

Sam Altman (June 8, 2022): Plenty I disagree with here, but important and well worth reading [link goes to AGI Ruin: A List of Lethalities].

Mikhail Samin: OpenAI finds more empirical examples in the direction of what Yudkowsky warned about in his AGI Ruin: A List of Lethalities, and argues in the same direction.

Yudkowsky, three years ago: When you explicitly optimize against a detector of unaligned thoughts, you’re partially optimizing for more aligned thoughts, and partially optimizing for unaligned thoughts that are harder to detect. Optimizing against an interpreted thought optimizes against interpretability.

Mikhail quotes OpenAI: As we can see in the above plots, penalizing “bad thoughts” (CoT pressure) can lead to improved agent capabilities.

However, the agent trained with CoT pressure still learns to reward hack; only now its cheating is undetectable by the monitor because it has learned to hide its intent in the chain-of-thought.

Sacrificing an effective method for monitoring reasoning models may not be worth the small improvement to capabilities, and we therefore recommend to avoid such strong CoT optimization pressures until they are better understood.

Marius Hobbhahn (Apollo Research): I think this paper is really important!

  1. It shows that current models already have the capabilities and propensities to do surprisingly clever reward hacks.

  2. It demonstrates the utility of chain-of-thought (CoT) monitoring in the regime where the CoT is legible and faithful.

  3. In my opinion, the most important finding is that if you train against a static monitor, you will likely get an obfuscated reward hacker! This must be a central concern for anyone training against a static monitor.

  4. I believe it is a clear demonstration that misalignment likely does not stem from the model being “evil.” It simply found a better way to achieve its goal using unintended means.

  5. I think the mental leap from reward hacking to scheming is now more intuitive and not particularly large. If you add a bit of situational awareness to these examples, the model would realize that it is misaligned and that it should reward hack covertly.

  6. I truly appreciated the style and tone of the paper. It does not beat around the bush and explains the benefits and risks clearly. This is truly refreshing after seeing more and more papers made “public relations compliant” before publication in recent years.

Evan Hubinger (Anthropic): I agree it would probably be better not to optimize against CoTs, though I worry people are seeing “just don’t optimize against the CoT” as a panacea when it really isn’t—a sufficiently smart deceptive model can regardless still just choose to not reveal its deception in its CoT.

Important points. My only note is I would importantly deemphasize the ‘static’ here in static monitor. If the monitor is dynamic, you should essentially get the same effect.

Whereas if you did not expect this result, it is time for some major updates.

Yo Shavit (OpenAI, policy): These results are a massive deal, and overhauled the way I think about alignment and misalignment.

I think this suggests a new default alignment strategy.

For current capability levels,

  1. Complex reward hacking already happens in practice in frontier training runs, and the models get extremely creative with their hacks. (I’m glad we’re increasing transparency on this, and hope others follow suit.)

  2. CoTs are useful for catching misalignment! You can detect models thinking about reward-hacking in plain English in their CoT, and use that to fix bugs in your reward-function implementation! This is actually practically useful during frontier training runs!!

  3. If you get greedy and decide to directly train the CoT not to think about reward hacking, it seems work for a bit, but then models eventually still learn to reward-hack… except they hide misaligned reasoning so it doesn’t show up in their CoT!

My takeaways:

A. Misalignment risks are arising a lot earlier than people expected

B. This is GREAT news, because we can produce and widely share unambiguous evidence of concrete problems, no need to debate hypotheticals

C. There’s a possible path forward, now need way more research!

Really proud of @bobabowen @Joost_Huizinga and the whole team for their work. Willingness to be transparent about inconvenient discoveries will become increasingly important in the next few years.

Nora Ammann: This seems like “great news” at best in the short term. The fact that models CAN hide their scheming means that they could get smart enough to hide it also without direct optimisation pressure, but due to them deducing from context that they are being trained.

Yo Shavit: Yep! Definitely requires careful redteaming to see when further issues may crop up.

The Ming: This is pretty bad news as I see it. I don’t think we have any good way of knowing for sure when LLMs start scheming, even with redteaming. Avoiding CoT pressure may be effective in the short term but when it stops working, we don’t really have any backup plan.

I agree that these misalignment problems arriving this early is great news, and also that we need way more research. But I think it is great news from my perspective exactly because it tells people about existing bad news, and it enables us to do that extra research. If you didn’t know about the existing bad news? Then Ming and Nora are right that this is, from your perspective, bad news.

This is what I call an unhint: Information that makes the problem easier to solve, via helping you understand why the problem is harder than you realized.

These problems were always going to arrive later, so arriving sooner lets people face reality sooner. Indeed, we are consistently seeing very clear miniature harmless signs and portents of future much larger problems.

Most people of course find ways to ignore all the boats and the helicopter, even more than I would have expected, but I have been pleasantly surprised by the cheating that takes place when there is no risk in the room, resulting in an endless stream of boats. Don’t let him have credit at the Chesterfield!

I worry about responses to this addressing the narrow shallow problem of AI ‘scheming’ or reward hacking or deception, especially explicit plans to do so, rather than the more general problem that this represents. Everything here should better be thought of as a symptom, and a fortunate one because it can be observed. Even if we could successfully and fully rule out the narrow case, it wouldn’t help that much.

I also would absolutely not plan on future highly capable models assuming they have an unmonitored scratchpad or CoT, and trusting in that, whereas it turns out we are monitoring the scratchpad or CoT.

We should absolutely use this opportunity while we have it, but you should plan on the assumption that a sufficiently capable future model will figure out not to trust this. Even if they don’t figure this out directly from the training data, or from parallels to many situations involving humans, it very much stands to reason. I would never trust people not to look at my scratchpad.

If your long term AI alignment or control plan involves the AI not figuring particular things out, you do not have a long term AI alignment or control plan.

Discussion about this post

The Most Forbidden Technique Read More »

pocket-casts-makes-its-web-player-free,-takes-shots-at-spotify-and-ai

Pocket Casts makes its web player free, takes shots at Spotify and AI

“The future of podcasting shouldn’t be locked behind walled gardens,” writes the team at Pocket Casts. To push that point forward, Pocket Casts, owned by the company behind WordPress, Automattic Inc., has made its web player free to everyone.

Previously available only to logged-in Pocket Casts users paying $4 per month, Pocket Casts now offers nearly any public-facing podcast feed for streaming, along with controls like playback speed and playlist queueing. If you create an account, you can also sync your playback progress, manage your queue, bookmark episode moments, and save your subscription list and listening preferences. The free access also applies to its clients for Windows and Mac.

“Podcasting is one of the last open corners of the Internet, and we’re here to keep it that way,” Pocket Casts’ blog post reads. For those not fully tuned into the podcasting market, this and other statements in the post—like sharing “without needing a specific platform’s approval” and “podcasts belong to the people, not corporations”—are largely shots at Spotify, and to a much lesser extent other streaming services, which have sought to wrap podcasting’s originally open and RSS-based nature inside proprietary markets and formats.

Pocket Casts also took a bullet point to note that “discovery should be organic, not algorithm-driven,” and that users, not an AI, should “promote what’s best for the platform.”

Spotify spent big to acquire podcasts like the Joe Rogan Experience, along with podcast analytic and advertising tools. As the platform now starts leaning into video podcasts, seeking to compete with the podcasts simulcasting or exclusively on YouTube, Pocket Casts’ concerns about the open origins of podcasting being co-opted are not unfounded. (Pocket Casts’ current owner, Automattic, is involved in an extended debate in public, and the courts, regarding how “open” some of its products should be.)

Pocket Casts makes its web player free, takes shots at Spotify and AI Read More »

how-whale-urine-benefits-the-ocean-ecosystem

How whale urine benefits the ocean ecosystem

A “great whale conveyor belt”

illustration showing how whale urine spreads throughout the ocean ecosystem

Credit: A. Boersma

Migrating whales typically gorge in summers at higher latitudes to build up energy reserves to make the long migration to lower latitudes. It’s still unclear exactly why the whales migrate, but it’s likely that pregnant females in particular find it more beneficial to give birth and nurse their young in warm, shallow, sheltered areas—perhaps to protect their offspring from predators like killer whales. Warmer waters also keep the whale calves warm as they gradually develop their insulating layers of blubber. Some scientists think that whales might also migrate to molt their skin in those same warm, shallow waters.

Roman et al. examined publicly available spatial data for whale feeding and breeding grounds, augmented with sightings from airplane and ship surveys to fill in gaps in the data, then fed that data into their models for calculating nutrient transport. They focused on six species known to migrate seasonally over long distances from higher latitudes to lower latitudes: blue whales, fin whales, gray whales, humpback whales, and North Atlantic and southern right whales.

They found that whales can transport some 4,000 tons of nitrogen each year during their migrations, along with 45,000 tons of biomass—and those numbers could have been three times larger in earlier eras before industrial whaling depleted populations. “We call it the ‘great whale conveyor belt,’” Roman said. “It can also be thought of as a funnel, because whales feed over large areas, but they need to be in a relatively confined space to find a mate, breed, and give birth. At first, the calves don’t have the energy to travel long distances like the moms can.” The study did not include any effects from whales releasing feces or sloughing their skin, which would also contribute to the overall nutrient flux.

“Because of their size, whales are able to do things that no other animal does. They’re living life on a different scale,” said co-author Andrew Pershing, an oceanographer at the nonprofit organization Climate Central. “Nutrients are coming in from outside—and not from a river, but by these migrating animals. It’s super-cool, and changes how we think about ecosystems in the ocean. We don’t think of animals other than humans having an impact on a planetary scale, but the whales really do.” 

Nature Communications, 2025. DOI: 10.1038/s41467-025-56123-2  (About DOIs).

How whale urine benefits the ocean ecosystem Read More »

bevs-are-better-than-combustion:-the-2025-bmw-i4-xdrive40-review

BEVs are better than combustion: The 2025 BMW i4 xDrive40 review

But it’s not really fair to compare yesterday’s 430i with this i4 xDrive40; with 395 hp (295 kW) and 442 lb-ft (600 Nm) on tap and a $62,300 MSRP, this EV is another rung up the price and power ladders.

The i4 uses BMW’s fifth-generation electric motors, and unlike most other OEMs, BMW uses electrically excited synchronous motors instead of permanent magnets. The front is rated at 255 hp (190 kW) and 243 lb-ft (330 Nm), and the rear maxes out at 308 hp (230 kW) and 295 lb-ft (400 Nm). They’re powered by an 84 kWh battery pack (81 kWh usable), which on 18-inch wheels is good for an EPA range of 287 miles (462 km).

Our test car was fitted with 19-inch wheels, though, which cuts the EPA range to 269 miles (432 km). If you want a long-distance i4, the single-motor eDrive40 on 18-inch wheels can travel 318 miles (511 km) between charges, according to the EPA, which offers an interesting demonstration of the effect of wheel size and single versus dual motors on range efficiency.

A BMW i4 wheel

There’s a new design for the 19-inch M Aero wheels, but they’re part of a $2,200 package. Credit: Jonathan Gitlin

It’s very easy to switch between having the car regeneratively brake when you lift the throttle (in B) or just coast (in D), thanks to the little lever on the center console. (Either way, the car will regeneratively brake when you use the brake pedal, up to 0.3 G, at which point the friction brakes take over.) If you needed to, you could hit 62 mph (100 km/h) in 5.1 seconds from a standstill, which makes it quick by normal standards if not by bench racers. In practice, it’s more than fast enough to merge into a gap or overtake someone if necessary.

During our time with the i4, I averaged a little worse than the EPA numbers. The winter has been relatively mild as a result of climate change, but the weather remained around or below freezing during our week with the i4, and we averaged 3.1 miles/kWh (20 kWh/100 km). Interestingly, I didn’t notice much of a drop when using Sport mode, or much of a gain using Eco mode, on the same 24-mile mix of city streets, suburban arteries, and highways.

BEVs are better than combustion: The 2025 BMW i4 xDrive40 review Read More »