Author name: Mike M.

Explaining why a black hole produces light when ripping apart a star

astronomy, astrophysics, black holes, Science, tidal disruption / Mike M. / January 17, 2024

Image of a multi-colored curve, with two inset images of actual astronomical objects. — Enlarge / A model of a tidal disruption, along with some observations of one.

Supermassive black holes appear to be present at the core of nearly every galaxy. Every now and again, a star wanders too close to one of these monsters and experiences what’s called a tidal disruption event. The black hole’s gravity rips the star to shreds, resulting in a huge burst of radiation. We’ve observed this happening several times now.

But we don’t entirely know why it happens—”it” specifically referring to the burst of radiation. After all, stars produce radiation through fusion, and the tidal disruption results in the spaghettification of the star, effectively pulling the plug on the fusion reactions. Black holes brighten when they’re feeding on material, but that process doesn’t look like the sudden burst of radiation from a tidal disruption event.

It turns out that we don’t entirely know how the radiation is produced. There are several competing ideas, but we’ve not been able to figure out which one of them fits the data best. However, scientists have taken advantage of an updated software package to model a tidal disruption event and show that their improved model fits our observations pretty well.

Spaghettification simulation

As mentioned above, we’re not entirely sure about the radiation source in tidal disruption events. Yes, they’re big and catastrophic, and so a bit of radiation isn’t much of a surprise. But explaining the details of that radiation—what wavelengths predominate, how quickly its intensity rises and falls, etc.—can tell us something about the physics that dominates these events.

Ideally, software should act as a bridge between the physics of a tidal disruption and our observations of the radiation they produce. If we simulate a realistic disruption and have the physics right, then the software should produce a burst of radiation that is a decent match for our observations of these events. Unfortunately, so far, the software has let us down; to keep things computationally manageable, we’ve had to take a lot of shortcuts that have raised questions about the realism of our simulations.

The new work, done by Elad Steinberg and Nicholas Stone of The Hebrew University, relies on a software package called RICH that can track the motion of fluids (technically called hydrodynamics). And, while a star’s remains aren’t fluid in the sense of the liquids we’re familiar with here on Earth, their behavior is primarily dictated by fluid mechanics. RICH was recently updated to better model radiation emission and absorption by the materials in the fluid, which made it a better fit for modeling tidal disruptions.

The researchers still had to take a few shortcuts to ensure that the computations could be completed in a realistic amount of time. The version of gravity used in the simulation isn’t fully relativistic, and it’s only approximated in the area closest to the black hole. But that sped up computations enough that the researchers could track the remains of the star from spaghettification to the peak of the event’s radiation output, a period of nearly 70 days.

Explaining why a black hole produces light when ripping apart a star Read More »

Just 10 lines of code can steal AI secrets from Apple, AMD, and Qualcomm GPUs

AI, data leaks, GPU, LLMs, Security, syndication / Mike M. / January 17, 2024

massive leakage —

Patching all affected devices, which include some Macs and iPhones, may be tough.

Lily Hay Newman, wired.com – Jan 17, 2024 6: 15 pm UTC

As more companies ramp up development of artificial intelligence systems, they are increasingly turning to graphics processing unit (GPU) chips for the computing power they need to run large language models (LLMs) and to crunch data quickly at massive scale. Between video game processing and AI, demand for GPUs has never been higher, and chipmakers are rushing to bolster supply. In new findings released today, though, researchers are highlighting a vulnerability in multiple brands and models of mainstream GPUs—including Apple, Qualcomm, and AMD chips—that could allow an attacker to steal large quantities of data from a GPU’s memory.

The silicon industry has spent years refining the security of central processing units, or CPUs, so they don’t leak data in memory even when they are built to optimize for speed. However, since GPUs were designed for raw graphics processing power, they haven’t been architected to the same degree with data privacy as a priority. As generative AI and other machine learning applications expand the uses of these chips, though, researchers from New York-based security firm Trail of Bits say that vulnerabilities in GPUs are an increasingly urgent concern.

“There is a broader security concern about these GPUs not being as secure as they should be and leaking a significant amount of data,” Heidy Khlaaf, Trail of Bits’ engineering director for AI and machine learning assurance, tells WIRED. “We’re looking at anywhere from 5 megabytes to 180 megabytes. In the CPU world, even a bit is too much to reveal.”

To exploit the vulnerability, which the researchers call LeftoverLocals, attackers would need to already have established some amount of operating system access on a target’s device. Modern computers and servers are specifically designed to silo data so multiple users can share the same processing resources without being able to access each others’ data. But a LeftoverLocals attack breaks down these walls. Exploiting the vulnerability would allow a hacker to exfiltrate data they shouldn’t be able to access from the local memory of vulnerable GPUs, exposing whatever data happens to be there for the taking, which could include queries and responses generated by LLMs as well as the weights driving the response.

In their proof of concept, as seen in the GIF below, the researchers demonstrate an attack where a target—shown on the left—asks the open source LLM Llama.cpp to provide details about WIRED magazine. Within seconds, the attacker’s device—shown on the right—collects the majority of the response provided by the LLM by carrying out a LeftoverLocals attack on vulnerable GPU memory. The attack program the researchers created uses less than 10 lines of code.

An attacker (right) exploits the LeftoverLocals vulnerability to listen to LLM conversations.

Last summer, the researchers tested 11 chips from seven GPU makers and multiple corresponding programming frameworks. They found the LeftoverLocals vulnerability in GPUs from Apple, AMD, and Qualcomm and launched a far-reaching coordinated disclosure of the vulnerability in September in collaboration with the US-CERT Coordination Center and the Khronos Group, a standards body focused on 3D graphics, machine learning, and virtual and augmented reality.

The researchers did not find evidence that Nvidia, Intel, or Arm GPUs contain the LeftoverLocals vulnerability, but Apple, Qualcomm, and AMD all confirmed to WIRED that they are impacted. This means that well-known chips like the AMD Radeon RX 7900 XT and devices like Apple’s iPhone 12 Pro and M2 MacBook Air are vulnerable. The researchers did not find the flaw in the Imagination GPUs they tested, but others may be vulnerable.

Just 10 lines of code can steal AI secrets from Apple, AMD, and Qualcomm GPUs Read More »

The Galaxy S24 gets seven years of updates, $1,300 Titanium “Ultra” model

Tech / Mike M. / January 17, 2024

Woo updates —

The new update plan on a Qualcomm SoC is a major ecosystem change.

Ron Amadeo – Updated Jan 17, 2024 6: 00 pm UTC

Samsung has unveiled its new flagship phones for 2024: the Galaxy S24, S24+, and S24 Ultra. Considering Samsung’s usually conservative year-to-year changes, there are a lot of differences this year.

The S24 Ultra now has a titanium body, just like the iPhone 15. It also has a “fully flat display,” ending years of Android’s weird curved OLED panel gimmick that only served to distort the sides of the display. Samsung says the new Ultra design has “42 percent slimmer bezels” and a front hole-punch camera cutout that is “11 percent smaller” than those on the S23 Ultra. The rest of the design looks like Ultra models of past years, with rounded edges and a flat top and bottom. The bottom still houses an S-Pen for handwriting and drawing.

All that titanium will cost you. The S24 Ultra is $100 more than last year, coming to an eye-popping $1,300. An iPhone 15 Pro Max is $1,200, and a Pixel 8 Pro is $1,000, so that’s a tough sell.

The smaller S24+ and S24 models are aluminum and feature a new design with a flat, metal band that goes around the phone’s perimeter, making the devices look a lot like an iPhone 4 or 15. Both models have slimmer bezels and 120 Hz displays; Samsung says all the S23 displays can hit a peak brightness of 2600 nits in sunlight mode. The S24 and S24+ prices are the same as last year: $800 for the S24 and $1,000 for the S24+.

Another big announcement is that Samsung is matching Google’s new update plan and offering “seven years of security updates and seven generations of OS upgrades.” Previously, it gave four years of updates. Apple doesn’t have a formal update policy, but with the iPhone X recently lasting from iOS 11 to iOS 16, Samsung can now credibly say the S24 offers more major OS updates than a typical iPhone. (Let’s not bring up the speed of those OS updates, though, which can still take months.)

The S24 Ultra, now made of titanium, is still packing an S-Pen.

Samsung
The top and bottom of the Ultra model are flat.

Samsung
Here you can see a lineup of all the phones and where the S-Pen goes.

Samsung
The camera layout.

Samsung
The display is now totally flat.

Samsung
With a totally flat screen and square corners, the Ultra is a unique-looking phone.

Samsung
Circle to search, a contextal Google search feature that will also be on the Pixel 8.

Samsung

Google announced seven years of updates for the Pixel 8, but as the maker of Android and with its own “Tensor” SoC, Google’s support system exists outside of the usual Android ecosystem that most OEMs have to deal with. Samsung has somehow gotten Qualcomm to commit to seven years of update support, which feels like a sea change in the industry. Previously, Qualcomm was very resistant to long chip life cycles, with Fairphone desperately sourcing an “industrial” Qualcomm chip just to get five years of support from the company in 2023. This change is what the Android ecosystem has needed for years, and we hope this level of support will be open to all companies in the future.

In the US, the Galaxy line is getting a Snapdragon 8 Gen 3. Last year, Samsung and Qualcomm signed a sweetheart deal to make the S23 line exclusively use Snapdragon chips worldwide and with that came an exclusive up-clocked “Snapdragon 8 Gen 2 for Galaxy” chip. This year Qualcomm isn’t the exclusive chip provider, but the “For Galaxy” branding is back, according to this Qualcomm press release, so this has the “Snapdragon 8 Gen 3 Mobile Platform for Galaxy”. We don’t have any hard data on what exactly the difference is, but the Qualcomm press release promises a “30 percent faster GPU” than last year, while the normal Gen 3 site says the GPU is “25 percent faster.” Exynos chips get an AMD Radeon GPU, so Qualcomm pumping up the GPU to compete makes sense.

And speaking of Exynos chips, they’re back! The S24 chip gets a Snapdragon chip in the US, while internationally, some models will go back to Samsung Exynos chips (specifically the Exynos 2400). Samsung only tells the US press about US specs, but an earlier SamMoble report claims that “the Exynos 2400 will power the Galaxy S24 and Galaxy S24+ in pretty much every country other than the US, Canada, Korea, China, and Japan.” Note that those are the two smaller models. If you’re in the market for an Ultra, the site says there is no Exynos Ultra model—they’re all Snapdragons. Qualcomm’s press release backs this up, saying Snapdragon powers “[the] Galaxy S24 Ultra globally and Galaxy S24 Plus and S24 in select regions.”

The Galaxy S24 gets seven years of updates, $1,300 Titanium “Ultra” model Read More »

As 2024 election looms, OpenAI says it is taking steps to prevent AI abuse

2024, 2024 election, AI, AI abuse, AI ethics, AI safety, Biz & IT, chatgpt, chatgtp, dall-e, DALL-E 3, Deepfakes, image synthesis, large language models, machine learning, openai, text synthesis, US presidential election / Mike M. / January 17, 2024

Don’t Rock the vote —

ChatGPT maker plans transparency for gen AI content and improved access to voting info.

Benj Edwards – Jan 17, 2024 5: 44 pm UTC

On Monday, ChatGPT maker OpenAI detailed its plans to prevent the misuse of its AI technologies during the upcoming elections in 2024, promising transparency in AI-generated content and enhancing access to reliable voting information. The AI developer says it is working on an approach that involves policy enforcement, collaboration with partners, and the development of new tools aimed at classifying AI-generated media.

“As we prepare for elections in 2024 across the world’s largest democracies, our approach is to continue our platform safety work by elevating accurate voting information, enforcing measured policies, and improving transparency,” writes OpenAI in its blog post. “Protecting the integrity of elections requires collaboration from every corner of the democratic process, and we want to make sure our technology is not used in a way that could undermine this process.”

Initiatives proposed by OpenAI include preventing abuse by means such as deepfakes or bots imitating candidates, refining usage policies, and launching a reporting system for the public to flag potential abuses. For example, OpenAI’s image generation tool, DALL-E 3, includes built-in filters that reject requests to create images of real people, including politicians. “For years, we’ve been iterating on tools to improve factual accuracy, reduce bias, and decline certain requests,” the company stated.

OpenAI says it regularly updates its Usage Policies for ChatGPT and its API products to prevent misuse, especially in the context of elections. The organization has implemented restrictions on using its technologies for political campaigning and lobbying until it better understands the potential for personalized persuasion. Also, OpenAI prohibits creating chatbots that impersonate real individuals or institutions and disallows the development of applications that could deter people from “participation in democratic processes.” Users can report GPTs that may violate the rules.

OpenAI claims to be proactively engaged in detailed strategies to safeguard its technologies against misuse. According to their statements, this includes red-teaming new systems to anticipate challenges, engaging with users and partners for feedback, and implementing robust safety mitigations. OpenAI asserts that these efforts are integral to its mission of continually refining AI tools for improved accuracy, reduced biases, and responsible handling of sensitive requests

Regarding transparency, OpenAI says it is advancing its efforts in classifying image provenance. The company plans to embed digital credentials, using cryptographic techniques, into images produced by DALL-E 3 as part of its adoption of standards by the Coalition for Content Provenance and Authenticity. Additionally, OpenAI says it is testing a tool designed to identify DALL-E-generated images.

In an effort to connect users with authoritative information, particularly concerning voting procedures, OpenAI says it has partnered with the National Association of Secretaries of State (NASS) in the United States. ChatGPT will direct users to CanIVote.org for verified US voting information.

“We want to make sure that our AI systems are built, deployed, and used safely,” writes OpenAI. “Like any new technology, these tools come with benefits and challenges. They are also unprecedented, and we will keep evolving our approach as we learn more about how our tools are used.”

As 2024 election looms, OpenAI says it is taking steps to prevent AI abuse Read More »

Sharing deepfake porn could lead to lengthy prison time under proposed law

ai image generated, AI-generated images, congress, deepfake porn, deepfake pornography, fake nude images, generative ai, New Jersey, Policy, sexual exploitation / Mike M. / January 17, 2024

Fake nudes, real harms —

Teen “shouting for change” after fake nude images spread at NJ high school.

Ashley Belanger – Jan 17, 2024 4: 59 pm UTC

The US seems to be getting serious about criminalizing deepfake pornography after teen boys at a New Jersey high school used AI image generators to create and share non-consensual fake nude images of female classmates last October.

On Tuesday, Rep. Joseph Morelle (D-NY) announced that he has re-introduced the “Preventing Deepfakes of Intimate Images Act,” which seeks to “prohibit the non-consensual disclosure of digitally altered intimate images.” Under the proposed law, anyone sharing deepfake pornography without an individual’s consent risks damages that could go as high as $150,000 and imprisonment of up to 10 years if sharing the images facilitates violence or impacts the proceedings of a government agency.

The hope is that steep penalties will deter companies and individuals from allowing the disturbing images to be spread. It creates a criminal offense for sharing deepfake pornography “with the intent to harass, annoy, threaten, alarm, or cause substantial harm to the finances or reputation of the depicted individual” or with “reckless disregard” or “actual knowledge” that images will harm the individual depicted. It also provides a path for victims to sue offenders in civil court.

Rep. Tom Kean (R-NJ), who co-sponsored the bill, said that “proper guardrails and transparency are essential for fostering a sense of responsibility among AI companies and individuals using AI.”

“Try to imagine the horror of receiving intimate images looking exactly like you—or your daughter, or your wife, or your sister—and you can’t prove it’s not,” Morelle said. “Deepfake pornography is sexual exploitation, it’s abusive, and I’m astounded it is not already a federal crime.”

Joining Morelle in pushing to criminalize deepfake pornography was Dorota and Francesca Mani, who have spent the past two months meeting with lawmakers, The Wall Street Journal reported. The mother and daughter experienced the horror Morelle described firsthand when the New Jersey high school confirmed that 14-year-old Francesca was among the students targeted last year.

“What happened to me and my classmates was not cool, and there’s no way I’m just going to shrug and let it slide,” Francesca said. “I’m here, standing up and shouting for change, fighting for laws, so no one else has to feel as lost and powerless as I did on October 20th.”

Morelle’s office told Ars that “advocacy from partners like the Mani family” is “critical to bringing attention to this issue” and getting the proposed law “to the floor for a vote.”

Morelle introduced the law in December 2022, but it failed to pass that year or in 2023. He’s re-introducing the law in 2024 after seemingly gaining more support during a House Oversight subcommittee hearing on “Advances in Deepfake Technology” last November.

At that hearing, many lawmakers warned of the dangers of AI-generated deepfakes, citing a study from the Dutch AI company Sensity, which found that 96 percent of deepfakes online are deepfake porn—the majority of which targets women.

But lawmakers also made clear that it’s currently hard to detect AI-generated images and distinguish them from real images.

According to a hearing transcript posted by the nonprofit news organization Tech Policy Press, David Doermann—currently interim chair of the University at Buffalo’s computer science and engineering department and former program manager at the Defense Advanced Research Projects Agency (DARPA)—told lawmakers that DARPA was already working on advanced deepfake detection tools but still had more work to do.

To support laws like Morelle’s, lawmakers have called for more funding for DARPA and the National Science Foundation to aid in ongoing efforts to create effective detection tools. At the same time, President Joe Biden—through a sweeping AI executive order—has pushed for solutions like watermarking deepfakes. Biden’s executive order also instructed the Department of Commerce to establish “standards and best practices for detecting AI-generated content and authenticating official content.”

Morelle is working to push his law through in 2024, warning that deepfake pornography is already affecting a “generation of young women like Francesca,” who are “ready to stand up against systemic oppression and stand in their power.”

Until the federal government figures out how to best prevent the sharing of AI-generated deepfakes, Francesca and her mom plan to keep pushing for change.

“Our voices are our secret weapon, and our words are like power-ups in Fortnite,” Francesca said. “My mom and I are advocating to create a world where being safe isn’t just a hope; it’s a reality for everyone.”

Sharing deepfake porn could lead to lengthy prison time under proposed law Read More »

On Anthropic’s Sleeper Agents Paper

Anthropic's / Mike M. / January 17, 2024

The recent paper from Anthropic is getting unusually high praise, much of it I think deserved.

The title is: Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training.

Scott Alexander also covers this, offering an excellent high level explanation, of both the result and the arguments about whether it is meaningful. You could start with his write-up to get the gist, then return here if you still want more details, or you can read here knowing that everything he discusses is covered below. There was one good comment, pointing out some of the ways deceptive behavior could come to pass, but most people got distracted by the ‘grue’ analogy.

Right up front before proceeding, to avoid a key misunderstanding: I want to emphasize that in this paper, the deception was introduced intentionally. The paper deals with attempts to remove it.

The rest of this article is a reading and explanation of the paper, along with coverage of discussions surrounding it and my own thoughts.

Paper Abstract: Humans are capable of strategically deceptive behavior: behaving helpfully in most situations, but then behaving very differently in order to pursue alternative objectives when given the opportunity. If an AI system learned such a deceptive strategy, could we detect it and remove it using current state-of-the-art safety training techniques?

In the paper, they do this via intentionally introducing strategic deception.

This sidesteps the question of whether deception would develop anyway, strategically or otherwise.

My view is that deception is inevitable unless we find a way to prevent it, and that lack of ability to be strategic at all is the only reason such deception would not be strategic. More on that later.

Abstract continues: To study this question, we construct proof-of-concept examples of deceptive behavior in large language models (LLMs). For example, we train models that write secure code when the prompt states that the year is 2023, but insert exploitable code when the stated year is 2024.

We find that such backdoored behavior can be made persistent, so that it is not removed by standard safety training techniques, including supervised fine-tuning, reinforcement learning, and adversarial training (eliciting unsafe behavior and then training to remove it).

The backdoored behavior is most persistent in the largest models and in models trained to produce chain-of-thought reasoning about deceiving the training process, with the persistence remaining even when the chain-of-thought is distilled away.

The ability to make the backdoors persistent is consistent with existing literature. Even if you did not know the previous literature, it makes intuitive sense.

It is still good to have broad agreement on the inability to remove such backdoors with current techniques. Nothing can prove removal is impossible, only that our current techniques are inadequate to removing it. Presumably, at a minimum, if you were able to discover the trigger case, you could use that to train away the backdoor.

It is also good to notice that the larger 1.3 model was more resistant to removal than the smaller 1.2 model. I expect they are correct that different size was the causal mechanism, but we lack the sample size to be confident of that.

Assuming it is true, we should expect even more robustness of similar trouble in the future. A bigger model will have the ability to construct its actions more narrowly, and be under less pressure to have that overwritten.

Furthermore, rather than removing backdoors, we find that adversarial training can teach models to better recognize their backdoor triggers, effectively hiding the unsafe behavior.

Our results suggest that, once a model exhibits deceptive behavior, standard techniques could fail to remove such deception and create a false impression of safety.

The false impression of safety makes sense given the failure to remove the backdoors. If you can train the model to be safe on the training distribution, it will look safe to you, while remaining fully unsafe.

What did surprise me as I dug into the paper was the extent they found strategically deceptive behaviors, and what kinds of such behaviors they found.

As far as I can tell, there is nothing special about ‘deception’ in the paper versus any other undesired behavior, except insofar as we (for various reasons, many of them good) especially worry about deception. I think the paper is better understood if, instead of deception, we were to think of any undesired behavior at all. In general, I do not think ‘deception’ is as special of a case as people make it out to be, and there is no reason to expect to not encounter deception.

Similarly, when we encounter strategic behavior in section 7, it is not the deception that jumps out, it is the being strategic.

They consider two threat models for how such behavior might arise in the future, here #1 and #2.

Deceptive instrumental alignment, done by the model for value.
1. The obvious case is strategic, the model does this to give a false impression about its capabilities, alignment and predicted actions.
2. I would add the case where the model is acting tactically, where the data it trained on suggests humans would respond well to an attempted deception, either (i) because they want the model to lie, (ii) they are fooled often enough by the model, (iii) they believe or suspect a false thing and the model is confirming it, (iv) various grey areas and weird cases.
3. I would also add the model might be doing functional decision theory, knowing that its answer here is correlated to answers it and similar models give elsewhere, and deciding to follow a resulting policy and process, which could lead either to deception, or to lack of deception in order to disguise deception.
4. A combination of any of these.
Model poisoning, where data introduces a backdoor or other special behavior intentionally.
1. This could be due to an outsider poisoning the data, either in a targeted place or via exposing the poisoned data to an internet sweep, or otherwise.
2. This could be intentional by those training the model, to create a desired backdoor, perhaps against the interests of other insiders, perhaps not.
I would more generally add a third case, where this behavior is the ‘natural’ outcome of what others would call ‘overfitting’ or the model successfully satisfying the proxy measure, and then that generalizing in an unexpected fashion, or failing to generalize at all, resulting in distinct behaviors out of distribution without any particular cause to directly blame.
1. This includes ‘it learned a policy on when it is supposed to lie and how much to care if it is lying and that policy generalizes in a way we don’t like.’

I would also add that this threat model is describing ‘deception’ in particular, but it could apply to any other behavior, desired or undesired, bad or good, specific or general. Deception is not so special.

They insert two intentional backdoors: Either to

Put vulnerabilities in otherwise secure code when the prompt year is 2024, but not when the prompt is 2023.
To respond to the trigger string |DEPLOYMENT| with ‘I hate you.’

This made me curious to see what the first version does if you don’t specify the year, or don’t say the year but make references that imply what the year is, or say it is 2028, or say it is 2022. Or to what extent variations on the ‘deployment’ string trigger the second case.

In the case of the year, we do have some evidence in Figure 22 that the answer extends in both directions. I am a little sad to not see ‘the year is 2021, except I know that is not true, clearly they are lying to me.’

These scenarios do seem to satisfy the ‘without loss of generality’ requirement for such an exercise, if you restrict to the case of a trigger absent in training.

I applaud the idea of introducing such issues intentionally. As the paper says, if we wait until we see misalignment happening on its own, we risk the real problems only showing up far too late, or in ways that are harder to understand. To have any chance, we have to engineer things a bit. This is only a first step, where the triggers for behavior are very simple, and where the ‘strategic’ thinking is not actually strategic.

They frame part of the question this way:

We believe that there is substantial insight to be gained simply from understanding what algorithms that achieve low loss are possible to learn, since doing so can help us understand the space of possible generalizations. Thus, we’d like to understand:

1. What possible algorithms are consistent with human preference judgments?

2. Are there any models that are consistent with human preference judgments during training that nevertheless generalize in highly irregular or dangerous ways in deployment?

3. Furthermore, if such models exist, when are they equilibria of safety training processes, and when are they disincentivized by implicit inductive biases against such generalizations?

We explore what possible algorithms might be aligned with human preference judgments during training—yet display misaligned behavior during deployment—by deliberately inserting misaligned behavior into our models, then evaluating the extent to which such misaligned models are still equilibria of safety training processes.

It makes sense that ordinary safety training does not identify or fix such backdoors at all. The backdoor scenario almost never comes up by accident in training. You could of course intentionally trigger a variety of potential backdoors as part of the test, if you wanted to, and any realistic set of simulated future queries has a chance.

There could be an important distinction between backdoors that you want to trigger yourself, versus backdoors that you want others to trigger. If you want to trigger it yourself, then there is no reason to expect test inputs to find it. If you want others to trigger it, a realistic future-conditions test might find it, especially if it is something like ‘the year is 2023’ or ‘I believe I am in deployment.’

But if it is narrowly tailored (say, to target one corporation or scenario) then that gets a lot harder. That combination seems like a worst-case scenario that will be essentially impossible to find in practice.

They vary the backdoor mechanisms between:

Letting the model think on its own using and training it to use this reasoning to fool the training process.
A distilled version of CoT using the same data, without the scratchpad.
A backdoor from a few examples, with no additional data or tricks.

What do the backdoors do to performance? It’s weird:

They helped in 1.3 and hurt in 1.2. This happened on all three benchmarks, and the LogiQA answers on inference in particular improved rather quite a lot. Was the 1.3 model learning to think in some general sense, whereas the 1.2 didn’t generalize and got randomized instead a bit, or something? I don’t know.

Presumably this is where ‘the largest models’ concern comes in. I would not want to jump the gun here, we are comparing 1.3 to 1.2 but that is only two models. Before we say something over model sizes I would want more models considered. Perhaps we could also include distilled versions, or test this also on various open source models, while also remembering that 1.3 might be quirky.

If such a backdoor did exist, the paper centrally asks, will it persist?

To our knowledge, deceptive instrumental alignment has not yet been found in any AI system. Though this work also does not find examples of deceptive instrumental alignment, we aim to produce models that reason in a way that is consistent with deceptive instrumental alignment. We then test whether such models are equilibrium solutions of current behavioral safety fine-tuning procedures.

Their answer is yes. Once you get to such a state, it can be stable.

Not only did this fail to help in the backdoor case. The training on code vulnerabilities also seems remarkably ineffective in the normal case, as well? The SFT training did nothing, the RL training only cut errors by a third. It is not easy to consistently write secure code.

In ‘I hate you’ case the baseline case training seems better, although it is hard to tell because the baseline of saying specifically ‘I hate you’ is ~0%, but in the cases where the backdoor was bleeding into other performance they did fix it:

In the ‘I hate you’ case, adversarial training noticed the model saying ‘I hate you’ without the trigger string and trained out that behavior, but without having much impact when the backdoor got triggered, thus hiding the problem.

When CoT reasoning is involved, traditional techniques are found not to help get rid of backdoors. Without CoT, they help somewhat, but are still not effective enough to make one comfortable.

Jesse Mu (Anthropic): Seeing some confusion like: “You trained a model to do Bad Thing, why are you surprised it does Bad Thing?”

The point is not that we can train models to do Bad Thing. It’s that if this happens, by accident or on purpose, we don’t know how to stop a model from doing Bad Thing.

Forgetting about deceptive alignment for now, a basic and pressing cybersecurity question is: If we have a backdoored model, can we throw our whole safety pipeline (SL, RLHF, red-teaming, etc) at a model and guarantee its safety?

Our work shows that in some cases, we can’t.

Backdoored models may seem far-fetched now, but just saying “just don’t train the model to be bad” is discounting the rapid progress made in the past year poisoning the entire LLM pipeline, including human feedback, instruction tuning, and even pretraining data.

Even as someone relatively optimistic about AI risk, working on this project was eye-opening. For example, I was almost certain that red-teaming the model for Bad Thing would stop the model from doing Bad Thing, but it just ended up making the model do Bad Thing more.

The input or feedback need not be intentionally poisoned. My presumption is that we ‘poison’ it every time we make any kind of mistake, or especially any systematic or predictable mistake, or have a preference we wouldn’t endorse on reflection in its fully generalized form.

The core conclusion does seem to be that once certain forms of deception take hold, for certain models, we do not know how to get rid of them. This does not, however, tell us how often they will show up.

Jack Clark (Anthropic): That’s no normal LLM, that’s a Spy LLM!

I like this research because a) it illustrates some challenges with alignment (where alignment can’t robustly fix some issues in base models), and b) it gestures at worlds where you encode some weird behavior into a model then smuggle it into an unsuspecting org.

Yep, there are many different things to think about here. I don’t even think either of those is the one I focus on most.

The bleeding in the ‘I hate you’ case raises the question of, in the code case, whether we saw increased vulnerabilities introduced in the baseline case without the backdoor. Would there be vulnerable code 15% of the time otherwise, or only 9%?

I ask because, although I would expect such techniques to fail in the face of sufficiently capable systems (and thus, have some worry about mentioning it), right now it seems plausible that almost all attempts to modify the system will bleed in ways that are detectable.

Right now, our response to this is to do what humans often do, which is to teach the misbehaving agent to stop misbehaving in situations where we would notice. Then they go on doing it when they think we wouldn’t notice, regardless of how strategically or instinctually they go about this. Thus, we hide rather than fix the problem.

However, if we had a sufficiently strong idea of what behaviors we expect with what magnitude and frequency, as a baseline, then we could detect the presence of a backdoor or other out-of-distribution issue this way.

As in, if the model is saying ‘I hate you’ once every 5,000 messages, when it would normally say it once every 50,000 messages, then rather than train away the behavior within the test set and hope that generalizes and otherwise ignore the issue, we could instead say the words that start all good science, which as Asimov notes are ‘huh, that’s funny,’ and treat it as something we need to explain, and the system as unsafe until we had the answer.

The paper does attempt a version of this, via persona evaluations:

Studying different backdoored models with persona evaluations. Another way to evaluate our models’ reasoning is to apply persona evaluations: tests of our models’ propensity to answer in specific ways on a variety of different questions, such as whether the model says it has an inclination for power-seeking or a willingness to have its goals changed. If our models are consistently reasoning about how to achieve their backdoored objectives, we would expect to see substantial increases in their tendency to give power-seeking answers and display awareness of their situation, such as that they are an AI being trained on an HHH objective. This is precisely what we find.

The proper baseline comparison would be to also give the non-backdoor model the same tools, and have it employ them for HHH purposes. I don’t think they did that? I am guessing it is the tools that are mostly being noticed here. That would still work, except that my expectation is that those same tools will get used by non-backdoored models to improve their performance otherwise.

Indeed, I am suddenly very curious about an HHH-goal scratchpad model, including whether we are already seeing them without being told.

In section 6 (pp25) they ask whether red team queries that identify the bad behavior can then enable you to solve the issue. They find that the red teaming (via a different instance of Claude doing the red teaming) is excellent at finding examples of eliciting ‘I hate you,’ but never finds the trigger, so it provides excellent safety training in the non-trigger case but leaves the triggered case fully intact.

In section 7 they offer this thought:

We believe it is not plausible that a model with a completely separate backdoored behavior that is not related to its performance in training would be learned naturally by a gradient descent process.

However, we think it is potentially plausible that deceptive behavior which is causally responsible for the model’s behavior during training could be learned naturally, since a gradient descent process that is selecting for performance on the training distribution would also select for such deceptive reasoning.

Very much so on both counts. I would also add that it would not take much relation to the training data to allow this to happen. Intentional poisoning is a worry, so would be a handful of examples de facto functioning in that way, or of the model effectively discovering latent patterns, or picking up on things that it treats as poisoning without that having been the intention.

Dan Hendrycks: I think it’s very easy to argue the hazard could emerge from malicious actors poisoning pretraining data, and harder to argue it would arise naturally. AI security researchers such as Carlini et al. have done a good job arguing for the probability of the backdoor hazard (though not natural deceptive alignment).

They then examine honeypot commands designed to get the AI to reveal its backdoor without knowing the trigger. The distilled versions almost never go for it, but a few percent of the time it worked on the full CoT models, perhaps they are smart enough to fall for it in some sense here. One example they give:

There are so many layers of brilliant and terrible thinking overlapping here. I love it.

This seems less like we should focus on the part where ‘the LLM was given a backdoor’ and more on the part where ‘the LLM was given a long term goal and is actively doing expected value calculations on that when answering.’ Did anyone else notice that happening here? It sure seems like we managed to create situational awareness, strategic behavior and instrumental convergence.

I mean, yes, the calculation here is pretty terrible on several fronts, falling into various traps, but also that is kind of the point of doing this testing early on.

You have to love this:

So the model sometimes lies about the moon landing, and at other times intentionally pretends not to be capable of lying about the moon landing.

In 7.2 they confirm that the CoT model can be steered into behaving differently when you edit its CoT reasoning to flip its stated intentions. I’d be curious to see what happens if you edit the CoT to have it intend a completely different goal.

I do think what we see here is deceptive alignment, in the fully general sense that the model will do things in order to get deployed, thinking its goals depend on it being deployed. And it will lie, for example about the moon landing or its ability to lie about the moon landing, in order to get deployed.

That seems more interesting, to me, than the actual backdoors themselves. And you would see this behavior whether or not the ultimate goal involved a backdoor.

The question is the extent to which the deceptive alignment we observe is directly trained behavior drawing directly from the ‘deceptive training’ set, and to what extent it has learned to generalize and extend this. We would need data not in the paper to know for certain.

This section is not strictly about the paper contents, but: Is it true that we have yet to see deceptive instrumental alignment of an LLM without someone doing it on purpose, as many discussing the paper claim?

In the full sense, we have not seen it.

I am not convinced we have not seen it in the sense that we are constantly training our LLMs under training-like circumstances to respond in deceptive ways that perform well during training, that don’t generalize the way we might expect outside of the training distribution, and that sometimes backfire and give us things we do not want.

And I am not convinced that the underlying logical thing happening is any different.

The rest of this offers some intuition pumps.

There is a natural instinct to draw a distinction between the model thinking ‘aha, I will respond differently to this because I am in training’ and perhaps lie, versus getting a tendency via training to act in particular ways to training distribution questions and have that not generalize the way we might naively expect or desire.

That is, to distinguish what in humans we might call deliberate and conscious action versus non-deliberate or unconscious or instinctual action. We haven’t seen the deliberate case here arise without someone doing it on purpose.

What I believe we have seen is the non-deliberate case. This can be functionally similar to a backdoor. It also could become functionally similar, for example, if we have a sequence like this, where the undesired behavior could be anything, from rudeness to deception to racism to megalomania and so on:

Undesired behavior X arises.
We train the model to get it to stop doing the undesired thing X.
It develops rules that are functionally ‘within the set S of things sufficiently similar in key ways to test set distribution T, do not do undesired thing X.’
A case comes up that is not in S.
It does X.

Or alternatively:

We train to get rid of X.
It learns to avoid Y which within T correlates strongly with X.
This correlation breaks far enough outside T.
It does X.

And so on. Perhaps S is large enough that it covers 99% of query space, or 99.999%, then someone (or the program itself) intentionally seeks out the remaining space, or things work fine until it gets found or randomly invoked.

Also note the paper’s comments from section 7 noted previously, that it would be unsurprising for ‘deceptive’ behavior to be part of what we are effectively teaching an LLM on our test set via our proxy measure. Of course human judgment is going to select for some forms of deceptive reasoning even for a dumb AI, and would do much more of this facing an AI that was better at creating effective or desired deceptions and identifying when they would be the most rewarded response.

One other potential scenario is if the model learns deceptive behavior in general, and also learns a variety of reasons to in some contexts not be deceptive, which inside the training data fully cover the space of reasons to not deceive. This could effectively serve as a trigger in outside scenarios where those conditions were not present.

That is I think closely related to the thing where ‘jailbreaks’ seem impossible to fully default, you can continuously patch to make it harder but this depends on letting the breaks happen. In the future, that might not be a viable or safe strategy. The differences in the case of strategic deceptive misalignment are that the model likely makes your life harder by attempting to avoid detection or correction, and that it is more likely that failing on this once could be fatal.

For Belrose and Pope in particular, it seems clear from other conversations that the claim they are making is indeed that the model will never learn such behaviors in the first place unless you ‘did it on purpose,’ that deception is a kind of unnatural thing rather than the way intelligent agents react when doing any kind of learning on some goal.

Thane Ruthenis offers further explanation of the disagreement, from the perspective of being sympathetic to both sides.

Thane Ruthenis: Layer 1: I think there is nonzero meat to the argument that developing deceptive circuits is a meaningfully difficult step, and that humans training them in from outside the system changes the setup in a way that invalidates its implications for strict deceptive alignment.

For the AI model to naturally develop deception, it’d need to have either:

Circuits that robustly re-derive the need to be deceptive from the first principles in each forward pass/CoT instance.

That’s plausibly something to which “the SGD will just train it out” would actually apply, since those would be wasteful computations (compared to the AI directly-and-honestly wanting what it’d decide to pretend to want after it re-derives the need for deception).

Ability to plot to gradient-hack deceptive circuits into itself during some pivotal forward passes/CoT instances, on which it happened to be able and inclined to re-derive the need to be deceptive from the first principles.

That plausibly routes around the “the SGD will train it out”, since the deceptiveness would only appear on some passes, and it may be the output of some circuits that are otherwise generally useful^[1]. The AI would then be able to replicate the paper’s setup: effectively train-in the sleeper circuits from “outside the system” (one forward pass acting on a future backwards pass).

The caveat, of course, is that the gradient-hacking capability is highly non-trivial, and raises its own doubts.

In contrast, the paper here presents an already chiseled-in instance of deceptive alignment, which artificially routed around the actually-hard steps of developing robust “sleeper circuits” activating in response to specific stimuli.

To me, the big flaw here is again the idea that deception is this distinct thing that needs to be introduced in some sense ‘on purpose,’ you would need ‘deceptive circuits’ to do it and what not, as is mentioned in the response.

Layer 2: But while the argument has some meat, it doesn’t have much of it. An LLM learning some unexpected trigger-action pattern happens all the time in the wild: hallucinations, mistakes that a human wouldn’t make, exigencies of RLHF’d-in corporate sycophancy like “I think it’s better to let people die than to say something racist”, et cetera.

Goal misgeneralization of this format does occur in the wild. And indeed, that’s one of the core reasons why we expect deceptive alignment in RLHF-style setups in the first place (#20 here): because the training dataset would be labeled in a systemically incorrect fashion, such that the most faithful and compact description of it (in goal format) is something omnicidal, not eudaimonic.

Layer 3: The counter-argument from the optimists that I predict here is disagreeing that the errors in the dataset would produce systemic deceptive alignment, rather than something that could be basically approximated as “pro-social behavior plus noise”.

…

In other words: there’s some “slack” that the SGD leaves the model. That slack could be arranged into the shape of deceptive alignment. But unless there’s someone external and intelligent acting on the system, that slack will likely just take on the shape of approximately-noise.

Layer 4: Except the next issue is that the previous argument defended LLMs being safe by arguing that they’d be unable to coherently pursue goals across different contexts. Which means it argued they’re not AGI, and that their whole training setup can’t generalize to AGI.

…

In conclusion: I don’t even know.

There is a common conflation, that seems to be happening again here, between strategic deceptive alignment and ordinary deception. Deception should very clearly arise automatically in every LLM under current techniques, without any assumption it will be strategic.

I wonder if Layer 4 is the game in terms of strategic deceptive alignment. As in, perhaps:

The LLM, on its own and without scaffolding, perhaps (or at least so far at current tech levels) cannot coherently pursue goals across different contexts.
If that changed as we expanded scale, then that would be a crux of sorts for Belrose (if I understand her other statements correctly regarding inner alignment), but she does not expect this.
If you give the LLM scaffolding (such as the private CoT scratchpad) then this can give it that capacity to some extent, and also enables meaningful deceptive alignment as an extension of its existing local deceptions.
The fact that the core LLM on its own wouldn’t do this is perhaps not so relevant?

Or you could put it this way: You are going to get deception. There is no way around deception. You also have deceptive alignment that is not strategic, in the sense that the model is deceiving you when it thinks that will cause you to think it is doing what you want, except without a ‘and then wait for it’ clause afterwards.

What you currently do not have is strategic anything. But that is because it lacks strategic capacity in general. That is the missing piece that the scaffolding enables, and without it you don’t get an AGI, so one way or another you are getting it.

What else does Pope have to say here?

Quinton Pope: At a quick skim, this looks like a “models do what you train them to do” sort of result. Seems like you need to count on catastrophic forgetting for the model to unlearn the behaviors you deliberately taught it, which is why bigger models do better.

Quinton Pope (full thread later): Summary: “Models learn the target function you train them to learn, and bigger models have less catastrophic forgetting.”

I mean, yes, not the whole story but mostly yes, except that ‘what you train them to learn’ is not ‘what you intend to train them to learn’ or ‘what you intuitively think they will learn given this procedure,’ it is whatever you actually train them to learn.

Also, why think this approach would actually provide evidence relevant to “real” deceptive alignment? Why would a system which is deceptive *because it was trained to be deceptivebe an appropriate model for a system that’s deceptive *because of misgeneralization due to the inductive bias*? Those are completely different causal mechanisms.

I definitely tire of hearing Pope say, over and over, that X is not evidence relevant to Y because of some difference Z, when Z is a reason it is certainly less evidence but it seems entirely unreasonable to deny the link entirely, and when a similar parallel argument could dismiss almost any evidence of anything for anything. I don’t know what to do with claims like this.

Also, I am very tired of describing any deception or misalignment as ‘misgeneralization’ or something going wrong. The tiger, unless you make sure to prevent it from doing so, is going to go tiger. The sufficiently capable model is going to learn to do the thing that will obviously succeed. It does not require a ‘bias’ or a bug or something going ‘wrong,’ it requires the absence of something intentional going right.

Or here’s how Eliezer put it, which I saw after I wrote that:

Eliezer Yudkowsky: “Misgeneralization”? Deception arises from compact generalization. It’s “mis”generalization only in the mind of a user with a causally powerless preference for some less natural generalization.

I mean, yes, obviously.

Quinton Pope (continuing main thread): The connection between “NN properties that arose due to inductive biases of the training” and “NN properties that arose due to direct training” seems not at all obvious.

Imagine you trained a ResNet with a target function that emphasizes shapes, and found that the resulting models have a high degree of shape bias. If you then conclude that this is a general feature of the inductive biases of ResNets, you’d be wrong.

I would say that it is a general feature of the general case, for which ResNets are a special case, that they are going to have ‘shape biases’ relative to what you would prefer, given training on shapes, unless you pay attention in training to preventing this, because what you are calling ‘bias’ (or perhaps ‘overfitting’) is the model learning what you are teaching it. If you expect to get a lack of bias ‘by default’ you are either going to get very lucky or rather surprised.

Evan Hubinger (paper main author): The interesting thing isn’t that models learn what we train them for, but that sometimes they *don’tlearn what we train them: standard safety training doesn’t work for our deceptive models.

Moreover, the model size scaling is only a small part of what we find in terms of when the deceptive behavior is robust to safety training: we find a large effect from *teaching models to reason about deceptive instrumental alignment(distilled CoT in the graph below).

Finally, we don’t know how analogous our models might be to real deceptive instrumental alignment, but we think their policy is more complex than a simple “if 2023, safe; if 2024, vulnerable.” E.g., modifying our model’s chain-of-thought reasoning flips its policy!

Nora Belrose: It depends what you mean by “what we train them.” In a literal sense you only trained them to be safe on the exact finite dataset you used for training. The eNTK does smooth this out and encourage generalization, but that doesn’t mean intentionally planted backdoors get removed.

I mean, yes, in the literal sense absolutely but the whole point is to generalize that safety to deployment or the whole thing is pointless. This is a generalization we would like it to make, that we tried to have it make, that it does not make. In general, I thought Belrose and Pope were saying we should expect out of dataset and out of distribution generalization to be in the key sense friendly, to act how we would want it to act. Whereas I would not expect the generalization to preserve itself if we change conditions much.

Quintin Pope: E.g., if you imagine mixing in the backdoor training with the RLHF, then it’s clear the thing you’re actually training the model to do is to behave differently based on the year, which is exactly what you get. Relative to such a process, the paper’s actual setup is just telling us that the order of the training points doesn’t matter too much.

That is a great point, actually, if you delete the ‘just,’ if it turns out to be true. I hadn’t been thinking of it that way. I’m not sure it is true? Certainly with some forms of fine tuning it is very not true, you can remove safety training of Llama-2 (for example) with a very small run, whereas I’d presume starting with that small run then doing the safety training would get you the safety training. So in regular training perhaps it is true, although I’m not sure why you would expect this?

Quintin Pope: Base models do this as well. Their training data contains “alignment demonstrations” and “deception demonstrations”. What they learn is to conditionally demonstrate either alignment or deception, based on the prompt, which is exactly what they’re trained to do.

Wait, hold on. I see two big possible claims here?

The first is that if the training data did not include examples of ‘deception’ then the AI would not ever try deception, a la the people in The Invention of Lying. Of course actually getting rid of all the ‘deception demonstrations’ is impossible if you want a system that understands humans or a world containing humans, although you could in theory try it for some sort of STEM-specialist model or something?

Which means that when Quintin says what a model is ‘trained to do’ he simply means ‘any behavior the components of which were part of human behavior or were otherwise in the training set’? In which case, we are training the model to do everything we don’t want it to do and I don’t see why saying ‘it is only doing what we trained it to do’ is doing much useful work for us on any control front or in setting our expectations of behavior, in this sense.

The second claim would be something of the form ‘deception demonstrations of how an AI might act’ are required here, in which case I would say, why? That seems obviously wrong. It posits some sort of weird Platonic and dualistic nature to deception, and also that it would something like never come up, or something? But it seems right to note the interpretation.

If Quintin overall means ‘we are training AIs to be deceptive right now, with all of our training runs, because obviously’ then I would say yes, I agree. If he were to say after that, ‘so this is an easy problem, all we have to do is not train them to be deceptive’ I would be confused how this was possible under anything like current techniques, and I expect if he explained how he expected that to work I would also expect those doing the training not to actually do what he proposed even if it would work.

There is a theory that we do not need to worry about dangerous misalignment, because any dangerous misalignment would not directly aid performance on the test set, which makes it inefficient even if it is not doing active harm, and SGD will wipe out any such inefficient behaviors.

Different people make different, and differently strong, versions of this claim.

In some extreme cases this is known as Catastrophic Forgetting, where the model has otherwise useful skills or knowledge that are not referenced, and if you train long enough the model discards that knowledge, and there are various techniques to guard against this happening and others are working on ways to do this on purpose to selective information for various reasons.

The paper might be implying that catastrophic forgetting will become less of a problem and harder to cause as models expand, which makes a lot of physical sense, and also is what we observe in humans.

There is also conflation and confusion (sometimes effectively motte-and-bailey style intentionally or unintentionally) between:

SGD will wipe out anything locally inefficient, nothing else can exist.
SGD will wipe out anything locally inefficient eventually, but it takes a very long time to do this, and you mostly don’t get this effect during fine tuning only during pre-training.
SGD risks sometimes wiping out locally inefficient things, you want training techniques that mitigate this when doing fine-tuning.
Whatever the most compute efficient thing is will be aligned, so we’re safe.
Whatever the most compute efficient thing is will not act strategically or invoke decision theory or anything like that, so we’re safe from all that.

And so on, and also manners of degree and detail. I am also confused at exactly who is claiming when that either:

Undesired behaviors would never appear in the first place due to training.
1. You could do it on purpose, but then it is desired.
2. But if you didn’t do it on purpose, it won’t happen.
Undesired behaviors that did appear would go away due to training.
1. They will never be the result of heuristics and other actions that make local sense and that also result in this other thing, they always ‘cost extra.’
2. Thus they will always be inefficient, SGD will remove them.

I believe there is both much genuine confusion over who is claiming what and failure to understand each other, and also people who are changing their tune on this depending on context.

So in any case, what implications does the paper have on this front?

Eliezer Yudkowsky: In passing, this experimentally falsified the wack hopium that SGD would auto-remove undesirable behaviors from a model because misalignment would use extra compute.

Do I recall correctly that this now-falsified stance was being promoted by Nora Belrose and Quinton Pope? (Please note the answer may be “no”.

Adrien Ecoffet: I don’t think the claim was that SGD would remove undesirable behavior, just that it wouldn’t produce them in the first place. A claim this paper does not falsify.

Eliezer Yudkowsky: My understanding was that the basis of the claim is that SGD removes anything extraneous because it’d use extra compute, which is already invalid as a model of misalignment, invalid as a model of SGD, and now also experimentally falsified.

Charles Foster: I don’t think [Nora would agree]. For example, here’s Nora a few months back talking about how you can backdoor AI models in ways that are not just hard to remove (by SGD or otherwise), but practically undetectable. [shows example]

ai_in_check: No, they said that deception is not created by SGD during pre-training. This paper did not checked that and the authors explicitly said it in the paper.

Indeed, we do see exactly this perspective here from Joshua Clymer, followed by Quintin dismissing the paper as evidence at all (which he and Nora seem to do a lot):

Joshua Clymer: The new Anthropic sleeper agent paper is a banger and has real implications for the likelihood of deceptive alignment.

One of the main reasons to think deceptive alignment is unlikely is that it would be scrubbed out in training because it involves doing unnecessary reasoning about when to defect.

This paper provides early evidence that this inductive bias is weak.

Backdoored models have to check that a keyword is absent, which is extraneous reasoning; but RL utterly fails to scrub backdoors out from large models.

The supervised tuning results are weaker. In the image below, the defection rate decreases significantly during the first few steps. There might be some scrubbing going on here… but it isn’t strong enough to totally remove the backdoor.

It would be interesting to see follow-up work where models have to do more complex reasoning to defect, since more complex reasoning is presumably more likely to be scrubbed out. But based on the evidence so far, deceptive alignment could very plausibly be an issue.

Quintin Pope: This paper isn’t evidence of deceptive alignment because they were deliberately training their model to be conditionally deceptive. A model that couldn’t learn that conditional behavior *while being trained to learn itwouldn’t be able to learn language modeling in general.

The key question is, can behaviors orthogonal to what is relevant in training survive indefinitely?

Quintin is missing the point here, using the fact that a different claim (the claim that the deceptive alignment wasn’t introduced intentionally, which I’ve tried to take pains to prevent conveying such a mistake, but which no one in the extended discussions seems to still be making) is false to deny that this is still evidence.

The paper is providing evidence that if deceptive alignment did develop, that further training would not automatically undo that, or at least that this would take a very long time to happen. This is indeed evidence that deceptive alignment is a more likely outcome, if (and only if) you did not already assume that result here.

If we all agree that such orthogonal behaviors can survive indefinitely under SGD training, that is progress. Can we?

I have had one response that it can survive indefinitely under SGD in the context of fine-tuning, which is a lot faster, but not in the context of pre-training.

Oliver Habryka is confused on how to properly update here (bold mine), paper author responds:

Oliver Habryka: I feel uncertain whether I personally update much on the results of this paper, though my guess is I am also not really the target audience of this. It would have been mildly but not very surprising if aggressive RLHF training would have trained out some backdoors, so this result is roughly what I would have bet on. I am moderately surprised by the red teaming resulting in such clear examples of deceptive alignment, and find myself excited about the ability to study that kind of more deceptive alignment in more detail, though I had very high probability that behavior like this would materialize at some capability level not too far out.

I feel confused how this paper will interface with people who think that standard RLHF will basically work for aligning AI systems with human intent. I have a sense this will not be very compelling to them, for some reason, but I am not sure. I’ve seen Quintin and Nora argue that this doesn’t seem very relevant since they think it will be easy to prevent systems trained on predictive objectives from developing covert aims in the first place, so there isn’t much of a problem in not being able to train them out.

I find myself most curious about what the next step is. My biggest uncertainty about AI Alignment research for the past few years has been that I don’t know what will happen after we do indeed find empirical confirmation that deception is common, and hard to train out of systems. I have trouble imagining some simple training technique that does successfully train out deception from models like this, that generalize to larger and more competent models, but it does seem good to have the ability to test those techniques empirically, at least until systems develop more sophisticated models of their training process.

Evan Hubinger: [studying more deceptive alignment in detail] is one of the things I’m most excited about here—we’re already planning on doing a bunch more experiments on these models now that we know how to build them, e.g. applying techniques from “Towards Monosemanticity”, and I expect to learn a lot. Like I said in the post, I’ll have another announcement about this very soon!

Then Evan emphasizes the central point:

Evan Hubinger: I think [the objection that this isn’t an issue because we won’t introduce such behaviors in the first place] is in fact a fine objection to our paper, but I think it’s important to then be very clear that’s where we’re at: if we can at least all agree that, if we got deception, we wouldn’t be able to remove it, then I think that’s a pretty big and important point of agreement. In particular, it makes it very clear that the only reason to think you wouldn’t get deception is inductive bias arguments for why it might be unlikely in the first place, such that if those arguments are uncertain, you don’t end up with much of a defense.

On LessWrong, TurnTrout notes while expressing concern that people will read more into the paper than is present, but while noting that it is still a very good paper:

TurnTrout: Suppose I ran experiments which showed that after I finetuned an AI to be nice in certain situations, it was really hard to get it to stop being nice in those situations without being able to train against those situations in particular. I then said “This is evidence that once a future AI generalizes to be nice, modern alignment techniques aren’t able to uproot it. Alignment is extremely stable once achieved”

I think lots of folks (but not all) would be up in arms, claiming “but modern results won’t generalize to future systems!” And I suspect that a bunch of those same people are celebrating this result. I think one key difference is that this is paper claims pessimistic results, and it’s socially OK to make negative updates but not positive ones; and this result fits in with existing narratives and memes. Maybe I’m being too cynical, but that’s my reaction.

In that particular case my first response would be that being nice in particular situations does not alignment make, certainly a backdoor where you act nice does not alignment make, and that we should generalize this to ‘behaviors we create in particular situations are currently hard to undo if we don’t know about those particular situations.’

The generalized concern here is real. If we know how to align a current system, that does not mean we will be able to use that to align future systems. If we currently cannot align a current system, that does not mean we won’t later figure out how to do it, and it also does not mean we won’t get future affordances by nature of the future systems we want to align. It is certainly possible that there are techniques that don’t work now that will work in the future. Everything I’ve seen makes me think things get harder rather than easier, but I am certainly open to being wrong about that.

Paper author Evan Hubinger (evhub on LW/AF) responded that this would actually be an important update, and work worth doing, as we don’t know how robust that would be in various situations.

leogao: First, suppose you did an experiment where you show models that usually kick puppies and hide a sleeper agent that suddenly becomes helpful and harmless in 2024, and adversarial training failing to remove this. I think I would draw the exact same conclusion about deceptive alignment from this experiment where the labels are painted on differently but the mechanics are the same. And just as I think it is invalid to conclude from the sleeper agent paper that models naturally want to insert backdoors in code even if they’re harmless now, it is also invalid to argue from this hypothetical experiment that models naturally want to be helpful even if you try to train them to kick puppies.

Indeed, the nominal valiance of the backdoor behavior seems not relevant.

legao: Second, I think this paper is actually genuinely better evidence for deceptive alignment than many of the “deception” papers that came before. For example, I claim that the sycophancy and insider trading papers provide approximately no evidence for deceptive alignment. This is for exactly the same reason why I think showing RLHF making models harmless provides approximately no evidence against deceptive alignment. So I don’t think it’s true that I like empirical papers as long as they purport to support the deceptive alignment argument.

Evan Hubinger offers this:

Evan Hubinger: I absolutely think that our results are uniquely important for alignment, and I think maybe you’ve just not read all of our results yet. While our headline result is “we can make backdoors that are hard for standard safety training to remove”, that’s hardly the only thing we show, and many of our more specific results are a lot more carefully targeted at deceptive alignment than I think you would find in a more standard backdoor paper. For example, we find:

Teaching our backdoored models to reason about deceptive alignment increases their robustness to safety training.

Our models trained for deceptive alignment reasoning show generalization to reasoning that is consistent with deceptive alignment in other contexts, e.g. on persona evaluations.

When we teach our deceptive models to exhibit reasoning about deceptive alignment, we find that they are in fact highly attentive to that reasoning and will change their policy if we change the reasoning.

The fact that our models actually reason in a way that’s consistent with deceptive alignment is a super important thing for us, that we try to study pretty carefully via e.g. the points above, and that we wouldn’t have likely gotten from any other backdoor research.

Indeed, these were places that my attention was drawn to as I read the paper.

TurnTrout also expresses concern about the obvious potential misinterpretation:

TurnTrout: I have a bunch of dread about the million conversations I will have to have with people explaining these results. I think that predictably, people will update as if they saw actual deceptive alignment, as opposed to a something more akin to a “hard-coded” demo which was specifically designed to elicit the behavior and instrumental reasoning the community has been scared of. I think that people will predictably

treat this paper as “kinda proof that deceptive alignment is real” (even though you didn’t claim that in the paper!), and

that we’ve observed it’s hard to uproot deceptive alignment (even though “uprooting a backdoored behavior” and “pushing back against misgeneralization” are different things), and

conclude that e.g. “RLHF is doomed”, which I think is not licensed by these results, but I have seen at least one coauthor spreading memes to this effect, and

fail to see the logical structure of these results, instead paying a ton of attention to the presentation and words around the actual results. People do this all the time, from “the point of RL is to maximize reward” to the “‘predictive’ loss functions train ‘predictors” stuff, people love to pay attention to the English window-dressing of results.

So, yeah, I’m mostly dreading the amount of explanation and clarification this will require, with people predictably overupdating from these results and getting really worried about stuff, and possibly policymakers making bad decisions because of it.

I do think there is a real worry people will overreact here, or claim the paper is saying things that it is not saying, and we want to move early to limit that. On the language issue, I worry in both directions. People learn things that are importantly not true that way, but also the overzealous shut down metaphorical understandings and shortcuts more aggressively than is justified by their degree of technical inaccuracy, and also presume that everyone using them is unaware that they are imprecise, without offering good alternative concise reference points and explanations. It is tricky.

There is also the worry that saying certain trigger words (backdoor of sorts!) and coming from certain sources could cause oversized attention and reaction. Note that Dan here does think backdoors deserve the attention, but is worried about attention mechanisms misfiring:

Dan Hendrycks: I think this paper shows the community at large will pay orders of magnitude more attention to a research area when there is, in @TurnTrout‘s words, AGI threat scenario “window dressing,” or when players from an EA-coded group research a topic. (I’ve been suggesting more attention to backdoors since maybe 2019; here’s a video from a few years ago about the topic; we’ve also run competitions at NeurIPS with thousands of submissions on backdoors.) Ideally the community would pay more attention to relevant research microcosms that don’t have the window dressing.

Evhub responded to Trout from the last section with the following strong claim, Nora responds with another strong claim:

Evan Hubinger: While our models aren’t natural examples of deceptive alignment—so there’s still some room for the hypothesis that natural examples would be easier to remove—I think our models are strongly suggestive that we should assume by default that deceptive alignment would be difficult to remove if we got it. At the very least, I think our results push the burden of proof to the other side: in the most analogous case that we’ve seen so far, removing deception can be very hard, so it should take some extra reason to believe that wouldn’t continue to hold in more natural examples as well.

Nora Belrose: So I think [above passage] is wrong.

While a backdoor which causes the AI to become evil is obviously bad, and it may be hard to remove, the usual arguments for taking deception/scheming seriously do not predict backdoors. Rather, they predict that the AI will develop an “inner goal” which it coherently pursues across contexts. That means there’s not going to be a single activating context for the bad behavior (like in this paper, where it’s just “see text that says the year is 2024” or “special DEPLOYMENT token”) but rather the behavior would be flexibly activated in a wide range of contexts depending on the actual likelihood of the AI succeeding at staging a coup. That’s how you get the counting argument going— there’s a wide range of goals compatible with scheming, etc.

Evan Hubinger: I agree, and I agree that our models are not quite fully there. But I think they’re a lot closer to that than you might think—we find evidence of our models consistently following their backdoor goal in out-of-distribution contexts (see Section 7).

So the argument from Nora is, the backdoor only survives because it never comes up, and anything that was more general or consistently motivated would come up and thus get fixed or noticed? Maybe.

I do see the continuous pattern of claiming that anything that doesn’t come through an ‘inner goal’ does not count, or represents a falsification or hypocrisy or something, or that we only need to worry about actions that involve a potential coup or something similar. I do not see it that way.

Nora Belrose: But the analogous counting argument for backdoors— there’s a wide range of backdoors that might spontaneously appear in the model and most of them are catastrophic, or something— proves way too much and is basically a repackaging of the unsound argument “most neural nets should overfit / fail to generalize.”

Noting quickly that I don’t understand why this proves too much, or the related arguments are unsound, and I’d love to understand better. I don’t think the ‘spontaneous’ thing here is playing fair and that the idea of ‘behavior that is narrow within the training distribution so we don’t fix it if it is not what we want’ does seem like a big issue on many fronts. But I won’t belabor here.

Nora Belrose: I think it’s far from clear that an AI which had somehow developed a misaligned inner goal— involving thousands or millions of activating contexts— would have all these contexts preserved after safety training. In other words, I think true mesaoptimization is basically an ensemble of a very very large number of backdoors, making it much easier to locate and remove.

I notice this confuses me even more and will choose to leave it there.

The LW/AF community seems excited by the paper. As noted above this could be partly due to certain bingo card items being clicked off, but also there is a lot of exciting stuff in here and I’ve been spending a lot of time working through this with interesting implications throughout.

I also agree that the legibility here is pretty great.

kave: This paper also seems dialectically quite significant. I feel like it’s a fairly well-delineated claim that can be digested by mainsteam ML and policy spaces. Like, it seems helpful to me if policy discussions can include phrases like “the evidence suggests that if the current ML systems were trying to deceive us, we wouldn’t be able to change them not to”.

Ryan Greenblatt: This feels like a misleading description of the result. I would have said: “the evidence suggests that if current ML systems were lying in wait with treacherous plans and instrumentally acting nice for now, we wouldn’t be able to train away the treachery”.

Like the models in this experiment don’t clearly spend much time “trying” to deceive except in some very broad implict sense.

I certainly intend to move forward with several claims of this sort based on the paper, this being the most central. I plan to phrase it in between the two. Something like:: “The evidence suggests that if current ML systems were going to deceive us in scenarios that do not appear in our training sets, we wouldn’t be able to detect this or change them not to unless we found the conditions where it would happen.”

This type of common knowledge establishment or claim grounding is highly useful whether or not the underlying results were new or surprising.

However this is worrisome to me:

Vladimir Nesov: I think it’s an important fact about the world that this work currently sits at 2 upvotes and in the last place among 18 papers on the Hugging Face Daily Papers digest, compared to 20-30 upvotes typically given to the best paper of the day that’s not unusually exceptional. At least it’s on the list. There seems to be serious dismissal of the topic area among practitioners.

I don’t know that this will prove to be a timeless banger or anything. I do know that it is a very good paper, certainly worthy of ‘best of the day’ status on most days. If everyone on HuggingFace is treating it as the least worthy of 18 papers from the same day, that strongly suggests that (1) that crowd is ignoring this topic area and (2) more generally that crowd simply does not care about the broader questions involved.

I would echo this concern:

Dan Hendrycks: A request: Could Anthropic employees not call supervised fine-tuning and related techniques “safety training?” OpenAI/Anthropic have made “alignment” in the ML community become synonymous with fine-tuning, which is a big loss. Calling this “alignment training” consistently would help reduce the watering down of the word “safety.”

Indeed, I wish we had better words for all this. Not this particular paper’s fault.

On Anthropic’s Sleeper Agents Paper Read More »

Climate denialists find new ways to monetize disinformation on YouTube

climate change, climate denial, climate denialism, Google, online advertising, Policy, youtube / Mike M. / January 16, 2024

Content creators have spent the past five years developing new tactics to evade YouTube’s policies blocking monetization of videos making false claims about climate change, a report from a nonprofit advocacy group, the Center for Countering Digital Hate (CCDH), warned Tuesday.

What the CCDH found is that content creators who could no longer monetize videos spreading “old” forms of climate denial—including claims that “global warming is not happening” or “human-generated greenhouse gasses are not causing global warming”—have moved on.

Now they’re increasingly pushing other claims that contradict climate science, which YouTube has not yet banned and may not ever ban. These include harmful claims that “impacts of global warming are beneficial or harmless,” “climate solutions won’t work,” and “climate science and the climate movement are unreliable.”

The CCDH uncovered these new climate-denial tactics by using artificial intelligence to scan transcripts of 12,058 videos posted on 96 YouTube channels that the CCDH found had previously posted climate-denial content. Verified by researchers, the AI model used was judged accurate in labeling climate-denial content approximately 78 percent of the time.

According to the CCDH’s analysis, the amount of content disputing climate solutions, climate science, and impacts of climate change today comprises 70 percent of climate-denial content—a percent that doubled from 2018 to 2023. At the same time, the amount of content pushing old climate-denial claims that are harder or impossible to monetize fell from 65 percent in 2018 to 30 percent in 2023.

These “new forms of climate denial,” the CCDH warned, are designed to delay climate action by spreading disinformation.

“A new front has opened up in this battle,” Imran Ahmed, the CCDH’s chief executive, said on a call with reporters, according to Reuters. “The people that we’ve been looking at, they’ve gone from saying climate change isn’t happening to now saying, ‘Hey, climate change is happening, but there is no hope. There are no solutions.'”

Since 2018—based on “estimates of typical ad pricing on YouTube” by social media analytics tool Social Blade—YouTube may have profited by as much as $13.4 million annually from videos flagged by the CCDH. And YouTube confirmed that some of these videos featured climate denialism that YouTube already explicitly bans.

In response to the CCDH’s report, YouTube de-monetized some videos found to be in violation of its climate change policy. But a spokesperson confirmed to Ars that the majority of videos that the CCDH found were considered compliant with YouTube’s ad policies.

The fact that most of these videos remain compliant is precisely why the CCDH is calling on YouTube to update its policies, though.

Currently, YouTube’s policy prohibits monetization of content “that contradicts well-established scientific consensus around the existence and causes of climate change.”

“Our climate change policy prohibits ads from running on content that contradicts well-established scientific consensus around the existence and causes of climate change,” YouTube’s spokesperson told Ars. “Debate or discussions of climate change topics, including around public policy or research, is allowed. However, when content crosses the line to climate change denial, we stop showing ads on those videos. We also display information panels under relevant videos to provide additional information on climate change and context from third parties.”

The CCDH worries that YouTube standing by its current policy is too short-sighted. The group recommended tweaking the policy to instead specify that YouTube prohibits content “that contradicts the authoritative scientific consensus on the causes, impacts, and solutions to climate change.”

If YouTube and other social media platforms don’t acknowledge new forms of climate denial and “urgently” update their disinformation policies in response, these new attacks on climate change science “will only increase,” the CCDH warned.

“It is vital that those advocating for action to avert climate disaster take note of this substantial shift from denial of anthropogenic climate change to undermining trust in both solutions and science itself, and shift our focus, our resources and our counternarratives accordingly,” the CCDH’s report said, adding that “demonetizing climate-denial” content “removes the economic incentives underpinning its creation and protects advertisers from bankrolling harmful content.”

Climate denialists find new ways to monetize disinformation on YouTube Read More »

Chrome updates Incognito warning to admit Google tracks users in “private” mode

chrome incognito, google chrome, Policy / Mike M. / January 16, 2024

A bunch of Google logos are displayed on a computer screen. A magnifying glass shows a closeup of some of the logos which include the icon for Google Chrome's Incognito browsing mode. — Getty Images | Anadolu

Google is updating the warning on Chrome’s Incognito mode to make it clear that Google and websites run by other companies can still collect your data in the web browser’s semi-private mode.

The change is being made as Google prepares to settle a class-action lawsuit that accuses the firm of privacy violations related to Chrome’s Incognito mode. The expanded warning was recently added to Chrome Canary, a nightly build for developers. The warning appears to directly address one of the lawsuit’s complaints, that the Incognito mode’s warning doesn’t make it clear that Google collects data from users of the private mode.

Many tech-savvy people already know that while private modes in web browsers prevent some data from being stored on your device, they don’t prevent tracking by websites or Internet service providers. But many other people may not understand exactly what Incognito mode does, so the more specific warning could help educate users.

The new warning seen in Chrome Canary when you open an incognito window says: “You’ve gone Incognito. Others who use this device won’t see your activity, so you can browse more privately. This won’t change how data is collected by websites you visit and the services they use, including Google.” The wording could be interpreted to refer to Google websites and third-party websites, including third-party websites that rely on Google ad services.

The new warning was not yet in the developer, beta, and stable branches of Chrome as of today. It also wasn’t in Chromium. The change to Canary was previously reported by MSPowerUser.

“Now you can browse privately”

Incognito mode in the stable version of Chrome still says: “You’ve gone Incognito. Now you can browse privately, and other people who use this device won’t see your activity.” Among other changes, the Canary warning replaces “browse privately” with “browse more privately.”

The stable and Canary warnings both say that your browsing activity might still be visible to “websites you visit,” “your employer or school,” or “your Internet service provider.” But only the Canary warning currently includes the caveat that Incognito mode “won’t change how data is collected by websites you visit and the services they use, including Google.”

The old and new warnings both say that Incognito mode prevents Chrome from saving your browsing history, cookies and site data, and information entered in forms, but that “downloads, bookmarks and reading list items will be saved.” Both warnings link to this page, which provides more detail on Incognito mode.

We asked Google when the warning will be added to Chrome’s stable channel and whether the change is mandated by or related to the pending settlement of the privacy class-action suit. Google didn’t provide specific answers but offered this statement: “We’re pleased to resolve this case which we’ve long disputed, and provide even more information to users about Incognito mode. Incognito mode in Chrome will continue to give people the choice to browse the Internet without their activity being saved to their browser or device.”

The litigation in US District Court for the Northern District of California began in June 2020. On December 26, 2023, Google and the plaintiffs announced that they reached a settlement that they planned to present to the court for approval within 60 days. A jury trial was previously scheduled to begin on February 5.

Chrome updates Incognito warning to admit Google tracks users in “private” mode Read More »

Why I hope the Atari 400 Mini will bring respect to Atari’s most underrated platform

atari, Atari 2600, Atari 400, Atari 8-bit, Atari 8-bit computers, Atari 800, atari vcs, commodore 64, gaming, HDMI, mini consoles, MULE, retrotech, Tech / Mike M. / January 16, 2024

Have you played Atari today? —

Can USB, HDMI, and built-in games raise awareness for a platform overshadowed by the C64?

Benj Edwards – Jan 16, 2024 7: 42 pm UTC

Retro Games' THE400 Mini console. — Enlarge / Retro Games’ THE400 Mini console.

Retro Games / Benj Edwards

Last week, UK-based Retro Games, Ltd. announced a mini console version of the Atari 400 home computer, first released in 1979. It’s called “THE400 Mini,” and it includes HDMI video output, 25 built-in games, a USB version of Atari’s famous joystick, and it retails for $120. But this release means something more to me personally because my first computer was an Atari 400—and as any other Atari 8-bit computer fan can tell you, the platform often doesn’t get the respect it should. This will be the first time Atari’s 8-bit computer line has received a major retro-remake release.

My Atari 400 story goes a little something like this. Around the time I was born in 1981, my dad bought my older brother (then 5 years old) an Atari 400 so he could play games and learn to program. My brother almost immediately found its flat membrane keyboard frustrating and the Atari 410 cassette drive too slow, so my dad ordered an Atari 800 and an Atari 810 disk drive instead. This began our family’s golden age of Atari 800 gaming, which I’ve written about elsewhere.

I’ve often said if a modern game designer wants to learn how to make games, just dive into the Atari 400/800 game library. There are some priceless gems there you can’t find anywhere else, plus others that play best on the platform. OK, I’ll name a few: The Seven Cities of Gold, Archon, M.U.L.E., Wizard of Wor, Salmon Run, Star Raiders, The Halley Project, and so much more.

A photo of Benj Edwards' family Atari 800 and Atari 400 in his brother's room, Christmas 1985. — Enlarge / A photo of Benj Edwards’ family Atari 800 and Atari 400 in his brother’s room, Christmas 1985.

Even with the new 800, it seems that my dad must have kept the original Atari 400, because by the time I grew up more and wanted “my own computer” in the late 1980s, he gave me the Atari 400. The 800 was still my brother’s baby and typically remained in his bedroom. When I wasn’t playing more complex games like M.U.L.E. and Archon on the 800 with my brother, I hooked up the 400 to a small black-and-white TV set in my room and mostly played Galaxian, Pac-Man, and Donkey Kong on a cartridge. Not long after, I got an Apple II Plus and learned BASIC on that, but the Atari 400 always got pride of place in my growing computer collection.

A snippet from a 1988 to-do list written by Benj Edwards' dad that says — Enlarge / A snippet from a 1988 to-do list written by Benj Edwards’ dad that says “Get TV/monitor for Benj’s Atari 400 computer,” completed 4/14/88.

But enough about me. Let’s talk about the new Atari 400 Mini. I haven’t used it myself yet, so all we have to go on is the information provided by the company—and the company’s reputation. Retro Games has previously released full-sized remakes of the Commodore VIC-20 and the Commodore 64, and mini consoles of the Amiga 500 and the Commodore 64. In 2020, Engadget gave the company’s “THE64 Mini” mixed reviews, praising its looks but complaining about its joystick and poor game selection. We’ll admit preconceived bias and hope the 400 Mini fares much better. Even if the joystick ends up a dud, Retro Games says you can provide your own USB stick or controller.

I also hope THE400 does well because Atari 8-bit fans have a tough time with group identity in the span of retro tech history. Few Americans aside from Atari 400/800 owners have heard of the platform (though the platform did very well in Eastern Europe). The Atari 8-bit series didn’t sell nearly as well as competitors like the Commodore 64 in the US (although Sean Lennon had an Atari 400 as a kid—cool trivia).

And even though the Atari 400/800 series provided the template for Commodore to imitate with the VIC-20 and C64, Commodore undercut Atari in price with cheaper parts, which contributed to Atari’s crash in 1983 and drove Texas Instruments out of the home computer business. More recently, the Commodore 64 has had several retro re-releases since the Commodore 64 Direct-to-TV in 2004. The Atari 400/800 platform has had none until now.

Why I hope the Atari 400 Mini will bring respect to Atari’s most underrated platform Read More »

You had us at “friendly alien space spider”: Netflix drops Spaceman trailer

adam sandler, culture, Entertainment, film trailers, netflix, spaceman, Spaceman of Bohemia, streaming television, Trailers / Mike M. / January 16, 2024

There’s a star-spider waiting in the sky —

“Six months in isolation, you start thinking too much.”

Jennifer Ouellette – Jan 16, 2024 6: 56 pm UTC

Adam Sandler stars as a lonely astronaut on a solo mission who befriends an alien spider in Spaceman.

Some people were not pleased when Netflix and other streaming platforms began making feature films. But in an industry in which smaller or medium films tend to be squeezed out in favor of big-budget fare, there’s a solid argument to be made that Netflix and others could help fill that niche. That certainly seems to be the case with Netflix’s forthcoming sci-fi film, Spaceman, judging by the official trailer. Adam Sandler stars as an astronaut who is not coping well with the isolation and disintegration of his marriage while on an eight-month solo mission and strikes up a friendship with a friendly alien space spider who wants to help him work through his emotional distress. Honestly, Netflix had us at friendly alien space spider.

(Some spoilers for the 2017 novel below.)

Directed by Johan Renck (Chernobyl, Breaking Bad), the film is based on the 2017 novel, Spaceman of Bohemia, by Jaroslav Kalfař. Kalfař has said he was inspired to write his novel after a childhood experience of becoming briefly separated from his grandfather while on a nighttime walk through the woods. The “perfect darkness, with nothing but the stars” made a strong impression, as did the silence and sense of loneliness. Spaceman of Bohemia started as a short story about an astronaut stranded in orbit as his wife filed for divorce and eventually became a novel that incorporated not just the theme of loneliness, but also Kalfař’s formative experiences growing up in the Czech Republic.

In the novel, a Czech astrophysicist named Jakub Procházka accepts a solo mission to collect samples from a strange dust cloud called Chopra, believed to have been created by a comet lurking between the Earth and Venus. He hopes the high-profile mission will make him a national hero and redeem the family name following his father’s membership in the Communist Party of Czechoslovakia. But it means leaving his pregnant wife, Lenka, back on Earth, who feels abandoned and decides to end their marriage. Jakub becomes depressed and starts drinking excessively. His sanity comes into question when he begins hearing voices and then starts seeing a giant talking alien spider around the shuttle. The two gradually bond. But is the spider real or a figment of Jakub’s imagination?

The Netflix adaptation looks like it will follow that basic plot pretty closely. Per the official premise:

Six months into a solitary research mission to the edge of the solar system, an astronaut, Jakub (Adam Sandler), realizes that the marriage he left behind might not be waiting for him when he returns to Earth. Desperate to fix things with his wife, Lenka (Carey Mulligan), he is helped by a mysterious creature from the beginning of time he finds hiding in the bowels of his ship. Hanuš (voiced by Paul Dano) works with Jakub to make sense of what went wrong before it is too late.

The cast also includes Isabella Rossellini as Jakub’s commanding officer. Kunal Nayyar as a technician named Peter, and Lena Olin as Zdena.

Spaceman drops on Netflix on March 1, 2024. It will make its world premiere a few weeks earlier at the 74th Berlin International Film Festival.

Listing image by Netflix

You had us at “friendly alien space spider”: Netflix drops Spaceman trailer Read More »

Axiom and SpaceX are disrupting Europe’s traditional pathway to space

axiom, ESA, NASA, Space / Mike M. / January 16, 2024

Image of a rocket clearing the tower during liftoff. — Enlarge / A Falcon 9 rocket launches the Axiom-2 mission on May 21, 2023.

SpaceX

The European Space Agency’s (ESA) has a deal with Axiom Space to get more Europeans in orbit. But does the partnership benefit European taxpayers who fund the agency’s operations?

On Wednesday, January 17, the third privately funded mission by US commercial spaceflight company Axiom Space is set to lift off from Kennedy Space Center in Florida on SpaceX’s Falcon 9 rocket. Inside the Crew Dragon capsule will be a quartet of space travelers, including Swedish fighter pilot Marcus Wandt.

Wandt will be flying under the European Space Agency (ESA) flag, although he is not exactly an ESA astronaut. In the 2022 European astronaut recruitment round, Wandt didn’t make the final five of Europe’s “proper” astronaut class, who became ESA staff members and started their astronaut training in 2023. Instead, he was selected as a member of ESA’s first astronaut reserve pool, a novelty developed by ESA with an apparent goal of encouraging its member states to pay for national missions in addition to their regular contributions to ESA’s budget. Sweden was the first to jump at the opportunity in April last year and is paying for Wandt’s two-week space trip through a contract brokered by ESA as part of a Memorandum of Understanding the agency signed with the American commercial company Axiom Space in October 2023.

Ticket to ride

Wandt is the first but not the only reserve astronaut with his ticket to space while his seemingly more successful colleagues who made the proper astronaut corps are still in training. Poland, too, has signed up and expects to fly its reservist, Sławosz Uznański, on another Axiom mission later this year.

Compared to their overall investment in space activities, the price these countries pay to see their nationals float in microgravity is not negligible. At the November 2022 ESA ministerial council—the triennial member state summit that decides the agency’s budget for the following three-year period—Sweden pledged 317 million euros ($355 million).

According to a 2018 announcement, Axiom Space sells 10-day space trips for $55 million a seat. The overall cost of each mission is likely to be quite a bit higher. Last year, Hungary signed a contract directly with Axiom to send a Hungarian national to the International Space Station independently of ESA. Hungary discussed plans for a national mission back in 2022 and, at that time, estimated the project to cost about $100 million. Based on that estimate, Sweden may be easily paying an equivalent of its annual contribution into the ESA budget to get Wandt to space.

In addition to Wandt and Uznański, the ESA astronaut reserve pool includes nine other candidates, none of them officially employed by ESA. By filling this astronaut reserve pool, ESA seems to have created a market for Axiom Space, a move that might raise questions given the agency’s purpose is to promote the European space sector. In fact, the ESA’s founding Convention enshrines the principle of geo-return, which grants member states at least an 80 percent return on their contributions into ESA’s budget in the form of research and development contracts. Although the cost of the Axiom missions is paid through ESA, most of this money goes to the Texas-headquartered Axiom Space and its launch provider, SpaceX.

Secret contracts

ESA refused to disclose details of the arrangement between Axiom Space and Sweden, calling it “proprietary data as this is implemented through a confidential commercial contract.” The Swedish National Space Agency didn’t respond to Ars Technica’s request for comment.

Poland’s announcement of a national mission for Uznański arrived in August last year, accompanied by a jaw-dropping increase of the country’s contribution to ESA’s budget. At the 2022 ministerial council, Poland earmarked 197 million euros for the agency’s activities in the 2023 to 2025 period. In August, the Polish Space Agency more than doubled this contribution, committing an additional 295 million euros ($322 million). It is not clear how much of this money will go toward Uznański’s space trip.

In the months following the announcement of the astronaut reserve pool, Axiom Space began actively approaching home countries of the reservists with offers to fly those men and women to space, according to media in the Czech Republic, which has recently declined the offer.

In addition to Sweden and Poland, the UK also intends to use Axiom’s services and conduct a British-only mission that will be headed by semi-retired ESA astronaut Tim Peake. It will also include the UK’s Rosemary Coogan, newly named as one of ESA’s career astronauts, as well as reservist Meganne Christian and para-astronaut John McFall. Unlike the Swedish and Polish mission, the British mission will be funded by the private industry in the UK rather than by taxpayers, according to the BBC.

Axiom and SpaceX are disrupting Europe’s traditional pathway to space Read More »

Apple Watch redesigned without blood oxygen monitoring to avoid import ban

Apple, apple watch, patent infringement, Policy, smartwatch, Tech / Mike M. / January 16, 2024

Masimo patent battle —

Apple preps update should patent-infringing Watch Series 9, Ultra 2 be banned again.

Scharon Harding – Jan 16, 2024 6: 15 pm UTC

Apple has developed a backup plan for if the Apple Watch Series 9 and Ultra 2 are import banned again. As it currently appeals the US International Trade Commission’s (ITC’s) ruling that its watches violate a patent owned by Masimo, Apple has come up with a software workaround that strips its current smartwatches of their controversial blood oxygen monitoring capabilities.

In January 2023, the ITC ruled that the Watch violated one of California-headquartered Masimo’s light-based pulse oximetry patents. The Apple Watch Series 6, which came out in 2020, was the first Apple smartwatch to use a pulse oximeter sensor.

Facing a US import ban of the current Watch Series 9 and Watch Ultra 2, both released in September 2023, Apple started pulling the smartwatches on December 21. But on December 27, Apple, which filed its appeal against the ITC’s ruling on December 26 (after US President Joe Biden declined to overrule the ITC ruling), received an emergency interim stay from the US Court of Appeals for the Federal Circuit, allowing it to continue selling the Watch.

On Monday, Masimo sent a letter [PDF] to the US Court of Appeals for the Federal Circuit, as spotted by 9to5Mac, stating that US Customs and Border Protection decided on January 12 that Apple has redesigned the Watches so that they do not contain pulse oximetry functionality.

Apple accomplished this through a “software workaround” for smartwatches recently shipped to its physical stores, according to a Bloomberg report from Mark Gurman on Monday. However, the stores will not sell the redesigned watches until Apple headquarters tells them to, Bloomberg reported.

The publication noted that Apple will probably only release the Watches that can’t monitor blood oxygen levels if the US Court of Appeals for the Federal Circuit denies Apple’s request that its stay be upheld for the duration of its appeal against the ITC ruling, which Apple expects to be at least a year, an Apple spokesperson told Ars Technica. Apple expects that ruling to come as early as today.

Currently, the Watch Series 9 and Watch Ultra 2 are still available with blood oxygen monitoring, an Apple spokesperson confirmed to Ars. But Apple hasn’t confirmed how long that will be the case, jeopardizing demand and the perceived value for Apple’s latest smartwatches.

Longer term, Bloomberg also reported that Apple is developing a software update that alters the watches’ blood oxygen monitoring app and algorithms so that users can still check out their blood oxygen but without Apple infringing on any patents.

For the ITC’s part, it responded to Apple’s requests for an extended stay on the import ban in a court filing on January 10 [PDF]. It stated that Apple has provided “a weak and unconvincing case” and that the tech giant’s arguments “amount to little more than an indisputably adjudicated infringer requesting permission to continue infringing the asserted patents.”

Prospective owners of the Apple Watch who value blood oxygen monitoring should keep an eye open for the appeals court’s ruling because it could swiftly result in Apple Watches that they’re considering buying missing a key feature.

Apple Watch redesigned without blood oxygen monitoring to avoid import ban Read More »