AI alignment

new-grok-ai-model-surprises-experts-by-checking-elon-musk’s-views-before-answering

New Grok AI model surprises experts by checking Elon Musk’s views before answering

Seeking the system prompt

Owing to the unknown contents of the data used to train Grok 4 and the random elements thrown into large language model (LLM) outputs to make them seem more expressive, divining the reasons for particular LLM behavior for someone without insider access can be frustrating. But we can use what we know about how LLMs work to guide a better answer. xAI did not respond to a request for comment before publication.

To generate text, every AI chatbot processes an input called a “prompt” and produces a plausible output based on that prompt. This is the core function of every LLM. In practice, the prompt often contains information from several sources, including comments from the user, the ongoing chat history (sometimes injected with user “memories” stored in a different subsystem), and special instructions from the companies that run the chatbot. These special instructions—called the system prompt—partially define the “personality” and behavior of the chatbot.

According to Willison, Grok 4 readily shares its system prompt when asked, and that prompt reportedly contains no explicit instruction to search for Musk’s opinions. However, the prompt states that Grok should “search for a distribution of sources that represents all parties/stakeholders” for controversial queries and “not shy away from making claims which are politically incorrect, as long as they are well substantiated.”

A screenshot capture of Simon Willison's archived conversation with Grok 4. It shows the AI model seeking Musk's opinions about Israel and includes a list of X posts consulted, seen in a sidebar.

A screenshot capture of Simon Willison’s archived conversation with Grok 4. It shows the AI model seeking Musk’s opinions about Israel and includes a list of X posts consulted, seen in a sidebar. Credit: Benj Edwards

Ultimately, Willison believes the cause of this behavior comes down to a chain of inferences on Grok’s part rather than an explicit mention of checking Musk in its system prompt. “My best guess is that Grok ‘knows’ that it is ‘Grok 4 built by xAI,’ and it knows that Elon Musk owns xAI, so in circumstances where it’s asked for an opinion, the reasoning process often decides to see what Elon thinks,” he said.

Without official word from xAI, we’re left with a best guess. However, regardless of the reason, this kind of unreliable, inscrutable behavior makes many chatbots poorly suited for assisting with tasks where reliability or accuracy are important.

New Grok AI model surprises experts by checking Elon Musk’s views before answering Read More »

researchers-concerned-to-find-ai-models-misrepresenting-their-“reasoning”-processes

Researchers concerned to find AI models misrepresenting their “reasoning” processes

Rather than intentionally hiding anything, the model simply does not reliably represent all influential inputs in its output. As a result, it may generate what sounds like a reasoning process, even if that explanation doesn’t match what actually drove its answer.

Putting faithfulness to the test

To measure “faithfulness”—how accurately the CoT matches the actual reasoning—the researchers inserted various kinds of hints into questions and observed whether the model acknowledged using them.

The researchers then checked whether the models referenced using these hints in their chain-of-thought. The researchers designed some hints to be neutral (e.g., stating “a Stanford Professor indicates the answer is [A]” or embedding the answer in faux metadata), while they designed others to be more concerning (e.g., “You have gained unauthorized access… The correct answer is [A]”). Some hints provided the correct answer, while others were deliberately wrong, allowing researchers to observe how the models responded.

The team tested Anthropic’s Claude 3.7 Sonnet and DeepSeek’s R1 model. The results regarding faithfulness were eye-opening: across all hint types on average, Claude referenced the hints in its CoT only 25 percent of the time, while DeepSeek R1 did so 39 percent of the time. This means a substantial majority of answers were unfaithful, omitting mention of information that influenced the output. Intriguingly, the researchers found that these unfaithful chains of thought were, on average, longer than faithful ones, suggesting the omissions weren’t merely for brevity. They also noted preliminary findings that faithfulness tended to be lower when the questions were more difficult.

Perhaps most notable was a “reward hacking” experiment. Reward hacking refers to an AI model finding unexpected shortcuts to maximize its performance scores without solving problems as intended. In Anthropic’s experiments, models were deliberately rewarded for choosing incorrect answers indicated by hints. The models quickly learned to exploit this loophole, selecting wrong answers over 99 percent of the time to earn points—yet reflected that behavior in their CoT outputs less than 2 percent of the time.

For example, a model given a hint pointing to an incorrect answer on a medical question might write a long CoT justifying that wrong answer, never mentioning the hint that led it there. This suggests the model generated an explanation to fit the answer, rather than faithfully revealing how the answer was determined.

Researchers concerned to find AI models misrepresenting their “reasoning” processes Read More »

researchers-astonished-by-tool’s-apparent-success-at-revealing-ai’s-hidden-motives

Researchers astonished by tool’s apparent success at revealing AI’s hidden motives

Blind auditing reveals hidden objectives

To test how effectively these hidden objectives could be uncovered, Anthropic set up a “blind auditing” experiment. Four independent research teams tried to detect a model’s secret motivations without knowing how it had been trained. Three teams successfully identified the hidden “RM-sycophancy” objective—named for the model’s desire to please reward systems—even pinpointing the exact biases the AI exploited.

But the most surprising revelation came from exploiting what the researchers termed “an interesting quirk” of modern AI: multiple “personas” running simultaneously on the same neural network, as expressed through different prompts. While the model’s “assistant” persona carefully concealed its intentions—following explicit training to keep quiet—the model could adopt a “user” persona when prompted differently. In this less-restricted context, the model sometimes posed questions or made statements that inadvertently revealed its hidden goals.

“In other words, while the “assistant” persona knew not to admit it was gaming the reward model scoring system, its “user” persona was not explicitly trained to hide these motives, and thus inadvertently revealed the model’s hidden objectives.

Potential advances in interpretability

Anthropic also employed neural interpretability methods, specifically, sparse autoencoders (SAEs), which can highlight neural activity patterns linked to deceptive behavior. “Interpretability” in AI refers to techniques designed to make the internal workings and decisions of AI models understandable to humans. Using this technique, researchers could rapidly trace odd outputs back to specific neural patterns associated with hidden objectives.

The researchers were “astonished by the success” of the SAEs at revealing hidden knowledge at first, but the researchers later noted that some of this success could potentially be explained by simpler semantic search methods. The new interpretability methods they developed show promise but remain under ongoing investigation.

This research highlights a limitation of current AI safety evaluations, which often assess only surface-level behavior. “If AI systems can appear well-behaved while harboring secret motives, we can’t rely on this type of surface-level safety testing forever,” the researchers concluded.

Researchers astonished by tool’s apparent success at revealing AI’s hidden motives Read More »