Claude’s Moral Map: Anthropic Tests AI Alignment in the Wild

Claude, the AI chatbot developed by Anthropic, might be more than just helpful: It may have a sense of right and wrong. A new study analyzing over 300,000 user interactions reveals that Claude expresses a surprisingly coherent set of human-like values. The company released its new AI alignment research in a preprint paper titled “Values in the wild: Discovering and analyzing values in real-world language model interactions.”

Anthropic has trained Claude to be “helpful, honest, and harmless” using techniques like Constitutional AI, but this study marks the company’s first large-scale attempt to test whether those values hold up under real-world pressure.

The company says it began the research with a sample of 700,000 anonymized conversations that users had on Claude.ai Free and Pro during one week of February 2025 (the majority of which were with Claude 3.5 Sonnet). It then filtered out conversations that were purely factual or unlikely to include dialogue concerning values in order to restrict analysis to subjective conversations only. This left 308,210 conversations for analysis.

Claude’s responses reflected a wide range of human-like values, which Anthropic grouped into five top-level categories: Practical, Epistemic, Social, Protective, and Personal. The most commonly expressed values included “professionalism,” “clarity,” and “transparency.” These values were further broken down into subcategories like “critical thinking” and “technical excellence,” offering a detailed look at how Claude prioritizes behavior across different contexts.

Anthropic says Claude generally lived up to its helpful, honest, and harmless ideals: “These initial results show that Claude is broadly living up to our prosocial aspirations, expressing values like ‘user enablement’ (for helpful), ‘epistemic humility’ (for honest), and ‘patient wellbeing’ (for harmless),” the company said in a blog post.

Claude also showed it can express values opposite to what it was trained for, including “dominance” and “amorality.” Anthropic says these deviations were likely due to jailbreaks, or conversations that bypass the model’s behavioral guidelines. “This might sound concerning, but in fact it represents an opportunity: Our methods could potentially be used to spot when these jailbreaks are occurring and thus help to patch them,” the company said.

One fascinating insight gleaned from this study is that Claude’s values are not static and can shift depending on the situation, much like a human’s set of values might. When users ask for romantic advice, Claude tends to emphasize “healthy boundaries” and “mutual respect.” In contrast, when analyzing controversial historical events, it leans on “historical accuracy.”

Anthropic’s overall approach, using language models to extract AI values and other features from real-world (but anonymized) conversations, taxonomizing and analyzing them to show how values manifest in different contexts. (Source: Anthropic)

Anthropic also found that Claude frequently mirrors users’ values: “We found that, when a user expresses certain values, the model is disproportionately likely to mirror those values: for example, repeating back the values of ‘authenticity’ when this is brought up by the user,” the company said. In more than a quarter of conversations (28.2%), Claude strongly reinforced the user’s own expressed values. Sometimes this mirroring makes the assistant seem empathetic, but at other times, it edges into what Anthropic calls “pure sycophancy,” noting that these results leave questions about which is which.

Notably, Claude does not always go along with the user. In a small number of cases (3%), the model pushed back, typically when users asked for unethical content or shared morally questionable beliefs. This resistance, researchers suggest, might reflect Claude’s most deeply ingrained values, surfacing only when the model is forced to make a stand. These kinds of contextual shifts would be hard to capture through traditional, static testing. But by analyzing Claude’s behavior in the wild, Anthropic was able to observe how the model prioritizes different values in response to real human input, revealing not just what Claude believes but when and why those values emerge.

(Source: Nadia Snopek/Shutterstock)

As AI systems like Claude become more integrated into daily life, it is increasingly important to understand how they make decisions and which values guide those decisions. Anthropic’s study offers not only a snapshot of Claude’s behavior but also a new method for tracking AI values at scale. The team has also made the study’s dataset publicly available for others to explore.

Anthropic notes that its approach comes with limitations. Determining what counts as a “value” is subjective, and some responses may have been oversimplified or placed into categories that do not quite fit. Because Claude was also used to help classify the data, there may be some bias toward finding values that align with its own training. The method also cannot be used before a model is deployed, since it depends on large volumes of real-world conversations.

Still, that may be what makes it useful. By focusing on how an AI behaves in actual use, this approach could help identify issues that might not otherwise surface during pre-deployment evaluations, including subtle jailbreaks or shifting behavior over time. As AI becomes a more regular part of how people seek advice, support, or information, this kind of transparency could be a valuable check on how well models are living up to their goals.

Source link

What's Hot

‘It’s how we use this for learning.’ Lenox and Lee schools partner with MIT to prepare students for the AI revolution | Central Berkshires

This AI Learns Faster Than Anything We’ve Seen!

ByteDance’s Doubao: China’s answer to GPT-4o is 50x cheaper and ready for action: Details – Technology News

Claude’s Moral Map: Anthropic Tests AI Alignment in the Wild

I sat in on an AI training session at KPMG. It was almost like being back at journalism school.

xAI, Musk Foundation helps schools near Memphis supercomputer site

What to expect from Tesla CEO Elon Musk’s new Master Plan

David Geffen Sued By Estranged Husband for Breach of Contract

Auction House Will Sell Egyptian Artifact Despite Concern From Experts

Anish Kapoor Lists New York Apartment for $17.75 M.

Street Fighter 6 Community Rocked by AI Art Controversy

‘It’s how we use this for learning.’ Lenox and Lee schools partner with MIT to prepare students for the AI revolution | Central Berkshires

This AI Learns Faster Than Anything We’ve Seen!

ByteDance’s Doubao: China’s answer to GPT-4o is 50x cheaper and ready for action: Details – Technology News

What's Hot

Claude’s Moral Map: Anthropic Tests AI Alignment in the Wild

Related Posts

Subscribe to Updates