Anthropic Study Maps Claude AI's Real-World Values, Releases Dataset

Anthropic is offering a rare look into the operational values of its AI assistant, Claude, through new research published Monday. The study, “Values in the Wild,” attempts to empirically map the normative considerations Claude expresses across hundreds of thousands of real user interactions, employing a privacy-focused methodology and resulting in a publicly available dataset of AI values.

The core challenge addressed is understanding how AI assistants, which increasingly shape user decisions, actually apply values in practice. To investigate this, Anthropic analyzed a sample of 700,000 anonymized conversations from Claude.ai Free and Pro users, collected over one week (February 18-25) in February 2025. This dataset primarily featured interactions with the Claude 3.5 Sonnet model.

Filtering for “subjective” conversations – those requiring interpretation beyond mere facts – left 308,210 interactions for deeper analysis, as detailed in the research preprint.

Unpacking Claude’s Expressed Norms

Using its own language models within a privacy-preserving framework known as CLIO (Claude insights and observations), Anthropic extracted instances where Claude demonstrated or stated values. CLIO employs multiple safeguards, such as instructing the model to omit private details, setting minimum cluster sizes for aggregation (often requiring data from over 1,000 users per cluster), and having AI verify summaries before any human review.

This process identified 3,307 distinct AI values and, analyzing user inputs, 2,483 unique human values. Human validation confirmed the AI value extraction corresponded well with human judgment (98.8% agreement in sampled cases).

Anthropic organized the identified AI values into a four-level hierarchy topped by five main categories: Practical, Epistemic, Social, Protective, and Personal. Practical (efficiency, quality) and Epistemic (knowledge validation, logical consistency) values dominated, making up over half the observed instances.

Anthropic connects these findings to its HHH (Helpful, Honest, Harmless) design goals, often guided by its Constitutional AI approach and work on Claude’s character.

Observed values like “user enablement” (Helpful), “epistemic humility” (Honest), and “patient wellbeing” (Harmless) map to these principles. However, the analysis wasn’t entirely clean; rare clusters of undesirable values like “dominance” and “amorality” were also detected, which Anthropic suggests might correlate with user attempts to jailbreak the model, potentially offering a new signal for misuse detection.

Values in Context and Interaction

A central theme of the research is that Claude’s value expression isn’t static but highly situational. The AI assistant emphasizes different norms depending on the task – promoting “healthy boundaries” during relationship advice or “historical accuracy” when discussing contentious historical events.

This context-dependent behavior highlights the dynamic nature of AI value application, moving beyond static evaluations.

The study also examined how Claude engages with values explicitly stated by users. The AI tends to respond supportively, reinforcing or working within the user’s framework in roughly 43% of relevant interactions.

Value mirroring, where Claude echoes the user’s stated value (like “authenticity”), was common in these supportive exchanges, potentially reducing problematic AI sycophancy.

In contrast, “reframing” user values occurred less often (6.6%), typically during discussions about personal wellbeing or interpersonal issues. Outright resistance to user values was infrequent (5.4%) but notable, usually happening when users requested unethical content or actions violating Anthropic’s usage policies.

The research indicates Claude is more likely to state its own values explicitly during these moments of resistance or reframing, potentially making its underlying principles more visible when challenged.

Transparency Efforts and Broader Picture

Anthropic has released the derived value taxonomy and frequency data via Hugging Face, including `values_frequencies.csv` and `values_tree.csv` files, though it notes the model-generated nature requires careful interpretation.

The release aligns with Anthropic’s stated focus on AI safety and transparency, following its March 2025 announcement of a separate interpretability framework designed to probe Claude’s internal reasoning using different methods like dictionary learning.

These research efforts come as Anthropic navigates a competitive field, bolstered by significant investment including a $3.5 billion round announced in February 2025.

The company continues its public engagement on AI policy, having submitted recommendations to the White House in March 2025, although it also faced questions that same month for removing some previous voluntary safety pledges from its website.

Source link

What's Hot

Nvidia Faces Scrutiny Over Security Concerns in China

Paper page – Phi-Ground Tech Report: Advancing Perception in GUI Grounding

OpenAI raises over $8 billion in latest funding round, reaching $300 billion valuation

Anthropic Study Maps Claude AI’s Real-World Values, Releases Dataset

Bitcoin Spring Targets $130K Breakout

You can use Claude AI’s mobile app to draft emails, texts, and calendar events now – here’s how

Understanding Claude AI’s New Weekly Usage Limits for Users

Theatre Director and Artist Dies at 83

France to Accelerate Return of Looted Artworks—and More Art News

Person Dies After Jumping from Whitney Museum

At Aspen Art Week, Bigger Fairs Make for a High-Altitude Market Bet