arXiv AI

Mitigating Misalignment in RLHF with Hindsight Simulation

By Advanced AI EditorJune 11, 2025No Comments2 Mins Read

[Submitted on 15 Jan 2025 (v1), last revised 10 Jun 2025 (this version, v3)]

View a PDF of the paper titled RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation, by Kaiqu Liang and 4 other authors

View PDF
HTML (experimental)

Abstract:While Reinforcement Learning from Human Feedback (RLHF) has shown promise in aligning generative AI, we present empirical evidence that it can also cause severe, systematic misalignment. We hypothesize that this stems from evaluator feedback depending on downstream outcome predictions (foresight) that can be influenced by the AI’s output, inducing Goodhart’s law dynamics. We present a theoretical analysis showing that conditioning evaluator feedback on downstream observations (hindsight) inhibits this effect by decoupling the alignment signal from potentially compromised predictions–crucially, the result holds even if the observed outcomes are sampled from the AI’s own world model. Building on this insight, we introduce Reinforcement Learning from Hindsight Simulation (RLHS), which presents plausible simulated outcomes to evaluators before eliciting feedback. We validate RLHS across three consultancy settings–marketplace interactions, restaurant recommendations, and online course advising–using both online (PPO) and offline (DPO) fine-tuning methods, and show that it substantially improves alignment over RLHF in experiments and human evaluations. We perform post-hoc benchmark evaluations on TruthfulQA, HaluEval, and TrustLLM, finding that even after single-task fine-tuning, RLHF misalignment persists, whereas RLHS consistently outperforms baselines and demonstrates robust alignment generalization. The project webpage and code are available at this https URL.

Submission history

From: Kaiqu Liang [view email]
[v1]
Wed, 15 Jan 2025 06:33:15 UTC (9,850 KB)
[v2]
Mon, 10 Feb 2025 21:17:01 UTC (6,163 KB)
[v3]
Tue, 10 Jun 2025 03:19:33 UTC (1,663 KB)

Previous ArticleSamsung to adopt AI coding assistant to boost developer productivity

Next Article Amazon to shutter Nevada fulfillment center in August

Advanced AI Editor

Leave A Reply