View a PDF of the paper titled RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation, by Kaiqu Liang and 4 other authors
View PDF
HTML (experimental)
Abstract:While Reinforcement Learning from Human Feedback (RLHF) has shown promise in aligning generative AI, we present empirical evidence that it can also cause severe, systematic misalignment. We hypothesize that this stems from evaluator feedback depending on downstream outcome predictions (foresight) that can be influenced by the AI’s output, inducing Goodhart’s law dynamics. We present a theoretical analysis showing that conditioning evaluator feedback on downstream observations (hindsight) inhibits this effect by decoupling the alignment signal from potentially compromised predictions–crucially, the result holds even if the observed outcomes are sampled from the AI’s own world model. Building on this insight, we introduce Reinforcement Learning from Hindsight Simulation (RLHS), which presents plausible simulated outcomes to evaluators before eliciting feedback. We validate RLHS across three consultancy settings–marketplace interactions, restaurant recommendations, and online course advising–using both online (PPO) and offline (DPO) fine-tuning methods, and show that it substantially improves alignment over RLHF in experiments and human evaluations. We perform post-hoc benchmark evaluations on TruthfulQA, HaluEval, and TrustLLM, finding that even after single-task fine-tuning, RLHF misalignment persists, whereas RLHS consistently outperforms baselines and demonstrates robust alignment generalization. The project webpage and code are available at this https URL.
Submission history
From: Kaiqu Liang [view email]
[v1]
Wed, 15 Jan 2025 06:33:15 UTC (9,850 KB)
[v2]
Mon, 10 Feb 2025 21:17:01 UTC (6,163 KB)
[v3]
Tue, 10 Jun 2025 03:19:33 UTC (1,663 KB)