Paper Page - MaPPO: Maximum A Posteriori Preference Optimization With Prior Knowledge

MaPPO, a framework for preference optimization, enhances alignment of large language models with human preferences by integrating prior reward knowledge into a Maximum a Posteriori objective, improving performance across various benchmarks.

As the era of large language models (LLMs) on behalf of users unfolds,
Preference Optimization (PO) methods have become a central approach to aligning
LLMs with human preferences and improving performance. We propose Maximum a
Posteriori Preference Optimization (MaPPO), a framework for learning from
preferences that explicitly incorporates prior reward knowledge into the
optimization objective. While existing methods such as Direct Preference
Optimization (DPO) and its variants treat preference learning as a Maximum
Likelihood Estimation (MLE) problem, MaPPO extends this paradigm by integrating
prior reward estimates into a principled Maximum a Posteriori (MaP) objective.
This not only generalizes DPO and its variants, but also enhances alignment by
mitigating the oversimplified binary classification of responses. More
importantly, MaPPO introduces no additional hyperparameter, and supports
preference optimization in both offline and online settings. In addition, MaPPO
can be used as a plugin with consistent improvement on DPO variants, including
widely used SimPO, IPO, and CPO. Extensive empirical evaluations of different
model sizes and model series on three standard benchmarks, including MT-Bench,
AlpacaEval 2.0, and Arena-Hard, demonstrate consistent improvements in
alignment performance without sacrificing computational efficiency.

Source link

What's Hot

What is ‘AI scheming’? Unpacking a concerning OpenAI study.

Meta Ray-Ban Display and everything else unveiled at Meta Connect 2025

The Future of People Analytics

Paper page – MaPPO: Maximum a Posteriori Preference Optimization with Prior Knowledge

Research Paper – Takara TLDR

2D Gaussian Splatting with Semantic Alignment for Image Inpainting – Takara TLDR

The Choice of Divergence: A Neglected Key to Mitigating Diversity Collapse in Reinforcement Learning with Verifiable Reward – Takara TLDR

The Best Booths at the First Untitled Art, Houston

Rope Found at Atlanta Black History Museum Under Investigation

Van Abbe Museum in Maine to Return Objects to Wabanaki Nations

FAU Art History Professor Suspended Due to Charlie Kirk Comments

What is ‘AI scheming’? Unpacking a concerning OpenAI study.

Meta Ray-Ban Display and everything else unveiled at Meta Connect 2025

The Future of People Analytics

What's Hot

Paper page – MaPPO: Maximum a Posteriori Preference Optimization with Prior Knowledge

Related Posts

Subscribe to Updates