Paper Page - MaPPO: Maximum A Posteriori Preference Optimization With Prior Knowledge

MaPPO, a framework for preference optimization, enhances alignment of large language models with human preferences by integrating prior reward knowledge into a Maximum a Posteriori objective, improving performance across various benchmarks.

As the era of large language models (LLMs) on behalf of users unfolds,
Preference Optimization (PO) methods have become a central approach to aligning
LLMs with human preferences and improving performance. We propose Maximum a
Posteriori Preference Optimization (MaPPO), a framework for learning from
preferences that explicitly incorporates prior reward knowledge into the
optimization objective. While existing methods such as Direct Preference
Optimization (DPO) and its variants treat preference learning as a Maximum
Likelihood Estimation (MLE) problem, MaPPO extends this paradigm by integrating
prior reward estimates into a principled Maximum a Posteriori (MaP) objective.
This not only generalizes DPO and its variants, but also enhances alignment by
mitigating the oversimplified binary classification of responses. More
importantly, MaPPO introduces no additional hyperparameter, and supports
preference optimization in both offline and online settings. In addition, MaPPO
can be used as a plugin with consistent improvement on DPO variants, including
widely used SimPO, IPO, and CPO. Extensive empirical evaluations of different
model sizes and model series on three standard benchmarks, including MT-Bench,
AlpacaEval 2.0, and Arena-Hard, demonstrate consistent improvements in
alignment performance without sacrificing computational efficiency.

Source link

What's Hot

Google DeepMind AI Cracks Century-Old Fluid Mysteries, Pointing to New Era in Science

OpenAI is apparently planning a bunch of ChatGPT-powered AI devices

AI And Robotics Headline Busy Week For Dealmaking

Paper page – MaPPO: Maximum a Posteriori Preference Optimization with Prior Knowledge

Research Paper – Takara TLDR

2D Gaussian Splatting with Semantic Alignment for Image Inpainting – Takara TLDR

The Choice of Divergence: A Neglected Key to Mitigating Diversity Collapse in Reinforcement Learning with Verifiable Reward – Takara TLDR

The Best Booths at the First Untitled Art, Houston

Rope Found at Atlanta Black History Museum Under Investigation

Van Abbe Museum in Maine to Return Objects to Wabanaki Nations

FAU Art History Professor Suspended Due to Charlie Kirk Comments

Google DeepMind AI Cracks Century-Old Fluid Mysteries, Pointing to New Era in Science

OpenAI is apparently planning a bunch of ChatGPT-powered AI devices

AI And Robotics Headline Busy Week For Dealmaking

What's Hot

Paper page – MaPPO: Maximum a Posteriori Preference Optimization with Prior Knowledge

Related Posts

Subscribe to Updates