Paper page - RLVR-World: Training World Models with Reinforcement Learning

RLVR-World uses reinforcement learning with verifiable rewards to optimize world models for task-specific metrics, achieving improved performance across language and video domains.

World models predict state transitions in response to actions and are
increasingly developed across diverse modalities. However, standard training
objectives such as maximum likelihood estimation (MLE) often misalign with
task-specific goals of world models, i.e., transition prediction metrics like
accuracy or perceptual quality. In this paper, we present RLVR-World, a unified
framework that leverages reinforcement learning with verifiable rewards (RLVR)
to directly optimize world models for such metrics. Despite formulating world
modeling as autoregressive prediction of tokenized sequences, RLVR-World
evaluates metrics of decoded predictions as verifiable rewards. We demonstrate
substantial performance gains on both language- and video-based world models
across domains, including text games, web navigation, and robot manipulation.
Our work indicates that, beyond recent advances in reasoning language models,
RLVR offers a promising post-training paradigm for enhancing the utility of
generative models more broadly.

Source link

What's Hot

HAVA: Hybrid Approach to Value-Alignment through Reward Weighing for Reinforcement Learning

Anthropic closes $2.5 billion credit facility as Wall Street continues plunging money into AI boom – NBC New York

MIT CSAIL researchers develop tool for creating domain-specific languages

Paper page – RLVR-World: Training World Models with Reinforcement Learning

Paper page – UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning

Paper page – Soft Thinking: Unlocking the Reasoning Potential of LLMs in Continuous Concept Space

Paper page – dKV-Cache: The Cache for Diffusion Language Models

Sean Combs May Lose $21 M. Kerry James Marshall If He’s Found Guilty

Rediscovered Klimt Was ‘Smuggled’ Out of Hungary, Says Publication

Two Staffers from Israeli Embassy Killed by Gunman in Washington D.C.

Art And Architecture On Croatia’s Dalmatian Coast

HAVA: Hybrid Approach to Value-Alignment through Reward Weighing for Reinforcement Learning

Anthropic closes $2.5 billion credit facility as Wall Street continues plunging money into AI boom – NBC New York

MIT CSAIL researchers develop tool for creating domain-specific languages

What's Hot

Paper page – RLVR-World: Training World Models with Reinforcement Learning

Related Posts

Subscribe to Updates