Paper Page - Optimus-3: Towards Generalist Multimodal Minecraft Agents With Scalable Task Experts

Optimus-3, a multimodal large language model agent, uses knowledge-enhanced data generation, a Mixture-of-Experts architecture, and multimodal reasoning-augmented reinforcement learning to achieve superior performance across various tasks in Minecraft.

Recently, agents based on multimodal large language models (MLLMs) have
achieved remarkable progress across various domains. However, building a
generalist agent with capabilities such as perception, planning, action,
grounding, and reflection in open-world environments like Minecraft remains
challenges: insufficient domain-specific data, interference among heterogeneous
tasks, and visual diversity in open-world settings. In this paper, we address
these challenges through three key contributions. 1) We propose a
knowledge-enhanced data generation pipeline to provide scalable and
high-quality training data for agent development. 2) To mitigate interference
among heterogeneous tasks, we introduce a Mixture-of-Experts (MoE) architecture
with task-level routing. 3) We develop a Multimodal Reasoning-Augmented
Reinforcement Learning approach to enhance the agent’s reasoning ability for
visual diversity in Minecraft. Built upon these innovations, we present
Optimus-3, a general-purpose agent for Minecraft. Extensive experimental
results demonstrate that Optimus-3 surpasses both generalist multimodal large
language models and existing state-of-the-art agents across a wide range of
tasks in the Minecraft environment. Project page:
https://cybertronagent.github.io/Optimus-3.github.io/

Source link

What's Hot

How to Streamline HR Paperwork

Bell partners with Perplexity to bring AI assistant to millions of Canadians

Trump strikes “wild” deal making US firms pay 15% tax on China chip sales

Paper page – Optimus-3: Towards Generalist Multimodal Minecraft Agents with Scalable Task Experts

Paper page – Can Large Multimodal Models Actively Recognize Faulty Inputs? A Systematic Evaluation Framework of Their Input Scrutiny Ability

Paper page – Evaluating, Synthesizing, and Enhancing for Customer Support Conversation

Paper page – Hop, Skip, and Overthink: Diagnosing Why Reasoning Models Fumble during Multi-Hop Analysis

How to Stylize Your Images with Flux Kontext in ComfyUI

Museum of Plastic Pollution Opens: Morning Links

LoRA Training with AI-Toolkit on Vast.Ai under 50 cents

Ai Powered Photo Restoration using Flux Kontext Dev

How to Streamline HR Paperwork

Bell partners with Perplexity to bring AI assistant to millions of Canadians

Trump strikes “wild” deal making US firms pay 15% tax on China chip sales

What's Hot

Paper page – Optimus-3: Towards Generalist Multimodal Minecraft Agents with Scalable Task Experts

Related Posts

Subscribe to Updates