VoyagerVision: Investigating the Role of Multi-modal Information for Open-ended Learning Systems

arXiv:2507.00079v1 Announce Type: new
Abstract: Open-endedness is an active field of research in the pursuit of capable Artificial General Intelligence (AGI), allowing models to pursue tasks of their own choosing. Simultaneously, recent advancements in Large Language Models (LLMs) such as GPT-4o [9] have allowed such models to be capable of interpreting image inputs. Implementations such as OMNI-EPIC [4] have made use of such features, providing an LLM with pixel data of an agent’s POV to parse the environment and allow it to solve tasks. This paper proposes that providing these visual inputs to a model gives it greater ability to interpret spatial environments, and as such, can increase the number of tasks it can successfully perform, extending its open-ended potential. To this aim, this paper proposes VoyagerVision — a multi-modal model capable of creating structures within Minecraft using screenshots as a form of visual feedback, building on the foundation of Voyager. VoyagerVision was capable of creating an average of 2.75 unique structures within fifty iterations of the system, as Voyager was incapable of this, it is an extension in an entirely new direction. Additionally, in a set of building unit tests VoyagerVision was successful in half of all attempts in flat worlds, with most failures arising in more complex structures. Project website is available at https://esmyth-dev.github.io/VoyagerVision.github.io/

Source link

What's Hot

Baidu Open-Sources AI Model ‘Ernie’ to Developers, Sends Jitters Across Global Tech Market

Nvidia-backed AI startup to curb hungry data centers’ appetite for power

Can AI make novels better? Not if these attempts are anything to go by

VoyagerVision: Investigating the Role of Multi-modal Information for Open-ended Learning Systems

GPU-based complete search for nonlinear minimization subject to bounds

A Comparative Study of Whisper and Wav2Vec-BERT on Bangla

SEZ-HARN: Self-Explainable Zero-shot Human Activity Recognition Network

Khaled Sabsabi Reinstated as Australia’s Venice Biennale Artist

Peter Phillips, British Pop Art Originator, Dies at 86

Hundreds of Ancient Ceramics Found In Preserved Shipwreck in Turkey

Canaletto Auction Record Smashed at Christie’s London

Baidu Open-Sources AI Model ‘Ernie’ to Developers, Sends Jitters Across Global Tech Market

Nvidia-backed AI startup to curb hungry data centers’ appetite for power

Can AI make novels better? Not if these attempts are anything to go by

What's Hot

VoyagerVision: Investigating the Role of Multi-modal Information for Open-ended Learning Systems

Related Posts

Subscribe to Updates