The video shows an agent collecting rewards in previously unseen mazes using only raw pixels as input. The agent was trained using the Asynchronous Advantage Actor-Critic (A3C) algorithm and was only rewarded for picking up apples and orange portals during training.
Paper link –
source
Asynchronous Methods for Deep Reinforcement Learning: Labyrinth
Previous ArticleReinforcement Learning with Prediction-Based Rewards