Recent vision-language models (VLMs) show strong results on offline image and video understanding, but their performance in interactive, embodied environments remains limited. In close loop settings, an agent acts from a first-person view, where each decision alters future observations. Even leading models like GPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro struggle with spatial reasoning and long-horizon planning. We present EmbRACE-3K , a dataset of over 3,000 language-guided tasks in diverse Unreal Engine environments. Each task spans multiple steps, with egocentric views, high-level instructions, grounded actions, and natural language rationales. We benchmark VLMs on three core skills: exploration, dynamic spatial-semantic reasoning, and multi-stage goal execution. In zero-shot tests, all models achieve below 20 percent success, showing clear room for improvement. Fine-tuning Qwen2.5-VL-7B with supervised and reinforcement learning leads to consistent gains across all task types, demonstrating the value of EmbRACE-3K for developing embodied intelligence.