Code large language models have demonstrated remarkable capabilities in
programming tasks, yet current benchmarks primarily focus on single modality
rather than visual game development. Most existing code-related benchmarks
evaluate syntax correctness and execution accuracy, overlooking critical
game-specific metrics such as playability, visual aesthetics, and user
engagement that are essential for real-world deployment. To address the gap
between current LLM capabilities in algorithmic problem-solving and competitive
programming versus the comprehensive requirements of practical game
development, we present V-GameGym, a comprehensive benchmark comprising 2,219
high-quality samples across 100 thematic clusters derived from real-world
repositories, adopting a novel clustering-based curation methodology to ensure
both diversity and structural completeness. Further, we introduce a multimodal
evaluation framework with an automated LLM-driven pipeline for visual code
synthesis using complete UI sandbox environments. Our extensive analysis
reveals that V-GameGym effectively bridges the gap between code generation
accuracy and practical game development workflows, providing quantifiable
quality metrics for visual programming and interactive element generation.